[jira] [Assigned] (SPARK-35182) Support driver-owned on-demand PVC

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35182:


Assignee: (was: Apache Spark)

> Support driver-owned on-demand PVC
> --
>
> Key: SPARK-35182
> URL: https://issues.apache.org/jira/browse/SPARK-35182
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35183) CombineConcats should call transformAllExpressions

2021-04-21 Thread Yingyi Bu (Jira)
Yingyi Bu created SPARK-35183:
-

 Summary: CombineConcats should call transformAllExpressions
 Key: SPARK-35183
 URL: https://issues.apache.org/jira/browse/SPARK-35183
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.1.0
Reporter: Yingyi Bu


 

{{plan transformExpressions \{ ... }}}

only applies the transformation node `plan` itself, but not its children. We 
should call transformAllExpressions instead of transformExpressions in 
CombineConcats. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35182) Support driver-owned on-demand PVC

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327137#comment-17327137
 ] 

Apache Spark commented on SPARK-35182:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32288

> Support driver-owned on-demand PVC
> --
>
> Key: SPARK-35182
> URL: https://issues.apache.org/jira/browse/SPARK-35182
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35182) Support driver-owned on-demand PVC

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35182:


Assignee: Apache Spark

> Support driver-owned on-demand PVC
> --
>
> Key: SPARK-35182
> URL: https://issues.apache.org/jira/browse/SPARK-35182
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35182) Support driver-owned on-demand PVC

2021-04-21 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-35182:
-

 Summary: Support driver-owned on-demand PVC
 Key: SPARK-35182
 URL: https://issues.apache.org/jira/browse/SPARK-35182
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34671) Support ZSTD compression in Parquet data sources

2021-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34671.
---
Resolution: Duplicate

> Support ZSTD compression in Parquet data sources
> 
>
> Key: SPARK-34671
> URL: https://issues.apache.org/jira/browse/SPARK-34671
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive

2021-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35096:
--
Fix Version/s: 3.0.3

> foreachBatch throws ArrayIndexOutOfBoundsException if schema is case 
> Insensitive
> 
>
> Key: SPARK-35096
> URL: https://issues.apache.org/jira/browse/SPARK-35096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> Below code works fine before spark3, running on spark3 throws 
> java.lang.ArrayIndexOutOfBoundsException
> {code:java}
> val inputPath = "/Users/xyz/data/testcaseInsensitivity"
> val output_path = "/Users/xyz/output"
> spark.range(10).write.format("parquet").save(inputPath)
> def process_row(microBatch: DataFrame, batchId: Long): Unit = {
>   val df = microBatch.select($"ID".alias("other")) // Doesn't work
>   df.write.format("parquet").mode("append").save(output_path)
> }
> val schema = new StructType().add("id", LongType)
> val stream_df = 
> spark.readStream.schema(schema).format("parquet").load(inputPath)
> stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _)
>   .start().awaitTermination()
> {code}
> Stack Trace:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection.mut

[jira] [Assigned] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34674:
-

Assignee: Sergey Kotlov

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Assignee: Sergey Kotlov
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34674.
---
Fix Version/s: 3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 32283
[https://github.com/apache/spark/pull/32283]

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Assignee: Sergey Kotlov
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327118#comment-17327118
 ] 

Apache Spark commented on SPARK-27991:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/32287

> ShuffleBlockFetcherIterator should take Netty constant-factor overheads into 
> account when limiting number of simultaneous block fetches
> ---
>
> Key: SPARK-27991
> URL: https://issues.apache.org/jira/browse/SPARK-27991
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> ShuffleBlockFetcherIterator has logic to limit the number of simultaneous 
> block fetches. By default, this logic tries to keep the number of outstanding 
> block fetches [beneath a data size 
> limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274]
>  ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads 
> into account: even though a remote block might be, say, 4KB, there are 
> certain fixed-size internal overheads due to Netty buffer sizes which may 
> cause the actual space requirements to be larger.
> As a result, if a map stage produces a huge number of extremely tiny blocks 
> then we may see errors like
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 
> byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
> [...]
> Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
> 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
> at 
> io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
> at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
> [...]{code}
> SPARK-24989 is another report of this problem (but with a different proposed 
> fix).
> This problem can currently be mitigated by setting 
> {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), 
> but this additional manual configuration step is cumbersome.
> Instead, I think that Spark should take these fixed overheads into account in 
> the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, 
> use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some 
> tricky details involved to make this work on all configurations (e.g. to use 
> a different minimum when direct buffers are disabled, etc.), but I think the 
> core idea behind the fix is pretty simple.
> This will improve Spark's stability and removes configuration / tuning burden 
> from end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327119#comment-17327119
 ] 

Apache Spark commented on SPARK-27991:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/32287

> ShuffleBlockFetcherIterator should take Netty constant-factor overheads into 
> account when limiting number of simultaneous block fetches
> ---
>
> Key: SPARK-27991
> URL: https://issues.apache.org/jira/browse/SPARK-27991
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> ShuffleBlockFetcherIterator has logic to limit the number of simultaneous 
> block fetches. By default, this logic tries to keep the number of outstanding 
> block fetches [beneath a data size 
> limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274]
>  ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads 
> into account: even though a remote block might be, say, 4KB, there are 
> certain fixed-size internal overheads due to Netty buffer sizes which may 
> cause the actual space requirements to be larger.
> As a result, if a map stage produces a huge number of extremely tiny blocks 
> then we may see errors like
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 
> byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
> [...]
> Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
> 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
> at 
> io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
> at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
> [...]{code}
> SPARK-24989 is another report of this problem (but with a different proposed 
> fix).
> This problem can currently be mitigated by setting 
> {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), 
> but this additional manual configuration step is cumbersome.
> Instead, I think that Spark should take these fixed overheads into account in 
> the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, 
> use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some 
> tricky details involved to make this work on all configurations (e.g. to use 
> a different minimum when direct buffers are disabled, etc.), but I think the 
> core idea behind the fix is pretty simple.
> This will improve Spark's stability and removes configuration / tuning burden 
> from end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27991:


Assignee: (was: Apache Spark)

> ShuffleBlockFetcherIterator should take Netty constant-factor overheads into 
> account when limiting number of simultaneous block fetches
> ---
>
> Key: SPARK-27991
> URL: https://issues.apache.org/jira/browse/SPARK-27991
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> ShuffleBlockFetcherIterator has logic to limit the number of simultaneous 
> block fetches. By default, this logic tries to keep the number of outstanding 
> block fetches [beneath a data size 
> limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274]
>  ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads 
> into account: even though a remote block might be, say, 4KB, there are 
> certain fixed-size internal overheads due to Netty buffer sizes which may 
> cause the actual space requirements to be larger.
> As a result, if a map stage produces a huge number of extremely tiny blocks 
> then we may see errors like
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 
> byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
> [...]
> Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
> 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
> at 
> io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
> at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
> [...]{code}
> SPARK-24989 is another report of this problem (but with a different proposed 
> fix).
> This problem can currently be mitigated by setting 
> {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), 
> but this additional manual configuration step is cumbersome.
> Instead, I think that Spark should take these fixed overheads into account in 
> the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, 
> use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some 
> tricky details involved to make this work on all configurations (e.g. to use 
> a different minimum when direct buffers are disabled, etc.), but I think the 
> core idea behind the fix is pretty simple.
> This will improve Spark's stability and removes configuration / tuning burden 
> from end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27991:


Assignee: Apache Spark

> ShuffleBlockFetcherIterator should take Netty constant-factor overheads into 
> account when limiting number of simultaneous block fetches
> ---
>
> Key: SPARK-27991
> URL: https://issues.apache.org/jira/browse/SPARK-27991
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Major
>
> ShuffleBlockFetcherIterator has logic to limit the number of simultaneous 
> block fetches. By default, this logic tries to keep the number of outstanding 
> block fetches [beneath a data size 
> limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274]
>  ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads 
> into account: even though a remote block might be, say, 4KB, there are 
> certain fixed-size internal overheads due to Netty buffer sizes which may 
> cause the actual space requirements to be larger.
> As a result, if a map stage produces a huge number of extremely tiny blocks 
> then we may see errors like
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 
> byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
> [...]
> Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
> 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
> at 
> io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
> at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
> [...]{code}
> SPARK-24989 is another report of this problem (but with a different proposed 
> fix).
> This problem can currently be mitigated by setting 
> {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), 
> but this additional manual configuration step is cumbersome.
> Instead, I think that Spark should take these fixed overheads into account in 
> the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, 
> use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some 
> tricky details involved to make this work on all configurations (e.g. to use 
> a different minimum when direct buffers are disabled, etc.), but I think the 
> core idea behind the fix is pretty simple.
> This will improve Spark's stability and removes configuration / tuning burden 
> from end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327109#comment-17327109
 ] 

Apache Spark commented on SPARK-35181:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32286

> Use zstd for spark.io.compression.codec by default
> --
>
> Key: SPARK-35181
> URL: https://issues.apache.org/jira/browse/SPARK-35181
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327108#comment-17327108
 ] 

Apache Spark commented on SPARK-35181:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32286

> Use zstd for spark.io.compression.codec by default
> --
>
> Key: SPARK-35181
> URL: https://issues.apache.org/jira/browse/SPARK-35181
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35181:


Assignee: (was: Apache Spark)

> Use zstd for spark.io.compression.codec by default
> --
>
> Key: SPARK-35181
> URL: https://issues.apache.org/jira/browse/SPARK-35181
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35181:


Assignee: Apache Spark

> Use zstd for spark.io.compression.codec by default
> --
>
> Key: SPARK-35181
> URL: https://issues.apache.org/jira/browse/SPARK-35181
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-04-21 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-35181:
-

 Summary: Use zstd for spark.io.compression.codec by default
 Key: SPARK-35181
 URL: https://issues.apache.org/jira/browse/SPARK-35181
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35180) Allow to build SparkR with SBT

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35180:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Allow to build SparkR with SBT
> --
>
> Key: SPARK-35180
> URL: https://issues.apache.org/jira/browse/SPARK-35180
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, SparkR can be built only with Maven.
> It's helpful if we can built it with SBT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35180) Allow to build SparkR with SBT

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327093#comment-17327093
 ] 

Apache Spark commented on SPARK-35180:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32285

> Allow to build SparkR with SBT
> --
>
> Key: SPARK-35180
> URL: https://issues.apache.org/jira/browse/SPARK-35180
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, SparkR can be built only with Maven.
> It's helpful if we can built it with SBT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35180) Allow to build SparkR with SBT

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35180:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Allow to build SparkR with SBT
> --
>
> Key: SPARK-35180
> URL: https://issues.apache.org/jira/browse/SPARK-35180
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In the current master, SparkR can be built only with Maven.
> It's helpful if we can built it with SBT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35180) Allow to build SparkR with SBT

2021-04-21 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-35180:
--

 Summary: Allow to build SparkR with SBT
 Key: SPARK-35180
 URL: https://issues.apache.org/jira/browse/SPARK-35180
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In the current master, SparkR can be built only with Maven.
It's helpful if we can built it with SBT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327088#comment-17327088
 ] 

Apache Spark commented on SPARK-35096:
--

User 'sandeep-katta' has created a pull request for this issue:
https://github.com/apache/spark/pull/32284

> foreachBatch throws ArrayIndexOutOfBoundsException if schema is case 
> Insensitive
> 
>
> Key: SPARK-35096
> URL: https://issues.apache.org/jira/browse/SPARK-35096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.1.2, 3.2.0
>
>
> Below code works fine before spark3, running on spark3 throws 
> java.lang.ArrayIndexOutOfBoundsException
> {code:java}
> val inputPath = "/Users/xyz/data/testcaseInsensitivity"
> val output_path = "/Users/xyz/output"
> spark.range(10).write.format("parquet").save(inputPath)
> def process_row(microBatch: DataFrame, batchId: Long): Unit = {
>   val df = microBatch.select($"ID".alias("other")) // Doesn't work
>   df.write.format("parquet").mode("append").save(output_path)
> }
> val schema = new StructType().add("id", LongType)
> val stream_df = 
> spark.readStream.schema(schema).format("parquet").load(inputPath)
> stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _)
>   .start().awaitTermination()
> {code}
> Stack Trace:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOpt

[jira] [Commented] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327089#comment-17327089
 ] 

Apache Spark commented on SPARK-35096:
--

User 'sandeep-katta' has created a pull request for this issue:
https://github.com/apache/spark/pull/32284

> foreachBatch throws ArrayIndexOutOfBoundsException if schema is case 
> Insensitive
> 
>
> Key: SPARK-35096
> URL: https://issues.apache.org/jira/browse/SPARK-35096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.1.2, 3.2.0
>
>
> Below code works fine before spark3, running on spark3 throws 
> java.lang.ArrayIndexOutOfBoundsException
> {code:java}
> val inputPath = "/Users/xyz/data/testcaseInsensitivity"
> val output_path = "/Users/xyz/output"
> spark.range(10).write.format("parquet").save(inputPath)
> def process_row(microBatch: DataFrame, batchId: Long): Unit = {
>   val df = microBatch.select($"ID".alias("other")) // Doesn't work
>   df.write.format("parquet").mode("append").save(output_path)
> }
> val schema = new StructType().add("id", LongType)
> val stream_df = 
> spark.readStream.schema(schema).format("parquet").load(inputPath)
> stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _)
>   .start().awaitTermination()
> {code}
> Stack Trace:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOpt

[jira] [Updated] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive

2021-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35096:
--
Fix Version/s: (was: 3.0.3)

> foreachBatch throws ArrayIndexOutOfBoundsException if schema is case 
> Insensitive
> 
>
> Key: SPARK-35096
> URL: https://issues.apache.org/jira/browse/SPARK-35096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.1.2, 3.2.0
>
>
> Below code works fine before spark3, running on spark3 throws 
> java.lang.ArrayIndexOutOfBoundsException
> {code:java}
> val inputPath = "/Users/xyz/data/testcaseInsensitivity"
> val output_path = "/Users/xyz/output"
> spark.range(10).write.format("parquet").save(inputPath)
> def process_row(microBatch: DataFrame, batchId: Long): Unit = {
>   val df = microBatch.select($"ID".alias("other")) // Doesn't work
>   df.write.format("parquet").mode("append").save(output_path)
> }
> val schema = new StructType().add("id", LongType)
> val stream_df = 
> spark.readStream.schema(schema).format("parquet").load(inputPath)
> stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _)
>   .start().awaitTermination()
> {code}
> Stack Trace:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection

[jira] [Commented] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327058#comment-17327058
 ] 

Apache Spark commented on SPARK-35177:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32281

> IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly
> -
>
> Key: SPARK-35177
> URL: https://issues.apache.org/jira/browse/SPARK-35177
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35177:


Assignee: (was: Apache Spark)

> IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly
> -
>
> Key: SPARK-35177
> URL: https://issues.apache.org/jira/browse/SPARK-35177
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35177:


Assignee: Apache Spark

> IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly
> -
>
> Key: SPARK-35177
> URL: https://issues.apache.org/jira/browse/SPARK-35177
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327057#comment-17327057
 ] 

Apache Spark commented on SPARK-35177:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32281

> IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly
> -
>
> Key: SPARK-35177
> URL: https://issues.apache.org/jira/browse/SPARK-35177
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327054#comment-17327054
 ] 

Apache Spark commented on SPARK-34674:
--

User 'kotlovs' has created a pull request for this issue:
https://github.com/apache/spark/pull/32283

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327056#comment-17327056
 ] 

Apache Spark commented on SPARK-34674:
--

User 'kotlovs' has created a pull request for this issue:
https://github.com/apache/spark/pull/32283

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327048#comment-17327048
 ] 

Apache Spark commented on SPARK-35178:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32282

> maven autodownload failing
> --
>
> Key: SPARK-35178
> URL: https://issues.apache.org/jira/browse/SPARK-35178
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0
>Reporter: Bruce Robbins
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> I attempted to build a fresh clone of Spark using mvn (on two different 
> networks) and got this error:
> {noformat}
> exec: curl --silent --show-error -L 
> https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
> tar: Unrecognized archive format
> tar: Error exit delayed from previous errors.
> Using `mvn` from path: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
> build/mvn: line 126: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such 
> file or directory
> {noformat}
> if I change the mirror as below, the issue goes away:
> {noformat}
> -local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
> +local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35178.
---
Fix Version/s: 3.2.0
   3.1.2
   3.0.3
 Assignee: Sean R. Owen  (was: Bruce Robbins)
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/32277

> maven autodownload failing
> --
>
> Key: SPARK-35178
> URL: https://issues.apache.org/jira/browse/SPARK-35178
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0
>Reporter: Bruce Robbins
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> I attempted to build a fresh clone of Spark using mvn (on two different 
> networks) and got this error:
> {noformat}
> exec: curl --silent --show-error -L 
> https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
> tar: Unrecognized archive format
> tar: Error exit delayed from previous errors.
> Using `mvn` from path: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
> build/mvn: line 126: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such 
> file or directory
> {noformat}
> if I change the mirror as below, the issue goes away:
> {noformat}
> -local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
> +local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327046#comment-17327046
 ] 

Apache Spark commented on SPARK-35178:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32282

> maven autodownload failing
> --
>
> Key: SPARK-35178
> URL: https://issues.apache.org/jira/browse/SPARK-35178
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>
> I attempted to build a fresh clone of Spark using mvn (on two different 
> networks) and got this error:
> {noformat}
> exec: curl --silent --show-error -L 
> https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
> tar: Unrecognized archive format
> tar: Error exit delayed from previous errors.
> Using `mvn` from path: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
> build/mvn: line 126: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such 
> file or directory
> {noformat}
> if I change the mirror as below, the issue goes away:
> {noformat}
> -local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
> +local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31168) Upgrade Scala to 2.12.13

2021-04-21 Thread Jim Kleckner (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327035#comment-17327035
 ] 

Jim Kleckner commented on SPARK-31168:
--

It appears that this fix [1] for 12038 merged into Scala master [2] and has 
been released in Scala 2.13.5 [3] but not yet released as Scala 2.12.14.

 

[1] [https://github.com/scala/scala/pull/9478]

[2] [https://github.com/scala/scala/pull/9495]

[3] [https://github.com/scala/scala/releases/tag/v2.13.5]

> Upgrade Scala to 2.12.13
> 
>
> Key: SPARK-31168
> URL: https://issues.apache.org/jira/browse/SPARK-31168
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> h2. Highlights
>  * Performance improvements in the collections library: algorithmic 
> improvements and changes to avoid unnecessary allocations ([list of 
> PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+label%3Alibrary%3Acollections+label%3Aperformance])
>  * Performance improvements in the compiler ([list of 
> PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+-label%3Alibrary%3Acollections+label%3Aperformance+],
>  minor [effects in our 
> benchmarks|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1567985515850&to=1584355915694&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench@scalabench@])
>  * Improvements to {{-Yrepl-class-based}}, an alternative internal REPL 
> encoding that avoids deadlocks (details on 
> [#8712|https://github.com/scala/scala/pull/8712])
>  * A new {{-Yrepl-use-magic-imports}} flag that avoids deep class nesting in 
> the REPL, which can lead to deteriorating performance in long sessions 
> ([#8576|https://github.com/scala/scala/pull/8576])
>  * Fix some {{toX}} methods that could expose the underlying mutability of a 
> {{ListBuffer}}-generated collection 
> ([#8674|https://github.com/scala/scala/pull/8674])
> h3. JDK 9+ support
>  * ASM was upgraded to 7.3.1, allowing the optimizer to run on JDK 13+ 
> ([#8676|https://github.com/scala/scala/pull/8676])
>  * {{:javap}} in the REPL now works on JDK 9+ 
> ([#8400|https://github.com/scala/scala/pull/8400])
> h3. Other changes
>  * Support new labels for creating durations for consistency: 
> {{Duration("1m")}}, {{Duration("3 hrs")}} 
> ([#8325|https://github.com/scala/scala/pull/8325], 
> [#8450|https://github.com/scala/scala/pull/8450])
>  * Fix memory leak in runtime reflection's {{TypeTag}} caches 
> ([#8470|https://github.com/scala/scala/pull/8470]) and some thread safety 
> issues in runtime reflection 
> ([#8433|https://github.com/scala/scala/pull/8433])
>  * When using compiler plugins, the ordering of compiler phases may change 
> due to [#8427|https://github.com/scala/scala/pull/8427]
> For more details, see [https://github.com/scala/scala/releases/tag/v2.12.11].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35110) Handle ANSI intervals in WindowExecBase

2021-04-21 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327034#comment-17327034
 ] 

jiaan.geng commented on SPARK-35110:


I'm working on.

> Handle ANSI intervals in WindowExecBase
> ---
>
> Key: SPARK-35110
> URL: https://issues.apache.org/jira/browse/SPARK-35110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Handle YearMonthIntervalType and DayTimeIntervalType in createBoundOrdering():
> https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala#L97-L99



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35161) Safe version SQL functions

2021-04-21 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327032#comment-17327032
 ] 

jiaan.geng commented on SPARK-35161:


I see.

> Safe version SQL functions
> --
>
> Key: SPARK-35161
> URL: https://issues.apache.org/jira/browse/SPARK-35161
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Create new safe version SQL functions for existing SQL functions/operators, 
> which returns NULL if overflow/error occurs. So that:
> 1. Users can manage to finish queries without interruptions in ANSI mode.
> 2. Users can get NULLs instead of unreasonable results if overflow occurs 
> when ANSI mode is off.
> For example, the behavior of the following SQL operations is unreasonable:
> {code:java}
> 2147483647 + 2 => -2147483647
> CAST(2147483648L AS INT) => -2147483648
> {code}
> With the new safe version SQL functions:
> {code:java}
> TRY_ADD(2147483647, 2) => null
> TRY_CAST(2147483648L AS INT) => null
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35117) UI progress bar no longer highlights in progress tasks

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327030#comment-17327030
 ] 

Apache Spark commented on SPARK-35117:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32281

> UI progress bar no longer highlights in progress tasks
> --
>
> Key: SPARK-35117
> URL: https://issues.apache.org/jira/browse/SPARK-35117
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.1.2, 3.2.0
>
>
> The Spark UI was updated to Bootstrap 4, and during the update the progress 
> bar in the UI was updated to highlight the whole bar once any tasks were in 
> progress, versus highlighting just the number of tasks that were in progress. 
> The was a great visual queue of seeing what percentage of the stage/job was 
> currently being worked on, and it'd be great to get that functionality back.
> The change can be found here: 
> https://github.com/apache/spark/pull/27370/files#diff-809c93c57cc59e5fe3c3eb54a24aa96a38147d02323f3e690ae6b5309a3284d2L448



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35078) Migrate to transformWithPruning or resolveWithPruning for expression rules

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327027#comment-17327027
 ] 

Apache Spark commented on SPARK-35078:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32280

> Migrate to transformWithPruning or resolveWithPruning for expression rules
> --
>
> Key: SPARK-35078
> URL: https://issues.apache.org/jira/browse/SPARK-35078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> E.g., rules in org/apache/spark/sql/catalyst/optimizer/expressions.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35078) Migrate to transformWithPruning or resolveWithPruning for expression rules

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35078:


Assignee: Apache Spark

> Migrate to transformWithPruning or resolveWithPruning for expression rules
> --
>
> Key: SPARK-35078
> URL: https://issues.apache.org/jira/browse/SPARK-35078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Apache Spark
>Priority: Major
>
> E.g., rules in org/apache/spark/sql/catalyst/optimizer/expressions.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35078) Migrate to transformWithPruning or resolveWithPruning for expression rules

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327029#comment-17327029
 ] 

Apache Spark commented on SPARK-35078:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32280

> Migrate to transformWithPruning or resolveWithPruning for expression rules
> --
>
> Key: SPARK-35078
> URL: https://issues.apache.org/jira/browse/SPARK-35078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> E.g., rules in org/apache/spark/sql/catalyst/optimizer/expressions.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35078) Migrate to transformWithPruning or resolveWithPruning for expression rules

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35078:


Assignee: (was: Apache Spark)

> Migrate to transformWithPruning or resolveWithPruning for expression rules
> --
>
> Key: SPARK-35078
> URL: https://issues.apache.org/jira/browse/SPARK-35078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> E.g., rules in org/apache/spark/sql/catalyst/optimizer/expressions.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32027) EventLoggingListener threw java.util.ConcurrentModificationException

2021-04-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32027.
--
Resolution: Duplicate

> EventLoggingListener threw  java.util.ConcurrentModificationException
> -
>
> Key: SPARK-32027
> URL: https://issues.apache.org/jira/browse/SPARK-32027
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> 20/06/18 20:22:25 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.util.ConcurrentModificationException
>   at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at org.apache.spark.util.JsonProtocol$.mapToJson(JsonProtocol.scala:568)
>   at 
> org.apache.spark.util.JsonProtocol$.$anonfun$propertiesToJson$1(JsonProtocol.scala:574)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.util.JsonProtocol$.propertiesToJson(JsonProtocol.scala:573)
>   at 
> org.apache.spark.util.JsonProtocol$.jobStartToJson(JsonProtocol.scala:159)
>   at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:81)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:97)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:159)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115)
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> 20/06/18 20:22:25 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.util.ConcurrentModificationException
>   at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at org.apache.spark.util.JsonProtocol$.mapToJson(JsonProtocol.scala:568)
>   at 
> org.apache.spark.util.JsonProtocol$.$anonfun$propertiesToJson$1(JsonProtocol.scala:574)

[jira] [Commented] (SPARK-34897) Support reconcile schemas based on index after nested column pruning

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327017#comment-17327017
 ] 

Apache Spark commented on SPARK-34897:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/32279

> Support reconcile schemas based on index after nested column pruning
> 
>
> Key: SPARK-34897
> URL: https://issues.apache.org/jira/browse/SPARK-34897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.sql(
>   """
> |CREATE TABLE `t1` (
> |  `_col0` INT,
> |  `_col1` STRING,
> |  `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>,
> |  `_col3` STRING)
> |USING orc
> |PARTITIONED BY (_col3)
> |""".stripMargin)
> spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')")
> spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show
> {code}
> Error message:
> {noformat}
> java.lang.AssertionError: assertion failed: The given data schema 
> struct<_col0:int,_col2:struct> has less fields than the actual ORC 
> physical schema, no idea which columns were dropped, fail to read. Try to 
> disable 
>   at scala.Predef$.assert(Predef.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34897) Support reconcile schemas based on index after nested column pruning

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327016#comment-17327016
 ] 

Apache Spark commented on SPARK-34897:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/32279

> Support reconcile schemas based on index after nested column pruning
> 
>
> Key: SPARK-34897
> URL: https://issues.apache.org/jira/browse/SPARK-34897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.sql(
>   """
> |CREATE TABLE `t1` (
> |  `_col0` INT,
> |  `_col1` STRING,
> |  `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>,
> |  `_col3` STRING)
> |USING orc
> |PARTITIONED BY (_col3)
> |""".stripMargin)
> spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')")
> spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show
> {code}
> Error message:
> {noformat}
> java.lang.AssertionError: assertion failed: The given data schema 
> struct<_col0:int,_col2:struct> has less fields than the actual ORC 
> physical schema, no idea which columns were dropped, fail to read. Try to 
> disable 
>   at scala.Predef$.assert(Predef.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34692) Support Not(Int) and Not(InSet) propagate null

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327010#comment-17327010
 ] 

Apache Spark commented on SPARK-34692:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32278

> Support Not(Int) and Not(InSet) propagate null
> --
>
> Key: SPARK-34692
> URL: https://issues.apache.org/jira/browse/SPARK-34692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> The semantics of `Not(In)` could be seen like `And(a != b, a != c)` that 
> match the `NullIntolerant`.
> As we already simplify the `NullIntolerant` expression to null if it's 
> children have null. E.g. `a != null` => `null`. It's safe to do this with 
> `Not(In)`/`Not(InSet)`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34692) Support Not(Int) and Not(InSet) propagate null

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327009#comment-17327009
 ] 

Apache Spark commented on SPARK-34692:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32278

> Support Not(Int) and Not(InSet) propagate null
> --
>
> Key: SPARK-34692
> URL: https://issues.apache.org/jira/browse/SPARK-34692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> The semantics of `Not(In)` could be seen like `And(a != b, a != c)` that 
> match the `NullIntolerant`.
> As we already simplify the `NullIntolerant` expression to null if it's 
> children have null. E.g. `a != null` => `null`. It's safe to do this with 
> `Not(In)`/`Not(InSet)`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35179) Introduce hybrid join for sort merge join and shuffled hash join in AQE

2021-04-21 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326991#comment-17326991
 ] 

Cheng Su commented on SPARK-35179:
--

Thanks for [~cloud_fan] for the idea. Please comment or edit if this is not 
captured correctly, thanks.

> Introduce hybrid join for sort merge join and shuffled hash join in AQE
> ---
>
> Key: SPARK-35179
> URL: https://issues.apache.org/jira/browse/SPARK-35179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> Per discussion in 
> [https://github.com/apache/spark/pull/32210#issuecomment-823503243] , we can 
> introduce some kind of {{HybridJoin}} operator in AQE, and we can choose to 
> do shuffled hash join vs sort merge join for each task independently, e.g. 
> based on partition size, task1 can do shuffled hash join, and task2 can do 
> sort merge join, etc. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32461) Shuffled hash join improvement

2021-04-21 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32461:
-
Affects Version/s: 3.2.0

> Shuffled hash join improvement
> --
>
> Key: SPARK-32461
> URL: https://issues.apache.org/jira/browse/SPARK-32461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Cheng Su
>Priority: Major
>  Labels: release-notes
>
> Shuffled hash join avoids sort compared to sort merge join. This advantage 
> shows up obviously when joining large table in terms of saving CPU and IO (in 
> case of external sort happens). In latest master trunk, shuffled hash join is 
> disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, 
> with favor of reducing risk of OOM. However shuffled hash join could be 
> improved to a better state (validated in our internal fork). Creating this 
> Jira to track overall progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35179) Introduce hybrid join for sort merge join and shuffled hash join in AQE

2021-04-21 Thread Cheng Su (Jira)
Cheng Su created SPARK-35179:


 Summary: Introduce hybrid join for sort merge join and shuffled 
hash join in AQE
 Key: SPARK-35179
 URL: https://issues.apache.org/jira/browse/SPARK-35179
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Cheng Su


Per discussion in 
[https://github.com/apache/spark/pull/32210#issuecomment-823503243] , we can 
introduce some kind of {{HybridJoin}} operator in AQE, and we can choose to do 
shuffled hash join vs sort merge join for each task independently, e.g. based 
on partition size, task1 can do shuffled hash join, and task2 can do sort merge 
join, etc. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326924#comment-17326924
 ] 

Apache Spark commented on SPARK-35178:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/32277

> maven autodownload failing
> --
>
> Key: SPARK-35178
> URL: https://issues.apache.org/jira/browse/SPARK-35178
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>
> I attempted to build a fresh clone of Spark using mvn (on two different 
> networks) and got this error:
> {noformat}
> exec: curl --silent --show-error -L 
> https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
> tar: Unrecognized archive format
> tar: Error exit delayed from previous errors.
> Using `mvn` from path: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
> build/mvn: line 126: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such 
> file or directory
> {noformat}
> if I change the mirror as below, the issue goes away:
> {noformat}
> -local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
> +local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35178:


Assignee: Bruce Robbins  (was: Apache Spark)

> maven autodownload failing
> --
>
> Key: SPARK-35178
> URL: https://issues.apache.org/jira/browse/SPARK-35178
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>
> I attempted to build a fresh clone of Spark using mvn (on two different 
> networks) and got this error:
> {noformat}
> exec: curl --silent --show-error -L 
> https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
> tar: Unrecognized archive format
> tar: Error exit delayed from previous errors.
> Using `mvn` from path: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
> build/mvn: line 126: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such 
> file or directory
> {noformat}
> if I change the mirror as below, the issue goes away:
> {noformat}
> -local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
> +local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326921#comment-17326921
 ] 

Apache Spark commented on SPARK-35178:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/32277

> maven autodownload failing
> --
>
> Key: SPARK-35178
> URL: https://issues.apache.org/jira/browse/SPARK-35178
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>
> I attempted to build a fresh clone of Spark using mvn (on two different 
> networks) and got this error:
> {noformat}
> exec: curl --silent --show-error -L 
> https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
> tar: Unrecognized archive format
> tar: Error exit delayed from previous errors.
> Using `mvn` from path: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
> build/mvn: line 126: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such 
> file or directory
> {noformat}
> if I change the mirror as below, the issue goes away:
> {noformat}
> -local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
> +local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35178:


Assignee: Apache Spark  (was: Bruce Robbins)

> maven autodownload failing
> --
>
> Key: SPARK-35178
> URL: https://issues.apache.org/jira/browse/SPARK-35178
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> I attempted to build a fresh clone of Spark using mvn (on two different 
> networks) and got this error:
> {noformat}
> exec: curl --silent --show-error -L 
> https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
> tar: Unrecognized archive format
> tar: Error exit delayed from previous errors.
> Using `mvn` from path: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
> build/mvn: line 126: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such 
> file or directory
> {noformat}
> if I change the mirror as below, the issue goes away:
> {noformat}
> -local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
> +local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35178:


Affects Version/s: 2.4.7
   3.0.2
   3.1.1
 Assignee: Bruce Robbins

> maven autodownload failing
> --
>
> Key: SPARK-35178
> URL: https://issues.apache.org/jira/browse/SPARK-35178
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>
> I attempted to build a fresh clone of Spark using mvn (on two different 
> networks) and got this error:
> {noformat}
> exec: curl --silent --show-error -L 
> https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
> tar: Unrecognized archive format
> tar: Error exit delayed from previous errors.
> Using `mvn` from path: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
> build/mvn: line 126: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such 
> file or directory
> {noformat}
> if I change the mirror as below, the issue goes away:
> {noformat}
> -local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
> +local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326910#comment-17326910
 ] 

Bruce Robbins commented on SPARK-35178:
---

In INFRA-21767, Daniel Gruno responded:
{quote}
Please use this format instead:
https://www.apache.org/dyn/closer.lua/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz?action=download

that is, 
https://www.apache.org/dyn/closer.lua/path/to/file.tar.gz?action=download
{quote}

 

> maven autodownload failing
> --
>
> Key: SPARK-35178
> URL: https://issues.apache.org/jira/browse/SPARK-35178
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Bruce Robbins
>Priority: Major
>
> I attempted to build a fresh clone of Spark using mvn (on two different 
> networks) and got this error:
> {noformat}
> exec: curl --silent --show-error -L 
> https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
> tar: Unrecognized archive format
> tar: Error exit delayed from previous errors.
> Using `mvn` from path: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
> build/mvn: line 126: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such 
> file or directory
> {noformat}
> if I change the mirror as below, the issue goes away:
> {noformat}
> -local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
> +local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326874#comment-17326874
 ] 

Bruce Robbins commented on SPARK-35178:
---

I also posted https://issues.apache.org/jira/browse/INFRA-21767. Maybe they 
have some insight.

> maven autodownload failing
> --
>
> Key: SPARK-35178
> URL: https://issues.apache.org/jira/browse/SPARK-35178
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Bruce Robbins
>Priority: Major
>
> I attempted to build a fresh clone of Spark using mvn (on two different 
> networks) and got this error:
> {noformat}
> exec: curl --silent --show-error -L 
> https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
> tar: Unrecognized archive format
> tar: Error exit delayed from previous errors.
> Using `mvn` from path: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
> build/mvn: line 126: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such 
> file or directory
> {noformat}
> if I change the mirror as below, the issue goes away:
> {noformat}
> -local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
> +local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326770#comment-17326770
 ] 

Sean R. Owen commented on SPARK-35178:
--

I agree, it looks like the automatic redirector has changed behavior. It still 
sends you to an HTML page for the mirror, but previously that link would cause 
it to redirect straight to the download.While the script can fallback to 
archive.apache.org, it doesn't because the HTML downloads successfully -- just 
is not the distribution!Either we detect this or have to hack this more to get 
the mirror URL from the redirector, then attach it to the path.

> maven autodownload failing
> --
>
> Key: SPARK-35178
> URL: https://issues.apache.org/jira/browse/SPARK-35178
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Bruce Robbins
>Priority: Major
>
> I attempted to build a fresh clone of Spark using mvn (on two different 
> networks) and got this error:
> {noformat}
> exec: curl --silent --show-error -L 
> https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
> tar: Unrecognized archive format
> tar: Error exit delayed from previous errors.
> Using `mvn` from path: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
> build/mvn: line 126: 
> /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such 
> file or directory
> {noformat}
> if I change the mirror as below, the issue goes away:
> {noformat}
> -local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
> +local 
> APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35178) maven autodownload failing

2021-04-21 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-35178:
-

 Summary: maven autodownload failing
 Key: SPARK-35178
 URL: https://issues.apache.org/jira/browse/SPARK-35178
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.2.0
Reporter: Bruce Robbins


I attempted to build a fresh clone of Spark using mvn (on two different 
networks) and got this error:
{noformat}
exec: curl --silent --show-error -L 
https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.
Using `mvn` from path: 
/tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn
build/mvn: line 126: 
/tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such file 
or directory
{noformat}
if I change the mirror as below, the issue goes away:
{noformat}
-local 
APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
+local 
APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'}
{noformat}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-04-21 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326716#comment-17326716
 ] 

L. C. Hsieh commented on SPARK-34198:
-

The major issue is the additional rocksdb dependency. For me, I'm not against 
it. But maybe others have strong preferences not to include it by default. I 
agree with [~kabhwan] that we may need to get a consensus from the community.

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34692) Support Not(Int) and Not(InSet) propagate null

2021-04-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34692:
---

Assignee: ulysses you

> Support Not(Int) and Not(InSet) propagate null
> --
>
> Key: SPARK-34692
> URL: https://issues.apache.org/jira/browse/SPARK-34692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> The semantics of `Not(In)` could be seen like `And(a != b, a != c)` that 
> match the `NullIntolerant`.
> As we already simplify the `NullIntolerant` expression to null if it's 
> children have null. E.g. `a != null` => `null`. It's safe to do this with 
> `Not(In)`/`Not(InSet)`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34692) Support Not(Int) and Not(InSet) propagate null

2021-04-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34692.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31797
[https://github.com/apache/spark/pull/31797]

> Support Not(Int) and Not(InSet) propagate null
> --
>
> Key: SPARK-34692
> URL: https://issues.apache.org/jira/browse/SPARK-34692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> The semantics of `Not(In)` could be seen like `And(a != b, a != c)` that 
> match the `NullIntolerant`.
> As we already simplify the `NullIntolerant` expression to null if it's 
> children have null. E.g. `a != null` => `null`. It's safe to do this with 
> `Not(In)`/`Not(InSet)`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly

2021-04-21 Thread angerszhu (Jira)
angerszhu created SPARK-35177:
-

 Summary: IntervalUtils.fromYearMonthString can't handle 
Int.MinValue correctly
 Key: SPARK-35177
 URL: https://issues.apache.org/jira/browse/SPARK-35177
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly

2021-04-21 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326643#comment-17326643
 ] 

angerszhu commented on SPARK-35177:
---

Raise a pr soon

> IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly
> -
>
> Key: SPARK-35177
> URL: https://issues.apache.org/jira/browse/SPARK-35177
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32027) EventLoggingListener threw java.util.ConcurrentModificationException

2021-04-21 Thread Seulki jake Han (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326639#comment-17326639
 ] 

Seulki jake Han commented on SPARK-32027:
-

[~kristopherkane] Thank you. This problem is solved by the SPARK-34731. This 
issue may be closed.

> EventLoggingListener threw  java.util.ConcurrentModificationException
> -
>
> Key: SPARK-32027
> URL: https://issues.apache.org/jira/browse/SPARK-32027
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> 20/06/18 20:22:25 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.util.ConcurrentModificationException
>   at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at org.apache.spark.util.JsonProtocol$.mapToJson(JsonProtocol.scala:568)
>   at 
> org.apache.spark.util.JsonProtocol$.$anonfun$propertiesToJson$1(JsonProtocol.scala:574)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.util.JsonProtocol$.propertiesToJson(JsonProtocol.scala:573)
>   at 
> org.apache.spark.util.JsonProtocol$.jobStartToJson(JsonProtocol.scala:159)
>   at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:81)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:97)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:159)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115)
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> 20/06/18 20:22:25 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.util.ConcurrentModificationException
>   at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at org.apache.spark.util.JsonProtocol$.

[jira] [Created] (SPARK-35176) Raise TypeError in inappropriate type case rather than ValueError

2021-04-21 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-35176:
---

 Summary:  Raise TypeError in inappropriate type case rather than 
ValueError
 Key: SPARK-35176
 URL: https://issues.apache.org/jira/browse/SPARK-35176
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Yikun Jiang


There are many wrong error type usages on ValueError type.

When an operation or function is applied to an object of inappropriate type, we 
should use TypeError rather than ValueError.

such as:

[https://github.com/apache/spark/blob/355c39939d9e4c87ffc9538eb822a41cb2ff93fb/python/pyspark/sql/dataframe.py#L1137]

[https://github.com/apache/spark/blob/355c39939d9e4c87ffc9538eb822a41cb2ff93fb/python/pyspark/sql/dataframe.py#L1228]

 

We should do some correction in some right time, note that if we do these 
corrections, it will break some catch on original ValueError.

 

[1] https://docs.python.org/3/library/exceptions.html#TypeError



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35173) Support columns batch adding in PySpark.dataframe

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35173:


Assignee: Apache Spark

> Support columns batch adding in PySpark.dataframe
> -
>
> Key: SPARK-35173
> URL: https://issues.apache.org/jira/browse/SPARK-35173
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> Now, the pyspark can only use withColumn to do column adding a column or 
> replacing the existing column that has the same name. The scala withColumn 
> can adding columns at one pass. [1]
>  
> Before this added, the user can only use withColumn again and again like:
>  
> {code:java}
> self.df.withColumn("key1", col("key1")).withColumn("key2", 
> col("key2")).withColumn("key3", col("key3")){code}
>  
> After the support, you user can use the with_columns complete batch 
> operations:
>  
> {code:java}
> self.df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), 
> col("key3")]){code}
>  
> [1] 
> [https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35173) Support columns batch adding in PySpark.dataframe

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35173:


Assignee: (was: Apache Spark)

> Support columns batch adding in PySpark.dataframe
> -
>
> Key: SPARK-35173
> URL: https://issues.apache.org/jira/browse/SPARK-35173
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> Now, the pyspark can only use withColumn to do column adding a column or 
> replacing the existing column that has the same name. The scala withColumn 
> can adding columns at one pass. [1]
>  
> Before this added, the user can only use withColumn again and again like:
>  
> {code:java}
> self.df.withColumn("key1", col("key1")).withColumn("key2", 
> col("key2")).withColumn("key3", col("key3")){code}
>  
> After the support, you user can use the with_columns complete batch 
> operations:
>  
> {code:java}
> self.df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), 
> col("key3")]){code}
>  
> [1] 
> [https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35173) Support columns batch adding in PySpark.dataframe

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326574#comment-17326574
 ] 

Apache Spark commented on SPARK-35173:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32276

> Support columns batch adding in PySpark.dataframe
> -
>
> Key: SPARK-35173
> URL: https://issues.apache.org/jira/browse/SPARK-35173
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> Now, the pyspark can only use withColumn to do column adding a column or 
> replacing the existing column that has the same name. The scala withColumn 
> can adding columns at one pass. [1]
>  
> Before this added, the user can only use withColumn again and again like:
>  
> {code:java}
> self.df.withColumn("key1", col("key1")).withColumn("key2", 
> col("key2")).withColumn("key3", col("key3")){code}
>  
> After the support, you user can use the with_columns complete batch 
> operations:
>  
> {code:java}
> self.df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), 
> col("key3")]){code}
>  
> [1] 
> [https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35142) `OneVsRest` classifier uses incorrect data type for `rawPrediction` column

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326518#comment-17326518
 ] 

Apache Spark commented on SPARK-35142:
--

User 'harupy' has created a pull request for this issue:
https://github.com/apache/spark/pull/32275

> `OneVsRest` classifier uses incorrect data type for `rawPrediction` column
> --
>
> Key: SPARK-35142
> URL: https://issues.apache.org/jira/browse/SPARK-35142
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0, 3.0.2, 3.1.0, 3.1.1
>Reporter: Harutaka Kawamura
>Priority: Major
>
> `OneVsRest` classifier uses an incorrect data type for the `rawPrediction` 
> column.
>  Code to reproduce the issue:
> {code:java}
> from pyspark.ml.classification import LogisticRegression, OneVsRest
> from pyspark.ml.linalg import Vectors
> from pyspark.sql import SparkSession
> from sklearn.datasets import load_iris
> spark = SparkSession.builder.getOrCreate()
> X, y = load_iris(return_X_y=True)
> df = spark.createDataFrame(
>  [(Vectors.dense(features), int(label)) for features, label in zip(X, y)], 
> ["features", "label"]
> )
> train, test = df.randomSplit([0.8, 0.2])
> lor = LogisticRegression(maxIter=5)
> ovr = OneVsRest(classifier=lor)
> ovrModel = ovr.fit(train)
> pred = ovrModel.transform(test)
> pred.printSchema()
> # This prints out:
> # root
> #  |-- features: vector (nullable = true)
> #  |-- label: long (nullable = true)
> #  |-- rawPrediction: string (nullable = true)  # <- should not be string
> #  |-- prediction: double (nullable = true)
> # pred.show()  # this fails because of the incorrect datatype{code}
> I ran the code above using GitHub Actiosn:
> [https://github.com/harupy/SPARK-35142/pull/1]
>  
> It looks like the UDF to compute the `rawPrediction` column is generated 
> without specyfing the return type:
>  
> [https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/python/pyspark/ml/classification.py#L3154]
> {code:java}
> rawPredictionUDF = udf(func)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35142) `OneVsRest` classifier uses incorrect data type for `rawPrediction` column

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326517#comment-17326517
 ] 

Apache Spark commented on SPARK-35142:
--

User 'harupy' has created a pull request for this issue:
https://github.com/apache/spark/pull/32275

> `OneVsRest` classifier uses incorrect data type for `rawPrediction` column
> --
>
> Key: SPARK-35142
> URL: https://issues.apache.org/jira/browse/SPARK-35142
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0, 3.0.2, 3.1.0, 3.1.1
>Reporter: Harutaka Kawamura
>Priority: Major
>
> `OneVsRest` classifier uses an incorrect data type for the `rawPrediction` 
> column.
>  Code to reproduce the issue:
> {code:java}
> from pyspark.ml.classification import LogisticRegression, OneVsRest
> from pyspark.ml.linalg import Vectors
> from pyspark.sql import SparkSession
> from sklearn.datasets import load_iris
> spark = SparkSession.builder.getOrCreate()
> X, y = load_iris(return_X_y=True)
> df = spark.createDataFrame(
>  [(Vectors.dense(features), int(label)) for features, label in zip(X, y)], 
> ["features", "label"]
> )
> train, test = df.randomSplit([0.8, 0.2])
> lor = LogisticRegression(maxIter=5)
> ovr = OneVsRest(classifier=lor)
> ovrModel = ovr.fit(train)
> pred = ovrModel.transform(test)
> pred.printSchema()
> # This prints out:
> # root
> #  |-- features: vector (nullable = true)
> #  |-- label: long (nullable = true)
> #  |-- rawPrediction: string (nullable = true)  # <- should not be string
> #  |-- prediction: double (nullable = true)
> # pred.show()  # this fails because of the incorrect datatype{code}
> I ran the code above using GitHub Actiosn:
> [https://github.com/harupy/SPARK-35142/pull/1]
>  
> It looks like the UDF to compute the `rawPrediction` column is generated 
> without specyfing the return type:
>  
> [https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/python/pyspark/ml/classification.py#L3154]
> {code:java}
> rawPredictionUDF = udf(func)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35175) Add linter for JavaScript source files

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35175:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Add linter for JavaScript source files
> --
>
> Key: SPARK-35175
> URL: https://issues.apache.org/jira/browse/SPARK-35175
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> In the current master, there is no linter for JavaScript sources.
> Let's add it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35175) Add linter for JavaScript source files

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326507#comment-17326507
 ] 

Apache Spark commented on SPARK-35175:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32274

> Add linter for JavaScript source files
> --
>
> Key: SPARK-35175
> URL: https://issues.apache.org/jira/browse/SPARK-35175
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> In the current master, there is no linter for JavaScript sources.
> Let's add it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35175) Add linter for JavaScript source files

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35175:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Add linter for JavaScript source files
> --
>
> Key: SPARK-35175
> URL: https://issues.apache.org/jira/browse/SPARK-35175
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> In the current master, there is no linter for JavaScript sources.
> Let's add it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35175) Add linter for JavaScript source files

2021-04-21 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-35175:
---
Summary: Add linter for JavaScript source files  (was: Add linter for 
JavaScript sources)

> Add linter for JavaScript source files
> --
>
> Key: SPARK-35175
> URL: https://issues.apache.org/jira/browse/SPARK-35175
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> In the current master, there is no linter for JavaScript sources.
> Let's add it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35175) Add linter for JavaScript sources

2021-04-21 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-35175:
--

 Summary: Add linter for JavaScript sources
 Key: SPARK-35175
 URL: https://issues.apache.org/jira/browse/SPARK-35175
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In the current master, there is no linter for JavaScript sources.
Let's add it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35143) Add default log config for spark-sql

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326498#comment-17326498
 ] 

Apache Spark commented on SPARK-35143:
--

User 'ChenDou2021' has created a pull request for this issue:
https://github.com/apache/spark/pull/32246

> Add default log config for spark-sql
> 
>
> Key: SPARK-35143
> URL: https://issues.apache.org/jira/browse/SPARK-35143
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Priority: Minor
>
> The default log level for spark-sql is WARN. How to change the log level is 
> confusing, we need a default config.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35143) Add default log config for spark-sql

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326496#comment-17326496
 ] 

Apache Spark commented on SPARK-35143:
--

User 'ChenDou2021' has created a pull request for this issue:
https://github.com/apache/spark/pull/32254

> Add default log config for spark-sql
> 
>
> Key: SPARK-35143
> URL: https://issues.apache.org/jira/browse/SPARK-35143
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Priority: Minor
>
> The default log level for spark-sql is WARN. How to change the log level is 
> confusing, we need a default config.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35143) Add default log config for spark-sql

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326495#comment-17326495
 ] 

Apache Spark commented on SPARK-35143:
--

User 'ChenDou2021' has created a pull request for this issue:
https://github.com/apache/spark/pull/32273

> Add default log config for spark-sql
> 
>
> Key: SPARK-35143
> URL: https://issues.apache.org/jira/browse/SPARK-35143
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Priority: Minor
>
> The default log level for spark-sql is WARN. How to change the log level is 
> confusing, we need a default config.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35140) Establish error message guidelines

2021-04-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35140:


Assignee: Karen Feng

> Establish error message guidelines
> --
>
> Key: SPARK-35140
> URL: https://issues.apache.org/jira/browse/SPARK-35140
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Assignee: Karen Feng
>Priority: Major
>
> In the SPIP: Standardize Exception Messages in Spark, there are three major 
> improvements proposed:
> # Group error messages in dedicated files.
> # Establish an error message guideline for developers.
> # Improve error message quality.
> The second step is to establish the error message guideline. This was 
> discussed in 
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Build-error-message-guideline-td31076.html
>  and added to the website in 
> https://github.com/apache/spark-website/pull/332. To increase visibility, the 
> guidelines should be accessible from the PR template.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35140) Establish error message guidelines

2021-04-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35140.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32241
[https://github.com/apache/spark/pull/32241]

> Establish error message guidelines
> --
>
> Key: SPARK-35140
> URL: https://issues.apache.org/jira/browse/SPARK-35140
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Assignee: Karen Feng
>Priority: Major
> Fix For: 3.2.0
>
>
> In the SPIP: Standardize Exception Messages in Spark, there are three major 
> improvements proposed:
> # Group error messages in dedicated files.
> # Establish an error message guideline for developers.
> # Improve error message quality.
> The second step is to establish the error message guideline. This was 
> discussed in 
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Build-error-message-guideline-td31076.html
>  and added to the website in 
> https://github.com/apache/spark-website/pull/332. To increase visibility, the 
> guidelines should be accessible from the PR template.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35154) Rpc env not shutdown when shutdown method call by endpoint onStop

2021-04-21 Thread LIU (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LIU updated SPARK-35154:

Description: 
when i use this code to work,  Rpc thread hangs up and not close gracefully. i 
think when rpc thread called shutdown on OnStop method, it will try to put 
MessageLoop.PoisonPill to return and stop thread in rpc pool. In spark 3.x, it 
will make others thread return & stop but current thread which call OnStop 
method to await current pool to stop. it makes current thread not stop, and 
pending program.

I'm not sure that needs to be improved or not?

 
{code:java}
//代码占位符{code}
test("Rpc env not shutdown when shutdown method call by endpoint onStop") {
 val rpcEndpoint = new RpcEndpoint {
    override val rpcEnv: RpcEnv = env
    override def onStop(): Unit = {
      env.shutdown()
 env.awaitTermination()         
 }
    override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, 
Unit] = {
 case m => context.reply(m)
 }
  }
  env.setupEndpoint("test", rpcEndpoint)
  rpcEndpoint.stop()
  env.awaitTermination()
}

> Rpc env not shutdown when shutdown method call by endpoint onStop
> -
>
> Key: SPARK-35154
> URL: https://issues.apache.org/jira/browse/SPARK-35154
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: spark-3.x
>Reporter: LIU
>Priority: Minor
>
> when i use this code to work,  Rpc thread hangs up and not close gracefully. 
> i think when rpc thread called shutdown on OnStop method, it will try to put 
> MessageLoop.PoisonPill to return and stop thread in rpc pool. In spark 3.x, 
> it will make others thread return & stop but current thread which call OnStop 
> method to await current pool to stop. it makes current thread not stop, and 
> pending program.
> I'm not sure that needs to be improved or not?
>  
> {code:java}
> //代码占位符{code}
> test("Rpc env not shutdown when shutdown method call by endpoint onStop") {
>  val rpcEndpoint = new RpcEndpoint {
>     override val rpcEnv: RpcEnv = env
>     override def onStop(): Unit = {
>       env.shutdown()
>  env.awaitTermination()         
>  }
>     override def receiveAndReply(context: RpcCallContext): 
> PartialFunction[Any, Unit] = {
>  case m => context.reply(m)
>  }
>   }
>   env.setupEndpoint("test", rpcEndpoint)
>   rpcEndpoint.stop()
>   env.awaitTermination()
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35154) Rpc env not shutdown when shutdown method call by endpoint onStop

2021-04-21 Thread LIU (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LIU updated SPARK-35154:

Issue Type: Improvement  (was: Bug)

> Rpc env not shutdown when shutdown method call by endpoint onStop
> -
>
> Key: SPARK-35154
> URL: https://issues.apache.org/jira/browse/SPARK-35154
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: spark-3.x
>Reporter: LIU
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35154) Rpc env not shutdown when shutdown method call by endpoint onStop

2021-04-21 Thread LIU (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LIU updated SPARK-35154:

Priority: Minor  (was: Major)

> Rpc env not shutdown when shutdown method call by endpoint onStop
> -
>
> Key: SPARK-35154
> URL: https://issues.apache.org/jira/browse/SPARK-35154
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: spark-3.x
>Reporter: LIU
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35154) Rpc env not shutdown when shutdown method call by endpoint onStop

2021-04-21 Thread LIU (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LIU updated SPARK-35154:

Description: (was: when i use this code to work,  Rpc thread hangs up 
and not close gracefully. i think when rpc thread called shutdown on OnStop 
method, it will try to put MessageLoop.PoisonPill to return and stop thread in 
rpc pool. In spark 3.x, it will make others thread return & stop but current 
thread which call OnStop method to await current pool to stop. it makes current 
thread not stop, and pending program.

I'm not sure that needs to be improved or not?

 
{code:java}
//代码占位符{code}
test("Rpc env not shutdown when shutdown method call by endpoint onStop") {
     val rpcEndpoint = new RpcEndpoint {
         override val rpcEnv: RpcEnv = env
         override def onStop(): Unit =

{             env.shutdown()             env.awaitTermination()         }

        override def receiveAndReply(context: RpcCallContext): 
PartialFunction[Any, Unit] =

{             case m => context.reply(m)          }

    }
    env.setupEndpoint("test", rpcEndpoint)
    rpcEndpoint.stop()
    env.awaitTermination()
 }

 )

> Rpc env not shutdown when shutdown method call by endpoint onStop
> -
>
> Key: SPARK-35154
> URL: https://issues.apache.org/jira/browse/SPARK-35154
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: spark-3.x
>Reporter: LIU
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-04-21 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326460#comment-17326460
 ] 

Jungtaek Lim commented on SPARK-34198:
--

Please note that the decision was taken from community discussion. If you need 
to change it, please bring it to the community with rationalization.

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35171) Declare the markdown package as a dependency of the SparkR package

2021-04-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35171.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32270
[https://github.com/apache/spark/pull/32270]

> Declare the markdown package as a dependency of the SparkR package
> --
>
> Key: SPARK-35171
> URL: https://issues.apache.org/jira/browse/SPARK-35171
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.2.0
>
>
> If we didn't install pandoc locally, the make-distribution package will fail 
> with the following message:
> {quote}
> --- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown
> Warning in engine$weave(file, quiet = quiet, encoding = enc) :
>   Pandoc (>= 1.12.3) not available. Falling back to R Markdown v1.
> Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
> The 'markdown' package should be declared as a dependency of the 'SparkR' 
> package (e.g., in the  'Suggests' field of DESCRIPTION), because the latter 
> contains vignette(s) built with the 'markdown' package. Please see 
> https://github.com/yihui/knitr/issues/1864 for more information.
> --- failed re-building ‘sparkr-vignettes.Rmd’
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35171) Declare the markdown package as a dependency of the SparkR package

2021-04-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35171:


Assignee: Yuanjian Li

> Declare the markdown package as a dependency of the SparkR package
> --
>
> Key: SPARK-35171
> URL: https://issues.apache.org/jira/browse/SPARK-35171
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> If we didn't install pandoc locally, the make-distribution package will fail 
> with the following message:
> {quote}
> --- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown
> Warning in engine$weave(file, quiet = quiet, encoding = enc) :
>   Pandoc (>= 1.12.3) not available. Falling back to R Markdown v1.
> Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
> The 'markdown' package should be declared as a dependency of the 'SparkR' 
> package (e.g., in the  'Suggests' field of DESCRIPTION), because the latter 
> contains vignette(s) built with the 'markdown' package. Please see 
> https://github.com/yihui/knitr/issues/1864 for more information.
> --- failed re-building ‘sparkr-vignettes.Rmd’
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35126) Execute jdbc cancellation method when jdbc load job is interrupted

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35126:


Assignee: (was: Apache Spark)

> Execute jdbc cancellation method when jdbc load job is interrupted
> --
>
> Key: SPARK-35126
> URL: https://issues.apache.org/jira/browse/SPARK-35126
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Environment version:
>  * spark3.1.1
>  * jdk1.8.201
>  * scala2.12
>  * mysql5.7.31
>  * mysql-connector-java-5.1.32.jar /mysql-connector-java-8.0.32.jar
>Reporter: zhangrenhua
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have a long-running spark service that continuously receives and runs spark 
> programs submitted by the client. There is a program to load jdbc table. 
> Query sql is very complicated. Each execution takes a lot of time and 
> resources. When the client submits such a similar request, the client may 
> interrupt the job at any time. At that time, I found that the database select 
> after the job was interrupted. The process is still executing and has not 
> been killed.
>  
> *Scene demonstration:*
> 1. Prepare two tables: SPARK_TEST1/SPARK_TEST2, each of which has 1000 
> records)
> 2. Test code
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import java.util.concurrent.TimeUnit;
> /**
>  * jdbc load cancel test
>  *
>  * @author gavin
>  * @create 2021/4/18 10:58
>  */
> public class JdbcLoadCancelTest {
> public static void main(String[] args) throws Exception {
> final SparkConf sparkConf = new SparkConf();
> sparkConf.setAppName("jdbc load test");
> sparkConf.setMaster("local[*]");
> final SparkContext sparkContext = new SparkContext(sparkConf);
> final SparkSession sparkSession = new SparkSession(sparkContext);
> // This is a sql that takes about a minute to execute
> String querySql = "select t1.*\n" +
> "from SPARK_TEST1 t1\n" +
> "left join SPARK_TEST1 t2 on 1=1\n" +
> "left join (select aa from SPARK_TEST1 limit 3) t3  on 1=1";
> // Specify job information
> final String jobGroup = "test";
> sparkContext.clearJobGroup();
> sparkContext.setJobGroup(jobGroup, "test", true);
> // Start the independent thread to start the jdbc load test logic
> new Thread(() -> {
> final Dataset table = sparkSession.read()
> 
> .format("org.apache.spark.sql.execution.datasources.jdbc3")
> .option("url", 
> "jdbc:mysql://192.168.10.226:32320/test?useUnicode=true&characterEncoding=utf-8&useSSL=false")
> .option("user", "root")
> .option("password", "123456")
> .option("query", querySql)
> .load();
> // Print the first data
> System.out.println(table.limit(1).first());
> }).start();
> // Wait for the jdbc load job to start
> TimeUnit.SECONDS.sleep(10);
> // Cancel the job just now
> sparkContext.cancelJobGroup(jobGroup);
> // Simulate a long-running service without stopping the driver 
> process, which is used to wait for new jobs to be received
> TimeUnit.SECONDS.sleep(Integer.MAX_VALUE);
> }
> }
> {code}
>  
> 3. View the mysql process
> {code:java}
> select * from information_schema.`PROCESSLIST` where info is not null;{code}
> When the program started 10 seconds later, and interrupted the job, it was 
> found that the database query process has not been killed.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35126) Execute jdbc cancellation method when jdbc load job is interrupted

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35126:


Assignee: Apache Spark

> Execute jdbc cancellation method when jdbc load job is interrupted
> --
>
> Key: SPARK-35126
> URL: https://issues.apache.org/jira/browse/SPARK-35126
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Environment version:
>  * spark3.1.1
>  * jdk1.8.201
>  * scala2.12
>  * mysql5.7.31
>  * mysql-connector-java-5.1.32.jar /mysql-connector-java-8.0.32.jar
>Reporter: zhangrenhua
>Assignee: Apache Spark
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have a long-running spark service that continuously receives and runs spark 
> programs submitted by the client. There is a program to load jdbc table. 
> Query sql is very complicated. Each execution takes a lot of time and 
> resources. When the client submits such a similar request, the client may 
> interrupt the job at any time. At that time, I found that the database select 
> after the job was interrupted. The process is still executing and has not 
> been killed.
>  
> *Scene demonstration:*
> 1. Prepare two tables: SPARK_TEST1/SPARK_TEST2, each of which has 1000 
> records)
> 2. Test code
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import java.util.concurrent.TimeUnit;
> /**
>  * jdbc load cancel test
>  *
>  * @author gavin
>  * @create 2021/4/18 10:58
>  */
> public class JdbcLoadCancelTest {
> public static void main(String[] args) throws Exception {
> final SparkConf sparkConf = new SparkConf();
> sparkConf.setAppName("jdbc load test");
> sparkConf.setMaster("local[*]");
> final SparkContext sparkContext = new SparkContext(sparkConf);
> final SparkSession sparkSession = new SparkSession(sparkContext);
> // This is a sql that takes about a minute to execute
> String querySql = "select t1.*\n" +
> "from SPARK_TEST1 t1\n" +
> "left join SPARK_TEST1 t2 on 1=1\n" +
> "left join (select aa from SPARK_TEST1 limit 3) t3  on 1=1";
> // Specify job information
> final String jobGroup = "test";
> sparkContext.clearJobGroup();
> sparkContext.setJobGroup(jobGroup, "test", true);
> // Start the independent thread to start the jdbc load test logic
> new Thread(() -> {
> final Dataset table = sparkSession.read()
> 
> .format("org.apache.spark.sql.execution.datasources.jdbc3")
> .option("url", 
> "jdbc:mysql://192.168.10.226:32320/test?useUnicode=true&characterEncoding=utf-8&useSSL=false")
> .option("user", "root")
> .option("password", "123456")
> .option("query", querySql)
> .load();
> // Print the first data
> System.out.println(table.limit(1).first());
> }).start();
> // Wait for the jdbc load job to start
> TimeUnit.SECONDS.sleep(10);
> // Cancel the job just now
> sparkContext.cancelJobGroup(jobGroup);
> // Simulate a long-running service without stopping the driver 
> process, which is used to wait for new jobs to be received
> TimeUnit.SECONDS.sleep(Integer.MAX_VALUE);
> }
> }
> {code}
>  
> 3. View the mysql process
> {code:java}
> select * from information_schema.`PROCESSLIST` where info is not null;{code}
> When the program started 10 seconds later, and interrupted the job, it was 
> found that the database query process has not been killed.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35174) Avoid opening watch when waitAppCompletion is false

2021-04-21 Thread Jonathan Lafleche (Jira)
Jonathan Lafleche created SPARK-35174:
-

 Summary: Avoid opening watch when waitAppCompletion is false
 Key: SPARK-35174
 URL: https://issues.apache.org/jira/browse/SPARK-35174
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.1
Reporter: Jonathan Lafleche


In spark-submit, we currently [open a pod watch for any spark 
submission|https://github.com/apache/spark/blame/0494dc90af48ce7da0625485a4dc6917a244d580/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L150-L167].
 If WAIT_FOR_APP_COMPLETION is false, we then immediately ignore the result of 
the watcher and break out of the watcher.

When submitting spark applications at scale, this is a source of operational 
pain, since opening the watch relies on opening a websocket, which tends to run 
into subtle networking issues around negotiating the websocket connection.

I'd like to change this behaviour so that we eagerly check whether we are 
waiting on app completion, and avoid opening the watch altogether when 
WAIT_FOR_APP_COMPLETION is false.

Would you accept a contribution for that change, or are there any concerns I've 
overlooked?




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-04-21 Thread Yuanjian Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326427#comment-17326427
 ] 

Yuanjian Li commented on SPARK-34198:
-

[~viirya] Since the RocksDBStateStore can solve the major drawbacks for the 
current HDFS based one. I think it's a better choice to directly add it as a 
build-in RocksDBStateStoreProvider. It's also convenient for the end-users to 
choose it directly. If you agree, I will change the description and the title 
for this ticket.

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35143) Add default log config for spark-sql

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326416#comment-17326416
 ] 

Apache Spark commented on SPARK-35143:
--

User 'ChenDou2021' has created a pull request for this issue:
https://github.com/apache/spark/pull/32273

> Add default log config for spark-sql
> 
>
> Key: SPARK-35143
> URL: https://issues.apache.org/jira/browse/SPARK-35143
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Priority: Minor
>
> The default log level for spark-sql is WARN. How to change the log level is 
> confusing, we need a default config.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35172) The implementation of RocksDBCheckpointMetadata

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326411#comment-17326411
 ] 

Apache Spark commented on SPARK-35172:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/32272

> The implementation of RocksDBCheckpointMetadata
> ---
>
> Key: SPARK-35172
> URL: https://issues.apache.org/jira/browse/SPARK-35172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Priority: Major
>
> The RocksDBCheckpointMetadata persists the metadata for each committed batch 
> in JSON format. The object contains all RocksDB file names and the number of 
> total keys.
>  The metadata binds closely with the directory structure of 
> RocksDBFileManager, as described in the design doc - [Directory Structure and 
> Format for Files stored in 
> DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35172) The implementation of RocksDBCheckpointMetadata

2021-04-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326410#comment-17326410
 ] 

Apache Spark commented on SPARK-35172:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/32272

> The implementation of RocksDBCheckpointMetadata
> ---
>
> Key: SPARK-35172
> URL: https://issues.apache.org/jira/browse/SPARK-35172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Priority: Major
>
> The RocksDBCheckpointMetadata persists the metadata for each committed batch 
> in JSON format. The object contains all RocksDB file names and the number of 
> total keys.
>  The metadata binds closely with the directory structure of 
> RocksDBFileManager, as described in the design doc - [Directory Structure and 
> Format for Files stored in 
> DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35172) The implementation of RocksDBCheckpointMetadata

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35172:


Assignee: Apache Spark

> The implementation of RocksDBCheckpointMetadata
> ---
>
> Key: SPARK-35172
> URL: https://issues.apache.org/jira/browse/SPARK-35172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> The RocksDBCheckpointMetadata persists the metadata for each committed batch 
> in JSON format. The object contains all RocksDB file names and the number of 
> total keys.
>  The metadata binds closely with the directory structure of 
> RocksDBFileManager, as described in the design doc - [Directory Structure and 
> Format for Files stored in 
> DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35172) The implementation of RocksDBCheckpointMetadata

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35172:


Assignee: (was: Apache Spark)

> The implementation of RocksDBCheckpointMetadata
> ---
>
> Key: SPARK-35172
> URL: https://issues.apache.org/jira/browse/SPARK-35172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Priority: Major
>
> The RocksDBCheckpointMetadata persists the metadata for each committed batch 
> in JSON format. The object contains all RocksDB file names and the number of 
> total keys.
>  The metadata binds closely with the directory structure of 
> RocksDBFileManager, as described in the design doc - [Directory Structure and 
> Format for Files stored in 
> DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35172) The implementation of RocksDBCheckpointMetadata

2021-04-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35172:


Assignee: Apache Spark

> The implementation of RocksDBCheckpointMetadata
> ---
>
> Key: SPARK-35172
> URL: https://issues.apache.org/jira/browse/SPARK-35172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> The RocksDBCheckpointMetadata persists the metadata for each committed batch 
> in JSON format. The object contains all RocksDB file names and the number of 
> total keys.
>  The metadata binds closely with the directory structure of 
> RocksDBFileManager, as described in the design doc - [Directory Structure and 
> Format for Files stored in 
> DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35172) The implementation of RocksDBCheckpointMetadata

2021-04-21 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-35172:

Summary: The implementation of RocksDBCheckpointMetadata  (was: 
Implementation for RocksDBCheckpointMetadata)

> The implementation of RocksDBCheckpointMetadata
> ---
>
> Key: SPARK-35172
> URL: https://issues.apache.org/jira/browse/SPARK-35172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Priority: Major
>
> The RocksDBCheckpointMetadata persists the metadata for each committed batch 
> in JSON format. The object contains all RocksDB file names and the number of 
> total keys.
>  The metadata binds closely with the directory structure of 
> RocksDBFileManager, as described in the design doc - [Directory Structure and 
> Format for Files stored in 
> DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35172) Implementation for RocksDBCheckpointMetadata

2021-04-21 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-35172:

Description: 
The RocksDBCheckpointMetadata persists the metadata for each committed batch in 
JSON format. The object contains all RocksDB file names and the number of total 
keys.
 The metadata binds closely with the directory structure of RocksDBFileManager, 
as described in the design doc - [Directory Structure and Format for Files 
stored in 
DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2]

  was:
The RocksDBCheckpointMetadata persists the metadata for each committed batch in 
JSON format. The schema for the object contains all RocksDB file names and the 
number of total keys.
The metadata binds closely with the directory structure of RocksDBFileManager, 
as described in the design doc - [Directory Structure and Format for Files 
stored in 
DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2]


> Implementation for RocksDBCheckpointMetadata
> 
>
> Key: SPARK-35172
> URL: https://issues.apache.org/jira/browse/SPARK-35172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Priority: Major
>
> The RocksDBCheckpointMetadata persists the metadata for each committed batch 
> in JSON format. The object contains all RocksDB file names and the number of 
> total keys.
>  The metadata binds closely with the directory structure of 
> RocksDBFileManager, as described in the design doc - [Directory Structure and 
> Format for Files stored in 
> DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >