[jira] [Assigned] (SPARK-35182) Support driver-owned on-demand PVC
[ https://issues.apache.org/jira/browse/SPARK-35182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35182: Assignee: (was: Apache Spark) > Support driver-owned on-demand PVC > -- > > Key: SPARK-35182 > URL: https://issues.apache.org/jira/browse/SPARK-35182 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35183) CombineConcats should call transformAllExpressions
Yingyi Bu created SPARK-35183: - Summary: CombineConcats should call transformAllExpressions Key: SPARK-35183 URL: https://issues.apache.org/jira/browse/SPARK-35183 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 3.1.0 Reporter: Yingyi Bu {{plan transformExpressions \{ ... }}} only applies the transformation node `plan` itself, but not its children. We should call transformAllExpressions instead of transformExpressions in CombineConcats. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35182) Support driver-owned on-demand PVC
[ https://issues.apache.org/jira/browse/SPARK-35182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327137#comment-17327137 ] Apache Spark commented on SPARK-35182: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/32288 > Support driver-owned on-demand PVC > -- > > Key: SPARK-35182 > URL: https://issues.apache.org/jira/browse/SPARK-35182 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35182) Support driver-owned on-demand PVC
[ https://issues.apache.org/jira/browse/SPARK-35182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35182: Assignee: Apache Spark > Support driver-owned on-demand PVC > -- > > Key: SPARK-35182 > URL: https://issues.apache.org/jira/browse/SPARK-35182 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35182) Support driver-owned on-demand PVC
Dongjoon Hyun created SPARK-35182: - Summary: Support driver-owned on-demand PVC Key: SPARK-35182 URL: https://issues.apache.org/jira/browse/SPARK-35182 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.2.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34671) Support ZSTD compression in Parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-34671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34671. --- Resolution: Duplicate > Support ZSTD compression in Parquet data sources > > > Key: SPARK-34671 > URL: https://issues.apache.org/jira/browse/SPARK-34671 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
[ https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35096: -- Fix Version/s: 3.0.3 > foreachBatch throws ArrayIndexOutOfBoundsException if schema is case > Insensitive > > > Key: SPARK-35096 > URL: https://issues.apache.org/jira/browse/SPARK-35096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Assignee: Sandeep Katta >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > Below code works fine before spark3, running on spark3 throws > java.lang.ArrayIndexOutOfBoundsException > {code:java} > val inputPath = "/Users/xyz/data/testcaseInsensitivity" > val output_path = "/Users/xyz/output" > spark.range(10).write.format("parquet").save(inputPath) > def process_row(microBatch: DataFrame, batchId: Long): Unit = { > val df = microBatch.select($"ID".alias("other")) // Doesn't work > df.write.format("parquet").mode("append").save(output_path) > } > val schema = new StructType().add("id", LongType) > val stream_df = > spark.readStream.schema(schema).format("parquet").load(inputPath) > stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) > .start().awaitTermination() > {code} > Stack Trace: > {code:java} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mut
[jira] [Assigned] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method
[ https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34674: - Assignee: Sergey Kotlov > Spark app on k8s doesn't terminate without call to sparkContext.stop() method > - > > Key: SPARK-34674 > URL: https://issues.apache.org/jira/browse/SPARK-34674 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Sergey Kotlov >Assignee: Sergey Kotlov >Priority: Major > > Hello! > I have run into a problem that if I don't call the method > sparkContext.stop() explicitly, then a Spark driver process doesn't terminate > even after its Main method has been completed. This behaviour is different > from spark on yarn, where the manual sparkContext stopping is not required. > It looks like, the problem is in using non-daemon threads, which prevent the > driver jvm process from terminating. > At least I see two non-daemon threads, if I don't call sparkContext.stop(): > {code:java} > Thread[OkHttp kubernetes.default.svc,5,main] > Thread[OkHttp kubernetes.default.svc Writer,5,main] > {code} > Could you tell please, if it is possible to solve this problem? > Docker image from the official release of spark-3.1.1 hadoop3.2 is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method
[ https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34674. --- Fix Version/s: 3.1.2 3.2.0 Resolution: Fixed Issue resolved by pull request 32283 [https://github.com/apache/spark/pull/32283] > Spark app on k8s doesn't terminate without call to sparkContext.stop() method > - > > Key: SPARK-34674 > URL: https://issues.apache.org/jira/browse/SPARK-34674 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Sergey Kotlov >Assignee: Sergey Kotlov >Priority: Major > Fix For: 3.2.0, 3.1.2 > > > Hello! > I have run into a problem that if I don't call the method > sparkContext.stop() explicitly, then a Spark driver process doesn't terminate > even after its Main method has been completed. This behaviour is different > from spark on yarn, where the manual sparkContext stopping is not required. > It looks like, the problem is in using non-daemon threads, which prevent the > driver jvm process from terminating. > At least I see two non-daemon threads, if I don't call sparkContext.stop(): > {code:java} > Thread[OkHttp kubernetes.default.svc,5,main] > Thread[OkHttp kubernetes.default.svc Writer,5,main] > {code} > Could you tell please, if it is possible to solve this problem? > Docker image from the official release of spark-3.1.1 hadoop3.2 is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches
[ https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327118#comment-17327118 ] Apache Spark commented on SPARK-27991: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/32287 > ShuffleBlockFetcherIterator should take Netty constant-factor overheads into > account when limiting number of simultaneous block fetches > --- > > Key: SPARK-27991 > URL: https://issues.apache.org/jira/browse/SPARK-27991 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > ShuffleBlockFetcherIterator has logic to limit the number of simultaneous > block fetches. By default, this logic tries to keep the number of outstanding > block fetches [beneath a data size > limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274] > ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads > into account: even though a remote block might be, say, 4KB, there are > certain fixed-size internal overheads due to Netty buffer sizes which may > cause the actual space requirements to be larger. > As a result, if a map stage produces a huge number of extremely tiny blocks > then we may see errors like > {code:java} > org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 > byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485) > [...] > Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate > 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640) > at > io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594) > at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:146) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324) > [...]{code} > SPARK-24989 is another report of this problem (but with a different proposed > fix). > This problem can currently be mitigated by setting > {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), > but this additional manual configuration step is cumbersome. > Instead, I think that Spark should take these fixed overheads into account in > the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, > use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some > tricky details involved to make this work on all configurations (e.g. to use > a different minimum when direct buffers are disabled, etc.), but I think the > core idea behind the fix is pretty simple. > This will improve Spark's stability and removes configuration / tuning burden > from end users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches
[ https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327119#comment-17327119 ] Apache Spark commented on SPARK-27991: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/32287 > ShuffleBlockFetcherIterator should take Netty constant-factor overheads into > account when limiting number of simultaneous block fetches > --- > > Key: SPARK-27991 > URL: https://issues.apache.org/jira/browse/SPARK-27991 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > ShuffleBlockFetcherIterator has logic to limit the number of simultaneous > block fetches. By default, this logic tries to keep the number of outstanding > block fetches [beneath a data size > limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274] > ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads > into account: even though a remote block might be, say, 4KB, there are > certain fixed-size internal overheads due to Netty buffer sizes which may > cause the actual space requirements to be larger. > As a result, if a map stage produces a huge number of extremely tiny blocks > then we may see errors like > {code:java} > org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 > byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485) > [...] > Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate > 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640) > at > io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594) > at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:146) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324) > [...]{code} > SPARK-24989 is another report of this problem (but with a different proposed > fix). > This problem can currently be mitigated by setting > {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), > but this additional manual configuration step is cumbersome. > Instead, I think that Spark should take these fixed overheads into account in > the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, > use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some > tricky details involved to make this work on all configurations (e.g. to use > a different minimum when direct buffers are disabled, etc.), but I think the > core idea behind the fix is pretty simple. > This will improve Spark's stability and removes configuration / tuning burden > from end users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches
[ https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27991: Assignee: (was: Apache Spark) > ShuffleBlockFetcherIterator should take Netty constant-factor overheads into > account when limiting number of simultaneous block fetches > --- > > Key: SPARK-27991 > URL: https://issues.apache.org/jira/browse/SPARK-27991 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > ShuffleBlockFetcherIterator has logic to limit the number of simultaneous > block fetches. By default, this logic tries to keep the number of outstanding > block fetches [beneath a data size > limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274] > ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads > into account: even though a remote block might be, say, 4KB, there are > certain fixed-size internal overheads due to Netty buffer sizes which may > cause the actual space requirements to be larger. > As a result, if a map stage produces a huge number of extremely tiny blocks > then we may see errors like > {code:java} > org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 > byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485) > [...] > Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate > 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640) > at > io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594) > at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:146) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324) > [...]{code} > SPARK-24989 is another report of this problem (but with a different proposed > fix). > This problem can currently be mitigated by setting > {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), > but this additional manual configuration step is cumbersome. > Instead, I think that Spark should take these fixed overheads into account in > the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, > use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some > tricky details involved to make this work on all configurations (e.g. to use > a different minimum when direct buffers are disabled, etc.), but I think the > core idea behind the fix is pretty simple. > This will improve Spark's stability and removes configuration / tuning burden > from end users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches
[ https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27991: Assignee: Apache Spark > ShuffleBlockFetcherIterator should take Netty constant-factor overheads into > account when limiting number of simultaneous block fetches > --- > > Key: SPARK-27991 > URL: https://issues.apache.org/jira/browse/SPARK-27991 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Major > > ShuffleBlockFetcherIterator has logic to limit the number of simultaneous > block fetches. By default, this logic tries to keep the number of outstanding > block fetches [beneath a data size > limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274] > ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads > into account: even though a remote block might be, say, 4KB, there are > certain fixed-size internal overheads due to Netty buffer sizes which may > cause the actual space requirements to be larger. > As a result, if a map stage produces a huge number of extremely tiny blocks > then we may see errors like > {code:java} > org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 > byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485) > [...] > Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate > 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640) > at > io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594) > at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:146) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324) > [...]{code} > SPARK-24989 is another report of this problem (but with a different proposed > fix). > This problem can currently be mitigated by setting > {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), > but this additional manual configuration step is cumbersome. > Instead, I think that Spark should take these fixed overheads into account in > the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, > use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some > tricky details involved to make this work on all configurations (e.g. to use > a different minimum when direct buffers are disabled, etc.), but I think the > core idea behind the fix is pretty simple. > This will improve Spark's stability and removes configuration / tuning burden > from end users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327109#comment-17327109 ] Apache Spark commented on SPARK-35181: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/32286 > Use zstd for spark.io.compression.codec by default > -- > > Key: SPARK-35181 > URL: https://issues.apache.org/jira/browse/SPARK-35181 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327108#comment-17327108 ] Apache Spark commented on SPARK-35181: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/32286 > Use zstd for spark.io.compression.codec by default > -- > > Key: SPARK-35181 > URL: https://issues.apache.org/jira/browse/SPARK-35181 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35181) Use zstd for spark.io.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35181: Assignee: (was: Apache Spark) > Use zstd for spark.io.compression.codec by default > -- > > Key: SPARK-35181 > URL: https://issues.apache.org/jira/browse/SPARK-35181 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35181) Use zstd for spark.io.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35181: Assignee: Apache Spark > Use zstd for spark.io.compression.codec by default > -- > > Key: SPARK-35181 > URL: https://issues.apache.org/jira/browse/SPARK-35181 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35181) Use zstd for spark.io.compression.codec by default
Dongjoon Hyun created SPARK-35181: - Summary: Use zstd for spark.io.compression.codec by default Key: SPARK-35181 URL: https://issues.apache.org/jira/browse/SPARK-35181 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.2.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35180) Allow to build SparkR with SBT
[ https://issues.apache.org/jira/browse/SPARK-35180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35180: Assignee: Kousuke Saruta (was: Apache Spark) > Allow to build SparkR with SBT > -- > > Key: SPARK-35180 > URL: https://issues.apache.org/jira/browse/SPARK-35180 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > In the current master, SparkR can be built only with Maven. > It's helpful if we can built it with SBT. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35180) Allow to build SparkR with SBT
[ https://issues.apache.org/jira/browse/SPARK-35180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327093#comment-17327093 ] Apache Spark commented on SPARK-35180: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32285 > Allow to build SparkR with SBT > -- > > Key: SPARK-35180 > URL: https://issues.apache.org/jira/browse/SPARK-35180 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > In the current master, SparkR can be built only with Maven. > It's helpful if we can built it with SBT. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35180) Allow to build SparkR with SBT
[ https://issues.apache.org/jira/browse/SPARK-35180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35180: Assignee: Apache Spark (was: Kousuke Saruta) > Allow to build SparkR with SBT > -- > > Key: SPARK-35180 > URL: https://issues.apache.org/jira/browse/SPARK-35180 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > In the current master, SparkR can be built only with Maven. > It's helpful if we can built it with SBT. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35180) Allow to build SparkR with SBT
Kousuke Saruta created SPARK-35180: -- Summary: Allow to build SparkR with SBT Key: SPARK-35180 URL: https://issues.apache.org/jira/browse/SPARK-35180 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta In the current master, SparkR can be built only with Maven. It's helpful if we can built it with SBT. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
[ https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327088#comment-17327088 ] Apache Spark commented on SPARK-35096: -- User 'sandeep-katta' has created a pull request for this issue: https://github.com/apache/spark/pull/32284 > foreachBatch throws ArrayIndexOutOfBoundsException if schema is case > Insensitive > > > Key: SPARK-35096 > URL: https://issues.apache.org/jira/browse/SPARK-35096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Assignee: Sandeep Katta >Priority: Major > Fix For: 3.1.2, 3.2.0 > > > Below code works fine before spark3, running on spark3 throws > java.lang.ArrayIndexOutOfBoundsException > {code:java} > val inputPath = "/Users/xyz/data/testcaseInsensitivity" > val output_path = "/Users/xyz/output" > spark.range(10).write.format("parquet").save(inputPath) > def process_row(microBatch: DataFrame, batchId: Long): Unit = { > val df = microBatch.select($"ID".alias("other")) // Doesn't work > df.write.format("parquet").mode("append").save(output_path) > } > val schema = new StructType().add("id", LongType) > val stream_df = > spark.readStream.schema(schema).format("parquet").load(inputPath) > stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) > .start().awaitTermination() > {code} > Stack Trace: > {code:java} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOpt
[jira] [Commented] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
[ https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327089#comment-17327089 ] Apache Spark commented on SPARK-35096: -- User 'sandeep-katta' has created a pull request for this issue: https://github.com/apache/spark/pull/32284 > foreachBatch throws ArrayIndexOutOfBoundsException if schema is case > Insensitive > > > Key: SPARK-35096 > URL: https://issues.apache.org/jira/browse/SPARK-35096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Assignee: Sandeep Katta >Priority: Major > Fix For: 3.1.2, 3.2.0 > > > Below code works fine before spark3, running on spark3 throws > java.lang.ArrayIndexOutOfBoundsException > {code:java} > val inputPath = "/Users/xyz/data/testcaseInsensitivity" > val output_path = "/Users/xyz/output" > spark.range(10).write.format("parquet").save(inputPath) > def process_row(microBatch: DataFrame, batchId: Long): Unit = { > val df = microBatch.select($"ID".alias("other")) // Doesn't work > df.write.format("parquet").mode("append").save(output_path) > } > val schema = new StructType().add("id", LongType) > val stream_df = > spark.readStream.schema(schema).format("parquet").load(inputPath) > stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) > .start().awaitTermination() > {code} > Stack Trace: > {code:java} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOpt
[jira] [Updated] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
[ https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35096: -- Fix Version/s: (was: 3.0.3) > foreachBatch throws ArrayIndexOutOfBoundsException if schema is case > Insensitive > > > Key: SPARK-35096 > URL: https://issues.apache.org/jira/browse/SPARK-35096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Assignee: Sandeep Katta >Priority: Major > Fix For: 3.1.2, 3.2.0 > > > Below code works fine before spark3, running on spark3 throws > java.lang.ArrayIndexOutOfBoundsException > {code:java} > val inputPath = "/Users/xyz/data/testcaseInsensitivity" > val output_path = "/Users/xyz/output" > spark.range(10).write.format("parquet").save(inputPath) > def process_row(microBatch: DataFrame, batchId: Long): Unit = { > val df = microBatch.select($"ID".alias("other")) // Doesn't work > df.write.format("parquet").mode("append").save(output_path) > } > val schema = new StructType().add("id", LongType) > val stream_df = > spark.readStream.schema(schema).format("parquet").load(inputPath) > stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) > .start().awaitTermination() > {code} > Stack Trace: > {code:java} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection
[jira] [Commented] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly
[ https://issues.apache.org/jira/browse/SPARK-35177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327058#comment-17327058 ] Apache Spark commented on SPARK-35177: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32281 > IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly > - > > Key: SPARK-35177 > URL: https://issues.apache.org/jira/browse/SPARK-35177 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly
[ https://issues.apache.org/jira/browse/SPARK-35177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35177: Assignee: (was: Apache Spark) > IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly > - > > Key: SPARK-35177 > URL: https://issues.apache.org/jira/browse/SPARK-35177 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly
[ https://issues.apache.org/jira/browse/SPARK-35177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35177: Assignee: Apache Spark > IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly > - > > Key: SPARK-35177 > URL: https://issues.apache.org/jira/browse/SPARK-35177 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly
[ https://issues.apache.org/jira/browse/SPARK-35177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327057#comment-17327057 ] Apache Spark commented on SPARK-35177: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32281 > IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly > - > > Key: SPARK-35177 > URL: https://issues.apache.org/jira/browse/SPARK-35177 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method
[ https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327054#comment-17327054 ] Apache Spark commented on SPARK-34674: -- User 'kotlovs' has created a pull request for this issue: https://github.com/apache/spark/pull/32283 > Spark app on k8s doesn't terminate without call to sparkContext.stop() method > - > > Key: SPARK-34674 > URL: https://issues.apache.org/jira/browse/SPARK-34674 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Sergey Kotlov >Priority: Major > > Hello! > I have run into a problem that if I don't call the method > sparkContext.stop() explicitly, then a Spark driver process doesn't terminate > even after its Main method has been completed. This behaviour is different > from spark on yarn, where the manual sparkContext stopping is not required. > It looks like, the problem is in using non-daemon threads, which prevent the > driver jvm process from terminating. > At least I see two non-daemon threads, if I don't call sparkContext.stop(): > {code:java} > Thread[OkHttp kubernetes.default.svc,5,main] > Thread[OkHttp kubernetes.default.svc Writer,5,main] > {code} > Could you tell please, if it is possible to solve this problem? > Docker image from the official release of spark-3.1.1 hadoop3.2 is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method
[ https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327056#comment-17327056 ] Apache Spark commented on SPARK-34674: -- User 'kotlovs' has created a pull request for this issue: https://github.com/apache/spark/pull/32283 > Spark app on k8s doesn't terminate without call to sparkContext.stop() method > - > > Key: SPARK-34674 > URL: https://issues.apache.org/jira/browse/SPARK-34674 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Sergey Kotlov >Priority: Major > > Hello! > I have run into a problem that if I don't call the method > sparkContext.stop() explicitly, then a Spark driver process doesn't terminate > even after its Main method has been completed. This behaviour is different > from spark on yarn, where the manual sparkContext stopping is not required. > It looks like, the problem is in using non-daemon threads, which prevent the > driver jvm process from terminating. > At least I see two non-daemon threads, if I don't call sparkContext.stop(): > {code:java} > Thread[OkHttp kubernetes.default.svc,5,main] > Thread[OkHttp kubernetes.default.svc Writer,5,main] > {code} > Could you tell please, if it is possible to solve this problem? > Docker image from the official release of spark-3.1.1 hadoop3.2 is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35178) maven autodownload failing
[ https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327048#comment-17327048 ] Apache Spark commented on SPARK-35178: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/32282 > maven autodownload failing > -- > > Key: SPARK-35178 > URL: https://issues.apache.org/jira/browse/SPARK-35178 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0 >Reporter: Bruce Robbins >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > I attempted to build a fresh clone of Spark using mvn (on two different > networks) and got this error: > {noformat} > exec: curl --silent --show-error -L > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz > tar: Unrecognized archive format > tar: Error exit delayed from previous errors. > Using `mvn` from path: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn > build/mvn: line 126: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such > file or directory > {noformat} > if I change the mirror as below, the issue goes away: > {noformat} > -local > APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} > +local > APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35178) maven autodownload failing
[ https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35178. --- Fix Version/s: 3.2.0 3.1.2 3.0.3 Assignee: Sean R. Owen (was: Bruce Robbins) Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/32277 > maven autodownload failing > -- > > Key: SPARK-35178 > URL: https://issues.apache.org/jira/browse/SPARK-35178 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0 >Reporter: Bruce Robbins >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > I attempted to build a fresh clone of Spark using mvn (on two different > networks) and got this error: > {noformat} > exec: curl --silent --show-error -L > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz > tar: Unrecognized archive format > tar: Error exit delayed from previous errors. > Using `mvn` from path: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn > build/mvn: line 126: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such > file or directory > {noformat} > if I change the mirror as below, the issue goes away: > {noformat} > -local > APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} > +local > APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35178) maven autodownload failing
[ https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327046#comment-17327046 ] Apache Spark commented on SPARK-35178: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/32282 > maven autodownload failing > -- > > Key: SPARK-35178 > URL: https://issues.apache.org/jira/browse/SPARK-35178 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > > I attempted to build a fresh clone of Spark using mvn (on two different > networks) and got this error: > {noformat} > exec: curl --silent --show-error -L > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz > tar: Unrecognized archive format > tar: Error exit delayed from previous errors. > Using `mvn` from path: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn > build/mvn: line 126: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such > file or directory > {noformat} > if I change the mirror as below, the issue goes away: > {noformat} > -local > APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} > +local > APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31168) Upgrade Scala to 2.12.13
[ https://issues.apache.org/jira/browse/SPARK-31168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327035#comment-17327035 ] Jim Kleckner commented on SPARK-31168: -- It appears that this fix [1] for 12038 merged into Scala master [2] and has been released in Scala 2.13.5 [3] but not yet released as Scala 2.12.14. [1] [https://github.com/scala/scala/pull/9478] [2] [https://github.com/scala/scala/pull/9495] [3] [https://github.com/scala/scala/releases/tag/v2.13.5] > Upgrade Scala to 2.12.13 > > > Key: SPARK-31168 > URL: https://issues.apache.org/jira/browse/SPARK-31168 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > h2. Highlights > * Performance improvements in the collections library: algorithmic > improvements and changes to avoid unnecessary allocations ([list of > PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+label%3Alibrary%3Acollections+label%3Aperformance]) > * Performance improvements in the compiler ([list of > PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+-label%3Alibrary%3Acollections+label%3Aperformance+], > minor [effects in our > benchmarks|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1567985515850&to=1584355915694&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench@scalabench@]) > * Improvements to {{-Yrepl-class-based}}, an alternative internal REPL > encoding that avoids deadlocks (details on > [#8712|https://github.com/scala/scala/pull/8712]) > * A new {{-Yrepl-use-magic-imports}} flag that avoids deep class nesting in > the REPL, which can lead to deteriorating performance in long sessions > ([#8576|https://github.com/scala/scala/pull/8576]) > * Fix some {{toX}} methods that could expose the underlying mutability of a > {{ListBuffer}}-generated collection > ([#8674|https://github.com/scala/scala/pull/8674]) > h3. JDK 9+ support > * ASM was upgraded to 7.3.1, allowing the optimizer to run on JDK 13+ > ([#8676|https://github.com/scala/scala/pull/8676]) > * {{:javap}} in the REPL now works on JDK 9+ > ([#8400|https://github.com/scala/scala/pull/8400]) > h3. Other changes > * Support new labels for creating durations for consistency: > {{Duration("1m")}}, {{Duration("3 hrs")}} > ([#8325|https://github.com/scala/scala/pull/8325], > [#8450|https://github.com/scala/scala/pull/8450]) > * Fix memory leak in runtime reflection's {{TypeTag}} caches > ([#8470|https://github.com/scala/scala/pull/8470]) and some thread safety > issues in runtime reflection > ([#8433|https://github.com/scala/scala/pull/8433]) > * When using compiler plugins, the ordering of compiler phases may change > due to [#8427|https://github.com/scala/scala/pull/8427] > For more details, see [https://github.com/scala/scala/releases/tag/v2.12.11]. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35110) Handle ANSI intervals in WindowExecBase
[ https://issues.apache.org/jira/browse/SPARK-35110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327034#comment-17327034 ] jiaan.geng commented on SPARK-35110: I'm working on. > Handle ANSI intervals in WindowExecBase > --- > > Key: SPARK-35110 > URL: https://issues.apache.org/jira/browse/SPARK-35110 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Handle YearMonthIntervalType and DayTimeIntervalType in createBoundOrdering(): > https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala#L97-L99 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35161) Safe version SQL functions
[ https://issues.apache.org/jira/browse/SPARK-35161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327032#comment-17327032 ] jiaan.geng commented on SPARK-35161: I see. > Safe version SQL functions > -- > > Key: SPARK-35161 > URL: https://issues.apache.org/jira/browse/SPARK-35161 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Priority: Major > > Create new safe version SQL functions for existing SQL functions/operators, > which returns NULL if overflow/error occurs. So that: > 1. Users can manage to finish queries without interruptions in ANSI mode. > 2. Users can get NULLs instead of unreasonable results if overflow occurs > when ANSI mode is off. > For example, the behavior of the following SQL operations is unreasonable: > {code:java} > 2147483647 + 2 => -2147483647 > CAST(2147483648L AS INT) => -2147483648 > {code} > With the new safe version SQL functions: > {code:java} > TRY_ADD(2147483647, 2) => null > TRY_CAST(2147483648L AS INT) => null > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35117) UI progress bar no longer highlights in progress tasks
[ https://issues.apache.org/jira/browse/SPARK-35117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327030#comment-17327030 ] Apache Spark commented on SPARK-35117: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32281 > UI progress bar no longer highlights in progress tasks > -- > > Key: SPARK-35117 > URL: https://issues.apache.org/jira/browse/SPARK-35117 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.1 >Reporter: Adam Binford >Assignee: Adam Binford >Priority: Major > Fix For: 3.1.2, 3.2.0 > > > The Spark UI was updated to Bootstrap 4, and during the update the progress > bar in the UI was updated to highlight the whole bar once any tasks were in > progress, versus highlighting just the number of tasks that were in progress. > The was a great visual queue of seeing what percentage of the stage/job was > currently being worked on, and it'd be great to get that functionality back. > The change can be found here: > https://github.com/apache/spark/pull/27370/files#diff-809c93c57cc59e5fe3c3eb54a24aa96a38147d02323f3e690ae6b5309a3284d2L448 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35078) Migrate to transformWithPruning or resolveWithPruning for expression rules
[ https://issues.apache.org/jira/browse/SPARK-35078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327027#comment-17327027 ] Apache Spark commented on SPARK-35078: -- User 'sigmod' has created a pull request for this issue: https://github.com/apache/spark/pull/32280 > Migrate to transformWithPruning or resolveWithPruning for expression rules > -- > > Key: SPARK-35078 > URL: https://issues.apache.org/jira/browse/SPARK-35078 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 3.1.0 >Reporter: Yingyi Bu >Priority: Major > > E.g., rules in org/apache/spark/sql/catalyst/optimizer/expressions.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35078) Migrate to transformWithPruning or resolveWithPruning for expression rules
[ https://issues.apache.org/jira/browse/SPARK-35078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35078: Assignee: Apache Spark > Migrate to transformWithPruning or resolveWithPruning for expression rules > -- > > Key: SPARK-35078 > URL: https://issues.apache.org/jira/browse/SPARK-35078 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 3.1.0 >Reporter: Yingyi Bu >Assignee: Apache Spark >Priority: Major > > E.g., rules in org/apache/spark/sql/catalyst/optimizer/expressions.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35078) Migrate to transformWithPruning or resolveWithPruning for expression rules
[ https://issues.apache.org/jira/browse/SPARK-35078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327029#comment-17327029 ] Apache Spark commented on SPARK-35078: -- User 'sigmod' has created a pull request for this issue: https://github.com/apache/spark/pull/32280 > Migrate to transformWithPruning or resolveWithPruning for expression rules > -- > > Key: SPARK-35078 > URL: https://issues.apache.org/jira/browse/SPARK-35078 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 3.1.0 >Reporter: Yingyi Bu >Priority: Major > > E.g., rules in org/apache/spark/sql/catalyst/optimizer/expressions.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35078) Migrate to transformWithPruning or resolveWithPruning for expression rules
[ https://issues.apache.org/jira/browse/SPARK-35078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35078: Assignee: (was: Apache Spark) > Migrate to transformWithPruning or resolveWithPruning for expression rules > -- > > Key: SPARK-35078 > URL: https://issues.apache.org/jira/browse/SPARK-35078 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 3.1.0 >Reporter: Yingyi Bu >Priority: Major > > E.g., rules in org/apache/spark/sql/catalyst/optimizer/expressions.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32027) EventLoggingListener threw java.util.ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-32027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32027. -- Resolution: Duplicate > EventLoggingListener threw java.util.ConcurrentModificationException > - > > Key: SPARK-32027 > URL: https://issues.apache.org/jira/browse/SPARK-32027 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > 20/06/18 20:22:25 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.util.ConcurrentModificationException > at java.util.Hashtable$Enumerator.next(Hashtable.java:1387) > at > scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424) > at > scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at org.apache.spark.util.JsonProtocol$.mapToJson(JsonProtocol.scala:568) > at > org.apache.spark.util.JsonProtocol$.$anonfun$propertiesToJson$1(JsonProtocol.scala:574) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.util.JsonProtocol$.propertiesToJson(JsonProtocol.scala:573) > at > org.apache.spark.util.JsonProtocol$.jobStartToJson(JsonProtocol.scala:159) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:81) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:97) > at > org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:159) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > 20/06/18 20:22:25 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.util.ConcurrentModificationException > at java.util.Hashtable$Enumerator.next(Hashtable.java:1387) > at > scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424) > at > scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at org.apache.spark.util.JsonProtocol$.mapToJson(JsonProtocol.scala:568) > at > org.apache.spark.util.JsonProtocol$.$anonfun$propertiesToJson$1(JsonProtocol.scala:574)
[jira] [Commented] (SPARK-34897) Support reconcile schemas based on index after nested column pruning
[ https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327017#comment-17327017 ] Apache Spark commented on SPARK-34897: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/32279 > Support reconcile schemas based on index after nested column pruning > > > Key: SPARK-34897 > URL: https://issues.apache.org/jira/browse/SPARK-34897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:scala} > spark.sql( > """ > |CREATE TABLE `t1` ( > | `_col0` INT, > | `_col1` STRING, > | `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>, > | `_col3` STRING) > |USING orc > |PARTITIONED BY (_col3) > |""".stripMargin) > spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')") > spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show > {code} > Error message: > {noformat} > java.lang.AssertionError: assertion failed: The given data schema > struct<_col0:int,_col2:struct> has less fields than the actual ORC > physical schema, no idea which columns were dropped, fail to read. Try to > disable > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34897) Support reconcile schemas based on index after nested column pruning
[ https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327016#comment-17327016 ] Apache Spark commented on SPARK-34897: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/32279 > Support reconcile schemas based on index after nested column pruning > > > Key: SPARK-34897 > URL: https://issues.apache.org/jira/browse/SPARK-34897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:scala} > spark.sql( > """ > |CREATE TABLE `t1` ( > | `_col0` INT, > | `_col1` STRING, > | `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>, > | `_col3` STRING) > |USING orc > |PARTITIONED BY (_col3) > |""".stripMargin) > spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')") > spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show > {code} > Error message: > {noformat} > java.lang.AssertionError: assertion failed: The given data schema > struct<_col0:int,_col2:struct> has less fields than the actual ORC > physical schema, no idea which columns were dropped, fail to read. Try to > disable > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34692) Support Not(Int) and Not(InSet) propagate null
[ https://issues.apache.org/jira/browse/SPARK-34692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327010#comment-17327010 ] Apache Spark commented on SPARK-34692: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/32278 > Support Not(Int) and Not(InSet) propagate null > -- > > Key: SPARK-34692 > URL: https://issues.apache.org/jira/browse/SPARK-34692 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.2.0 > > > The semantics of `Not(In)` could be seen like `And(a != b, a != c)` that > match the `NullIntolerant`. > As we already simplify the `NullIntolerant` expression to null if it's > children have null. E.g. `a != null` => `null`. It's safe to do this with > `Not(In)`/`Not(InSet)`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34692) Support Not(Int) and Not(InSet) propagate null
[ https://issues.apache.org/jira/browse/SPARK-34692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327009#comment-17327009 ] Apache Spark commented on SPARK-34692: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/32278 > Support Not(Int) and Not(InSet) propagate null > -- > > Key: SPARK-34692 > URL: https://issues.apache.org/jira/browse/SPARK-34692 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.2.0 > > > The semantics of `Not(In)` could be seen like `And(a != b, a != c)` that > match the `NullIntolerant`. > As we already simplify the `NullIntolerant` expression to null if it's > children have null. E.g. `a != null` => `null`. It's safe to do this with > `Not(In)`/`Not(InSet)`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35179) Introduce hybrid join for sort merge join and shuffled hash join in AQE
[ https://issues.apache.org/jira/browse/SPARK-35179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326991#comment-17326991 ] Cheng Su commented on SPARK-35179: -- Thanks for [~cloud_fan] for the idea. Please comment or edit if this is not captured correctly, thanks. > Introduce hybrid join for sort merge join and shuffled hash join in AQE > --- > > Key: SPARK-35179 > URL: https://issues.apache.org/jira/browse/SPARK-35179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Minor > > Per discussion in > [https://github.com/apache/spark/pull/32210#issuecomment-823503243] , we can > introduce some kind of {{HybridJoin}} operator in AQE, and we can choose to > do shuffled hash join vs sort merge join for each task independently, e.g. > based on partition size, task1 can do shuffled hash join, and task2 can do > sort merge join, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32461) Shuffled hash join improvement
[ https://issues.apache.org/jira/browse/SPARK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-32461: - Affects Version/s: 3.2.0 > Shuffled hash join improvement > -- > > Key: SPARK-32461 > URL: https://issues.apache.org/jira/browse/SPARK-32461 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Cheng Su >Priority: Major > Labels: release-notes > > Shuffled hash join avoids sort compared to sort merge join. This advantage > shows up obviously when joining large table in terms of saving CPU and IO (in > case of external sort happens). In latest master trunk, shuffled hash join is > disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, > with favor of reducing risk of OOM. However shuffled hash join could be > improved to a better state (validated in our internal fork). Creating this > Jira to track overall progress. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35179) Introduce hybrid join for sort merge join and shuffled hash join in AQE
Cheng Su created SPARK-35179: Summary: Introduce hybrid join for sort merge join and shuffled hash join in AQE Key: SPARK-35179 URL: https://issues.apache.org/jira/browse/SPARK-35179 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Cheng Su Per discussion in [https://github.com/apache/spark/pull/32210#issuecomment-823503243] , we can introduce some kind of {{HybridJoin}} operator in AQE, and we can choose to do shuffled hash join vs sort merge join for each task independently, e.g. based on partition size, task1 can do shuffled hash join, and task2 can do sort merge join, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35178) maven autodownload failing
[ https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326924#comment-17326924 ] Apache Spark commented on SPARK-35178: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/32277 > maven autodownload failing > -- > > Key: SPARK-35178 > URL: https://issues.apache.org/jira/browse/SPARK-35178 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > > I attempted to build a fresh clone of Spark using mvn (on two different > networks) and got this error: > {noformat} > exec: curl --silent --show-error -L > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz > tar: Unrecognized archive format > tar: Error exit delayed from previous errors. > Using `mvn` from path: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn > build/mvn: line 126: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such > file or directory > {noformat} > if I change the mirror as below, the issue goes away: > {noformat} > -local > APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} > +local > APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35178) maven autodownload failing
[ https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35178: Assignee: Bruce Robbins (was: Apache Spark) > maven autodownload failing > -- > > Key: SPARK-35178 > URL: https://issues.apache.org/jira/browse/SPARK-35178 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > > I attempted to build a fresh clone of Spark using mvn (on two different > networks) and got this error: > {noformat} > exec: curl --silent --show-error -L > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz > tar: Unrecognized archive format > tar: Error exit delayed from previous errors. > Using `mvn` from path: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn > build/mvn: line 126: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such > file or directory > {noformat} > if I change the mirror as below, the issue goes away: > {noformat} > -local > APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} > +local > APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35178) maven autodownload failing
[ https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326921#comment-17326921 ] Apache Spark commented on SPARK-35178: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/32277 > maven autodownload failing > -- > > Key: SPARK-35178 > URL: https://issues.apache.org/jira/browse/SPARK-35178 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > > I attempted to build a fresh clone of Spark using mvn (on two different > networks) and got this error: > {noformat} > exec: curl --silent --show-error -L > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz > tar: Unrecognized archive format > tar: Error exit delayed from previous errors. > Using `mvn` from path: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn > build/mvn: line 126: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such > file or directory > {noformat} > if I change the mirror as below, the issue goes away: > {noformat} > -local > APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} > +local > APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35178) maven autodownload failing
[ https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35178: Assignee: Apache Spark (was: Bruce Robbins) > maven autodownload failing > -- > > Key: SPARK-35178 > URL: https://issues.apache.org/jira/browse/SPARK-35178 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Major > > I attempted to build a fresh clone of Spark using mvn (on two different > networks) and got this error: > {noformat} > exec: curl --silent --show-error -L > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz > tar: Unrecognized archive format > tar: Error exit delayed from previous errors. > Using `mvn` from path: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn > build/mvn: line 126: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such > file or directory > {noformat} > if I change the mirror as below, the issue goes away: > {noformat} > -local > APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} > +local > APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35178) maven autodownload failing
[ https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-35178: Affects Version/s: 2.4.7 3.0.2 3.1.1 Assignee: Bruce Robbins > maven autodownload failing > -- > > Key: SPARK-35178 > URL: https://issues.apache.org/jira/browse/SPARK-35178 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.7, 3.0.2, 3.1.1, 3.2.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > > I attempted to build a fresh clone of Spark using mvn (on two different > networks) and got this error: > {noformat} > exec: curl --silent --show-error -L > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz > tar: Unrecognized archive format > tar: Error exit delayed from previous errors. > Using `mvn` from path: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn > build/mvn: line 126: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such > file or directory > {noformat} > if I change the mirror as below, the issue goes away: > {noformat} > -local > APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} > +local > APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35178) maven autodownload failing
[ https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326910#comment-17326910 ] Bruce Robbins commented on SPARK-35178: --- In INFRA-21767, Daniel Gruno responded: {quote} Please use this format instead: https://www.apache.org/dyn/closer.lua/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz?action=download that is, https://www.apache.org/dyn/closer.lua/path/to/file.tar.gz?action=download {quote} > maven autodownload failing > -- > > Key: SPARK-35178 > URL: https://issues.apache.org/jira/browse/SPARK-35178 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Bruce Robbins >Priority: Major > > I attempted to build a fresh clone of Spark using mvn (on two different > networks) and got this error: > {noformat} > exec: curl --silent --show-error -L > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz > tar: Unrecognized archive format > tar: Error exit delayed from previous errors. > Using `mvn` from path: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn > build/mvn: line 126: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such > file or directory > {noformat} > if I change the mirror as below, the issue goes away: > {noformat} > -local > APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} > +local > APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35178) maven autodownload failing
[ https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326874#comment-17326874 ] Bruce Robbins commented on SPARK-35178: --- I also posted https://issues.apache.org/jira/browse/INFRA-21767. Maybe they have some insight. > maven autodownload failing > -- > > Key: SPARK-35178 > URL: https://issues.apache.org/jira/browse/SPARK-35178 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Bruce Robbins >Priority: Major > > I attempted to build a fresh clone of Spark using mvn (on two different > networks) and got this error: > {noformat} > exec: curl --silent --show-error -L > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz > tar: Unrecognized archive format > tar: Error exit delayed from previous errors. > Using `mvn` from path: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn > build/mvn: line 126: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such > file or directory > {noformat} > if I change the mirror as below, the issue goes away: > {noformat} > -local > APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} > +local > APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35178) maven autodownload failing
[ https://issues.apache.org/jira/browse/SPARK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326770#comment-17326770 ] Sean R. Owen commented on SPARK-35178: -- I agree, it looks like the automatic redirector has changed behavior. It still sends you to an HTML page for the mirror, but previously that link would cause it to redirect straight to the download.While the script can fallback to archive.apache.org, it doesn't because the HTML downloads successfully -- just is not the distribution!Either we detect this or have to hack this more to get the mirror URL from the redirector, then attach it to the path. > maven autodownload failing > -- > > Key: SPARK-35178 > URL: https://issues.apache.org/jira/browse/SPARK-35178 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Bruce Robbins >Priority: Major > > I attempted to build a fresh clone of Spark using mvn (on two different > networks) and got this error: > {noformat} > exec: curl --silent --show-error -L > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz > tar: Unrecognized archive format > tar: Error exit delayed from previous errors. > Using `mvn` from path: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn > build/mvn: line 126: > /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such > file or directory > {noformat} > if I change the mirror as below, the issue goes away: > {noformat} > -local > APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} > +local > APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35178) maven autodownload failing
Bruce Robbins created SPARK-35178: - Summary: maven autodownload failing Key: SPARK-35178 URL: https://issues.apache.org/jira/browse/SPARK-35178 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.2.0 Reporter: Bruce Robbins I attempted to build a fresh clone of Spark using mvn (on two different networks) and got this error: {noformat} exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz tar: Unrecognized archive format tar: Error exit delayed from previous errors. Using `mvn` from path: /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn build/mvn: line 126: /tmp/testmvn/spark-mvn-download/build/apache-maven-3.6.3/bin/mvn: No such file or directory {noformat} if I change the mirror as below, the issue goes away: {noformat} -local APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='} +local APACHE_MIRROR=${APACHE_MIRROR:-'https://https://downloads.apache.org'} {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326716#comment-17326716 ] L. C. Hsieh commented on SPARK-34198: - The major issue is the additional rocksdb dependency. For me, I'm not against it. But maybe others have strong preferences not to include it by default. I agree with [~kabhwan] that we may need to get a consensus from the community. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34692) Support Not(Int) and Not(InSet) propagate null
[ https://issues.apache.org/jira/browse/SPARK-34692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34692: --- Assignee: ulysses you > Support Not(Int) and Not(InSet) propagate null > -- > > Key: SPARK-34692 > URL: https://issues.apache.org/jira/browse/SPARK-34692 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > > The semantics of `Not(In)` could be seen like `And(a != b, a != c)` that > match the `NullIntolerant`. > As we already simplify the `NullIntolerant` expression to null if it's > children have null. E.g. `a != null` => `null`. It's safe to do this with > `Not(In)`/`Not(InSet)`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34692) Support Not(Int) and Not(InSet) propagate null
[ https://issues.apache.org/jira/browse/SPARK-34692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34692. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31797 [https://github.com/apache/spark/pull/31797] > Support Not(Int) and Not(InSet) propagate null > -- > > Key: SPARK-34692 > URL: https://issues.apache.org/jira/browse/SPARK-34692 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.2.0 > > > The semantics of `Not(In)` could be seen like `And(a != b, a != c)` that > match the `NullIntolerant`. > As we already simplify the `NullIntolerant` expression to null if it's > children have null. E.g. `a != null` => `null`. It's safe to do this with > `Not(In)`/`Not(InSet)`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly
angerszhu created SPARK-35177: - Summary: IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly Key: SPARK-35177 URL: https://issues.apache.org/jira/browse/SPARK-35177 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35177) IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly
[ https://issues.apache.org/jira/browse/SPARK-35177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326643#comment-17326643 ] angerszhu commented on SPARK-35177: --- Raise a pr soon > IntervalUtils.fromYearMonthString can't handle Int.MinValue correctly > - > > Key: SPARK-35177 > URL: https://issues.apache.org/jira/browse/SPARK-35177 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > pass `INTERVAL '-178956970-8' YEAR TO MONTH ` throw exception -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32027) EventLoggingListener threw java.util.ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-32027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326639#comment-17326639 ] Seulki jake Han commented on SPARK-32027: - [~kristopherkane] Thank you. This problem is solved by the SPARK-34731. This issue may be closed. > EventLoggingListener threw java.util.ConcurrentModificationException > - > > Key: SPARK-32027 > URL: https://issues.apache.org/jira/browse/SPARK-32027 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > 20/06/18 20:22:25 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.util.ConcurrentModificationException > at java.util.Hashtable$Enumerator.next(Hashtable.java:1387) > at > scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424) > at > scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at org.apache.spark.util.JsonProtocol$.mapToJson(JsonProtocol.scala:568) > at > org.apache.spark.util.JsonProtocol$.$anonfun$propertiesToJson$1(JsonProtocol.scala:574) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.util.JsonProtocol$.propertiesToJson(JsonProtocol.scala:573) > at > org.apache.spark.util.JsonProtocol$.jobStartToJson(JsonProtocol.scala:159) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:81) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:97) > at > org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:159) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > 20/06/18 20:22:25 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.util.ConcurrentModificationException > at java.util.Hashtable$Enumerator.next(Hashtable.java:1387) > at > scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424) > at > scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at org.apache.spark.util.JsonProtocol$.
[jira] [Created] (SPARK-35176) Raise TypeError in inappropriate type case rather than ValueError
Yikun Jiang created SPARK-35176: --- Summary: Raise TypeError in inappropriate type case rather than ValueError Key: SPARK-35176 URL: https://issues.apache.org/jira/browse/SPARK-35176 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.0 Reporter: Yikun Jiang There are many wrong error type usages on ValueError type. When an operation or function is applied to an object of inappropriate type, we should use TypeError rather than ValueError. such as: [https://github.com/apache/spark/blob/355c39939d9e4c87ffc9538eb822a41cb2ff93fb/python/pyspark/sql/dataframe.py#L1137] [https://github.com/apache/spark/blob/355c39939d9e4c87ffc9538eb822a41cb2ff93fb/python/pyspark/sql/dataframe.py#L1228] We should do some correction in some right time, note that if we do these corrections, it will break some catch on original ValueError. [1] https://docs.python.org/3/library/exceptions.html#TypeError -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35173) Support columns batch adding in PySpark.dataframe
[ https://issues.apache.org/jira/browse/SPARK-35173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35173: Assignee: Apache Spark > Support columns batch adding in PySpark.dataframe > - > > Key: SPARK-35173 > URL: https://issues.apache.org/jira/browse/SPARK-35173 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > Now, the pyspark can only use withColumn to do column adding a column or > replacing the existing column that has the same name. The scala withColumn > can adding columns at one pass. [1] > > Before this added, the user can only use withColumn again and again like: > > {code:java} > self.df.withColumn("key1", col("key1")).withColumn("key2", > col("key2")).withColumn("key3", col("key3")){code} > > After the support, you user can use the with_columns complete batch > operations: > > {code:java} > self.df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), > col("key3")]){code} > > [1] > [https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35173) Support columns batch adding in PySpark.dataframe
[ https://issues.apache.org/jira/browse/SPARK-35173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35173: Assignee: (was: Apache Spark) > Support columns batch adding in PySpark.dataframe > - > > Key: SPARK-35173 > URL: https://issues.apache.org/jira/browse/SPARK-35173 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Major > > Now, the pyspark can only use withColumn to do column adding a column or > replacing the existing column that has the same name. The scala withColumn > can adding columns at one pass. [1] > > Before this added, the user can only use withColumn again and again like: > > {code:java} > self.df.withColumn("key1", col("key1")).withColumn("key2", > col("key2")).withColumn("key3", col("key3")){code} > > After the support, you user can use the with_columns complete batch > operations: > > {code:java} > self.df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), > col("key3")]){code} > > [1] > [https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35173) Support columns batch adding in PySpark.dataframe
[ https://issues.apache.org/jira/browse/SPARK-35173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326574#comment-17326574 ] Apache Spark commented on SPARK-35173: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/32276 > Support columns batch adding in PySpark.dataframe > - > > Key: SPARK-35173 > URL: https://issues.apache.org/jira/browse/SPARK-35173 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Major > > Now, the pyspark can only use withColumn to do column adding a column or > replacing the existing column that has the same name. The scala withColumn > can adding columns at one pass. [1] > > Before this added, the user can only use withColumn again and again like: > > {code:java} > self.df.withColumn("key1", col("key1")).withColumn("key2", > col("key2")).withColumn("key3", col("key3")){code} > > After the support, you user can use the with_columns complete batch > operations: > > {code:java} > self.df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), > col("key3")]){code} > > [1] > [https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35142) `OneVsRest` classifier uses incorrect data type for `rawPrediction` column
[ https://issues.apache.org/jira/browse/SPARK-35142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326518#comment-17326518 ] Apache Spark commented on SPARK-35142: -- User 'harupy' has created a pull request for this issue: https://github.com/apache/spark/pull/32275 > `OneVsRest` classifier uses incorrect data type for `rawPrediction` column > -- > > Key: SPARK-35142 > URL: https://issues.apache.org/jira/browse/SPARK-35142 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0, 3.0.2, 3.1.0, 3.1.1 >Reporter: Harutaka Kawamura >Priority: Major > > `OneVsRest` classifier uses an incorrect data type for the `rawPrediction` > column. > Code to reproduce the issue: > {code:java} > from pyspark.ml.classification import LogisticRegression, OneVsRest > from pyspark.ml.linalg import Vectors > from pyspark.sql import SparkSession > from sklearn.datasets import load_iris > spark = SparkSession.builder.getOrCreate() > X, y = load_iris(return_X_y=True) > df = spark.createDataFrame( > [(Vectors.dense(features), int(label)) for features, label in zip(X, y)], > ["features", "label"] > ) > train, test = df.randomSplit([0.8, 0.2]) > lor = LogisticRegression(maxIter=5) > ovr = OneVsRest(classifier=lor) > ovrModel = ovr.fit(train) > pred = ovrModel.transform(test) > pred.printSchema() > # This prints out: > # root > # |-- features: vector (nullable = true) > # |-- label: long (nullable = true) > # |-- rawPrediction: string (nullable = true) # <- should not be string > # |-- prediction: double (nullable = true) > # pred.show() # this fails because of the incorrect datatype{code} > I ran the code above using GitHub Actiosn: > [https://github.com/harupy/SPARK-35142/pull/1] > > It looks like the UDF to compute the `rawPrediction` column is generated > without specyfing the return type: > > [https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/python/pyspark/ml/classification.py#L3154] > {code:java} > rawPredictionUDF = udf(func) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35142) `OneVsRest` classifier uses incorrect data type for `rawPrediction` column
[ https://issues.apache.org/jira/browse/SPARK-35142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326517#comment-17326517 ] Apache Spark commented on SPARK-35142: -- User 'harupy' has created a pull request for this issue: https://github.com/apache/spark/pull/32275 > `OneVsRest` classifier uses incorrect data type for `rawPrediction` column > -- > > Key: SPARK-35142 > URL: https://issues.apache.org/jira/browse/SPARK-35142 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0, 3.0.2, 3.1.0, 3.1.1 >Reporter: Harutaka Kawamura >Priority: Major > > `OneVsRest` classifier uses an incorrect data type for the `rawPrediction` > column. > Code to reproduce the issue: > {code:java} > from pyspark.ml.classification import LogisticRegression, OneVsRest > from pyspark.ml.linalg import Vectors > from pyspark.sql import SparkSession > from sklearn.datasets import load_iris > spark = SparkSession.builder.getOrCreate() > X, y = load_iris(return_X_y=True) > df = spark.createDataFrame( > [(Vectors.dense(features), int(label)) for features, label in zip(X, y)], > ["features", "label"] > ) > train, test = df.randomSplit([0.8, 0.2]) > lor = LogisticRegression(maxIter=5) > ovr = OneVsRest(classifier=lor) > ovrModel = ovr.fit(train) > pred = ovrModel.transform(test) > pred.printSchema() > # This prints out: > # root > # |-- features: vector (nullable = true) > # |-- label: long (nullable = true) > # |-- rawPrediction: string (nullable = true) # <- should not be string > # |-- prediction: double (nullable = true) > # pred.show() # this fails because of the incorrect datatype{code} > I ran the code above using GitHub Actiosn: > [https://github.com/harupy/SPARK-35142/pull/1] > > It looks like the UDF to compute the `rawPrediction` column is generated > without specyfing the return type: > > [https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/python/pyspark/ml/classification.py#L3154] > {code:java} > rawPredictionUDF = udf(func) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35175) Add linter for JavaScript source files
[ https://issues.apache.org/jira/browse/SPARK-35175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35175: Assignee: Kousuke Saruta (was: Apache Spark) > Add linter for JavaScript source files > -- > > Key: SPARK-35175 > URL: https://issues.apache.org/jira/browse/SPARK-35175 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > In the current master, there is no linter for JavaScript sources. > Let's add it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35175) Add linter for JavaScript source files
[ https://issues.apache.org/jira/browse/SPARK-35175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326507#comment-17326507 ] Apache Spark commented on SPARK-35175: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32274 > Add linter for JavaScript source files > -- > > Key: SPARK-35175 > URL: https://issues.apache.org/jira/browse/SPARK-35175 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > In the current master, there is no linter for JavaScript sources. > Let's add it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35175) Add linter for JavaScript source files
[ https://issues.apache.org/jira/browse/SPARK-35175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35175: Assignee: Apache Spark (was: Kousuke Saruta) > Add linter for JavaScript source files > -- > > Key: SPARK-35175 > URL: https://issues.apache.org/jira/browse/SPARK-35175 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > In the current master, there is no linter for JavaScript sources. > Let's add it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35175) Add linter for JavaScript source files
[ https://issues.apache.org/jira/browse/SPARK-35175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-35175: --- Summary: Add linter for JavaScript source files (was: Add linter for JavaScript sources) > Add linter for JavaScript source files > -- > > Key: SPARK-35175 > URL: https://issues.apache.org/jira/browse/SPARK-35175 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > In the current master, there is no linter for JavaScript sources. > Let's add it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35175) Add linter for JavaScript sources
Kousuke Saruta created SPARK-35175: -- Summary: Add linter for JavaScript sources Key: SPARK-35175 URL: https://issues.apache.org/jira/browse/SPARK-35175 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta In the current master, there is no linter for JavaScript sources. Let's add it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35143) Add default log config for spark-sql
[ https://issues.apache.org/jira/browse/SPARK-35143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326498#comment-17326498 ] Apache Spark commented on SPARK-35143: -- User 'ChenDou2021' has created a pull request for this issue: https://github.com/apache/spark/pull/32246 > Add default log config for spark-sql > > > Key: SPARK-35143 > URL: https://issues.apache.org/jira/browse/SPARK-35143 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Priority: Minor > > The default log level for spark-sql is WARN. How to change the log level is > confusing, we need a default config. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35143) Add default log config for spark-sql
[ https://issues.apache.org/jira/browse/SPARK-35143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326496#comment-17326496 ] Apache Spark commented on SPARK-35143: -- User 'ChenDou2021' has created a pull request for this issue: https://github.com/apache/spark/pull/32254 > Add default log config for spark-sql > > > Key: SPARK-35143 > URL: https://issues.apache.org/jira/browse/SPARK-35143 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Priority: Minor > > The default log level for spark-sql is WARN. How to change the log level is > confusing, we need a default config. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35143) Add default log config for spark-sql
[ https://issues.apache.org/jira/browse/SPARK-35143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326495#comment-17326495 ] Apache Spark commented on SPARK-35143: -- User 'ChenDou2021' has created a pull request for this issue: https://github.com/apache/spark/pull/32273 > Add default log config for spark-sql > > > Key: SPARK-35143 > URL: https://issues.apache.org/jira/browse/SPARK-35143 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Priority: Minor > > The default log level for spark-sql is WARN. How to change the log level is > confusing, we need a default config. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35140) Establish error message guidelines
[ https://issues.apache.org/jira/browse/SPARK-35140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-35140: Assignee: Karen Feng > Establish error message guidelines > -- > > Key: SPARK-35140 > URL: https://issues.apache.org/jira/browse/SPARK-35140 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: Karen Feng >Assignee: Karen Feng >Priority: Major > > In the SPIP: Standardize Exception Messages in Spark, there are three major > improvements proposed: > # Group error messages in dedicated files. > # Establish an error message guideline for developers. > # Improve error message quality. > The second step is to establish the error message guideline. This was > discussed in > http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Build-error-message-guideline-td31076.html > and added to the website in > https://github.com/apache/spark-website/pull/332. To increase visibility, the > guidelines should be accessible from the PR template. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35140) Establish error message guidelines
[ https://issues.apache.org/jira/browse/SPARK-35140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35140. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32241 [https://github.com/apache/spark/pull/32241] > Establish error message guidelines > -- > > Key: SPARK-35140 > URL: https://issues.apache.org/jira/browse/SPARK-35140 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: Karen Feng >Assignee: Karen Feng >Priority: Major > Fix For: 3.2.0 > > > In the SPIP: Standardize Exception Messages in Spark, there are three major > improvements proposed: > # Group error messages in dedicated files. > # Establish an error message guideline for developers. > # Improve error message quality. > The second step is to establish the error message guideline. This was > discussed in > http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Build-error-message-guideline-td31076.html > and added to the website in > https://github.com/apache/spark-website/pull/332. To increase visibility, the > guidelines should be accessible from the PR template. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35154) Rpc env not shutdown when shutdown method call by endpoint onStop
[ https://issues.apache.org/jira/browse/SPARK-35154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LIU updated SPARK-35154: Description: when i use this code to work, Rpc thread hangs up and not close gracefully. i think when rpc thread called shutdown on OnStop method, it will try to put MessageLoop.PoisonPill to return and stop thread in rpc pool. In spark 3.x, it will make others thread return & stop but current thread which call OnStop method to await current pool to stop. it makes current thread not stop, and pending program. I'm not sure that needs to be improved or not? {code:java} //代码占位符{code} test("Rpc env not shutdown when shutdown method call by endpoint onStop") { val rpcEndpoint = new RpcEndpoint { override val rpcEnv: RpcEnv = env override def onStop(): Unit = { env.shutdown() env.awaitTermination() } override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = { case m => context.reply(m) } } env.setupEndpoint("test", rpcEndpoint) rpcEndpoint.stop() env.awaitTermination() } > Rpc env not shutdown when shutdown method call by endpoint onStop > - > > Key: SPARK-35154 > URL: https://issues.apache.org/jira/browse/SPARK-35154 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: spark-3.x >Reporter: LIU >Priority: Minor > > when i use this code to work, Rpc thread hangs up and not close gracefully. > i think when rpc thread called shutdown on OnStop method, it will try to put > MessageLoop.PoisonPill to return and stop thread in rpc pool. In spark 3.x, > it will make others thread return & stop but current thread which call OnStop > method to await current pool to stop. it makes current thread not stop, and > pending program. > I'm not sure that needs to be improved or not? > > {code:java} > //代码占位符{code} > test("Rpc env not shutdown when shutdown method call by endpoint onStop") { > val rpcEndpoint = new RpcEndpoint { > override val rpcEnv: RpcEnv = env > override def onStop(): Unit = { > env.shutdown() > env.awaitTermination() > } > override def receiveAndReply(context: RpcCallContext): > PartialFunction[Any, Unit] = { > case m => context.reply(m) > } > } > env.setupEndpoint("test", rpcEndpoint) > rpcEndpoint.stop() > env.awaitTermination() > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35154) Rpc env not shutdown when shutdown method call by endpoint onStop
[ https://issues.apache.org/jira/browse/SPARK-35154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LIU updated SPARK-35154: Issue Type: Improvement (was: Bug) > Rpc env not shutdown when shutdown method call by endpoint onStop > - > > Key: SPARK-35154 > URL: https://issues.apache.org/jira/browse/SPARK-35154 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: spark-3.x >Reporter: LIU >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35154) Rpc env not shutdown when shutdown method call by endpoint onStop
[ https://issues.apache.org/jira/browse/SPARK-35154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LIU updated SPARK-35154: Priority: Minor (was: Major) > Rpc env not shutdown when shutdown method call by endpoint onStop > - > > Key: SPARK-35154 > URL: https://issues.apache.org/jira/browse/SPARK-35154 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: spark-3.x >Reporter: LIU >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35154) Rpc env not shutdown when shutdown method call by endpoint onStop
[ https://issues.apache.org/jira/browse/SPARK-35154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LIU updated SPARK-35154: Description: (was: when i use this code to work, Rpc thread hangs up and not close gracefully. i think when rpc thread called shutdown on OnStop method, it will try to put MessageLoop.PoisonPill to return and stop thread in rpc pool. In spark 3.x, it will make others thread return & stop but current thread which call OnStop method to await current pool to stop. it makes current thread not stop, and pending program. I'm not sure that needs to be improved or not? {code:java} //代码占位符{code} test("Rpc env not shutdown when shutdown method call by endpoint onStop") { val rpcEndpoint = new RpcEndpoint { override val rpcEnv: RpcEnv = env override def onStop(): Unit = { env.shutdown() env.awaitTermination() } override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = { case m => context.reply(m) } } env.setupEndpoint("test", rpcEndpoint) rpcEndpoint.stop() env.awaitTermination() } ) > Rpc env not shutdown when shutdown method call by endpoint onStop > - > > Key: SPARK-35154 > URL: https://issues.apache.org/jira/browse/SPARK-35154 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 > Environment: spark-3.x >Reporter: LIU >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326460#comment-17326460 ] Jungtaek Lim commented on SPARK-34198: -- Please note that the decision was taken from community discussion. If you need to change it, please bring it to the community with rationalization. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35171) Declare the markdown package as a dependency of the SparkR package
[ https://issues.apache.org/jira/browse/SPARK-35171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35171. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32270 [https://github.com/apache/spark/pull/32270] > Declare the markdown package as a dependency of the SparkR package > -- > > Key: SPARK-35171 > URL: https://issues.apache.org/jira/browse/SPARK-35171 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.2.0 > > > If we didn't install pandoc locally, the make-distribution package will fail > with the following message: > {quote} > --- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown > Warning in engine$weave(file, quiet = quiet, encoding = enc) : > Pandoc (>= 1.12.3) not available. Falling back to R Markdown v1. > Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics: > The 'markdown' package should be declared as a dependency of the 'SparkR' > package (e.g., in the 'Suggests' field of DESCRIPTION), because the latter > contains vignette(s) built with the 'markdown' package. Please see > https://github.com/yihui/knitr/issues/1864 for more information. > --- failed re-building ‘sparkr-vignettes.Rmd’ > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35171) Declare the markdown package as a dependency of the SparkR package
[ https://issues.apache.org/jira/browse/SPARK-35171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-35171: Assignee: Yuanjian Li > Declare the markdown package as a dependency of the SparkR package > -- > > Key: SPARK-35171 > URL: https://issues.apache.org/jira/browse/SPARK-35171 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > > If we didn't install pandoc locally, the make-distribution package will fail > with the following message: > {quote} > --- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown > Warning in engine$weave(file, quiet = quiet, encoding = enc) : > Pandoc (>= 1.12.3) not available. Falling back to R Markdown v1. > Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics: > The 'markdown' package should be declared as a dependency of the 'SparkR' > package (e.g., in the 'Suggests' field of DESCRIPTION), because the latter > contains vignette(s) built with the 'markdown' package. Please see > https://github.com/yihui/knitr/issues/1864 for more information. > --- failed re-building ‘sparkr-vignettes.Rmd’ > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35126) Execute jdbc cancellation method when jdbc load job is interrupted
[ https://issues.apache.org/jira/browse/SPARK-35126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35126: Assignee: (was: Apache Spark) > Execute jdbc cancellation method when jdbc load job is interrupted > -- > > Key: SPARK-35126 > URL: https://issues.apache.org/jira/browse/SPARK-35126 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 > Environment: Environment version: > * spark3.1.1 > * jdk1.8.201 > * scala2.12 > * mysql5.7.31 > * mysql-connector-java-5.1.32.jar /mysql-connector-java-8.0.32.jar >Reporter: zhangrenhua >Priority: Major > Original Estimate: 2h > Remaining Estimate: 2h > > I have a long-running spark service that continuously receives and runs spark > programs submitted by the client. There is a program to load jdbc table. > Query sql is very complicated. Each execution takes a lot of time and > resources. When the client submits such a similar request, the client may > interrupt the job at any time. At that time, I found that the database select > after the job was interrupted. The process is still executing and has not > been killed. > > *Scene demonstration:* > 1. Prepare two tables: SPARK_TEST1/SPARK_TEST2, each of which has 1000 > records) > 2. Test code > {code:java} > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import java.util.concurrent.TimeUnit; > /** > * jdbc load cancel test > * > * @author gavin > * @create 2021/4/18 10:58 > */ > public class JdbcLoadCancelTest { > public static void main(String[] args) throws Exception { > final SparkConf sparkConf = new SparkConf(); > sparkConf.setAppName("jdbc load test"); > sparkConf.setMaster("local[*]"); > final SparkContext sparkContext = new SparkContext(sparkConf); > final SparkSession sparkSession = new SparkSession(sparkContext); > // This is a sql that takes about a minute to execute > String querySql = "select t1.*\n" + > "from SPARK_TEST1 t1\n" + > "left join SPARK_TEST1 t2 on 1=1\n" + > "left join (select aa from SPARK_TEST1 limit 3) t3 on 1=1"; > // Specify job information > final String jobGroup = "test"; > sparkContext.clearJobGroup(); > sparkContext.setJobGroup(jobGroup, "test", true); > // Start the independent thread to start the jdbc load test logic > new Thread(() -> { > final Dataset table = sparkSession.read() > > .format("org.apache.spark.sql.execution.datasources.jdbc3") > .option("url", > "jdbc:mysql://192.168.10.226:32320/test?useUnicode=true&characterEncoding=utf-8&useSSL=false") > .option("user", "root") > .option("password", "123456") > .option("query", querySql) > .load(); > // Print the first data > System.out.println(table.limit(1).first()); > }).start(); > // Wait for the jdbc load job to start > TimeUnit.SECONDS.sleep(10); > // Cancel the job just now > sparkContext.cancelJobGroup(jobGroup); > // Simulate a long-running service without stopping the driver > process, which is used to wait for new jobs to be received > TimeUnit.SECONDS.sleep(Integer.MAX_VALUE); > } > } > {code} > > 3. View the mysql process > {code:java} > select * from information_schema.`PROCESSLIST` where info is not null;{code} > When the program started 10 seconds later, and interrupted the job, it was > found that the database query process has not been killed. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35126) Execute jdbc cancellation method when jdbc load job is interrupted
[ https://issues.apache.org/jira/browse/SPARK-35126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35126: Assignee: Apache Spark > Execute jdbc cancellation method when jdbc load job is interrupted > -- > > Key: SPARK-35126 > URL: https://issues.apache.org/jira/browse/SPARK-35126 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 > Environment: Environment version: > * spark3.1.1 > * jdk1.8.201 > * scala2.12 > * mysql5.7.31 > * mysql-connector-java-5.1.32.jar /mysql-connector-java-8.0.32.jar >Reporter: zhangrenhua >Assignee: Apache Spark >Priority: Major > Original Estimate: 2h > Remaining Estimate: 2h > > I have a long-running spark service that continuously receives and runs spark > programs submitted by the client. There is a program to load jdbc table. > Query sql is very complicated. Each execution takes a lot of time and > resources. When the client submits such a similar request, the client may > interrupt the job at any time. At that time, I found that the database select > after the job was interrupted. The process is still executing and has not > been killed. > > *Scene demonstration:* > 1. Prepare two tables: SPARK_TEST1/SPARK_TEST2, each of which has 1000 > records) > 2. Test code > {code:java} > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import java.util.concurrent.TimeUnit; > /** > * jdbc load cancel test > * > * @author gavin > * @create 2021/4/18 10:58 > */ > public class JdbcLoadCancelTest { > public static void main(String[] args) throws Exception { > final SparkConf sparkConf = new SparkConf(); > sparkConf.setAppName("jdbc load test"); > sparkConf.setMaster("local[*]"); > final SparkContext sparkContext = new SparkContext(sparkConf); > final SparkSession sparkSession = new SparkSession(sparkContext); > // This is a sql that takes about a minute to execute > String querySql = "select t1.*\n" + > "from SPARK_TEST1 t1\n" + > "left join SPARK_TEST1 t2 on 1=1\n" + > "left join (select aa from SPARK_TEST1 limit 3) t3 on 1=1"; > // Specify job information > final String jobGroup = "test"; > sparkContext.clearJobGroup(); > sparkContext.setJobGroup(jobGroup, "test", true); > // Start the independent thread to start the jdbc load test logic > new Thread(() -> { > final Dataset table = sparkSession.read() > > .format("org.apache.spark.sql.execution.datasources.jdbc3") > .option("url", > "jdbc:mysql://192.168.10.226:32320/test?useUnicode=true&characterEncoding=utf-8&useSSL=false") > .option("user", "root") > .option("password", "123456") > .option("query", querySql) > .load(); > // Print the first data > System.out.println(table.limit(1).first()); > }).start(); > // Wait for the jdbc load job to start > TimeUnit.SECONDS.sleep(10); > // Cancel the job just now > sparkContext.cancelJobGroup(jobGroup); > // Simulate a long-running service without stopping the driver > process, which is used to wait for new jobs to be received > TimeUnit.SECONDS.sleep(Integer.MAX_VALUE); > } > } > {code} > > 3. View the mysql process > {code:java} > select * from information_schema.`PROCESSLIST` where info is not null;{code} > When the program started 10 seconds later, and interrupted the job, it was > found that the database query process has not been killed. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35174) Avoid opening watch when waitAppCompletion is false
Jonathan Lafleche created SPARK-35174: - Summary: Avoid opening watch when waitAppCompletion is false Key: SPARK-35174 URL: https://issues.apache.org/jira/browse/SPARK-35174 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.1.1 Reporter: Jonathan Lafleche In spark-submit, we currently [open a pod watch for any spark submission|https://github.com/apache/spark/blame/0494dc90af48ce7da0625485a4dc6917a244d580/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L150-L167]. If WAIT_FOR_APP_COMPLETION is false, we then immediately ignore the result of the watcher and break out of the watcher. When submitting spark applications at scale, this is a source of operational pain, since opening the watch relies on opening a websocket, which tends to run into subtle networking issues around negotiating the websocket connection. I'd like to change this behaviour so that we eagerly check whether we are waiting on app completion, and avoid opening the watch altogether when WAIT_FOR_APP_COMPLETION is false. Would you accept a contribution for that change, or are there any concerns I've overlooked? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326427#comment-17326427 ] Yuanjian Li commented on SPARK-34198: - [~viirya] Since the RocksDBStateStore can solve the major drawbacks for the current HDFS based one. I think it's a better choice to directly add it as a build-in RocksDBStateStoreProvider. It's also convenient for the end-users to choose it directly. If you agree, I will change the description and the title for this ticket. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35143) Add default log config for spark-sql
[ https://issues.apache.org/jira/browse/SPARK-35143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326416#comment-17326416 ] Apache Spark commented on SPARK-35143: -- User 'ChenDou2021' has created a pull request for this issue: https://github.com/apache/spark/pull/32273 > Add default log config for spark-sql > > > Key: SPARK-35143 > URL: https://issues.apache.org/jira/browse/SPARK-35143 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Priority: Minor > > The default log level for spark-sql is WARN. How to change the log level is > confusing, we need a default config. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35172) The implementation of RocksDBCheckpointMetadata
[ https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326411#comment-17326411 ] Apache Spark commented on SPARK-35172: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/32272 > The implementation of RocksDBCheckpointMetadata > --- > > Key: SPARK-35172 > URL: https://issues.apache.org/jira/browse/SPARK-35172 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Priority: Major > > The RocksDBCheckpointMetadata persists the metadata for each committed batch > in JSON format. The object contains all RocksDB file names and the number of > total keys. > The metadata binds closely with the directory structure of > RocksDBFileManager, as described in the design doc - [Directory Structure and > Format for Files stored in > DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35172) The implementation of RocksDBCheckpointMetadata
[ https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326410#comment-17326410 ] Apache Spark commented on SPARK-35172: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/32272 > The implementation of RocksDBCheckpointMetadata > --- > > Key: SPARK-35172 > URL: https://issues.apache.org/jira/browse/SPARK-35172 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Priority: Major > > The RocksDBCheckpointMetadata persists the metadata for each committed batch > in JSON format. The object contains all RocksDB file names and the number of > total keys. > The metadata binds closely with the directory structure of > RocksDBFileManager, as described in the design doc - [Directory Structure and > Format for Files stored in > DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35172) The implementation of RocksDBCheckpointMetadata
[ https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35172: Assignee: Apache Spark > The implementation of RocksDBCheckpointMetadata > --- > > Key: SPARK-35172 > URL: https://issues.apache.org/jira/browse/SPARK-35172 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Assignee: Apache Spark >Priority: Major > > The RocksDBCheckpointMetadata persists the metadata for each committed batch > in JSON format. The object contains all RocksDB file names and the number of > total keys. > The metadata binds closely with the directory structure of > RocksDBFileManager, as described in the design doc - [Directory Structure and > Format for Files stored in > DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35172) The implementation of RocksDBCheckpointMetadata
[ https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35172: Assignee: (was: Apache Spark) > The implementation of RocksDBCheckpointMetadata > --- > > Key: SPARK-35172 > URL: https://issues.apache.org/jira/browse/SPARK-35172 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Priority: Major > > The RocksDBCheckpointMetadata persists the metadata for each committed batch > in JSON format. The object contains all RocksDB file names and the number of > total keys. > The metadata binds closely with the directory structure of > RocksDBFileManager, as described in the design doc - [Directory Structure and > Format for Files stored in > DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35172) The implementation of RocksDBCheckpointMetadata
[ https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35172: Assignee: Apache Spark > The implementation of RocksDBCheckpointMetadata > --- > > Key: SPARK-35172 > URL: https://issues.apache.org/jira/browse/SPARK-35172 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Assignee: Apache Spark >Priority: Major > > The RocksDBCheckpointMetadata persists the metadata for each committed batch > in JSON format. The object contains all RocksDB file names and the number of > total keys. > The metadata binds closely with the directory structure of > RocksDBFileManager, as described in the design doc - [Directory Structure and > Format for Files stored in > DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35172) The implementation of RocksDBCheckpointMetadata
[ https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-35172: Summary: The implementation of RocksDBCheckpointMetadata (was: Implementation for RocksDBCheckpointMetadata) > The implementation of RocksDBCheckpointMetadata > --- > > Key: SPARK-35172 > URL: https://issues.apache.org/jira/browse/SPARK-35172 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Priority: Major > > The RocksDBCheckpointMetadata persists the metadata for each committed batch > in JSON format. The object contains all RocksDB file names and the number of > total keys. > The metadata binds closely with the directory structure of > RocksDBFileManager, as described in the design doc - [Directory Structure and > Format for Files stored in > DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35172) Implementation for RocksDBCheckpointMetadata
[ https://issues.apache.org/jira/browse/SPARK-35172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-35172: Description: The RocksDBCheckpointMetadata persists the metadata for each committed batch in JSON format. The object contains all RocksDB file names and the number of total keys. The metadata binds closely with the directory structure of RocksDBFileManager, as described in the design doc - [Directory Structure and Format for Files stored in DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2] was: The RocksDBCheckpointMetadata persists the metadata for each committed batch in JSON format. The schema for the object contains all RocksDB file names and the number of total keys. The metadata binds closely with the directory structure of RocksDBFileManager, as described in the design doc - [Directory Structure and Format for Files stored in DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2] > Implementation for RocksDBCheckpointMetadata > > > Key: SPARK-35172 > URL: https://issues.apache.org/jira/browse/SPARK-35172 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Priority: Major > > The RocksDBCheckpointMetadata persists the metadata for each committed batch > in JSON format. The object contains all RocksDB file names and the number of > total keys. > The metadata binds closely with the directory structure of > RocksDBFileManager, as described in the design doc - [Directory Structure and > Format for Files stored in > DFS|https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org