[jira] [Created] (SPARK-30400) Test failure in SQL module on ppc64le
AK97 created SPARK-30400: Summary: Test failure in SQL module on ppc64le Key: SPARK-30400 URL: https://issues.apache.org/jira/browse/SPARK-30400 Project: Spark Issue Type: Test Components: Build Affects Versions: 2.4.0 Environment: os: rhel 7.6 arch: ppc64le Reporter: AK97 I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, the test cases are failing in SQL module with following error : - CREATE TABLE USING AS SELECT based on the file without write permission *** FAILED *** Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown (CreateTableAsSelectSuite.scala:92) - create a table, drop it and create another one with the same name *** FAILED *** org.apache.spark.sql.AnalysisException: Table default.jsonTable already exists. You need to drop it first.; at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) Would like some help on understanding the cause for the same . I am running it on a High end VM with good connectivity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23626) DAGScheduler blocked due to JobSubmitted event
[ https://issues.apache.org/jira/browse/SPARK-23626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-23626. -- Resolution: Won't Fix > DAGScheduler blocked due to JobSubmitted event > --- > > Key: SPARK-23626 > URL: https://issues.apache.org/jira/browse/SPARK-23626 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.1, 2.3.3, 2.4.3, 3.0.0 >Reporter: Ajith S >Priority: Major > > DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted > events has to be processed as DAGSchedulerEventProcessLoop is single threaded > and it will block other tasks in queue like TaskCompletion. > The JobSubmitted event is time consuming depending on the nature of the job > (Example: calculating parent stage dependencies, shuffle dependencies, > partitions) and thus it blocks all the events to be processed. > > I see multiple JIRA referring to this behavior > https://issues.apache.org/jira/browse/SPARK-2647 > https://issues.apache.org/jira/browse/SPARK-4961 > > Similarly in my cluster some jobs partition calculation is time consuming > (Similar to stack at SPARK-2647) hence it slows down the spark > DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if > its tasks are finished within seconds, as TaskCompletion Events are processed > at a slower rate due to blockage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30399) Bucketing does not compatible with partitioning in practice
[ https://issues.apache.org/jira/browse/SPARK-30399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shay Elbaz updated SPARK-30399: --- Description: When using Spark Bucketed table, Spark would use as many partitions as the number of buckets for the map-side join (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" tables, but quite disastrous for _time-partitioned_ tables. In our use case, a daily partitioned key-value table is added 100GB of data every day. So in 100 days there are 10TB of data we want to join with. Aiming to this scenario, we need thousands of buckets if we want every task to successfully *read and sort* all of it's data in a map-side join. But in such case, every daily increment would emit thousands of small files, leading to other big issues. In practice, and with a hope for some hidden optimization, we set the number of buckets to 1000 and backfilled such a table with 10TB. When trying to join with the smallest input, every executor was killed by Yarn due to over allocating memory in the sorting phase. Even without such failures, it would take every executor unreasonably amount of time to locally sort all its data. A question on SO remained unanswered for a while, so I thought asking here - is it by design that buckets cannot be used in time-partitioned table, or am I doing something wrong? was: When using Spark Bucketed table, Spark would use as many partitions as the number of buckets for the map-side join (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" tables, but quite disastrous for _time-partitioned_ tables. In our use case, a daily partitioned key-value table is added 100GB of data every day. So in 100 days there are 10TB of data we want to join with - aiming to this scenario, we need thousands of buckets if we want every task to successfully *read and sort* all of it's data in a map-side join. But in such case, every daily increment would emit thousands of small files, leading to other big issues. In practice, and with a hope for some hidden optimization, we set the number of buckets to 1000 and backfilled such a table with 10TB. When trying to join with the smallest input, every executor was killed by Yarn due to over allocating memory in the sorting phase. Even without such failures, it would take every executor unreasonably amount of time to locally sort all its data. A question on SO remained unanswered for a while, so I thought asking here - is it by design that buckets cannot be used in time-partitioned table, or am I doing something wrong? > Bucketing does not compatible with partitioning in practice > --- > > Key: SPARK-30399 > URL: https://issues.apache.org/jira/browse/SPARK-30399 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: HDP 2.7 >Reporter: Shay Elbaz >Priority: Minor > > When using Spark Bucketed table, Spark would use as many partitions as the > number of buckets for the map-side join > (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" > tables, but quite disastrous for _time-partitioned_ tables. In our use case, > a daily partitioned key-value table is added 100GB of data every day. So in > 100 days there are 10TB of data we want to join with. Aiming to this > scenario, we need thousands of buckets if we want every task to successfully > *read and sort* all of it's data in a map-side join. But in such case, every > daily increment would emit thousands of small files, leading to other big > issues. > In practice, and with a hope for some hidden optimization, we set the number > of buckets to 1000 and backfilled such a table with 10TB. When trying to join > with the smallest input, every executor was killed by Yarn due to over > allocating memory in the sorting phase. Even without such failures, it would > take every executor unreasonably amount of time to locally sort all its data. > A question on SO remained unanswered for a while, so I thought asking here - > is it by design that buckets cannot be used in time-partitioned table, or am > I doing something wrong? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30399) Bucketing does not compatible with partitioning in practice
Shay Elbaz created SPARK-30399: -- Summary: Bucketing does not compatible with partitioning in practice Key: SPARK-30399 URL: https://issues.apache.org/jira/browse/SPARK-30399 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: HDP 2.7 Reporter: Shay Elbaz When using Spark Bucketed table, Spark would use as many partitions as the number of buckets for the map-side join (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" tables, but quite disastrous for _time-partitioned_ tables. In our use case, a daily partitioned key-value table is added 100GB of data every day. So in 100 days there are 10TB of data we want to join with - aiming to this scenario, we need thousands of buckets if we want every task to successfully *read and sort* all of it's data in a map-side join. But in such case, every daily increment would emit thousands of small files, leading to other big issues. In practice, and with a hope for some hidden optimization, we set the number of buckets to 1000 and backfilled such a table with 10TB. When trying to join with the smallest input, every executor was killed by Yarn due to over allocating memory in the sorting phase. Even without such failures, it would take every executor unreasonably amount of time to locally sort all its data. A question on SO remained unanswered for a while, so I thought asking here - is it by design that buckets cannot be used in time-partitioned table, or am I doing something wrong? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24202) Separate SQLContext dependency from SparkSession.implicits
[ https://issues.apache.org/jira/browse/SPARK-24202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006214#comment-17006214 ] Sam hendley commented on SPARK-24202: - I agree that this would be a very valuable change, was there a reason this was closed without comment? > Separate SQLContext dependency from SparkSession.implicits > -- > > Key: SPARK-24202 > URL: https://issues.apache.org/jira/browse/SPARK-24202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Gerard Maas >Priority: Major > Labels: bulk-closed > > The current implementation of the implicits in SparkSession passes the > current active SQLContext to the SQLImplicits class. This implies that all > usage of these (extremely helpful) implicits require the prior creation of a > Spark Session instance. > Usage is typically done as follows: > > {code:java} > val sparkSession = SparkSession.builder() > getOrCreate() > import sparkSession.implicits._ > {code} > > This is OK in user code, but it burdens the creation of library code that > uses Spark, where static imports for _Encoder_ support is required. > A simple example would be: > > {code:java} > class SparkTransformation[In: Encoder, Out: Encoder] { > def transform(ds: Dataset[In]): Dataset[Out] > } > {code} > > Attempting to compile such code would result in the following exception: > {code:java} > Unable to find encoder for type stored in a Dataset. Primitive types (Int, > String, etc) and Product types (case classes) are supported by importing > spark.implicits._ Support for serializing other types will be added in > future releases.{code} > The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two > utilities to transform _RDD_ and local collections into a _Dataset_. > These are 2 methods of the 46 implicit conversions offered by this class. > The request is to separate the two implicit methods that depend on the > SQLContext instance creation into a separate class: > {code:java} > SQLImplicits#214-229 > /** > * Creates a [[Dataset]] from an RDD. > * > * @since 1.6.0 > */ > implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = > { > DatasetHolder(_sqlContext.createDataset(rdd)) > } > /** > * Creates a [[Dataset]] from a local Seq. > * @since 1.6.0 > */ > implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): > DatasetHolder[T] = { > DatasetHolder(_sqlContext.createDataset(s)) > }{code} > By separating the static methods from these two methods that depend on > _sqlContext_ into separate classes, we could provide static imports for all > the other functionality and only require the instance-bound implicits for > the RDD and collection support (Which is an uncommon use case these days) > As this is potentially breaking the current interface, this might be a > candidate for Spark 3.0. Although there's nothing stopping us from creating a > separate hierarchy for the static encoders already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26734) StackOverflowError on WAL serialization caused by large receivedBlockQueue
[ https://issues.apache.org/jira/browse/SPARK-26734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006211#comment-17006211 ] Stephen commented on SPARK-26734: - Is this bug fixed? I still see the same error at Spark 2.4.3 > StackOverflowError on WAL serialization caused by large receivedBlockQueue > -- > > Key: SPARK-26734 > URL: https://issues.apache.org/jira/browse/SPARK-26734 > Project: Spark > Issue Type: Bug > Components: Block Manager, DStreams >Affects Versions: 2.3.1, 2.3.2, 2.4.0 > Environment: spark 2.4.0 streaming job > java 1.8 > scala 2.11.12 >Reporter: Ross M. Lodge >Assignee: Ross M. Lodge >Priority: Major > Fix For: 2.3.4, 2.4.1, 3.0.0 > > > We encountered an intermittent StackOverflowError with a stack trace similar > to: > > {noformat} > Exception in thread "JobGenerator" java.lang.StackOverflowError > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){noformat} > The name of the thread has been seen to be either "JobGenerator" or > "streaming-start", depending on when in the lifecycle of the job the problem > occurs. It appears to only occur in streaming jobs with checkpointing and > WAL enabled; this has prevented us from upgrading to v2.4.0. > > Via debugging, we tracked this down to allocateBlocksToBatch in > ReceivedBlockTracker: > {code:java} > /** > * Allocate all unallocated blocks to the given batch. > * This event will get written to the write ahead log (if enabled). > */ > def allocateBlocksToBatch(batchTime: Time): Unit = synchronized { > if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) { > val streamIdToBlocks = streamIds.map { streamId => > (streamId, getReceivedBlockQueue(streamId).clone()) > }.toMap > val allocatedBlocks = AllocatedBlocks(streamIdToBlocks) > if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) { > streamIds.foreach(getReceivedBlockQueue(_).clear()) > timeToAllocatedBlocks.put(batchTime, allocatedBlocks) > lastAllocatedBatchTime = batchTime > } else { > logInfo(s"Possibly processed batch $batchTime needs to be processed > again in WAL recovery") > } > } else { > // This situation occurs when: > // 1. WAL is ended with BatchAllocationEvent, but without > BatchCleanupEvent, > // possibly processed batch job or half-processed batch job need to be > processed again, > // so the batchTime will be equal to lastAllocatedBatchTime. > // 2. Slow checkpointing makes recovered batch time older than WAL > recovered > // lastAllocatedBatchTime. > // This situation will only occurs in recovery time. > logInfo(s"Possibly processed batch $batchTime needs to be processed again > in WAL recovery") > } > } > {code} > Prior to 2.3.1, this code did > {code:java} > getReceivedBlockQueue(streamId).dequeueAll(x => true){code} > but it was changed as part of SPARK-23991 to > {code:java} > getReceivedBlockQueue(streamId).clone(){code} > We've not been able to reproduce this in a test of the actual above method, > but we've been able to produce a test that reproduces it by putting a lot of > values into the queue: > > {code:java} > class SerializationFailureTest extends FunSpec { > private val logger = LoggerFactory.getLogger(getClass) > private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo] > describe("Queue") { > it("should be serializable") { > runTest(1062) > } > it("should not be serializable") { > runTest(1063) > } > it("should DEFINITELY not be serializable") { > runTest(199952) > } > } > private def runTest(mx: Int): Array[Byte] = { > try { > val random = new scala.util.Random() > val queue = new ReceivedBlockQueue() > for (_ <- 0 until mx) { > queue += ReceivedBlockInfo( > streamId = 0, > numRecords = Some(random.nextInt(5)), > metadataOption = None, > blockStoreResult = WriteAheadLogBasedStoreResult( > blockId = StreamBlockId(0, random.nextInt()), > numRecords = Some(random.nextInt(5)), > walRecordHandle = FileBasedWriteAheadLogSegment( > path = > s"""hdfs://foo.bar.com:8080/spark/streaming/BAZ/7/receivedData/0/log-${random.nextInt()}-${random.nextInt()}""", > offset = random.nextLong(), > length
[jira] [Updated] (SPARK-30339) Avoid to fail twice in function lookup
[ https://issues.apache.org/jira/browse/SPARK-30339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-30339: - Fix Version/s: 2.4.5 > Avoid to fail twice in function lookup > -- > > Key: SPARK-30339 > URL: https://issues.apache.org/jira/browse/SPARK-30339 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > Currently if function lookup fails, spark will give it a second chance by > casting decimal type to double type. But for cases where decimal type doesn't > exist, it's meaningless to lookup again and cause extra cost like unnecessary > metastore access. We should throw exceptions directly in these cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30363) Add Documentation for Refresh Resources
[ https://issues.apache.org/jira/browse/SPARK-30363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30363. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27023 [https://github.com/apache/spark/pull/27023] > Add Documentation for Refresh Resources > --- > > Key: SPARK-30363 > URL: https://issues.apache.org/jira/browse/SPARK-30363 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Assignee: Rakesh Raushan >Priority: Minor > Fix For: 3.0.0 > > > Refresh Resources is not documented in the docs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30363) Add Documentation for Refresh Resources
[ https://issues.apache.org/jira/browse/SPARK-30363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30363: Assignee: Rakesh Raushan > Add Documentation for Refresh Resources > --- > > Key: SPARK-30363 > URL: https://issues.apache.org/jira/browse/SPARK-30363 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Assignee: Rakesh Raushan >Priority: Minor > > Refresh Resources is not documented in the docs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30389) Add jar should not allow to add other format except jar file
[ https://issues.apache.org/jira/browse/SPARK-30389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30389. -- Resolution: Not A Problem > Add jar should not allow to add other format except jar file > > > Key: SPARK-30389 > URL: https://issues.apache.org/jira/browse/SPARK-30389 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > spark-sql> add jar /opt/abhi/udf/test1jar/12.txt; > ADD JAR /opt/abhi/udf/test1jar/12.txt > Added [/opt/abhi/udf/test1jar/12.txt] to class path > spark-sql> list jar; > spark://vm1:45169/jars/12.txt -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30392) Documentation for the date_trunc function is incorrect
[ https://issues.apache.org/jira/browse/SPARK-30392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30392. -- Resolution: Not A Problem Yes, this is being reported against 2.3.0 apparently > Documentation for the date_trunc function is incorrect > -- > > Key: SPARK-30392 > URL: https://issues.apache.org/jira/browse/SPARK-30392 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ashley >Priority: Minor > Attachments: Date_trunc.PNG > > > The documentation for the date_trunc function includes a few sample SELECT > statements to show how the function works: > [https://spark.apache.org/docs/2.3.0/api/sql/#date_trunc] > In the sample SELECT statements, the inputs to the function are swapped: > *The docs show:* SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); > *The docs _should_ show:* SELECT date_trunc('YEAR', > '2015-03-05T09:32:05.359'); > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30336) Move Kafka consumer related classes to its own package
[ https://issues.apache.org/jira/browse/SPARK-30336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30336. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26991 [https://github.com/apache/spark/pull/26991] > Move Kafka consumer related classes to its own package > -- > > Key: SPARK-30336 > URL: https://issues.apache.org/jira/browse/SPARK-30336 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 3.0.0 > > > There're too many classes placed in a package "org.apache.spark.sql.kafka010" > which classes should have been grouped by purpose. > As a part of change in SPARK-21869, we moved out producer related classes to > "org.apache.spark.sql.kafka010.producer" and only expose necessary > classes/methods to the outside of package. We can apply it to consumer > related classes as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30336) Move Kafka consumer related classes to its own package
[ https://issues.apache.org/jira/browse/SPARK-30336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30336: Assignee: Jungtaek Lim > Move Kafka consumer related classes to its own package > -- > > Key: SPARK-30336 > URL: https://issues.apache.org/jira/browse/SPARK-30336 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > > There're too many classes placed in a package "org.apache.spark.sql.kafka010" > which classes should have been grouped by purpose. > As a part of change in SPARK-21869, we moved out producer related classes to > "org.apache.spark.sql.kafka010.producer" and only expose necessary > classes/methods to the outside of package. We can apply it to consumer > related classes as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30339) Avoid to fail twice in function lookup
[ https://issues.apache.org/jira/browse/SPARK-30339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-30339: - Description: Currently if function lookup fails, spark will give it a second chance by casting decimal type to double type. But for cases where decimal type doesn't exist, it's meaningless to lookup again and cause extra cost like unnecessary metastore access. We should throw exceptions directly in these cases. (was: Currently if function lookup fails, spark will give it a second change by casting decimal type to double type. But for cases where decimal type doesn't exist, it's meaningless to lookup again and causes extra cost like unnecessary metastore access. We should throw exceptions directly in these cases.) > Avoid to fail twice in function lookup > -- > > Key: SPARK-30339 > URL: https://issues.apache.org/jira/browse/SPARK-30339 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Minor > Fix For: 3.0.0 > > > Currently if function lookup fails, spark will give it a second chance by > casting decimal type to double type. But for cases where decimal type doesn't > exist, it's meaningless to lookup again and cause extra cost like unnecessary > metastore access. We should throw exceptions directly in these cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27692) Optimize evaluation of udf that is deterministic and has literal inputs
[ https://issues.apache.org/jira/browse/SPARK-27692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006127#comment-17006127 ] Maciej Szymkiewicz commented on SPARK-27692: Could you explain what is the value of this proposal over just passing a literal? > Optimize evaluation of udf that is deterministic and has literal inputs > --- > > Key: SPARK-27692 > URL: https://issues.apache.org/jira/browse/SPARK-27692 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sunitha Kambhampati >Priority: Major > > Deterministic UDF is a udf for which the following is true: Given a specific > input, the output of the udf will be the same no matter how many times you > execute the udf. > When your inputs to the UDF are all literal and UDF is deterministic, we can > optimize this to evaluate the udf once and use the output instead of > evaluating the UDF each time for every row in the query. > This is valid only if the UDF is deterministic and inputs are literal. > Otherwise we should not and cannot apply this optimization. > *Testing:* > We have used this internally and have seen significant performance > improvements for some very expensive UDFs ( as expected). > In the PR, I have added unit tests. > *Credits:* > Thanks to Guy Khazma([https://github.com/guykhazma]) from the IBM Haifa > Research Team for the idea and the original implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30393) Too much ProvisionedThroughputExceededException while recover from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-30393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen updated SPARK-30393: Environment: I am using EMR 5.25.0, Spark 2.4.3, spark-streaming-kinesis-asl 2.4.3 I have 6 r5.4xLarge in my cluster, plenty of memory. 6 kinesis shards, I even increased to 12 shards but still see the kinesis error (was: I am using EMR 5.23.0, Spark 2.4.3, spark-streaming-kinesis-asl 2.4.3 I have 6 r5.4xLarge in my cluster, plenty of memory. 6 kinesis shards, I even increased to 12 shards but still see the kinesis error) > Too much ProvisionedThroughputExceededException while recover from checkpoint > - > > Key: SPARK-30393 > URL: https://issues.apache.org/jira/browse/SPARK-30393 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.4.3 > Environment: I am using EMR 5.25.0, Spark 2.4.3, > spark-streaming-kinesis-asl 2.4.3 I have 6 r5.4xLarge in my cluster, plenty > of memory. 6 kinesis shards, I even increased to 12 shards but still see the > kinesis error >Reporter: Stephen >Priority: Major > Attachments: kinesisexceedreadlimit.png, > kinesisusagewhilecheckpointrecoveryerror.png, > sparkuiwhilecheckpointrecoveryerror.png > > > I have a spark application which consume from Kinesis with 6 shards. Data was > produced to Kinesis at at most 2000 records/second. At non peak time data > only comes in at 200 records/second. Each record is 0.5K Bytes. So 6 shards > is enough to handle that. > I use reduceByKeyAndWindow and mapWithState in the program and the sliding > window is one hour long. > Recently I am trying to checkpoint the application to S3. I am testing this > at nonpeak time so the data incoming rate is very low like 200 records/sec. I > run the Spark application by creating new context, checkpoint is created at > s3, but when I kill the app and restarts, it failed to recover from > checkpoint, and the error message is the following and my SparkUI shows all > the batches are stucked, and it takes a long time for the checkpoint recovery > to complete, 15 minutes to over an hour. > I found lots of error message in the log related to Kinesis exceeding read > limit: > {quote}19/12/24 00:15:21 WARN TaskSetManager: Lost task 571.0 in stage 33.0 > (TID 4452, ip-172-17-32-11.ec2.internal, executor 9): > org.apache.spark.SparkException: Gave up after 3 retries while getting shard > iterator from sequence number > 49601654074184110438492229476281538439036626028298502210, last exception: > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288) > bq. at scala.Option.getOrElse(Option.scala:121) > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282) > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getKinesisIterator(KinesisBackedBlockRDD.scala:246) > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:206) > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162) > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133) > bq. at > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > bq. at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > bq. at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > bq. at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187) > bq. at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > bq. at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > bq. at org.apache.spark.scheduler.Task.run(Task.scala:121) > bq. at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > bq. at > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > bq. at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > bq. at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > bq. at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > bq. at java.lang.Thread.run(Thread.java:748) > bq.
[jira] [Created] (SPARK-30398) PCA/RegressionMetrics/RowMatrix avoid unnecessary computation
zhengruifeng created SPARK-30398: Summary: PCA/RegressionMetrics/RowMatrix avoid unnecessary computation Key: SPARK-30398 URL: https://issues.apache.org/jira/browse/SPARK-30398 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 3.0.0 Reporter: zhengruifeng use Summarizer instead of MultivariateOnlineSummarizer to avoid computation of unused metrics -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30397) [pyspark] Writer applied to custom model changes type of keys' dict from int to str
Jean-Marc Montanier created SPARK-30397: --- Summary: [pyspark] Writer applied to custom model changes type of keys' dict from int to str Key: SPARK-30397 URL: https://issues.apache.org/jira/browse/SPARK-30397 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: Jean-Marc Montanier Hello, I have a custom model that I'm trying to persist. Within this custom model there is a python dict mapping from int to int. When the model is saved (with write().save('path')), the keys of the dict are modified from int to str. You can find bellow a code to reproduce the issue: {code:python} #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ @author: Jean-Marc Montanier @date: 2019/12/31 """ from pyspark.sql import SparkSession from pyspark import keyword_only from pyspark.ml import Pipeline, PipelineModel from pyspark.ml import Estimator, Model from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable from pyspark.ml.param import Param, Params from pyspark.ml.param.shared import HasInputCol, HasOutputCol from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf spark = SparkSession \ .builder \ .appName("ImputeNormal") \ .getOrCreate() class CustomFit(Estimator, HasInputCol, HasOutputCol, DefaultParamsReadable, DefaultParamsWritable, ): @keyword_only def __init__(self, inputCol="inputCol", outputCol="outputCol"): super(CustomFit, self).__init__() self._setDefault(inputCol="inputCol", outputCol="outputCol") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol="inputCol", outputCol="outputCol"): """ setParams(self, inputCol="inputCol", outputCol="outputCol") """ kwargs = self._input_kwargs self._set(**kwargs) return self def _fit(self, data): inputCol = self.getInputCol() outputCol = self.getOutputCol() categories = data.where(data[inputCol].isNotNull()) \ .groupby(inputCol) \ .count() \ .orderBy("count", ascending=False) \ .limit(2) categories = dict(categories.toPandas().set_index(inputCol)["count"]) for cat in categories: categories[cat] = int(categories[cat]) return CustomModel(categories=categories, input_col=inputCol, output_col=outputCol) class CustomModel(Model, DefaultParamsReadable, DefaultParamsWritable): input_col = Param(Params._dummy(), "input_col", "Name of the input column") output_col = Param(Params._dummy(), "output_col", "Name of the output column") categories = Param(Params._dummy(), "categories", "Top categories") def __init__(self, categories: dict = None, input_col="input_col", output_col="output_col"): super(CustomModel, self).__init__() self._set(categories=categories, input_col=input_col, output_col=output_col) def get_output_col(self) -> str: """ output_col getter :return: """ return self.getOrDefault(self.output_col) def get_input_col(self) -> str: """ input_col getter :return: """ return self.getOrDefault(self.input_col) def get_categories(self): """ categories getter :return: """ return self.getOrDefault(self.categories) def _transform(self, data): input_col = self.get_input_col() output_col = self.get_output_col() categories = self.get_categories() def get_cat(val): if val is None: return -1 if val not in categories: return -1 return int(categories[val]) get_cat_udf = udf(get_cat, IntegerType()) df = data.withColumn(output_col, get_cat_udf(input_col)) return df def test_without_write(): fit_df = spark.createDataFrame([[10]] * 5 + [[11]] * 4 + [[12]] * 3 + [[None]] * 2, ['input']) custom_fit = CustomFit(inputCol='input', outputCol='output') pipeline = Pipeline(stages=[custom_fit]) pipeline_model = pipeline.fit(fit_df) print("Categories: {}".format(pipeline_model.stages[0].get_categories())) transform_df = spark.createDataFrame([[10]] * 2 + [[11]] * 2 + [[12]] * 2 + [[None]] * 2, ['input']) test = pipeline_model.transform(transform_df) test.show() # This output is the expected output def test_with_write(): fit_df = spark.createDataFrame([[10]] * 5 + [[11]] * 4 + [[12]] * 3 + [[None]] * 2, ['input']) custom_fit = CustomFit(inputCol='input', outputCol='output') pi
[jira] [Updated] (SPARK-30396) When there are multiple DISTINCT aggregate expressions acting on different fields, any DISTINCT aggregate expression allows the use of the FILTER clause
[ https://issues.apache.org/jira/browse/SPARK-30396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-30396: --- Description: This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986 This ticket will support: When there are multiple DISTINCT aggregate expressions acting on different fields, any DISTINCT aggregate expression allows the use of the FILTER clause such as: select class_id, count(distinct sex), sum(distinct id) filter (where sex = 'man') from student group by class_id; select class_id, count(distinct sex) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id; was: This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986 This ticket will support: When there are multiple DISTINCT aggregate expressions acting on different fields, any DISTINCT aggregate expression allows the use of the FILTER clause > When there are multiple DISTINCT aggregate expressions acting on different > fields, any DISTINCT aggregate expression allows the use of the FILTER clause > > > Key: SPARK-30396 > URL: https://issues.apache.org/jira/browse/SPARK-30396 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986 > This ticket will support: > When there are multiple DISTINCT aggregate expressions acting on different > fields, any DISTINCT aggregate expression allows the use of the FILTER clause > such as: > select class_id, count(distinct sex), sum(distinct id) filter (where sex = > 'man') from student group by class_id; > select class_id, count(distinct sex) filter (where class_id = 1), > sum(distinct id) filter (where sex = 'man') from student group by class_id; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30395) When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
[ https://issues.apache.org/jira/browse/SPARK-30395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-30395: --- Description: This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986 This ticket will support: When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause such as: {code:java} select sum(distinct id) filter (where sex = 'man') from student; select class_id, sum(distinct id) filter (where sex = 'man') from student group by class_id; select count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student; select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id; select sum(distinct id), sum(distinct id) filter (where sex = 'man') from student; select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id; select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id; {code} but not support: {code:java} select class_id, count(distinct sex), sum(distinct id) filter (where sex = 'man') from student group by class_id; select class_id, count(distinct sex) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id;{code} https://issues.apache.org/jira/browse/SPARK-30396 used for later. was: This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986 This ticket will support: When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause such as: {code:java} select sum(distinct id) filter (where sex = 'man') from student; select class_id, sum(distinct id) filter (where sex = 'man') from student group by class_id; select count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student; select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id; select sum(distinct id), sum(distinct id) filter (where sex = 'man') from student; select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id; select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id; {code} but not support: > When one or more DISTINCT aggregate expressions operate on the same field, > the DISTINCT aggregate expression allows the use of the FILTER clause > > > Key: SPARK-30395 > URL: https://issues.apache.org/jira/browse/SPARK-30395 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986 > This ticket will support: > When one or more DISTINCT aggregate expressions operate on the same field, > the DISTINCT aggregate expression allows the use of the FILTER clause > such as: > > {code:java} > select sum(distinct id) filter (where sex = 'man') from student; > select class_id, sum(distinct id) filter (where sex = 'man') from student > group by class_id; > select count(id) filter (where class_id = 1), sum(distinct id) filter (where > sex = 'man') from student; > select class_id, count(id) filter (where class_id = 1), sum(distinct id) > filter (where sex = 'man') from student group by class_id; > select sum(distinct id), sum(distinct id) filter (where sex = 'man') from > student; > select class_id, sum(distinct id), sum(distinct id) filter (where sex = > 'man') from student group by class_id; > select class_id, count(id), count(id) filter (where class_id = 1), > sum(distinct id), sum(distinct id) filter (where sex = 'man') from student > group by class_id; > {code} > > but not support: > > {code:java} > select class_id, count(distinct sex), sum(distinct id) filter (where sex = > 'man') from student group by class_id; > select class_id, count(distinct sex) filter (where class_id = 1), > sum(distinct id) filter (where sex = 'man') from student group by > class_id;{code} > https://issues.apache.org/jira/browse/SPARK-30396 used for later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30395) When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
[ https://issues.apache.org/jira/browse/SPARK-30395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-30395: --- Description: This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986 This ticket will support: When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause such as: {code:java} select sum(distinct id) filter (where sex = 'man') from student; select class_id, sum(distinct id) filter (where sex = 'man') from student group by class_id; select count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student; select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id; select sum(distinct id), sum(distinct id) filter (where sex = 'man') from student; select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id; select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id; {code} but not support: was: This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986 This ticket will support: When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause > When one or more DISTINCT aggregate expressions operate on the same field, > the DISTINCT aggregate expression allows the use of the FILTER clause > > > Key: SPARK-30395 > URL: https://issues.apache.org/jira/browse/SPARK-30395 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986 > This ticket will support: > When one or more DISTINCT aggregate expressions operate on the same field, > the DISTINCT aggregate expression allows the use of the FILTER clause > such as: > > {code:java} > select sum(distinct id) filter (where sex = 'man') from student; > select class_id, sum(distinct id) filter (where sex = 'man') from student > group by class_id; > select count(id) filter (where class_id = 1), sum(distinct id) filter (where > sex = 'man') from student; > select class_id, count(id) filter (where class_id = 1), sum(distinct id) > filter (where sex = 'man') from student group by class_id; > select sum(distinct id), sum(distinct id) filter (where sex = 'man') from > student; > select class_id, sum(distinct id), sum(distinct id) filter (where sex = > 'man') from student group by class_id; > select class_id, count(id), count(id) filter (where class_id = 1), > sum(distinct id), sum(distinct id) filter (where sex = 'man') from student > group by class_id; > {code} > > but not support: > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30396) When there are multiple DISTINCT aggregate expressions acting on different fields, any DISTINCT aggregate expression allows the use of the FILTER clause
jiaan.geng created SPARK-30396: -- Summary: When there are multiple DISTINCT aggregate expressions acting on different fields, any DISTINCT aggregate expression allows the use of the FILTER clause Key: SPARK-30396 URL: https://issues.apache.org/jira/browse/SPARK-30396 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: jiaan.geng This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986 This ticket will support: When there are multiple DISTINCT aggregate expressions acting on different fields, any DISTINCT aggregate expression allows the use of the FILTER clause -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30395) When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
jiaan.geng created SPARK-30395: -- Summary: When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause Key: SPARK-30395 URL: https://issues.apache.org/jira/browse/SPARK-30395 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: jiaan.geng This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986 This ticket will support: When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org