[jira] [Updated] (SPARK-23811) FetchFailed comes before Success of same task will cause child stage never succeed
[ https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Yuanjian updated SPARK-23811: Description: This is a bug caused by abnormal scenario describe below: # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` , shuffleStatus changed. # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately. # ShuffleMapTask 1.0 finally succeed, but because of 1.1's FetchFailed, stage still mark as failed stage. # ShuffleMapTask 1 is the last task of its stage, so this stage will never succeed because of there's no missing task DagScheduler can get. was: This is a bug caused by abnormal scenario describe below: # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` , shuffleStatus changed. # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately. # ShuffleMapTask 1 is the last task of its stage, so this stage will never succeed because of there's no missing task DagScheduler can get. > FetchFailed comes before Success of same task will cause child stage never > succeed > -- > > Key: SPARK-23811 > URL: https://issues.apache.org/jira/browse/SPARK-23811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Li Yuanjian >Priority: Major > Attachments: 1.png, 2.png > > > This is a bug caused by abnormal scenario describe below: > # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA > # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` > , shuffleStatus changed. > # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately. > # ShuffleMapTask 1.0 finally succeed, but because of 1.1's FetchFailed, > stage still mark as failed stage. > # ShuffleMapTask 1 is the last task of its stage, so this stage will never > succeed because of there's no missing task DagScheduler can get. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420215#comment-16420215 ] Jepson commented on SPARK-22968: Adjust these parameters,keep monitor. “request.timeout.ms“ -> (21: java.lang.Integer) “session.timeout.ms“ -> (18: java.lang.Integer) “heartbeat.interval.ms“ -> (6000: java.lang.Integer) > java.lang.IllegalStateException: No current assignment for partition kssh-2 > --- > > Key: SPARK-22968 > URL: https://issues.apache.org/jira/browse/SPARK-22968 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Kafka: 0.10.0 (CDH5.12.0) > Apache Spark 2.1.1 > Spark streaming+Kafka >Reporter: Jepson >Priority: Major > Original Estimate: 96h > Remaining Estimate: 96h > > *Kafka Broker:* > {code:java} >message.max.bytes : 2621440 > {code} > *Spark Streaming+Kafka Code:* > {code:java} > , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: > 1048576 > , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 > , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 > , "heartbeat.interval.ms" -> (5000: java.lang.Integer) > , "receive.buffer.bytes" -> (10485760: java.lang.Integer) > {code} > *Error message:* > {code:java} > 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously > assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, > kssh-1] for group use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined > group use_a_separate_group_id_for_each_stream with generation 4 > 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned > partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for > time 1515116907000 ms > java.lang.IllegalStateException: No current assignment for partition kssh-2 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerato
[jira] [Assigned] (SPARK-23743) IsolatedClientLoader.isSharedClass returns an unindented result against `slf4j` keyword
[ https://issues.apache.org/jira/browse/SPARK-23743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-23743: --- Assignee: Jongyoul Lee > IsolatedClientLoader.isSharedClass returns an unindented result against > `slf4j` keyword > --- > > Key: SPARK-23743 > URL: https://issues.apache.org/jira/browse/SPARK-23743 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jongyoul Lee >Assignee: Jongyoul Lee >Priority: Minor > Fix For: 2.4.0 > > > {{isSharedClass}} returns if some classes can/should be shared or not. It > checks if the classes names have some keywords or start with some names. > Following the logic, it can occur unintended behaviors when a custom package > has {{slf4j}} inside the package or class name. As I guess, the first > intention seems to figure out the class containing {{org.slf4j}}. It would be > better to change the comparison logic to {{name.startsWith("org.slf4j")}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23743) IsolatedClientLoader.isSharedClass returns an unindented result against `slf4j` keyword
[ https://issues.apache.org/jira/browse/SPARK-23743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-23743. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20860 [https://github.com/apache/spark/pull/20860] > IsolatedClientLoader.isSharedClass returns an unindented result against > `slf4j` keyword > --- > > Key: SPARK-23743 > URL: https://issues.apache.org/jira/browse/SPARK-23743 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jongyoul Lee >Priority: Minor > Fix For: 2.4.0 > > > {{isSharedClass}} returns if some classes can/should be shared or not. It > checks if the classes names have some keywords or start with some names. > Following the logic, it can occur unintended behaviors when a custom package > has {{slf4j}} inside the package or class name. As I guess, the first > intention seems to figure out the class containing {{org.slf4j}}. It would be > better to change the comparison logic to {{name.startsWith("org.slf4j")}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23829) spark-sql-kafka source in spark 2.3 causes reading stream failure frequently
Norman Bai created SPARK-23829: -- Summary: spark-sql-kafka source in spark 2.3 causes reading stream failure frequently Key: SPARK-23829 URL: https://issues.apache.org/jira/browse/SPARK-23829 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Norman Bai In spark 2.3 , it provides a source "spark-sql-kafka-0-10_2.11". When I wanted to read from my kafka-0.10.2.1 cluster, it throws out an error "*java.util.concurrent.TimeoutException: Cannot fetch record for offset in 12000 milliseconds*" frequently , and the job thus failed. I searched on google & stackoverflow for a while, and found many other people who got this excption too, and nobody gave an answer. I debuged the source code, found nothing, but I guess it's because the dependency spark-sql-kafka-0-10_2.11 is using. {code:java} org.apache.spark spark-sql-kafka-0-10_2.11 2.3.0 kafka-clients org.apache.kafka org.apache.kafka kafka-clients 0.10.2.1 {code} I excluded it from maven ,and added another version , rerun the code , and now it works. I guess something is wrong on kafka-clients0.10.0.1 working with kafka0.10.2.1, or more kafka versions. Hope for an explanation. Here is the error stack. {code:java} [ERROR] 2018-03-30 13:34:11,404 [stream execution thread for [id = 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = b3e18aa6-358f-43f6-a077-e34db0822df6]] org.apache.spark.sql.execution.streaming.MicroBatchExecution logError - Query [id = 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = b3e18aa6-358f-43f6-a077-e34db0822df6] terminated with error org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost, executor driver): java.util.concurrent.TimeoutException: Cannot fetch record for offset 6481521 in 12 milliseconds at org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:230) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106) at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157) at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at o
[jira] [Resolved] (SPARK-23808) Test spark sessions should set default session
[ https://issues.apache.org/jira/browse/SPARK-23808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23808. - Resolution: Fixed Assignee: Jose Torres > Test spark sessions should set default session > -- > > Key: SPARK-23808 > URL: https://issues.apache.org/jira/browse/SPARK-23808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > SparkSession.getOrCreate() ensures that the session it returns is set as a > default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts > around this method, and thus a default is never set. We need to set it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23808) Test spark sessions should set default session
[ https://issues.apache.org/jira/browse/SPARK-23808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23808: Fix Version/s: 2.4.0 2.3.1 > Test spark sessions should set default session > -- > > Key: SPARK-23808 > URL: https://issues.apache.org/jira/browse/SPARK-23808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > SparkSession.getOrCreate() ensures that the session it returns is set as a > default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts > around this method, and thus a default is never set. We need to set it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23805) support vector-size validation and Inference
[ https://issues.apache.org/jira/browse/SPARK-23805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420113#comment-16420113 ] zhengruifeng commented on SPARK-23805: -- [~sethah] [~josephkb] Are you interested in this? > support vector-size validation and Inference > > > Key: SPARK-23805 > URL: https://issues.apache.org/jira/browse/SPARK-23805 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Priority: Major > > I think it maybe miningful to unify the usage of \{{AttributeGroup}} and > support vector-size validation and inference in algs. > My thoughts are: > * In \{{transformSchema}}, validate the input vector-size if possible. If > the input vector-size can be obtained from schema, check it. > ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will > require the vector-size to be no more than 4. > ** Suppose a \{{PCAModel}} trained with vectors of length 10, the > \{{transformSchema}} will require the vector-size to be 10. > * In \{{transformSchema}}, inference the output vector-size if possible. > ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will > return a schema with output vector-size=4. > ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will > return a schema with output vector-size=4. > * In \{{transform}}, inference the output vector-size if possible. > * In \{{fit}}, obtain the input vector-size from schema if possible. This > can help eliminating redundant \{{first}} jobs. > > Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my > idea. Since the validation and inference is quite alg-speciafic, we may need > to sperate the task into several small subtasks. > How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuliang updated SPARK-5928: --- Comment: was deleted (was: !image-2018-03-29-11-52-32-075.png|width=741,height=189! >From the picture, The shuffle read size is over 2GB, but job not failed. I wonder if anyone can answer me? ) > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid >Priority: Major > Attachments: image-2018-03-29-11-52-32-075.png > > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame > length exceeds 2147483647: 3021252889 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io
[jira] [Commented] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources
[ https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419995#comment-16419995 ] Apache Spark commented on SPARK-23825: -- User 'dvogelbacher' has created a pull request for this issue: https://github.com/apache/spark/pull/20943 > [K8s] Spark pods should request memory + memoryOverhead as resources > > > Key: SPARK-23825 > URL: https://issues.apache.org/jira/browse/SPARK-23825 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: David Vogelbacher >Priority: Major > > We currently request {{spark.[driver,executor].memory}} as memory from > Kubernetes (e.g., > [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). > The limit is set to {{spark.[driver,executor].memory + > spark.kubernetes.[driver,executor].memoryOverhead}}. > This seems to be using Kubernetes wrong. > [How Pods with resource limits are > run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], > states: > {noformat} > If a Container exceeds its memory request, it is likely that its Pod will be > evicted whenever the node runs out of memory. > {noformat} > Thus, if a the spark driver/executor uses {{memory + memoryOverhead}} > memory, it can be evicted. While an executor might get restarted (but it > would still be very bad performance-wise), the driver would be hard to > recover. > I think spark should be able to run with the requested (and, thus, > guaranteed) resources from Kubernetes without being in danger of termination > without needing to rely on optional available resources. > Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes > (and this should also be the limit). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources
[ https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23825: Assignee: Apache Spark > [K8s] Spark pods should request memory + memoryOverhead as resources > > > Key: SPARK-23825 > URL: https://issues.apache.org/jira/browse/SPARK-23825 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: David Vogelbacher >Assignee: Apache Spark >Priority: Major > > We currently request {{spark.[driver,executor].memory}} as memory from > Kubernetes (e.g., > [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). > The limit is set to {{spark.[driver,executor].memory + > spark.kubernetes.[driver,executor].memoryOverhead}}. > This seems to be using Kubernetes wrong. > [How Pods with resource limits are > run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], > states: > {noformat} > If a Container exceeds its memory request, it is likely that its Pod will be > evicted whenever the node runs out of memory. > {noformat} > Thus, if a the spark driver/executor uses {{memory + memoryOverhead}} > memory, it can be evicted. While an executor might get restarted (but it > would still be very bad performance-wise), the driver would be hard to > recover. > I think spark should be able to run with the requested (and, thus, > guaranteed) resources from Kubernetes without being in danger of termination > without needing to rely on optional available resources. > Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes > (and this should also be the limit). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources
[ https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23825: Assignee: (was: Apache Spark) > [K8s] Spark pods should request memory + memoryOverhead as resources > > > Key: SPARK-23825 > URL: https://issues.apache.org/jira/browse/SPARK-23825 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: David Vogelbacher >Priority: Major > > We currently request {{spark.[driver,executor].memory}} as memory from > Kubernetes (e.g., > [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). > The limit is set to {{spark.[driver,executor].memory + > spark.kubernetes.[driver,executor].memoryOverhead}}. > This seems to be using Kubernetes wrong. > [How Pods with resource limits are > run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], > states: > {noformat} > If a Container exceeds its memory request, it is likely that its Pod will be > evicted whenever the node runs out of memory. > {noformat} > Thus, if a the spark driver/executor uses {{memory + memoryOverhead}} > memory, it can be evicted. While an executor might get restarted (but it > would still be very bad performance-wise), the driver would be hard to > recover. > I think spark should be able to run with the requested (and, thus, > guaranteed) resources from Kubernetes without being in danger of termination > without needing to rely on optional available resources. > Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes > (and this should also be the limit). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23822) Improve error message for Parquet schema mismatches
[ https://issues.apache.org/jira/browse/SPARK-23822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Huo updated SPARK-23822: --- Component/s: (was: Input/Output) SQL > Improve error message for Parquet schema mismatches > --- > > Key: SPARK-23822 > URL: https://issues.apache.org/jira/browse/SPARK-23822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuchen Huo >Priority: Major > > If a user attempts to read Parquet files with mismatched schemas and schema > merging is disabled then this may result in a very confusing > UnsupportedOperationException and ParquetDecodingException errors from > Parquet. > e.g. > {code:java} > Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/") > Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/") > spark.read.parquet(s"$path/").collect() > {code} > Would result in > {code:java} > Caused by: java.lang.UnsupportedOperationException: Unimplemented type: > IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources
[ https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419950#comment-16419950 ] David Vogelbacher commented on SPARK-23825: --- addressed by https://github.com/apache/spark/pull/20943 > [K8s] Spark pods should request memory + memoryOverhead as resources > > > Key: SPARK-23825 > URL: https://issues.apache.org/jira/browse/SPARK-23825 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: David Vogelbacher >Priority: Major > > We currently request {{spark.[driver,executor].memory}} as memory from > Kubernetes (e.g., > [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). > The limit is set to {{spark.[driver,executor].memory + > spark.kubernetes.[driver,executor].memoryOverhead}}. > This seems to be using Kubernetes wrong. > [How Pods with resource limits are > run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], > states: > {noformat} > If a Container exceeds its memory request, it is likely that its Pod will be > evicted whenever the node runs out of memory. > {noformat} > Thus, if a the spark driver/executor uses {{memory + memoryOverhead}} > memory, it can be evicted. While an executor might get restarted (but it > would still be very bad performance-wise), the driver would be hard to > recover. > I think spark should be able to run with the requested (and, thus, > guaranteed) resources from Kubernetes without being in danger of termination > without needing to rely on optional available resources. > Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes > (and this should also be the limit). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23828) PySpark StringIndexerModel should have constructor from labels
Bryan Cutler created SPARK-23828: Summary: PySpark StringIndexerModel should have constructor from labels Key: SPARK-23828 URL: https://issues.apache.org/jira/browse/SPARK-23828 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.3.0 Reporter: Bryan Cutler The Scala StringIndexerModel has an alternate constructor that will create the model from an array of label strings. This should be in the Python API also, maybe as a classmethod {code:java} model = StringIndexerModel.from_labels(['a', 'b']){code} See SPARK-15009 which added a similar constructor to CountVectorizerModel -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15009) PySpark CountVectorizerModel should be able to construct from vocabulary list
[ https://issues.apache.org/jira/browse/SPARK-15009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419939#comment-16419939 ] Apache Spark commented on SPARK-15009: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/20942 > PySpark CountVectorizerModel should be able to construct from vocabulary list > - > > Key: SPARK-15009 > URL: https://issues.apache.org/jira/browse/SPARK-15009 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > Fix For: 2.4.0 > > > Like the Scala version, PySpark CountVectorizerModel should be able to > construct the model from given a vocabulary list. > For example > {noformat} > cvm = CountVectorizerModel(["a", "b", "c"]) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23827) StreamingJoinExec should ensure that input data is partitioned into specific number of partitions
[ https://issues.apache.org/jira/browse/SPARK-23827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23827: Assignee: Apache Spark (was: Tathagata Das) > StreamingJoinExec should ensure that input data is partitioned into specific > number of partitions > - > > Key: SPARK-23827 > URL: https://issues.apache.org/jira/browse/SPARK-23827 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Critical > > Currently, the requiredChildDistribution does not specify the partitions. > This can cause the weird corner cases where the child's distribution is > `SinglePartition` which satisfies the required distribution of > `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the > shuffle needed to repartition input data into the required number of > partitions (i.e. same as state stores). That can lead to "file not found" > errors on the state store delta files as the micro-batch-with-no-shuffle will > not run certain tasks and therefore not generate the expected state store > delta files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23827) StreamingJoinExec should ensure that input data is partitioned into specific number of partitions
[ https://issues.apache.org/jira/browse/SPARK-23827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419925#comment-16419925 ] Apache Spark commented on SPARK-23827: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/20941 > StreamingJoinExec should ensure that input data is partitioned into specific > number of partitions > - > > Key: SPARK-23827 > URL: https://issues.apache.org/jira/browse/SPARK-23827 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Currently, the requiredChildDistribution does not specify the partitions. > This can cause the weird corner cases where the child's distribution is > `SinglePartition` which satisfies the required distribution of > `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the > shuffle needed to repartition input data into the required number of > partitions (i.e. same as state stores). That can lead to "file not found" > errors on the state store delta files as the micro-batch-with-no-shuffle will > not run certain tasks and therefore not generate the expected state store > delta files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23827) StreamingJoinExec should ensure that input data is partitioned into specific number of partitions
[ https://issues.apache.org/jira/browse/SPARK-23827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23827: Assignee: Tathagata Das (was: Apache Spark) > StreamingJoinExec should ensure that input data is partitioned into specific > number of partitions > - > > Key: SPARK-23827 > URL: https://issues.apache.org/jira/browse/SPARK-23827 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Currently, the requiredChildDistribution does not specify the partitions. > This can cause the weird corner cases where the child's distribution is > `SinglePartition` which satisfies the required distribution of > `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the > shuffle needed to repartition input data into the required number of > partitions (i.e. same as state stores). That can lead to "file not found" > errors on the state store delta files as the micro-batch-with-no-shuffle will > not run certain tasks and therefore not generate the expected state store > delta files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py
[ https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-22711. -- Resolution: Workaround Closing this because it seems wordnet is not serializable with Cloudpickle and it's preferable to import inside the executor function rather than globally. > _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from > cloudpickle.py > --- > > Key: SPARK-22711 > URL: https://issues.apache.org/jira/browse/SPARK-22711 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.2.0, 2.2.1 > Environment: Ubuntu pseudo distributed installation of Spark 2.2.0 >Reporter: Prateek >Priority: Major > Attachments: Jira_Spark_minimized_code.py > > Original Estimate: 336h > Remaining Estimate: 336h > > When I submit a Pyspark program with spark-submit command this error is > thrown. > It happens when for code like below > RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v) > Traceback (most recent call last): > File "/home/prateek/Project/textrank.py", line 299, in > summaryRDD = sentenceTokensReduceRDD.map(lambda m: > get_summary(m)).reduceByKey(lambda c,v :c+v) > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, > in reduceByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, > in combineByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, > in partitionBy > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, > in _jrdd > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, > in _wrap_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, > in _prepare_for_python_RDD > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 460, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 704, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 148, in dump > File "/usr/lib/python3.5/pickle.py", line 408, in dump > self.save(obj) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) >
[jira] [Updated] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API
[ https://issues.apache.org/jira/browse/SPARK-23429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edwina Lu updated SPARK-23429: -- Description: Add new executor level memory metrics ( jvmUsedMemory, onHeapExecutionMemory, offHeapExecutionMemory, onHeapStorageMemory, offHeapStorageMemory, onHeapUnifiedMemory, and offHeapUnifiedMemory), and expose these via the executors REST API. This information will help provide insight into how executor and driver JVM memory is used, and for the different memory regions. It can be used to help determine good values for spark.executor.memory, spark.driver.memory, spark.memory.fraction, and spark.memory.storageFraction. Add an ExecutorMetrics class, with jvmUsedMemory, onHeapExecutionMemory, offHeapExecutionMemory, onHeapStorageMemory, and offHeapStorageMemory. This will track the memory usage at the executor level. The new ExecutorMetrics will be sent by executors to the driver as part of the Heartbeat. A heartbeat will be added for the driver as well, to collect these metrics for the driver. Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there is a new peak value for one of the memory metrics for an executor and stage. Only the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize additional logging. Analysis on a set of sample applications showed an increase of 0.25% in the size of the Spark history log, with this approach. Modify the AppStatusListener to collect snapshots of peak values for each memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and storageMemory, and list of active stages. Add the new memory metrics (snapshots of peak values for each memory metric) to the executors REST API. This is a subtask for SPARK-23206. Please refer to the design doc for that ticket for more details. was: Add new executor level memory metrics ( jvmUsedMemory, executionMemory, storageMemory, and unifiedMemory), and expose these via the executors REST API. This information will help provide insight into how executor and driver JVM memory is used, and for the different memory regions. It can be used to help determine good values for spark.executor.memory, spark.driver.memory, spark.memory.fraction, and spark.memory.storageFraction. Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and storageMemory. This will track the memory usage at the executor level. The new ExecutorMetrics will be sent by executors to the driver as part of the Heartbeat. A heartbeat will be added for the driver as well, to collect these metrics for the driver. Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there is a new peak value for one of the memory metrics for an executor and stage. Only the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize additional logging. Analysis on a set of sample applications showed an increase of 0.25% in the size of the Spark history log, with this approach. Modify the AppStatusListener to collect snapshots of peak values for each memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and storageMemory, and list of active stages. Add the new memory metrics (snapshots of peak values for each memory metric) to the executors REST API. This is a subtask for SPARK-23206. Please refer to the design doc for that ticket for more details. > Add executor memory metrics to heartbeat and expose in executors REST API > - > > Key: SPARK-23429 > URL: https://issues.apache.org/jira/browse/SPARK-23429 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > > Add new executor level memory metrics ( jvmUsedMemory, onHeapExecutionMemory, > offHeapExecutionMemory, onHeapStorageMemory, offHeapStorageMemory, > onHeapUnifiedMemory, and offHeapUnifiedMemory), and expose these via the > executors REST API. This information will help provide insight into how > executor and driver JVM memory is used, and for the different memory regions. > It can be used to help determine good values for spark.executor.memory, > spark.driver.memory, spark.memory.fraction, and spark.memory.storageFraction. > Add an ExecutorMetrics class, with jvmUsedMemory, onHeapExecutionMemory, > offHeapExecutionMemory, onHeapStorageMemory, and offHeapStorageMemory. This > will track the memory usage at the executor level. The new ExecutorMetrics > will be sent by executors to the driver as part of the Heartbeat. A heartbeat > will be added for the driver as well, to collect these metrics for the driver. > Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there > is a new peak value for one of the memory metrics for an execu
[jira] [Created] (SPARK-23827) StreamingJoinExec should ensure that input data is partitioned into specific number of partitions
Tathagata Das created SPARK-23827: - Summary: StreamingJoinExec should ensure that input data is partitioned into specific number of partitions Key: SPARK-23827 URL: https://issues.apache.org/jira/browse/SPARK-23827 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Tathagata Das Assignee: Tathagata Das Currently, the requiredChildDistribution does not specify the partitions. This can cause the weird corner cases where the child's distribution is `SinglePartition` which satisfies the required distribution of `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the shuffle needed to repartition input data into the required number of partitions (i.e. same as state stores). That can lead to "file not found" errors on the state store delta files as the micro-batch-with-no-shuffle will not run certain tasks and therefore not generate the expected state store delta files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23826) TestHiveSparkSession should set default session
Jose Torres created SPARK-23826: --- Summary: TestHiveSparkSession should set default session Key: SPARK-23826 URL: https://issues.apache.org/jira/browse/SPARK-23826 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Jose Torres The fix for TestSparkSession breaks hive/testOnly, because many of the tests both instantiate a TestHiveSparkSession and call SparkSession.getOrCreate(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API
[ https://issues.apache.org/jira/browse/SPARK-23429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23429: Assignee: (was: Apache Spark) > Add executor memory metrics to heartbeat and expose in executors REST API > - > > Key: SPARK-23429 > URL: https://issues.apache.org/jira/browse/SPARK-23429 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > > Add new executor level memory metrics ( jvmUsedMemory, executionMemory, > storageMemory, and unifiedMemory), and expose these via the executors REST > API. This information will help provide insight into how executor and driver > JVM memory is used, and for the different memory regions. It can be used to > help determine good values for spark.executor.memory, spark.driver.memory, > spark.memory.fraction, and spark.memory.storageFraction. > Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and > storageMemory. This will track the memory usage at the executor level. The > new ExecutorMetrics will be sent by executors to the driver as part of the > Heartbeat. A heartbeat will be added for the driver as well, to collect these > metrics for the driver. > Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there > is a new peak value for one of the memory metrics for an executor and stage. > Only the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize > additional logging. Analysis on a set of sample applications showed an > increase of 0.25% in the size of the Spark history log, with this approach. > Modify the AppStatusListener to collect snapshots of peak values for each > memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and > storageMemory, and list of active stages. > Add the new memory metrics (snapshots of peak values for each memory metric) > to the executors REST API. > This is a subtask for SPARK-23206. Please refer to the design doc for that > ticket for more details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API
[ https://issues.apache.org/jira/browse/SPARK-23429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419861#comment-16419861 ] Apache Spark commented on SPARK-23429: -- User 'edwinalu' has created a pull request for this issue: https://github.com/apache/spark/pull/20940 > Add executor memory metrics to heartbeat and expose in executors REST API > - > > Key: SPARK-23429 > URL: https://issues.apache.org/jira/browse/SPARK-23429 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > > Add new executor level memory metrics ( jvmUsedMemory, executionMemory, > storageMemory, and unifiedMemory), and expose these via the executors REST > API. This information will help provide insight into how executor and driver > JVM memory is used, and for the different memory regions. It can be used to > help determine good values for spark.executor.memory, spark.driver.memory, > spark.memory.fraction, and spark.memory.storageFraction. > Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and > storageMemory. This will track the memory usage at the executor level. The > new ExecutorMetrics will be sent by executors to the driver as part of the > Heartbeat. A heartbeat will be added for the driver as well, to collect these > metrics for the driver. > Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there > is a new peak value for one of the memory metrics for an executor and stage. > Only the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize > additional logging. Analysis on a set of sample applications showed an > increase of 0.25% in the size of the Spark history log, with this approach. > Modify the AppStatusListener to collect snapshots of peak values for each > memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and > storageMemory, and list of active stages. > Add the new memory metrics (snapshots of peak values for each memory metric) > to the executors REST API. > This is a subtask for SPARK-23206. Please refer to the design doc for that > ticket for more details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API
[ https://issues.apache.org/jira/browse/SPARK-23429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23429: Assignee: Apache Spark > Add executor memory metrics to heartbeat and expose in executors REST API > - > > Key: SPARK-23429 > URL: https://issues.apache.org/jira/browse/SPARK-23429 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Assignee: Apache Spark >Priority: Major > > Add new executor level memory metrics ( jvmUsedMemory, executionMemory, > storageMemory, and unifiedMemory), and expose these via the executors REST > API. This information will help provide insight into how executor and driver > JVM memory is used, and for the different memory regions. It can be used to > help determine good values for spark.executor.memory, spark.driver.memory, > spark.memory.fraction, and spark.memory.storageFraction. > Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and > storageMemory. This will track the memory usage at the executor level. The > new ExecutorMetrics will be sent by executors to the driver as part of the > Heartbeat. A heartbeat will be added for the driver as well, to collect these > metrics for the driver. > Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there > is a new peak value for one of the memory metrics for an executor and stage. > Only the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize > additional logging. Analysis on a set of sample applications showed an > increase of 0.25% in the size of the Spark history log, with this approach. > Modify the AppStatusListener to collect snapshots of peak values for each > memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and > storageMemory, and list of active stages. > Add the new memory metrics (snapshots of peak values for each memory metric) > to the executors REST API. > This is a subtask for SPARK-23206. Please refer to the design doc for that > ticket for more details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources
[ https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419857#comment-16419857 ] David Vogelbacher commented on SPARK-23825: --- Will make a PR shortly, cc [~mcheah] > [K8s] Spark pods should request memory + memoryOverhead as resources > > > Key: SPARK-23825 > URL: https://issues.apache.org/jira/browse/SPARK-23825 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: David Vogelbacher >Priority: Major > > We currently request {{spark.[driver,executor].memory}} as memory from > Kubernetes (e.g., > [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). > The limit is set to {{spark.[driver,executor].memory + > spark.kubernetes.[driver,executor].memoryOverhead}}. > This seems to be using Kubernetes wrong. > [How Pods with resource limits are > run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], > states: > {noformat} > If a Container exceeds its memory request, it is likely that its Pod will be > evicted whenever the node runs out of memory. > {noformat} > Thus, if a the spark driver/executor uses {{memory + memoryOverhead}} > memory, it can be evicted. While an executor might get restarted (but it > would still be very bad performance-wise), the driver would be hard to > recover. > I think spark should be able to run with the requested (and, thus, > guaranteed) resources from Kubernetes without being in danger of termination > without needing to rely on optional available resources. > Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes > (and this should also be the limit). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources
[ https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Vogelbacher updated SPARK-23825: -- Description: We currently request {{spark.[driver,executor].memory}} as memory from Kubernetes (e.g., [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). The limit is set to {{spark.[driver,executor].memory + spark.kubernetes.[driver,executor].memoryOverhead}}. This seems to be using Kubernetes wrong. [How Pods with resource limits are run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], states: {noformat} If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory. {noformat} Thus, if a the spark driver/executor uses {{memory + memoryOverhead}} memory, it can be evicted. While an executor might get restarted (but it would still be very bad performance-wise), the driver would be hard to recover. I think spark should be able to run with the requested (and, thus, guaranteed) resources from Kubernetes without being in danger of termination without needing to rely on optional available resources. Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes (and this should also be the limit). was: We currently request {{spark.{driver,executor}.memory}} as memory from Kubernetes (e.g., [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). The limit is set to {{spark.{driver,executor}.memory + spark.kubernetes.{driver,executor}.memoryOverhead}}. This seems to be using Kubernetes wrong. [How Pods with resource limits are run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], states" {noformat} If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory. {noformat} Thus, if a the spark driver/executor uses {{memory + memoryOverhead}} memory, it can be evicted. While an executor might get restarted (but it would still be very bad performance-wise), the driver would be hard to recover. I think spark should be able to run with the requested (and, thus, guaranteed) resources from Kubernetes. It shouldn't rely on optional resources above the request and, therefore, be in danger of termination on high cluster utilization. Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes (and this should also be the limit). > [K8s] Spark pods should request memory + memoryOverhead as resources > > > Key: SPARK-23825 > URL: https://issues.apache.org/jira/browse/SPARK-23825 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: David Vogelbacher >Priority: Major > > We currently request {{spark.[driver,executor].memory}} as memory from > Kubernetes (e.g., > [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). > The limit is set to {{spark.[driver,executor].memory + > spark.kubernetes.[driver,executor].memoryOverhead}}. > This seems to be using Kubernetes wrong. > [How Pods with resource limits are > run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], > states: > {noformat} > If a Container exceeds its memory request, it is likely that its Pod will be > evicted whenever the node runs out of memory. > {noformat} > Thus, if a the spark driver/executor uses {{memory + memoryOverhead}} > memory, it can be evicted. While an executor might get restarted (but it > would still be very bad performance-wise), the driver would be hard to > recover. > I think spark should be able to run with the requested (and, thus, > guaranteed) resources from Kubernetes without being in danger of termination > without needing to rely on optional available resources. > Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes > (and this should also be the limit). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources
[ https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Vogelbacher updated SPARK-23825: -- Description: We currently request {{spark.{driver,executor}.memory}} as memory from Kubernetes (e.g., [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). The limit is set to {{spark.{driver,executor}.memory + spark.kubernetes.{driver,executor}.memoryOverhead}}. This seems to be using Kubernetes wrong. [How Pods with resource limits are run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], states" {noformat} If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory. {noformat} Thus, if a the spark driver/executor uses {{memory + memoryOverhead}} memory, it can be evicted. While an executor might get restarted (but it would still be very bad performance-wise), the driver would be hard to recover. I think spark should be able to run with the requested (and, thus, guaranteed) resources from Kubernetes. It shouldn't rely on optional resources above the request and, therefore, be in danger of termination on high cluster utilization. Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes (and this should also be the limit). was: We currently request `spark.{driver,executor}.memory` as memory from Kubernetes (e.g., [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). The limit is set to `spark.{driver,executor}.memory + spark.kubernetes.{driver,executor}.memoryOverhead`. This seems to be using Kubernetes wrong. [How Pods with resource limits are run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], states" {noformat} If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory. {noformat} Thus, if a the spark driver/executor uses `memory + memoryOverhead` memory, it can be evicted. While an executor might get restarted (but it would still be very bad performance-wise), the driver would be hard to recover. I think spark should be able to run with the requested (and, thus, guaranteed) resources from Kubernetes. It shouldn't rely on optional resources above the request and, therefore, be in danger of termination on high cluster utilization. Thus, we shoud request `memory + memoryOverhead` memory from Kubernetes (and this should also be the limit). > [K8s] Spark pods should request memory + memoryOverhead as resources > > > Key: SPARK-23825 > URL: https://issues.apache.org/jira/browse/SPARK-23825 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: David Vogelbacher >Priority: Major > > We currently request {{spark.{driver,executor}.memory}} as memory from > Kubernetes (e.g., > [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). > The limit is set to {{spark.{driver,executor}.memory + > spark.kubernetes.{driver,executor}.memoryOverhead}}. > This seems to be using Kubernetes wrong. > [How Pods with resource limits are > run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], > states" > {noformat} > If a Container exceeds its memory request, it is likely that its Pod will be > evicted whenever the node runs out of memory. > {noformat} > Thus, if a the spark driver/executor uses {{memory + memoryOverhead}} > memory, it can be evicted. While an executor might get restarted (but it > would still be very bad performance-wise), the driver would be hard to > recover. > I think spark should be able to run with the requested (and, thus, > guaranteed) resources from Kubernetes. It shouldn't rely on optional > resources above the request and, therefore, be in danger of termination on > high cluster utilization. > Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes > (and this should also be the limit). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources
David Vogelbacher created SPARK-23825: - Summary: [K8s] Spark pods should request memory + memoryOverhead as resources Key: SPARK-23825 URL: https://issues.apache.org/jira/browse/SPARK-23825 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.3.0 Reporter: David Vogelbacher We currently request `spark.{driver,executor}.memory` as memory from Kubernetes (e.g., [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]). The limit is set to `spark.{driver,executor}.memory + spark.kubernetes.{driver,executor}.memoryOverhead`. This seems to be using Kubernetes wrong. [How Pods with resource limits are run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run], states" {noformat} If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory. {noformat} Thus, if a the spark driver/executor uses `memory + memoryOverhead` memory, it can be evicted. While an executor might get restarted (but it would still be very bad performance-wise), the driver would be hard to recover. I think spark should be able to run with the requested (and, thus, guaranteed) resources from Kubernetes. It shouldn't rely on optional resources above the request and, therefore, be in danger of termination on high cluster utilization. Thus, we shoud request `memory + memoryOverhead` memory from Kubernetes (and this should also be the limit). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20169) Groupby Bug with Sparksql
[ https://issues.apache.org/jira/browse/SPARK-20169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419808#comment-16419808 ] Maryann Xue commented on SPARK-20169: - [~smilegator], I think this is also caused by SPARK-23368, so could you please assign someone else to review https://github.com/apache/spark/pull/20613 if the original person still does not respond in a few days? > Groupby Bug with Sparksql > - > > Key: SPARK-20169 > URL: https://issues.apache.org/jira/browse/SPARK-20169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Bin Wu >Priority: Major > > We find a potential bug in Catalyst optimizer which cannot correctly > process "groupby". You can reproduce it by following simple example: > = > from pyspark.sql.functions import * > #e=sc.parallelize([(1,2),(1,3),(1,4),(2,1),(3,1),(4,1)]).toDF(["src","dst"]) > e = spark.read.csv("graph.csv", header=True) > r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src']) > r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src') > jr = e.join(r1, 'src') > jr.show() > r2 = jr.groupBy('dst').count() > r2.show() > = > FYI, "graph.csv" contains exactly the same data as the commented line. > You can find that jr is: > |src|dst|count| > | 3| 1|1| > | 1| 4|3| > | 1| 3|3| > | 1| 2|3| > | 4| 1|1| > | 2| 1|1| > But, after the last groupBy, the 3 rows with dst = 1 are not grouped together: > |dst|count| > | 1|1| > | 4|1| > | 3|1| > | 2|1| > | 1|1| > | 1|1| > If we build jr directly from raw data (commented line), this error will not > show up. So > we suspect that there is a bug in the Catalyst optimizer when multiple joins > and groupBy's > are being optimized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23784) Cannot use custom Aggregator with groupBy/agg
[ https://issues.apache.org/jira/browse/SPARK-23784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419764#comment-16419764 ] Joshua Howard commented on SPARK-23784: --- Correct. It had been answered, but I am just now closing this out. > Cannot use custom Aggregator with groupBy/agg > -- > > Key: SPARK-23784 > URL: https://issues.apache.org/jira/browse/SPARK-23784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joshua Howard >Priority: Major > > I have code > [here|http://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work] > where I am trying to use an Aggregator with both the select and agg > functions. I cannot seem to get this to work in Spark 2.3.0. > [Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html] > is a blog post that appears to be using this functionality in Spark 1.6, but > It appears to no longer work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-23784) Cannot use custom Aggregator with groupBy/agg
[ https://issues.apache.org/jira/browse/SPARK-23784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua Howard closed SPARK-23784. - See SO link. > Cannot use custom Aggregator with groupBy/agg > -- > > Key: SPARK-23784 > URL: https://issues.apache.org/jira/browse/SPARK-23784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joshua Howard >Priority: Major > > I have code > [here|http://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work] > where I am trying to use an Aggregator with both the select and agg > functions. I cannot seem to get this to work in Spark 2.3.0. > [Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html] > is a blog post that appears to be using this functionality in Spark 1.6, but > It appears to no longer work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] V Luong resolved SPARK-2. - Resolution: Won't Fix > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419745#comment-16419745 ] V Luong commented on SPARK-2: - [~bago.amirbekian] thank you, that is indeed a good solution available in Spark 2.3. I'm using that successfully. Will close this issue now. > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23824) Make inpurityStats publicly accessible in ml.tree.Node
Barry Becker created SPARK-23824: Summary: Make inpurityStats publicly accessible in ml.tree.Node Key: SPARK-23824 URL: https://issues.apache.org/jira/browse/SPARK-23824 Project: Spark Issue Type: Wish Components: ML Affects Versions: 2.1.1 Reporter: Barry Becker This is minor, but it is also a very easy fix. I would like to visualize the structure of a decision tree model, but currently the only means of obtaining the label distribution data at each node of the tree is hidden within each ml.tree.Node inside the impurityStats. I'm pretty sure that the fix for this is as easy as removing the private[ml] qualifier from occurrences of private[ml] def impurityStats: ImpurityCalculator and override private[ml] val impurityStats: ImpurityCalculator As a workaround, I've put my class that needs access into a org.apache.spark.ml.tree package in my own repository, but I would really like to not have to do that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23823) ResolveReferences loses correct origin
[ https://issues.apache.org/jira/browse/SPARK-23823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23823: Assignee: (was: Apache Spark) > ResolveReferences loses correct origin > -- > > Key: SPARK-23823 > URL: https://issues.apache.org/jira/browse/SPARK-23823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiahui Jiang >Priority: Major > > Introduced in [https://github.com/apache/spark/pull/19585] > ResolveReferences stopped doing transfromsUp after this change and Attributes > sometimes lose its correct origin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23823) ResolveReferences loses correct origin
[ https://issues.apache.org/jira/browse/SPARK-23823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23823: Assignee: Apache Spark > ResolveReferences loses correct origin > -- > > Key: SPARK-23823 > URL: https://issues.apache.org/jira/browse/SPARK-23823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiahui Jiang >Assignee: Apache Spark >Priority: Major > > Introduced in [https://github.com/apache/spark/pull/19585] > ResolveReferences stopped doing transfromsUp after this change and Attributes > sometimes lose its correct origin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23823) ResolveReferences loses correct origin
[ https://issues.apache.org/jira/browse/SPARK-23823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419482#comment-16419482 ] Apache Spark commented on SPARK-23823: -- User 'JiahuiJiang' has created a pull request for this issue: https://github.com/apache/spark/pull/20939 > ResolveReferences loses correct origin > -- > > Key: SPARK-23823 > URL: https://issues.apache.org/jira/browse/SPARK-23823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiahui Jiang >Priority: Major > > Introduced in [https://github.com/apache/spark/pull/19585] > ResolveReferences stopped doing transfromsUp after this change and Attributes > sometimes lose its correct origin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23823) ResolveReferences loses correct origin
Jiahui Jiang created SPARK-23823: Summary: ResolveReferences loses correct origin Key: SPARK-23823 URL: https://issues.apache.org/jira/browse/SPARK-23823 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Jiahui Jiang Introduced in [https://github.com/apache/spark/pull/19585] ResolveReferences stopped doing transfromsUp after this change and Attributes sometimes lose its correct origin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23639) SparkSQL CLI fails talk to Kerberized metastore when use proxy user
[ https://issues.apache.org/jira/browse/SPARK-23639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23639. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 20784 [https://github.com/apache/spark/pull/20784] > SparkSQL CLI fails talk to Kerberized metastore when use proxy user > --- > > Key: SPARK-23639 > URL: https://issues.apache.org/jira/browse/SPARK-23639 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > In SparkSQLCLI, SessionState generates before SparkContext instantiating. > When we use --proxy-user to impersonate, it's unable to initializing a > metastore client to talk to the secured metastore for no kerberos ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23639) SparkSQL CLI fails talk to Kerberized metastore when use proxy user
[ https://issues.apache.org/jira/browse/SPARK-23639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23639: -- Assignee: Kent Yao > SparkSQL CLI fails talk to Kerberized metastore when use proxy user > --- > > Key: SPARK-23639 > URL: https://issues.apache.org/jira/browse/SPARK-23639 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > In SparkSQLCLI, SessionState generates before SparkContext instantiating. > When we use --proxy-user to impersonate, it's unable to initializing a > metastore client to talk to the secured metastore for no kerberos ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state
[ https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23785: -- Assignee: Sahil Takiar > LauncherBackend doesn't check state of connection before setting state > -- > > Key: SPARK-23785 > URL: https://issues.apache.org/jira/browse/SPARK-23785 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > Found in HIVE-18533 while trying to integration with the > {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its > connection to the {{LauncherServer}} before trying to run {{setState}} - > which sends a {{SetState}} message on the connection. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state
[ https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23785. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 20893 [https://github.com/apache/spark/pull/20893] > LauncherBackend doesn't check state of connection before setting state > -- > > Key: SPARK-23785 > URL: https://issues.apache.org/jira/browse/SPARK-23785 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > Found in HIVE-18533 while trying to integration with the > {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its > connection to the {{LauncherServer}} before trying to run {{setState}} - > which sends a {{SetState}} message on the connection. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23821) Collection function: flatten
[ https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marek Novotny updated SPARK-23821: -- Summary: Collection function: flatten (was: Collection functions: flatten) > Collection function: flatten > > > Key: SPARK-23821 > URL: https://issues.apache.org/jira/browse/SPARK-23821 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Priority: Major > > Add the flatten function that transforms an Array of Arrays column into an > Array elements column. if the array structure contains more than two levels > of nesting, the function removes one nesting level > Example: > {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => > [1,2,3,4,5,6,7,8,9]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23821) Collection functions: flatten
[ https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23821: Assignee: (was: Apache Spark) > Collection functions: flatten > - > > Key: SPARK-23821 > URL: https://issues.apache.org/jira/browse/SPARK-23821 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Priority: Major > > Add the flatten function that transforms an Array of Arrays column into an > Array elements column. if the array structure contains more than two levels > of nesting, the function removes one nesting level > Example: > {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => > [1,2,3,4,5,6,7,8,9]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23821) Collection functions: flatten
[ https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23821: Assignee: Apache Spark > Collection functions: flatten > - > > Key: SPARK-23821 > URL: https://issues.apache.org/jira/browse/SPARK-23821 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Assignee: Apache Spark >Priority: Major > > Add the flatten function that transforms an Array of Arrays column into an > Array elements column. if the array structure contains more than two levels > of nesting, the function removes one nesting level > Example: > {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => > [1,2,3,4,5,6,7,8,9]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23821) Collection functions: flatten
[ https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419426#comment-16419426 ] Apache Spark commented on SPARK-23821: -- User 'mn-mikke' has created a pull request for this issue: https://github.com/apache/spark/pull/20938 > Collection functions: flatten > - > > Key: SPARK-23821 > URL: https://issues.apache.org/jira/browse/SPARK-23821 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Priority: Major > > Add the flatten function that transforms an Array of Arrays column into an > Array elements column. if the array structure contains more than two levels > of nesting, the function removes one nesting level > Example: > {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => > [1,2,3,4,5,6,7,8,9]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-20498. --- Resolution: Fixed Fix Version/s: 2.3.0 > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > Fix For: 2.3.0 > > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23822) Improve error message for Parquet schema mismatches
[ https://issues.apache.org/jira/browse/SPARK-23822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Huo updated SPARK-23822: --- Description: If a user attempts to read Parquet files with mismatched schemas and schema merging is disabled then this may result in a very confusing UnsupportedOperationException and ParquetDecodingException errors from Parquet. e.g. {code:java} Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/") Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/") spark.read.parquet(s"$path/").collect() {code} Would result in {code:java} Caused by: java.lang.UnsupportedOperationException: Unimplemented type: IntegerType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} was: If a user attempts to read Parquet files with mismatched schemas and schema merging is disabled then this may result in a very confusing UnsupportedOperationException and ParquetDecodingException errors from Parquet. e.g. {code:java} Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/") Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/") spark.read.parquet(s"$path/").collect() {code} Would result in > Improve error message for Parquet schema mismatches > --- > > Key: SPARK-23822 > URL: https://issues.apache.org/jira/browse/SPARK-23822 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Yuchen Huo >Priority: Major > > If a user attempts to read Parquet files with mismatched schemas and schema > merging is disabled then this may result in a very confusing > UnsupportedOperationException and ParquetDecodingException errors from > Parquet. > e.g. > {code:java} > Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/") > Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/") > spark.read.parquet(s"$path/").collect() > {code} > Would result in > {code:java} > Caused by: java.lang.UnsupportedOperationException: Unimplemented type: > IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReade
[jira] [Updated] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20498: -- Target Version/s: (was: 2.4.0) > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419397#comment-16419397 ] Joseph K. Bradley commented on SPARK-20498: --- I'll close this since Bryan's PR mostly solved this issue, but feel free to reopen it if needed. Sorry for the delay. > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20498: -- Shepherd: (was: Joseph K. Bradley) > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23821) Collection functions: flatten
Marek Novotny created SPARK-23821: - Summary: Collection functions: flatten Key: SPARK-23821 URL: https://issues.apache.org/jira/browse/SPARK-23821 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.0 Reporter: Marek Novotny Add the flatten function that transforms an Array of Arrays column into an Array elements column. if the array structure contains more than two levels of nesting, the function removes one nesting level Example: {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => [1,2,3,4,5,6,7,8,9]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23822) Improve error message for Parquet schema mismatches
Yuchen Huo created SPARK-23822: -- Summary: Improve error message for Parquet schema mismatches Key: SPARK-23822 URL: https://issues.apache.org/jira/browse/SPARK-23822 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.3.0 Reporter: Yuchen Huo If a user attempts to read Parquet files with mismatched schemas and schema merging is disabled then this may result in a very confusing UnsupportedOperationException and ParquetDecodingException errors from Parquet. e.g. {code:java} Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/") Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/") spark.read.parquet(s"$path/").collect() {code} Would result in -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Ivanfi updated SPARK-20297: -- Comment: was deleted (was: Sorry, commented to the wrong JIRA.) > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23704) PySpark access of individual trees in random forest is slow
[ https://issues.apache.org/jira/browse/SPARK-23704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-23704: - Component/s: PySpark > PySpark access of individual trees in random forest is slow > --- > > Key: SPARK-23704 > URL: https://issues.apache.org/jira/browse/SPARK-23704 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.1 > Environment: PySpark 2.2.1 / Windows 10 >Reporter: Julian King >Priority: Minor > > Making predictions from a randomForestClassifier PySpark is much faster than > making predictions from an individual tree contained within the .trees > attribute. > In fact, the model.transform call without an action is more than 10x slower > for an individual tree vs the model.transform call for the random forest > model. > See > [https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark] > for example with timing. > Ideally: > * Getting a prediction from a single tree should be comparable to or faster > than getting predictions from the whole tree > * Getting all the predictions from all the individual trees should be > comparable in speed to getting the predictions from the random forest > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419338#comment-16419338 ] Zoltan Ivanfi commented on SPARK-20297: --- Sorry, commented to the wrong JIRA. > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Ivanfi updated SPARK-20297: -- Comment: was deleted (was: Could you please clarify how those DECIMALS were written in the first place? * If some manual configuration was done to allow Spark to choose this representation, then we are fine. * If an upstream Spark version wrote data using this representation by default, that's a valid reason to feel mildly uncomfortable. * If a downstream Spark version wrote data using this representation by default, then we should open a JIRA to prevent CDH Spark from doing so until Hive and Impala supports it. Thanks!) > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419335#comment-16419335 ] Zoltan Ivanfi commented on SPARK-20297: --- Could you please clarify how those DECIMALS were written in the first place? * If some manual configuration was done to allow Spark to choose this representation, then we are fine. * If an upstream Spark version wrote data using this representation by default, that's a valid reason to feel mildly uncomfortable. * If a downstream Spark version wrote data using this representation by default, then we should open a JIRA to prevent CDH Spark from doing so until Hive and Impala supports it. Thanks! > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23503) continuous execution should sequence committed epochs
[ https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419136#comment-16419136 ] Efim Poberezkin commented on SPARK-23503: - [~joseph.torres] Good day Jose. From what I've figured about Continuous Execution implementation an epoch coordinator is created per streaming query run and is able to store state. I've added tracking of last committed epoch and of waiting epochs to it to enforce epoch sequencing. Could you please correct me if my understanding of your implementation is not correct and take a look at my approach? > continuous execution should sequence committed epochs > - > > Key: SPARK-23503 > URL: https://issues.apache.org/jira/browse/SPARK-23503 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > Currently, the EpochCoordinator doesn't enforce a commit order. If a message > for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for > commit earlier, epoch n + 1 will be committed. > > This is either incorrect or needlessly confusing, because it's not safe to > start from the end offset of epoch n + 1 until epoch n is committed. > EpochCoordinator should enforce this sequencing. > > Note that this is not actually a problem right now, because the commit > messages go through the same RPC channel from the same place. But we > shouldn't implicitly bake this assumption in. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23723) New charset option for json datasource
[ https://issues.apache.org/jira/browse/SPARK-23723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419123#comment-16419123 ] Apache Spark commented on SPARK-23723: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/20937 > New charset option for json datasource > -- > > Key: SPARK-23723 > URL: https://issues.apache.org/jira/browse/SPARK-23723 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Currently JSON Reader can read json files in different charset/encodings. The > JSON Reader uses the jackson-json library to automatically detect the charset > of input text/stream. Here you can see the method which detects encoding: > [https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174] > > The detectEncoding method checks the BOM > ([https://en.wikipedia.org/wiki/Byte_order_mark]) at the beginning of a text. > The BOM can be in the file but it is not mandatory. If it is not present, the > auto detection mechanism can select wrong charset. And as a consequence of > that, the user cannot read the json file. *The proposed option will allow to > bypass the auto detection mechanism and set the charset explicitly.* > > The charset option is already exposed as a CSV option: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88] > . I propose to add the same option for JSON. > > Regarding to JSON Writer, *the charset option will give to the user > opportunity* to read json files in charset different from UTF-8, modify the > dataset and *write results back to json files in the original encoding.* At > the moment it is not possible to do because the result can be saved in UTF-8 > only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23724) Custom record separator for jsons in charsets different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419124#comment-16419124 ] Apache Spark commented on SPARK-23724: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/20937 > Custom record separator for jsons in charsets different from UTF-8 > -- > > Key: SPARK-23724 > URL: https://issues.apache.org/jira/browse/SPARK-23724 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > The option should define a sequence of bytes between two consecutive json > records. Currently the separator is detected automatically by hadoop library: > > [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L185-L254] > > The method is able to recognize only *\r, \n* and *\r\n* in UTF-8 encoding. > It doesn't work in the cases if encoding of input stream is different from > UTF-8. The option should allow to users explicitly set separator/delimiter of > json records. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23503) continuous execution should sequence committed epochs
[ https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419111#comment-16419111 ] Apache Spark commented on SPARK-23503: -- User 'efimpoberezkin' has created a pull request for this issue: https://github.com/apache/spark/pull/20936 > continuous execution should sequence committed epochs > - > > Key: SPARK-23503 > URL: https://issues.apache.org/jira/browse/SPARK-23503 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > Currently, the EpochCoordinator doesn't enforce a commit order. If a message > for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for > commit earlier, epoch n + 1 will be committed. > > This is either incorrect or needlessly confusing, because it's not safe to > start from the end offset of epoch n + 1 until epoch n is committed. > EpochCoordinator should enforce this sequencing. > > Note that this is not actually a problem right now, because the commit > messages go through the same RPC channel from the same place. But we > shouldn't implicitly bake this assumption in. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23503) continuous execution should sequence committed epochs
[ https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23503: Assignee: (was: Apache Spark) > continuous execution should sequence committed epochs > - > > Key: SPARK-23503 > URL: https://issues.apache.org/jira/browse/SPARK-23503 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > Currently, the EpochCoordinator doesn't enforce a commit order. If a message > for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for > commit earlier, epoch n + 1 will be committed. > > This is either incorrect or needlessly confusing, because it's not safe to > start from the end offset of epoch n + 1 until epoch n is committed. > EpochCoordinator should enforce this sequencing. > > Note that this is not actually a problem right now, because the commit > messages go through the same RPC channel from the same place. But we > shouldn't implicitly bake this assumption in. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23503) continuous execution should sequence committed epochs
[ https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23503: Assignee: Apache Spark > continuous execution should sequence committed epochs > - > > Key: SPARK-23503 > URL: https://issues.apache.org/jira/browse/SPARK-23503 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Major > > Currently, the EpochCoordinator doesn't enforce a commit order. If a message > for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for > commit earlier, epoch n + 1 will be committed. > > This is either incorrect or needlessly confusing, because it's not safe to > start from the end offset of epoch n + 1 until epoch n is committed. > EpochCoordinator should enforce this sequencing. > > Note that this is not actually a problem right now, because the commit > messages go through the same RPC channel from the same place. But we > shouldn't implicitly bake this assumption in. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances
[ https://issues.apache.org/jira/browse/SPARK-21960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419104#comment-16419104 ] Leonel Atencio commented on SPARK-21960: This is a very important issue, because right now, Streaming DRA does not include an `initExecutors` property. Some streaming jobs need a minimal amount of executors to work properly. Right now, you can only wait for the DRA algorithm to fire add executors until your minimum is reached. This is an issue. > Spark Streaming Dynamic Allocation should respect spark.executor.instances > -- > > Key: SPARK-21960 > URL: https://issues.apache.org/jira/browse/SPARK-21960 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Karthik Palaniappan >Priority: Minor > > This check enforces that spark.executor.instances (aka --num-executors) is > either unset or explicitly set to 0. > https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207 > If spark.executor.instances is unset, the check is fine, and the property > defaults to 2. Spark requests the cluster manager for 2 executors to start > with, then adds/removes executors appropriately. > However, if you explicitly set it to 0, the check also succeeds, but Spark > never asks the cluster manager for any executors. When running on YARN, I > repeatedly saw: > {code:java} > 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > {code} > I noticed that at least Google Dataproc and Ambari explicitly set > spark.executor.instances to a positive number, meaning that to use dynamic > allocation, you would have to edit spark-defaults.conf to remove the > property. That's obnoxious. > In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value > for --num-executors or --conf spark.executor.instances: > https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9 > It is much more reasonable for Streaming DRA to use spark.executor.instances, > just like Core DRA. I'll open a pull request to remove the check if there are > no objections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23820) Allow the long form of call sites to be recorded in the log
Michael Mior created SPARK-23820: Summary: Allow the long form of call sites to be recorded in the log Key: SPARK-23820 URL: https://issues.apache.org/jira/browse/SPARK-23820 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: Michael Mior It would be nice if the long form of the callsite information could be included in the log. An example of what I'm proposing is here: https://github.com/michaelmior/spark/commit/4b4076cfb1d51ceb20fd2b0a3b1b5be2aebb6416 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-991) Report call sites of operators in Python
[ https://issues.apache.org/jira/browse/SPARK-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419099#comment-16419099 ] Michael Mior commented on SPARK-991: Maybe I'm missing something here, but I'm not seeing a Python stack trace captured when I look at the RDD call site information. > Report call sites of operators in Python > > > Key: SPARK-991 > URL: https://issues.apache.org/jira/browse/SPARK-991 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Matei Zaharia >Priority: Major > Labels: Starter > Fix For: 0.9.0 > > > Similar to our call site reporting in Java, we can get a stack trace in > Python with traceback.extract_stack. It would be nice to set that stack trace > as the name on the corresponding RDDs so it can appear in the UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22618) RDD.unpersist can cause fatal exception when used with dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-22618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419062#comment-16419062 ] Brad commented on SPARK-22618: -- Yeah the fix for broadcaset unpersist should be basically the same. Thanks Thomas. > RDD.unpersist can cause fatal exception when used with dynamic allocation > - > > Key: SPARK-22618 > URL: https://issues.apache.org/jira/browse/SPARK-22618 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Brad >Assignee: Brad >Priority: Minor > Fix For: 2.3.0 > > > If you use rdd.unpersist() with dynamic allocation, then an executor can be > deallocated while your rdd is being removed, which will throw an uncaught > exception killing your job. > I looked into different ways of preventing this error from occurring but > couldn't come up with anything that wouldn't require a big change. I propose > the best fix is just to catch and log IOExceptions in unpersist() so they > don't kill your job. This will match the effective behavior when executors > are lost from dynamic allocation in other parts of the code. > In the worst case scenario I think this could lead to RDD partitions getting > left on executors after they were unpersisted, but this is probably better > than the whole job failing. I think in most cases the IOException would be > due to the executor dieing for some reason, which is effectively the same > result as unpersisting the rdd from that executor anyway. > I noticed this exception in a job that loads a 100GB dataset on a cluster > where we use dynamic allocation heavily. Here is the relevant stack trace > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:276) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > at java.lang.Thread.run(Thread.java:748) > Exception in thread "main" org.apache.spark.SparkException: Exception thrown > in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:131) > at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:1806) > at org.apache.spark.rdd.RDD.unpersist(RDD.scala:217) > at > com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.doWorkload(CacheTest.scala:62) > at > com.ibm.sparktc.sparkbench.workload.Workload$class.run(Workload.scala:40) > at > com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.run(CacheTest.scala:33) > at > com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78) > at > com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSeria
[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419053#comment-16419053 ] Steve Loughran commented on SPARK-23534: [~nchammas] bq. Cloudera still ships 2.6 and EMR is on 2.7. EMR and google have 2.8.x options & CDH may be based on 2.6 but they've backported and updated a lot. The aws & azure connectors on (HDP, CDH, HD/Insights) are all very much 2.8.x+ source and dependencies bq. Hadoop releases have historically been very disruptive, with backwards-incompatible changes even in minor releases. I don't disagree. But having the profiles so you can build and test this stuff is how to find out the problems. The HBase team have already been fairly busy, driving the development of a shaded client adequate for their needs...spark & hadoop 3 helps identify & fix regressions, push that shaded jar to be complete, etc. Profiles => build & tests => bug reports => fixes. [~Bidek] bq. I am afraid of using of older version of azure-storage because of all the security issues that have been found and fixed in the newer version, Hadoop 3.x is still shipping with Azure storage 5.4.0, and as that upgrade came from Microsoft (HADOOP-14662) I think they'll be up to speed with any security issues. That's in Hadoop 2.9.0 BTW. If you really want that storage lib, build Spark against with {{-Phadoop-2.7 -Dhadoop.version=2.9.0}}. Hive will be happy. > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding
[ https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-23807: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-23534 > Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and > binding > --- > > Key: SPARK-23807 > URL: https://issues.apache.org/jira/browse/SPARK-23807 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Priority: Major > > Hadoop 3, and particular Hadoop 3.1 adds: > * Java 8 as the minimum (and currently sole) supported Java version > * A new "hadoop-cloud-storage" module intended to be a minimal dependency > POM for all the cloud connectors in the version of hadoop built against > * The ability to declare a committer for any FileOutputFormat which > supercedes the classic FileOutputCommitter -in both a job and for a specific > FS URI > * A shaded client JAR, though not yet one complete enough for spark. > * Lots of other features and fixes. > The basic work of building spark with hadoop 3 is one of just doing the build > with {{-Dhadoop.version=3.x.y}}; however that > * Doesn't build on SBT (dependency resolution of zookeeper JAR) > * Misses the new cloud features > The ZK dependency can be fixed everywhere by explicitly declaring the ZK > artifact, instead of relying on curator to pull it in; this needs a profile > to declare the right ZK version, obviously.. > To use the cloud features spark the hadoop-3 profile should declare that the > spark-hadoop-cloud module depends on —and only on— the > hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud > storage, and a source package which is only built and tested when build > against Hadoop 3.1+ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding
[ https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419017#comment-16419017 ] Steve Loughran commented on SPARK-23807: yes, this profile is part of the hadoop 3 support, "necessary but not sufficient", given the hive issue > Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and > binding > --- > > Key: SPARK-23807 > URL: https://issues.apache.org/jira/browse/SPARK-23807 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Priority: Major > > Hadoop 3, and particular Hadoop 3.1 adds: > * Java 8 as the minimum (and currently sole) supported Java version > * A new "hadoop-cloud-storage" module intended to be a minimal dependency > POM for all the cloud connectors in the version of hadoop built against > * The ability to declare a committer for any FileOutputFormat which > supercedes the classic FileOutputCommitter -in both a job and for a specific > FS URI > * A shaded client JAR, though not yet one complete enough for spark. > * Lots of other features and fixes. > The basic work of building spark with hadoop 3 is one of just doing the build > with {{-Dhadoop.version=3.x.y}}; however that > * Doesn't build on SBT (dependency resolution of zookeeper JAR) > * Misses the new cloud features > The ZK dependency can be fixed everywhere by explicitly declaring the ZK > artifact, instead of relying on curator to pull it in; this needs a profile > to declare the right ZK version, obviously.. > To use the cloud features spark the hadoop-3 profile should declare that the > spark-hadoop-cloud module depends on —and only on— the > hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud > storage, and a source package which is only built and tested when build > against Hadoop 3.1+ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15125) CSV data source recognizes empty quoted strings in the input as null.
[ https://issues.apache.org/jira/browse/SPARK-15125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418925#comment-16418925 ] Max Murphy commented on SPARK-15125: [~snanda] That would not allow ,, to be distinguished from ,"", whereas we want ,, => null and ,"", => "". > CSV data source recognizes empty quoted strings in the input as null. > -- > > Key: SPARK-15125 > URL: https://issues.apache.org/jira/browse/SPARK-15125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Suresh Thalamati >Priority: Major > > CSV data source does not differentiate between empty quoted strings and empty > fields as null. In some scenarios user would want to differentiate between > these values, especially in the context of SQL where NULL , and empty string > have different meanings If input data happens to be dump from traditional > relational data source, users will see different results for the SQL queries. > {code} > Repro: > Test Data: (test.csv) > year,make,model,comment,price > 2017,Tesla,Mode 3,looks nice.,35000.99 > 2016,Chevy,Bolt,"",29000.00 > 2015,Porsche,"",, > scala> val df= sqlContext.read.format("csv").option("header", > "true").option("inferSchema", "true").option("nullValue", > null).load("/tmp/test.csv") > df: org.apache.spark.sql.DataFrame = [year: int, make: string ... 3 more > fields] > scala> df.show > ++---+--+---++ > |year| make| model|comment| price| > ++---+--+---++ > |2017| Tesla|Mode 3|looks nice.|35000.99| > |2016| Chevy| Bolt| null| 29000.0| > |2015|Porsche| null| null|null| > ++---+--+---++ > Expected: > ++---+--+---++ > |year| make| model|comment| price| > ++---+--+---++ > |2017| Tesla|Mode 3|looks nice.|35000.99| > |2016| Chevy| Bolt| | 29000.0| > |2015|Porsche| | null|null| > ++---+--+---++ > {code} > Testing a fix for the this issue. I will give a shot at submitting a PR for > this soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23811) FetchFailed comes before Success of same task will cause child stage never succeed
[ https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Yuanjian updated SPARK-23811: Summary: FetchFailed comes before Success of same task will cause child stage never succeed (was: Same tasks' FetchFailed event comes before Success will cause child stage never succeed) > FetchFailed comes before Success of same task will cause child stage never > succeed > -- > > Key: SPARK-23811 > URL: https://issues.apache.org/jira/browse/SPARK-23811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Li Yuanjian >Priority: Major > Attachments: 1.png, 2.png > > > This is a bug caused by abnormal scenario describe below: > # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA > # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` > , shuffleStatus changed. > # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately. > # ShuffleMapTask 1 is the last task of its stage, so this stage will never > succeed because of there's no missing task DagScheduler can get. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23819) InMemoryTableScanExec prunes orderable complex types due to out of date ColumnStats
[ https://issues.apache.org/jira/browse/SPARK-23819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23819: Assignee: Apache Spark > InMemoryTableScanExec prunes orderable complex types due to out of date > ColumnStats > --- > > Key: SPARK-23819 > URL: https://issues.apache.org/jira/browse/SPARK-23819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Patrick Woody >Assignee: Apache Spark >Priority: Major > > The data types that can be compared via BinaryComparison was expanded in > SPARK-21110 now include Arrays/Structs/etc, but ColumnStats would still have > hard coded upper/lower bounds for these types. > InMemoryTableScanExec used to be safe against these comparisons because the > predicate would fail type checking. Now that it passes, the statistics > unintentionally allow pruning of the partition, causing correctness issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23819) InMemoryTableScanExec prunes orderable complex types due to out of date ColumnStats
[ https://issues.apache.org/jira/browse/SPARK-23819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23819: Assignee: (was: Apache Spark) > InMemoryTableScanExec prunes orderable complex types due to out of date > ColumnStats > --- > > Key: SPARK-23819 > URL: https://issues.apache.org/jira/browse/SPARK-23819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Patrick Woody >Priority: Major > > The data types that can be compared via BinaryComparison was expanded in > SPARK-21110 now include Arrays/Structs/etc, but ColumnStats would still have > hard coded upper/lower bounds for these types. > InMemoryTableScanExec used to be safe against these comparisons because the > predicate would fail type checking. Now that it passes, the statistics > unintentionally allow pruning of the partition, causing correctness issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23819) InMemoryTableScanExec prunes orderable complex types due to out of date ColumnStats
[ https://issues.apache.org/jira/browse/SPARK-23819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418777#comment-16418777 ] Apache Spark commented on SPARK-23819: -- User 'pwoody' has created a pull request for this issue: https://github.com/apache/spark/pull/20935 > InMemoryTableScanExec prunes orderable complex types due to out of date > ColumnStats > --- > > Key: SPARK-23819 > URL: https://issues.apache.org/jira/browse/SPARK-23819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Patrick Woody >Priority: Major > > The data types that can be compared via BinaryComparison was expanded in > SPARK-21110 now include Arrays/Structs/etc, but ColumnStats would still have > hard coded upper/lower bounds for these types. > InMemoryTableScanExec used to be safe against these comparisons because the > predicate would fail type checking. Now that it passes, the statistics > unintentionally allow pruning of the partition, causing correctness issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23818) an official UDF interface for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23818: Assignee: Apache Spark > an official UDF interface for Spark SQL > --- > > Key: SPARK-23818 > URL: https://issues.apache.org/jira/browse/SPARK-23818 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > > Nowadays the Spark UDF interface is pretty simple: it's basically just a > lambda function. > We should come up with a new one that can: > 1. operate on InternalRow directly for better performance > 2. define deterministic/mutable/nullable in the UDF itself > 3. java friendly -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23818) an official UDF interface for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418754#comment-16418754 ] Apache Spark commented on SPARK-23818: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/20934 > an official UDF interface for Spark SQL > --- > > Key: SPARK-23818 > URL: https://issues.apache.org/jira/browse/SPARK-23818 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > Nowadays the Spark UDF interface is pretty simple: it's basically just a > lambda function. > We should come up with a new one that can: > 1. operate on InternalRow directly for better performance > 2. define deterministic/mutable/nullable in the UDF itself > 3. java friendly -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23818) an official UDF interface for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23818: Assignee: (was: Apache Spark) > an official UDF interface for Spark SQL > --- > > Key: SPARK-23818 > URL: https://issues.apache.org/jira/browse/SPARK-23818 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > Nowadays the Spark UDF interface is pretty simple: it's basically just a > lambda function. > We should come up with a new one that can: > 1. operate on InternalRow directly for better performance > 2. define deterministic/mutable/nullable in the UDF itself > 3. java friendly -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23819) InMemoryTableScanExec prunes orderable complex types due to out of date ColumnStats
Patrick Woody created SPARK-23819: - Summary: InMemoryTableScanExec prunes orderable complex types due to out of date ColumnStats Key: SPARK-23819 URL: https://issues.apache.org/jira/browse/SPARK-23819 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Patrick Woody The data types that can be compared via BinaryComparison was expanded in SPARK-21110 now include Arrays/Structs/etc, but ColumnStats would still have hard coded upper/lower bounds for these types. InMemoryTableScanExec used to be safe against these comparisons because the predicate would fail type checking. Now that it passes, the statistics unintentionally allow pruning of the partition, causing correctness issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23818) an official UDF interface for Spark SQL
Wenchen Fan created SPARK-23818: --- Summary: an official UDF interface for Spark SQL Key: SPARK-23818 URL: https://issues.apache.org/jira/browse/SPARK-23818 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Nowadays the Spark UDF interface is pretty simple: it's basically just a lambda function. We should come up with a new one that can: 1. operate on InternalRow directly for better performance 2. define deterministic/mutable/nullable in the UDF itself 3. java friendly -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23770) Expose repartitionByRange in SparkR
[ https://issues.apache.org/jira/browse/SPARK-23770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23770: Assignee: Hyukjin Kwon > Expose repartitionByRange in SparkR > --- > > Key: SPARK-23770 > URL: https://issues.apache.org/jira/browse/SPARK-23770 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > SPARK-22614 added repartitionByRange. It'd be good to have it in R side too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23770) Expose repartitionByRange in SparkR
[ https://issues.apache.org/jira/browse/SPARK-23770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23770. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20902 [https://github.com/apache/spark/pull/20902] > Expose repartitionByRange in SparkR > --- > > Key: SPARK-23770 > URL: https://issues.apache.org/jira/browse/SPARK-23770 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > SPARK-22614 added repartitionByRange. It'd be good to have it in R side too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20384) supporting value classes over primitives in DataSets
[ https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418691#comment-16418691 ] Furcy Pin edited comment on SPARK-20384 at 3/29/18 10:10 AM: - +1 on this issue. I think the generic use case is that the spark Encoder magic to automatically transform a DataFrame into a case class currently only work for base types. This is great if you have a {code:java} case class Table(id: Long, attribute: String) {code} with simple attributes, BUT, if you want to wrap your attribute into another simple class like this {code:java} case class Attribute(value: String) { // some specific methods... } case class Table(id: Long, attribute: Attribute){code} Then this won't work automatically, unless the "attribute" column in your DataFrame is a struct itself. The problem is that currently there doesn't seem to be any simple way to achieve this, which really limits the usefulness of the whole Encoder magic. And if a nice, simple way to achieve this exists, please document it as I did not find it. EDIT: after giving it some thought, I tried to do this: {code:java} implicit class Attribute(value: String) case class Table(id: Long, attribute: Attribute){code} But it does not work either. If it were possible like this, it would be a nice way to do it. was (Author: fpin): +1 on this issue. I think the generic use case is that the spark Encoder magic to automatically transform a DataFrame into a case class currently only work for base types. This is great if you have a {code:java} case class Table(id: Long, attribute: String) {code} with simple attributes, BUT, if you want to wrap your attribute into another simple class like this {code:java} case class Attribute(value: String) { // some specific methods... } case class Table(id: Long, attribute: Attribute){code} Then this won't work automatically, unless the "attribute" column in your DataFrame is a struct itself. The problem is that currently there doesn't seem to be any simple way to achieve this, which really limits the usefulness of the whole Encoder magic. And if a nice, simple way to achieve this exists, please document it as I did not find it. > supporting value classes over primitives in DataSets > > > Key: SPARK-20384 > URL: https://issues.apache.org/jira/browse/SPARK-20384 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Daniel Davis >Priority: Minor > > As a spark user who uses value classes in scala for modelling domain objects, > I also would like to make use of them for datasets. > For example, I would like to use the {{User}} case class which is using a > value-class for it's {{id}} as the type for a DataSet: > - the underlying primitive should be mapped to the value-class column > - function on the column (for example comparison ) should only work if > defined on the value-class and use these implementation > - show() should pick up the toString method of the value-class > {code} > case class Id(value: Long) extends AnyVal { > def toString: String = value.toHexString > } > case class User(id: Id, name: String) > val ds = spark.sparkContext > .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS() > .withColumnRenamed("_1", "id") > .withColumnRenamed("_2", "name") > // mapping should work > val usrs = ds.as[User] > // show should use toString > usrs.show() > // comparison with long should throw exception, as not defined on Id > usrs.col("id") > 0L > {code} > For example `.show()` should use the toString of the `Id` value class: > {noformat} > +---+---+ > | id| name| > +---+---+ > | 0| name-0| > | 1| name-1| > | 2| name-2| > | 3| name-3| > | 4| name-4| > | 5| name-5| > | 6| name-6| > | 7| name-7| > | 8| name-8| > | 9| name-9| > | A|name-10| > | B|name-11| > | C|name-12| > +---+---+ > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20384) supporting value classes over primitives in DataSets
[ https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418691#comment-16418691 ] Furcy Pin commented on SPARK-20384: --- +1 on this issue. I think the generic use case is that the spark Encoder magic to automatically transform a DataFrame into a case class currently only work for base types. This is great if you have a {code:java} case class Table(id: Long, attribute: String) {code} with simple attributes, BUT, if you want to wrap your attribute into another simple class like this {code:java} case class Attribute(value: String) { // some specific methods... } case class Table(id: Long, attribute: Attribute){code} Then this won't work automatically, unless the "attribute" column in your DataFrame is a struct itself. The problem is that currently there doesn't seem to be any simple way to achieve this, which really limits the usefulness of the whole Encoder magic. And if a nice, simple way to achieve this exists, please document it as I did not find it. > supporting value classes over primitives in DataSets > > > Key: SPARK-20384 > URL: https://issues.apache.org/jira/browse/SPARK-20384 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Daniel Davis >Priority: Minor > > As a spark user who uses value classes in scala for modelling domain objects, > I also would like to make use of them for datasets. > For example, I would like to use the {{User}} case class which is using a > value-class for it's {{id}} as the type for a DataSet: > - the underlying primitive should be mapped to the value-class column > - function on the column (for example comparison ) should only work if > defined on the value-class and use these implementation > - show() should pick up the toString method of the value-class > {code} > case class Id(value: Long) extends AnyVal { > def toString: String = value.toHexString > } > case class User(id: Id, name: String) > val ds = spark.sparkContext > .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS() > .withColumnRenamed("_1", "id") > .withColumnRenamed("_2", "name") > // mapping should work > val usrs = ds.as[User] > // show should use toString > usrs.show() > // comparison with long should throw exception, as not defined on Id > usrs.col("id") > 0L > {code} > For example `.show()` should use the toString of the `Id` value class: > {noformat} > +---+---+ > | id| name| > +---+---+ > | 0| name-0| > | 1| name-1| > | 2| name-2| > | 3| name-3| > | 4| name-4| > | 5| name-5| > | 6| name-6| > | 7| name-7| > | 8| name-8| > | 9| name-9| > | A|name-10| > | B|name-11| > | C|name-12| > +---+---+ > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22342) refactor schedulerDriver registration
[ https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418617#comment-16418617 ] Stavros Kontopoulos edited comment on SPARK-22342 at 3/29/18 9:43 AM: -- [~susanxhuynh] I guess we need to create a bug for this or change this issue type. was (Author: skonto): [~susanxhuynh] I guess we need to create a bug for this or change this is issue type. > refactor schedulerDriver registration > - > > Key: SPARK-22342 > URL: https://issues.apache.org/jira/browse/SPARK-22342 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Stavros Kontopoulos >Priority: Major > > This is an umbrella issue for working on: > https://github.com/apache/spark/pull/13143 > and handle the multiple re-registration issue which invalidates an offer. > To test: > dcos spark run --verbose --name=spark-nohive --submit-args="--driver-cores > 1 --conf spark.cores.max=1 --driver-memory 512M --class > org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar"; > master log: > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3085 hierarchical.cpp:303] Added framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3085 hierarchical.cpp:412] Deactivated framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3090 hierarchical.cpp:380] Activated framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: (was: duibi1.zip) > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: test.JPG, test1.JPG, test2.JPG > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: (was: duibi2.zip) > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: test.JPG, test1.JPG, test2.JPG > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: duibi2.zip > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: duibi1.zip, duibi2.zip, test.JPG, test1.JPG, test2.JPG > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: duibi1.zip > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: duibi1.zip, duibi2.zip, test.JPG, test1.JPG, test2.JPG > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: (was: shiro.ini) > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: test.JPG, test1.JPG, test2.JPG > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: (was: SecurityRestApi.java) > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: test.JPG, test1.JPG, test2.JPG > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: (was: zeppelin-site.xml) > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: test.JPG, test1.JPG, test2.JPG > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: (was: ZeppelinConfiguration.java) > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: test.JPG, test1.JPG, test2.JPG > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: (was: SecurityUtils.java) > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: test.JPG, test1.JPG, test2.JPG > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: (was: LoginRestApi.java) > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: SecurityRestApi.java, SecurityUtils.java, > ZeppelinConfiguration.java, shiro.ini, test.JPG, test1.JPG, test2.JPG, > zeppelin-site.xml > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: (was: GetUserList.java) > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: SecurityRestApi.java, SecurityUtils.java, > ZeppelinConfiguration.java, shiro.ini, test.JPG, test1.JPG, test2.JPG, > zeppelin-site.xml > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: (was: GbdLdapRealm.java) > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge >Priority: Major > Fix For: 2.1.1 > > Attachments: SecurityRestApi.java, SecurityUtils.java, > ZeppelinConfiguration.java, shiro.ini, test.JPG, test1.JPG, test2.JPG, > zeppelin-site.xml > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org