[jira] [Assigned] (SPARK-27551) Uniformative error message for mismatched types in when().otherwise()
[ https://issues.apache.org/jira/browse/SPARK-27551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27551: Assignee: (was: Apache Spark) > Uniformative error message for mismatched types in when().otherwise() > - > > Key: SPARK-27551 > URL: https://issues.apache.org/jira/browse/SPARK-27551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Huon Wilson >Priority: Minor > > When a {{when(...).otherwise(...)}} construct has a type error, the error > message can be quite uninformative, since it just splats out a potentially > large chunk of code and says the types don't match. For instance: > {code:none} > scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + > 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as > "y" > org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = > CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS > BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * > CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data > type mismatch: THEN and ELSE expressions should all be same type or coercible > to a common type;; > ... > {code} > The problem is the structs have different field names ({{x}} vs {{y}}), but > it's not obvious that this is the case, and this is a relatively simple case > of a single {{select}} expression. > It would be great for the error message to at least include the types that > Spark has computed, to help clarify what might have gone wrong. For instance, > {{greatest}} and {{least}} write out the expression with the types instead of > values: > {code:none} > scala> spark.range(100).select(greatest('id, struct(lit("x" > org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`id`, > named_struct('col1', 'x'))' due to data type mismatch: The expressions should > all have the same type, got GREATEST(bigint, struct).;; > {code} > For the example above, this might look like: > {code:none} > org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = > CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS > BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * > CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data > type mismatch: THEN and ELSE expressions should all be same type or coercible > to a common type, got CASE WHEN ... THEN array> ELSE > array> END;; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27551) Uniformative error message for mismatched types in when().otherwise()
[ https://issues.apache.org/jira/browse/SPARK-27551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27551: Assignee: Apache Spark > Uniformative error message for mismatched types in when().otherwise() > - > > Key: SPARK-27551 > URL: https://issues.apache.org/jira/browse/SPARK-27551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Huon Wilson >Assignee: Apache Spark >Priority: Minor > > When a {{when(...).otherwise(...)}} construct has a type error, the error > message can be quite uninformative, since it just splats out a potentially > large chunk of code and says the types don't match. For instance: > {code:none} > scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + > 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as > "y" > org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = > CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS > BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * > CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data > type mismatch: THEN and ELSE expressions should all be same type or coercible > to a common type;; > ... > {code} > The problem is the structs have different field names ({{x}} vs {{y}}), but > it's not obvious that this is the case, and this is a relatively simple case > of a single {{select}} expression. > It would be great for the error message to at least include the types that > Spark has computed, to help clarify what might have gone wrong. For instance, > {{greatest}} and {{least}} write out the expression with the types instead of > values: > {code:none} > scala> spark.range(100).select(greatest('id, struct(lit("x" > org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`id`, > named_struct('col1', 'x'))' due to data type mismatch: The expressions should > all have the same type, got GREATEST(bigint, struct).;; > {code} > For the example above, this might look like: > {code:none} > org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = > CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS > BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * > CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data > type mismatch: THEN and ELSE expressions should all be same type or coercible > to a common type, got CASE WHEN ... THEN array> ELSE > array> END;; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
[ https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825788#comment-16825788 ] Xiao Li commented on SPARK-27216: - Thanks, Josh! We will add it to the release note. [~cloud_fan] > Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue > - > > Key: SPARK-27216 > URL: https://issues.apache.org/jira/browse/SPARK-27216 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Labels: correctness > Fix For: 2.3.4, 2.4.2, 3.0.0 > > > HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But > RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer. > We can use below UT to reproduce: > {code} > test("kryo serialization with RoaringBitmap") { > val bitmap = new RoaringBitmap > bitmap.add(1787) > val safeSer = new KryoSerializer(conf).newInstance() > val bitmap2 : RoaringBitmap = > safeSer.deserialize(safeSer.serialize(bitmap)) > assert(bitmap2.equals(bitmap)) > conf.set("spark.kryo.unsafe", "true") > val unsafeSer = new KryoSerializer(conf).newInstance() > val bitmap3 : RoaringBitmap = > unsafeSer.deserialize(unsafeSer.serialize(bitmap)) > assert(bitmap3.equals(bitmap)) // this will fail > } > {code} > Upgrade to latest version 0.7.45 to fix it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27462) Spark hive can not choose some columns in target table flexibly, when running insert into.
[ https://issues.apache.org/jira/browse/SPARK-27462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27462: Assignee: Apache Spark > Spark hive can not choose some columns in target table flexibly, when running > insert into. > -- > > Key: SPARK-27462 > URL: https://issues.apache.org/jira/browse/SPARK-27462 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Spark SQL can not support the feature to choose some columns in target table > flexibly, when running > {code:java} > insert into tableA select ... from tableB;{code} > This feature is supported by Hive, so I think this grammar should be > consistent with Hive。 > Hive support some feature about 'insert into' as follows: > {code:java} > insert into gja_test_spark select * from gja_test; > insert into gja_test_spark(key, value, other) select key, value, other from > gja_test; > insert into gja_test_spark(key, value) select value, other from gja_test; > insert into gja_test_spark(key, other) select value, other from gja_test; > insert into gja_test_spark(value, other) select value, other from > gja_test;{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27462) Spark hive can not choose some columns in target table flexibly, when running insert into.
[ https://issues.apache.org/jira/browse/SPARK-27462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27462: Assignee: (was: Apache Spark) > Spark hive can not choose some columns in target table flexibly, when running > insert into. > -- > > Key: SPARK-27462 > URL: https://issues.apache.org/jira/browse/SPARK-27462 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark SQL can not support the feature to choose some columns in target table > flexibly, when running > {code:java} > insert into tableA select ... from tableB;{code} > This feature is supported by Hive, so I think this grammar should be > consistent with Hive。 > Hive support some feature about 'insert into' as follows: > {code:java} > insert into gja_test_spark select * from gja_test; > insert into gja_test_spark(key, value, other) select key, value, other from > gja_test; > insert into gja_test_spark(key, value) select value, other from gja_test; > insert into gja_test_spark(key, other) select value, other from gja_test; > insert into gja_test_spark(value, other) select value, other from > gja_test;{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23178) Kryo Unsafe problems with count distinct from cache
[ https://issues.apache.org/jira/browse/SPARK-23178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825783#comment-16825783 ] Josh Rosen commented on SPARK-23178: This might be fixed by SPARK-27216 > Kryo Unsafe problems with count distinct from cache > --- > > Key: SPARK-23178 > URL: https://issues.apache.org/jira/browse/SPARK-23178 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: KIryl Sultanau >Priority: Minor > Attachments: Unsafe-issue.png, Unsafe-off.png > > > Spark incorrectly process cached data with Kryo & Unsafe options. > Distinct count from cache doesn't work correctly. Example available below: > {quote}val spark = SparkSession > .builder > .appName("unsafe-issue") > .master("local[*]") > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > .config("spark.kryo.unsafe", "true") > .config("spark.kryo.registrationRequired", "false") > .getOrCreate() > val devicesDF = spark.read.format("csv") > .option("header", "true") > .option("delimiter", "\t") > .load("/data/Devices.tsv").cache() > val gatewaysDF = spark.read.format("csv") > .option("header", "true") > .option("delimiter", "\t") > .load("/data/Gateways.tsv").cache() > val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), > "inner").cache() > devJoinedDF.printSchema() > println(devJoinedDF.count()) > println(devJoinedDF.select("DeviceId").distinct().count()) > println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count()) > println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count()) > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
[ https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825782#comment-16825782 ] Josh Rosen edited comment on SPARK-27216 at 4/25/19 6:38 AM: - I've added the {{correctness}} label to this ticket because it sounds like it can cause query correctness issues in pre-2.4 versions of Spark (for example, I think SPARK-23178 might be a report of this). Note for future readers: this problem only occurs in a non-default configuration (the default is {{spark.kryo.unsafe == false}}). /cc [~smilegator] was (Author: joshrosen): I've added the {{correctness}} label to this ticket because it sounds like it can cause query correctness issues in pre-2.4 versions of Spark (for example, I think SPARK-27216 might be a report of this). Note for future readers: this problem only occurs in a non-default configuration (the default is {{spark.kryo.unsafe == false}}). /cc [~smilegator] > Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue > - > > Key: SPARK-27216 > URL: https://issues.apache.org/jira/browse/SPARK-27216 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Fix For: 2.3.4, 2.4.2, 3.0.0 > > > HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But > RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer. > We can use below UT to reproduce: > {code} > test("kryo serialization with RoaringBitmap") { > val bitmap = new RoaringBitmap > bitmap.add(1787) > val safeSer = new KryoSerializer(conf).newInstance() > val bitmap2 : RoaringBitmap = > safeSer.deserialize(safeSer.serialize(bitmap)) > assert(bitmap2.equals(bitmap)) > conf.set("spark.kryo.unsafe", "true") > val unsafeSer = new KryoSerializer(conf).newInstance() > val bitmap3 : RoaringBitmap = > unsafeSer.deserialize(unsafeSer.serialize(bitmap)) > assert(bitmap3.equals(bitmap)) // this will fail > } > {code} > Upgrade to latest version 0.7.45 to fix it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
[ https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-27216: --- Labels: correctness (was: ) > Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue > - > > Key: SPARK-27216 > URL: https://issues.apache.org/jira/browse/SPARK-27216 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Labels: correctness > Fix For: 2.3.4, 2.4.2, 3.0.0 > > > HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But > RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer. > We can use below UT to reproduce: > {code} > test("kryo serialization with RoaringBitmap") { > val bitmap = new RoaringBitmap > bitmap.add(1787) > val safeSer = new KryoSerializer(conf).newInstance() > val bitmap2 : RoaringBitmap = > safeSer.deserialize(safeSer.serialize(bitmap)) > assert(bitmap2.equals(bitmap)) > conf.set("spark.kryo.unsafe", "true") > val unsafeSer = new KryoSerializer(conf).newInstance() > val bitmap3 : RoaringBitmap = > unsafeSer.deserialize(unsafeSer.serialize(bitmap)) > assert(bitmap3.equals(bitmap)) // this will fail > } > {code} > Upgrade to latest version 0.7.45 to fix it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
[ https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825782#comment-16825782 ] Josh Rosen commented on SPARK-27216: I've added the {{correctness}} label to this ticket because it sounds like it can cause query correctness issues in pre-2.4 versions of Spark (for example, I think SPARK-27216 might be a report of this). Note for future readers: this problem only occurs in a non-default configuration (the default is {{spark.kryo.unsafe == false}}). /cc [~smilegator] > Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue > - > > Key: SPARK-27216 > URL: https://issues.apache.org/jira/browse/SPARK-27216 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Fix For: 2.3.4, 2.4.2, 3.0.0 > > > HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But > RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer. > We can use below UT to reproduce: > {code} > test("kryo serialization with RoaringBitmap") { > val bitmap = new RoaringBitmap > bitmap.add(1787) > val safeSer = new KryoSerializer(conf).newInstance() > val bitmap2 : RoaringBitmap = > safeSer.deserialize(safeSer.serialize(bitmap)) > assert(bitmap2.equals(bitmap)) > conf.set("spark.kryo.unsafe", "true") > val unsafeSer = new KryoSerializer(conf).newInstance() > val bitmap3 : RoaringBitmap = > unsafeSer.deserialize(unsafeSer.serialize(bitmap)) > assert(bitmap3.equals(bitmap)) // this will fail > } > {code} > Upgrade to latest version 0.7.45 to fix it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle
[ https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825770#comment-16825770 ] Josh Rosen edited comment on SPARK-27530 at 4/25/19 6:32 AM: - This specific error message was added in SPARK-24160. As discussed in that ticket's PR, zero-sized blocks should never be requested, so the receipt of a zero-sized block indicates either a bug in the shuffle-request-issuing logic or a data corruption / truncation bug somewhere in the shuffle stack. *Update*: I see someone linked SPARK-27216 to this ticket; that sounds like a plausible cause / fix. was (Author: joshrosen): This specific error message was added in SPARK-24160. As discussed in that ticket's PR, zero-sized blocks should never be requested, so the receipt of a zero-sized block indicates either a bug in the shuffle-request-issuing logic or a data corruption / truncation bug somewhere in the shuffle stack. *Update*: I see someone linked SPARK-27216 to this ticket; that sounds like a plausible cause. > FetchFailedException: Received a zero-size buffer for block shuffle > --- > > Key: SPARK-27530 > URL: https://issues.apache.org/jira/browse/SPARK-27530 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Adrian Muraru >Priority: Major > > I'm getting this in a large shuffle: > {code:java} > org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer > for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, > 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Received a zero-size buffer for block > shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) > (expectedApproxSize = 33708, isNetworkReqDone=false){code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle
[ https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825770#comment-16825770 ] Josh Rosen edited comment on SPARK-27530 at 4/25/19 6:32 AM: - This specific error message was added in SPARK-24160. As discussed in that ticket's PR, zero-sized blocks should never be requested, so the receipt of a zero-sized block indicates either a bug in the shuffle-request-issuing logic or a data corruption / truncation bug somewhere in the shuffle stack. *Update*: I see someone linked SPARK-27216 to this ticket; that sounds like a plausible cause. was (Author: joshrosen): This specific error message was added in SPARK-24160. As discussed in that ticket's PR, zero-sized blocks should never be requested, so the receipt of a zero-sized block indicates either a bug in the shuffle-request-issuing logic or a data corruption / truncation bug somewhere in the shuffle stack. > FetchFailedException: Received a zero-size buffer for block shuffle > --- > > Key: SPARK-27530 > URL: https://issues.apache.org/jira/browse/SPARK-27530 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Adrian Muraru >Priority: Major > > I'm getting this in a large shuffle: > {code:java} > org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer > for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, > 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Received a zero-size buffer for block > shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) > (expectedApproxSize = 33708, isNetworkReqDone=false){code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui WANG updated SPARK-27555: - Description: *You can see details in attachment called Try.pdf.* I already seen https://issues.apache.org/jira/browse/SPARK-17620 and https://issues.apache.org/jira/browse/SPARK-18397 and I check source code of Spark for the change of set "spark.sql.hive.covertCTAS=true" and then spark will use "spark.sql.sources.default" which is parquet as storage format in "create table as select" scenario. But my case is just create table without select. When I set hive.default.fileformat=parquet in hive-site.xml or set spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after create a table, when i check the hive table, it still use textfile fileformat. It seems HiveSerDe gets the value of the hive.default.fileformat parameter from SQLConf The parameter values in SQLConf are copied from SparkContext's SparkConf at SparkSession initialization, while the configuration parameters in hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by SharedState, And all the config with "spark.hadoop" conf are setted to hadoopconfig, so the configuration does not take effect. was: I already seen https://issues.apache.org/jira/browse/SPARK-17620 and https://issues.apache.org/jira/browse/SPARK-18397 and I check source code of Spark for the change of set "spark.sql.hive.covertCTAS=true" and then spark will use "spark.sql.sources.default" which is parquet as storage format in "create table as select" scenario. But my case is just create table without select. When I set hive.default.fileformat=parquet in hive-site.xml or set spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after create a table, when i check the hive table, it still use textfile fileformat. It seems HiveSerDe gets the value of the hive.default.fileformat parameter from SQLConf The parameter values in SQLConf are copied from SparkContext's SparkConf at SparkSession initialization, while the configuration parameters in hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by SharedState, And all the config with "spark.hadoop" conf are setted to hadoopconfig, so the configuration does not take effect. > cannot create table by using the hive default fileformat in both > hive-site.xml and spark-defaults.conf > -- > > Key: SPARK-27555 > URL: https://issues.apache.org/jira/browse/SPARK-27555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Hui WANG >Priority: Major > Attachments: Try.pdf > > > *You can see details in attachment called Try.pdf.* > I already seen https://issues.apache.org/jira/browse/SPARK-17620 > and https://issues.apache.org/jira/browse/SPARK-18397 > and I check source code of Spark for the change of set > "spark.sql.hive.covertCTAS=true" and then spark will use > "spark.sql.sources.default" which is parquet as storage format in "create > table as select" scenario. > But my case is just create table without select. When I set > hive.default.fileformat=parquet in hive-site.xml or set > spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after > create a table, when i check the hive table, it still use textfile fileformat. > > It seems HiveSerDe gets the value of the hive.default.fileformat parameter > from SQLConf > The parameter values in SQLConf are copied from SparkContext's SparkConf at > SparkSession initialization, while the configuration parameters in > hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters > by SharedState, And all the config with "spark.hadoop" conf are setted to > hadoopconfig, so the configuration does not take effect. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825772#comment-16825772 ] Hui WANG commented on SPARK-27555: -- ok, I already attached the details of reproduce in the attachment area called Try.pdf. > cannot create table by using the hive default fileformat in both > hive-site.xml and spark-defaults.conf > -- > > Key: SPARK-27555 > URL: https://issues.apache.org/jira/browse/SPARK-27555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Hui WANG >Priority: Major > Attachments: Try.pdf > > > I already seen https://issues.apache.org/jira/browse/SPARK-17620 > and https://issues.apache.org/jira/browse/SPARK-18397 > and I check source code of Spark for the change of set > "spark.sql.hive.covertCTAS=true" and then spark will use > "spark.sql.sources.default" which is parquet as storage format in "create > table as select" scenario. > But my case is just create table without select. When I set > hive.default.fileformat=parquet in hive-site.xml or set > spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after > create a table, when i check the hive table, it still use textfile fileformat. > > It seems HiveSerDe gets the value of the hive.default.fileformat parameter > from SQLConf > The parameter values in SQLConf are copied from SparkContext's SparkConf at > SparkSession initialization, while the configuration parameters in > hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters > by SharedState, And all the config with "spark.hadoop" conf are setted to > hadoopconfig, so the configuration does not take effect. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle
[ https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825770#comment-16825770 ] Josh Rosen commented on SPARK-27530: This specific error message was added in SPARK-24160. As discussed in that ticket's PR, zero-sized blocks should never be requested, so the receipt of a zero-sized block indicates either a bug in the shuffle-request-issuing logic or a data corruption / truncation bug somewhere in the shuffle stack. > FetchFailedException: Received a zero-size buffer for block shuffle > --- > > Key: SPARK-27530 > URL: https://issues.apache.org/jira/browse/SPARK-27530 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Adrian Muraru >Priority: Major > > I'm getting this in a large shuffle: > {code:java} > org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer > for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, > 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Received a zero-size buffer for block > shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) > (expectedApproxSize = 33708, isNetworkReqDone=false){code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui WANG updated SPARK-27555: - Attachment: (was: Try1.jpg) > cannot create table by using the hive default fileformat in both > hive-site.xml and spark-defaults.conf > -- > > Key: SPARK-27555 > URL: https://issues.apache.org/jira/browse/SPARK-27555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Hui WANG >Priority: Major > Attachments: Try.pdf > > > I already seen https://issues.apache.org/jira/browse/SPARK-17620 > and https://issues.apache.org/jira/browse/SPARK-18397 > and I check source code of Spark for the change of set > "spark.sql.hive.covertCTAS=true" and then spark will use > "spark.sql.sources.default" which is parquet as storage format in "create > table as select" scenario. > But my case is just create table without select. When I set > hive.default.fileformat=parquet in hive-site.xml or set > spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after > create a table, when i check the hive table, it still use textfile fileformat. > > It seems HiveSerDe gets the value of the hive.default.fileformat parameter > from SQLConf > The parameter values in SQLConf are copied from SparkContext's SparkConf at > SparkSession initialization, while the configuration parameters in > hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters > by SharedState, And all the config with "spark.hadoop" conf are setted to > hadoopconfig, so the configuration does not take effect. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui WANG updated SPARK-27555: - Attachment: Try.pdf > cannot create table by using the hive default fileformat in both > hive-site.xml and spark-defaults.conf > -- > > Key: SPARK-27555 > URL: https://issues.apache.org/jira/browse/SPARK-27555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Hui WANG >Priority: Major > Attachments: Try.pdf > > > I already seen https://issues.apache.org/jira/browse/SPARK-17620 > and https://issues.apache.org/jira/browse/SPARK-18397 > and I check source code of Spark for the change of set > "spark.sql.hive.covertCTAS=true" and then spark will use > "spark.sql.sources.default" which is parquet as storage format in "create > table as select" scenario. > But my case is just create table without select. When I set > hive.default.fileformat=parquet in hive-site.xml or set > spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after > create a table, when i check the hive table, it still use textfile fileformat. > > It seems HiveSerDe gets the value of the hive.default.fileformat parameter > from SQLConf > The parameter values in SQLConf are copied from SparkContext's SparkConf at > SparkSession initialization, while the configuration parameters in > hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters > by SharedState, And all the config with "spark.hadoop" conf are setted to > hadoopconfig, so the configuration does not take effect. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui WANG updated SPARK-27555: - Attachment: Try1.jpg > cannot create table by using the hive default fileformat in both > hive-site.xml and spark-defaults.conf > -- > > Key: SPARK-27555 > URL: https://issues.apache.org/jira/browse/SPARK-27555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Hui WANG >Priority: Major > Attachments: Try1.jpg > > > I already seen https://issues.apache.org/jira/browse/SPARK-17620 > and https://issues.apache.org/jira/browse/SPARK-18397 > and I check source code of Spark for the change of set > "spark.sql.hive.covertCTAS=true" and then spark will use > "spark.sql.sources.default" which is parquet as storage format in "create > table as select" scenario. > But my case is just create table without select. When I set > hive.default.fileformat=parquet in hive-site.xml or set > spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after > create a table, when i check the hive table, it still use textfile fileformat. > > It seems HiveSerDe gets the value of the hive.default.fileformat parameter > from SQLConf > The parameter values in SQLConf are copied from SparkContext's SparkConf at > SparkSession initialization, while the configuration parameters in > hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters > by SharedState, And all the config with "spark.hadoop" conf are setted to > hadoopconfig, so the configuration does not take effect. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui WANG updated SPARK-27555: - Description: I already seen https://issues.apache.org/jira/browse/SPARK-17620 and https://issues.apache.org/jira/browse/SPARK-18397 and I check source code of Spark for the change of set "spark.sql.hive.covertCTAS=true" and then spark will use "spark.sql.sources.default" which is parquet as storage format in "create table as select" scenario. But my case is just create table without select. When I set hive.default.fileformat=parquet in hive-site.xml or set spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after create a table, when i check the hive table, it still use textfile fileformat. It seems HiveSerDe gets the value of the hive.default.fileformat parameter from SQLConf The parameter values in SQLConf are copied from SparkContext's SparkConf at SparkSession initialization, while the configuration parameters in hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by SharedState, And all the config with "spark.hadoop" conf are setted to hadoopconfig, so the configuration does not take effect. was: I already seen https://issues.apache.org/jira/browse/SPARK-17620 and https://issues.apache.org/jira/browse/SPARK-18397 and I check source code of Spark for the change of set "spark.sql.hive.covertCTAS=true" and then spark will use "spark.sql.sources.default" which is parquet as storage format in "create table as select" scenario. But my case is just create table without select. When I set hive.default.fileformat=parquet in hive-site.xml or set spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after create a table, when i check the hive table, it still use textfile fileformat. It seems HiveSerDe gets the value of the hive.default.fileformat parameter from SQLConf The parameter values in SQLConf are copied from SparkContext's SparkConf at SparkSession initialization, while the configuration parameters in hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by SharedState, And all the config with "spark.hadoop" conf are setted to hadoopconfig, so the configuration does not take effect. Try1: vi /opt/spark/conf/spark-defaults.conf hive.default.fileformat parquetfile Then open spark-sql, It tells me Try2: vi /opt/spark/conf/spark-defaults.conf spark.hadoop.hive.default.fileformat parquetfile The open spark-sql And Then, I set hive.default.fileformat directly in current spark-sql repl, Try3: Edit hive-site.xml directly. vi hive-site.xml The open spark-sql > cannot create table by using the hive default fileformat in both > hive-site.xml and spark-defaults.conf > -- > > Key: SPARK-27555 > URL: https://issues.apache.org/jira/browse/SPARK-27555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Hui WANG >Priority: Major > > I already seen https://issues.apache.org/jira/browse/SPARK-17620 > and https://issues.apache.org/jira/browse/SPARK-18397 > and I check source code of Spark for the change of set > "spark.sql.hive.covertCTAS=true" and then spark will use > "spark.sql.sources.default" which is parquet as storage format in "create > table as select" scenario. > But my case is just create table without select. When I set > hive.default.fileformat=parquet in hive-site.xml or set > spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after > create a table, when i check the hive table, it still use textfile fileformat. > > It seems HiveSerDe gets the value of the hive.default.fileformat parameter > from SQLConf > The parameter values in SQLConf are copied from SparkContext's SparkConf at > SparkSession initialization, while the configuration parameters in > hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters > by SharedState, And all the config with "spark.hadoop" conf are setted to > hadoopconfig, so the configuration does not take effect. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui WANG updated SPARK-27555: - Description: I already seen https://issues.apache.org/jira/browse/SPARK-17620 and https://issues.apache.org/jira/browse/SPARK-18397 and I check source code of Spark for the change of set "spark.sql.hive.covertCTAS=true" and then spark will use "spark.sql.sources.default" which is parquet as storage format in "create table as select" scenario. But my case is just create table without select. When I set hive.default.fileformat=parquet in hive-site.xml or set spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after create a table, when i check the hive table, it still use textfile fileformat. It seems HiveSerDe gets the value of the hive.default.fileformat parameter from SQLConf The parameter values in SQLConf are copied from SparkContext's SparkConf at SparkSession initialization, while the configuration parameters in hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by SharedState, And all the config with "spark.hadoop" conf are setted to hadoopconfig, so the configuration does not take effect. Try1: vi /opt/spark/conf/spark-defaults.conf hive.default.fileformat parquetfile Then open spark-sql, It tells me Try2: vi /opt/spark/conf/spark-defaults.conf spark.hadoop.hive.default.fileformat parquetfile The open spark-sql And Then, I set hive.default.fileformat directly in current spark-sql repl, Try3: Edit hive-site.xml directly. vi hive-site.xml The open spark-sql was: I already seen https://issues.apache.org/jira/browse/SPARK-17620 and https://issues.apache.org/jira/browse/SPARK-18397 and I check source code of Spark for the change of set "spark.sql.hive.covertCTAS=true" and then spark will use "spark.sql.sources.default" which is parquet as storage format in "create table as select" scenario. But my case is just create table without select. When I set hive.default.fileformat=parquet in hive-site.xml or set spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after create a table, when i check the hive table, it still use textfile fileformat. It seems HiveSerDe gets the value of the hive.default.fileformat parameter from SQLConf The parameter values in SQLConf are copied from SparkContext's SparkConf at SparkSession initialization, while the configuration parameters in hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by SharedState, And all the config with "spark.hadoop" conf are setted to hadoopconfig, so the configuration does not take effect. > cannot create table by using the hive default fileformat in both > hive-site.xml and spark-defaults.conf > -- > > Key: SPARK-27555 > URL: https://issues.apache.org/jira/browse/SPARK-27555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Hui WANG >Priority: Major > > I already seen https://issues.apache.org/jira/browse/SPARK-17620 > and https://issues.apache.org/jira/browse/SPARK-18397 > and I check source code of Spark for the change of set > "spark.sql.hive.covertCTAS=true" and then spark will use > "spark.sql.sources.default" which is parquet as storage format in "create > table as select" scenario. > But my case is just create table without select. When I set > hive.default.fileformat=parquet in hive-site.xml or set > spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after > create a table, when i check the hive table, it still use textfile fileformat. > > It seems HiveSerDe gets the value of the hive.default.fileformat parameter > from SQLConf > The parameter values in SQLConf are copied from SparkContext's SparkConf at > SparkSession initialization, while the configuration parameters in > hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters > by SharedState, And all the config with "spark.hadoop" conf are setted to > hadoopconfig, so the configuration does not take effect. > > > Try1: > vi /opt/spark/conf/spark-defaults.conf > hive.default.fileformat parquetfile > Then open spark-sql, It tells me > > > Try2: > vi /opt/spark/conf/spark-defaults.conf > spark.hadoop.hive.default.fileformat parquetfile > The open spark-sql > > > And Then, I set hive.default.fileformat directly in current spark-sql repl, > > > > Try3: > Edit hive-site.xml directly. > vi hive-site.xml > > The open spark-sql > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For addi
[jira] [Assigned] (SPARK-27494) Null values don't work in Kafka source v2
[ https://issues.apache.org/jira/browse/SPARK-27494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27494: Assignee: (was: Apache Spark) > Null values don't work in Kafka source v2 > - > > Key: SPARK-27494 > URL: https://issues.apache.org/jira/browse/SPARK-27494 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.1 >Reporter: Shixiong Zhu >Priority: Major > > Right now Kafka source v2 doesn't support null values. The issue is in > org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow > which doesn't handle null values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27494) Null values don't work in Kafka source v2
[ https://issues.apache.org/jira/browse/SPARK-27494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27494: Assignee: Apache Spark > Null values don't work in Kafka source v2 > - > > Key: SPARK-27494 > URL: https://issues.apache.org/jira/browse/SPARK-27494 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.1 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > Right now Kafka source v2 doesn't support null values. The issue is in > org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow > which doesn't handle null values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27526) Driver OOM error occurs while writing parquet file with Append mode
[ https://issues.apache.org/jira/browse/SPARK-27526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825764#comment-16825764 ] Hyukjin Kwon commented on SPARK-27526: -- Please avoid to set target version which is usually reserved for committers. > Driver OOM error occurs while writing parquet file with Append mode > --- > > Key: SPARK-27526 > URL: https://issues.apache.org/jira/browse/SPARK-27526 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.1 > Environment: centos6.7 >Reporter: senyoung >Priority: Major > Labels: oom > > As this user code below > {code:java} > someDataFrame.write > .mode(SaveMode.Append) > .partitionBy(somePartitionKeySeqs) > .parquet(targetPath); > {code} > When spark try to write parquet files into hdfs with the SaveMode.Append > mode,it must check the existing Partition Columns > would match the "existed files" ,how ever,this behevior would cache all leaf > fileInfos under the "targetPath"; > This can easily trigger oom when there are too many files in the targetPath; > This behevior is useful when someone needs the exactly correctness ,but i > think it should be optional to avoid the oom; > The linked code be here > {code:java} > //package org.apache.spark.sql.execution.datasources > //case class DataSource > private def writeInFileFormat(format: FileFormat, mode: SaveMode, data: > DataFrame): Unit = { > ... > /** > */can we make it optional? > */ > if (mode == SaveMode.Append) { > val existingPartitionColumns = Try { > /** > * getOrInferFileFormatSchema(format, justPartitioning = true), > * this method may cause oom when there be too many files,could we just sample > limited files rather than all existed files ? > */ > getOrInferFileFormatSchema(format, justPartitioning = true) > ._2.fieldNames.toList > }.getOrElse(Seq.empty[String]) > val sameColumns = > existingPartitionColumns.map(_.toLowerCase()) == > partitionColumns.map(_.toLowerCase()) > if (existingPartitionColumns.nonEmpty && !sameColumns) { > throw new AnalysisException( > s"""Requested partitioning does not match existing partitioning. > |Existing partitioning columns: > | ${existingPartitionColumns.mkString(", ")} > |Requested partitioning columns: > | ${partitionColumns.mkString(", ")} > |""".stripMargin) > } > } > ... > } > private def getOrInferFileFormatSchema( > format: FileFormat, > justPartitioning: Boolean = false): (StructType, StructType) = { > lazy val tempFileIndex = { > val allPaths = caseInsensitiveOptions.get("path") ++ paths > val hadoopConf = sparkSession.sessionState.newHadoopConf() > val globbedPaths = allPaths.toSeq.flatMap { path => > val hdfsPath = new Path(path) > val fs = hdfsPath.getFileSystem(hadoopConf) > val qualified = hdfsPath.makeQualified(fs.getUri, > fs.getWorkingDirectory) > SparkHadoopUtil.get.globPathIfNecessary(qualified) > }.toArray >/** > * InMemoryFileIndex.refresh0() cache all files info ,oom risks >*/ > new InMemoryFileIndex(sparkSession, globbedPaths, options, None) > } > ... > } > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27526) Driver OOM error occurs while writing parquet file with Append mode
[ https://issues.apache.org/jira/browse/SPARK-27526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27526: - Target Version/s: (was: 2.1.1) > Driver OOM error occurs while writing parquet file with Append mode > --- > > Key: SPARK-27526 > URL: https://issues.apache.org/jira/browse/SPARK-27526 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.1 > Environment: centos6.7 >Reporter: senyoung >Priority: Major > Labels: oom > > As this user code below > {code:java} > someDataFrame.write > .mode(SaveMode.Append) > .partitionBy(somePartitionKeySeqs) > .parquet(targetPath); > {code} > When spark try to write parquet files into hdfs with the SaveMode.Append > mode,it must check the existing Partition Columns > would match the "existed files" ,how ever,this behevior would cache all leaf > fileInfos under the "targetPath"; > This can easily trigger oom when there are too many files in the targetPath; > This behevior is useful when someone needs the exactly correctness ,but i > think it should be optional to avoid the oom; > The linked code be here > {code:java} > //package org.apache.spark.sql.execution.datasources > //case class DataSource > private def writeInFileFormat(format: FileFormat, mode: SaveMode, data: > DataFrame): Unit = { > ... > /** > */can we make it optional? > */ > if (mode == SaveMode.Append) { > val existingPartitionColumns = Try { > /** > * getOrInferFileFormatSchema(format, justPartitioning = true), > * this method may cause oom when there be too many files,could we just sample > limited files rather than all existed files ? > */ > getOrInferFileFormatSchema(format, justPartitioning = true) > ._2.fieldNames.toList > }.getOrElse(Seq.empty[String]) > val sameColumns = > existingPartitionColumns.map(_.toLowerCase()) == > partitionColumns.map(_.toLowerCase()) > if (existingPartitionColumns.nonEmpty && !sameColumns) { > throw new AnalysisException( > s"""Requested partitioning does not match existing partitioning. > |Existing partitioning columns: > | ${existingPartitionColumns.mkString(", ")} > |Requested partitioning columns: > | ${partitionColumns.mkString(", ")} > |""".stripMargin) > } > } > ... > } > private def getOrInferFileFormatSchema( > format: FileFormat, > justPartitioning: Boolean = false): (StructType, StructType) = { > lazy val tempFileIndex = { > val allPaths = caseInsensitiveOptions.get("path") ++ paths > val hadoopConf = sparkSession.sessionState.newHadoopConf() > val globbedPaths = allPaths.toSeq.flatMap { path => > val hdfsPath = new Path(path) > val fs = hdfsPath.getFileSystem(hadoopConf) > val qualified = hdfsPath.makeQualified(fs.getUri, > fs.getWorkingDirectory) > SparkHadoopUtil.get.globPathIfNecessary(qualified) > }.toArray >/** > * InMemoryFileIndex.refresh0() cache all files info ,oom risks >*/ > new InMemoryFileIndex(sparkSession, globbedPaths, options, None) > } > ... > } > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27529) Spark Streaming consumer dies with kafka.common.OffsetOutOfRangeException
[ https://issues.apache.org/jira/browse/SPARK-27529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825763#comment-16825763 ] Hyukjin Kwon commented on SPARK-27529: -- Spark 1.x is EOL. Can you check if it works in higher version? > Spark Streaming consumer dies with kafka.common.OffsetOutOfRangeException > - > > Key: SPARK-27529 > URL: https://issues.apache.org/jira/browse/SPARK-27529 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.5.0 >Reporter: Dmitry Goldenberg >Priority: Major > > We have a Spark Streaming consumer which at a certain point started > consistently failing upon a restart with the below error. > Some details: > * Spark version is 1.5.0. > * Kafka version is 0.8.2.1 (2.10-0.8.2.1). > * The topic is configured with: retention.ms=1471228928, > max.message.bytes=1. > * The consumer runs with auto.offset.reset=smallest. > * No checkpointing is currently enabled. > I don't see anything in the Spark or Kafka doc to understand why this is > happening. From googling around, > {noformat} > https://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/ > Finally, I’ll repeat that any semantics beyond at-most-once require that you > have sufficient log retention in Kafka. If you’re seeing things like > OffsetOutOfRangeException, it’s probably because you underprovisioned Kafka > storage, not because something’s wrong with Spark or Kafka.{noformat} > Also looking at SPARK-12693 and SPARK-11693, I don't understand the possible > causes. > {noformat} > You've under-provisioned Kafka storage and / or Spark compute capacity. > The result is that data is being deleted before it has been > processed.{noformat} > All we're trying to do is start the consumer and consume from the topic from > the earliest available offset. Why would we not be able to do that? How can > the offsets be out of range if we're saying, just read from the earliest > available? > Since we have the retention.ms set to 1 year and we created the topic just a > few weeks ago, I'd not expect any deletion being done by Kafka as we're > consuming. > I'd like to understand the actual cause of this error. Any recommendations on > a workaround would be appreciated. > Stack traces: > {noformat} > 2019-04-19 11:35:17,147 ERROR org.apache.spark.scheduler > .TaskSetManager: Task 10 in stage 147.0 failed 4 times; aborting job > 2019-04-19 11:35:17,160 ERROR > org.apache.spark.streaming.scheduler.JobScheduler: Error running job > streaming job 1555682554000 ms.0 > org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in > stage 147.0 failed 4 times, most recent failure: Lost task > 10.3 in stage 147.0 (TID 2368, 10.150.0.58): > kafka.common.OffsetOutOfRangeException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at java.lang.Class.newInstance(Class.java:442) > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86) > at > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184) > at > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193) > at > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29) > at > com.acme.consumer.kafka.spark.ProcessPartitionFunction.call(ProcessPartitionFunction.java:69) > at > com.acme.consumer.kafka.spark.ProcessPartitionFunction.call(ProcessPartitionFunction.java:24) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:222) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:222) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala
[jira] [Commented] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle
[ https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825762#comment-16825762 ] Hyukjin Kwon commented on SPARK-27530: -- Please avoid to set Critical+ which is usually reserved for committers. > FetchFailedException: Received a zero-size buffer for block shuffle > --- > > Key: SPARK-27530 > URL: https://issues.apache.org/jira/browse/SPARK-27530 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Adrian Muraru >Priority: Major > > I'm getting this in a large shuffle: > {code:java} > org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer > for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, > 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Received a zero-size buffer for block > shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) > (expectedApproxSize = 33708, isNetworkReqDone=false){code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle
[ https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27530: - Priority: Major (was: Critical) > FetchFailedException: Received a zero-size buffer for block shuffle > --- > > Key: SPARK-27530 > URL: https://issues.apache.org/jira/browse/SPARK-27530 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Adrian Muraru >Priority: Major > > I'm getting this in a large shuffle: > {code:java} > org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer > for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, > 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Received a zero-size buffer for block > shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) > (expectedApproxSize = 33708, isNetworkReqDone=false){code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27477) Kafka token provider should have provided dependency on Spark
[ https://issues.apache.org/jira/browse/SPARK-27477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27477: Assignee: Apache Spark > Kafka token provider should have provided dependency on Spark > - > > Key: SPARK-27477 > URL: https://issues.apache.org/jira/browse/SPARK-27477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark 3.0.0-SNAPSHOT > commit 38fc8e2484aa4971d1f2c115da61fc96f36e7868 > Author: Sean Owen > Date: Sat Apr 13 22:27:25 2019 +0900 > [MINOR][DOCS] Fix some broken links in docs >Reporter: koert kuipers >Assignee: Apache Spark >Priority: Trivial > Fix For: 3.0.0 > > > currently the external module spark-token-provider-kafka-0-10 has a compile > dependency on spark-core. this means spark-sql-kafka-0-10 also has a > transitive compile dependency on spark-core. > since spark-sql-kafka-0-10 is not bundled with spark but instead has to be > added to an application that runs on spark this dependency should be > provided, not compile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27477) Kafka token provider should have provided dependency on Spark
[ https://issues.apache.org/jira/browse/SPARK-27477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27477: Assignee: (was: Apache Spark) > Kafka token provider should have provided dependency on Spark > - > > Key: SPARK-27477 > URL: https://issues.apache.org/jira/browse/SPARK-27477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark 3.0.0-SNAPSHOT > commit 38fc8e2484aa4971d1f2c115da61fc96f36e7868 > Author: Sean Owen > Date: Sat Apr 13 22:27:25 2019 +0900 > [MINOR][DOCS] Fix some broken links in docs >Reporter: koert kuipers >Priority: Trivial > Fix For: 3.0.0 > > > currently the external module spark-token-provider-kafka-0-10 has a compile > dependency on spark-core. this means spark-sql-kafka-0-10 also has a > transitive compile dependency on spark-core. > since spark-sql-kafka-0-10 is not bundled with spark but instead has to be > added to an application that runs on spark this dependency should be > provided, not compile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27537) spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value size is not a member of Object
[ https://issues.apache.org/jira/browse/SPARK-27537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27537. -- Resolution: Duplicate JDK 11 is in progress > spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value > size is not a member of Object > - > > Key: SPARK-27537 > URL: https://issues.apache.org/jira/browse/SPARK-27537 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.3.0, 2.4.1 > Environment: Machine:aarch64 > OS:Red Hat Enterprise Linux Server release 7.4 > Kernel:4.11.0-44.el7a > spark version: spark-2.4.1 > java:openjdk version "11.0.2" 2019-01-15 > OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9) > scala:2.11.12 > gcc version:4.8.5 >Reporter: dingwei2019 >Priority: Major > Labels: build, test > > {code} > [ERROR]: [Error] > $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value > size is not a member of Object > [ERROR]: [Error] > $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.869:Value > size is not a member of Object > {code} > ERROR: two errors found > Below is the related code: > {code} > test("toString") { > val empty = Matrices.ones(0, 0) > empty.toString(0, 0) > > val mat = Matrices.rand(5, 10, new Random()) > mat.toString(-1, -5) > mat.toString(0, 0) > mat.toString(Int.MinValue, Int.MinValue) > mat.toString(Int.MaxValue, Int.MaxValue) > var lines = mat.toString(6, 50).lines.toArray > assert(lines.size == 5 && lines.forall(_.size <= 50)) > > lines = mat.toString(5, 100).lines.toArray > assert(lines.size == 5 && lines.forall(_.size <= 100)) > } > {code} > {code} > test("numNonzeros and numActives") { > val dm1 = Matrices.dense(3, 2, Array(0, 0, -1, 1, 0, 1)) > assert(dm1.numNonzeros === 3) > assert(dm1.numActives === 6) > val sm1 = Matrices.sparse(3, 2, Array(0, 2, 3), Array(0, 2, 1), Array(0.0, > -1.2, 0.0)) > assert(sm1.numNonzeros === 1) > assert(sm1.numActives === 3) > } > {code} > what shall i do to solve this problem, and when will spark support jdk11? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27540) Add 'meanAveragePrecision_at_k' metric to RankingMetrics
[ https://issues.apache.org/jira/browse/SPARK-27540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27540: Assignee: (was: Apache Spark) > Add 'meanAveragePrecision_at_k' metric to RankingMetrics > > > Key: SPARK-27540 > URL: https://issues.apache.org/jira/browse/SPARK-27540 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.4.1 >Reporter: Pham Nguyen Tuan Anh >Priority: Minor > > Sometimes, we only focus on MAP of top-k results. > This ticket adds MAP@k to RankingMetrics, besides existing MAP. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27540) Add 'meanAveragePrecision_at_k' metric to RankingMetrics
[ https://issues.apache.org/jira/browse/SPARK-27540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27540: Assignee: Apache Spark > Add 'meanAveragePrecision_at_k' metric to RankingMetrics > > > Key: SPARK-27540 > URL: https://issues.apache.org/jira/browse/SPARK-27540 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.4.1 >Reporter: Pham Nguyen Tuan Anh >Assignee: Apache Spark >Priority: Minor > > Sometimes, we only focus on MAP of top-k results. > This ticket adds MAP@k to RankingMetrics, besides existing MAP. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27537) spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value size is not a member of Object
[ https://issues.apache.org/jira/browse/SPARK-27537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27537: - Description: {code} [ERROR]: [Error] $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value size is not a member of Object [ERROR]: [Error] $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.869:Value size is not a member of Object {code} ERROR: two errors found Below is the related code: {code} test("toString") { val empty = Matrices.ones(0, 0) empty.toString(0, 0) val mat = Matrices.rand(5, 10, new Random()) mat.toString(-1, -5) mat.toString(0, 0) mat.toString(Int.MinValue, Int.MinValue) mat.toString(Int.MaxValue, Int.MaxValue) var lines = mat.toString(6, 50).lines.toArray assert(lines.size == 5 && lines.forall(_.size <= 50)) lines = mat.toString(5, 100).lines.toArray assert(lines.size == 5 && lines.forall(_.size <= 100)) } {code} {code} test("numNonzeros and numActives") { val dm1 = Matrices.dense(3, 2, Array(0, 0, -1, 1, 0, 1)) assert(dm1.numNonzeros === 3) assert(dm1.numActives === 6) val sm1 = Matrices.sparse(3, 2, Array(0, 2, 3), Array(0, 2, 1), Array(0.0, -1.2, 0.0)) assert(sm1.numNonzeros === 1) assert(sm1.numActives === 3) } {code} what shall i do to solve this problem, and when will spark support jdk11? was: [ERROR]: [Error] $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value size is not a member of Object [ERROR]: [Error] $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.869:Value size is not a member of Object ERROR: two errors found Below is the related code: 856 test("toString") { 857 val empty = Matrices.ones(0, 0) 858 empty.toString(0, 0) 859 860 val mat = Matrices.rand(5, 10, new Random()) 861 mat.toString(-1, -5) 862 mat.toString(0, 0) 863 mat.toString(Int.MinValue, Int.MinValue) 864 mat.toString(Int.MaxValue, Int.MaxValue) 865 var lines = mat.toString(6, 50).lines.toArray 866 assert(lines.size == 5 && lines.forall(_.size <= 50)) 867 868 lines = mat.toString(5, 100).lines.toArray 869 assert(lines.size == 5 && lines.forall(_.size <= 100)) 870 } 871 872 test("numNonzeros and numActives") { 873 val dm1 = Matrices.dense(3, 2, Array(0, 0, -1, 1, 0, 1)) 874 assert(dm1.numNonzeros === 3) 875 assert(dm1.numActives === 6) 876 877 val sm1 = Matrices.sparse(3, 2, Array(0, 2, 3), Array(0, 2, 1), Array(0.0, -1.2, 0.0)) 878 assert(sm1.numNonzeros === 1) 879 assert(sm1.numActives === 3) what shall i do to solve this problem, and when will spark support jdk11? > spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value > size is not a member of Object > - > > Key: SPARK-27537 > URL: https://issues.apache.org/jira/browse/SPARK-27537 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.3.0, 2.4.1 > Environment: Machine:aarch64 > OS:Red Hat Enterprise Linux Server release 7.4 > Kernel:4.11.0-44.el7a > spark version: spark-2.4.1 > java:openjdk version "11.0.2" 2019-01-15 > OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9) > scala:2.11.12 > gcc version:4.8.5 >Reporter: dingwei2019 >Priority: Major > Labels: build, test > > {code} > [ERROR]: [Error] > $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value > size is not a member of Object > [ERROR]: [Error] > $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.869:Value > size is not a member of Object > {code} > ERROR: two errors found > Below is the related code: > {code} > test("toString") { > val empty = Matrices.ones(0, 0) > empty.toString(0, 0) > > val mat = Matrices.rand(5, 10, new Random()) > mat.toString(-1, -5) > mat.toString(0, 0) > mat.toString(Int.MinValue, Int.MinValue) > mat.toString(Int.MaxValue, Int.MaxValue) > var lines = mat.toString(6, 50).lines.toArray > assert(lines.size == 5 && lines.forall(_.size <= 50)) > > lines = mat.toString(5, 100).lines.toArray > assert(lines.size == 5 && lines.forall(_.size <= 100)) > } > {code} > {code} > test("numNonzeros and numActives") { > val dm1 = Matrices.dense(3, 2, Array(0, 0, -1, 1, 0, 1)) > assert(dm1.numNonzeros === 3) > assert(dm1.numActives === 6) > val sm1 = Matrices.sparse(3, 2, Array(0, 2, 3), Array(0, 2, 1), Array(0.0, > -1.2, 0.0)) > assert(sm1.numNonzeros === 1) > assert(sm1.numActives === 3) > } > {code} > what shall i do to solve this problem, and when will spark support jdk11?
[jira] [Comment Edited] (SPARK-27537) spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value size is not a member of Object
[ https://issues.apache.org/jira/browse/SPARK-27537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822845#comment-16822845 ] Hyukjin Kwon edited comment on SPARK-27537 at 4/25/19 6:05 AM: --- the question is found in spark ml test module, althrough this is an test module, i want to figure it out. from the describe above, it seems an incompatible problem between java 11 and scala 2.11.12. if I change my jdk to jdk8, and there is no problem. Below is my analysis: it seems in spark if a method has implementation in java, spark will use java method, or will use scala method. 'string' class in java11 adds the lines method. This method conflicts with the scala syntax. scala has lines method in 'stringlike' class, the method return an Iterator; Iterator in scala has a toArray method, the method return an Array; the class array in scala has a size method. so if spark use scala method, it will have no problem. lines(Iterator)\-\->toArray(Array)\-\->size But Java11 adds lines method in 'string', this will return a Stream; Stream in java11 has toArray method, and will return Object; Object has no 'size' method. This is what the error says. (Stream)\-\->(Object)toArray\-\->has no size method. what shall i do to solve this problem. was (Author: dingwei2019): the question is found in spark ml test module, althrough this is an test module, i want to figure it out. from the describe above, it seems an incompatible problem between java 11 and scala 2.11.12. if I change my jdk to jdk8, and there is no problem. Below is my analysis: it seems in spark if a method has implementation in java, spark will use java method, or will use scala method. 'string' class in java11 adds the lines method. This method conflicts with the scala syntax. scala has lines method in 'stringlike' class, the method return an Iterator; Iterator in scala has a toArray method, the method return an Array; the class array in scala has a size method. so if spark use scala method, it will have no problem. lines(Iterator)-->toArray(Array)-->size But Java11 adds lines method in 'string', this will return a Stream; Stream in java11 has toArray method, and will return Object; Object has no 'size' method. This is what the error says. (Stream)-->(Object)toArray-->has no size method. what shall i do to solve this problem. > spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value > size is not a member of Object > - > > Key: SPARK-27537 > URL: https://issues.apache.org/jira/browse/SPARK-27537 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.3.0, 2.4.1 > Environment: Machine:aarch64 > OS:Red Hat Enterprise Linux Server release 7.4 > Kernel:4.11.0-44.el7a > spark version: spark-2.4.1 > java:openjdk version "11.0.2" 2019-01-15 > OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9) > scala:2.11.12 > gcc version:4.8.5 >Reporter: dingwei2019 >Priority: Major > Labels: build, test > > {code} > [ERROR]: [Error] > $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value > size is not a member of Object > [ERROR]: [Error] > $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.869:Value > size is not a member of Object > {code} > ERROR: two errors found > Below is the related code: > {code} > test("toString") { > val empty = Matrices.ones(0, 0) > empty.toString(0, 0) > > val mat = Matrices.rand(5, 10, new Random()) > mat.toString(-1, -5) > mat.toString(0, 0) > mat.toString(Int.MinValue, Int.MinValue) > mat.toString(Int.MaxValue, Int.MaxValue) > var lines = mat.toString(6, 50).lines.toArray > assert(lines.size == 5 && lines.forall(_.size <= 50)) > > lines = mat.toString(5, 100).lines.toArray > assert(lines.size == 5 && lines.forall(_.size <= 100)) > } > {code} > {code} > test("numNonzeros and numActives") { > val dm1 = Matrices.dense(3, 2, Array(0, 0, -1, 1, 0, 1)) > assert(dm1.numNonzeros === 3) > assert(dm1.numActives === 6) > val sm1 = Matrices.sparse(3, 2, Array(0, 2, 3), Array(0, 2, 1), Array(0.0, > -1.2, 0.0)) > assert(sm1.numNonzeros === 1) > assert(sm1.numActives === 3) > } > {code} > what shall i do to solve this problem, and when will spark support jdk11? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22044) explain function with codegen and cost parameters
[ https://issues.apache.org/jira/browse/SPARK-22044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22044: Assignee: (was: Apache Spark) > explain function with codegen and cost parameters > - > > Key: SPARK-22044 > URL: https://issues.apache.org/jira/browse/SPARK-22044 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > {{explain}} operator creates {{ExplainCommand}} runnable command that accepts > (among other things) {{codegen}} and {{cost}} arguments. > There's no version of {{explain}} to allow for this. That's however possible > using SQL which is kind of surprising (given how much focus is devoted to the > Dataset API). > This is to have another {{explain}} with {{codegen}} and {{cost}} arguments, > i.e. > {code} > def explain(codegen: Boolean = false, cost: Boolean = false): Unit > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22044) explain function with codegen and cost parameters
[ https://issues.apache.org/jira/browse/SPARK-22044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22044: Assignee: Apache Spark > explain function with codegen and cost parameters > - > > Key: SPARK-22044 > URL: https://issues.apache.org/jira/browse/SPARK-22044 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Minor > > {{explain}} operator creates {{ExplainCommand}} runnable command that accepts > (among other things) {{codegen}} and {{cost}} arguments. > There's no version of {{explain}} to allow for this. That's however possible > using SQL which is kind of surprising (given how much focus is devoted to the > Dataset API). > This is to have another {{explain}} with {{codegen}} and {{cost}} arguments, > i.e. > {code} > def explain(codegen: Boolean = false, cost: Boolean = false): Unit > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27538) sparksql could not start in jdk11, exception org.datanucleus.exceptions.NucleusException: The java type java.lang.Long (jdbc-type='', sql-type="") cant be mapped for th
[ https://issues.apache.org/jira/browse/SPARK-27538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27538. -- Resolution: Duplicate > sparksql could not start in jdk11, exception > org.datanucleus.exceptions.NucleusException: The java type java.lang.Long > (jdbc-type='', sql-type="") cant be mapped for this datastore. No mapping is > available. > -- > > Key: SPARK-27538 > URL: https://issues.apache.org/jira/browse/SPARK-27538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.1 > Environment: Machine:aarch64 > OS:Red Hat Enterprise Linux Server release 7.4 > Kernel:4.11.0-44.el7a > spark version: spark-2.4.1 > java:openjdk version "11.0.2" 2019-01-15 > OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9) > scala:2.11.12 >Reporter: dingwei2019 >Priority: Major > Labels: features > > [root@172-19-18-8 spark-2.4.1-bin-hadoop2.7-bak]# bin/spark-sql > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform > (file:/home/dingwei/spark-2.4.1-bin-x86/spark-2.4.1-bin-hadoop2.7-bak/jars/spark-unsafe_2.11-2.4.1.jar) > to method java.nio.Bits.unaligned() > WARNING: Please consider reporting this to the maintainers of > org.apache.spark.unsafe.Platform > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > 2019-04-22 11:27:34,419 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2019-04-22 11:27:35,306 INFO metastore.HiveMetaStore: 0: Opening raw store > with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 2019-04-22 11:27:35,330 INFO metastore.ObjectStore: ObjectStore, initialize > called > 2019-04-22 11:27:35,492 INFO DataNucleus.Persistence: Property > hive.metastore.integral.jdo.pushdown unknown - will be ignored > 2019-04-22 11:27:35,492 INFO DataNucleus.Persistence: Property > datanucleus.cache.level2 unknown - will be ignored > 2019-04-22 11:27:37,012 INFO metastore.ObjectStore: Setting MetaStore object > pin classes with > hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" > 2019-04-22 11:27:37,638 WARN DataNucleus.Query: Query for candidates of > org.apache.hadoop.hive.metastore.model.MDatabase and subclasses resulted in > no possible candidates > The java type java.lang.Long (jdbc-type="", sql-type="") cant be mapped for > this datastore. No mapping is available. > org.datanucleus.exceptions.NucleusException: The java type java.lang.Long > (jdbc-type="", sql-type="") cant be mapped for this datastore. No mapping is > available. > at > org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.getDatastoreMappingClass(RDBMSMappingManager.java:1215) > at > org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.createDatastoreMapping(RDBMSMappingManager.java:1378) > at > org.datanucleus.store.rdbms.table.AbstractClassTable.addDatastoreId(AbstractClassTable.java:392) > at > org.datanucleus.store.rdbms.table.ClassTable.initializePK(ClassTable.java:1087) > at > org.datanucleus.store.rdbms.table.ClassTable.preInitialize(ClassTable.java:247) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTable(RDBMSStoreManager.java:3118) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTables(RDBMSStoreManager.java:2909) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTablesAndValidate(RDBMSStoreManager.java:3182) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2841) > at > org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:122) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.addClasses(RDBMSStoreManager.java:1605) > at > org.datanucleus.store.AbstractStoreManager.addClass(AbstractStoreManager.java:954) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:679) > at > org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:408) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:947) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:370) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1744) > at org.datanucleus.store.query.Query.executeWithArray(Que
[jira] [Comment Edited] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825756#comment-16825756 ] belvey edited comment on SPARK-13510 at 4/25/19 5:58 AM: - @Mike in my case I set it to '536870912' (512m) ,and it can be set to '512m' as spark will treat them equally. was (Author: belvey): @Mike in my case I set it to '536870912' (512m) ,and it can be set to '512m' as spark will treat it equally. > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen >Priority: Major > Attachments: spark-13510.diff > > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. > If the block is more than we can cache in memory, we should write to disk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806307#comment-16806307 ] belvey edited comment on SPARK-13510 at 4/25/19 5:57 AM: - [~shenhong] , hello hongsheng, I am using spark2.0 facing the same issue, I am not sure if it's merged into spark2. it's very kind for you to post your pr. edit: I found that solution had already been added to spark2.3 and later. i am not sure if is hong shen's pr , but the solution is similar to what hong shen's said. And for spark2.3 and later we can use "spark.maxRemoteBlockSizeFetchToMem" to control the max block size allowed for shuffle fetching data that cached in memory, it's default value is (Interger.max-512) bytes. was (Author: belvey): [~shenhong] , hello hongsheng, I am using spark2.0 facing the same issue, I am not sure if it's merged into spark2. it's very kind for you to post your pr. edit: I found that solution had already been added to spark2.3 and later. i am not sure if is hong shen's pr , but the solution is similar to what hong shen's said. And for spark2.3 and later we can use "spark.maxRemoteBlockSizeFetchToMem" to control the max block size allowed for shuffle fetching data that catched in memory, it's default value is (Interger.max-512) bytes. > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen >Priority: Major > Attachments: spark-13510.diff > > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code}
[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825756#comment-16825756 ] belvey commented on SPARK-13510: @Mike in my case I set it to '536870912' (512m) ,and it can be set to '512m' as spark will treat it equally. > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen >Priority: Major > Attachments: spark-13510.diff > > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. > If the block is more than we can cache in memory, we should write to disk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27553) Operation log is not closed when close session
[ https://issues.apache.org/jira/browse/SPARK-27553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27553: - Description: On Window 1. start spark-shell 2. start hive server in shell by org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark.sqlContext) 3. beeline connect to hive server 3.1 connect beeline -u jdbc:hive2://localhost:1 3.2 Run SQL show tables; 3.3 quit beeline !quit Get exception log {code} 19/04/24 11:38:22 ERROR HiveSessionImpl: Failed to cleanup ses sion log dir: SessionHandle [5827428b-d140-4fc0-8ad4-721c39b3ead0] java.io.IOException: Unable to delete file: C:\Users\test\AppData\Local\Temp\test\operation_logs\5827428b-d140-4fc0-8ad4-721c39b3ead0\df9cd631-66e7-4303-9a4 1-a09bdefcf888 at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270) at org.apache.hive.service.cli.session.HiveSessionImpl.cleanupSessionLogDir(HiveSessionImpl.java:671) at org.apache.hive.service.cli.session.HiveSessionImpl.close(HiveSessionImpl.java:643) at org.apache.hive.service.cli.session.HiveSessionImplwithUGI.close(HiveSessionImplwithUGI.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy.$Proxy19.close(Unknown Source) at org.apache.hive.service.cli.session.SessionManager.closeSession(SessionManager.java:280) at org.apache.spark.sql.hive.thriftserver.SparkSQLSessionManager.closeSession(SparkSQLSessionManager.scala:76) at org.apache.hive.service.cli.CLIService.closeSession(CLIService.java:237) at org.apache.hive.service.cli.thrift.ThriftCLIService.CloseSession(ThriftCLIService.java:397) at org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1273) at org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1258) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} was: On Window 1. start spark-shell 2. start hive server in shell by org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark.sqlContext) 3. beeline connect to hive server 3.1 connect beeline -u jdbc:hive2://localhost:1 3.2 Run SQL show tables; 3.3 quit beeline !quit Get exception log 19/04/24 11:38:22 ERROR HiveSessionImpl: Failed to cleanup ses sion log dir: SessionHandle [5827428b-d140-4fc0-8ad4-721c39b3ead0] java.io.IOException: Unable to delete file: C:\Users\test\AppData\Local\Temp\test\operation_logs\5827428b-d140-4fc0-8ad4-721c39b3ead0\df9cd631-66e7-4303-9a4 1-a09bdefcf888 at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270) at org.apache.hive.service.cli.session.HiveSessionImpl.cleanupSessionLogDir(HiveSessionImpl.java:671) at org.apache.hive.service.cli.session.HiveSessionImpl.close(HiveSessionImpl.java:643) at org.apache.hive.service.cli.session.HiveSessionImplwithUGI.close(HiveSessionImplwithUGI.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:4
[jira] [Commented] (SPARK-27553) Operation log is not closed when close session
[ https://issues.apache.org/jira/browse/SPARK-27553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825751#comment-16825751 ] Hyukjin Kwon commented on SPARK-27553: -- The cause will probably be a file not being closed somewhere and some codes try to delete that file. It doesn't work on Windows. > Operation log is not closed when close session > -- > > Key: SPARK-27553 > URL: https://issues.apache.org/jira/browse/SPARK-27553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > On Window > 1. start spark-shell > 2. start hive server in shell by > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark.sqlContext) > 3. beeline connect to hive server > 3.1 connect > beeline -u jdbc:hive2://localhost:1 > 3.2 Run SQL > show tables; > 3.3 quit beeline > !quit > Get exception log > {code} > 19/04/24 11:38:22 ERROR HiveSessionImpl: Failed to cleanup ses > sion log dir: SessionHandle [5827428b-d140-4fc0-8ad4-721c39b3ead0] > java.io.IOException: Unable to delete file: > C:\Users\test\AppData\Local\Temp\test\operation_logs\5827428b-d140-4fc0-8ad4-721c39b3ead0\df9cd631-66e7-4303-9a4 > 1-a09bdefcf888 > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270) > at > org.apache.hive.service.cli.session.HiveSessionImpl.cleanupSessionLogDir(HiveSessionImpl.java:671) > at > org.apache.hive.service.cli.session.HiveSessionImpl.close(HiveSessionImpl.java:643) > at > org.apache.hive.service.cli.session.HiveSessionImplwithUGI.close(HiveSessionImplwithUGI.java:109) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy19.close(Unknown Source) > at > org.apache.hive.service.cli.session.SessionManager.closeSession(SessionManager.java:280) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLSessionManager.closeSession(SparkSQLSessionManager.scala:76) > at org.apache.hive.service.cli.CLIService.closeSession(CLIService.java:237) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.CloseSession(ThriftCLIService.java:397) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1273) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1258) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27554) org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
[ https://issues.apache.org/jira/browse/SPARK-27554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27554. -- Resolution: Invalid No evidence that it's an issue in Spark. Probably it will be a problem about Kerberos in your env setup. check if you did {{kinit}} first. > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN, KERBEROS] > --- > > Key: SPARK-27554 > URL: https://issues.apache.org/jira/browse/SPARK-27554 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.0, 2.4.1 > Environment: cdh5.16.1 > spark2.4.0 > spark2.4.1 > spark2.4.2 >Reporter: Jepson >Priority: Major > Original Estimate: 96h > Remaining Estimate: 96h > > {code} > #basic env > JAVA_HOME=/usr/java/jdk1.8.0_181 > HADOOP_CONF_DIR=/etc/hadoop/conf > export SPARK_HOME=/opt/software/spark/spark-2.4.2 > {code} > {code} > #project env > KERBEROS_USER=h...@hadoop.com > KERBEROS_USER_KEYTAB=/etc/kerberos/hdfs.keytab > PROJECT_MAIN_CLASS=com.jy.dwexercise.OrderDeliveryModel > PROJECT_JAR=/opt/maintain/scripts/bas-1.0-SNAPSHOT.jar > {code} > {code} > #spark resource > DRIVER_MEMORY=4G > EXECUTORS_NUM=20 > EXECUTOR_MEMORY=6G > EXECUTOR_VCORES=4 > QUEQUE=idss > {code} > {code} > #submit job > /opt/software/spark/spark-2.4.2/bin/spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --queue ${QUEQUE} \ > --driver-memory ${DRIVER_MEMORY} \ > --num-executors ${EXECUTORS_NUM} \ > --executor-memory ${EXECUTOR_MEMORY} \ > --executor-cores ${EXECUTOR_VCORES} \ > --principal ${KERBEROS_USER} \ > --keytab ${KERBEROS_USER_KEYTAB} \ > --class ${PROJECT_MAIN_CLASS} \ > --conf > "spark.yarn.archive=hdfs://jybigdata/spark/spark-archive-20190422.zip" \ > --conf "spark.sql.autoBroadcastJoinThreshold=20971520" \ > ${PROJECT_JAR} > {code} > > {code} > *ERROR log:* > 19/04/24 13:35:07 INFO ConfiguredRMFailoverProxyProvider: Failing over to rm78 > 19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException > as:hdfs (auth:SIMPLE) > cause:org.apache.hadoop.security.AccessControlException: Client cannot > authenticate via:[TOKEN, KERBEROS] > 19/04/24 13:35:07 WARN Client: Exception encountered while connecting to the > server : org.apache.hadoop.security.AccessControlException: Client cannot > authenticate via:[TOKEN, KERBEROS] > 19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException > as:hdfs (auth:SIMPLE) cause:java.io.IOException: > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN, KERBEROS] > 19/04/24 13:35:07 INFO RetryInvocationHandler: Exception while invoking > getClusterMetrics of class ApplicationClientProtocolPBClientImpl over rm78 > after 10 fail over attempts. Trying to fail over immediately. > java.io.IOException: Failed on local exception: java.io.IOException: > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN, KERBEROS]; Host Details : local host is: > "hadoopnode143/192.168.209.143"; destination host is: > "hadoopmanager136":8032; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1508) > at org.apache.hadoop.ipc.Client.call(Client.java:1441) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231) > at com.sun.proxy.$Proxy25.getClusterMetrics(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy26.getClusterMetrics(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:483) > at > org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164) > at > org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164) > at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) > at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:59) > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:163) > at > org.apache.spark.scheduler.cluster.
[jira] [Updated] (SPARK-27554) org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
[ https://issues.apache.org/jira/browse/SPARK-27554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27554: - Description: {code} #basic env JAVA_HOME=/usr/java/jdk1.8.0_181 HADOOP_CONF_DIR=/etc/hadoop/conf export SPARK_HOME=/opt/software/spark/spark-2.4.2 {code} {code} #project env KERBEROS_USER=h...@hadoop.com KERBEROS_USER_KEYTAB=/etc/kerberos/hdfs.keytab PROJECT_MAIN_CLASS=com.jy.dwexercise.OrderDeliveryModel PROJECT_JAR=/opt/maintain/scripts/bas-1.0-SNAPSHOT.jar {code} {code} #spark resource DRIVER_MEMORY=4G EXECUTORS_NUM=20 EXECUTOR_MEMORY=6G EXECUTOR_VCORES=4 QUEQUE=idss {code} {code} #submit job /opt/software/spark/spark-2.4.2/bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --queue ${QUEQUE} \ --driver-memory ${DRIVER_MEMORY} \ --num-executors ${EXECUTORS_NUM} \ --executor-memory ${EXECUTOR_MEMORY} \ --executor-cores ${EXECUTOR_VCORES} \ --principal ${KERBEROS_USER} \ --keytab ${KERBEROS_USER_KEYTAB} \ --class ${PROJECT_MAIN_CLASS} \ --conf "spark.yarn.archive=hdfs://jybigdata/spark/spark-archive-20190422.zip" \ --conf "spark.sql.autoBroadcastJoinThreshold=20971520" \ ${PROJECT_JAR} {code} {code} *ERROR log:* 19/04/24 13:35:07 INFO ConfiguredRMFailoverProxyProvider: Failing over to rm78 19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 19/04/24 13:35:07 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 19/04/24 13:35:07 INFO RetryInvocationHandler: Exception while invoking getClusterMetrics of class ApplicationClientProtocolPBClientImpl over rm78 after 10 fail over attempts. Trying to fail over immediately. java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "hadoopnode143/192.168.209.143"; destination host is: "hadoopmanager136":8032; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1508) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231) at com.sun.proxy.$Proxy25.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy26.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:483) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164) at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:59) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:163) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:183) at org.apache.spark.SparkContext.(SparkContext.scala:501) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926) at com.jiuye.dwexercise.OrderDeliveryModel$.(OrderDeliveryModel.scala:25) at com.jiuye.dwexercise.OrderDeliveryModel$.(OrderDeliveryModel.scala) at com.jiuye.dwexercise.OrderDeliveryModel$$anonfun$5.apply(OrderDeliveryModel.scala:55) at com.jiuye.dwexercise.OrderDeliveryModel$$anonfun$5.apply(OrderDeliveryModel.scala:55) at org.apache.spark.sql.catalyst.express
[jira] [Updated] (SPARK-27554) org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
[ https://issues.apache.org/jira/browse/SPARK-27554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27554: - Description: {code} #basic env JAVA_HOME=/usr/java/jdk1.8.0_181 HADOOP_CONF_DIR=/etc/hadoop/conf export SPARK_HOME=/opt/software/spark/spark-2.4.2 {code} {code} #project env KERBEROS_USER=h...@hadoop.com KERBEROS_USER_KEYTAB=/etc/kerberos/hdfs.keytab PROJECT_MAIN_CLASS=com.jy.dwexercise.OrderDeliveryModel PROJECT_JAR=/opt/maintain/scripts/bas-1.0-SNAPSHOT.jar {code} {code} #spark resource DRIVER_MEMORY=4G EXECUTORS_NUM=20 EXECUTOR_MEMORY=6G EXECUTOR_VCORES=4 QUEQUE=idss {code} {code} #submit job /opt/software/spark/spark-2.4.2/bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --queue ${QUEQUE} \ --driver-memory ${DRIVER_MEMORY} \ --num-executors ${EXECUTORS_NUM} \ --executor-memory ${EXECUTOR_MEMORY} \ --executor-cores ${EXECUTOR_VCORES} \ --principal ${KERBEROS_USER} \ --keytab ${KERBEROS_USER_KEYTAB} \ --class ${PROJECT_MAIN_CLASS} \ --conf "spark.yarn.archive=hdfs://jybigdata/spark/spark-archive-20190422.zip" \ --conf "spark.sql.autoBroadcastJoinThreshold=20971520" \ ${PROJECT_JAR} *ERROR log:* 19/04/24 13:35:07 INFO ConfiguredRMFailoverProxyProvider: Failing over to rm78 19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 19/04/24 13:35:07 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 19/04/24 13:35:07 INFO RetryInvocationHandler: Exception while invoking getClusterMetrics of class ApplicationClientProtocolPBClientImpl over rm78 after 10 fail over attempts. Trying to fail over immediately. java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "hadoopnode143/192.168.209.143"; destination host is: "hadoopmanager136":8032; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1508) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231) at com.sun.proxy.$Proxy25.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy26.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:483) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164) at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:59) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:163) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:183) at org.apache.spark.SparkContext.(SparkContext.scala:501) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926) at com.jiuye.dwexercise.OrderDeliveryModel$.(OrderDeliveryModel.scala:25) at com.jiuye.dwexercise.OrderDeliveryModel$.(OrderDeliveryModel.scala) at com.jiuye.dwexercise.OrderDeliveryModel$$anonfun$5.apply(OrderDeliveryModel.scala:55) at com.jiuye.dwexercise.OrderDeliveryModel$$anonfun$5.apply(OrderDeliveryModel.scala:55) at org.apache.spark.sql.catalyst.expressions.Generate
[jira] [Commented] (SPARK-27551) Uniformative error message for mismatched types in when().otherwise()
[ https://issues.apache.org/jira/browse/SPARK-27551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825749#comment-16825749 ] Dongjoon Hyun commented on SPARK-27551: --- Hi, [~huonw]. I updated this from `BUG` to `Improvement`. > Uniformative error message for mismatched types in when().otherwise() > - > > Key: SPARK-27551 > URL: https://issues.apache.org/jira/browse/SPARK-27551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Huon Wilson >Priority: Minor > > When a {{when(...).otherwise(...)}} construct has a type error, the error > message can be quite uninformative, since it just splats out a potentially > large chunk of code and says the types don't match. For instance: > {code:none} > scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + > 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as > "y" > org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = > CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS > BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * > CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data > type mismatch: THEN and ELSE expressions should all be same type or coercible > to a common type;; > ... > {code} > The problem is the structs have different field names ({{x}} vs {{y}}), but > it's not obvious that this is the case, and this is a relatively simple case > of a single {{select}} expression. > It would be great for the error message to at least include the types that > Spark has computed, to help clarify what might have gone wrong. For instance, > {{greatest}} and {{least}} write out the expression with the types instead of > values: > {code:none} > scala> spark.range(100).select(greatest('id, struct(lit("x" > org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`id`, > named_struct('col1', 'x'))' due to data type mismatch: The expressions should > all have the same type, got GREATEST(bigint, struct).;; > {code} > For the example above, this might look like: > {code:none} > org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = > CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS > BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * > CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data > type mismatch: THEN and ELSE expressions should all be same type or coercible > to a common type, got CASE WHEN ... THEN array> ELSE > array> END;; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27551) Uniformative error message for mismatched types in when().otherwise()
[ https://issues.apache.org/jira/browse/SPARK-27551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27551: -- Priority: Minor (was: Major) > Uniformative error message for mismatched types in when().otherwise() > - > > Key: SPARK-27551 > URL: https://issues.apache.org/jira/browse/SPARK-27551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Huon Wilson >Priority: Minor > > When a {{when(...).otherwise(...)}} construct has a type error, the error > message can be quite uninformative, since it just splats out a potentially > large chunk of code and says the types don't match. For instance: > {code:none} > scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + > 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as > "y" > org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = > CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS > BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * > CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data > type mismatch: THEN and ELSE expressions should all be same type or coercible > to a common type;; > ... > {code} > The problem is the structs have different field names ({{x}} vs {{y}}), but > it's not obvious that this is the case, and this is a relatively simple case > of a single {{select}} expression. > It would be great for the error message to at least include the types that > Spark has computed, to help clarify what might have gone wrong. For instance, > {{greatest}} and {{least}} write out the expression with the types instead of > values: > {code:none} > scala> spark.range(100).select(greatest('id, struct(lit("x" > org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`id`, > named_struct('col1', 'x'))' due to data type mismatch: The expressions should > all have the same type, got GREATEST(bigint, struct).;; > {code} > For the example above, this might look like: > {code:none} > org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = > CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS > BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * > CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data > type mismatch: THEN and ELSE expressions should all be same type or coercible > to a common type, got CASE WHEN ... THEN array> ELSE > array> END;; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27557) Add copybutton to spark Python API docs for easier copying of code-blocks
[ https://issues.apache.org/jira/browse/SPARK-27557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825746#comment-16825746 ] Hyukjin Kwon commented on SPARK-27557: -- please go ahead for a PR. > Add copybutton to spark Python API docs for easier copying of code-blocks > - > > Key: SPARK-27557 > URL: https://issues.apache.org/jira/browse/SPARK-27557 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 2.4.2 >Reporter: Sangram G >Priority: Minor > Labels: Documentation, PySpark > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Add a non-intrusive button for python API documentation, which will remove > ">>>" prompts and outputs of code - for easier copying of code. > For example: The below code-snippet in the document is difficult to copy due > to ">>>" prompts > {code} > >>> l = [('Alice', 1)] > >>> spark.createDataFrame(l).collect() > [Row(_1='Alice', _2=1)] > {code} > Becomes this - After the copybutton in the corner of of code-block is pressed > - which is easier to copy > {code} > l = [('Alice', 1)] > spark.createDataFrame(l).collect() > {code} > Sample Screenshot for reference: > [https://screenshots.firefox.com/ZfBXZrpINKdt9QZg/www269.lunapic.com] > This can be easily done only by adding a copybutton.js script in > python/docs/_static folder and calling it at setup time from > python/docs/conf.py. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27551) Uniformative error message for mismatched types in when().otherwise()
[ https://issues.apache.org/jira/browse/SPARK-27551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27551: -- Issue Type: Improvement (was: Bug) > Uniformative error message for mismatched types in when().otherwise() > - > > Key: SPARK-27551 > URL: https://issues.apache.org/jira/browse/SPARK-27551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Huon Wilson >Priority: Major > > When a {{when(...).otherwise(...)}} construct has a type error, the error > message can be quite uninformative, since it just splats out a potentially > large chunk of code and says the types don't match. For instance: > {code:none} > scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + > 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as > "y" > org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = > CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS > BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * > CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data > type mismatch: THEN and ELSE expressions should all be same type or coercible > to a common type;; > ... > {code} > The problem is the structs have different field names ({{x}} vs {{y}}), but > it's not obvious that this is the case, and this is a relatively simple case > of a single {{select}} expression. > It would be great for the error message to at least include the types that > Spark has computed, to help clarify what might have gone wrong. For instance, > {{greatest}} and {{least}} write out the expression with the types instead of > values: > {code:none} > scala> spark.range(100).select(greatest('id, struct(lit("x" > org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`id`, > named_struct('col1', 'x'))' due to data type mismatch: The expressions should > all have the same type, got GREATEST(bigint, struct).;; > {code} > For the example above, this might look like: > {code:none} > org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = > CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS > BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * > CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data > type mismatch: THEN and ELSE expressions should all be same type or coercible > to a common type, got CASE WHEN ... THEN array> ELSE > array> END;; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825747#comment-16825747 ] Hyukjin Kwon commented on SPARK-27555: -- can you post a self-contained reproducer please? > cannot create table by using the hive default fileformat in both > hive-site.xml and spark-defaults.conf > -- > > Key: SPARK-27555 > URL: https://issues.apache.org/jira/browse/SPARK-27555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Hui WANG >Priority: Major > > I already seen https://issues.apache.org/jira/browse/SPARK-17620 > and https://issues.apache.org/jira/browse/SPARK-18397 > and I check source code of Spark for the change of set > "spark.sql.hive.covertCTAS=true" and then spark will use > "spark.sql.sources.default" which is parquet as storage format in "create > table as select" scenario. > But my case is just create table without select. When I set > hive.default.fileformat=parquet in hive-site.xml or set > spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after > create a table, when i check the hive table, it still use textfile fileformat. > > It seems HiveSerDe gets the value of the hive.default.fileformat parameter > from SQLConf > The parameter values in SQLConf are copied from SparkContext's SparkConf at > SparkSession initialization, while the configuration parameters in > hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters > by SharedState, And all the config with "spark.hadoop" conf are setted to > hadoopconfig, so the configuration does not take effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27557) Add copybutton to spark Python API docs for easier copying of code-blocks
[ https://issues.apache.org/jira/browse/SPARK-27557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27557: - Description: Add a non-intrusive button for python API documentation, which will remove ">>>" prompts and outputs of code - for easier copying of code. For example: The below code-snippet in the document is difficult to copy due to ">>>" prompts {code} >>> l = [('Alice', 1)] >>> spark.createDataFrame(l).collect() [Row(_1='Alice', _2=1)] {code} Becomes this - After the copybutton in the corner of of code-block is pressed - which is easier to copy {code} l = [('Alice', 1)] spark.createDataFrame(l).collect() {code} Sample Screenshot for reference: [https://screenshots.firefox.com/ZfBXZrpINKdt9QZg/www269.lunapic.com] This can be easily done only by adding a copybutton.js script in python/docs/_static folder and calling it at setup time from python/docs/conf.py. was: Add a non-intrusive button for python API documentation, which will remove ">>>" prompts and outputs of code - for easier copying of code. For example: The below code-snippet in the document is difficult to copy due to ">>>" prompts >>> l = [('Alice', 1)] >>> spark.createDataFrame(l).collect() [Row(_1='Alice', _2=1)] Becomes this - After the copybutton in the corner of of code-block is pressed - which is easier to copy l = [('Alice', 1)] spark.createDataFrame(l).collect() Sample Screenshot for reference: [https://screenshots.firefox.com/ZfBXZrpINKdt9QZg/www269.lunapic.com] This can be easily done only by adding a copybutton.js script in python/docs/_static folder and calling it at setup time from python/docs/conf.py. > Add copybutton to spark Python API docs for easier copying of code-blocks > - > > Key: SPARK-27557 > URL: https://issues.apache.org/jira/browse/SPARK-27557 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 2.4.2 >Reporter: Sangram G >Priority: Minor > Labels: Documentation, PySpark > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Add a non-intrusive button for python API documentation, which will remove > ">>>" prompts and outputs of code - for easier copying of code. > For example: The below code-snippet in the document is difficult to copy due > to ">>>" prompts > {code} > >>> l = [('Alice', 1)] > >>> spark.createDataFrame(l).collect() > [Row(_1='Alice', _2=1)] > {code} > Becomes this - After the copybutton in the corner of of code-block is pressed > - which is easier to copy > {code} > l = [('Alice', 1)] > spark.createDataFrame(l).collect() > {code} > Sample Screenshot for reference: > [https://screenshots.firefox.com/ZfBXZrpINKdt9QZg/www269.lunapic.com] > This can be easily done only by adding a copybutton.js script in > python/docs/_static folder and calling it at setup time from > python/docs/conf.py. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27558) NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang
[ https://issues.apache.org/jira/browse/SPARK-27558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825743#comment-16825743 ] Hyukjin Kwon commented on SPARK-27558: -- Please avoid to set Critical+ which is usually reserved for committers. > NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter > causing tasks to hang > > > Key: SPARK-27558 > URL: https://issues.apache.org/jira/browse/SPARK-27558 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.2 >Reporter: Alessandro Bellina >Priority: Major > > We see an NPE in the UnsafeInMemorySorter.getMemoryUsage function (due to the > array we are accessing there being null). This looks to be caused by a Spark > OOM when UnsafeInMemorySorter is trying to spill. > This is likely a symptom of > https://issues.apache.org/jira/browse/SPARK-21492. The real question for this > ticket is, could we handle things more gracefully, rather than NPE. For > example: > Remove this: > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L182 > so when this fails (and store the new array into a temporary): > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L186 > we don't end up with a null "array". This state is causing one of our jobs to > hang infinitely (we think) due to the original allocation error. > Stack trace for reference > {noformat} > 2019-04-23 08:57:14,989 [Executor task launch worker for task 46729] ERROR > org.apache.spark.TaskContextImpl - Error in TaskCompletionListener > java.lang.NullPointerException > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getMemoryUsage(UnsafeInMemorySorter.java:208) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getMemoryUsage(UnsafeExternalSorter.java:249) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.updatePeakMemoryUsed(UnsafeExternalSorter.java:253) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.freeMemory(UnsafeExternalSorter.java:296) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:328) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.lambda$new$0(UnsafeExternalSorter.java:178) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118) > at > org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:131) > at > org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:129) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:129) > at > org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117) > at org.apache.spark.scheduler.Task.run(Task.scala:119) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2019-04-23 08:57:15,069 [Executor task launch worker for task 46729] ERROR > org.apache.spark.executor.Executor - Exception in task 102.0 in stage 28.0 > (TID 46729) > org.apache.spark.util.TaskCompletionListenerException: null > Previous exception in task: Unable to acquire 65536 bytes of memory, got 0 > org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157) > > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98) > > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:186) > > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:229) > > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:204) > > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:283) > > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96) > > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(Uns
[jira] [Updated] (SPARK-27558) NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang
[ https://issues.apache.org/jira/browse/SPARK-27558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27558: - Priority: Major (was: Critical) > NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter > causing tasks to hang > > > Key: SPARK-27558 > URL: https://issues.apache.org/jira/browse/SPARK-27558 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.2 >Reporter: Alessandro Bellina >Priority: Major > > We see an NPE in the UnsafeInMemorySorter.getMemoryUsage function (due to the > array we are accessing there being null). This looks to be caused by a Spark > OOM when UnsafeInMemorySorter is trying to spill. > This is likely a symptom of > https://issues.apache.org/jira/browse/SPARK-21492. The real question for this > ticket is, could we handle things more gracefully, rather than NPE. For > example: > Remove this: > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L182 > so when this fails (and store the new array into a temporary): > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L186 > we don't end up with a null "array". This state is causing one of our jobs to > hang infinitely (we think) due to the original allocation error. > Stack trace for reference > {noformat} > 2019-04-23 08:57:14,989 [Executor task launch worker for task 46729] ERROR > org.apache.spark.TaskContextImpl - Error in TaskCompletionListener > java.lang.NullPointerException > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getMemoryUsage(UnsafeInMemorySorter.java:208) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getMemoryUsage(UnsafeExternalSorter.java:249) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.updatePeakMemoryUsed(UnsafeExternalSorter.java:253) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.freeMemory(UnsafeExternalSorter.java:296) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:328) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.lambda$new$0(UnsafeExternalSorter.java:178) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118) > at > org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:131) > at > org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:129) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:129) > at > org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117) > at org.apache.spark.scheduler.Task.run(Task.scala:119) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2019-04-23 08:57:15,069 [Executor task launch worker for task 46729] ERROR > org.apache.spark.executor.Executor - Exception in task 102.0 in stage 28.0 > (TID 46729) > org.apache.spark.util.TaskCompletionListenerException: null > Previous exception in task: Unable to acquire 65536 bytes of memory, got 0 > org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157) > > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98) > > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:186) > > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:229) > > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:204) > > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:283) > > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96) > > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:348) > > org.apache.spark.util.collection.unsafe.sort.UnsafeExt
[jira] [Resolved] (SPARK-27559) Nullable in a given schema is not respected when reading from parquet
[ https://issues.apache.org/jira/browse/SPARK-27559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27559. -- Resolution: Duplicate > Nullable in a given schema is not respected when reading from parquet > - > > Key: SPARK-27559 > URL: https://issues.apache.org/jira/browse/SPARK-27559 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.2 >Reporter: colin fang >Priority: Minor > > Even if I specify a schema when reading from parquet, nullable is not reset. > {code:java} > spark.range(10, numPartitions=1).write.mode('overwrite').parquet('tmp') > df1 = spark.read.parquet('tmp') > df1.printSchema() > # root > # |-- id: long (nullable = true) > df2 = spark.read.schema(StructType([StructField('id', LongType(), > False)])).parquet('tmp') > df2.printSchema() > # root > # |-- x: long (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27559) Nullable in a given schema is not respected when reading from parquet
[ https://issues.apache.org/jira/browse/SPARK-27559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825742#comment-16825742 ] Hyukjin Kwon commented on SPARK-27559: -- It's a dup of SPARK-16472 > Nullable in a given schema is not respected when reading from parquet > - > > Key: SPARK-27559 > URL: https://issues.apache.org/jira/browse/SPARK-27559 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.2 >Reporter: colin fang >Priority: Minor > > Even if I specify a schema when reading from parquet, nullable is not reset. > {code:java} > spark.range(10, numPartitions=1).write.mode('overwrite').parquet('tmp') > df1 = spark.read.parquet('tmp') > df1.printSchema() > # root > # |-- id: long (nullable = true) > df2 = spark.read.schema(StructType([StructField('id', LongType(), > False)])).parquet('tmp') > df2.printSchema() > # root > # |-- x: long (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27559) Nullable in a given schema is not respected when reading from parquet
[ https://issues.apache.org/jira/browse/SPARK-27559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27559: - Component/s: (was: Spark Core) SQL > Nullable in a given schema is not respected when reading from parquet > - > > Key: SPARK-27559 > URL: https://issues.apache.org/jira/browse/SPARK-27559 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.2 >Reporter: colin fang >Priority: Minor > > Even if I specify a schema when reading from parquet, nullable is not reset. > {code:java} > spark.range(10, numPartitions=1).write.mode('overwrite').parquet('tmp') > df1 = spark.read.parquet('tmp') > df1.printSchema() > # root > # |-- id: long (nullable = true) > df2 = spark.read.schema(StructType([StructField('id', LongType(), > False)])).parquet('tmp') > df2.printSchema() > # root > # |-- x: long (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27505) autoBroadcastJoinThreshold including bigger table
[ https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825739#comment-16825739 ] Hyukjin Kwon commented on SPARK-27505: -- I meant copy-and-past-able code to reproduce this error > autoBroadcastJoinThreshold including bigger table > - > > Key: SPARK-27505 > URL: https://issues.apache.org/jira/browse/SPARK-27505 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Hive table with Spark 2.3.1 on Azure, using Azure > storage as storage layer >Reporter: Mike Chan >Priority: Major > Attachments: explain_plan.txt > > > I'm on a case that when certain table being exposed to broadcast join, the > query will eventually failed with remote block error. > > Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely > 10485760 > !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.2&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI&disp=emb&realattid=ii_jumg5jxd1|width=542,height=66! > > Then we proceed to perform query. In the SQL plan, we found that one table > that is 25MB in size is broadcast as well. > > !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.1&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE&disp=emb&realattid=ii_jumg53fq0|width=227,height=542! > > Also in desc extended the table is 24452111 bytes. It is a Hive table. We > always ran into error when this table being broadcast. Below is the sample > error > > Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt > remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > > Also attached the physical plan if you're interested. One thing to note that, > if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query > will get successfully executed and default.product NOT broadcasted.{color} > {color:#00} > {color}{color:#00}However, when I change to another query that querying > even less columns than pervious one, even in 5MB this table still get > broadcasted and failed with the same error. I even changed to 1MB and still > the same. {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825721#comment-16825721 ] Mike Chan commented on SPARK-13510: --- Hi [~belvey] I'm having similar issue and our "spark.maxRemoteBlockSizeFetchToMem" is at 188. From the forums I can tell this parameter should be set below 2GB. How do you set your parameter? Should it be "2g" or 2 * 1024 * 1024 * 1024 = 2147483648? I'm at Spark 2.3 on Azure > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen >Priority: Major > Attachments: spark-13510.diff > > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. > If the block is more than we can cache in memory, we should write to disk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27563) automatically get the latest Spark versions in HiveExternalCatalogVersionsSuite
Wenchen Fan created SPARK-27563: --- Summary: automatically get the latest Spark versions in HiveExternalCatalogVersionsSuite Key: SPARK-27563 URL: https://issues.apache.org/jira/browse/SPARK-27563 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20462) Spark-Kinesis Direct Connector
[ https://issues.apache.org/jira/browse/SPARK-20462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825669#comment-16825669 ] Long Sun commented on SPARK-20462: -- [~gaurav24], [~arushkharbanda], Any update on this? "The user does not have to implement their own checkpointing logic and ensuring checkpointing only occurs after the terminal action of a user’s Spark job ensures that no data is lost" The idea above is excellent in the above proposal. But at least, I doubt why *Spark + Kinesis* don't provide '*offset commit API*' like *Spark+Kafka 0.10.0 or higher.* > Spark-Kinesis Direct Connector > --- > > Key: SPARK-20462 > URL: https://issues.apache.org/jira/browse/SPARK-20462 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Lauren Moos >Priority: Major > > I'd like to propose and the vet the design for a direct connector between > Spark and Kinesis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data
[ https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-27562: Description: We've seen some shuffle data corruption during shuffle read phase. As described in SPARK-26089, spark only checks small shuffle blocks before PR #23453, which is proposed by ankuriitg. There are two changes/improvements that are made in PR #23453. 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as smaller blocks, so if a large block is corrupt in the starting, that block will be re-fetched and if that also fails, FetchFailureException will be thrown. 2. If large blocks are corrupt after size maxBytesInFlight/3, then any IOException thrown while reading the stream will be converted to FetchFailureException. This is slightly more aggressive than was originally intended but since the consumer of the stream may have already read some records and processed them, we can't just re-fetch the block, we need to fail the whole task. Additionally, we also thought about maybe adding a new type of TaskEndReason, which would re-try the task couple of times before failing the previous stage, but given the complexity involved in that solution we decided to not proceed in that direction. However, I think there still exists some problems with the current shuffle transmitted data verification mechanism: - For a large block, it is checked upto maxBytesInFlight/3 size when fetching shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it can not be detected in data fetch phase. This has been described in the previous section. - Only the compressed or wrapped blocks are checked, I think we should also check thease blocks which are not wrapped. We complete the verification mechanism for shuffle transmitted data: Firstly, we choose crc32 for the checksum verification of shuffle data. Crc is also used for checksum verification in hadoop, it is simple and fast. In shuffle write phase, after completing the partitionedFile, we compute the crc32 value for each partition and then write these digests with the indexs into shuffle index file. For the sortShuffleWriter and unsafe shuffle writer, there is only one partitionedFile for a shuffleMapTask, so the compution of digests(compute the digests for each partition depend on the indexs of this partitionedFile) is cheap. For the bypassShuffleWriter, the reduce partitions is little than byPassMergeThreshold, the cost of digests compution is acceptable. In shuffle read phase, the digest value will be passed with the block data. And we will recompute the digest of the data obtained to compare with the origin digest value. When recomputing the digest of data obtained, it only need an additional buffer(2048Bytes) for computing crc32 value. After recomputing, we will reset the obtained data inputStream, if it is markSupported we only need reset it, otherwise it is a fileSegmentManagerBuffer, we need recreate it. So, this verification mechanism proposed for shuffle transmitted data is cheap and complete. was: We've seen some shuffle data corruption during shuffle read phase. As described in SPARK-26089, spark only checks small shuffle blocks before PR #23543, which is proposed by ankuriitg. There are two changes/improvements that are made in PR #23543. 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as smaller blocks, so if a large block is corrupt in the starting, that block will be re-fetched and if that also fails, FetchFailureException will be thrown. 2. If large blocks are corrupt after size maxBytesInFlight/3, then any IOException thrown while reading the stream will be converted to FetchFailureException. This is slightly more aggressive than was originally intended but since the consumer of the stream may have already read some records and processed them, we can't just re-fetch the block, we need to fail the whole task. Additionally, we also thought about maybe adding a new type of TaskEndReason, which would re-try the task couple of times before failing the previous stage, but given the complexity involved in that solution we decided to not proceed in that direction. However, I think there still exists some problems with the current shuffle transmitted data verification mechanism: - For a large block, it is checked upto maxBytesInFlight/3 size when fetching shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it can not be detected in data fetch phase. This has been described in the previous section. - Only the compressed or wrapped blocks are checked, I think we should also check thease blocks which are not wrapped. We complete the verification mechanism for shuffle transmitted data: Firstly, we choose crc32 for the checksum verification of shuffle data. Crc is also used for checksum verification in hadoop, it is
[jira] [Updated] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data
[ https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-27562: Description: SPARK-27562 We've seen some shuffle data corruption during shuffle read phase. As described in SPARK-26089, spark only checks small shuffle blocks before PR #23543, which is proposed by ankuriitg. There are two changes/improvements that are made in PR #23543. 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as smaller blocks, so if a large block is corrupt in the starting, that block will be re-fetched and if that also fails, FetchFailureException will be thrown. 2. If large blocks are corrupt after size maxBytesInFlight/3, then any IOException thrown while reading the stream will be converted to FetchFailureException. This is slightly more aggressive than was originally intended but since the consumer of the stream may have already read some records and processed them, we can't just re-fetch the block, we need to fail the whole task. Additionally, we also thought about maybe adding a new type of TaskEndReason, which would re-try the task couple of times before failing the previous stage, but given the complexity involved in that solution we decided to not proceed in that direction. However, I think there still exists some problems with the current shuffle transmitted data verification mechanism: - For a large block, it is checked upto maxBytesInFlight/3 size when fetching shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it can not be detected in data fetch phase. This has been described in the previous section. - Only the compressed or wrapped blocks are checked, I think we should also check thease blocks which are not wrapped. We complete the verification mechanism for shuffle transmitted data: Firstly, we choose crc32 for the checksum verification of shuffle data. Crc is also used for checksum verification in hadoop, it is simple and fast. In shuffle write phase, after completing the partitionedFile, we compute the crc32 value for each partition and then write these digests with the indexs into shuffle index file. For the sortShuffleWriter and unsafe shuffle writer, there is only one partitionedFile for a shuffleMapTask, so the compution of digests(compute the digests for each partition depend on the indexs of this partitionedFile) is cheap. For the bypassShuffleWriter, the reduce partitions is little than byPassMergeThreshold, the cost of digests compution is acceptable. In shuffle read phase, the digest value will be passed with the block data. And we will recompute the digest of the data obtained to compare with the origin digest value. When recomputing the digest of data obtained, it only need an additional buffer(2048Bytes) for computing crc32 value. After recomputing, we will reset the obtained data inputStream, if it is markSupported we only need reset it, otherwise it is a fileSegmentManagerBuffer, we need recreate it. So, this verification mechanism proposed for shuffle transmitted data is cheap and complete. was:I will update this lately. > Complete the verification mechanism for shuffle transmitted data > > > Key: SPARK-27562 > URL: https://issues.apache.org/jira/browse/SPARK-27562 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: feiwang >Priority: Major > > SPARK-27562 > We've seen some shuffle data corruption during shuffle read phase. > As described in SPARK-26089, spark only checks small shuffle blocks before > PR #23543, which is proposed by ankuriitg. > There are two changes/improvements that are made in PR #23543. > 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as > smaller blocks, so if a > large block is corrupt in the starting, that block will be re-fetched and if > that also fails, > FetchFailureException will be thrown. > 2. If large blocks are corrupt after size maxBytesInFlight/3, then any > IOException thrown while > reading the stream will be converted to FetchFailureException. This is > slightly more aggressive > than was originally intended but since the consumer of the stream may have > already read some records and processed them, we can't just re-fetch the > block, we need to fail the whole task. Additionally, we also thought about > maybe adding a new type of TaskEndReason, which would re-try the task couple > of times before failing the previous stage, but given the complexity involved > in that solution we decided to not proceed in that direction. > However, I think there still exists some problems with the current shuffle > transmitted data verification mechanism: > - For a large block, it is checked upto maxBytesInF
[jira] [Updated] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data
[ https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-27562: Description: We've seen some shuffle data corruption during shuffle read phase. As described in SPARK-26089, spark only checks small shuffle blocks before PR #23543, which is proposed by ankuriitg. There are two changes/improvements that are made in PR #23543. 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as smaller blocks, so if a large block is corrupt in the starting, that block will be re-fetched and if that also fails, FetchFailureException will be thrown. 2. If large blocks are corrupt after size maxBytesInFlight/3, then any IOException thrown while reading the stream will be converted to FetchFailureException. This is slightly more aggressive than was originally intended but since the consumer of the stream may have already read some records and processed them, we can't just re-fetch the block, we need to fail the whole task. Additionally, we also thought about maybe adding a new type of TaskEndReason, which would re-try the task couple of times before failing the previous stage, but given the complexity involved in that solution we decided to not proceed in that direction. However, I think there still exists some problems with the current shuffle transmitted data verification mechanism: - For a large block, it is checked upto maxBytesInFlight/3 size when fetching shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it can not be detected in data fetch phase. This has been described in the previous section. - Only the compressed or wrapped blocks are checked, I think we should also check thease blocks which are not wrapped. We complete the verification mechanism for shuffle transmitted data: Firstly, we choose crc32 for the checksum verification of shuffle data. Crc is also used for checksum verification in hadoop, it is simple and fast. In shuffle write phase, after completing the partitionedFile, we compute the crc32 value for each partition and then write these digests with the indexs into shuffle index file. For the sortShuffleWriter and unsafe shuffle writer, there is only one partitionedFile for a shuffleMapTask, so the compution of digests(compute the digests for each partition depend on the indexs of this partitionedFile) is cheap. For the bypassShuffleWriter, the reduce partitions is little than byPassMergeThreshold, the cost of digests compution is acceptable. In shuffle read phase, the digest value will be passed with the block data. And we will recompute the digest of the data obtained to compare with the origin digest value. When recomputing the digest of data obtained, it only need an additional buffer(2048Bytes) for computing crc32 value. After recomputing, we will reset the obtained data inputStream, if it is markSupported we only need reset it, otherwise it is a fileSegmentManagerBuffer, we need recreate it. So, this verification mechanism proposed for shuffle transmitted data is cheap and complete. was: SPARK-27562 We've seen some shuffle data corruption during shuffle read phase. As described in SPARK-26089, spark only checks small shuffle blocks before PR #23543, which is proposed by ankuriitg. There are two changes/improvements that are made in PR #23543. 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as smaller blocks, so if a large block is corrupt in the starting, that block will be re-fetched and if that also fails, FetchFailureException will be thrown. 2. If large blocks are corrupt after size maxBytesInFlight/3, then any IOException thrown while reading the stream will be converted to FetchFailureException. This is slightly more aggressive than was originally intended but since the consumer of the stream may have already read some records and processed them, we can't just re-fetch the block, we need to fail the whole task. Additionally, we also thought about maybe adding a new type of TaskEndReason, which would re-try the task couple of times before failing the previous stage, but given the complexity involved in that solution we decided to not proceed in that direction. However, I think there still exists some problems with the current shuffle transmitted data verification mechanism: - For a large block, it is checked upto maxBytesInFlight/3 size when fetching shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it can not be detected in data fetch phase. This has been described in the previous section. - Only the compressed or wrapped blocks are checked, I think we should also check thease blocks which are not wrapped. We complete the verification mechanism for shuffle transmitted data: Firstly, we choose crc32 for the checksum verification of shuffle data. Crc is also used for checksum verification in
[jira] [Updated] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data
[ https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-27562: Description: I will update this lately. > Complete the verification mechanism for shuffle transmitted data > > > Key: SPARK-27562 > URL: https://issues.apache.org/jira/browse/SPARK-27562 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: feiwang >Priority: Major > > I will update this lately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24192) Invalid Spark URL in local spark session since upgrading from org.apache.spark:spark-sql_2.11:2.2.1 to org.apache.spark:spark-sql_2.11:2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825616#comment-16825616 ] Aseem Patni commented on SPARK-24192: - export SPARK_LOCAL_HOSTNAME=localhost should also fix this. > Invalid Spark URL in local spark session since upgrading from > org.apache.spark:spark-sql_2.11:2.2.1 to org.apache.spark:spark-sql_2.11:2.3.0 > > > Key: SPARK-24192 > URL: https://issues.apache.org/jira/browse/SPARK-24192 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Tal Barda >Priority: Major > Labels: HeartbeatReceiver, config, session, spark, spark-conf, > spark-session > Original Estimate: 24h > Remaining Estimate: 24h > > since updating to Spark 2.3.0, tests which are run in my CI (Codeship) fail > due to a allegedly invalid spark url when creating the (local) spark context. > Here's a log from my _*mvn clean install*_ command execution: > {quote}{{2018-05-03 13:18:47.668 ERROR 5533 --- [ main] > org.apache.spark.SparkContext : Error initializing SparkContext. > org.apache.spark.SparkException: Invalid Spark URL: > spark://HeartbeatReceiver@railsonfire_61eb1c99-232b-49d0-abb5-a1eb9693516b_52bcc09bb48b:44284 > at > org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66) > ~[spark-core_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134) > ~[spark-core_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) > ~[spark-core_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109) > ~[spark-core_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32) > ~[spark-core_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.executor.Executor.(Executor.scala:155) > ~[spark-core_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59) > ~[spark-core_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126) > ~[spark-core_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) > ~[spark-core_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.SparkContext.(SparkContext.scala:500) > ~[spark-core_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486) > [spark-core_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930) > [spark-sql_2.11-2.3.0.jar:2.3.0] at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921) > [spark-sql_2.11-2.3.0.jar:2.3.0] at scala.Option.getOrElse(Option.scala:121) > [scala-library-2.11.8.jar:na] at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) > [spark-sql_2.11-2.3.0.jar:2.3.0] at > com.planck.spark_data_features_extractors.utils.SparkConfig.sparkSession(SparkConfig.java:43) > [classes/:na] at > com.planck.spark_data_features_extractors.utils.SparkConfig$$EnhancerBySpringCGLIB$$66dd1f72.CGLIB$sparkSession$0() > [classes/:na] at > com.planck.spark_data_features_extractors.utils.SparkConfig$$EnhancerBySpringCGLIB$$66dd1f72$$FastClassBySpringCGLIB$$a213b647.invoke() > [classes/:na] at > org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228) > [spring-core-5.0.2.RELEASE.jar:5.0.2.RELEASE] at > org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:361) > [spring-context-5.0.2.RELEASE.jar:5.0.2.RELEASE] at > com.planck.spark_data_features_extractors.utils.SparkConfig$$EnhancerBySpringCGLIB$$66dd1f72.sparkSession() > [classes/:na] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[na:1.8.0_171] at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[na:1.8.0_171] at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[na:1.8.0_171] at java.lang.reflect.Method.invoke(Method.java:498) > ~[na:1.8.0_171] at > org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:154) > [spring-beans-5.0.2.RELEASE.jar:5.0.2.RELEASE] at > org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:579) > [spring-beans-5.0.2.RELEASE.jar:5.0.2.RELEASE] at > org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateUsingFactoryMethod(AbstractAuto
[jira] [Created] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data
feiwang created SPARK-27562: --- Summary: Complete the verification mechanism for shuffle transmitted data Key: SPARK-27562 URL: https://issues.apache.org/jira/browse/SPARK-27562 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.4.0 Reporter: feiwang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27312) PropertyGraph <-> GraphX conversions
[ https://issues.apache.org/jira/browse/SPARK-27312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27312: - Assignee: Weichen Xu > PropertyGraph <-> GraphX conversions > > > Key: SPARK-27312 > URL: https://issues.apache.org/jira/browse/SPARK-27312 > Project: Spark > Issue Type: Story > Components: Graph, GraphX >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > > As a user, I can convert a GraphX graph into a PropertyGraph and a > PropertyGraph into a GraphX graph if they are compatible. > * Scala only > * Whether this is an internal API is pending design discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27300) Create the new graph projects in Spark and set up build/test
[ https://issues.apache.org/jira/browse/SPARK-27300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27300: - Assignee: Martin Junghanns (was: Xiangrui Meng) > Create the new graph projects in Spark and set up build/test > > > Key: SPARK-27300 > URL: https://issues.apache.org/jira/browse/SPARK-27300 > Project: Spark > Issue Type: Story > Components: Graph >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Martin Junghanns >Priority: Major > > * Create graph projects in Spark repo that works with both maven and sbt. > * Add internal and external dependencies. > * Add a dummy test that automatically runs with PR build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27300) Create the new graph projects in Spark and set up build/test
[ https://issues.apache.org/jira/browse/SPARK-27300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27300: - Assignee: Xiangrui Meng > Create the new graph projects in Spark and set up build/test > > > Key: SPARK-27300 > URL: https://issues.apache.org/jira/browse/SPARK-27300 > Project: Spark > Issue Type: Story > Components: Graph >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > * Create graph projects in Spark repo that works with both maven and sbt. > * Add internal and external dependencies. > * Add a dummy test that automatically runs with PR build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27561) Support "lateral column alias references" to allow column aliases to be used within SELECT clauses
[ https://issues.apache.org/jira/browse/SPARK-27561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-27561: --- Description: Amazon Redshift has a feature called "lateral column alias references": [https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/]. Quoting from that blogpost: {quote}The support for lateral column alias reference enables you to write queries without repeating the same expressions in the SELECT list. For example, you can define the alias 'probability' and use it within the same select statement: {code:java} select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data; {code} {quote} There's more information about this feature on [https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html:] {quote}The benefit of the lateral alias reference is you don't need to repeat the aliased expression when building more complex expressions in the same target list. When Amazon Redshift parses this type of reference, it just inlines the previously defined aliases. If there is a column with the same name defined in the FROM clause as the previously aliased expression, the column in the FROM clause takes priority. For example, in the above query if there is a column named 'probability' in table raw_data, the 'probability' in the second expression in the target list will refer to that column instead of the alias name 'probability'. {quote} It would be nice if Spark supported this syntax. I don't think that this is standard SQL, so it might be a good idea to research if other SQL databases support similar syntax (and to see if they implement the same column resolution strategy as Redshift). We should also consider whether this needs to be feature-flagged as part of a specific SQL compatibility mode / dialect. One possibly-related existing ticket: SPARK-9338, which discusses the use of SELECT aliases in GROUP BY expressions. /cc [~hvanhovell] was: Amazon Redshift has a feature called "lateral column alias references: [https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/]. Quoting from that blogpost: {quote}The support for lateral column alias reference enables you to write queries without repeating the same expressions in the SELECT list. For example, you can define the alias 'probability' and use it within the same select statement: {code:java} select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data; {code} {quote} There's more information about this feature on [https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html:] {quote}The benefit of the lateral alias reference is you don't need to repeat the aliased expression when building more complex expressions in the same target list. When Amazon Redshift parses this type of reference, it just inlines the previously defined aliases. If there is a column with the same name defined in the FROM clause as the previously aliased expression, the column in the FROM clause takes priority. For example, in the above query if there is a column named 'probability' in table raw_data, the 'probability' in the second expression in the target list will refer to that column instead of the alias name 'probability'. {quote} It would be nice if Spark supported this syntax. I don't think that this is standard SQL, so it might be a good idea to research if other SQL databases support similar syntax (and to see if they implement the same column resolution strategy as Redshift). We should also consider whether this needs to be feature-flagged as part of a specific SQL compatibility mode / dialect. One possibly-related existing ticket: SPARK-9338, which discusses the use of SELECT aliases in GROUP BY expressions. /cc [~hvanhovell] > Support "lateral column alias references" to allow column aliases to be used > within SELECT clauses > -- > > Key: SPARK-27561 > URL: https://issues.apache.org/jira/browse/SPARK-27561 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > Amazon Redshift has a feature called "lateral column alias references": > [https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/]. > Quoting from that blogpost: > {quote}The support for lateral column alias reference enables you to write > queries without repeating the same expressions in the SELECT list. For > example, you can define the alias 'probability' and use it within the same > select statement: > {code:java} > select c
[jira] [Updated] (SPARK-27561) Support "lateral column alias references" to allow column aliases to be used within SELECT clauses
[ https://issues.apache.org/jira/browse/SPARK-27561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-27561: --- Description: Amazon Redshift has a feature called "lateral column alias references: [https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/]. Quoting from that blogpost: {quote}The support for lateral column alias reference enables you to write queries without repeating the same expressions in the SELECT list. For example, you can define the alias 'probability' and use it within the same select statement: {code:java} select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data; {code} {quote} There's more information about this feature on [https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html:] {quote}The benefit of the lateral alias reference is you don't need to repeat the aliased expression when building more complex expressions in the same target list. When Amazon Redshift parses this type of reference, it just inlines the previously defined aliases. If there is a column with the same name defined in the FROM clause as the previously aliased expression, the column in the FROM clause takes priority. For example, in the above query if there is a column named 'probability' in table raw_data, the 'probability' in the second expression in the target list will refer to that column instead of the alias name 'probability'. {quote} It would be nice if Spark supported this syntax. I don't think that this is standard SQL, so it might be a good idea to research if other SQL databases support similar syntax (and to see if they implement the same column resolution strategy as Redshift). We should also consider whether this needs to be feature-flagged as part of a specific SQL compatibility mode / dialect. One possibly-related existing ticket: SPARK-9338, which discusses the use of SELECT aliases in GROUP BY expressions. /cc [~hvanhovell] was: Amazon Redshift has a feature called "lateral column alias references: [https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/]. Quoting from that blogpost: {quote}The support for lateral column alias reference enables you to write queries without repeating the same expressions in the SELECT list. For example, you can define the alias 'probability' and use it within the same select statement: {code:java} select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data; {code} {quote} There's more information about this feature on [https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html:] {quote}The benefit of the lateral alias reference is you don't need to repeat the aliased expression when building more complex expressions in the same target list. When Amazon Redshift parses this type of reference, it just inlines the previously defined aliases. If there is a column with the same name defined in the FROM clause as the previously aliased expression, the column in the FROM clause takes priority. For example, in the above query if there is a column named 'probability' in table raw_data, the 'probability' in the second expression in the target list will refer to that column instead of the alias name 'probability'. {quote} It would be nice if Spark supported this syntax. I don't think that this is standard SQL, so it might be a good idea to research if other SQL databases support similar syntax (and to see if they implement the same column resolution strategy as Redshift). One possibly-related existing ticket: SPARK-9338, which discusses the use of SELECT aliases in GROUP BY expressions. /cc [~hvanhovell] > Support "lateral column alias references" to allow column aliases to be used > within SELECT clauses > -- > > Key: SPARK-27561 > URL: https://issues.apache.org/jira/browse/SPARK-27561 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > Amazon Redshift has a feature called "lateral column alias references: > [https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/]. > Quoting from that blogpost: > {quote}The support for lateral column alias reference enables you to write > queries without repeating the same expressions in the SELECT list. For > example, you can define the alias 'probability' and use it within the same > select statement: > {code:java} > select clicks / impressions as probability, round(100 * probability, 1) as > percentage from raw_data; > {code} > {quote} > There's
[jira] [Created] (SPARK-27561) Support "lateral column alias references" to allow column aliases to be used within SELECT clauses
Josh Rosen created SPARK-27561: -- Summary: Support "lateral column alias references" to allow column aliases to be used within SELECT clauses Key: SPARK-27561 URL: https://issues.apache.org/jira/browse/SPARK-27561 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.0 Reporter: Josh Rosen Amazon Redshift has a feature called "lateral column alias references: [https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/]. Quoting from that blogpost: {quote}The support for lateral column alias reference enables you to write queries without repeating the same expressions in the SELECT list. For example, you can define the alias 'probability' and use it within the same select statement: {code:java} select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data; {code} {quote} There's more information about this feature on [https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html:] {quote}The benefit of the lateral alias reference is you don't need to repeat the aliased expression when building more complex expressions in the same target list. When Amazon Redshift parses this type of reference, it just inlines the previously defined aliases. If there is a column with the same name defined in the FROM clause as the previously aliased expression, the column in the FROM clause takes priority. For example, in the above query if there is a column named 'probability' in table raw_data, the 'probability' in the second expression in the target list will refer to that column instead of the alias name 'probability'. {quote} It would be nice if Spark supported this syntax. I don't think that this is standard SQL, so it might be a good idea to research if other SQL databases support similar syntax (and to see if they implement the same column resolution strategy as Redshift). One possibly-related existing ticket: SPARK-9338, which discusses the use of SELECT aliases in GROUP BY expressions. /cc [~hvanhovell] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27542) SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using certain legacy OutputFormats
[ https://issues.apache.org/jira/browse/SPARK-27542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825560#comment-16825560 ] Josh Rosen commented on SPARK-27542: [~shivuson...@gmail.com], unfortunately I don't have a self-contained example which I can share right now, but the following skeleton is a rough starting point: {code:java} import org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat val data: RDD[(Void, org.apache.parquet.example.data.Group)] = ??? data.saveAsHadoopFile( path = "output", keyClass = classOf[Void], valueClass = classOf[org.apache.parquet.example.data.Group], outputFormatClass = classOf[DeprecatedParquetOutputFormat[org.apache.parquet.example.data.Group]] ) {code} > SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using > certain legacy OutputFormats > -- > > Key: SPARK-27542 > URL: https://issues.apache.org/jira/browse/SPARK-27542 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > In Hadoop MapReduce, tasks call {{FileOutputFormat.setWorkOutputPath()}} > after configuring the output committer: > [https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L611] > > Spark doesn't do this: > [https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L115] > As a result, certain legacy output formats can fail to work out-of-the-box on > Spark. In particular, > {{org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat}} can fail > with NullPointerExceptions, e.g. > {code:java} > java.lang.NullPointerException > at org.apache.hadoop.fs.Path.(Path.java:105) > at org.apache.hadoop.fs.Path.(Path.java:94) > at > org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getDefaultWorkFile(DeprecatedParquetOutputFormat.java:69) > [...] > at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:96) > {code} > It looks like someone on GitHub has hit the same problem: > https://gist.github.com/themodernlife/e3b07c23ba978f6cc98b73e3f3609abe > Tez had a very similar bug: https://issues.apache.org/jira/browse/TEZ-3348 > We might be able to fix this by having Spark mimic Hadoop's logic. I'm unsure > of whether that change would pose compatibility risks for other existing > workloads, though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle
[ https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Muraru updated SPARK-27530: -- Description: I'm getting this in a large shuffle: {code:java} org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Received a zero-size buffer for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false){code} was: I'm getting this in a large shuffle: org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
[jira] [Created] (SPARK-27560) HashPartitioner uses Object.hashCode which is not seeded
Andrew McHarg created SPARK-27560: - Summary: HashPartitioner uses Object.hashCode which is not seeded Key: SPARK-27560 URL: https://issues.apache.org/jira/browse/SPARK-27560 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.4.0 Environment: Notebook is running spark v2.4.0 local[*] Python 3.6.6 (default, Sep 6 2018, 13:10:03) [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin I imagine this would reproduce on all operating systems and most versions of spark though. Reporter: Andrew McHarg Forgive the quality of the bug report here, I am a pyspark user and not super familiar with the internals of spark, yet it seems I have a strange corner case with the HashPartitioner. This may already be known but repartition with HashPartitioner seems to assign everything the same partition if data that was partitioned by the same column is only partially read (say one partition). I suppose it is obvious concequence of Object.hashCode being deterministic but took some while to track down. Steps to repro: # Get dataframe with a bunch of uuids say 1 # repartition(100, 'uuid_column') # save to parquet # read from parquet # collect()[:100] then filter using pyspark.sql.functions isin (yes I know this is bad and sampleBy should probably be used here) # repartition(10, 'uuid_column') # Resulting dataframe will have all of its data in one single partition Jupyter notebook for the above: https://gist.github.com/robo-hamburger/4752a40cb643318464e58ab66cf7d23e I think an easy fix would be to seed the HashPartitioner like many hashtable libraries do to avoid denial of service attacks. It also might be the case this is obvious behavior for more experienced spark users :) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27559) Nullable in a given schema is not respected when reading from parquet
colin fang created SPARK-27559: -- Summary: Nullable in a given schema is not respected when reading from parquet Key: SPARK-27559 URL: https://issues.apache.org/jira/browse/SPARK-27559 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.2 Reporter: colin fang Even if I specify a schema when reading from parquet, nullable is not reset. {code:java} spark.range(10, numPartitions=1).write.mode('overwrite').parquet('tmp') df1 = spark.read.parquet('tmp') df1.printSchema() # root # |-- id: long (nullable = true) df2 = spark.read.schema(StructType([StructField('id', LongType(), False)])).parquet('tmp') df2.printSchema() # root # |-- x: long (nullable = true) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27558) NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang
[ https://issues.apache.org/jira/browse/SPARK-27558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alessandro Bellina updated SPARK-27558: --- Description: We see an NPE in the UnsafeInMemorySorter.getMemoryUsage function (due to the array we are accessing there being null). This looks to be caused by a Spark OOM when UnsafeInMemorySorter is trying to spill. This is likely a symptom of https://issues.apache.org/jira/browse/SPARK-21492. The real question for this ticket is, could we handle things more gracefully, rather than NPE. For example: Remove this: https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L182 so when this fails (and store the new array into a temporary): https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L186 we don't end up with a null "array". This state is causing one of our jobs to hang infinitely (we think) due to the original allocation error. Stack trace for reference {noformat} 2019-04-23 08:57:14,989 [Executor task launch worker for task 46729] ERROR org.apache.spark.TaskContextImpl - Error in TaskCompletionListener java.lang.NullPointerException at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getMemoryUsage(UnsafeInMemorySorter.java:208) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getMemoryUsage(UnsafeExternalSorter.java:249) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.updatePeakMemoryUsed(UnsafeExternalSorter.java:253) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.freeMemory(UnsafeExternalSorter.java:296) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:328) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.lambda$new$0(UnsafeExternalSorter.java:178) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118) at org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:131) at org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:129) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:129) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117) at org.apache.spark.scheduler.Task.run(Task.scala:119) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2019-04-23 08:57:15,069 [Executor task launch worker for task 46729] ERROR org.apache.spark.executor.Executor - Exception in task 102.0 in stage 28.0 (TID 46729) org.apache.spark.util.TaskCompletionListenerException: null Previous exception in task: Unable to acquire 65536 bytes of memory, got 0 org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157) org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98) org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:186) org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:229) org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:204) org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:283) org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96) org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:348) org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:403) org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:135) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage23.sort_addToSorter_0$(Unknown Source) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage23.processNext(Unknown Source) org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) org.apache.spark.sql.execution.WholeStageCodegenExec$$
[jira] [Created] (SPARK-27558) NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang
Alessandro Bellina created SPARK-27558: -- Summary: NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang Key: SPARK-27558 URL: https://issues.apache.org/jira/browse/SPARK-27558 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.2, 2.3.3 Reporter: Alessandro Bellina We see an NPE in the UnsafeInMemorySorter.getMemoryUsage function (due to the array we are accessing there being null). This looks to be caused by a Spark OOM when UnsafeInMemorySorter is trying to spill. This is likely a symptom of https://issues.apache.org/jira/browse/SPARK-21492. The real question for this ticket is, could we handle things more gracefully, rather than NPE. For example: allocate the new array into a temporary here: https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L182 so when this fails: https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L186 we don't end up with a null "array". This state is causing one of our jobs to hang infinitely (we think) due to the original allocation error. Stack trace for reference {noformat} 2019-04-23 08:57:14,989 [Executor task launch worker for task 46729] ERROR org.apache.spark.TaskContextImpl - Error in TaskCompletionListener java.lang.NullPointerException at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getMemoryUsage(UnsafeInMemorySorter.java:208) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getMemoryUsage(UnsafeExternalSorter.java:249) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.updatePeakMemoryUsed(UnsafeExternalSorter.java:253) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.freeMemory(UnsafeExternalSorter.java:296) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:328) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.lambda$new$0(UnsafeExternalSorter.java:178) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118) at org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:131) at org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:129) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:129) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117) at org.apache.spark.scheduler.Task.run(Task.scala:119) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2019-04-23 08:57:15,069 [Executor task launch worker for task 46729] ERROR org.apache.spark.executor.Executor - Exception in task 102.0 in stage 28.0 (TID 46729) org.apache.spark.util.TaskCompletionListenerException: null Previous exception in task: Unable to acquire 65536 bytes of memory, got 0 org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157) org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98) org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:186) org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:229) org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:204) org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:283) org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96) org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:348) org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:403) org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:135) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage23.sort_addToSorter_0$(Unknown Source) org.apache.spark.sql.catalyst.expressions.Genera
[jira] [Commented] (SPARK-25888) Service requests for persist() blocks via external service after dynamic deallocation
[ https://issues.apache.org/jira/browse/SPARK-25888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825248#comment-16825248 ] Attila Zsolt Piros commented on SPARK-25888: I am working on this. > Service requests for persist() blocks via external service after dynamic > deallocation > - > > Key: SPARK-25888 > URL: https://issues.apache.org/jira/browse/SPARK-25888 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Shuffle, YARN >Affects Versions: 2.3.2 >Reporter: Adam Kennedy >Priority: Major > > Large and highly multi-tenant Spark on YARN clusters with diverse job > execution often display terrible utilization rates (we have observed as low > as 3-7% CPU at max container allocation, but 50% CPU utilization on even a > well policed cluster is not uncommon). > As a sizing example, consider a scenario with 1,000 nodes, 50,000 cores, 250 > users and 50,000 runs of 1,000 distinct applications per week, with > predominantly Spark including a mixture of ETL, Ad Hoc tasks and PySpark > Notebook jobs (no streaming) > Utilization problems appear to be due in large part to difficulties with > persist() blocks (DISK or DISK+MEMORY) preventing dynamic deallocation. > In situations where an external shuffle service is present (which is typical > on clusters of this type) we already solve this for the shuffle block case by > offloading the IO handling of shuffle blocks to the external service, > allowing dynamic deallocation to proceed. > Allowing Executors to transfer persist() blocks to some external "shuffle" > service in a similar manner would be an enormous win for Spark multi-tenancy > as it would limit deallocation blocking scenarios to only MEMORY-only cache() > scenarios. > I'm not sure if I'm correct, but I seem to recall seeing in the original > external shuffle service commits that may have been considered at the time > but getting shuffle blocks moved to the external shuffle service was the > first priority. > With support for external persist() DISK blocks in place, we could also then > handle deallocation of DISK+MEMORY, as the memory instance could first be > dropped, changing the block to DISK only, and then further transferred to the > shuffle service. > We have tried to resolve the persist() issue via extensive user training, but > that has typically only allowed us to improve utilization of the worst > offenders (10% utilization) up to around 40-60% utilization, as the need for > persist() is often legitimate and occurs during the middle stages of a job. > In a healthy multi-tenant scenario, a large job might spool up to say 10,000 > cores, persist() data, release executors across a long tail down to 100 > cores, and then spool back up to 10,000 cores for the following stage without > impact on the persist() data. > In an ideal world, if an new executor started up on a node on which blocks > had been transferred to the shuffle service, the new executor might even be > able to "recapture" control of those blocks (if that would help with > performance in some way). > And the behavior of gradually expanding up and down several times over the > course of a job would not just improve utilization, but would allow resources > to more easily be redistributed to other jobs which start on the cluster > during the long-tail periods, which would improve multi-tenancy and bring us > closer to optimal "envy free" YARN scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27557) Add copybutton to spark Python API docs for easier copying of code-blocks
Sangram G created SPARK-27557: - Summary: Add copybutton to spark Python API docs for easier copying of code-blocks Key: SPARK-27557 URL: https://issues.apache.org/jira/browse/SPARK-27557 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 2.4.2 Reporter: Sangram G Add a non-intrusive button for python API documentation, which will remove ">>>" prompts and outputs of code - for easier copying of code. For example: The below code-snippet in the document is difficult to copy due to ">>>" prompts >>> l = [('Alice', 1)] >>> spark.createDataFrame(l).collect() [Row(_1='Alice', _2=1)] Becomes this - After the copybutton in the corner of of code-block is pressed - which is easier to copy l = [('Alice', 1)] spark.createDataFrame(l).collect() Sample Screenshot for reference: [https://screenshots.firefox.com/ZfBXZrpINKdt9QZg/www269.lunapic.com] This can be easily done only by adding a copybutton.js script in python/docs/_static folder and calling it at setup time from python/docs/conf.py. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27550) Fix `test-dependencies.sh` not to use `kafka-0-8` profile for Scala-2.12
[ https://issues.apache.org/jira/browse/SPARK-27550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27550. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.4.3 This is resolved via https://github.com/apache/spark/pull/24445 > Fix `test-dependencies.sh` not to use `kafka-0-8` profile for Scala-2.12 > > > Key: SPARK-27550 > URL: https://issues.apache.org/jira/browse/SPARK-27550 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.3 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.3 > > > Since SPARK-27274 deprecated Scala-2.11 at Spark 2.4.1, we need to test > Scala-2.12 more. > Kakfa 0.8 doesn't have Scala-2.12 artifacts, e.g., > `org.apache.kafka:kafka_2.12:jar:0.8.2.1`. This issue aims to fix > `test-dependencies.sh` script to understand Scala binary version. > {code:java} > $ dev/change-scala-version.sh 2.12 > $ dev/test-dependencies.sh > Using `mvn` from path: /usr/local/bin/mvn > Using `mvn` from path: /usr/local/bin/mvn > Performing Maven install for hadoop-2.6 > Using `mvn` from path: /usr/local/bin/mvn > [ERROR] Failed to execute goal on project spark-streaming-kafka-0-8_2.12: > Could not resolve dependencies for project > org.apache.spark:spark-streaming-kafka-0-8_2.12:jar:spark-335572: Could not > find artifact org.apache.kafka:kafka_2.12:jar:0.8.2.1 in central > (https://repo.maven.apache.org/maven2) -> [Help 1] > {code} > Please note that this issue doesn't aim to update `spark-deps-hadoop-2.6` > manifest here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27556) Exclude com.zaxxer:HikariCP-java7 from hadoop-yarn-server-web-proxy
[ https://issues.apache.org/jira/browse/SPARK-27556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27556: Description: {noformat} [INFO] | +- org.apache.hadoop:hadoop-yarn-server-web-proxy:jar:3.2.0:compile [INFO] | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:3.2.0:compile [INFO] | | +- org.apache.hadoop:hadoop-yarn-registry:jar:3.2.0:compile [INFO] | | | \- commons-daemon:commons-daemon:jar:1.0.13:compile [INFO] | | +- org.apache.geronimo.specs:geronimo-jcache_1.0_spec:jar:1.0-alpha-1:compile [INFO] | | +- org.ehcache:ehcache:jar:3.3.1:compile [INFO] | | +- com.zaxxer:HikariCP-java7:jar:2.4.12:compile{noformat} {noformat} [INFO] +- org.apache.hive:hive-metastore:jar:2.3.4:compile [INFO] | +- javolution:javolution:jar:5.5.1:compile [INFO] | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile [INFO] | +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile [INFO] | +- com.zaxxer:HikariCP:jar:2.5.1:compile{noformat} was: {noformat} [INFO] | +- org.apache.hadoop:hadoop-yarn-server-web-proxy:jar:3.2.0:compile [INFO] | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:3.2.0:compile [INFO] | | +- org.apache.hadoop:hadoop-yarn-registry:jar:3.2.0:compile [INFO] | | | \- commons-daemon:commons-daemon:jar:1.0.13:compile [INFO] | | +- org.apache.geronimo.specs:geronimo-jcache_1.0_spec:jar:1.0-alpha-1:compile [INFO] | | +- org.ehcache:ehcache:jar:3.3.1:compile [INFO] | | +- com.zaxxer:HikariCP-java7:jar:2.4.12:compile [INFO] | | \- com.microsoft.sqlserver:mssql-jdbc:jar:6.2.1.jre7:runtime{noformat} {noformat} [INFO] +- org.apache.hive:hive-metastore:jar:2.3.4:compile [INFO] | +- javolution:javolution:jar:5.5.1:compile [INFO] | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile [INFO] | +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile [INFO] | +- com.zaxxer:HikariCP:jar:2.5.1:compile{noformat} > Exclude com.zaxxer:HikariCP-java7 from hadoop-yarn-server-web-proxy > --- > > Key: SPARK-27556 > URL: https://issues.apache.org/jira/browse/SPARK-27556 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > [INFO] | +- org.apache.hadoop:hadoop-yarn-server-web-proxy:jar:3.2.0:compile > [INFO] | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:3.2.0:compile > [INFO] | | +- org.apache.hadoop:hadoop-yarn-registry:jar:3.2.0:compile > [INFO] | | | \- commons-daemon:commons-daemon:jar:1.0.13:compile > [INFO] | | +- > org.apache.geronimo.specs:geronimo-jcache_1.0_spec:jar:1.0-alpha-1:compile > [INFO] | | +- org.ehcache:ehcache:jar:3.3.1:compile > [INFO] | | +- com.zaxxer:HikariCP-java7:jar:2.4.12:compile{noformat} > {noformat} > [INFO] +- org.apache.hive:hive-metastore:jar:2.3.4:compile > [INFO] | +- javolution:javolution:jar:5.5.1:compile > [INFO] | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile > [INFO] | +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile > [INFO] | +- com.zaxxer:HikariCP:jar:2.5.1:compile{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825134#comment-16825134 ] Udbhav Agrawal commented on SPARK-25262: Thanks [~rvesse] i will try to work on this > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27556) Exclude com.zaxxer:HikariCP-java7 from hadoop-yarn-server-web-proxy
Yuming Wang created SPARK-27556: --- Summary: Exclude com.zaxxer:HikariCP-java7 from hadoop-yarn-server-web-proxy Key: SPARK-27556 URL: https://issues.apache.org/jira/browse/SPARK-27556 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.0.0 Reporter: Yuming Wang {noformat} [INFO] | +- org.apache.hadoop:hadoop-yarn-server-web-proxy:jar:3.2.0:compile [INFO] | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:3.2.0:compile [INFO] | | +- org.apache.hadoop:hadoop-yarn-registry:jar:3.2.0:compile [INFO] | | | \- commons-daemon:commons-daemon:jar:1.0.13:compile [INFO] | | +- org.apache.geronimo.specs:geronimo-jcache_1.0_spec:jar:1.0-alpha-1:compile [INFO] | | +- org.ehcache:ehcache:jar:3.3.1:compile [INFO] | | +- com.zaxxer:HikariCP-java7:jar:2.4.12:compile [INFO] | | \- com.microsoft.sqlserver:mssql-jdbc:jar:6.2.1.jre7:runtime{noformat} {noformat} [INFO] +- org.apache.hive:hive-metastore:jar:2.3.4:compile [INFO] | +- javolution:javolution:jar:5.5.1:compile [INFO] | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile [INFO] | +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile [INFO] | +- com.zaxxer:HikariCP:jar:2.5.1:compile{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825132#comment-16825132 ] Rob Vesse commented on SPARK-25262: --- [~Udbhav Agrawal] Yes I think an approach like that would be acceptable to the community (and if not then I don't know what will be). If you want to take a stab at doing this please feel free > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824993#comment-16824993 ] Udbhav Agrawal commented on SPARK-25262: [~rvesse] for the second part, can we move out {{LocalDirFetureStep from baseFeatures and}} provide a new configuration something like {color:#6a8759}spark.kubernetes.emptyDir.disable {color}and handle it in {{KubernetesDriverBuilder.scala}} so then user will have flexibility to define different volume type to back the local directories other than {{emptyDir}} through pod templates > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27354) Move incompatible code from the hive-thriftserver module to sql/hive-thriftserver/v1.2.1
[ https://issues.apache.org/jira/browse/SPARK-27354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27354: Summary: Move incompatible code from the hive-thriftserver module to sql/hive-thriftserver/v1.2.1 (was: Add a new empty hive-thriftserver module for Hive 2.3.4 ) > Move incompatible code from the hive-thriftserver module to > sql/hive-thriftserver/v1.2.1 > > > Key: SPARK-27354 > URL: https://issues.apache.org/jira/browse/SPARK-27354 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > When we upgraded the built-in Hive to 2.3.4, the current > {{hive-thriftserver}} module is not compatible, such as these Hive changes: > # HIVE-12442 HiveServer2: Refactor/repackage HiveServer2's Thrift code so > that it can be used in the tasks > # HIVE-12237 Use slf4j as logging facade > # HIVE-13169 HiveServer2: Support delegation token based connection when > using http transport > So we should add a new {{hive-thriftserver}} module for Hive 2.3.4: > 1. Add a new empty module for Hive 2.3.4 named {{hive-thriftserverV2}}. > 2. Make {{hive-thriftserver}} can only be activated when testing with > hadoop-2.7. > 3. Make {{hive-thriftserverV2}} can only be activated when testing with > hadoop-3.2. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui WANG updated SPARK-27555: - Description: I already seen https://issues.apache.org/jira/browse/SPARK-17620 and https://issues.apache.org/jira/browse/SPARK-18397 and I check source code of Spark for the change of set "spark.sql.hive.covertCTAS=true" and then spark will use "spark.sql.sources.default" which is parquet as storage format in "create table as select" scenario. But my case is just create table without select. When I set hive.default.fileformat=parquet in hive-site.xml or set spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after create a table, when i check the hive table, it still use textfile fileformat. It seems HiveSerDe gets the value of the hive.default.fileformat parameter from SQLConf The parameter values in SQLConf are copied from SparkContext's SparkConf at SparkSession initialization, while the configuration parameters in hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by SharedState, And all the config with "spark.hadoop" conf are setted to hadoopconfig, so the configuration does not take effect. was: I already seen https://issues.apache.org/jira/browse/SPARK-17620 and https://issues.apache.org/jira/browse/SPARK-18397 and I check source code of Spark for the change of set "spark.sql.hive.covertCTAS=true" and then spark will use "spark.sql.sources.default" which is parquet as storage format in "create table as select" scenario. But my case is just create table without select. When I set hive.default.fileformat=parquet in hive-site.xml or set spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after create a table, when i check the hive table, it still use textfile fileformat. HiveSerDe gets the value of the hive.default.fileformat parameter from SQLConf The parameter values in SQLConf are copied from SparkContext's SparkConf at SparkSession initialization, while the configuration parameters in hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by SharedState, And all the config with "spark.hadoop" conf are setted to hadoopconfig, so the configuration does not take effect. > cannot create table by using the hive default fileformat in both > hive-site.xml and spark-defaults.conf > -- > > Key: SPARK-27555 > URL: https://issues.apache.org/jira/browse/SPARK-27555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Hui WANG >Priority: Major > > I already seen https://issues.apache.org/jira/browse/SPARK-17620 > and https://issues.apache.org/jira/browse/SPARK-18397 > and I check source code of Spark for the change of set > "spark.sql.hive.covertCTAS=true" and then spark will use > "spark.sql.sources.default" which is parquet as storage format in "create > table as select" scenario. > But my case is just create table without select. When I set > hive.default.fileformat=parquet in hive-site.xml or set > spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after > create a table, when i check the hive table, it still use textfile fileformat. > > It seems HiveSerDe gets the value of the hive.default.fileformat parameter > from SQLConf > The parameter values in SQLConf are copied from SparkContext's SparkConf at > SparkSession initialization, while the configuration parameters in > hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters > by SharedState, And all the config with "spark.hadoop" conf are setted to > hadoopconfig, so the configuration does not take effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
Hui WANG created SPARK-27555: Summary: cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf Key: SPARK-27555 URL: https://issues.apache.org/jira/browse/SPARK-27555 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2 Reporter: Hui WANG I already seen https://issues.apache.org/jira/browse/SPARK-17620 and https://issues.apache.org/jira/browse/SPARK-18397 and I check source code of Spark for the change of set "spark.sql.hive.covertCTAS=true" and then spark will use "spark.sql.sources.default" which is parquet as storage format in "create table as select" scenario. But my case is just create table without select. When I set hive.default.fileformat=parquet in hive-site.xml or set spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after create a table, when i check the hive table, it still use textfile fileformat. HiveSerDe gets the value of the hive.default.fileformat parameter from SQLConf The parameter values in SQLConf are copied from SparkContext's SparkConf at SparkSession initialization, while the configuration parameters in hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by SharedState, And all the config with "spark.hadoop" conf are setted to hadoopconfig, so the configuration does not take effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27512) Decimal parsing leads to unexpected type inference
[ https://issues.apache.org/jira/browse/SPARK-27512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27512: Assignee: Hyukjin Kwon > Decimal parsing leads to unexpected type inference > -- > > Key: SPARK-27512 > URL: https://issues.apache.org/jira/browse/SPARK-27512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark 3.0.0-SNAPSHOT from this commit: > {code:bash} > commit 3ab96d7acf870e53c9016b0b63d0b328eec23bed > Author: Dilip Biswal > Date: Mon Apr 15 21:26:45 2019 +0800 > {code} >Reporter: koert kuipers >Assignee: Hyukjin Kwon >Priority: Minor > > {code:bash} > $ hadoop fs -text test.bsv > x|y > 1|1,2 > 2|2,3 > 3|3,4 > {code} > in spark 2.4.1: > {code:bash} > scala> val data = spark.read.format("csv").option("header", > true).option("delimiter", "|").option("inferSchema", true).load("test.bsv") > scala> data.printSchema > root > |-- x: integer (nullable = true) > |-- y: string (nullable = true) > scala> data.show > +---+---+ > | x| y| > +---+---+ > | 1|1,2| > | 2|2,3| > | 3|3,4| > +---+---+ > {code} > in spark 3.0.0-SNAPSHOT: > {code:bash} > scala> val data = spark.read.format("csv").option("header", > true).option("delimiter", "|").option("inferSchema", true).load("test.bsv") > scala> data.printSchema > root > |-- x: integer (nullable = true) > |-- y: decimal(2,0) (nullable = true) > scala> data.show > +---+---+ > | x| y| > +---+---+ > | 1| 12| > | 2| 23| > | 3| 34| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27512) Decimal parsing leads to unexpected type inference
[ https://issues.apache.org/jira/browse/SPARK-27512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27512. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24437 [https://github.com/apache/spark/pull/24437] > Decimal parsing leads to unexpected type inference > -- > > Key: SPARK-27512 > URL: https://issues.apache.org/jira/browse/SPARK-27512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark 3.0.0-SNAPSHOT from this commit: > {code:bash} > commit 3ab96d7acf870e53c9016b0b63d0b328eec23bed > Author: Dilip Biswal > Date: Mon Apr 15 21:26:45 2019 +0800 > {code} >Reporter: koert kuipers >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > {code:bash} > $ hadoop fs -text test.bsv > x|y > 1|1,2 > 2|2,3 > 3|3,4 > {code} > in spark 2.4.1: > {code:bash} > scala> val data = spark.read.format("csv").option("header", > true).option("delimiter", "|").option("inferSchema", true).load("test.bsv") > scala> data.printSchema > root > |-- x: integer (nullable = true) > |-- y: string (nullable = true) > scala> data.show > +---+---+ > | x| y| > +---+---+ > | 1|1,2| > | 2|2,3| > | 3|3,4| > +---+---+ > {code} > in spark 3.0.0-SNAPSHOT: > {code:bash} > scala> val data = spark.read.format("csv").option("header", > true).option("delimiter", "|").option("inferSchema", true).load("test.bsv") > scala> data.printSchema > root > |-- x: integer (nullable = true) > |-- y: decimal(2,0) (nullable = true) > scala> data.show > +---+---+ > | x| y| > +---+---+ > | 1| 12| > | 2| 23| > | 3| 34| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org