[jira] [Assigned] (SPARK-27551) Uniformative error message for mismatched types in when().otherwise()

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27551:


Assignee: (was: Apache Spark)

> Uniformative error message for mismatched types in when().otherwise()
> -
>
> Key: SPARK-27551
> URL: https://issues.apache.org/jira/browse/SPARK-27551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Huon Wilson
>Priority: Minor
>
> When a {{when(...).otherwise(...)}} construct has a type error, the error 
> message can be quite uninformative, since it just splats out a potentially 
> large chunk of code and says the types don't match. For instance:
> {code:none}
> scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 
> 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as 
> "y"
> org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = 
> CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS 
> BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * 
> CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data 
> type mismatch: THEN and ELSE expressions should all be same type or coercible 
> to a common type;;
> ...
> {code}
> The problem is the structs have different field names ({{x}} vs {{y}}), but 
> it's not obvious that this is the case, and this is a relatively simple case 
> of a single {{select}} expression.
> It would be great for the error message to at least include the types that 
> Spark has computed, to help clarify what might have gone wrong. For instance, 
> {{greatest}} and {{least}} write out the expression with the types instead of 
> values:
> {code:none}
> scala> spark.range(100).select(greatest('id, struct(lit("x"
> org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`id`, 
> named_struct('col1', 'x'))' due to data type mismatch: The expressions should 
> all have the same type, got GREATEST(bigint, struct).;;
> {code}
> For the example above, this might look like:
> {code:none}
> org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = 
> CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS 
> BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * 
> CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data 
> type mismatch: THEN and ELSE expressions should all be same type or coercible 
> to a common type, got CASE WHEN ... THEN array> ELSE 
> array> END;;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27551) Uniformative error message for mismatched types in when().otherwise()

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27551:


Assignee: Apache Spark

> Uniformative error message for mismatched types in when().otherwise()
> -
>
> Key: SPARK-27551
> URL: https://issues.apache.org/jira/browse/SPARK-27551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Huon Wilson
>Assignee: Apache Spark
>Priority: Minor
>
> When a {{when(...).otherwise(...)}} construct has a type error, the error 
> message can be quite uninformative, since it just splats out a potentially 
> large chunk of code and says the types don't match. For instance:
> {code:none}
> scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 
> 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as 
> "y"
> org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = 
> CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS 
> BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * 
> CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data 
> type mismatch: THEN and ELSE expressions should all be same type or coercible 
> to a common type;;
> ...
> {code}
> The problem is the structs have different field names ({{x}} vs {{y}}), but 
> it's not obvious that this is the case, and this is a relatively simple case 
> of a single {{select}} expression.
> It would be great for the error message to at least include the types that 
> Spark has computed, to help clarify what might have gone wrong. For instance, 
> {{greatest}} and {{least}} write out the expression with the types instead of 
> values:
> {code:none}
> scala> spark.range(100).select(greatest('id, struct(lit("x"
> org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`id`, 
> named_struct('col1', 'x'))' due to data type mismatch: The expressions should 
> all have the same type, got GREATEST(bigint, struct).;;
> {code}
> For the example above, this might look like:
> {code:none}
> org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = 
> CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS 
> BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * 
> CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data 
> type mismatch: THEN and ELSE expressions should all be same type or coercible 
> to a common type, got CASE WHEN ... THEN array> ELSE 
> array> END;;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue

2019-04-24 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825788#comment-16825788
 ] 

Xiao Li commented on SPARK-27216:
-

Thanks, Josh! We will add it to the release note. [~cloud_fan]

> Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
> -
>
> Key: SPARK-27216
> URL: https://issues.apache.org/jira/browse/SPARK-27216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.4, 2.4.2, 3.0.0
>
>
> HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But 
> RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer.
> We can use below UT to reproduce:
> {code}
>   test("kryo serialization with RoaringBitmap") {
> val bitmap = new RoaringBitmap
> bitmap.add(1787)
> val safeSer = new KryoSerializer(conf).newInstance()
> val bitmap2 : RoaringBitmap = 
> safeSer.deserialize(safeSer.serialize(bitmap))
> assert(bitmap2.equals(bitmap))
> conf.set("spark.kryo.unsafe", "true")
> val unsafeSer = new KryoSerializer(conf).newInstance()
> val bitmap3 : RoaringBitmap = 
> unsafeSer.deserialize(unsafeSer.serialize(bitmap))
> assert(bitmap3.equals(bitmap)) // this will fail
>   }
> {code}
> Upgrade to latest version 0.7.45 to fix it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27462) Spark hive can not choose some columns in target table flexibly, when running insert into.

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27462:


Assignee: Apache Spark

> Spark hive can not choose some columns in target table flexibly, when running 
> insert into.
> --
>
> Key: SPARK-27462
> URL: https://issues.apache.org/jira/browse/SPARK-27462
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Spark SQL can not support the feature to choose some columns in target table 
> flexibly, when running
> {code:java}
> insert into tableA select ... from tableB;{code}
> This feature is supported by Hive, so I think this grammar should be 
> consistent with Hive。
> Hive support some feature about 'insert into' as follows:
> {code:java}
> insert into gja_test_spark select * from gja_test;
> insert into gja_test_spark(key, value, other) select key, value, other from 
> gja_test;
> insert into gja_test_spark(key, value) select value, other from gja_test;
> insert into gja_test_spark(key, other) select value, other from gja_test;
> insert into gja_test_spark(value, other) select value, other from 
> gja_test;{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27462) Spark hive can not choose some columns in target table flexibly, when running insert into.

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27462:


Assignee: (was: Apache Spark)

> Spark hive can not choose some columns in target table flexibly, when running 
> insert into.
> --
>
> Key: SPARK-27462
> URL: https://issues.apache.org/jira/browse/SPARK-27462
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL can not support the feature to choose some columns in target table 
> flexibly, when running
> {code:java}
> insert into tableA select ... from tableB;{code}
> This feature is supported by Hive, so I think this grammar should be 
> consistent with Hive。
> Hive support some feature about 'insert into' as follows:
> {code:java}
> insert into gja_test_spark select * from gja_test;
> insert into gja_test_spark(key, value, other) select key, value, other from 
> gja_test;
> insert into gja_test_spark(key, value) select value, other from gja_test;
> insert into gja_test_spark(key, other) select value, other from gja_test;
> insert into gja_test_spark(value, other) select value, other from 
> gja_test;{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23178) Kryo Unsafe problems with count distinct from cache

2019-04-24 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825783#comment-16825783
 ] 

Josh Rosen commented on SPARK-23178:


This might be fixed by SPARK-27216

> Kryo Unsafe problems with count distinct from cache
> ---
>
> Key: SPARK-23178
> URL: https://issues.apache.org/jira/browse/SPARK-23178
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: KIryl Sultanau
>Priority: Minor
> Attachments: Unsafe-issue.png, Unsafe-off.png
>
>
> Spark incorrectly process cached data with Kryo & Unsafe options.
> Distinct count from cache doesn't work correctly. Example available below:
> {quote}val spark = SparkSession
>      .builder
>      .appName("unsafe-issue")
>      .master("local[*]")
>      .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>      .config("spark.kryo.unsafe", "true")
>      .config("spark.kryo.registrationRequired", "false")
>      .getOrCreate()
>     val devicesDF = spark.read.format("csv")
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .load("/data/Devices.tsv").cache()
>     val gatewaysDF = spark.read.format("csv")
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .load("/data/Gateways.tsv").cache()
>     val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
> "inner").cache()
>      devJoinedDF.printSchema()
>     println(devJoinedDF.count())
>     println(devJoinedDF.select("DeviceId").distinct().count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue

2019-04-24 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825782#comment-16825782
 ] 

Josh Rosen edited comment on SPARK-27216 at 4/25/19 6:38 AM:
-

I've added the {{correctness}} label to this ticket because it sounds like it 
can cause query correctness issues in pre-2.4 versions of Spark (for example, I 
think SPARK-23178 might be a report of this).

Note for future readers: this problem only occurs in a non-default 
configuration (the default is {{spark.kryo.unsafe == false}}).

/cc [~smilegator]


was (Author: joshrosen):
I've added the {{correctness}} label to this ticket because it sounds like it 
can cause query correctness issues in pre-2.4 versions of Spark (for example, I 
think SPARK-27216 might be a report of this).

Note for future readers: this problem only occurs in a non-default 
configuration (the default is {{spark.kryo.unsafe == false}}).

/cc [~smilegator]

> Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
> -
>
> Key: SPARK-27216
> URL: https://issues.apache.org/jira/browse/SPARK-27216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 2.3.4, 2.4.2, 3.0.0
>
>
> HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But 
> RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer.
> We can use below UT to reproduce:
> {code}
>   test("kryo serialization with RoaringBitmap") {
> val bitmap = new RoaringBitmap
> bitmap.add(1787)
> val safeSer = new KryoSerializer(conf).newInstance()
> val bitmap2 : RoaringBitmap = 
> safeSer.deserialize(safeSer.serialize(bitmap))
> assert(bitmap2.equals(bitmap))
> conf.set("spark.kryo.unsafe", "true")
> val unsafeSer = new KryoSerializer(conf).newInstance()
> val bitmap3 : RoaringBitmap = 
> unsafeSer.deserialize(unsafeSer.serialize(bitmap))
> assert(bitmap3.equals(bitmap)) // this will fail
>   }
> {code}
> Upgrade to latest version 0.7.45 to fix it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue

2019-04-24 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27216:
---
Labels: correctness  (was: )

> Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
> -
>
> Key: SPARK-27216
> URL: https://issues.apache.org/jira/browse/SPARK-27216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.4, 2.4.2, 3.0.0
>
>
> HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But 
> RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer.
> We can use below UT to reproduce:
> {code}
>   test("kryo serialization with RoaringBitmap") {
> val bitmap = new RoaringBitmap
> bitmap.add(1787)
> val safeSer = new KryoSerializer(conf).newInstance()
> val bitmap2 : RoaringBitmap = 
> safeSer.deserialize(safeSer.serialize(bitmap))
> assert(bitmap2.equals(bitmap))
> conf.set("spark.kryo.unsafe", "true")
> val unsafeSer = new KryoSerializer(conf).newInstance()
> val bitmap3 : RoaringBitmap = 
> unsafeSer.deserialize(unsafeSer.serialize(bitmap))
> assert(bitmap3.equals(bitmap)) // this will fail
>   }
> {code}
> Upgrade to latest version 0.7.45 to fix it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue

2019-04-24 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825782#comment-16825782
 ] 

Josh Rosen commented on SPARK-27216:


I've added the {{correctness}} label to this ticket because it sounds like it 
can cause query correctness issues in pre-2.4 versions of Spark (for example, I 
think SPARK-27216 might be a report of this).

Note for future readers: this problem only occurs in a non-default 
configuration (the default is {{spark.kryo.unsafe == false}}).

/cc [~smilegator]

> Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
> -
>
> Key: SPARK-27216
> URL: https://issues.apache.org/jira/browse/SPARK-27216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 2.3.4, 2.4.2, 3.0.0
>
>
> HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But 
> RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer.
> We can use below UT to reproduce:
> {code}
>   test("kryo serialization with RoaringBitmap") {
> val bitmap = new RoaringBitmap
> bitmap.add(1787)
> val safeSer = new KryoSerializer(conf).newInstance()
> val bitmap2 : RoaringBitmap = 
> safeSer.deserialize(safeSer.serialize(bitmap))
> assert(bitmap2.equals(bitmap))
> conf.set("spark.kryo.unsafe", "true")
> val unsafeSer = new KryoSerializer(conf).newInstance()
> val bitmap3 : RoaringBitmap = 
> unsafeSer.deserialize(unsafeSer.serialize(bitmap))
> assert(bitmap3.equals(bitmap)) // this will fail
>   }
> {code}
> Upgrade to latest version 0.7.45 to fix it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle

2019-04-24 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825770#comment-16825770
 ] 

Josh Rosen edited comment on SPARK-27530 at 4/25/19 6:32 AM:
-

This specific error message was added in SPARK-24160. As discussed in that 
ticket's PR, zero-sized blocks should never be requested, so the receipt of a 
zero-sized block indicates either a bug in the shuffle-request-issuing logic or 
a data corruption / truncation bug somewhere in the shuffle stack.

*Update*: I see someone linked  SPARK-27216 to this ticket; that sounds like a 
plausible cause / fix.


was (Author: joshrosen):
This specific error message was added in SPARK-24160. As discussed in that 
ticket's PR, zero-sized blocks should never be requested, so the receipt of a 
zero-sized block indicates either a bug in the shuffle-request-issuing logic or 
a data corruption / truncation bug somewhere in the shuffle stack.

*Update*: I see someone linked  SPARK-27216 to this ticket; that sounds like a 
plausible cause.

> FetchFailedException: Received a zero-size buffer for block shuffle
> ---
>
> Key: SPARK-27530
> URL: https://issues.apache.org/jira/browse/SPARK-27530
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Adrian Muraru
>Priority: Major
>
> I'm getting this in a large shuffle:
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer 
> for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 
> 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Received a zero-size buffer for block 
> shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) 
> (expectedApproxSize = 33708, isNetworkReqDone=false){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle

2019-04-24 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825770#comment-16825770
 ] 

Josh Rosen edited comment on SPARK-27530 at 4/25/19 6:32 AM:
-

This specific error message was added in SPARK-24160. As discussed in that 
ticket's PR, zero-sized blocks should never be requested, so the receipt of a 
zero-sized block indicates either a bug in the shuffle-request-issuing logic or 
a data corruption / truncation bug somewhere in the shuffle stack.

*Update*: I see someone linked  SPARK-27216 to this ticket; that sounds like a 
plausible cause.


was (Author: joshrosen):
This specific error message was added in SPARK-24160. As discussed in that 
ticket's PR, zero-sized blocks should never be requested, so the receipt of a 
zero-sized block indicates either a bug in the shuffle-request-issuing logic or 
a data corruption / truncation bug somewhere in the shuffle stack.

> FetchFailedException: Received a zero-size buffer for block shuffle
> ---
>
> Key: SPARK-27530
> URL: https://issues.apache.org/jira/browse/SPARK-27530
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Adrian Muraru
>Priority: Major
>
> I'm getting this in a large shuffle:
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer 
> for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 
> 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Received a zero-size buffer for block 
> shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) 
> (expectedApproxSize = 33708, isNetworkReqDone=false){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-24 Thread Hui WANG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui WANG updated SPARK-27555:
-
Description: 
*You can see details in attachment called Try.pdf.*

I already seen https://issues.apache.org/jira/browse/SPARK-17620 

and https://issues.apache.org/jira/browse/SPARK-18397

and I check source code of Spark for the change of  set 
"spark.sql.hive.covertCTAS=true" and then spark will use 
"spark.sql.sources.default" which is parquet as storage format in "create table 
as select" scenario.

But my case is just create table without select. When I set  
hive.default.fileformat=parquet in hive-site.xml or set  
spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
create a table, when i check the hive table, it still use textfile fileformat.

 

It seems HiveSerDe gets the value of the hive.default.fileformat parameter from 
SQLConf

The parameter values in SQLConf are copied from SparkContext's SparkConf at 
SparkSession initialization, while the configuration parameters in 
hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by 
SharedState, And all the config with "spark.hadoop" conf are setted to 
hadoopconfig, so the configuration does not take effect.

 

 

 

  was:
I already seen https://issues.apache.org/jira/browse/SPARK-17620 

and https://issues.apache.org/jira/browse/SPARK-18397

and I check source code of Spark for the change of  set 
"spark.sql.hive.covertCTAS=true" and then spark will use 
"spark.sql.sources.default" which is parquet as storage format in "create table 
as select" scenario.

But my case is just create table without select. When I set  
hive.default.fileformat=parquet in hive-site.xml or set  
spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
create a table, when i check the hive table, it still use textfile fileformat.

 

It seems HiveSerDe gets the value of the hive.default.fileformat parameter from 
SQLConf

The parameter values in SQLConf are copied from SparkContext's SparkConf at 
SparkSession initialization, while the configuration parameters in 
hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by 
SharedState, And all the config with "spark.hadoop" conf are setted to 
hadoopconfig, so the configuration does not take effect.

 

 

 


> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Priority: Major
> Attachments: Try.pdf
>
>
> *You can see details in attachment called Try.pdf.*
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-24 Thread Hui WANG (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825772#comment-16825772
 ] 

Hui WANG commented on SPARK-27555:
--

ok, I already attached the details of reproduce in the attachment area called 
Try.pdf.

> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Priority: Major
> Attachments: Try.pdf
>
>
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle

2019-04-24 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825770#comment-16825770
 ] 

Josh Rosen commented on SPARK-27530:


This specific error message was added in SPARK-24160. As discussed in that 
ticket's PR, zero-sized blocks should never be requested, so the receipt of a 
zero-sized block indicates either a bug in the shuffle-request-issuing logic or 
a data corruption / truncation bug somewhere in the shuffle stack.

> FetchFailedException: Received a zero-size buffer for block shuffle
> ---
>
> Key: SPARK-27530
> URL: https://issues.apache.org/jira/browse/SPARK-27530
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Adrian Muraru
>Priority: Major
>
> I'm getting this in a large shuffle:
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer 
> for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 
> 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Received a zero-size buffer for block 
> shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) 
> (expectedApproxSize = 33708, isNetworkReqDone=false){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-24 Thread Hui WANG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui WANG updated SPARK-27555:
-
Attachment: (was: Try1.jpg)

> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Priority: Major
> Attachments: Try.pdf
>
>
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-24 Thread Hui WANG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui WANG updated SPARK-27555:
-
Attachment: Try.pdf

> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Priority: Major
> Attachments: Try.pdf
>
>
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-24 Thread Hui WANG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui WANG updated SPARK-27555:
-
Attachment: Try1.jpg

> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Priority: Major
> Attachments: Try1.jpg
>
>
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-24 Thread Hui WANG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui WANG updated SPARK-27555:
-
Description: 
I already seen https://issues.apache.org/jira/browse/SPARK-17620 

and https://issues.apache.org/jira/browse/SPARK-18397

and I check source code of Spark for the change of  set 
"spark.sql.hive.covertCTAS=true" and then spark will use 
"spark.sql.sources.default" which is parquet as storage format in "create table 
as select" scenario.

But my case is just create table without select. When I set  
hive.default.fileformat=parquet in hive-site.xml or set  
spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
create a table, when i check the hive table, it still use textfile fileformat.

 

It seems HiveSerDe gets the value of the hive.default.fileformat parameter from 
SQLConf

The parameter values in SQLConf are copied from SparkContext's SparkConf at 
SparkSession initialization, while the configuration parameters in 
hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by 
SharedState, And all the config with "spark.hadoop" conf are setted to 
hadoopconfig, so the configuration does not take effect.

 

 

 

  was:
I already seen https://issues.apache.org/jira/browse/SPARK-17620 

and https://issues.apache.org/jira/browse/SPARK-18397

and I check source code of Spark for the change of  set 
"spark.sql.hive.covertCTAS=true" and then spark will use 
"spark.sql.sources.default" which is parquet as storage format in "create table 
as select" scenario.

But my case is just create table without select. When I set  
hive.default.fileformat=parquet in hive-site.xml or set  
spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
create a table, when i check the hive table, it still use textfile fileformat.

 

It seems HiveSerDe gets the value of the hive.default.fileformat parameter from 
SQLConf

The parameter values in SQLConf are copied from SparkContext's SparkConf at 
SparkSession initialization, while the configuration parameters in 
hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by 
SharedState, And all the config with "spark.hadoop" conf are setted to 
hadoopconfig, so the configuration does not take effect.

 

 

Try1:

vi /opt/spark/conf/spark-defaults.conf

hive.default.fileformat parquetfile

Then open spark-sql, It tells me

 

 

Try2:

vi /opt/spark/conf/spark-defaults.conf

spark.hadoop.hive.default.fileformat parquetfile

The open spark-sql

 

 

And Then, I set hive.default.fileformat directly in current spark-sql repl,

 

 

 

Try3:

Edit hive-site.xml directly.

vi hive-site.xml

 

The open spark-sql

 


> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Priority: Major
>
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-24 Thread Hui WANG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui WANG updated SPARK-27555:
-
Description: 
I already seen https://issues.apache.org/jira/browse/SPARK-17620 

and https://issues.apache.org/jira/browse/SPARK-18397

and I check source code of Spark for the change of  set 
"spark.sql.hive.covertCTAS=true" and then spark will use 
"spark.sql.sources.default" which is parquet as storage format in "create table 
as select" scenario.

But my case is just create table without select. When I set  
hive.default.fileformat=parquet in hive-site.xml or set  
spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
create a table, when i check the hive table, it still use textfile fileformat.

 

It seems HiveSerDe gets the value of the hive.default.fileformat parameter from 
SQLConf

The parameter values in SQLConf are copied from SparkContext's SparkConf at 
SparkSession initialization, while the configuration parameters in 
hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by 
SharedState, And all the config with "spark.hadoop" conf are setted to 
hadoopconfig, so the configuration does not take effect.

 

 

Try1:

vi /opt/spark/conf/spark-defaults.conf

hive.default.fileformat parquetfile

Then open spark-sql, It tells me

 

 

Try2:

vi /opt/spark/conf/spark-defaults.conf

spark.hadoop.hive.default.fileformat parquetfile

The open spark-sql

 

 

And Then, I set hive.default.fileformat directly in current spark-sql repl,

 

 

 

Try3:

Edit hive-site.xml directly.

vi hive-site.xml

 

The open spark-sql

 

  was:
I already seen https://issues.apache.org/jira/browse/SPARK-17620 

and https://issues.apache.org/jira/browse/SPARK-18397

and I check source code of Spark for the change of  set 
"spark.sql.hive.covertCTAS=true" and then spark will use 
"spark.sql.sources.default" which is parquet as storage format in "create table 
as select" scenario.

But my case is just create table without select. When I set  
hive.default.fileformat=parquet in hive-site.xml or set  
spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
create a table, when i check the hive table, it still use textfile fileformat.

 

It seems HiveSerDe gets the value of the hive.default.fileformat parameter from 
SQLConf

The parameter values in SQLConf are copied from SparkContext's SparkConf at 
SparkSession initialization, while the configuration parameters in 
hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by 
SharedState, And all the config with "spark.hadoop" conf are setted to 
hadoopconfig, so the configuration does not take effect.


> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Priority: Major
>
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.
>  
>  
> Try1:
> vi /opt/spark/conf/spark-defaults.conf
> hive.default.fileformat parquetfile
> Then open spark-sql, It tells me
>  
>  
> Try2:
> vi /opt/spark/conf/spark-defaults.conf
> spark.hadoop.hive.default.fileformat parquetfile
> The open spark-sql
>  
>  
> And Then, I set hive.default.fileformat directly in current spark-sql repl,
>  
>  
>  
> Try3:
> Edit hive-site.xml directly.
> vi hive-site.xml
>  
> The open spark-sql
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For addi

[jira] [Assigned] (SPARK-27494) Null values don't work in Kafka source v2

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27494:


Assignee: (was: Apache Spark)

> Null values don't work in Kafka source v2
> -
>
> Key: SPARK-27494
> URL: https://issues.apache.org/jira/browse/SPARK-27494
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> Right now Kafka source v2 doesn't support null values. The issue is in 
> org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow 
> which doesn't handle null values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27494) Null values don't work in Kafka source v2

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27494:


Assignee: Apache Spark

> Null values don't work in Kafka source v2
> -
>
> Key: SPARK-27494
> URL: https://issues.apache.org/jira/browse/SPARK-27494
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> Right now Kafka source v2 doesn't support null values. The issue is in 
> org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow 
> which doesn't handle null values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27526) Driver OOM error occurs while writing parquet file with Append mode

2019-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825764#comment-16825764
 ] 

Hyukjin Kwon commented on SPARK-27526:
--

Please avoid to set target version which is usually reserved for committers.

> Driver OOM error occurs while writing parquet file with Append mode
> ---
>
> Key: SPARK-27526
> URL: https://issues.apache.org/jira/browse/SPARK-27526
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.1
> Environment: centos6.7
>Reporter: senyoung
>Priority: Major
>  Labels: oom
>
> As this user code below
> {code:java}
> someDataFrame.write
> .mode(SaveMode.Append)
> .partitionBy(somePartitionKeySeqs)
> .parquet(targetPath);
> {code}
> When spark try to write parquet files into hdfs with the SaveMode.Append 
> mode,it must check the existing Partition Columns 
>  would match the "existed files" ,how ever,this behevior would cache all leaf 
> fileInfos under the "targetPath";
>  This can easily trigger oom when there are too many files in the targetPath;
>  This behevior is useful when someone needs the exactly correctness ,but i 
> think it should be optional to avoid the oom;
> The linked code be here
> {code:java}
> //package org.apache.spark.sql.execution.datasources
> //case class DataSource
> private def writeInFileFormat(format: FileFormat, mode: SaveMode, data: 
> DataFrame): Unit = {
> ...
> /**
> */can we make it optional?
> */
> if (mode == SaveMode.Append) {
>   val existingPartitionColumns = Try {
>   /**
> * getOrInferFileFormatSchema(format, justPartitioning = true),
> * this method may cause oom when there be too many files,could we just sample 
> limited files rather than all existed files ?
> */
> getOrInferFileFormatSchema(format, justPartitioning = true)
> ._2.fieldNames.toList
>   }.getOrElse(Seq.empty[String])
>   val sameColumns =
> existingPartitionColumns.map(_.toLowerCase()) == 
> partitionColumns.map(_.toLowerCase())
>   if (existingPartitionColumns.nonEmpty && !sameColumns) {
> throw new AnalysisException(
>   s"""Requested partitioning does not match existing partitioning.
>  |Existing partitioning columns:
>  |  ${existingPartitionColumns.mkString(", ")}
>  |Requested partitioning columns:
>  |  ${partitionColumns.mkString(", ")}
>  |""".stripMargin)
>   }
> }
> ...
> }
> private def getOrInferFileFormatSchema(
> format: FileFormat,
> justPartitioning: Boolean = false): (StructType, StructType) = {
>   lazy val tempFileIndex = {
> val allPaths = caseInsensitiveOptions.get("path") ++ paths
> val hadoopConf = sparkSession.sessionState.newHadoopConf()
> val globbedPaths = allPaths.toSeq.flatMap { path =>
>   val hdfsPath = new Path(path)
>   val fs = hdfsPath.getFileSystem(hadoopConf)
>   val qualified = hdfsPath.makeQualified(fs.getUri, 
> fs.getWorkingDirectory)
>   SparkHadoopUtil.get.globPathIfNecessary(qualified)
> }.toArray
>/**
> * InMemoryFileIndex.refresh0() cache all files info ,oom risks
>*/ 
> new InMemoryFileIndex(sparkSession, globbedPaths, options, None)
>   }
> ...
> }
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27526) Driver OOM error occurs while writing parquet file with Append mode

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27526:
-
Target Version/s:   (was: 2.1.1)

> Driver OOM error occurs while writing parquet file with Append mode
> ---
>
> Key: SPARK-27526
> URL: https://issues.apache.org/jira/browse/SPARK-27526
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.1
> Environment: centos6.7
>Reporter: senyoung
>Priority: Major
>  Labels: oom
>
> As this user code below
> {code:java}
> someDataFrame.write
> .mode(SaveMode.Append)
> .partitionBy(somePartitionKeySeqs)
> .parquet(targetPath);
> {code}
> When spark try to write parquet files into hdfs with the SaveMode.Append 
> mode,it must check the existing Partition Columns 
>  would match the "existed files" ,how ever,this behevior would cache all leaf 
> fileInfos under the "targetPath";
>  This can easily trigger oom when there are too many files in the targetPath;
>  This behevior is useful when someone needs the exactly correctness ,but i 
> think it should be optional to avoid the oom;
> The linked code be here
> {code:java}
> //package org.apache.spark.sql.execution.datasources
> //case class DataSource
> private def writeInFileFormat(format: FileFormat, mode: SaveMode, data: 
> DataFrame): Unit = {
> ...
> /**
> */can we make it optional?
> */
> if (mode == SaveMode.Append) {
>   val existingPartitionColumns = Try {
>   /**
> * getOrInferFileFormatSchema(format, justPartitioning = true),
> * this method may cause oom when there be too many files,could we just sample 
> limited files rather than all existed files ?
> */
> getOrInferFileFormatSchema(format, justPartitioning = true)
> ._2.fieldNames.toList
>   }.getOrElse(Seq.empty[String])
>   val sameColumns =
> existingPartitionColumns.map(_.toLowerCase()) == 
> partitionColumns.map(_.toLowerCase())
>   if (existingPartitionColumns.nonEmpty && !sameColumns) {
> throw new AnalysisException(
>   s"""Requested partitioning does not match existing partitioning.
>  |Existing partitioning columns:
>  |  ${existingPartitionColumns.mkString(", ")}
>  |Requested partitioning columns:
>  |  ${partitionColumns.mkString(", ")}
>  |""".stripMargin)
>   }
> }
> ...
> }
> private def getOrInferFileFormatSchema(
> format: FileFormat,
> justPartitioning: Boolean = false): (StructType, StructType) = {
>   lazy val tempFileIndex = {
> val allPaths = caseInsensitiveOptions.get("path") ++ paths
> val hadoopConf = sparkSession.sessionState.newHadoopConf()
> val globbedPaths = allPaths.toSeq.flatMap { path =>
>   val hdfsPath = new Path(path)
>   val fs = hdfsPath.getFileSystem(hadoopConf)
>   val qualified = hdfsPath.makeQualified(fs.getUri, 
> fs.getWorkingDirectory)
>   SparkHadoopUtil.get.globPathIfNecessary(qualified)
> }.toArray
>/**
> * InMemoryFileIndex.refresh0() cache all files info ,oom risks
>*/ 
> new InMemoryFileIndex(sparkSession, globbedPaths, options, None)
>   }
> ...
> }
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27529) Spark Streaming consumer dies with kafka.common.OffsetOutOfRangeException

2019-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825763#comment-16825763
 ] 

Hyukjin Kwon commented on SPARK-27529:
--

Spark 1.x is EOL. Can you check if it works in higher version?

> Spark Streaming consumer dies with kafka.common.OffsetOutOfRangeException
> -
>
> Key: SPARK-27529
> URL: https://issues.apache.org/jira/browse/SPARK-27529
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.5.0
>Reporter: Dmitry Goldenberg
>Priority: Major
>
> We have a Spark Streaming consumer which at a certain point started 
> consistently failing upon a restart with the below error.
> Some details:
>  * Spark version is 1.5.0.
>  * Kafka version is 0.8.2.1 (2.10-0.8.2.1).
>  * The topic is configured with: retention.ms=1471228928, 
> max.message.bytes=1.
>  * The consumer runs with auto.offset.reset=smallest.
>  * No checkpointing is currently enabled.
> I don't see anything in the Spark or Kafka doc to understand why this is 
> happening. From googling around,
> {noformat}
> https://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/
> Finally, I’ll repeat that any semantics beyond at-most-once require that you 
> have sufficient log retention in Kafka. If you’re seeing things like 
> OffsetOutOfRangeException, it’s probably because you underprovisioned Kafka 
> storage, not because something’s wrong with Spark or Kafka.{noformat}
> Also looking at SPARK-12693 and SPARK-11693, I don't understand the possible 
> causes.
> {noformat}
> You've under-provisioned Kafka storage and / or Spark compute capacity.
> The result is that data is being deleted before it has been 
> processed.{noformat}
> All we're trying to do is start the consumer and consume from the topic from 
> the earliest available offset. Why would we not be able to do that? How can 
> the offsets be out of range if we're saying, just read from the earliest 
> available?
> Since we have the retention.ms set to 1 year and we created the topic just a 
> few weeks ago, I'd not expect any deletion being done by Kafka as we're 
> consuming.
> I'd like to understand the actual cause of this error. Any recommendations on 
> a workaround would be appreciated.
> Stack traces:
> {noformat}
> 2019-04-19 11:35:17,147 ERROR org.apache.spark.scheduler
> .TaskSetManager: Task 10 in stage 147.0 failed 4 times; aborting job
> 2019-04-19 11:35:17,160 ERROR 
> org.apache.spark.streaming.scheduler.JobScheduler: Error running job 
> streaming job 1555682554000 ms.0
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in 
> stage 147.0 failed 4 times, most recent failure: Lost task
> 10.3 in stage 147.0 (TID 2368, 10.150.0.58): 
> kafka.common.OffsetOutOfRangeException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at java.lang.Class.newInstance(Class.java:442)
> at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86)
> at 
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184)
> at 
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
> at 
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29)
> at 
> com.acme.consumer.kafka.spark.ProcessPartitionFunction.call(ProcessPartitionFunction.java:69)
> at 
> com.acme.consumer.kafka.spark.ProcessPartitionFunction.call(ProcessPartitionFunction.java:24)
> at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:222)
> at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:222)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala

[jira] [Commented] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle

2019-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825762#comment-16825762
 ] 

Hyukjin Kwon commented on SPARK-27530:
--

Please avoid to set Critical+ which is usually reserved for committers.

> FetchFailedException: Received a zero-size buffer for block shuffle
> ---
>
> Key: SPARK-27530
> URL: https://issues.apache.org/jira/browse/SPARK-27530
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Adrian Muraru
>Priority: Major
>
> I'm getting this in a large shuffle:
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer 
> for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 
> 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Received a zero-size buffer for block 
> shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) 
> (expectedApproxSize = 33708, isNetworkReqDone=false){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27530:
-
Priority: Major  (was: Critical)

> FetchFailedException: Received a zero-size buffer for block shuffle
> ---
>
> Key: SPARK-27530
> URL: https://issues.apache.org/jira/browse/SPARK-27530
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Adrian Muraru
>Priority: Major
>
> I'm getting this in a large shuffle:
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer 
> for block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 
> 39439, None) (expectedApproxSize = 33708, isNetworkReqDone=false)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Received a zero-size buffer for block 
> shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) 
> (expectedApproxSize = 33708, isNetworkReqDone=false){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27477) Kafka token provider should have provided dependency on Spark

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27477:


Assignee: Apache Spark

> Kafka token provider should have provided dependency on Spark
> -
>
> Key: SPARK-27477
> URL: https://issues.apache.org/jira/browse/SPARK-27477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark 3.0.0-SNAPSHOT
> commit 38fc8e2484aa4971d1f2c115da61fc96f36e7868
> Author: Sean Owen 
> Date:   Sat Apr 13 22:27:25 2019 +0900
> [MINOR][DOCS] Fix some broken links in docs
>Reporter: koert kuipers
>Assignee: Apache Spark
>Priority: Trivial
> Fix For: 3.0.0
>
>
> currently the external module spark-token-provider-kafka-0-10 has a compile 
> dependency on spark-core. this means spark-sql-kafka-0-10 also has a 
> transitive compile dependency on spark-core.
> since spark-sql-kafka-0-10 is not bundled with spark but instead has to be 
> added to an application that runs on spark this dependency should be 
> provided, not compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27477) Kafka token provider should have provided dependency on Spark

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27477:


Assignee: (was: Apache Spark)

> Kafka token provider should have provided dependency on Spark
> -
>
> Key: SPARK-27477
> URL: https://issues.apache.org/jira/browse/SPARK-27477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark 3.0.0-SNAPSHOT
> commit 38fc8e2484aa4971d1f2c115da61fc96f36e7868
> Author: Sean Owen 
> Date:   Sat Apr 13 22:27:25 2019 +0900
> [MINOR][DOCS] Fix some broken links in docs
>Reporter: koert kuipers
>Priority: Trivial
> Fix For: 3.0.0
>
>
> currently the external module spark-token-provider-kafka-0-10 has a compile 
> dependency on spark-core. this means spark-sql-kafka-0-10 also has a 
> transitive compile dependency on spark-core.
> since spark-sql-kafka-0-10 is not bundled with spark but instead has to be 
> added to an application that runs on spark this dependency should be 
> provided, not compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27537) spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value size is not a member of Object

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27537.
--
Resolution: Duplicate

JDK 11 is in progress

> spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value
>  size is not a member of Object
> -
>
> Key: SPARK-27537
> URL: https://issues.apache.org/jira/browse/SPARK-27537
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.3.0, 2.4.1
> Environment: Machine:aarch64
> OS:Red Hat Enterprise Linux Server release 7.4
> Kernel:4.11.0-44.el7a
> spark version: spark-2.4.1
> java:openjdk version "11.0.2" 2019-01-15
>           OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9)
> scala:2.11.12
> gcc version:4.8.5
>Reporter: dingwei2019
>Priority: Major
>  Labels: build, test
>
> {code}
> [ERROR]: [Error] 
> $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value
>  size is not a member of Object
> [ERROR]: [Error] 
> $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.869:Value
>  size is not a member of Object
> {code}
> ERROR: two errors found
> Below is the related code:
> {code}
> test("toString") {
>   val empty = Matrices.ones(0, 0)
>   empty.toString(0, 0)
>  
>   val mat = Matrices.rand(5, 10, new Random())
>   mat.toString(-1, -5)
>   mat.toString(0, 0)
>   mat.toString(Int.MinValue, Int.MinValue)
>   mat.toString(Int.MaxValue, Int.MaxValue)
>   var lines = mat.toString(6, 50).lines.toArray
>   assert(lines.size == 5 && lines.forall(_.size <= 50))
>  
>   lines = mat.toString(5, 100).lines.toArray
>   assert(lines.size == 5 && lines.forall(_.size <= 100))
> }
> {code}
> {code}
> test("numNonzeros and numActives") {
>   val dm1 = Matrices.dense(3, 2, Array(0, 0, -1, 1, 0, 1))
>   assert(dm1.numNonzeros === 3)
>   assert(dm1.numActives === 6)
>   val sm1 = Matrices.sparse(3, 2, Array(0, 2, 3), Array(0, 2, 1), Array(0.0, 
> -1.2, 0.0))
>   assert(sm1.numNonzeros === 1)
>   assert(sm1.numActives === 3)
> }
> {code}
> what shall i do to solve this problem, and when will spark support jdk11?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27540) Add 'meanAveragePrecision_at_k' metric to RankingMetrics

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27540:


Assignee: (was: Apache Spark)

> Add 'meanAveragePrecision_at_k' metric to RankingMetrics
> 
>
> Key: SPARK-27540
> URL: https://issues.apache.org/jira/browse/SPARK-27540
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.4.1
>Reporter: Pham Nguyen Tuan Anh
>Priority: Minor
>
> Sometimes, we only focus on MAP of top-k results.
> This ticket adds MAP@k to RankingMetrics, besides existing MAP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27540) Add 'meanAveragePrecision_at_k' metric to RankingMetrics

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27540:


Assignee: Apache Spark

> Add 'meanAveragePrecision_at_k' metric to RankingMetrics
> 
>
> Key: SPARK-27540
> URL: https://issues.apache.org/jira/browse/SPARK-27540
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.4.1
>Reporter: Pham Nguyen Tuan Anh
>Assignee: Apache Spark
>Priority: Minor
>
> Sometimes, we only focus on MAP of top-k results.
> This ticket adds MAP@k to RankingMetrics, besides existing MAP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27537) spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value size is not a member of Object

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27537:
-
Description: 
{code}
[ERROR]: [Error] 
$SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value
 size is not a member of Object
[ERROR]: [Error] 
$SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.869:Value
 size is not a member of Object
{code}

ERROR: two errors found

Below is the related code:

{code}
test("toString") {
  val empty = Matrices.ones(0, 0)
  empty.toString(0, 0)
 
  val mat = Matrices.rand(5, 10, new Random())
  mat.toString(-1, -5)
  mat.toString(0, 0)
  mat.toString(Int.MinValue, Int.MinValue)
  mat.toString(Int.MaxValue, Int.MaxValue)
  var lines = mat.toString(6, 50).lines.toArray
  assert(lines.size == 5 && lines.forall(_.size <= 50))
 
  lines = mat.toString(5, 100).lines.toArray
  assert(lines.size == 5 && lines.forall(_.size <= 100))
}
{code}

{code}
test("numNonzeros and numActives") {
  val dm1 = Matrices.dense(3, 2, Array(0, 0, -1, 1, 0, 1))
  assert(dm1.numNonzeros === 3)
  assert(dm1.numActives === 6)

  val sm1 = Matrices.sparse(3, 2, Array(0, 2, 3), Array(0, 2, 1), Array(0.0, 
-1.2, 0.0))
  assert(sm1.numNonzeros === 1)
  assert(sm1.numActives === 3)
}
{code}


what shall i do to solve this problem, and when will spark support jdk11?


  was:
[ERROR]: [Error] 
$SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value
 size is not a member of Object

[ERROR]: [Error] 
$SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.869:Value
 size is not a member of Object

ERROR: two errors found

Below is the related code:
856   test("toString") {
857 val empty = Matrices.ones(0, 0)
858 empty.toString(0, 0)
859 
860 val mat = Matrices.rand(5, 10, new Random())
861 mat.toString(-1, -5)
862 mat.toString(0, 0)
863 mat.toString(Int.MinValue, Int.MinValue)
864 mat.toString(Int.MaxValue, Int.MaxValue)
865 var lines = mat.toString(6, 50).lines.toArray
866 assert(lines.size == 5 && lines.forall(_.size <= 50))
867 
868 lines = mat.toString(5, 100).lines.toArray
869 assert(lines.size == 5 && lines.forall(_.size <= 100))
870   }
871 
872   test("numNonzeros and numActives") {
873 val dm1 = Matrices.dense(3, 2, Array(0, 0, -1, 1, 0, 1))
874 assert(dm1.numNonzeros === 3)
875 assert(dm1.numActives === 6)
876 
877 val sm1 = Matrices.sparse(3, 2, Array(0, 2, 3), Array(0, 2, 1), 
Array(0.0, -1.2, 0.0))
878 assert(sm1.numNonzeros === 1)
879 assert(sm1.numActives === 3)


what shall i do to solve this problem, and when will spark support jdk11?



> spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value
>  size is not a member of Object
> -
>
> Key: SPARK-27537
> URL: https://issues.apache.org/jira/browse/SPARK-27537
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.3.0, 2.4.1
> Environment: Machine:aarch64
> OS:Red Hat Enterprise Linux Server release 7.4
> Kernel:4.11.0-44.el7a
> spark version: spark-2.4.1
> java:openjdk version "11.0.2" 2019-01-15
>           OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9)
> scala:2.11.12
> gcc version:4.8.5
>Reporter: dingwei2019
>Priority: Major
>  Labels: build, test
>
> {code}
> [ERROR]: [Error] 
> $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value
>  size is not a member of Object
> [ERROR]: [Error] 
> $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.869:Value
>  size is not a member of Object
> {code}
> ERROR: two errors found
> Below is the related code:
> {code}
> test("toString") {
>   val empty = Matrices.ones(0, 0)
>   empty.toString(0, 0)
>  
>   val mat = Matrices.rand(5, 10, new Random())
>   mat.toString(-1, -5)
>   mat.toString(0, 0)
>   mat.toString(Int.MinValue, Int.MinValue)
>   mat.toString(Int.MaxValue, Int.MaxValue)
>   var lines = mat.toString(6, 50).lines.toArray
>   assert(lines.size == 5 && lines.forall(_.size <= 50))
>  
>   lines = mat.toString(5, 100).lines.toArray
>   assert(lines.size == 5 && lines.forall(_.size <= 100))
> }
> {code}
> {code}
> test("numNonzeros and numActives") {
>   val dm1 = Matrices.dense(3, 2, Array(0, 0, -1, 1, 0, 1))
>   assert(dm1.numNonzeros === 3)
>   assert(dm1.numActives === 6)
>   val sm1 = Matrices.sparse(3, 2, Array(0, 2, 3), Array(0, 2, 1), Array(0.0, 
> -1.2, 0.0))
>   assert(sm1.numNonzeros === 1)
>   assert(sm1.numActives === 3)
> }
> {code}
> what shall i do to solve this problem, and when will spark support jdk11?

[jira] [Comment Edited] (SPARK-27537) spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value size is not a member of Object

2019-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822845#comment-16822845
 ] 

Hyukjin Kwon edited comment on SPARK-27537 at 4/25/19 6:05 AM:
---

the question is found in spark ml test module, althrough this is an test 
module, i want to figure it out.
from the describe above, it seems an incompatible problem between java 11 and 
scala 2.11.12.

if I change my jdk to jdk8, and there is no problem.

Below is my analysis:

it seems in spark if  a method has implementation in java, spark will use java 
method, or will use scala method.
 'string' class in java11 adds the lines method. This method conflicts with the 
scala syntax.

scala has lines method in 'stringlike' class, the method return an Iterator;
Iterator in scala has a toArray method, the method return an Array;
the class array in scala has a size method. so if spark use scala method, it 
will have no problem.
lines(Iterator)\-\->toArray(Array)\-\->size

But Java11 adds lines method in 'string', this will return a Stream;
Stream in java11 has toArray method, and will return Object;
Object has no 'size' method. This is what the error says.
(Stream)\-\->(Object)toArray\-\->has no size method.

what shall i do to solve this problem.


was (Author: dingwei2019):
the question is found in spark ml test module, althrough this is an test 
module, i want to figure it out.
from the describe above, it seems an incompatible problem between java 11 and 
scala 2.11.12.

if I change my jdk to jdk8, and there is no problem.

Below is my analysis:

it seems in spark if  a method has implementation in java, spark will use java 
method, or will use scala method.
 'string' class in java11 adds the lines method. This method conflicts with the 
scala syntax.

scala has lines method in 'stringlike' class, the method return an Iterator;
Iterator in scala has a toArray method, the method return an Array;
the class array in scala has a size method. so if spark use scala method, it 
will have no problem.
lines(Iterator)-->toArray(Array)-->size

But Java11 adds lines method in 'string', this will return a Stream;
Stream in java11 has toArray method, and will return Object;
Object has no 'size' method. This is what the error says.
(Stream)-->(Object)toArray-->has no size method.

what shall i do to solve this problem.

> spark-2.4.1/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value
>  size is not a member of Object
> -
>
> Key: SPARK-27537
> URL: https://issues.apache.org/jira/browse/SPARK-27537
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.3.0, 2.4.1
> Environment: Machine:aarch64
> OS:Red Hat Enterprise Linux Server release 7.4
> Kernel:4.11.0-44.el7a
> spark version: spark-2.4.1
> java:openjdk version "11.0.2" 2019-01-15
>           OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9)
> scala:2.11.12
> gcc version:4.8.5
>Reporter: dingwei2019
>Priority: Major
>  Labels: build, test
>
> {code}
> [ERROR]: [Error] 
> $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.866:Value
>  size is not a member of Object
> [ERROR]: [Error] 
> $SPARK_HOME/mlib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala.869:Value
>  size is not a member of Object
> {code}
> ERROR: two errors found
> Below is the related code:
> {code}
> test("toString") {
>   val empty = Matrices.ones(0, 0)
>   empty.toString(0, 0)
>  
>   val mat = Matrices.rand(5, 10, new Random())
>   mat.toString(-1, -5)
>   mat.toString(0, 0)
>   mat.toString(Int.MinValue, Int.MinValue)
>   mat.toString(Int.MaxValue, Int.MaxValue)
>   var lines = mat.toString(6, 50).lines.toArray
>   assert(lines.size == 5 && lines.forall(_.size <= 50))
>  
>   lines = mat.toString(5, 100).lines.toArray
>   assert(lines.size == 5 && lines.forall(_.size <= 100))
> }
> {code}
> {code}
> test("numNonzeros and numActives") {
>   val dm1 = Matrices.dense(3, 2, Array(0, 0, -1, 1, 0, 1))
>   assert(dm1.numNonzeros === 3)
>   assert(dm1.numActives === 6)
>   val sm1 = Matrices.sparse(3, 2, Array(0, 2, 3), Array(0, 2, 1), Array(0.0, 
> -1.2, 0.0))
>   assert(sm1.numNonzeros === 1)
>   assert(sm1.numActives === 3)
> }
> {code}
> what shall i do to solve this problem, and when will spark support jdk11?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22044) explain function with codegen and cost parameters

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22044:


Assignee: (was: Apache Spark)

> explain function with codegen and cost parameters
> -
>
> Key: SPARK-22044
> URL: https://issues.apache.org/jira/browse/SPARK-22044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{explain}} operator creates {{ExplainCommand}} runnable command that accepts 
> (among other things) {{codegen}} and {{cost}} arguments.
> There's no version of {{explain}} to allow for this. That's however possible 
> using SQL which is kind of surprising (given how much focus is devoted to the 
> Dataset API).
> This is to have another {{explain}} with {{codegen}} and {{cost}} arguments, 
> i.e.
> {code}
> def explain(codegen: Boolean = false, cost: Boolean = false): Unit
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22044) explain function with codegen and cost parameters

2019-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22044:


Assignee: Apache Spark

> explain function with codegen and cost parameters
> -
>
> Key: SPARK-22044
> URL: https://issues.apache.org/jira/browse/SPARK-22044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
>
> {{explain}} operator creates {{ExplainCommand}} runnable command that accepts 
> (among other things) {{codegen}} and {{cost}} arguments.
> There's no version of {{explain}} to allow for this. That's however possible 
> using SQL which is kind of surprising (given how much focus is devoted to the 
> Dataset API).
> This is to have another {{explain}} with {{codegen}} and {{cost}} arguments, 
> i.e.
> {code}
> def explain(codegen: Boolean = false, cost: Boolean = false): Unit
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27538) sparksql could not start in jdk11, exception org.datanucleus.exceptions.NucleusException: The java type java.lang.Long (jdbc-type='', sql-type="") cant be mapped for th

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27538.
--
Resolution: Duplicate

> sparksql could not start in jdk11, exception 
> org.datanucleus.exceptions.NucleusException: The java type java.lang.Long 
> (jdbc-type='', sql-type="") cant be mapped for this datastore. No mapping is 
> available.
> --
>
> Key: SPARK-27538
> URL: https://issues.apache.org/jira/browse/SPARK-27538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.1
> Environment: Machine:aarch64
> OS:Red Hat Enterprise Linux Server release 7.4
> Kernel:4.11.0-44.el7a
> spark version: spark-2.4.1
> java:openjdk version "11.0.2" 2019-01-15
>   OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9)
> scala:2.11.12
>Reporter: dingwei2019
>Priority: Major
>  Labels: features
>
> [root@172-19-18-8 spark-2.4.1-bin-hadoop2.7-bak]# bin/spark-sql 
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/home/dingwei/spark-2.4.1-bin-x86/spark-2.4.1-bin-hadoop2.7-bak/jars/spark-unsafe_2.11-2.4.1.jar)
>  to method java.nio.Bits.unaligned()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 2019-04-22 11:27:34,419 WARN util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> 2019-04-22 11:27:35,306 INFO metastore.HiveMetaStore: 0: Opening raw store 
> with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 2019-04-22 11:27:35,330 INFO metastore.ObjectStore: ObjectStore, initialize 
> called
> 2019-04-22 11:27:35,492 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 2019-04-22 11:27:35,492 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 2019-04-22 11:27:37,012 INFO metastore.ObjectStore: Setting MetaStore object 
> pin classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 2019-04-22 11:27:37,638 WARN DataNucleus.Query: Query for candidates of 
> org.apache.hadoop.hive.metastore.model.MDatabase and subclasses resulted in 
> no possible candidates
> The java type java.lang.Long (jdbc-type="", sql-type="") cant be mapped for 
> this datastore. No mapping is available.
> org.datanucleus.exceptions.NucleusException: The java type java.lang.Long 
> (jdbc-type="", sql-type="") cant be mapped for this datastore. No mapping is 
> available.
>  at 
> org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.getDatastoreMappingClass(RDBMSMappingManager.java:1215)
>  at 
> org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.createDatastoreMapping(RDBMSMappingManager.java:1378)
>  at 
> org.datanucleus.store.rdbms.table.AbstractClassTable.addDatastoreId(AbstractClassTable.java:392)
>  at 
> org.datanucleus.store.rdbms.table.ClassTable.initializePK(ClassTable.java:1087)
>  at 
> org.datanucleus.store.rdbms.table.ClassTable.preInitialize(ClassTable.java:247)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTable(RDBMSStoreManager.java:3118)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTables(RDBMSStoreManager.java:2909)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTablesAndValidate(RDBMSStoreManager.java:3182)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2841)
>  at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:122)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.addClasses(RDBMSStoreManager.java:1605)
>  at 
> org.datanucleus.store.AbstractStoreManager.addClass(AbstractStoreManager.java:954)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:679)
>  at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:408)
>  at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:947)
>  at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:370)
>  at org.datanucleus.store.query.Query.executeQuery(Query.java:1744)
>  at org.datanucleus.store.query.Query.executeWithArray(Que

[jira] [Comment Edited] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory

2019-04-24 Thread belvey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825756#comment-16825756
 ] 

belvey edited comment on SPARK-13510 at 4/25/19 5:58 AM:
-

@Mike in my case I set it to '536870912' (512m) ,and it can be set to '512m' as 
spark will treat them equally.


was (Author: belvey):
@Mike in my case I set it to '536870912' (512m)  ,and it can be set to '512m' 
as spark will treat it equally. 

> Shuffle may throw FetchFailedException: Direct buffer memory
> 
>
> Key: SPARK-13510
> URL: https://issues.apache.org/jira/browse/SPARK-13510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Hong Shen
>Priority: Major
> Attachments: spark-13510.diff
>
>
> In our cluster, when I test spark-1.6.0 with a sql, it throw exception and 
> failed.
> {code}
> 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request 
> for 1 blocks (915.4 MB) from 10.196.134.220:7337
> 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch 
> from 10.196.134.220:7337 (executor id 122)
> 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 
> to /10.196.134.220:7337
> 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in 
> connection from /10.196.134.220:7337
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>   at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:744)
> 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 
> requests outstanding when connection from /10.196.134.220:7337 is closed
> 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block 
> shuffle_3_81_2, and will not retry (0 retries)
> {code}
>   The reason is that when shuffle a big block(like 1G), task will allocate 
> the same memory, it will easily throw "FetchFailedException: Direct buffer 
> memory".
>   If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will 
> throw 
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607)
> at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
> {code}
>   
>   In mapreduce shuffle, it will firstly judge whether the block can cache in 
> memery, but spark doesn't. 
>   If the block is more than we can cache in memory, we  should write to disk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory

2019-04-24 Thread belvey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806307#comment-16806307
 ] 

belvey edited comment on SPARK-13510 at 4/25/19 5:57 AM:
-

[~shenhong] , hello hongsheng, I am using spark2.0 facing the same issue,  I am 
not sure if it's merged into spark2. it's very kind for you to post your pr.

 edit:

I found that solution had already been added to spark2.3 and later.  i am not 
sure if is hong shen's pr , but the solution is similar to what hong shen's 
said. And for spark2.3 and later we can use 
"spark.maxRemoteBlockSizeFetchToMem" to control the max block size allowed for 
shuffle fetching data that cached in memory, it's default value is 
(Interger.max-512)  bytes.


was (Author: belvey):
[~shenhong] , hello hongsheng, I am using spark2.0 facing the same issue,  I am 
not sure if it's merged into spark2. it's very kind for you to post your pr.

 edit:

I found that solution had already been added to spark2.3 and later.  i am not 
sure if is hong shen's pr , but the solution is similar to what hong shen's 
said. And for spark2.3 and later we can use 
"spark.maxRemoteBlockSizeFetchToMem" to control the max block size allowed for 
shuffle fetching data that catched in memory, it's default value is 
(Interger.max-512)  bytes.

> Shuffle may throw FetchFailedException: Direct buffer memory
> 
>
> Key: SPARK-13510
> URL: https://issues.apache.org/jira/browse/SPARK-13510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Hong Shen
>Priority: Major
> Attachments: spark-13510.diff
>
>
> In our cluster, when I test spark-1.6.0 with a sql, it throw exception and 
> failed.
> {code}
> 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request 
> for 1 blocks (915.4 MB) from 10.196.134.220:7337
> 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch 
> from 10.196.134.220:7337 (executor id 122)
> 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 
> to /10.196.134.220:7337
> 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in 
> connection from /10.196.134.220:7337
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>   at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:744)
> 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 
> requests outstanding when connection from /10.196.134.220:7337 is closed
> 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block 
> shuffle_3_81_2, and will not retry (0 retries)
> {code}
>   The reason is that when shuffle a big block(like 1G), task will allocate 
> the same memory, it will easily throw "FetchFailedException: Direct buffer 
> memory".
>   If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will 
> throw 
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607)
> at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
> {code}

[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory

2019-04-24 Thread belvey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825756#comment-16825756
 ] 

belvey commented on SPARK-13510:


@Mike in my case I set it to '536870912' (512m)  ,and it can be set to '512m' 
as spark will treat it equally. 

> Shuffle may throw FetchFailedException: Direct buffer memory
> 
>
> Key: SPARK-13510
> URL: https://issues.apache.org/jira/browse/SPARK-13510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Hong Shen
>Priority: Major
> Attachments: spark-13510.diff
>
>
> In our cluster, when I test spark-1.6.0 with a sql, it throw exception and 
> failed.
> {code}
> 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request 
> for 1 blocks (915.4 MB) from 10.196.134.220:7337
> 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch 
> from 10.196.134.220:7337 (executor id 122)
> 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 
> to /10.196.134.220:7337
> 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in 
> connection from /10.196.134.220:7337
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>   at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:744)
> 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 
> requests outstanding when connection from /10.196.134.220:7337 is closed
> 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block 
> shuffle_3_81_2, and will not retry (0 retries)
> {code}
>   The reason is that when shuffle a big block(like 1G), task will allocate 
> the same memory, it will easily throw "FetchFailedException: Direct buffer 
> memory".
>   If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will 
> throw 
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607)
> at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
> {code}
>   
>   In mapreduce shuffle, it will firstly judge whether the block can cache in 
> memery, but spark doesn't. 
>   If the block is more than we can cache in memory, we  should write to disk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27553) Operation log is not closed when close session

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27553:
-
Description: 
On Window
1. start spark-shell
2. start hive server in shell by 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark.sqlContext)
3. beeline connect to hive server
    3.1 connect 
          beeline -u jdbc:hive2://localhost:1
    3.2 Run SQL
          show tables;
    3.3 quit beeline
          !quit
Get exception log
{code}
 19/04/24 11:38:22 ERROR HiveSessionImpl: Failed to cleanup ses
sion log dir: SessionHandle [5827428b-d140-4fc0-8ad4-721c39b3ead0]
java.io.IOException: Unable to delete file: 
C:\Users\test\AppData\Local\Temp\test\operation_logs\5827428b-d140-4fc0-8ad4-721c39b3ead0\df9cd631-66e7-4303-9a4
1-a09bdefcf888
 at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
 at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
 at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
 at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
 at 
org.apache.hive.service.cli.session.HiveSessionImpl.cleanupSessionLogDir(HiveSessionImpl.java:671)
 at 
org.apache.hive.service.cli.session.HiveSessionImpl.close(HiveSessionImpl.java:643)
 at 
org.apache.hive.service.cli.session.HiveSessionImplwithUGI.close(HiveSessionImplwithUGI.java:109)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
 at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
 at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
 at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
 at com.sun.proxy.$Proxy19.close(Unknown Source)
 at 
org.apache.hive.service.cli.session.SessionManager.closeSession(SessionManager.java:280)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLSessionManager.closeSession(SparkSQLSessionManager.scala:76)
 at org.apache.hive.service.cli.CLIService.closeSession(CLIService.java:237)
 at 
org.apache.hive.service.cli.thrift.ThriftCLIService.CloseSession(ThriftCLIService.java:397)
 at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1273)
 at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1258)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
 at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
{code}


  was:
On Window
1. start spark-shell
2. start hive server in shell by 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark.sqlContext)
3. beeline connect to hive server
    3.1 connect 
          beeline -u jdbc:hive2://localhost:1
    3.2 Run SQL
          show tables;
    3.3 quit beeline
          !quit
Get exception log
 19/04/24 11:38:22 ERROR HiveSessionImpl: Failed to cleanup ses
sion log dir: SessionHandle [5827428b-d140-4fc0-8ad4-721c39b3ead0]
java.io.IOException: Unable to delete file: 
C:\Users\test\AppData\Local\Temp\test\operation_logs\5827428b-d140-4fc0-8ad4-721c39b3ead0\df9cd631-66e7-4303-9a4
1-a09bdefcf888
 at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
 at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
 at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
 at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
 at 
org.apache.hive.service.cli.session.HiveSessionImpl.cleanupSessionLogDir(HiveSessionImpl.java:671)
 at 
org.apache.hive.service.cli.session.HiveSessionImpl.close(HiveSessionImpl.java:643)
 at 
org.apache.hive.service.cli.session.HiveSessionImplwithUGI.close(HiveSessionImplwithUGI.java:109)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:4

[jira] [Commented] (SPARK-27553) Operation log is not closed when close session

2019-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825751#comment-16825751
 ] 

Hyukjin Kwon commented on SPARK-27553:
--

The cause will probably be a file not being closed somewhere and some codes try 
to delete that file. It doesn't work on Windows.

> Operation log is not closed when close session
> --
>
> Key: SPARK-27553
> URL: https://issues.apache.org/jira/browse/SPARK-27553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: pin_zhang
>Priority: Major
>
> On Window
> 1. start spark-shell
> 2. start hive server in shell by 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark.sqlContext)
> 3. beeline connect to hive server
>     3.1 connect 
>           beeline -u jdbc:hive2://localhost:1
>     3.2 Run SQL
>           show tables;
>     3.3 quit beeline
>           !quit
> Get exception log
> {code}
>  19/04/24 11:38:22 ERROR HiveSessionImpl: Failed to cleanup ses
> sion log dir: SessionHandle [5827428b-d140-4fc0-8ad4-721c39b3ead0]
> java.io.IOException: Unable to delete file: 
> C:\Users\test\AppData\Local\Temp\test\operation_logs\5827428b-d140-4fc0-8ad4-721c39b3ead0\df9cd631-66e7-4303-9a4
> 1-a09bdefcf888
>  at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>  at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>  at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
>  at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
>  at 
> org.apache.hive.service.cli.session.HiveSessionImpl.cleanupSessionLogDir(HiveSessionImpl.java:671)
>  at 
> org.apache.hive.service.cli.session.HiveSessionImpl.close(HiveSessionImpl.java:643)
>  at 
> org.apache.hive.service.cli.session.HiveSessionImplwithUGI.close(HiveSessionImplwithUGI.java:109)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>  at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>  at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>  at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>  at com.sun.proxy.$Proxy19.close(Unknown Source)
>  at 
> org.apache.hive.service.cli.session.SessionManager.closeSession(SessionManager.java:280)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLSessionManager.closeSession(SparkSQLSessionManager.scala:76)
>  at org.apache.hive.service.cli.CLIService.closeSession(CLIService.java:237)
>  at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.CloseSession(ThriftCLIService.java:397)
>  at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1273)
>  at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1258)
>  at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>  at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>  at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27554) org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27554.
--
Resolution: Invalid

No evidence that it's an issue in Spark. Probably it will be a problem about 
Kerberos in your env setup. check if you did {{kinit}} first.

> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN, KERBEROS]
> ---
>
> Key: SPARK-27554
> URL: https://issues.apache.org/jira/browse/SPARK-27554
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.0, 2.4.1
> Environment: cdh5.16.1
> spark2.4.0
> spark2.4.1
> spark2.4.2
>Reporter: Jepson
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> {code}
> #basic env 
>  JAVA_HOME=/usr/java/jdk1.8.0_181
>  HADOOP_CONF_DIR=/etc/hadoop/conf
>  export SPARK_HOME=/opt/software/spark/spark-2.4.2
> {code}
> {code}
> #project env
>  KERBEROS_USER=h...@hadoop.com
>  KERBEROS_USER_KEYTAB=/etc/kerberos/hdfs.keytab
>  PROJECT_MAIN_CLASS=com.jy.dwexercise.OrderDeliveryModel
>  PROJECT_JAR=/opt/maintain/scripts/bas-1.0-SNAPSHOT.jar
> {code}
> {code}
> #spark resource
>  DRIVER_MEMORY=4G
>  EXECUTORS_NUM=20
>  EXECUTOR_MEMORY=6G
>  EXECUTOR_VCORES=4
>  QUEQUE=idss
> {code}
> {code}
> #submit job
>  /opt/software/spark/spark-2.4.2/bin/spark-submit \
>  --master yarn \
>  --deploy-mode cluster \
>  --queue ${QUEQUE} \
>  --driver-memory ${DRIVER_MEMORY} \
>  --num-executors ${EXECUTORS_NUM} \
>  --executor-memory ${EXECUTOR_MEMORY} \
>  --executor-cores ${EXECUTOR_VCORES} \
>  --principal ${KERBEROS_USER} \
>  --keytab ${KERBEROS_USER_KEYTAB} \
>  --class ${PROJECT_MAIN_CLASS} \
>  --conf 
> "spark.yarn.archive=hdfs://jybigdata/spark/spark-archive-20190422.zip" \
>  --conf "spark.sql.autoBroadcastJoinThreshold=20971520" \
>  ${PROJECT_JAR}
> {code}
>  
> {code}
> *ERROR log:*
> 19/04/24 13:35:07 INFO ConfiguredRMFailoverProxyProvider: Failing over to rm78
>  19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException 
> as:hdfs (auth:SIMPLE) 
> cause:org.apache.hadoop.security.AccessControlException: Client cannot 
> authenticate via:[TOKEN, KERBEROS]
>  19/04/24 13:35:07 WARN Client: Exception encountered while connecting to the 
> server : org.apache.hadoop.security.AccessControlException: Client cannot 
> authenticate via:[TOKEN, KERBEROS]
>  19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException 
> as:hdfs (auth:SIMPLE) cause:java.io.IOException: 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN, KERBEROS]
>  19/04/24 13:35:07 INFO RetryInvocationHandler: Exception while invoking 
> getClusterMetrics of class ApplicationClientProtocolPBClientImpl over rm78 
> after 10 fail over attempts. Trying to fail over immediately.
>  java.io.IOException: Failed on local exception: java.io.IOException: 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN, KERBEROS]; Host Details : local host is: 
> "hadoopnode143/192.168.209.143"; destination host is: 
> "hadoopmanager136":8032; 
>  at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>  at org.apache.hadoop.ipc.Client.call(Client.java:1508)
>  at org.apache.hadoop.ipc.Client.call(Client.java:1441)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)
>  at com.sun.proxy.$Proxy25.getClusterMetrics(Unknown Source)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:202)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>  at com.sun.proxy.$Proxy26.getClusterMetrics(Unknown Source)
>  at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:483)
>  at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164)
>  at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164)
>  at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>  at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:59)
>  at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:163)
>  at 
> org.apache.spark.scheduler.cluster.

[jira] [Updated] (SPARK-27554) org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27554:
-
Description: 
{code}
#basic env 
 JAVA_HOME=/usr/java/jdk1.8.0_181
 HADOOP_CONF_DIR=/etc/hadoop/conf
 export SPARK_HOME=/opt/software/spark/spark-2.4.2
{code}

{code}
#project env
 KERBEROS_USER=h...@hadoop.com
 KERBEROS_USER_KEYTAB=/etc/kerberos/hdfs.keytab
 PROJECT_MAIN_CLASS=com.jy.dwexercise.OrderDeliveryModel
 PROJECT_JAR=/opt/maintain/scripts/bas-1.0-SNAPSHOT.jar
{code}

{code}
#spark resource
 DRIVER_MEMORY=4G
 EXECUTORS_NUM=20
 EXECUTOR_MEMORY=6G
 EXECUTOR_VCORES=4
 QUEQUE=idss
{code}

{code}
#submit job
 /opt/software/spark/spark-2.4.2/bin/spark-submit \
 --master yarn \
 --deploy-mode cluster \
 --queue ${QUEQUE} \
 --driver-memory ${DRIVER_MEMORY} \
 --num-executors ${EXECUTORS_NUM} \
 --executor-memory ${EXECUTOR_MEMORY} \
 --executor-cores ${EXECUTOR_VCORES} \
 --principal ${KERBEROS_USER} \
 --keytab ${KERBEROS_USER_KEYTAB} \
 --class ${PROJECT_MAIN_CLASS} \
 --conf "spark.yarn.archive=hdfs://jybigdata/spark/spark-archive-20190422.zip" \
 --conf "spark.sql.autoBroadcastJoinThreshold=20971520" \
 ${PROJECT_JAR}
{code}

 
{code}
*ERROR log:*

19/04/24 13:35:07 INFO ConfiguredRMFailoverProxyProvider: Failing over to rm78
 19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException 
as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: 
Client cannot authenticate via:[TOKEN, KERBEROS]
 19/04/24 13:35:07 WARN Client: Exception encountered while connecting to the 
server : org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN, KERBEROS]
 19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException 
as:hdfs (auth:SIMPLE) cause:java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]
 19/04/24 13:35:07 INFO RetryInvocationHandler: Exception while invoking 
getClusterMetrics of class ApplicationClientProtocolPBClientImpl over rm78 
after 10 fail over attempts. Trying to fail over immediately.
 java.io.IOException: Failed on local exception: java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]; Host Details : local host is: 
"hadoopnode143/192.168.209.143"; destination host is: "hadoopmanager136":8032; 
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
 at org.apache.hadoop.ipc.Client.call(Client.java:1508)
 at org.apache.hadoop.ipc.Client.call(Client.java:1441)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)
 at com.sun.proxy.$Proxy25.getClusterMetrics(Unknown Source)
 at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:202)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
 at com.sun.proxy.$Proxy26.getClusterMetrics(Unknown Source)
 at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:483)
 at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164)
 at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164)
 at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
 at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:59)
 at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:163)
 at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
 at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:183)
 at org.apache.spark.SparkContext.(SparkContext.scala:501)
 at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
 at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
 at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
 at scala.Option.getOrElse(Option.scala:121)
 at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
 at com.jiuye.dwexercise.OrderDeliveryModel$.(OrderDeliveryModel.scala:25)
 at com.jiuye.dwexercise.OrderDeliveryModel$.(OrderDeliveryModel.scala)
 at 
com.jiuye.dwexercise.OrderDeliveryModel$$anonfun$5.apply(OrderDeliveryModel.scala:55)
 at 
com.jiuye.dwexercise.OrderDeliveryModel$$anonfun$5.apply(OrderDeliveryModel.scala:55)
 at 
org.apache.spark.sql.catalyst.express

[jira] [Updated] (SPARK-27554) org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27554:
-
Description: 
{code}
#basic env 
 JAVA_HOME=/usr/java/jdk1.8.0_181
 HADOOP_CONF_DIR=/etc/hadoop/conf
 export SPARK_HOME=/opt/software/spark/spark-2.4.2
{code}

{code}
#project env
 KERBEROS_USER=h...@hadoop.com
 KERBEROS_USER_KEYTAB=/etc/kerberos/hdfs.keytab
 PROJECT_MAIN_CLASS=com.jy.dwexercise.OrderDeliveryModel
 PROJECT_JAR=/opt/maintain/scripts/bas-1.0-SNAPSHOT.jar
{code}

{code}
#spark resource
 DRIVER_MEMORY=4G
 EXECUTORS_NUM=20
 EXECUTOR_MEMORY=6G
 EXECUTOR_VCORES=4
 QUEQUE=idss
{code}

{code}
#submit job
 /opt/software/spark/spark-2.4.2/bin/spark-submit \
 --master yarn \
 --deploy-mode cluster \
 --queue ${QUEQUE} \
 --driver-memory ${DRIVER_MEMORY} \
 --num-executors ${EXECUTORS_NUM} \
 --executor-memory ${EXECUTOR_MEMORY} \
 --executor-cores ${EXECUTOR_VCORES} \
 --principal ${KERBEROS_USER} \
 --keytab ${KERBEROS_USER_KEYTAB} \
 --class ${PROJECT_MAIN_CLASS} \
 --conf "spark.yarn.archive=hdfs://jybigdata/spark/spark-archive-20190422.zip" \
 --conf "spark.sql.autoBroadcastJoinThreshold=20971520" \
 ${PROJECT_JAR}

 

*ERROR log:*

19/04/24 13:35:07 INFO ConfiguredRMFailoverProxyProvider: Failing over to rm78
 19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException 
as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: 
Client cannot authenticate via:[TOKEN, KERBEROS]
 19/04/24 13:35:07 WARN Client: Exception encountered while connecting to the 
server : org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN, KERBEROS]
 19/04/24 13:35:07 WARN UserGroupInformation: PriviledgedActionException 
as:hdfs (auth:SIMPLE) cause:java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]
 19/04/24 13:35:07 INFO RetryInvocationHandler: Exception while invoking 
getClusterMetrics of class ApplicationClientProtocolPBClientImpl over rm78 
after 10 fail over attempts. Trying to fail over immediately.
 java.io.IOException: Failed on local exception: java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]; Host Details : local host is: 
"hadoopnode143/192.168.209.143"; destination host is: "hadoopmanager136":8032; 
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
 at org.apache.hadoop.ipc.Client.call(Client.java:1508)
 at org.apache.hadoop.ipc.Client.call(Client.java:1441)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)
 at com.sun.proxy.$Proxy25.getClusterMetrics(Unknown Source)
 at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:202)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
 at com.sun.proxy.$Proxy26.getClusterMetrics(Unknown Source)
 at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:483)
 at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164)
 at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:164)
 at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
 at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:59)
 at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:163)
 at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
 at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:183)
 at org.apache.spark.SparkContext.(SparkContext.scala:501)
 at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
 at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
 at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
 at scala.Option.getOrElse(Option.scala:121)
 at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
 at com.jiuye.dwexercise.OrderDeliveryModel$.(OrderDeliveryModel.scala:25)
 at com.jiuye.dwexercise.OrderDeliveryModel$.(OrderDeliveryModel.scala)
 at 
com.jiuye.dwexercise.OrderDeliveryModel$$anonfun$5.apply(OrderDeliveryModel.scala:55)
 at 
com.jiuye.dwexercise.OrderDeliveryModel$$anonfun$5.apply(OrderDeliveryModel.scala:55)
 at 
org.apache.spark.sql.catalyst.expressions.Generate

[jira] [Commented] (SPARK-27551) Uniformative error message for mismatched types in when().otherwise()

2019-04-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825749#comment-16825749
 ] 

Dongjoon Hyun commented on SPARK-27551:
---

Hi, [~huonw]. I updated this from `BUG` to `Improvement`.

> Uniformative error message for mismatched types in when().otherwise()
> -
>
> Key: SPARK-27551
> URL: https://issues.apache.org/jira/browse/SPARK-27551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Huon Wilson
>Priority: Minor
>
> When a {{when(...).otherwise(...)}} construct has a type error, the error 
> message can be quite uninformative, since it just splats out a potentially 
> large chunk of code and says the types don't match. For instance:
> {code:none}
> scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 
> 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as 
> "y"
> org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = 
> CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS 
> BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * 
> CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data 
> type mismatch: THEN and ELSE expressions should all be same type or coercible 
> to a common type;;
> ...
> {code}
> The problem is the structs have different field names ({{x}} vs {{y}}), but 
> it's not obvious that this is the case, and this is a relatively simple case 
> of a single {{select}} expression.
> It would be great for the error message to at least include the types that 
> Spark has computed, to help clarify what might have gone wrong. For instance, 
> {{greatest}} and {{least}} write out the expression with the types instead of 
> values:
> {code:none}
> scala> spark.range(100).select(greatest('id, struct(lit("x"
> org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`id`, 
> named_struct('col1', 'x'))' due to data type mismatch: The expressions should 
> all have the same type, got GREATEST(bigint, struct).;;
> {code}
> For the example above, this might look like:
> {code:none}
> org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = 
> CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS 
> BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * 
> CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data 
> type mismatch: THEN and ELSE expressions should all be same type or coercible 
> to a common type, got CASE WHEN ... THEN array> ELSE 
> array> END;;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27551) Uniformative error message for mismatched types in when().otherwise()

2019-04-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27551:
--
Priority: Minor  (was: Major)

> Uniformative error message for mismatched types in when().otherwise()
> -
>
> Key: SPARK-27551
> URL: https://issues.apache.org/jira/browse/SPARK-27551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Huon Wilson
>Priority: Minor
>
> When a {{when(...).otherwise(...)}} construct has a type error, the error 
> message can be quite uninformative, since it just splats out a potentially 
> large chunk of code and says the types don't match. For instance:
> {code:none}
> scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 
> 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as 
> "y"
> org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = 
> CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS 
> BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * 
> CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data 
> type mismatch: THEN and ELSE expressions should all be same type or coercible 
> to a common type;;
> ...
> {code}
> The problem is the structs have different field names ({{x}} vs {{y}}), but 
> it's not obvious that this is the case, and this is a relatively simple case 
> of a single {{select}} expression.
> It would be great for the error message to at least include the types that 
> Spark has computed, to help clarify what might have gone wrong. For instance, 
> {{greatest}} and {{least}} write out the expression with the types instead of 
> values:
> {code:none}
> scala> spark.range(100).select(greatest('id, struct(lit("x"
> org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`id`, 
> named_struct('col1', 'x'))' due to data type mismatch: The expressions should 
> all have the same type, got GREATEST(bigint, struct).;;
> {code}
> For the example above, this might look like:
> {code:none}
> org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = 
> CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS 
> BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * 
> CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data 
> type mismatch: THEN and ELSE expressions should all be same type or coercible 
> to a common type, got CASE WHEN ... THEN array> ELSE 
> array> END;;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27557) Add copybutton to spark Python API docs for easier copying of code-blocks

2019-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825746#comment-16825746
 ] 

Hyukjin Kwon commented on SPARK-27557:
--

please go ahead for a PR.

> Add copybutton to spark Python API docs for easier copying of code-blocks
> -
>
> Key: SPARK-27557
> URL: https://issues.apache.org/jira/browse/SPARK-27557
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.4.2
>Reporter: Sangram G
>Priority: Minor
>  Labels: Documentation, PySpark
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Add a non-intrusive button for python API documentation, which will remove 
> ">>>" prompts and outputs of code - for easier copying of code.
> For example: The below code-snippet in the document is difficult to copy due 
> to ">>>" prompts
> {code}
> >>> l = [('Alice', 1)]
> >>> spark.createDataFrame(l).collect()
> [Row(_1='Alice', _2=1)]
> {code}
> Becomes this - After the copybutton in the corner of of code-block is pressed 
> - which is easier to copy 
> {code}
> l = [('Alice', 1)]
> spark.createDataFrame(l).collect()
> {code}
> Sample Screenshot for reference: 
> [https://screenshots.firefox.com/ZfBXZrpINKdt9QZg/www269.lunapic.com]
> This can be easily done only by adding a copybutton.js script in 
> python/docs/_static folder and calling it at setup time from 
> python/docs/conf.py.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27551) Uniformative error message for mismatched types in when().otherwise()

2019-04-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27551:
--
Issue Type: Improvement  (was: Bug)

> Uniformative error message for mismatched types in when().otherwise()
> -
>
> Key: SPARK-27551
> URL: https://issues.apache.org/jira/browse/SPARK-27551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> When a {{when(...).otherwise(...)}} construct has a type error, the error 
> message can be quite uninformative, since it just splats out a potentially 
> large chunk of code and says the types don't match. For instance:
> {code:none}
> scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 
> 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as 
> "y"
> org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = 
> CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS 
> BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * 
> CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data 
> type mismatch: THEN and ELSE expressions should all be same type or coercible 
> to a common type;;
> ...
> {code}
> The problem is the structs have different field names ({{x}} vs {{y}}), but 
> it's not obvious that this is the case, and this is a relatively simple case 
> of a single {{select}} expression.
> It would be great for the error message to at least include the types that 
> Spark has computed, to help clarify what might have gone wrong. For instance, 
> {{greatest}} and {{least}} write out the expression with the types instead of 
> values:
> {code:none}
> scala> spark.range(100).select(greatest('id, struct(lit("x"
> org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`id`, 
> named_struct('col1', 'x'))' due to data type mismatch: The expressions should 
> all have the same type, got GREATEST(bigint, struct).;;
> {code}
> For the example above, this might look like:
> {code:none}
> org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = 
> CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS 
> BIGINT)) + CAST(123456789 AS BIGINT ELSE array(named_struct('y', ((`id` * 
> CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT END' due to data 
> type mismatch: THEN and ELSE expressions should all be same type or coercible 
> to a common type, got CASE WHEN ... THEN array> ELSE 
> array> END;;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825747#comment-16825747
 ] 

Hyukjin Kwon commented on SPARK-27555:
--

can you post a self-contained reproducer please?

> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Priority: Major
>
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27557) Add copybutton to spark Python API docs for easier copying of code-blocks

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27557:
-
Description: 
Add a non-intrusive button for python API documentation, which will remove 
">>>" prompts and outputs of code - for easier copying of code.

For example: The below code-snippet in the document is difficult to copy due to 
">>>" prompts

{code}
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1='Alice', _2=1)]
{code}

Becomes this - After the copybutton in the corner of of code-block is pressed - 
which is easier to copy 

{code}
l = [('Alice', 1)]
spark.createDataFrame(l).collect()
{code}

Sample Screenshot for reference: 
[https://screenshots.firefox.com/ZfBXZrpINKdt9QZg/www269.lunapic.com]

This can be easily done only by adding a copybutton.js script in 
python/docs/_static folder and calling it at setup time from 
python/docs/conf.py.

  was:
Add a non-intrusive button for python API documentation, which will remove 
">>>" prompts and outputs of code - for easier copying of code.

 For example: The below code-snippet in the document is difficult to copy due 
to ">>>" prompts

   >>> l = [('Alice', 1)]

   >>> spark.createDataFrame(l).collect()

    [Row(_1='Alice', _2=1)]


Becomes this - After the copybutton in the corner of of code-block is pressed - 
which is easier to copy 


l = [('Alice', 1)]

spark.createDataFrame(l).collect()

Sample Screenshot for reference: 
[https://screenshots.firefox.com/ZfBXZrpINKdt9QZg/www269.lunapic.com]

 

This can be easily done only by adding a copybutton.js script in 
python/docs/_static folder and calling it at setup time from 
python/docs/conf.py.


> Add copybutton to spark Python API docs for easier copying of code-blocks
> -
>
> Key: SPARK-27557
> URL: https://issues.apache.org/jira/browse/SPARK-27557
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.4.2
>Reporter: Sangram G
>Priority: Minor
>  Labels: Documentation, PySpark
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Add a non-intrusive button for python API documentation, which will remove 
> ">>>" prompts and outputs of code - for easier copying of code.
> For example: The below code-snippet in the document is difficult to copy due 
> to ">>>" prompts
> {code}
> >>> l = [('Alice', 1)]
> >>> spark.createDataFrame(l).collect()
> [Row(_1='Alice', _2=1)]
> {code}
> Becomes this - After the copybutton in the corner of of code-block is pressed 
> - which is easier to copy 
> {code}
> l = [('Alice', 1)]
> spark.createDataFrame(l).collect()
> {code}
> Sample Screenshot for reference: 
> [https://screenshots.firefox.com/ZfBXZrpINKdt9QZg/www269.lunapic.com]
> This can be easily done only by adding a copybutton.js script in 
> python/docs/_static folder and calling it at setup time from 
> python/docs/conf.py.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27558) NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang

2019-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825743#comment-16825743
 ] 

Hyukjin Kwon commented on SPARK-27558:
--

Please avoid to set Critical+ which is usually reserved for committers.

> NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter 
> causing tasks to hang
> 
>
> Key: SPARK-27558
> URL: https://issues.apache.org/jira/browse/SPARK-27558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.2
>Reporter: Alessandro Bellina
>Priority: Major
>
> We see an NPE in the UnsafeInMemorySorter.getMemoryUsage function (due to the 
> array we are accessing there being null). This looks to be caused by a Spark 
> OOM when UnsafeInMemorySorter is trying to spill.
> This is likely a symptom of 
> https://issues.apache.org/jira/browse/SPARK-21492. The real question for this 
> ticket is, could we handle things more gracefully, rather than NPE. For 
> example:
> Remove this:
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L182
> so when this fails (and store the new array into a temporary):
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L186
> we don't end up with a null "array". This state is causing one of our jobs to 
> hang infinitely (we think) due to the original allocation error.
> Stack trace for reference
> {noformat}
> 2019-04-23 08:57:14,989 [Executor task launch worker for task 46729] ERROR 
> org.apache.spark.TaskContextImpl  - Error in TaskCompletionListener
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getMemoryUsage(UnsafeInMemorySorter.java:208)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getMemoryUsage(UnsafeExternalSorter.java:249)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.updatePeakMemoryUsed(UnsafeExternalSorter.java:253)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.freeMemory(UnsafeExternalSorter.java:296)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:328)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.lambda$new$0(UnsafeExternalSorter.java:178)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:131)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:129)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:129)
>   at 
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
>   at org.apache.spark.scheduler.Task.run(Task.scala:119)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2019-04-23 08:57:15,069 [Executor task launch worker for task 46729] ERROR 
> org.apache.spark.executor.Executor  - Exception in task 102.0 in stage 28.0 
> (TID 46729)
> org.apache.spark.util.TaskCompletionListenerException: null
> Previous exception in task: Unable to acquire 65536 bytes of memory, got 0
>   org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
>   
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
>   
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:186)
>   
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:229)
>   
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:204)
>   
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:283)
>   
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96)
>   
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(Uns

[jira] [Updated] (SPARK-27558) NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27558:
-
Priority: Major  (was: Critical)

> NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter 
> causing tasks to hang
> 
>
> Key: SPARK-27558
> URL: https://issues.apache.org/jira/browse/SPARK-27558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.2
>Reporter: Alessandro Bellina
>Priority: Major
>
> We see an NPE in the UnsafeInMemorySorter.getMemoryUsage function (due to the 
> array we are accessing there being null). This looks to be caused by a Spark 
> OOM when UnsafeInMemorySorter is trying to spill.
> This is likely a symptom of 
> https://issues.apache.org/jira/browse/SPARK-21492. The real question for this 
> ticket is, could we handle things more gracefully, rather than NPE. For 
> example:
> Remove this:
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L182
> so when this fails (and store the new array into a temporary):
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L186
> we don't end up with a null "array". This state is causing one of our jobs to 
> hang infinitely (we think) due to the original allocation error.
> Stack trace for reference
> {noformat}
> 2019-04-23 08:57:14,989 [Executor task launch worker for task 46729] ERROR 
> org.apache.spark.TaskContextImpl  - Error in TaskCompletionListener
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getMemoryUsage(UnsafeInMemorySorter.java:208)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getMemoryUsage(UnsafeExternalSorter.java:249)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.updatePeakMemoryUsed(UnsafeExternalSorter.java:253)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.freeMemory(UnsafeExternalSorter.java:296)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:328)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.lambda$new$0(UnsafeExternalSorter.java:178)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:131)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:129)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:129)
>   at 
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
>   at org.apache.spark.scheduler.Task.run(Task.scala:119)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2019-04-23 08:57:15,069 [Executor task launch worker for task 46729] ERROR 
> org.apache.spark.executor.Executor  - Exception in task 102.0 in stage 28.0 
> (TID 46729)
> org.apache.spark.util.TaskCompletionListenerException: null
> Previous exception in task: Unable to acquire 65536 bytes of memory, got 0
>   org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
>   
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
>   
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:186)
>   
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:229)
>   
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:204)
>   
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:283)
>   
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96)
>   
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:348)
>   
> org.apache.spark.util.collection.unsafe.sort.UnsafeExt

[jira] [Resolved] (SPARK-27559) Nullable in a given schema is not respected when reading from parquet

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27559.
--
Resolution: Duplicate

> Nullable in a given schema is not respected when reading from parquet
> -
>
> Key: SPARK-27559
> URL: https://issues.apache.org/jira/browse/SPARK-27559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: colin fang
>Priority: Minor
>
> Even if I specify a schema when reading from parquet, nullable is not reset.
> {code:java}
> spark.range(10, numPartitions=1).write.mode('overwrite').parquet('tmp')
> df1 = spark.read.parquet('tmp')
> df1.printSchema()
> # root
> #  |-- id: long (nullable = true)
> df2 = spark.read.schema(StructType([StructField('id', LongType(), 
> False)])).parquet('tmp')
> df2.printSchema()
> # root
> #  |-- x: long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27559) Nullable in a given schema is not respected when reading from parquet

2019-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825742#comment-16825742
 ] 

Hyukjin Kwon commented on SPARK-27559:
--

It's a dup of SPARK-16472

> Nullable in a given schema is not respected when reading from parquet
> -
>
> Key: SPARK-27559
> URL: https://issues.apache.org/jira/browse/SPARK-27559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: colin fang
>Priority: Minor
>
> Even if I specify a schema when reading from parquet, nullable is not reset.
> {code:java}
> spark.range(10, numPartitions=1).write.mode('overwrite').parquet('tmp')
> df1 = spark.read.parquet('tmp')
> df1.printSchema()
> # root
> #  |-- id: long (nullable = true)
> df2 = spark.read.schema(StructType([StructField('id', LongType(), 
> False)])).parquet('tmp')
> df2.printSchema()
> # root
> #  |-- x: long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27559) Nullable in a given schema is not respected when reading from parquet

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27559:
-
Component/s: (was: Spark Core)
 SQL

> Nullable in a given schema is not respected when reading from parquet
> -
>
> Key: SPARK-27559
> URL: https://issues.apache.org/jira/browse/SPARK-27559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: colin fang
>Priority: Minor
>
> Even if I specify a schema when reading from parquet, nullable is not reset.
> {code:java}
> spark.range(10, numPartitions=1).write.mode('overwrite').parquet('tmp')
> df1 = spark.read.parquet('tmp')
> df1.printSchema()
> # root
> #  |-- id: long (nullable = true)
> df2 = spark.read.schema(StructType([StructField('id', LongType(), 
> False)])).parquet('tmp')
> df2.printSchema()
> # root
> #  |-- x: long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27505) autoBroadcastJoinThreshold including bigger table

2019-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825739#comment-16825739
 ] 

Hyukjin Kwon commented on SPARK-27505:
--

I meant copy-and-past-able code to reproduce this error

> autoBroadcastJoinThreshold including bigger table
> -
>
> Key: SPARK-27505
> URL: https://issues.apache.org/jira/browse/SPARK-27505
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Hive table with Spark 2.3.1 on Azure, using Azure 
> storage as storage layer
>Reporter: Mike Chan
>Priority: Major
> Attachments: explain_plan.txt
>
>
> I'm on a case that when certain table being exposed to broadcast join, the 
> query will eventually failed with remote block error. 
>  
> Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 
> 10485760
> !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.2&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI&disp=emb&realattid=ii_jumg5jxd1|width=542,height=66!
>  
> Then we proceed to perform query. In the SQL plan, we found that one table 
> that is 25MB in size is broadcast as well.
>  
> !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.1&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE&disp=emb&realattid=ii_jumg53fq0|width=227,height=542!
>  
> Also in desc extended the table is 24452111 bytes. It is a Hive table. We 
> always ran into error when this table being broadcast. Below is the sample 
> error
>  
> Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt 
> remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) 
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>  
> Also attached the physical plan if you're interested. One thing to note that, 
> if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query 
> will get successfully executed and default.product NOT broadcasted.{color}
> {color:#00}
> {color}{color:#00}However, when I change to another query that querying 
> even less columns than pervious one, even in 5MB this table still get 
> broadcasted and failed with the same error. I even changed to 1MB and still 
> the same. {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory

2019-04-24 Thread Mike Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825721#comment-16825721
 ] 

Mike Chan commented on SPARK-13510:
---

Hi [~belvey] I'm having similar issue and our 
"spark.maxRemoteBlockSizeFetchToMem" is at 188. From the forums I can tell 
this parameter should be set below 2GB. How do you set your parameter? Should 
it be "2g" or 2 * 1024 * 1024 * 1024 = 2147483648? I'm at Spark 2.3 on Azure 

> Shuffle may throw FetchFailedException: Direct buffer memory
> 
>
> Key: SPARK-13510
> URL: https://issues.apache.org/jira/browse/SPARK-13510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Hong Shen
>Priority: Major
> Attachments: spark-13510.diff
>
>
> In our cluster, when I test spark-1.6.0 with a sql, it throw exception and 
> failed.
> {code}
> 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request 
> for 1 blocks (915.4 MB) from 10.196.134.220:7337
> 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch 
> from 10.196.134.220:7337 (executor id 122)
> 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 
> to /10.196.134.220:7337
> 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in 
> connection from /10.196.134.220:7337
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>   at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:744)
> 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 
> requests outstanding when connection from /10.196.134.220:7337 is closed
> 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block 
> shuffle_3_81_2, and will not retry (0 retries)
> {code}
>   The reason is that when shuffle a big block(like 1G), task will allocate 
> the same memory, it will easily throw "FetchFailedException: Direct buffer 
> memory".
>   If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will 
> throw 
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607)
> at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
> {code}
>   
>   In mapreduce shuffle, it will firstly judge whether the block can cache in 
> memery, but spark doesn't. 
>   If the block is more than we can cache in memory, we  should write to disk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27563) automatically get the latest Spark versions in HiveExternalCatalogVersionsSuite

2019-04-24 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27563:
---

 Summary: automatically get the latest Spark versions in 
HiveExternalCatalogVersionsSuite
 Key: SPARK-27563
 URL: https://issues.apache.org/jira/browse/SPARK-27563
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20462) Spark-Kinesis Direct Connector

2019-04-24 Thread Long Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825669#comment-16825669
 ] 

Long Sun commented on SPARK-20462:
--

[~gaurav24], [~arushkharbanda],

Any update on this?

"The user does not have to implement their own checkpointing logic and ensuring 
checkpointing only occurs after the terminal action of a user’s Spark job 
ensures that no data is lost" The idea above is excellent in the above proposal.

But at least, I doubt why *Spark + Kinesis* don't provide '*offset commit API*' 
like *Spark+Kafka 0.10.0 or higher.* 

> Spark-Kinesis Direct Connector 
> ---
>
> Key: SPARK-20462
> URL: https://issues.apache.org/jira/browse/SPARK-20462
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Lauren Moos
>Priority: Major
>
> I'd like to propose and the vet the design for a direct connector between 
> Spark and Kinesis. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data

2019-04-24 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27562:

Description: 
We've seen some shuffle data corruption during shuffle read phase. 

As described in SPARK-26089, spark only checks small  shuffle blocks before PR 
#23453, which is proposed by ankuriitg.

There are two changes/improvements that are made in PR #23453.

1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as 
smaller blocks, so if a
large block is corrupt in the starting, that block will be re-fetched and if 
that also fails,
FetchFailureException will be thrown.
2. If large blocks are corrupt after size maxBytesInFlight/3, then any 
IOException thrown while
reading the stream will be converted to FetchFailureException. This is slightly 
more aggressive
than was originally intended but since the consumer of the stream may have 
already read some records and processed them, we can't just re-fetch the block, 
we need to fail the whole task. Additionally, we also thought about maybe 
adding a new type of TaskEndReason, which would re-try the task couple of times 
before failing the previous stage, but given the complexity involved in that 
solution we decided to not proceed in that direction.

However, I think there still exists some problems with the current shuffle 
transmitted data verification mechanism:

- For a large block, it is checked upto  maxBytesInFlight/3 size when fetching 
shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it 
can not be detected in data fetch phase.  This has been described in the 
previous section.
- Only the compressed or wrapped blocks are checked, I think we should also 
check thease blocks which are not wrapped.

We complete the verification mechanism for shuffle transmitted data:

Firstly, we choose crc32 for the checksum verification  of shuffle data.

Crc is also used for checksum verification in hadoop, it is simple and fast.

In shuffle write phase, after completing the partitionedFile, we compute 

the crc32 value for each partition and then write these digests with the indexs 
into shuffle index file.

For the sortShuffleWriter and unsafe shuffle writer, there is only one 
partitionedFile for a shuffleMapTask, so the compution of digests(compute the 
digests for each partition depend on the indexs of this partitionedFile) is  
cheap.

For the bypassShuffleWriter, the reduce partitions is little than 
byPassMergeThreshold, the cost of digests compution is acceptable.

In shuffle read phase, the digest value will be passed with the block data.

And we will recompute the digest of the data obtained to compare with the 
origin digest value.
When recomputing the digest of data obtained, it only need an additional 
buffer(2048Bytes) for computing crc32 value.
After recomputing, we will reset the obtained data inputStream, if it is 
markSupported we only need reset it, otherwise it is a 
fileSegmentManagerBuffer, we need recreate it.

So, this verification mechanism  proposed for shuffle transmitted data is cheap 
and complete.


  was:
We've seen some shuffle data corruption during shuffle read phase. 

As described in SPARK-26089, spark only checks small  shuffle blocks before PR 
#23543, which is proposed by ankuriitg.

There are two changes/improvements that are made in PR #23543.

1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as 
smaller blocks, so if a
large block is corrupt in the starting, that block will be re-fetched and if 
that also fails,
FetchFailureException will be thrown.
2. If large blocks are corrupt after size maxBytesInFlight/3, then any 
IOException thrown while
reading the stream will be converted to FetchFailureException. This is slightly 
more aggressive
than was originally intended but since the consumer of the stream may have 
already read some records and processed them, we can't just re-fetch the block, 
we need to fail the whole task. Additionally, we also thought about maybe 
adding a new type of TaskEndReason, which would re-try the task couple of times 
before failing the previous stage, but given the complexity involved in that 
solution we decided to not proceed in that direction.

However, I think there still exists some problems with the current shuffle 
transmitted data verification mechanism:

- For a large block, it is checked upto  maxBytesInFlight/3 size when fetching 
shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it 
can not be detected in data fetch phase.  This has been described in the 
previous section.
- Only the compressed or wrapped blocks are checked, I think we should also 
check thease blocks which are not wrapped.

We complete the verification mechanism for shuffle transmitted data:

Firstly, we choose crc32 for the checksum verification  of shuffle data.

Crc is also used for checksum verification in hadoop, it is

[jira] [Updated] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data

2019-04-24 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27562:

Description: 
SPARK-27562

We've seen some shuffle data corruption during shuffle read phase. 

As described in SPARK-26089, spark only checks small  shuffle blocks before PR 
#23543, which is proposed by ankuriitg.

There are two changes/improvements that are made in PR #23543.

1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as 
smaller blocks, so if a
large block is corrupt in the starting, that block will be re-fetched and if 
that also fails,
FetchFailureException will be thrown.
2. If large blocks are corrupt after size maxBytesInFlight/3, then any 
IOException thrown while
reading the stream will be converted to FetchFailureException. This is slightly 
more aggressive
than was originally intended but since the consumer of the stream may have 
already read some records and processed them, we can't just re-fetch the block, 
we need to fail the whole task. Additionally, we also thought about maybe 
adding a new type of TaskEndReason, which would re-try the task couple of times 
before failing the previous stage, but given the complexity involved in that 
solution we decided to not proceed in that direction.

However, I think there still exists some problems with the current shuffle 
transmitted data verification mechanism:

- For a large block, it is checked upto  maxBytesInFlight/3 size when fetching 
shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it 
can not be detected in data fetch phase.  This has been described in the 
previous section.
- Only the compressed or wrapped blocks are checked, I think we should also 
check thease blocks which are not wrapped.

We complete the verification mechanism for shuffle transmitted data:

Firstly, we choose crc32 for the checksum verification  of shuffle data.

Crc is also used for checksum verification in hadoop, it is simple and fast.

In shuffle write phase, after completing the partitionedFile, we compute 

the crc32 value for each partition and then write these digests with the indexs 
into shuffle index file.

For the sortShuffleWriter and unsafe shuffle writer, there is only one 
partitionedFile for a shuffleMapTask, so the compution of digests(compute the 
digests for each partition depend on the indexs of this partitionedFile) is  
cheap.

For the bypassShuffleWriter, the reduce partitions is little than 
byPassMergeThreshold, the cost of digests compution is acceptable.

In shuffle read phase, the digest value will be passed with the block data.

And we will recompute the digest of the data obtained to compare with the 
origin digest value.
When recomputing the digest of data obtained, it only need an additional 
buffer(2048Bytes) for computing crc32 value.
After recomputing, we will reset the obtained data inputStream, if it is 
markSupported we only need reset it, otherwise it is a 
fileSegmentManagerBuffer, we need recreate it.

So, this verification mechanism  proposed for shuffle transmitted data is cheap 
and complete.


  was:I will update this lately.


> Complete the verification mechanism for shuffle transmitted data
> 
>
> Key: SPARK-27562
> URL: https://issues.apache.org/jira/browse/SPARK-27562
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: feiwang
>Priority: Major
>
> SPARK-27562
> We've seen some shuffle data corruption during shuffle read phase. 
> As described in SPARK-26089, spark only checks small  shuffle blocks before 
> PR #23543, which is proposed by ankuriitg.
> There are two changes/improvements that are made in PR #23543.
> 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as 
> smaller blocks, so if a
> large block is corrupt in the starting, that block will be re-fetched and if 
> that also fails,
> FetchFailureException will be thrown.
> 2. If large blocks are corrupt after size maxBytesInFlight/3, then any 
> IOException thrown while
> reading the stream will be converted to FetchFailureException. This is 
> slightly more aggressive
> than was originally intended but since the consumer of the stream may have 
> already read some records and processed them, we can't just re-fetch the 
> block, we need to fail the whole task. Additionally, we also thought about 
> maybe adding a new type of TaskEndReason, which would re-try the task couple 
> of times before failing the previous stage, but given the complexity involved 
> in that solution we decided to not proceed in that direction.
> However, I think there still exists some problems with the current shuffle 
> transmitted data verification mechanism:
> - For a large block, it is checked upto  maxBytesInF

[jira] [Updated] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data

2019-04-24 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27562:

Description: 
We've seen some shuffle data corruption during shuffle read phase. 

As described in SPARK-26089, spark only checks small  shuffle blocks before PR 
#23543, which is proposed by ankuriitg.

There are two changes/improvements that are made in PR #23543.

1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as 
smaller blocks, so if a
large block is corrupt in the starting, that block will be re-fetched and if 
that also fails,
FetchFailureException will be thrown.
2. If large blocks are corrupt after size maxBytesInFlight/3, then any 
IOException thrown while
reading the stream will be converted to FetchFailureException. This is slightly 
more aggressive
than was originally intended but since the consumer of the stream may have 
already read some records and processed them, we can't just re-fetch the block, 
we need to fail the whole task. Additionally, we also thought about maybe 
adding a new type of TaskEndReason, which would re-try the task couple of times 
before failing the previous stage, but given the complexity involved in that 
solution we decided to not proceed in that direction.

However, I think there still exists some problems with the current shuffle 
transmitted data verification mechanism:

- For a large block, it is checked upto  maxBytesInFlight/3 size when fetching 
shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it 
can not be detected in data fetch phase.  This has been described in the 
previous section.
- Only the compressed or wrapped blocks are checked, I think we should also 
check thease blocks which are not wrapped.

We complete the verification mechanism for shuffle transmitted data:

Firstly, we choose crc32 for the checksum verification  of shuffle data.

Crc is also used for checksum verification in hadoop, it is simple and fast.

In shuffle write phase, after completing the partitionedFile, we compute 

the crc32 value for each partition and then write these digests with the indexs 
into shuffle index file.

For the sortShuffleWriter and unsafe shuffle writer, there is only one 
partitionedFile for a shuffleMapTask, so the compution of digests(compute the 
digests for each partition depend on the indexs of this partitionedFile) is  
cheap.

For the bypassShuffleWriter, the reduce partitions is little than 
byPassMergeThreshold, the cost of digests compution is acceptable.

In shuffle read phase, the digest value will be passed with the block data.

And we will recompute the digest of the data obtained to compare with the 
origin digest value.
When recomputing the digest of data obtained, it only need an additional 
buffer(2048Bytes) for computing crc32 value.
After recomputing, we will reset the obtained data inputStream, if it is 
markSupported we only need reset it, otherwise it is a 
fileSegmentManagerBuffer, we need recreate it.

So, this verification mechanism  proposed for shuffle transmitted data is cheap 
and complete.


  was:
SPARK-27562

We've seen some shuffle data corruption during shuffle read phase. 

As described in SPARK-26089, spark only checks small  shuffle blocks before PR 
#23543, which is proposed by ankuriitg.

There are two changes/improvements that are made in PR #23543.

1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as 
smaller blocks, so if a
large block is corrupt in the starting, that block will be re-fetched and if 
that also fails,
FetchFailureException will be thrown.
2. If large blocks are corrupt after size maxBytesInFlight/3, then any 
IOException thrown while
reading the stream will be converted to FetchFailureException. This is slightly 
more aggressive
than was originally intended but since the consumer of the stream may have 
already read some records and processed them, we can't just re-fetch the block, 
we need to fail the whole task. Additionally, we also thought about maybe 
adding a new type of TaskEndReason, which would re-try the task couple of times 
before failing the previous stage, but given the complexity involved in that 
solution we decided to not proceed in that direction.

However, I think there still exists some problems with the current shuffle 
transmitted data verification mechanism:

- For a large block, it is checked upto  maxBytesInFlight/3 size when fetching 
shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it 
can not be detected in data fetch phase.  This has been described in the 
previous section.
- Only the compressed or wrapped blocks are checked, I think we should also 
check thease blocks which are not wrapped.

We complete the verification mechanism for shuffle transmitted data:

Firstly, we choose crc32 for the checksum verification  of shuffle data.

Crc is also used for checksum verification in 

[jira] [Updated] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data

2019-04-24 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27562:

Description: I will update this lately.

> Complete the verification mechanism for shuffle transmitted data
> 
>
> Key: SPARK-27562
> URL: https://issues.apache.org/jira/browse/SPARK-27562
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: feiwang
>Priority: Major
>
> I will update this lately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24192) Invalid Spark URL in local spark session since upgrading from org.apache.spark:spark-sql_2.11:2.2.1 to org.apache.spark:spark-sql_2.11:2.3.0

2019-04-24 Thread Aseem Patni (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825616#comment-16825616
 ] 

Aseem Patni commented on SPARK-24192:
-

export SPARK_LOCAL_HOSTNAME=localhost should also fix this. 

> Invalid Spark URL in local spark session since upgrading from 
> org.apache.spark:spark-sql_2.11:2.2.1 to org.apache.spark:spark-sql_2.11:2.3.0
> 
>
> Key: SPARK-24192
> URL: https://issues.apache.org/jira/browse/SPARK-24192
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Tal Barda
>Priority: Major
>  Labels: HeartbeatReceiver, config, session, spark, spark-conf, 
> spark-session
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> since updating to Spark 2.3.0, tests which are run in my CI (Codeship) fail 
> due to a allegedly invalid spark url when creating the (local) spark context. 
> Here's a log from my _*mvn clean install*_ command execution:
> {quote}{{2018-05-03 13:18:47.668 ERROR 5533 --- [ main] 
> org.apache.spark.SparkContext : Error initializing SparkContext. 
> org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@railsonfire_61eb1c99-232b-49d0-abb5-a1eb9693516b_52bcc09bb48b:44284
>  at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)
>  ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.executor.Executor.(Executor.scala:155) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)
>  ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
>  ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
>  ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.SparkContext.(SparkContext.scala:500) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486) 
> [spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
>  [spark-sql_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
>  [spark-sql_2.11-2.3.0.jar:2.3.0] at scala.Option.getOrElse(Option.scala:121) 
> [scala-library-2.11.8.jar:na] at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) 
> [spark-sql_2.11-2.3.0.jar:2.3.0] at 
> com.planck.spark_data_features_extractors.utils.SparkConfig.sparkSession(SparkConfig.java:43)
>  [classes/:na] at 
> com.planck.spark_data_features_extractors.utils.SparkConfig$$EnhancerBySpringCGLIB$$66dd1f72.CGLIB$sparkSession$0()
>  [classes/:na] at 
> com.planck.spark_data_features_extractors.utils.SparkConfig$$EnhancerBySpringCGLIB$$66dd1f72$$FastClassBySpringCGLIB$$a213b647.invoke()
>  [classes/:na] at 
> org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228) 
> [spring-core-5.0.2.RELEASE.jar:5.0.2.RELEASE] at 
> org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:361)
>  [spring-context-5.0.2.RELEASE.jar:5.0.2.RELEASE] at 
> com.planck.spark_data_features_extractors.utils.SparkConfig$$EnhancerBySpringCGLIB$$66dd1f72.sparkSession()
>  [classes/:na] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[na:1.8.0_171] at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[na:1.8.0_171] at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[na:1.8.0_171] at java.lang.reflect.Method.invoke(Method.java:498) 
> ~[na:1.8.0_171] at 
> org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:154)
>  [spring-beans-5.0.2.RELEASE.jar:5.0.2.RELEASE] at 
> org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:579)
>  [spring-beans-5.0.2.RELEASE.jar:5.0.2.RELEASE] at 
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateUsingFactoryMethod(AbstractAuto

[jira] [Created] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data

2019-04-24 Thread feiwang (JIRA)
feiwang created SPARK-27562:
---

 Summary: Complete the verification mechanism for shuffle 
transmitted data
 Key: SPARK-27562
 URL: https://issues.apache.org/jira/browse/SPARK-27562
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.4.0
Reporter: feiwang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27312) PropertyGraph <-> GraphX conversions

2019-04-24 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27312:
-

Assignee: Weichen Xu

> PropertyGraph <-> GraphX conversions
> 
>
> Key: SPARK-27312
> URL: https://issues.apache.org/jira/browse/SPARK-27312
> Project: Spark
>  Issue Type: Story
>  Components: Graph, GraphX
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> As a user, I can convert a GraphX graph into a PropertyGraph and a 
> PropertyGraph into a GraphX graph if they are compatible.
> * Scala only
> * Whether this is an internal API is pending design discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27300) Create the new graph projects in Spark and set up build/test

2019-04-24 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27300:
-

Assignee: Martin Junghanns  (was: Xiangrui Meng)

> Create the new graph projects in Spark and set up build/test
> 
>
> Key: SPARK-27300
> URL: https://issues.apache.org/jira/browse/SPARK-27300
> Project: Spark
>  Issue Type: Story
>  Components: Graph
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Martin Junghanns
>Priority: Major
>
> * Create graph projects in Spark repo that works with both maven and sbt.
> * Add internal and external dependencies.
> * Add a dummy test that automatically runs with PR build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27300) Create the new graph projects in Spark and set up build/test

2019-04-24 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27300:
-

Assignee: Xiangrui Meng

> Create the new graph projects in Spark and set up build/test
> 
>
> Key: SPARK-27300
> URL: https://issues.apache.org/jira/browse/SPARK-27300
> Project: Spark
>  Issue Type: Story
>  Components: Graph
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> * Create graph projects in Spark repo that works with both maven and sbt.
> * Add internal and external dependencies.
> * Add a dummy test that automatically runs with PR build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27561) Support "lateral column alias references" to allow column aliases to be used within SELECT clauses

2019-04-24 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27561:
---
Description: 
Amazon Redshift has a feature called "lateral column alias references": 
[https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/].
 Quoting from that blogpost:
{quote}The support for lateral column alias reference enables you to write 
queries without repeating the same expressions in the SELECT list. For example, 
you can define the alias 'probability' and use it within the same select 
statement:
{code:java}
select clicks / impressions as probability, round(100 * probability, 1) as 
percentage from raw_data;
{code}
{quote}
There's more information about this feature on 
[https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html:]
{quote}The benefit of the lateral alias reference is you don't need to repeat 
the aliased expression when building more complex expressions in the same 
target list. When Amazon Redshift parses this type of reference, it just 
inlines the previously defined aliases. If there is a column with the same name 
defined in the FROM clause as the previously aliased expression, the column in 
the FROM clause takes priority. For example, in the above query if there is a 
column named 'probability' in table raw_data, the 'probability' in the second 
expression in the target list will refer to that column instead of the alias 
name 'probability'.
{quote}
It would be nice if Spark supported this syntax. I don't think that this is 
standard SQL, so it might be a good idea to research if other SQL databases 
support similar syntax (and to see if they implement the same column resolution 
strategy as Redshift).

We should also consider whether this needs to be feature-flagged as part of a 
specific SQL compatibility mode / dialect.

One possibly-related existing ticket: SPARK-9338, which discusses the use of 
SELECT aliases in GROUP BY expressions.

/cc [~hvanhovell]

  was:
Amazon Redshift has a feature called "lateral column alias references: 
[https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/].
 Quoting from that blogpost:
{quote}The support for lateral column alias reference enables you to write 
queries without repeating the same expressions in the SELECT list. For example, 
you can define the alias 'probability' and use it within the same select 
statement:
{code:java}
select clicks / impressions as probability, round(100 * probability, 1) as 
percentage from raw_data;
{code}
{quote}
There's more information about this feature on 
[https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html:]
{quote}The benefit of the lateral alias reference is you don't need to repeat 
the aliased expression when building more complex expressions in the same 
target list. When Amazon Redshift parses this type of reference, it just 
inlines the previously defined aliases. If there is a column with the same name 
defined in the FROM clause as the previously aliased expression, the column in 
the FROM clause takes priority. For example, in the above query if there is a 
column named 'probability' in table raw_data, the 'probability' in the second 
expression in the target list will refer to that column instead of the alias 
name 'probability'.
{quote}
It would be nice if Spark supported this syntax. I don't think that this is 
standard SQL, so it might be a good idea to research if other SQL databases 
support similar syntax (and to see if they implement the same column resolution 
strategy as Redshift).

We should also consider whether this needs to be feature-flagged as part of a 
specific SQL compatibility mode / dialect.

One possibly-related existing ticket: SPARK-9338, which discusses the use of 
SELECT aliases in GROUP BY expressions.

/cc [~hvanhovell]


> Support "lateral column alias references" to allow column aliases to be used 
> within SELECT clauses
> --
>
> Key: SPARK-27561
> URL: https://issues.apache.org/jira/browse/SPARK-27561
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> Amazon Redshift has a feature called "lateral column alias references": 
> [https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/].
>  Quoting from that blogpost:
> {quote}The support for lateral column alias reference enables you to write 
> queries without repeating the same expressions in the SELECT list. For 
> example, you can define the alias 'probability' and use it within the same 
> select statement:
> {code:java}
> select c

[jira] [Updated] (SPARK-27561) Support "lateral column alias references" to allow column aliases to be used within SELECT clauses

2019-04-24 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27561:
---
Description: 
Amazon Redshift has a feature called "lateral column alias references: 
[https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/].
 Quoting from that blogpost:
{quote}The support for lateral column alias reference enables you to write 
queries without repeating the same expressions in the SELECT list. For example, 
you can define the alias 'probability' and use it within the same select 
statement:
{code:java}
select clicks / impressions as probability, round(100 * probability, 1) as 
percentage from raw_data;
{code}
{quote}
There's more information about this feature on 
[https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html:]
{quote}The benefit of the lateral alias reference is you don't need to repeat 
the aliased expression when building more complex expressions in the same 
target list. When Amazon Redshift parses this type of reference, it just 
inlines the previously defined aliases. If there is a column with the same name 
defined in the FROM clause as the previously aliased expression, the column in 
the FROM clause takes priority. For example, in the above query if there is a 
column named 'probability' in table raw_data, the 'probability' in the second 
expression in the target list will refer to that column instead of the alias 
name 'probability'.
{quote}
It would be nice if Spark supported this syntax. I don't think that this is 
standard SQL, so it might be a good idea to research if other SQL databases 
support similar syntax (and to see if they implement the same column resolution 
strategy as Redshift).

We should also consider whether this needs to be feature-flagged as part of a 
specific SQL compatibility mode / dialect.

One possibly-related existing ticket: SPARK-9338, which discusses the use of 
SELECT aliases in GROUP BY expressions.

/cc [~hvanhovell]

  was:
Amazon Redshift has a feature called "lateral column alias references: 
[https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/].
 Quoting from that blogpost:
{quote}The support for lateral column alias reference enables you to write 
queries without repeating the same expressions in the SELECT list. For example, 
you can define the alias 'probability' and use it within the same select 
statement:
{code:java}
select clicks / impressions as probability, round(100 * probability, 1) as 
percentage from raw_data;
{code}
{quote}
There's more information about this feature on 
[https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html:]
{quote}The benefit of the lateral alias reference is you don't need to repeat 
the aliased expression when building more complex expressions in the same 
target list. When Amazon Redshift parses this type of reference, it just 
inlines the previously defined aliases. If there is a column with the same name 
defined in the FROM clause as the previously aliased expression, the column in 
the FROM clause takes priority. For example, in the above query if there is a 
column named 'probability' in table raw_data, the 'probability' in the second 
expression in the target list will refer to that column instead of the alias 
name 'probability'.
{quote}
It would be nice if Spark supported this syntax. I don't think that this is 
standard SQL, so it might be a good idea to research if other SQL databases 
support similar syntax (and to see if they implement the same column resolution 
strategy as Redshift).

One possibly-related existing ticket: SPARK-9338, which discusses the use of 
SELECT aliases in GROUP BY expressions.

/cc [~hvanhovell]


> Support "lateral column alias references" to allow column aliases to be used 
> within SELECT clauses
> --
>
> Key: SPARK-27561
> URL: https://issues.apache.org/jira/browse/SPARK-27561
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> Amazon Redshift has a feature called "lateral column alias references: 
> [https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/].
>  Quoting from that blogpost:
> {quote}The support for lateral column alias reference enables you to write 
> queries without repeating the same expressions in the SELECT list. For 
> example, you can define the alias 'probability' and use it within the same 
> select statement:
> {code:java}
> select clicks / impressions as probability, round(100 * probability, 1) as 
> percentage from raw_data;
> {code}
> {quote}
> There's 

[jira] [Created] (SPARK-27561) Support "lateral column alias references" to allow column aliases to be used within SELECT clauses

2019-04-24 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-27561:
--

 Summary: Support "lateral column alias references" to allow column 
aliases to be used within SELECT clauses
 Key: SPARK-27561
 URL: https://issues.apache.org/jira/browse/SPARK-27561
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: Josh Rosen


Amazon Redshift has a feature called "lateral column alias references: 
[https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/].
 Quoting from that blogpost:
{quote}The support for lateral column alias reference enables you to write 
queries without repeating the same expressions in the SELECT list. For example, 
you can define the alias 'probability' and use it within the same select 
statement:
{code:java}
select clicks / impressions as probability, round(100 * probability, 1) as 
percentage from raw_data;
{code}
{quote}
There's more information about this feature on 
[https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html:]
{quote}The benefit of the lateral alias reference is you don't need to repeat 
the aliased expression when building more complex expressions in the same 
target list. When Amazon Redshift parses this type of reference, it just 
inlines the previously defined aliases. If there is a column with the same name 
defined in the FROM clause as the previously aliased expression, the column in 
the FROM clause takes priority. For example, in the above query if there is a 
column named 'probability' in table raw_data, the 'probability' in the second 
expression in the target list will refer to that column instead of the alias 
name 'probability'.
{quote}
It would be nice if Spark supported this syntax. I don't think that this is 
standard SQL, so it might be a good idea to research if other SQL databases 
support similar syntax (and to see if they implement the same column resolution 
strategy as Redshift).

One possibly-related existing ticket: SPARK-9338, which discusses the use of 
SELECT aliases in GROUP BY expressions.

/cc [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27542) SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using certain legacy OutputFormats

2019-04-24 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825560#comment-16825560
 ] 

Josh Rosen commented on SPARK-27542:


[~shivuson...@gmail.com], unfortunately I don't have a self-contained example 
which I can share right now, but the following skeleton is a rough starting 
point:
{code:java}
import org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat

val data: RDD[(Void, org.apache.parquet.example.data.Group)] = ???

data.saveAsHadoopFile(
  path = "output",
  keyClass = classOf[Void],
  valueClass = classOf[org.apache.parquet.example.data.Group],
  outputFormatClass = 
classOf[DeprecatedParquetOutputFormat[org.apache.parquet.example.data.Group]]
)
{code}

> SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using 
> certain legacy OutputFormats
> --
>
> Key: SPARK-27542
> URL: https://issues.apache.org/jira/browse/SPARK-27542
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> In Hadoop MapReduce, tasks call {{FileOutputFormat.setWorkOutputPath()}} 
> after configuring the  output committer: 
> [https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L611]
>  
> Spark doesn't do this: 
> [https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L115]
> As a result, certain legacy output formats can fail to work out-of-the-box on 
> Spark. In particular, 
> {{org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat}} can fail 
> with NullPointerExceptions, e.g.
> {code:java}
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.Path.(Path.java:105)
>   at org.apache.hadoop.fs.Path.(Path.java:94)
>   at 
> org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getDefaultWorkFile(DeprecatedParquetOutputFormat.java:69)
> [...]
>   at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:96)
> {code}
> It looks like someone on GitHub has hit the same problem: 
> https://gist.github.com/themodernlife/e3b07c23ba978f6cc98b73e3f3609abe
> Tez had a very similar bug: https://issues.apache.org/jira/browse/TEZ-3348
> We might be able to fix this by having Spark mimic Hadoop's logic. I'm unsure 
> of whether that change would pose compatibility risks for other existing 
> workloads, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle

2019-04-24 Thread Adrian Muraru (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Muraru updated SPARK-27530:
--
Description: 
I'm getting this in a large shuffle:
{code:java}
org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer for 
block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, 
None) (expectedApproxSize = 33708, isNetworkReqDone=false)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Received a zero-size buffer for block 
shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, None) 
(expectedApproxSize = 33708, isNetworkReqDone=false){code}
 

  was:
I'm getting this in a large shuffle:

org.apache.spark.shuffle.FetchFailedException: Received a zero-size buffer for 
block shuffle_2_9167_1861 from BlockManagerId(2665, ip-172-25-44-74, 39439, 
None) (expectedApproxSize = 33708, isNetworkReqDone=false)
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:438)
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
 at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
 at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
 at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90)
 at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

[jira] [Created] (SPARK-27560) HashPartitioner uses Object.hashCode which is not seeded

2019-04-24 Thread Andrew McHarg (JIRA)
Andrew McHarg created SPARK-27560:
-

 Summary: HashPartitioner uses Object.hashCode which is not seeded
 Key: SPARK-27560
 URL: https://issues.apache.org/jira/browse/SPARK-27560
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.4.0
 Environment: Notebook is running spark v2.4.0 local[*]

Python 3.6.6 (default, Sep  6 2018, 13:10:03)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin

I imagine this would reproduce on all operating systems and most versions of 
spark though.
Reporter: Andrew McHarg


Forgive the quality of the bug report here, I am a pyspark user and not super 
familiar with the internals of spark, yet it seems I have a strange corner case 
with the HashPartitioner.

This may already be known but repartition with HashPartitioner seems to assign 
everything the same partition if data that was partitioned by the same column 
is only partially read (say one partition). I suppose it is obvious concequence 
of Object.hashCode being deterministic but took some while to track down. 

Steps to repro:
 # Get dataframe with a bunch of uuids say 1
 # repartition(100, 'uuid_column')
 # save to parquet
 # read from parquet
 # collect()[:100] then filter using pyspark.sql.functions isin (yes I know 
this is bad and sampleBy should probably be used here)
 # repartition(10, 'uuid_column')
 # Resulting dataframe will have all of its data in one single partition

Jupyter notebook for the above: 
https://gist.github.com/robo-hamburger/4752a40cb643318464e58ab66cf7d23e

I think an easy fix would be to seed the HashPartitioner like many hashtable 
libraries do to avoid denial of service attacks. It also might be the case this 
is obvious behavior for more experienced spark users :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27559) Nullable in a given schema is not respected when reading from parquet

2019-04-24 Thread colin fang (JIRA)
colin fang created SPARK-27559:
--

 Summary: Nullable in a given schema is not respected when reading 
from parquet
 Key: SPARK-27559
 URL: https://issues.apache.org/jira/browse/SPARK-27559
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.2
Reporter: colin fang


Even if I specify a schema when reading from parquet, nullable is not reset.

{code:java}
spark.range(10, numPartitions=1).write.mode('overwrite').parquet('tmp')
df1 = spark.read.parquet('tmp')
df1.printSchema()
# root
#  |-- id: long (nullable = true)
df2 = spark.read.schema(StructType([StructField('id', LongType(), 
False)])).parquet('tmp')
df2.printSchema()
# root
#  |-- x: long (nullable = true)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27558) NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang

2019-04-24 Thread Alessandro Bellina (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Bellina updated SPARK-27558:
---
Description: 
We see an NPE in the UnsafeInMemorySorter.getMemoryUsage function (due to the 
array we are accessing there being null). This looks to be caused by a Spark 
OOM when UnsafeInMemorySorter is trying to spill.

This is likely a symptom of https://issues.apache.org/jira/browse/SPARK-21492. 
The real question for this ticket is, could we handle things more gracefully, 
rather than NPE. For example:

Remove this:

https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L182

so when this fails (and store the new array into a temporary):

https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L186

we don't end up with a null "array". This state is causing one of our jobs to 
hang infinitely (we think) due to the original allocation error.

Stack trace for reference

{noformat}
2019-04-23 08:57:14,989 [Executor task launch worker for task 46729] ERROR 
org.apache.spark.TaskContextImpl  - Error in TaskCompletionListener
java.lang.NullPointerException
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getMemoryUsage(UnsafeInMemorySorter.java:208)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getMemoryUsage(UnsafeExternalSorter.java:249)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.updatePeakMemoryUsed(UnsafeExternalSorter.java:253)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.freeMemory(UnsafeExternalSorter.java:296)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:328)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.lambda$new$0(UnsafeExternalSorter.java:178)
at 
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118)
at 
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118)
at 
org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:131)
at 
org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:129)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:129)
at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
at org.apache.spark.scheduler.Task.run(Task.scala:119)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


2019-04-23 08:57:15,069 [Executor task launch worker for task 46729] ERROR 
org.apache.spark.executor.Executor  - Exception in task 102.0 in stage 28.0 
(TID 46729)
org.apache.spark.util.TaskCompletionListenerException: null

Previous exception in task: Unable to acquire 65536 bytes of memory, got 0
org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)

org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)

org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:186)

org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:229)

org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:204)

org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:283)

org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96)

org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:348)

org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:403)

org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:135)

org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage23.sort_addToSorter_0$(Unknown
 Source)

org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage23.processNext(Unknown
 Source)

org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

org.apache.spark.sql.execution.WholeStageCodegenExec$$

[jira] [Created] (SPARK-27558) NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang

2019-04-24 Thread Alessandro Bellina (JIRA)
Alessandro Bellina created SPARK-27558:
--

 Summary: NPE in TaskCompletionListener due to Spark OOM in 
UnsafeExternalSorter causing tasks to hang
 Key: SPARK-27558
 URL: https://issues.apache.org/jira/browse/SPARK-27558
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.2, 2.3.3
Reporter: Alessandro Bellina


We see an NPE in the UnsafeInMemorySorter.getMemoryUsage function (due to the 
array we are accessing there being null). This looks to be caused by a Spark 
OOM when UnsafeInMemorySorter is trying to spill.

This is likely a symptom of https://issues.apache.org/jira/browse/SPARK-21492. 
The real question for this ticket is, could we handle things more gracefully, 
rather than NPE. For example:

allocate the new array into a temporary here:

https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L182

so when this fails:

https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L186

we don't end up with a null "array". This state is causing one of our jobs to 
hang infinitely (we think) due to the original allocation error.

Stack trace for reference

{noformat}
2019-04-23 08:57:14,989 [Executor task launch worker for task 46729] ERROR 
org.apache.spark.TaskContextImpl  - Error in TaskCompletionListener
java.lang.NullPointerException
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getMemoryUsage(UnsafeInMemorySorter.java:208)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getMemoryUsage(UnsafeExternalSorter.java:249)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.updatePeakMemoryUsed(UnsafeExternalSorter.java:253)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.freeMemory(UnsafeExternalSorter.java:296)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:328)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.lambda$new$0(UnsafeExternalSorter.java:178)
at 
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118)
at 
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118)
at 
org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:131)
at 
org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:129)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:129)
at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
at org.apache.spark.scheduler.Task.run(Task.scala:119)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


2019-04-23 08:57:15,069 [Executor task launch worker for task 46729] ERROR 
org.apache.spark.executor.Executor  - Exception in task 102.0 in stage 28.0 
(TID 46729)
org.apache.spark.util.TaskCompletionListenerException: null

Previous exception in task: Unable to acquire 65536 bytes of memory, got 0
org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)

org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)

org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:186)

org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:229)

org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:204)

org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:283)

org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96)

org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:348)

org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:403)

org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:135)

org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage23.sort_addToSorter_0$(Unknown
 Source)

org.apache.spark.sql.catalyst.expressions.Genera

[jira] [Commented] (SPARK-25888) Service requests for persist() blocks via external service after dynamic deallocation

2019-04-24 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825248#comment-16825248
 ] 

Attila Zsolt Piros commented on SPARK-25888:


I am working on this.

> Service requests for persist() blocks via external service after dynamic 
> deallocation
> -
>
> Key: SPARK-25888
> URL: https://issues.apache.org/jira/browse/SPARK-25888
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle, YARN
>Affects Versions: 2.3.2
>Reporter: Adam Kennedy
>Priority: Major
>
> Large and highly multi-tenant Spark on YARN clusters with diverse job 
> execution often display terrible utilization rates (we have observed as low 
> as 3-7% CPU at max container allocation, but 50% CPU utilization on even a 
> well policed cluster is not uncommon).
> As a sizing example, consider a scenario with 1,000 nodes, 50,000 cores, 250 
> users and 50,000 runs of 1,000 distinct applications per week, with 
> predominantly Spark including a mixture of ETL, Ad Hoc tasks and PySpark 
> Notebook jobs (no streaming)
> Utilization problems appear to be due in large part to difficulties with 
> persist() blocks (DISK or DISK+MEMORY) preventing dynamic deallocation.
> In situations where an external shuffle service is present (which is typical 
> on clusters of this type) we already solve this for the shuffle block case by 
> offloading the IO handling of shuffle blocks to the external service, 
> allowing dynamic deallocation to proceed.
> Allowing Executors to transfer persist() blocks to some external "shuffle" 
> service in a similar manner would be an enormous win for Spark multi-tenancy 
> as it would limit deallocation blocking scenarios to only MEMORY-only cache() 
> scenarios.
> I'm not sure if I'm correct, but I seem to recall seeing in the original 
> external shuffle service commits that may have been considered at the time 
> but getting shuffle blocks moved to the external shuffle service was the 
> first priority.
> With support for external persist() DISK blocks in place, we could also then 
> handle deallocation of DISK+MEMORY, as the memory instance could first be 
> dropped, changing the block to DISK only, and then further transferred to the 
> shuffle service.
> We have tried to resolve the persist() issue via extensive user training, but 
> that has typically only allowed us to improve utilization of the worst 
> offenders (10% utilization) up to around 40-60% utilization, as the need for 
> persist() is often legitimate and occurs during the middle stages of a job.
> In a healthy multi-tenant scenario, a large job might spool up to say 10,000 
> cores, persist() data, release executors across a long tail down to 100 
> cores, and then spool back up to 10,000 cores for the following stage without 
> impact on the persist() data.
> In an ideal world, if an new executor started up on a node on which blocks 
> had been transferred to the shuffle service, the new executor might even be 
> able to "recapture" control of those blocks (if that would help with 
> performance in some way).
> And the behavior of gradually expanding up and down several times over the 
> course of a job would not just improve utilization, but would allow resources 
> to more easily be redistributed to other jobs which start on the cluster 
> during the long-tail periods, which would improve multi-tenancy and bring us 
> closer to optimal "envy free" YARN scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27557) Add copybutton to spark Python API docs for easier copying of code-blocks

2019-04-24 Thread Sangram G (JIRA)
Sangram G created SPARK-27557:
-

 Summary: Add copybutton to spark Python API docs for easier 
copying of code-blocks
 Key: SPARK-27557
 URL: https://issues.apache.org/jira/browse/SPARK-27557
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 2.4.2
Reporter: Sangram G


Add a non-intrusive button for python API documentation, which will remove 
">>>" prompts and outputs of code - for easier copying of code.

 For example: The below code-snippet in the document is difficult to copy due 
to ">>>" prompts

   >>> l = [('Alice', 1)]

   >>> spark.createDataFrame(l).collect()

    [Row(_1='Alice', _2=1)]


Becomes this - After the copybutton in the corner of of code-block is pressed - 
which is easier to copy 


l = [('Alice', 1)]

spark.createDataFrame(l).collect()

Sample Screenshot for reference: 
[https://screenshots.firefox.com/ZfBXZrpINKdt9QZg/www269.lunapic.com]

 

This can be easily done only by adding a copybutton.js script in 
python/docs/_static folder and calling it at setup time from 
python/docs/conf.py.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27550) Fix `test-dependencies.sh` not to use `kafka-0-8` profile for Scala-2.12

2019-04-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27550.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.4.3

This is resolved via https://github.com/apache/spark/pull/24445

> Fix `test-dependencies.sh` not to use `kafka-0-8` profile for Scala-2.12
> 
>
> Key: SPARK-27550
> URL: https://issues.apache.org/jira/browse/SPARK-27550
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.3
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.3
>
>
> Since SPARK-27274 deprecated Scala-2.11 at Spark 2.4.1, we need to test 
> Scala-2.12 more.
> Kakfa 0.8 doesn't have Scala-2.12 artifacts, e.g., 
> `org.apache.kafka:kafka_2.12:jar:0.8.2.1`. This issue aims to fix 
> `test-dependencies.sh` script to understand Scala binary version.
> {code:java}
> $ dev/change-scala-version.sh 2.12
> $ dev/test-dependencies.sh
> Using `mvn` from path: /usr/local/bin/mvn
> Using `mvn` from path: /usr/local/bin/mvn
> Performing Maven install for hadoop-2.6
> Using `mvn` from path: /usr/local/bin/mvn
> [ERROR] Failed to execute goal on project spark-streaming-kafka-0-8_2.12: 
> Could not resolve dependencies for project 
> org.apache.spark:spark-streaming-kafka-0-8_2.12:jar:spark-335572: Could not 
> find artifact org.apache.kafka:kafka_2.12:jar:0.8.2.1 in central 
> (https://repo.maven.apache.org/maven2) -> [Help 1]
> {code}
> Please note that this issue doesn't aim to update `spark-deps-hadoop-2.6` 
> manifest here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27556) Exclude com.zaxxer:HikariCP-java7 from hadoop-yarn-server-web-proxy

2019-04-24 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27556:

Description: 
{noformat}
[INFO] | +- org.apache.hadoop:hadoop-yarn-server-web-proxy:jar:3.2.0:compile
[INFO] | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:3.2.0:compile
[INFO] | | +- org.apache.hadoop:hadoop-yarn-registry:jar:3.2.0:compile
[INFO] | | | \- commons-daemon:commons-daemon:jar:1.0.13:compile
[INFO] | | +- 
org.apache.geronimo.specs:geronimo-jcache_1.0_spec:jar:1.0-alpha-1:compile
[INFO] | | +- org.ehcache:ehcache:jar:3.3.1:compile
[INFO] | | +- com.zaxxer:HikariCP-java7:jar:2.4.12:compile{noformat}
{noformat}
[INFO] +- org.apache.hive:hive-metastore:jar:2.3.4:compile
[INFO] | +- javolution:javolution:jar:5.5.1:compile
[INFO] | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
[INFO] | +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile
[INFO] | +- com.zaxxer:HikariCP:jar:2.5.1:compile{noformat}

  was:
{noformat}
[INFO] | +- org.apache.hadoop:hadoop-yarn-server-web-proxy:jar:3.2.0:compile
[INFO] | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:3.2.0:compile
[INFO] | | +- org.apache.hadoop:hadoop-yarn-registry:jar:3.2.0:compile
[INFO] | | | \- commons-daemon:commons-daemon:jar:1.0.13:compile
[INFO] | | +- 
org.apache.geronimo.specs:geronimo-jcache_1.0_spec:jar:1.0-alpha-1:compile
[INFO] | | +- org.ehcache:ehcache:jar:3.3.1:compile
[INFO] | | +- com.zaxxer:HikariCP-java7:jar:2.4.12:compile
[INFO] | | \- 
com.microsoft.sqlserver:mssql-jdbc:jar:6.2.1.jre7:runtime{noformat}
{noformat}
[INFO] +- org.apache.hive:hive-metastore:jar:2.3.4:compile
[INFO] | +- javolution:javolution:jar:5.5.1:compile
[INFO] | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
[INFO] | +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile
[INFO] | +- com.zaxxer:HikariCP:jar:2.5.1:compile{noformat}


> Exclude com.zaxxer:HikariCP-java7 from hadoop-yarn-server-web-proxy
> ---
>
> Key: SPARK-27556
> URL: https://issues.apache.org/jira/browse/SPARK-27556
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> [INFO] | +- org.apache.hadoop:hadoop-yarn-server-web-proxy:jar:3.2.0:compile
> [INFO] | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:3.2.0:compile
> [INFO] | | +- org.apache.hadoop:hadoop-yarn-registry:jar:3.2.0:compile
> [INFO] | | | \- commons-daemon:commons-daemon:jar:1.0.13:compile
> [INFO] | | +- 
> org.apache.geronimo.specs:geronimo-jcache_1.0_spec:jar:1.0-alpha-1:compile
> [INFO] | | +- org.ehcache:ehcache:jar:3.3.1:compile
> [INFO] | | +- com.zaxxer:HikariCP-java7:jar:2.4.12:compile{noformat}
> {noformat}
> [INFO] +- org.apache.hive:hive-metastore:jar:2.3.4:compile
> [INFO] | +- javolution:javolution:jar:5.5.1:compile
> [INFO] | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
> [INFO] | +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile
> [INFO] | +- com.zaxxer:HikariCP:jar:2.5.1:compile{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2019-04-24 Thread Udbhav Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825134#comment-16825134
 ] 

Udbhav Agrawal commented on SPARK-25262:


Thanks [~rvesse] i will try to work on this

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27556) Exclude com.zaxxer:HikariCP-java7 from hadoop-yarn-server-web-proxy

2019-04-24 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27556:
---

 Summary: Exclude com.zaxxer:HikariCP-java7 from 
hadoop-yarn-server-web-proxy
 Key: SPARK-27556
 URL: https://issues.apache.org/jira/browse/SPARK-27556
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.0.0
Reporter: Yuming Wang


{noformat}
[INFO] | +- org.apache.hadoop:hadoop-yarn-server-web-proxy:jar:3.2.0:compile
[INFO] | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:3.2.0:compile
[INFO] | | +- org.apache.hadoop:hadoop-yarn-registry:jar:3.2.0:compile
[INFO] | | | \- commons-daemon:commons-daemon:jar:1.0.13:compile
[INFO] | | +- 
org.apache.geronimo.specs:geronimo-jcache_1.0_spec:jar:1.0-alpha-1:compile
[INFO] | | +- org.ehcache:ehcache:jar:3.3.1:compile
[INFO] | | +- com.zaxxer:HikariCP-java7:jar:2.4.12:compile
[INFO] | | \- 
com.microsoft.sqlserver:mssql-jdbc:jar:6.2.1.jre7:runtime{noformat}
{noformat}
[INFO] +- org.apache.hive:hive-metastore:jar:2.3.4:compile
[INFO] | +- javolution:javolution:jar:5.5.1:compile
[INFO] | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
[INFO] | +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile
[INFO] | +- com.zaxxer:HikariCP:jar:2.5.1:compile{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2019-04-24 Thread Rob Vesse (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825132#comment-16825132
 ] 

Rob Vesse commented on SPARK-25262:
---

[~Udbhav Agrawal] Yes I think an approach like that would be acceptable to the 
community (and if not then I don't know what will be).  If you want to take a 
stab at doing this please feel free

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2019-04-24 Thread Udbhav Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824993#comment-16824993
 ] 

Udbhav Agrawal commented on SPARK-25262:


[~rvesse] for the second part, can we move out {{LocalDirFetureStep from 
baseFeatures and}} provide a new configuration something like  
{color:#6a8759}spark.kubernetes.emptyDir.disable {color}and handle it in 
{{KubernetesDriverBuilder.scala}} so then user will have flexibility to define 
different volume type to back the local directories other than {{emptyDir}} 
through pod templates

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27354) Move incompatible code from the hive-thriftserver module to sql/hive-thriftserver/v1.2.1

2019-04-24 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27354:

Summary: Move incompatible code from the hive-thriftserver module to 
sql/hive-thriftserver/v1.2.1  (was: Add a new empty hive-thriftserver module 
for Hive 2.3.4 )

> Move incompatible code from the hive-thriftserver module to 
> sql/hive-thriftserver/v1.2.1
> 
>
> Key: SPARK-27354
> URL: https://issues.apache.org/jira/browse/SPARK-27354
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> When we upgraded the built-in Hive to 2.3.4, the current 
> {{hive-thriftserver}} module is not compatible, such as these Hive changes:
>  # HIVE-12442 HiveServer2: Refactor/repackage HiveServer2's Thrift code so 
> that it can be used in the tasks
>  # HIVE-12237 Use slf4j as logging facade
>  # HIVE-13169 HiveServer2: Support delegation token based connection when 
> using http transport
> So we should add a new {{hive-thriftserver}} module for Hive 2.3.4:
> 1. Add a new empty module for Hive 2.3.4 named {{hive-thriftserverV2}}.
> 2. Make {{hive-thriftserver}} can only be activated when testing with 
> hadoop-2.7.
> 3. Make {{hive-thriftserverV2}} can only be activated when testing with 
> hadoop-3.2.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-24 Thread Hui WANG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui WANG updated SPARK-27555:
-
Description: 
I already seen https://issues.apache.org/jira/browse/SPARK-17620 

and https://issues.apache.org/jira/browse/SPARK-18397

and I check source code of Spark for the change of  set 
"spark.sql.hive.covertCTAS=true" and then spark will use 
"spark.sql.sources.default" which is parquet as storage format in "create table 
as select" scenario.

But my case is just create table without select. When I set  
hive.default.fileformat=parquet in hive-site.xml or set  
spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
create a table, when i check the hive table, it still use textfile fileformat.

 

It seems HiveSerDe gets the value of the hive.default.fileformat parameter from 
SQLConf

The parameter values in SQLConf are copied from SparkContext's SparkConf at 
SparkSession initialization, while the configuration parameters in 
hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by 
SharedState, And all the config with "spark.hadoop" conf are setted to 
hadoopconfig, so the configuration does not take effect.

  was:
I already seen https://issues.apache.org/jira/browse/SPARK-17620 

and https://issues.apache.org/jira/browse/SPARK-18397

and I check source code of Spark for the change of  set 
"spark.sql.hive.covertCTAS=true" and then spark will use 
"spark.sql.sources.default" which is parquet as storage format in "create table 
as select" scenario.

But my case is just create table without select. When I set  
hive.default.fileformat=parquet in hive-site.xml or set  
spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
create a table, when i check the hive table, it still use textfile fileformat.

 

HiveSerDe gets the value of the hive.default.fileformat parameter from SQLConf

The parameter values in SQLConf are copied from SparkContext's SparkConf at 
SparkSession initialization, while the configuration parameters in 
hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by 
SharedState, And all the config with "spark.hadoop" conf are setted to 
hadoopconfig, so the configuration does not take effect.


> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Priority: Major
>
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-24 Thread Hui WANG (JIRA)
Hui WANG created SPARK-27555:


 Summary: cannot create table by using the hive default fileformat 
in both hive-site.xml and spark-defaults.conf
 Key: SPARK-27555
 URL: https://issues.apache.org/jira/browse/SPARK-27555
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: Hui WANG


I already seen https://issues.apache.org/jira/browse/SPARK-17620 

and https://issues.apache.org/jira/browse/SPARK-18397

and I check source code of Spark for the change of  set 
"spark.sql.hive.covertCTAS=true" and then spark will use 
"spark.sql.sources.default" which is parquet as storage format in "create table 
as select" scenario.

But my case is just create table without select. When I set  
hive.default.fileformat=parquet in hive-site.xml or set  
spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
create a table, when i check the hive table, it still use textfile fileformat.

 

HiveSerDe gets the value of the hive.default.fileformat parameter from SQLConf

The parameter values in SQLConf are copied from SparkContext's SparkConf at 
SparkSession initialization, while the configuration parameters in 
hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters by 
SharedState, And all the config with "spark.hadoop" conf are setted to 
hadoopconfig, so the configuration does not take effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27512) Decimal parsing leads to unexpected type inference

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27512:


Assignee: Hyukjin Kwon

> Decimal parsing leads to unexpected type inference
> --
>
> Key: SPARK-27512
> URL: https://issues.apache.org/jira/browse/SPARK-27512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark 3.0.0-SNAPSHOT from this commit:
> {code:bash}
> commit 3ab96d7acf870e53c9016b0b63d0b328eec23bed
> Author: Dilip Biswal 
> Date:   Mon Apr 15 21:26:45 2019 +0800
> {code}
>Reporter: koert kuipers
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> {code:bash}
> $ hadoop fs -text test.bsv
> x|y
> 1|1,2
> 2|2,3
> 3|3,4
> {code}
> in spark 2.4.1:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: string (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1|1,2|
> |  2|2,3|
> |  3|3,4|
> +---+---+
> {code}
> in spark 3.0.0-SNAPSHOT:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: decimal(2,0) (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1| 12|
> |  2| 23|
> |  3| 34|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27512) Decimal parsing leads to unexpected type inference

2019-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27512.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24437
[https://github.com/apache/spark/pull/24437]

> Decimal parsing leads to unexpected type inference
> --
>
> Key: SPARK-27512
> URL: https://issues.apache.org/jira/browse/SPARK-27512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark 3.0.0-SNAPSHOT from this commit:
> {code:bash}
> commit 3ab96d7acf870e53c9016b0b63d0b328eec23bed
> Author: Dilip Biswal 
> Date:   Mon Apr 15 21:26:45 2019 +0800
> {code}
>Reporter: koert kuipers
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code:bash}
> $ hadoop fs -text test.bsv
> x|y
> 1|1,2
> 2|2,3
> 3|3,4
> {code}
> in spark 2.4.1:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: string (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1|1,2|
> |  2|2,3|
> |  3|3,4|
> +---+---+
> {code}
> in spark 3.0.0-SNAPSHOT:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: decimal(2,0) (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1| 12|
> |  2| 23|
> |  3| 34|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org