[jira] [Resolved] (SPARK-27514) Empty window expression results in error in optimizer

2019-04-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27514.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24411
[https://github.com/apache/spark/pull/24411]

> Empty window expression results in error in optimizer
> -
>
> Key: SPARK-27514
> URL: https://issues.apache.org/jira/browse/SPARK-27514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yifei Huang
>Assignee: Yifei Huang
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the optimizer will break on the following code:
> {code:java}
> val schema = StructType(Seq(
>   StructField("colA", StringType, true),
>   StructField("colB", IntegerType, true)
> ))
> var df = sqlContext.sparkSession.createDataFrame(new util.ArrayList[Row](), 
> schema)
> val w = Window.partitionBy("colA")
> df = df.withColumn("col1", sum("colB").over(w))
> df = df.withColumn("col3", sum("colB").over(w))
> df = df.withColumn("col4", sum("col3").over(w))
> df = df.withColumn("col2", sum("col1").over(w))
> df = df.select("col2")
> df.explain(true)
> {code}
> with the following stacktrace:
> {code:java}
> next on empty iterator
> java.util.NoSuchElementException: next on empty iterator
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
> at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
> at scala.collection.IterableLike$class.head(IterableLike.scala:107)
> at 
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:48)
> at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
> at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:803)
> at 
> org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:798)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:281)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Log

[jira] [Assigned] (SPARK-27514) Empty window expression results in error in optimizer

2019-04-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27514:
---

Assignee: Yifei Huang

> Empty window expression results in error in optimizer
> -
>
> Key: SPARK-27514
> URL: https://issues.apache.org/jira/browse/SPARK-27514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yifei Huang
>Assignee: Yifei Huang
>Priority: Major
>
> Currently, the optimizer will break on the following code:
> {code:java}
> val schema = StructType(Seq(
>   StructField("colA", StringType, true),
>   StructField("colB", IntegerType, true)
> ))
> var df = sqlContext.sparkSession.createDataFrame(new util.ArrayList[Row](), 
> schema)
> val w = Window.partitionBy("colA")
> df = df.withColumn("col1", sum("colB").over(w))
> df = df.withColumn("col3", sum("colB").over(w))
> df = df.withColumn("col4", sum("col3").over(w))
> df = df.withColumn("col2", sum("col1").over(w))
> df = df.select("col2")
> df.explain(true)
> {code}
> with the following stacktrace:
> {code:java}
> next on empty iterator
> java.util.NoSuchElementException: next on empty iterator
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
> at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
> at scala.collection.IterableLike$class.head(IterableLike.scala:107)
> at 
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:48)
> at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
> at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:803)
> at 
> org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:798)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:281)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.sc

[jira] [Comment Edited] (SPARK-27367) Faster RoaringBitmap Serialization with v0.8.0

2019-04-18 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821653#comment-16821653
 ] 

Liang-Chi Hsieh edited comment on SPARK-27367 at 4/19/19 4:32 AM:
--

I do upgrade it in local. But seems the performance improvement isn't so 
obvious. Maybe the optimization is only significant on larger bitmap. I'm not 
sure if in Spark we will have large bitmap that can take advantage of this 
optimization.

I compare 0.7.45 (used in current Spark) and 0.8.1 (latest release), except for 
serde to bytebuffer, I didn't see other noticeable commits.

So, do we still want to upgrade to 0.8.1? If so, I can make a PR.

 


was (Author: viirya):
I do upgrade it in local. But seems the performance improvement isn't so 
obvious. Maybe the optimization is only significant on larger bitmap. I'm not 
sure if in Spark we will have large bitmap that can take advantage of this 
optimization.

I compare 0.7.45 (used in current master) and 0.8.1 (latest release), except 
for serde to bytebuffer, I didn't see other noticeable commits.

So, do we still want to upgrade to 0.8.1? If so, I can make a PR.

 

> Faster RoaringBitmap Serialization with v0.8.0
> --
>
> Key: SPARK-27367
> URL: https://issues.apache.org/jira/browse/SPARK-27367
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> RoaringBitmap 0.8.0 adds faster serde, but also requires us to change how we 
> call the serde routines slightly to take advantage of it.  This is probably a 
> worthwhile optimization as the every shuffle map task with a large # of 
> partitions generates these bitmaps, and the driver especially has to 
> deserialize many of these messages.
> See 
> * https://github.com/apache/spark/pull/24264#issuecomment-479675572
> * https://github.com/RoaringBitmap/RoaringBitmap/pull/325
> * https://github.com/RoaringBitmap/RoaringBitmap/issues/319



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27367) Faster RoaringBitmap Serialization with v0.8.0

2019-04-18 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821653#comment-16821653
 ] 

Liang-Chi Hsieh commented on SPARK-27367:
-

I do upgrade it in local. But seems the performance improvement isn't so 
obvious. Maybe the optimization is only significant on larger bitmap. I'm not 
sure if in Spark we will have large bitmap that can take advantage of this 
optimization.

I compare 0.7.45 (used in current master) and 0.8.1 (latest release), except 
for serde to bytebuffer, I didn't see other noticeable commits.

So, do we still want to upgrade to 0.8.1? If so, I can make a PR.

 

> Faster RoaringBitmap Serialization with v0.8.0
> --
>
> Key: SPARK-27367
> URL: https://issues.apache.org/jira/browse/SPARK-27367
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> RoaringBitmap 0.8.0 adds faster serde, but also requires us to change how we 
> call the serde routines slightly to take advantage of it.  This is probably a 
> worthwhile optimization as the every shuffle map task with a large # of 
> partitions generates these bitmaps, and the driver especially has to 
> deserialize many of these messages.
> See 
> * https://github.com/apache/spark/pull/24264#issuecomment-479675572
> * https://github.com/RoaringBitmap/RoaringBitmap/pull/325
> * https://github.com/RoaringBitmap/RoaringBitmap/issues/319



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27518) Show statistics in Optimized Logical Plan in the "Details" On SparkSQL ui page when CBO is enabled

2019-04-18 Thread peng bo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo updated SPARK-27518:

Attachment: SPARK-27518-1.jpg

> Show statistics in Optimized Logical Plan in the "Details" On SparkSQL ui 
> page when CBO is enabled
> --
>
> Key: SPARK-27518
> URL: https://issues.apache.org/jira/browse/SPARK-27518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: peng bo
>Priority: Major
> Attachments: SPARK-27518-1.jpg
>
>
> {{Statistics}} snapshot info with current query is really helpful to find out 
> why the query runs slowly, especially caused by bad rejoin order when {{CBO}} 
> is enabled.
> This issue is to show statistics in Optimized Logical Plan in the "Details" 
> On SparkSQL ui page when CBO is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27518) Show statistics in Optimized Logical Plan in the "Details" On SparkSQL ui page when CBO is enabled

2019-04-18 Thread peng bo (JIRA)
peng bo created SPARK-27518:
---

 Summary: Show statistics in Optimized Logical Plan in the 
"Details" On SparkSQL ui page when CBO is enabled
 Key: SPARK-27518
 URL: https://issues.apache.org/jira/browse/SPARK-27518
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 3.0.0
Reporter: peng bo


{{Statistics}} snapshot info with current query is really helpful to find out 
why the query runs slowly, especially caused by bad rejoin order when {{CBO}} 
is enabled.

This issue is to show statistics in Optimized Logical Plan in the "Details" On 
SparkSQL ui page when CBO is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22044) explain function with codegen and cost parameters

2019-04-18 Thread Huon Wilson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821645#comment-16821645
 ] 

Huon Wilson commented on SPARK-22044:
-

I think this would be great, since the current ways to do it are moderately 
annoying, and the mismatch with the direct {{EXPLAIN CODEGEN}} and {{EXPLAIN 
COST}} in SQL is a bit jarring/unexpected.


* For {{codegen}}, there's the work-around of using 
{{df.queryExecution.debug.codegen}} in Scala, but this is somewhat awkward to 
use from pyspark ({{df._jdf.queryExecution().debug().codegen()}}, which doesn't 
use Python's {{stdout}} for printing, and so can't be captured easily, if 
required), and very awkward for sparkR (I believe 
{{invisible(sparkR.callJMethod(sparkR.callJMethod(sparkR.callJMethod(df@sdf, 
"queryExecution"), "debug"), "codegen"))}}, but again, cannot be captured via 
{{capture.output}} easily). 
* For {{cost}}, there's a similar work around of using 
{{df.queryExecution.stringWithStats}}, but this has the same awkwardness as 
{{codegen}} for calling from pyspark and sparkR.

> explain function with codegen and cost parameters
> -
>
> Key: SPARK-22044
> URL: https://issues.apache.org/jira/browse/SPARK-22044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{explain}} operator creates {{ExplainCommand}} runnable command that accepts 
> (among other things) {{codegen}} and {{cost}} arguments.
> There's no version of {{explain}} to allow for this. That's however possible 
> using SQL which is kind of surprising (given how much focus is devoted to the 
> Dataset API).
> This is to have another {{explain}} with {{codegen}} and {{cost}} arguments, 
> i.e.
> {code}
> def explain(codegen: Boolean = false, cost: Boolean = false): Unit
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27465) Kafka Client 0.11.0.0 is not Supporting the kafkatestutils package

2019-04-18 Thread Praveen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821639#comment-16821639
 ] 

Praveen commented on SPARK-27465:
-

Hi Shahid,

Can you please let me know if you have any update on this issue?

> Kafka Client 0.11.0.0 is not Supporting the kafkatestutils package
> --
>
> Key: SPARK-27465
> URL: https://issues.apache.org/jira/browse/SPARK-27465
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Praveen
>Priority: Critical
>
> Hi Team,
> We are getting the below exceptions with Kafka Client Version 0.11.0.0 for 
> KafkaTestUtils Package. But its working fine when we use the Kafka Client 
> Version 0.10.0.1. Please suggest the way forwards. We are using the package "
> org.apache.spark.streaming.kafka010.KafkaTestUtils;"
> And the Spark Streaming Version is 2.2.3 and above.
>  
> ERROR:
> java.lang.NoSuchMethodError: 
> kafka.server.KafkaServer$.$lessinit$greater$default$2()Lkafka/utils/Time;
>  at 
> org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:110)
>  at 
> org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:107)
>  at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2234)
>  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>  at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2226)
>  at 
> org.apache.spark.streaming.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:107)
>  at 
> org.apache.spark.streaming.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:122)
>  at 
> com.netcracker.rms.smart.esp.ESPTestEnv.prepareKafkaTestUtils(ESPTestEnv.java:203)
>  at com.netcracker.rms.smart.esp.ESPTestEnv.setUp(ESPTestEnv.java:157)
>  at 
> com.netcracker.rms.smart.esp.TestEventStreamProcessor.setUp(TestEventStreamProcessor.java:58)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27517) python.PythonRDD: Error while sending iterator

2019-04-18 Thread lucasysfeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lucasysfeng updated SPARK-27517:

Attachment: spark.stderr

>  python.PythonRDD: Error while sending iterator
> ---
>
> Key: SPARK-27517
> URL: https://issues.apache.org/jira/browse/SPARK-27517
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: lucasysfeng
>Priority: Major
> Attachments: spark.stderr
>
>
> when use collect function, occasionnally throw the exception below:
> ERROR python.PythonRDD: Error while sending iterator
> java.net.SocketTimeoutException: Accept timed out
>  at java.net.PlainSocketImpl.socketAccept(Native Method)
>  at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
>  at java.net.ServerSocket.implAccept(ServerSocket.java:545)
>  at java.net.ServerSocket.accept(ServerSocket.java:513)
>  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:702)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27517) python.PythonRDD: Error while sending iterator

2019-04-18 Thread lucasysfeng (JIRA)
lucasysfeng created SPARK-27517:
---

 Summary:  python.PythonRDD: Error while sending iterator
 Key: SPARK-27517
 URL: https://issues.apache.org/jira/browse/SPARK-27517
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: lucasysfeng


when use collect function, occasionnally throw the exception below:

ERROR python.PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
 at java.net.PlainSocketImpl.socketAccept(Native Method)
 at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
 at java.net.ServerSocket.implAccept(ServerSocket.java:545)
 at java.net.ServerSocket.accept(ServerSocket.java:513)
 at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:702)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27516) java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]

2019-04-18 Thread lucasysfeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lucasysfeng updated SPARK-27516:

Attachment: driver_gc

> java.util.concurrent.TimeoutException: Futures timed out after [10 
> milliseconds]
> 
>
> Key: SPARK-27516
> URL: https://issues.apache.org/jira/browse/SPARK-27516
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: linux
> YARN  cluster mode
>Reporter: lucasysfeng
>Priority: Minor
> Attachments: driver_gc, spark.stderr
>
>
>  
> {code:java}
> #! /usr/bin/env python
> # -*- coding: utf-8 -*-
> from pyspark import SparkContext
> from pyspark.sql import SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName('sparktest').getOrCreate()
> # Other code is omitted below
> {code}
>  
> *The code is simple, but occasionally throws the following exception:*
>  19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: 
>  java.util.concurrent.TimeoutException: Futures timed out after [10 
> milliseconds]
>  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>  at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:400)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:253)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:771)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:769)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>  
> I know spark.yarn.am.waitTime can increase the sparkcontext initialization 
> time.
>  Why does SparkContext initialization take so long?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27516) java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]

2019-04-18 Thread lucasysfeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lucasysfeng updated SPARK-27516:

Attachment: spark.stderr

> java.util.concurrent.TimeoutException: Futures timed out after [10 
> milliseconds]
> 
>
> Key: SPARK-27516
> URL: https://issues.apache.org/jira/browse/SPARK-27516
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: linux
> YARN  cluster mode
>Reporter: lucasysfeng
>Priority: Minor
> Attachments: driver_gc, spark.stderr
>
>
>  
> {code:java}
> #! /usr/bin/env python
> # -*- coding: utf-8 -*-
> from pyspark import SparkContext
> from pyspark.sql import SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName('sparktest').getOrCreate()
> # Other code is omitted below
> {code}
>  
> *The code is simple, but occasionally throws the following exception:*
>  19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: 
>  java.util.concurrent.TimeoutException: Futures timed out after [10 
> milliseconds]
>  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>  at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:400)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:253)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:771)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:769)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>  
> I know spark.yarn.am.waitTime can increase the sparkcontext initialization 
> time.
>  Why does SparkContext initialization take so long?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27516) java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]

2019-04-18 Thread lucasysfeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lucasysfeng updated SPARK-27516:

Description: 
 
{code:java}
#! /usr/bin/env python
# -*- coding: utf-8 -*-

from pyspark import SparkContext
from pyspark.sql import SparkSession

if __name__ == '__main__':
spark = SparkSession.builder.appName('sparktest').getOrCreate()
# Other code is omitted below
{code}
 

*The code is simple, but occasionally throws the following exception:*
 19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.util.concurrent.TimeoutException: Futures timed out after [10 
milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:400)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:253)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:771)
 at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
 at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743)
 at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:769)
 at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)

 

I know spark.yarn.am.waitTime can increase the sparkcontext initialization time.
 Why does SparkContext initialization take so long?

 

 

  was:
 
{code:java}
#! /usr/bin/env python
# -*- coding: utf-8 -*-

from pyspark import SparkContext
from pyspark.sql import SparkSession

if __name__ == '__main__':
spark = SparkSession.builder.appName('sparktest').getOrCreate()
# Other code is omitted below
{code}
 

*The code is simple, but occasionally throws the following exception:*
19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [10 
milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:400)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:253)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:771)
 at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
 at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743)
 at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:769)
 at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)


I know spark.yarn.am.waitTime can increase the sparkcontext initialization time.
Why does SparkContext initialization take so long?

 

 


> java.util.concurrent.TimeoutException: Futures timed out after [10 
> milliseconds]
> 
>
> Key: SPARK-27516
> URL: https://issues.apache.org/jira/browse/SPARK-27516
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: linux
> YARN  cluster mode
>Reporter: lucasysfeng
>Priority: Minor
>
>  
> {code:java}
> #! /usr/bin/env python
> # -*- coding: utf-8 -*-
> from pyspark import SparkContext
> from pyspark.sql import SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName('sparktest').getOrCreate()
> # Other code is omitted below
> {code}
>  
> *The code is simple, but occasionally throws the following exception:*
>  19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: 
>  java.util.concurrent.TimeoutException: Futures timed out after [10 
> milliseconds]
>  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>  at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>  at org.apache.spar

[jira] [Created] (SPARK-27516) java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]

2019-04-18 Thread lucasysfeng (JIRA)
lucasysfeng created SPARK-27516:
---

 Summary: java.util.concurrent.TimeoutException: Futures timed out 
after [10 milliseconds]
 Key: SPARK-27516
 URL: https://issues.apache.org/jira/browse/SPARK-27516
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
 Environment: linux

YARN  cluster mode
Reporter: lucasysfeng


 
{code:java}
#! /usr/bin/env python
# -*- coding: utf-8 -*-

from pyspark import SparkContext
from pyspark.sql import SparkSession

if __name__ == '__main__':
spark = SparkSession.builder.appName('sparktest').getOrCreate()
# Other code is omitted below
{code}
 

*The code is simple, but occasionally throws the following exception:*
19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [10 
milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:400)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:253)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:771)
 at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
 at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743)
 at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
 at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:769)
 at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)


I know spark.yarn.am.waitTime can increase the sparkcontext initialization time.
Why does SparkContext initialization take so long?

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6

2019-04-18 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp reopened SPARK-25079:
-

going to reopen this until i'm done w/the branch-2.3 and -2.4 python deployment.

> [PYTHON] upgrade python 3.4 -> 3.6
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Fix For: 3.0.0
>
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired

2019-04-18 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27515:

Description: 
When submit a spark yarn application, we first create a container launch 
context and store the relative tokens.
And for each attempt of applicationMaster, it would transfer origin tokens for 
connecting to yarn .
However, it also transfer origin hdfs delegation tokens.
For a spark streaming application, if its applicationMaster failed when it has 
run for a long duration.
The hdfs token stored in container launch context may be expired.
When the new attempt applicationMaster prepareLocalResources, it would access 
the hdfs and failed for token expired.
This error occured when we rolling upgrading our cluster.

  was:
When submit a spark yarn application, we first create a container launch 
context and store the relative tokens.
And for each attempt of applicationMaster, it would transfer origin tokens.
However, it also transfer origin hdfs delegation tokens.
For a spark streaming application, if its applicationMaster failed when it has 
run for a long duration.
The hdfs token stored in container launch context may be expired.
When the new attempt applicationMaster prepareLocalResources, it would access 
the hdfs and failed for token expired.
This error occured when we rolling upgrading our cluster.


> [Deploy] When application master retry after a long time running, the hdfs 
> delegation token may be expired
> --
>
> Key: SPARK-27515
> URL: https://issues.apache.org/jira/browse/SPARK-27515
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.2
>Reporter: feiwang
>Priority: Major
>
> When submit a spark yarn application, we first create a container launch 
> context and store the relative tokens.
> And for each attempt of applicationMaster, it would transfer origin tokens 
> for connecting to yarn .
> However, it also transfer origin hdfs delegation tokens.
> For a spark streaming application, if its applicationMaster failed when it 
> has run for a long duration.
> The hdfs token stored in container launch context may be expired.
> When the new attempt applicationMaster prepareLocalResources, it would access 
> the hdfs and failed for token expired.
> This error occured when we rolling upgrading our cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired

2019-04-18 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27515:

Description: 
When submit a spark yarn application, we first create a container launch 
context and store the relative tokens.
And for each attempt of applicationMaster, it would transfer origin tokens for 
connecting to yarn.
However, it also transfer origin hdfs delegation tokens.
For a spark streaming application, if its applicationMaster failed when it has 
run for a long duration.
The hdfs token stored in container launch context may be expired.
When the new attempt applicationMaster prepareLocalResources, it would access 
the hdfs and failed for token expired.
This error occured when we rolling upgrading our cluster.

  was:
When submit a spark yarn application, we first create a container launch 
context and store the relative tokens.
And for each attempt of applicationMaster, it would transfer origin tokens for 
connecting to yarn .
However, it also transfer origin hdfs delegation tokens.
For a spark streaming application, if its applicationMaster failed when it has 
run for a long duration.
The hdfs token stored in container launch context may be expired.
When the new attempt applicationMaster prepareLocalResources, it would access 
the hdfs and failed for token expired.
This error occured when we rolling upgrading our cluster.


> [Deploy] When application master retry after a long time running, the hdfs 
> delegation token may be expired
> --
>
> Key: SPARK-27515
> URL: https://issues.apache.org/jira/browse/SPARK-27515
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.2
>Reporter: feiwang
>Priority: Major
>
> When submit a spark yarn application, we first create a container launch 
> context and store the relative tokens.
> And for each attempt of applicationMaster, it would transfer origin tokens 
> for connecting to yarn.
> However, it also transfer origin hdfs delegation tokens.
> For a spark streaming application, if its applicationMaster failed when it 
> has run for a long duration.
> The hdfs token stored in container launch context may be expired.
> When the new attempt applicationMaster prepareLocalResources, it would access 
> the hdfs and failed for token expired.
> This error occured when we rolling upgrading our cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired

2019-04-18 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27515:

Description: 
When submit a spark yarn application, we first create a container launch 
context and store the relative tokens.
And for each attempt of applicationMaster, it would transfer origin tokens.
However, it also transfer origin hdfs delegation tokens.
For a spark streaming application, if its applicationMaster failed when it has 
run for a long duration.
The hdfs token stored in container launch context may be expired.
When the new attempt applicationMaster prepareLocalResources, it would access 
the hdfs and failed for token expired.
This error occured when we rolling upgrading our cluster.

  was:
When submit a spark yarn application, we first create a container launch 
context and store the relative tokens.
And for each attempt of applicationMaster, it would transfer origin tokens to 
connect yarn.
However, it also transfer origin hdfs delegation tokens.
For a spark streaming application, if its applicationMaster failed when it has 
run for a long duration.
The hdfs token stored in container launch context may be expired.
When the new attempt applicationMaster prepareLocalResources, it would access 
the hdfs and failed for token expired.
This error occured when we rolling upgrading our cluster.


> [Deploy] When application master retry after a long time running, the hdfs 
> delegation token may be expired
> --
>
> Key: SPARK-27515
> URL: https://issues.apache.org/jira/browse/SPARK-27515
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.2
>Reporter: feiwang
>Priority: Major
>
> When submit a spark yarn application, we first create a container launch 
> context and store the relative tokens.
> And for each attempt of applicationMaster, it would transfer origin tokens.
> However, it also transfer origin hdfs delegation tokens.
> For a spark streaming application, if its applicationMaster failed when it 
> has run for a long duration.
> The hdfs token stored in container launch context may be expired.
> When the new attempt applicationMaster prepareLocalResources, it would access 
> the hdfs and failed for token expired.
> This error occured when we rolling upgrading our cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired

2019-04-18 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27515:

Description: 
When submit a spark yarn application, we first create a container launch 
context and store the relative tokens.
And for each attempt of applicationMaster, it would transfer origin tokens to 
connect yarn.
However, it also transfer origin hdfs delegation tokens.
For a spark streaming application, if its applicationMaster failed when it has 
run for a long duration.
The hdfs token stored in container launch context may be expired.
When the new attempt applicationMaster prepareLocalResources, it would access 
the hdfs and failed for token expired.
This error occured when we rolling upgrading our cluster.

> [Deploy] When application master retry after a long time running, the hdfs 
> delegation token may be expired
> --
>
> Key: SPARK-27515
> URL: https://issues.apache.org/jira/browse/SPARK-27515
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.2
>Reporter: feiwang
>Priority: Major
>
> When submit a spark yarn application, we first create a container launch 
> context and store the relative tokens.
> And for each attempt of applicationMaster, it would transfer origin tokens to 
> connect yarn.
> However, it also transfer origin hdfs delegation tokens.
> For a spark streaming application, if its applicationMaster failed when it 
> has run for a long duration.
> The hdfs token stored in container launch context may be expired.
> When the new attempt applicationMaster prepareLocalResources, it would access 
> the hdfs and failed for token expired.
> This error occured when we rolling upgrading our cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired

2019-04-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27515.

Resolution: Duplicate

> [Deploy] When application master retry after a long time running, the hdfs 
> delegation token may be expired
> --
>
> Key: SPARK-27515
> URL: https://issues.apache.org/jira/browse/SPARK-27515
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.2
>Reporter: feiwang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired

2019-04-18 Thread feiwang (JIRA)
feiwang created SPARK-27515:
---

 Summary: [Deploy] When application master retry after a long time 
running, the hdfs delegation token may be expired
 Key: SPARK-27515
 URL: https://issues.apache.org/jira/browse/SPARK-27515
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.3.2
Reporter: feiwang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27501) Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress present stream

2019-04-18 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27501.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Fixed in https://github.com/apache/spark/pull/24397

> Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress 
> present stream
> ---
>
> Key: SPARK-27501
> URL: https://issues.apache.org/jira/browse/SPARK-27501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27501) Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress present stream

2019-04-18 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27501:


Assignee: Yuming Wang

> Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress 
> present stream
> ---
>
> Key: SPARK-27501
> URL: https://issues.apache.org/jira/browse/SPARK-27501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6

2019-04-18 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25079.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24266
[https://github.com/apache/spark/pull/24266]

> [PYTHON] upgrade python 3.4 -> 3.6
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Fix For: 3.0.0
>
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15348) Hive ACID

2019-04-18 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821557#comment-16821557
 ] 

Xiao Li commented on SPARK-15348:
-

Please follow the announcement of the upcoming Spark+AI summit. 

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27514) Empty window expression results in error in optimizer

2019-04-18 Thread Yifei Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821552#comment-16821552
 ] 

Yifei Huang commented on SPARK-27514:
-

[~cloud_fan] [~dongjoon] 

> Empty window expression results in error in optimizer
> -
>
> Key: SPARK-27514
> URL: https://issues.apache.org/jira/browse/SPARK-27514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yifei Huang
>Priority: Major
>
> Currently, the optimizer will break on the following code:
> {code:java}
> val schema = StructType(Seq(
>   StructField("colA", StringType, true),
>   StructField("colB", IntegerType, true)
> ))
> var df = sqlContext.sparkSession.createDataFrame(new util.ArrayList[Row](), 
> schema)
> val w = Window.partitionBy("colA")
> df = df.withColumn("col1", sum("colB").over(w))
> df = df.withColumn("col3", sum("colB").over(w))
> df = df.withColumn("col4", sum("col3").over(w))
> df = df.withColumn("col2", sum("col1").over(w))
> df = df.select("col2")
> df.explain(true)
> {code}
> with the following stacktrace:
> {code:java}
> next on empty iterator
> java.util.NoSuchElementException: next on empty iterator
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
> at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
> at scala.collection.IterableLike$class.head(IterableLike.scala:107)
> at 
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:48)
> at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
> at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:803)
> at 
> org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:798)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:281)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformU

[jira] [Created] (SPARK-27514) Empty window expression results in error in optimizer

2019-04-18 Thread Yifei Huang (JIRA)
Yifei Huang created SPARK-27514:
---

 Summary: Empty window expression results in error in optimizer
 Key: SPARK-27514
 URL: https://issues.apache.org/jira/browse/SPARK-27514
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yifei Huang


Currently, the optimizer will break on the following code:
{code:java}
val schema = StructType(Seq(
  StructField("colA", StringType, true),
  StructField("colB", IntegerType, true)
))

var df = sqlContext.sparkSession.createDataFrame(new util.ArrayList[Row](), 
schema)
val w = Window.partitionBy("colA")
df = df.withColumn("col1", sum("colB").over(w))
df = df.withColumn("col3", sum("colB").over(w))
df = df.withColumn("col4", sum("col3").over(w))
df = df.withColumn("col2", sum("col1").over(w))
df = df.select("col2")
df.explain(true)
{code}
with the following stacktrace:
{code:java}
next on empty iterator
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike$class.head(IterableLike.scala:107)
at 
scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:48)
at 
scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:48)
at 
org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:803)
at 
org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:798)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:281)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328)
at org.apache.spa

[jira] [Commented] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue

2019-04-18 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821450#comment-16821450
 ] 

Sean Owen commented on SPARK-27068:
---

I dunno, is this common that you want to research a job from 1000 jobs ago? 
This is also what the history server's output is for.

> Support failed jobs ui and completed jobs ui use different queue
> 
>
> Key: SPARK-27068
> URL: https://issues.apache.org/jira/browse/SPARK-27068
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> For some long running jobs,we may want to check out the cause of some failed 
> jobs.
> But most jobs has completed and failed jobs ui may disappear, we can use 
> different queue for this two kinds of jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27513) Spark tarball with binaries should have files owned by uid 0

2019-04-18 Thread koert kuipers (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-27513:
--
Description: 
currently the tarball is created in dev/make-distribution.sh like this:

{code:bash}
tar czf "spark-$VERSION-bin-$NAME.tgz" -C "$SPARK_HOME" "$TARDIR_NAME"
{code}

the problem with this is that if root unpacks this tarball the files are owned 
by whatever the uid is of the person that created the tarball. this uid 
probably doesnt exist or belongs to a different unrelated user. this is 
unlikely to be what anyone wants.

for other users this problem doesnt exist since tar is now allowed to change 
uid. so when they unpack the tarball the files are owned by them.

it is more typical to set the uid and gid to 0 for a tarball. that way when 
root unpacks it the files are owned by root. so like this:

{code:bash}
tar czf "spark-$VERSION-bin-$NAME.tgz" --numeric-owner --owner=0 --group=0 -C 
"$SPARK_HOME" "$TARDIR_NAME"
{code}



  was:
currently the tarball is created in dev/make-distribution.sh like this:

{code:bash}
tar czf "spark-$VERSION-bin-$NAME.tgz" -C "$SPARK_HOME" "$TARDIR_NAME"
{code}

the problem with this is that if root unpacks this tarball the files are owned 
by whatever the uid is of the person that created the tarball. this uid 
probably doesnt exist or belongs to a different unrelated user. this is 
unlikely to be what anyone wants.

for other users this problem doesnt exist since tar is now allowed to change 
uid. so when they unpack the tarball the files are owned by them.

it is more typical to set the uid and gid to 0 for a tarball. that way when 
root unpacks it the files are owned by root. so like this:

{code:bash}
tar czf "spark-$VERSION-bin-$NAME.tgz" --numeric-owner --owner=0 --group=0 -C 
"$SPARK_HOME" "$TARDIR_NAME
{code}




> Spark tarball with binaries should have files owned by uid 0
> 
>
> Key: SPARK-27513
> URL: https://issues.apache.org/jira/browse/SPARK-27513
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.1
>Reporter: koert kuipers
>Priority: Minor
> Fix For: 3.0.0
>
>
> currently the tarball is created in dev/make-distribution.sh like this:
> {code:bash}
> tar czf "spark-$VERSION-bin-$NAME.tgz" -C "$SPARK_HOME" "$TARDIR_NAME"
> {code}
> the problem with this is that if root unpacks this tarball the files are 
> owned by whatever the uid is of the person that created the tarball. this uid 
> probably doesnt exist or belongs to a different unrelated user. this is 
> unlikely to be what anyone wants.
> for other users this problem doesnt exist since tar is now allowed to change 
> uid. so when they unpack the tarball the files are owned by them.
> it is more typical to set the uid and gid to 0 for a tarball. that way when 
> root unpacks it the files are owned by root. so like this:
> {code:bash}
> tar czf "spark-$VERSION-bin-$NAME.tgz" --numeric-owner --owner=0 --group=0 -C 
> "$SPARK_HOME" "$TARDIR_NAME"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27513) Spark tarball with binaries should have files owned by uid 0

2019-04-18 Thread koert kuipers (JIRA)
koert kuipers created SPARK-27513:
-

 Summary: Spark tarball with binaries should have files owned by 
uid 0
 Key: SPARK-27513
 URL: https://issues.apache.org/jira/browse/SPARK-27513
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.1
Reporter: koert kuipers
 Fix For: 3.0.0


currently the tarball is created in dev/make-distribution.sh like this:

{code:bash}
tar czf "spark-$VERSION-bin-$NAME.tgz" -C "$SPARK_HOME" "$TARDIR_NAME"
{code}

the problem with this is that if root unpacks this tarball the files are owned 
by whatever the uid is of the person that created the tarball. this uid 
probably doesnt exist or belongs to a different unrelated user. this is 
unlikely to be what anyone wants.

for other users this problem doesnt exist since tar is now allowed to change 
uid. so when they unpack the tarball the files are owned by them.

it is more typical to set the uid and gid to 0 for a tarball. that way when 
root unpacks it the files are owned by root. so like this:

{code:bash}
tar czf "spark-$VERSION-bin-$NAME.tgz" --numeric-owner --owner=0 --group=0 -C 
"$SPARK_HOME" "$TARDIR_NAME
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27512) Decimal parsing leads to unexpected type inference

2019-04-18 Thread koert kuipers (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-27512:
--
Summary: Decimal parsing leads to unexpected type inference  (was: Decimal 
parsing leading to unexpected type inference)

> Decimal parsing leads to unexpected type inference
> --
>
> Key: SPARK-27512
> URL: https://issues.apache.org/jira/browse/SPARK-27512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark 3.0.0-SNAPSHOT from this commit:
> {code:bash}
> commit 3ab96d7acf870e53c9016b0b63d0b328eec23bed
> Author: Dilip Biswal 
> Date:   Mon Apr 15 21:26:45 2019 +0800
> {code}
>Reporter: koert kuipers
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code:bash}
> $ hadoop fs -text test.bsv
> x|y
> 1|1,2
> 2|2,3
> 3|3,4
> {code}
> in spark 2.4.1:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: string (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1|1,2|
> |  2|2,3|
> |  3|3,4|
> +---+---+
> {code}
> in spark 3.0.0-SNAPSHOT:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: decimal(2,0) (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1| 12|
> |  2| 23|
> |  3| 34|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27512) Decimal parsing leading to unexpected type inference

2019-04-18 Thread koert kuipers (JIRA)
koert kuipers created SPARK-27512:
-

 Summary: Decimal parsing leading to unexpected type inference
 Key: SPARK-27512
 URL: https://issues.apache.org/jira/browse/SPARK-27512
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: spark 3.0.0-SNAPSHOT from this commit:
{code:bash}
commit 3ab96d7acf870e53c9016b0b63d0b328eec23bed
Author: Dilip Biswal 
Date:   Mon Apr 15 21:26:45 2019 +0800
{code}
Reporter: koert kuipers
 Fix For: 3.0.0


{code:bash}
$ hadoop fs -text test.bsv
x|y
1|1,2
2|2,3
3|3,4
{code}

in spark 2.4.1:
{code:bash}
scala> val data = spark.read.format("csv").option("header", 
true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")

scala> data.printSchema
root
 |-- x: integer (nullable = true)
 |-- y: string (nullable = true)

scala> data.show
+---+---+
|  x|  y|
+---+---+
|  1|1,2|
|  2|2,3|
|  3|3,4|
+---+---+
{code}

in spark 3.0.0-SNAPSHOT:
{code:bash}
scala> val data = spark.read.format("csv").option("header", 
true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")

scala> data.printSchema
root
 |-- x: integer (nullable = true)
 |-- y: decimal(2,0) (nullable = true)

scala> data.show
+---+---+
|  x|  y|
+---+---+
|  1| 12|
|  2| 23|
|  3| 34|
+---+---+
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27511) Spark Streaming Driver Memory

2019-04-18 Thread Badri Krishnan (JIRA)
Badri Krishnan created SPARK-27511:
--

 Summary: Spark Streaming Driver Memory
 Key: SPARK-27511
 URL: https://issues.apache.org/jira/browse/SPARK-27511
 Project: Spark
  Issue Type: Question
  Components: DStreams
Affects Versions: 2.4.0
Reporter: Badri Krishnan


Hello Apache Spark Community.

We are currently facing an issue with one of our Spark Streaming jobs which 
consumes data from a IBM MQ, this is run on a AWS EMR cluster using DStreams 
and Checkpointing.

Our Spark streaming job failed with several containers exiting with error code: 
143. I checked your container logs. For example, one of the killed container's 
stdout logs [1] show the below error: (Exit code from container 
container_1553356041292_0001_15_04 is : 143)

2019-03-28 19:32:26,569 ERROR [dispatcher-event-loop-3] 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl:Error stopping 
receiver 2 org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to 
ip-**-***-*.***.***.com/**.**.***.**:*
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more

These containers exited with code 143 because it was not able to reach the 
application master(Driver Process).

Amazon mentioned that the Application Master is consuming more memory and hence 
recommended us to double it. As AM runs on driver, we were asked to increase 
spark.driver.memory from 1.4G to 3G. But the question that was unanswered was 
whether increasing the memory would solve the problem or delay the failure. As 
this is an ever running streaming application, do we need to consider something 
to understand whether the memory usage builds up over a period of time or are 
there any properties that needs to be set specific to how AM(application 
Master) works for streaming application. Any inputs on how to track the AM 
memory usage? Any insights will be helpful.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue

2019-04-18 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821286#comment-16821286
 ] 

shahid edited comment on SPARK-27068 at 4/18/19 4:31 PM:
-

cc [~srowen] Can we raise a PR for the issue?. The actual issue is, when there 
are lots of jobs, UI cleans older jobs if the number of jobs exceeds a 
threshold. Eventually it removes failure jobs as well. If user want to see the 
reason for failure, it won't be available in UI. 
 The solution could be, we can remove the jobs only from successful jobs table 
and retain failed or killed jobs table. Kindly give the feedback


was (Author: shahid):
cc [~srowen] Can we raise a PR for the issue?. The actual issue is, when there 
are lots of jobs, UI cleans older jobs if the number of jobs exceeds a 
threshold. Eventually it removes failure jobs as well. If user want to see the 
reason for failure, it won't be available in UI. 
 The solution could be, we can remove the jobs only from successful jobs table 
and retain failed of killed jobs table. Kindly give the feedback

> Support failed jobs ui and completed jobs ui use different queue
> 
>
> Key: SPARK-27068
> URL: https://issues.apache.org/jira/browse/SPARK-27068
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> For some long running jobs,we may want to check out the cause of some failed 
> jobs.
> But most jobs has completed and failed jobs ui may disappear, we can use 
> different queue for this two kinds of jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue

2019-04-18 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821286#comment-16821286
 ] 

shahid commented on SPARK-27068:


cc [~srowen] Can we raise a PR for the issue?. The actual issue is, when there 
are lots of jobs, UI cleans older jobs if the number of jobs exceeds a 
threshold. Eventually it removes failure jobs as well. If user want to see the 
reason for failure, it won't be available in UI. 
 The solution could be, we can remove the jobs only from successful jobs table 
and retain failed of killed jobs table. Kindly give the feedback

> Support failed jobs ui and completed jobs ui use different queue
> 
>
> Key: SPARK-27068
> URL: https://issues.apache.org/jira/browse/SPARK-27068
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> For some long running jobs,we may want to check out the cause of some failed 
> jobs.
> But most jobs has completed and failed jobs ui may disappear, we can use 
> different queue for this two kinds of jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24655) [K8S] Custom Docker Image Expectations and Documentation

2019-04-18 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821270#comment-16821270
 ] 

Thomas Graves commented on SPARK-24655:
---

>From the linked issues it seems the goals would be:
 * support more then alpine image base - ie a glibc version
 * Allow for adding at least certain support like GPUs - although this may just 
making base image configurable
 * Allow for overriding the start commands for things like using jupyter docker 
images.
 * add in python pip requirements, and I assume would be nice for R, is there 
something generic we can do to make this easy

Correct me if I'm wrong but anything spark related you should be able to use 
SPARK confs for, like env variables. like 
{{spark.kubernetes.driverEnv.[EnvironmentVariableName]}} and 
spark.executorEnv..  Otherwise you could just use the dockerfile built here as 
a base and build on it. 

I think we would just want to try to make it easy for the common cases and 
allow users to override things we may have hardcoded to allow them to reuse it 
as a base.

[~mcheah] From the original description, why do we want to try to not rebuild 
the image if spark version changes? It seems ok to allow them to override to 
point to their own spark version (which they could then use to do this), but I 
would think normally you would build a new docker image for a new version of 
spark? Dependencies may have changed, the docker template may have changed, 
etc..  It seems if they really wanted this, they would just specify their own 
docker image as a base and just add the spark pieces, is that what you are 
getting at?  We can make the base image a argument to the docker-image-tool.sh 
script

> [K8S] Custom Docker Image Expectations and Documentation
> 
>
> Key: SPARK-24655
> URL: https://issues.apache.org/jira/browse/SPARK-24655
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Matt Cheah
>Priority: Major
>
> A common use case we want to support with Kubernetes is the usage of custom 
> Docker images. Some examples include:
>  * A user builds an application using Gradle or Maven, using Spark as a 
> compile-time dependency. The application's jars (both the custom-written jars 
> and the dependencies) need to be packaged in a docker image that can be run 
> via spark-submit.
>  * A user builds a PySpark or R application and desires to include custom 
> dependencies
>  * A user wants to switch the base image from Alpine to CentOS while using 
> either built-in or custom jars
> We currently do not document how these custom Docker images are supposed to 
> be built, nor do we guarantee stability of these Docker images with various 
> spark-submit versions. To illustrate how this can break down, suppose for 
> example we decide to change the names of environment variables that denote 
> the driver/executor extra JVM options specified by 
> {{spark.[driver|executor].extraJavaOptions}}. If we change the environment 
> variable spark-submit provides then the user must update their custom 
> Dockerfile and build new images.
> Rather than jumping to an implementation immediately though, it's worth 
> taking a step back and considering these matters from the perspective of the 
> end user. Towards that end, this ticket will serve as a forum where we can 
> answer at least the following questions, and any others pertaining to the 
> matter:
>  # What would be the steps a user would need to take to build a custom Docker 
> image, given their desire to customize the dependencies and the content (OS 
> or otherwise) of said images?
>  # How can we ensure the user does not need to rebuild the image if only the 
> spark-submit version changes?
> The end deliverable for this ticket is a design document, and then we'll 
> create sub-issues for the technical implementation and documentation of the 
> contract.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27510) Master fall into dead loop while launching executor failed in Worker

2019-04-18 Thread wuyi (JIRA)
wuyi created SPARK-27510:


 Summary: Master fall into dead loop while launching executor 
failed in Worker
 Key: SPARK-27510
 URL: https://issues.apache.org/jira/browse/SPARK-27510
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.3.0, 2.2.0, 2.1.0, 2.0.0, 1.6.0
Reporter: wuyi


In Standalone, while launching executor during ExecutorRunner.start() is always 
failed in Worker, Master will continue to launch new executor for the same 
Worker indefinitely.

The issue is easy to reproduce by running a unit test with local-cluster mode 
and set a wrong  spark.test.home(e.g. /tmp). Then, when running unit test, test 
would get stuck and we can see endless executor directories under 
/tmp/work/app/.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27502) Update nested schema benchmark result for Orc V2

2019-04-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27502.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24399

> Update nested schema benchmark result for Orc V2
> 
>
> Key: SPARK-27502
> URL: https://issues.apache.org/jira/browse/SPARK-27502
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> We added nested schema pruning support to Orc V2 recently. The benchmark 
> result should be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27502) Update nested schema benchmark result for Orc V2

2019-04-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27502:
--
Issue Type: Sub-task  (was: Test)
Parent: SPARK-25603

> Update nested schema benchmark result for Orc V2
> 
>
> Key: SPARK-27502
> URL: https://issues.apache.org/jira/browse/SPARK-27502
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> We added nested schema pruning support to Orc V2 recently. The benchmark 
> result should be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25422) flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream)

2019-04-18 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821178#comment-16821178
 ] 

Imran Rashid commented on SPARK-25422:
--

[~magnusfa] handling blocks > 2GB was broken in many ways before 2.4.  You'll 
need to upgrade to 2.4.0 to get it work.  There are no known issues with large 
blocks in 2.4.0 (as far as I know, anyway).

> flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated 
> (encryption = on) (with replication as stream)
> 
>
> Key: SPARK-25422
> URL: https://issues.apache.org/jira/browse/SPARK-25422
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> stacktrace
> {code}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 7, localhost, executor 1): java.io.IOException: 
> org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of 
> broadcast_0: 1651574976 != 1165629262
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: corrupt remote block 
> broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313)
>   ... 13 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27509) enable connection in cluster mode for output in client machine

2019-04-18 Thread Ilya Brodetsky (JIRA)
Ilya Brodetsky created SPARK-27509:
--

 Summary: enable connection in cluster mode for output in client 
machine
 Key: SPARK-27509
 URL: https://issues.apache.org/jira/browse/SPARK-27509
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 3.0.0
Reporter: Ilya Brodetsky


While working on Spark I implemented a feature where when you submit a Spark 
job you enable the option that when it finishes the client machine will read 
from a file that the Spark job wrote to. It enables a quick and easy way to get 
output of lets say a dynamic Spark SQL query in the client without manually 
doing a hdfs download command or anything like. There are some pre requisites 
for it ( like having configurations for the hdfs communication) but I think its 
completely possible to  implement it and help many users to have an established 
communication line with a Spark job.

What do you think about the option? 
Would be glad to hear any issues with this idea or why it shouldnt be 
implemented. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27460) Running slowest test suites in their own forked JVMs for higher parallelism

2019-04-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27460.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24373
[https://github.com/apache/spark/pull/24373]

> Running slowest test suites in their own forked JVMs for higher parallelism
> ---
>
> Key: SPARK-27460
> URL: https://issues.apache.org/jira/browse/SPARK-27460
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Gengliang Wang
>Priority: Critical
> Fix For: 3.0.0
>
>
> We should modify SparkBuild so that the largest / slowest test suites (or 
> collections of suites) can run in their own forked JVMs, allowing them to be 
> run in parallel with each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27508) Reduce test time of HiveClientSuites

2019-04-18 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-27508.

Resolution: Won't Fix

Sorry, I think I have made some mistakes. The time is quite the same.



> Reduce test time of HiveClientSuites
> 
>
> Key: SPARK-27508
> URL: https://issues.apache.org/jira/browse/SPARK-27508
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The test time of HiveClientSuites on Jenkins is about 3.5 minutes.
> The test suite itself is sometimes flaky.
> I find that changing the default table from Managed table to External Table 
> can fasten the tests, while the test scenarios are still covered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27508) Reduce test time of HiveClientSuites

2019-04-18 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-27508:
---
Component/s: (was: SQL)
 Tests

> Reduce test time of HiveClientSuites
> 
>
> Key: SPARK-27508
> URL: https://issues.apache.org/jira/browse/SPARK-27508
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The test time of HiveClientSuites on Jenkins is about 3.5 minutes.
> The test suite itself is sometimes flaky.
> I find that changing the default table from Managed table to External Table 
> can fasten the tests, while the test scenarios are still covered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27508) Reduce test time of HiveClientSuites

2019-04-18 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-27508:
--

 Summary: Reduce test time of HiveClientSuites
 Key: SPARK-27508
 URL: https://issues.apache.org/jira/browse/SPARK-27508
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


The test time of HiveClientSuites on Jenkins is about 3.5 minutes.
The test suite itself is sometimes flaky.

I find that changing the default table from Managed table to External Table can 
fasten the tests, while the test scenarios are still covered.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27507) get_json_object fails somewhat arbitrarily on long input

2019-04-18 Thread Michael Chirico (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Chirico updated SPARK-27507:

Description: 
Some long JSON objects are parsed incorrectly by {{get_json_object}}.

The specific string we noticed this on can't be shared, but here's some 
reproduction in Pyspark:
{code:java}
# v2.3.1

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

from string import ascii_lowercase

# create a long string
alpha_rep = ascii_lowercase*1000

# create a simple query on a simple json object which contains this string
test_q = '''
select get_json_object('{{"a": "{}"}}', '$.a')
'''

def run_q(s):
return len(spark.sql(test_q.format(s)).collect()[0][0])

def diagnose(s):
out_len = run_q(s)
# input & output should be identical (length match is a necessary condition)
print('Input length: %d\tOutput length: %d' % (len(s), out_len))
return True

def test_l(n):
diagnose(alpha_rep[:n])
return True

test_l(2264)
test_l(2265)
test_l(2667)
test_l(2666)
test_l(2668)
test_l(len(alpha_rep)){code}
With results on my instance:
{code:java}
Input length: 2264  Output length: 2264
Input length: 2265  Output length: 2265
Input length: 2667  Output length: 2660 < problematic!!
Input length: 2666  Output length: 2666
Input length: 2668  Output length: 2661 < problematic!!
Input length: 26000 Output length: 26000
{code}
It's strange that the error triggers for some lengths, but it's apparently not 
exclusively about the input being large.

 

More details from a {{pandas}} exploration:
{code:java}
import pandas as pd

DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})

N = DF.shape[0]
# note -- takes about 20 minutes to run on my machine
for ii in range(N):
DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
if ii % 520 == 0:
print("%.0f%% Done" % (100.0*ii/N))

DF[DF['n'] != DF['m']].shape
# (1326, 2)

DF['miss'] = DF['n'] - DF['m']
DF.plot('n', 'miss')
{code}
Plot attached

So it appears to fail for a narrowly defined range of about 1300 characters 
before recovering and continuing to function as expected.

  was:
Some long JSON objects are parsed incorrectly by {{get_json_object}}.

The specific string we noticed this on can't be shared, but here's some 
reproduction in Pyspark:
{code:java}
# v2.3.1

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

from string import ascii_lowercase

# create a long string
alpha_rep = ascii_lowercase*1000

# create a simple query on a simple json object which contains this string
test_q = '''
select get_json_object('{{"a": "{}"}}', '$.a')
'''

def run_q(s):
return len(spark.sql(test_q.format(s)).collect()[0][0])

def diagnose(s):
out_len = run_q(s)
# input & output should be identical (length match is a necessary condition)
print('Input length: %d\tOutput length: %d' % (len(s), out_len))
return True

def test_l(n):
diagnose(alpha_rep[:n])
return True

test_l(2264)
test_l(2265)
test_l(2667)
test_l(2666)
test_l(2668)
test_l(len(alpha_rep)){code}
With results on my instance:
{code:java}
Input length: 2264  Output length: 2264
Input length: 2265  Output length: 2265
Input length: 2667  Output length: 2660
Input length: 2666  Output length: 2666
Input length: 2668  Output length: 2661 < problematic!!
Input length: 26000 Output length: 26000
{code}
It's strange that the error triggers for some lengths, but it's apparently not 
exclusively about the input being large.

 

More details from a {{pandas}} exploration:
{code:java}
import pandas as pd

DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})

N = DF.shape[0]
# note -- takes about 20 minutes to run on my machine
for ii in range(N):
DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
if ii % 520 == 0:
print("%.0f%% Done" % (100.0*ii/N))

DF[DF['n'] != DF['m']].shape
# (1326, 2)

DF['miss'] = DF['n'] - DF['m']
DF.plot('n', 'miss')
{code}
Plot attached

So it appears to fail for a narrowly defined range of about 1300 characters 
before recovering and continuing to function as expected.


> get_json_object fails somewhat arbitrarily on long input
> 
>
> Key: SPARK-27507
> URL: https://issues.apache.org/jira/browse/SPARK-27507
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Michael Chirico
>Priority: Major
> Attachments: Screen Shot 2019-04-18 at 7.13.02 PM.png
>
>
> Some long JSON objects are parsed incorrectly by {{get_json_object}}.
> The specific string we noticed this on can't be shared, but here's some 
> reproduction in Pyspark:
> {code:java}
> # v2.3.1
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> from string import ascii_lowercase
> 

[jira] [Updated] (SPARK-27507) get_json_object fails somewhat arbitrarily on long input

2019-04-18 Thread Michael Chirico (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Chirico updated SPARK-27507:

Attachment: Screen Shot 2019-04-18 at 7.13.02 PM.png

> get_json_object fails somewhat arbitrarily on long input
> 
>
> Key: SPARK-27507
> URL: https://issues.apache.org/jira/browse/SPARK-27507
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Michael Chirico
>Priority: Major
> Attachments: Screen Shot 2019-04-18 at 7.13.02 PM.png
>
>
> Some long JSON objects are parsed incorrectly by {{get_json_object}}.
> The specific string we noticed this on can't be shared, but here's some 
> reproduction in Pyspark:
> {code:java}
> # v2.3.1
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> from string import ascii_lowercase
> # create a long string
> alpha_rep = ascii_lowercase*1000
> # create a simple query on a simple json object which contains this string
> test_q = '''
> select get_json_object('{{"a": "{}"}}', '$.a')
> '''
> def run_q(s):
> return len(spark.sql(test_q.format(s)).collect()[0][0])
> def diagnose(s):
> out_len = run_q(s)
> # input & output should be identical (length match is a necessary 
> condition)
> print('Input length: %d\tOutput length: %d' % (len(s), out_len))
> return True
> def test_l(n):
> diagnose(alpha_rep[:n])
> return True
> test_l(2264)
> test_l(2265)
> test_l(2667)
> test_l(2666)
> test_l(2668)
> test_l(len(alpha_rep)){code}
> With results on my instance:
> {code:java}
> Input length: 2264Output length: 2264
> Input length: 2265Output length: 2265
> Input length: 2667Output length: 2660
> Input length: 2666Output length: 2666
> Input length: 2668Output length: 2661 < problematic!!
> Input length: 26000   Output length: 26000
> {code}
> It's strange that the error triggers for some lengths, but it's apparently 
> not exclusively about the input being large.
>  
> More details from a {{pandas}} exploration:
> {code:java}
> import pandas as pd
> DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})
> N = DF.shape[0]
> # note -- takes about 20 minutes to run on my machine
> for ii in range(N):
> DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
> if ii % 520 == 0:
> print("%.0f%% Done" % (100.0*ii/N))
> DF[DF['n'] != DF['m']].shape
> # (1326, 2)
> DF['miss'] = DF['n'] - DF['m']
> DF.plot('n', 'miss')
> {code}
> Plot here:
> [https://imgur.com/vCPLNwy]
> So it appears to fail for a narrowly defined range of about 1300 characters 
> before recovering and continuing to function as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27507) get_json_object fails somewhat arbitrarily on long input

2019-04-18 Thread Michael Chirico (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Chirico updated SPARK-27507:

Description: 
Some long JSON objects are parsed incorrectly by {{get_json_object}}.

The specific string we noticed this on can't be shared, but here's some 
reproduction in Pyspark:
{code:java}
# v2.3.1

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

from string import ascii_lowercase

# create a long string
alpha_rep = ascii_lowercase*1000

# create a simple query on a simple json object which contains this string
test_q = '''
select get_json_object('{{"a": "{}"}}', '$.a')
'''

def run_q(s):
return len(spark.sql(test_q.format(s)).collect()[0][0])

def diagnose(s):
out_len = run_q(s)
# input & output should be identical (length match is a necessary condition)
print('Input length: %d\tOutput length: %d' % (len(s), out_len))
return True

def test_l(n):
diagnose(alpha_rep[:n])
return True

test_l(2264)
test_l(2265)
test_l(2667)
test_l(2666)
test_l(2668)
test_l(len(alpha_rep)){code}
With results on my instance:
{code:java}
Input length: 2264  Output length: 2264
Input length: 2265  Output length: 2265
Input length: 2667  Output length: 2660
Input length: 2666  Output length: 2666
Input length: 2668  Output length: 2661 < problematic!!
Input length: 26000 Output length: 26000
{code}
It's strange that the error triggers for some lengths, but it's apparently not 
exclusively about the input being large.

 

More details from a {{pandas}} exploration:
{code:java}
import pandas as pd

DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})

N = DF.shape[0]
# note -- takes about 20 minutes to run on my machine
for ii in range(N):
DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
if ii % 520 == 0:
print("%.0f%% Done" % (100.0*ii/N))

DF[DF['n'] != DF['m']].shape
# (1326, 2)

DF['miss'] = DF['n'] - DF['m']
DF.plot('n', 'miss')
{code}
Plot attached

So it appears to fail for a narrowly defined range of about 1300 characters 
before recovering and continuing to function as expected.

  was:
Some long JSON objects are parsed incorrectly by {{get_json_object}}.

The specific string we noticed this on can't be shared, but here's some 
reproduction in Pyspark:
{code:java}
# v2.3.1

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

from string import ascii_lowercase

# create a long string
alpha_rep = ascii_lowercase*1000

# create a simple query on a simple json object which contains this string
test_q = '''
select get_json_object('{{"a": "{}"}}', '$.a')
'''

def run_q(s):
return len(spark.sql(test_q.format(s)).collect()[0][0])

def diagnose(s):
out_len = run_q(s)
# input & output should be identical (length match is a necessary condition)
print('Input length: %d\tOutput length: %d' % (len(s), out_len))
return True

def test_l(n):
diagnose(alpha_rep[:n])
return True

test_l(2264)
test_l(2265)
test_l(2667)
test_l(2666)
test_l(2668)
test_l(len(alpha_rep)){code}
With results on my instance:
{code:java}
Input length: 2264  Output length: 2264
Input length: 2265  Output length: 2265
Input length: 2667  Output length: 2660
Input length: 2666  Output length: 2666
Input length: 2668  Output length: 2661 < problematic!!
Input length: 26000 Output length: 26000
{code}
It's strange that the error triggers for some lengths, but it's apparently not 
exclusively about the input being large.

 

More details from a {{pandas}} exploration:
{code:java}
import pandas as pd

DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})

N = DF.shape[0]
# note -- takes about 20 minutes to run on my machine
for ii in range(N):
DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
if ii % 520 == 0:
print("%.0f%% Done" % (100.0*ii/N))

DF[DF['n'] != DF['m']].shape
# (1326, 2)

DF['miss'] = DF['n'] - DF['m']
DF.plot('n', 'miss')
{code}
Plot here:

[https://imgur.com/vCPLNwy]

So it appears to fail for a narrowly defined range of about 1300 characters 
before recovering and continuing to function as expected.


> get_json_object fails somewhat arbitrarily on long input
> 
>
> Key: SPARK-27507
> URL: https://issues.apache.org/jira/browse/SPARK-27507
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Michael Chirico
>Priority: Major
> Attachments: Screen Shot 2019-04-18 at 7.13.02 PM.png
>
>
> Some long JSON objects are parsed incorrectly by {{get_json_object}}.
> The specific string we noticed this on can't be shared, but here's some 
> reproduction in Pyspark:
> {code:java}
> # v2.3.1
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> from string import ascii_lowerc

[jira] [Created] (SPARK-27507) get_json_object fails somewhat arbitrarily on long input

2019-04-18 Thread Michael Chirico (JIRA)
Michael Chirico created SPARK-27507:
---

 Summary: get_json_object fails somewhat arbitrarily on long input
 Key: SPARK-27507
 URL: https://issues.apache.org/jira/browse/SPARK-27507
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.1
Reporter: Michael Chirico


Some long JSON objects are parsed incorrectly by {{get_json_object}}.

The specific string we noticed this on can't be shared, but here's some 
reproduction in Pyspark:
{code:java}
# v2.3.1

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

from string import ascii_lowercase

# create a long string
alpha_rep = ascii_lowercase*1000

# create a simple query on a simple json object which contains this string
test_q = '''
select get_json_object('{{"a": "{}"}}', '$.a')
'''

def run_q(s):
return len(spark.sql(test_q.format(s)).collect()[0][0])

def diagnose(s):
out_len = run_q(s)
# input & output should be identical (length match is a necessary condition)
print('Input length: %d\tOutput length: %d' % (len(s), out_len))
return True

def test_l(n):
diagnose(alpha_rep[:n])
return True

test_l(2264)
test_l(2265)
test_l(2667)
test_l(2666)
test_l(2668)
test_l(len(alpha_rep)){code}
With results on my instance:
{code:java}
Input length: 2264  Output length: 2264
Input length: 2265  Output length: 2265
Input length: 2667  Output length: 2660
Input length: 2666  Output length: 2666
Input length: 2668  Output length: 2661 < problematic!!
Input length: 26000 Output length: 26000
{code}
It's strange that the error triggers for some lengths, but it's apparently not 
exclusively about the input being large.

 

More details from a {{pandas}} exploration:
{code:java}
import pandas as pd

DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})

N = DF.shape[0]
# note -- takes about 20 minutes to run on my machine
for ii in range(N):
DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
if ii % 520 == 0:
print("%.0f%% Done" % (100.0*ii/N))

DF[DF['n'] != DF['m']].shape
# (1326, 2)

DF['miss'] = DF['n'] - DF['m']
DF.plot('n', 'miss')
{code}
Plot here:

[https://imgur.com/vCPLNwy]

So it appears to fail for a narrowly defined range of about 1300 characters 
before recovering and continuing to function as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27506) Function `from_avro` doesn't allow deserialization of data using other compatible schemas

2019-04-18 Thread Gianluca Amori (JIRA)
Gianluca Amori created SPARK-27506:
--

 Summary: Function `from_avro` doesn't allow deserialization of 
data using other compatible schemas
 Key: SPARK-27506
 URL: https://issues.apache.org/jira/browse/SPARK-27506
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: Gianluca Amori


 SPARK-24768 and subtasks introduced support to read and write Avro data by 
parsing a binary column of Avro format and converting it into its corresponding 
catalyst value (and viceversa).

 

The current implementation has the limitation of requiring deserialization of 
an event with the exact same schema with which it was serialized. This breaks 
one of the most important features of Avro, schema evolution 
[https://docs.confluent.io/current/schema-registry/avro.html] - most 
importantly, the ability to read old data with a newer (compatible) schema 
without breaking the consumer.

 

The GenericDatumReader in the Avro library already supports passing an optional 
*writer's schema* (the schema with which the record was serialized) alongside a 
mandatory *reader's schema* (the schema with which the record is going to be 
deserialized). The proposed change is to do the same in the from_avro function, 
allowing the possibility to pass an optional writer's schema to be used in the 
deserialization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27505) autoBroadcastJoinThreshold including bigger table

2019-04-18 Thread Mike Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820951#comment-16820951
 ] 

Mike Chan commented on SPARK-27505:
---

Table desc extended result:

Statistics |24452111 bytes

> autoBroadcastJoinThreshold including bigger table
> -
>
> Key: SPARK-27505
> URL: https://issues.apache.org/jira/browse/SPARK-27505
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Hive table with Spark 2.3.1 on Azure, using Azure 
> storage as storage layer
>Reporter: Mike Chan
>Priority: Major
> Attachments: explain_plan.txt
>
>
> I'm on a case that when certain table being exposed to broadcast join, the 
> query will eventually failed with remote block error. 
>  
> Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 
> 10485760
> !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.2&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI&disp=emb&realattid=ii_jumg5jxd1|width=542,height=66!
>  
> Then we proceed to perform query. In the SQL plan, we found that one table 
> that is 25MB in size is broadcast as well.
>  
> !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.1&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE&disp=emb&realattid=ii_jumg53fq0|width=227,height=542!
>  
> Also in desc extended the table is 24452111 bytes. It is a Hive table. We 
> always ran into error when this table being broadcast. Below is the sample 
> error
>  
> Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt 
> remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) 
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>  
> Also attached the physical plan if you're interested. One thing to note that, 
> if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query 
> will get successfully executed and default.product NOT broadcasted.{color}
> {color:#00}
> {color}{color:#00}However, when I change to another query that querying 
> even less columns than pervious one, even in 5MB this table still get 
> broadcasted and failed with the same error. I even changed to 1MB and still 
> the same. {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27505) autoBroadcastJoinThreshold including bigger table

2019-04-18 Thread Mike Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Chan updated SPARK-27505:
--
Attachment: explain_plan.txt

> autoBroadcastJoinThreshold including bigger table
> -
>
> Key: SPARK-27505
> URL: https://issues.apache.org/jira/browse/SPARK-27505
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Hive table with Spark 2.3.1 on Azure, using Azure 
> storage as storage layer
>Reporter: Mike Chan
>Priority: Major
> Attachments: explain_plan.txt
>
>
> I'm on a case that when certain table being exposed to broadcast join, the 
> query will eventually failed with remote block error. 
>  
> Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 
> 10485760
> !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.2&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI&disp=emb&realattid=ii_jumg5jxd1|width=542,height=66!
>  
> Then we proceed to perform query. In the SQL plan, we found that one table 
> that is 25MB in size is broadcast as well.
>  
> !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.1&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE&disp=emb&realattid=ii_jumg53fq0|width=227,height=542!
>  
> Also in desc extended the table is 24452111 bytes. It is a Hive table. We 
> always ran into error when this table being broadcast. Below is the sample 
> error
>  
> Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt 
> remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) 
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>  
> Also attached the physical plan if you're interested. One thing to note that, 
> if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query 
> will get successfully executed and default.product NOT broadcasted.{color}
> {color:#00}
> {color}{color:#00}However, when I change to another query that querying 
> even less columns than pervious one, even in 5MB this table still get 
> broadcasted and failed with the same error. I even changed to 1MB and still 
> the same. {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27505) autoBroadcastJoinThreshold including bigger table

2019-04-18 Thread Mike Chan (JIRA)
Mike Chan created SPARK-27505:
-

 Summary: autoBroadcastJoinThreshold including bigger table
 Key: SPARK-27505
 URL: https://issues.apache.org/jira/browse/SPARK-27505
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.3.1
 Environment: Hive table with Spark 2.3.1 on Azure, using Azure storage 
as storage layer
Reporter: Mike Chan


I'm on a case that when certain table being exposed to broadcast join, the 
query will eventually failed with remote block error. 
 
Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 
10485760
!https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.2&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI&disp=emb&realattid=ii_jumg5jxd1|width=542,height=66!
 
Then we proceed to perform query. In the SQL plan, we found that one table that 
is 25MB in size is broadcast as well.
 
!https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.1&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE&disp=emb&realattid=ii_jumg53fq0|width=227,height=542!
 
Also in desc extended the table is 24452111 bytes. It is a Hive table. We 
always ran into error when this table being broadcast. Below is the sample error
 
Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt remote 
block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
 at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
 at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) 
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)

 
Also attached the physical plan if you're interested. One thing to note that, 
if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query will 
get successfully executed and default.product NOT broadcasted.{color}
{color:#00}
{color}{color:#00}However, when I change to another query that querying 
even less columns than pervious one, even in 5MB this table still get 
broadcasted and failed with the same error. I even changed to 1MB and still the 
same. {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A

2019-04-18 Thread Yu-Jhe Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Jhe Li resolved SPARK-24492.
---
Resolution: Won't Do

Haven't happened again since upgrade to 2.3.2, i'm going to close this issue.

> Endless attempted task when TaskCommitDenied exception writing to S3A
> -
>
> Key: SPARK-24492
> URL: https://issues.apache.org/jira/browse/SPARK-24492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yu-Jhe Li
>Priority: Critical
> Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 
> 2018-05-16 上午11.10.57.png
>
>
> Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and 
> output file to S3, some tasks endless retry and all of them failed with 
> TaskCommitDenied exception. This happened when we run Spark application on 
> some network issue instances. (it runs well on healthy spot instances)
> Sorry, I can find a easy way to reproduce this issue, here's all I can 
> provide.
> The Spark UI shows (in attachments) one task of stage 112 failed due to 
> FetchFailedException (it is network issue) and attempt to retry a new stage 
> 112 (retry 1). But in stage 112 (retry 1), all task failed due to 
> TaskCommitDenied exception, and keep retry (it never succeed and cause lots 
> of S3 requests).
> On the other side, driver logs shows:
>  # task 123.0 in stage 112.0 failed due to FetchFailedException (network 
> issue cause corrupted file)
>  # warning message from OutputCommitCoordinator
>  # task 92.0 in stage 112.1 failed when writing rows
>  # keep retry the failed tasks, but never succeed
> {noformat}
> 2018-05-16 02:38:055 WARN  TaskSetManager:66 - Lost task 123.0 in stage 112.0 
> (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 
> 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message=
> org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.T

[jira] [Created] (SPARK-27504) File source V2: support refreshing metadata cache

2019-04-18 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-27504:
--

 Summary: File source V2: support refreshing metadata cache
 Key: SPARK-27504
 URL: https://issues.apache.org/jira/browse/SPARK-27504
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


In file source V1, if some file is deleted manually, reading the 
DataFrame/Table will throws an exception with suggestion message "It is 
possible the underlying files have been updated. You can explicitly invalidate 
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by 
recreating the Dataset/DataFrame involved.".
After refreshing the table/DataFrame, the reads should return correct results.

We should follow it in file source V2 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27503) JobGenerator thread exit for some fatal errors but application keeps running

2019-04-18 Thread Genmao Yu (JIRA)
Genmao Yu created SPARK-27503:
-

 Summary: JobGenerator thread exit for some fatal errors but 
application keeps running
 Key: SPARK-27503
 URL: https://issues.apache.org/jira/browse/SPARK-27503
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 3.0.0
Reporter: Genmao Yu


JobGenerator thread (including some other EventLoop threads) may exit for some 
fatal error, like OOM, but Spark Streaming job keep running with no batch job 
generating. Currently, we only report any non-fatal error. 
{code}
override def run(): Unit = {
  try {
while (!stopped.get) {
  val event = eventQueue.take()
  try {
onReceive(event)
  } catch {
case NonFatal(e) =>
  try {
onError(e)
  } catch {
case NonFatal(e) => logError("Unexpected error in " + name, e)
  }
  }
}
  } catch {
case ie: InterruptedException => // exit even if eventQueue is not empty
case NonFatal(e) => logError("Unexpected error in " + name, e)
  }
}
{code}

In some corner cases, these event threads may exit with OOM error, but driver 
thread can still keep running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27502) Update nested schema benchmark result for Orc V2

2019-04-18 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-27502:
---

 Summary: Update nested schema benchmark result for Orc V2
 Key: SPARK-27502
 URL: https://issues.apache.org/jira/browse/SPARK-27502
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


We added nested schema pruning support to Orc V2 recently. The benchmark result 
should be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect

2019-04-18 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820787#comment-16820787
 ] 

Yuming Wang edited comment on SPARK-27475 at 4/18/19 7:08 AM:
--

sbt can do it, but some result is incorrect. It added {{jetty-*-9.4.12.jar}}: 
{code:java}
build/sbt  "show assembly/compile:dependencyClasspath" -Phadoop-3.2 | grep 
"Attributed(" | awk -F "/" '{print $NF}' | sed 's/'\)'//g' | sort | grep -v 
spark
{code}



was (Author: q79969786):
sbt can do it, but some result is incorrect. It added {{jetty-*-9.4.12.jar}}: 
{code:java}
build/sbt  "show assembly/compile:dependencyClasspath" -Phadoop-3.2 | grep 
"Attributed(" | rev | sed \ 's/^.//' | cut -d "/" -f 1 | rev | sort | grep -v 
spark
{code}


> dev/deps/spark-deps-hadoop-3.2 is incorrect
> ---
>
> Key: SPARK-27475
> URL: https://issues.apache.org/jira/browse/SPARK-27475
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect

2019-04-18 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820787#comment-16820787
 ] 

Yuming Wang commented on SPARK-27475:
-

sbt can do it, but some result is incorrect. It added {{jetty-*-9.4.12.jar}}: 
{code:java}
build/sbt  "show assembly/compile:dependencyClasspath" -Phadoop-3.2 | grep 
"Attributed(" | rev | sed \ 's/^.//' | cut -d "/" -f 1 | rev | sort | grep -v 
spark
{code}


> dev/deps/spark-deps-hadoop-3.2 is incorrect
> ---
>
> Key: SPARK-27475
> URL: https://issues.apache.org/jira/browse/SPARK-27475
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org