[jira] [Commented] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A

2018-06-08 Thread Yu-Jhe Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506792#comment-16506792
 ] 

Yu-Jhe Li commented on SPARK-24492:
---

[~ste...@apache.org]: i thinks the TaskCommitDenied exception is not caused by 
S3 eventually consistency. As [~jiangxb1987] mentioned, there may be another 
task get the lock of commit permission. 

Btw, in this case, we did not enable speculative.

> Endless attempted task when TaskCommitDenied exception writing to S3A
> -
>
> Key: SPARK-24492
> URL: https://issues.apache.org/jira/browse/SPARK-24492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yu-Jhe Li
>Priority: Critical
> Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 
> 2018-05-16 上午11.10.57.png
>
>
> Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and 
> output file to S3, some tasks endless retry and all of them failed with 
> TaskCommitDenied exception. This happened when we run Spark application on 
> some network issue instances. (it runs well on healthy spot instances)
> Sorry, I can find a easy way to reproduce this issue, here's all I can 
> provide.
> The Spark UI shows (in attachments) one task of stage 112 failed due to 
> FetchFailedException (it is network issue) and attempt to retry a new stage 
> 112 (retry 1). But in stage 112 (retry 1), all task failed due to 
> TaskCommitDenied exception, and keep retry (it never succeed and cause lots 
> of S3 requests).
> On the other side, driver logs shows:
>  # task 123.0 in stage 112.0 failed due to FetchFailedException (network 
> issue cause corrupted file)
>  # warning message from OutputCommitCoordinator
>  # task 92.0 in stage 112.1 failed when writing rows
>  # keep retry the failed tasks, but never succeed
> {noformat}
> 2018-05-16 02:38:055 WARN  TaskSetManager:66 - Lost task 123.0 in stage 112.0 
> (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 
> 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message=
> org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   

[jira] [Resolved] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when scale is negative

2018-06-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24468.
-
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 21499
[https://github.com/apache/spark/pull/21499]

> DecimalType `adjustPrecisionScale` might fail when scale is negative
> 
>
> Key: SPARK-24468
> URL: https://issues.apache.org/jira/browse/SPARK-24468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifei Wu
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0, 2.3.2
>
>
> Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.
> When multiplying a LongType with a decimal in scientific notation, say
> {code:java}
> spark.sql("select some_int * 2.34E10 from t"){code}
> , decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be 
> casted as decimal(20,0).
>  
> So according to the rules in comments:
> {code:java}
> /*
>  *   OperationResult PrecisionResult Scale
>  *   
>  *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 * e2  p1 + p2 + 1 s1 + s2
>  *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
>  *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
>  *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
> */
> {code}
> their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
> assumption (scale>=0) on DecimalType.scala:166.
>  
> My current workaround is to set 
> spark.sql.decimalOperations.allowPrecisionLoss to false.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when scale is negative

2018-06-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24468:
---

Assignee: Marco Gaido

> DecimalType `adjustPrecisionScale` might fail when scale is negative
> 
>
> Key: SPARK-24468
> URL: https://issues.apache.org/jira/browse/SPARK-24468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifei Wu
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.3.2, 2.4.0
>
>
> Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.
> When multiplying a LongType with a decimal in scientific notation, say
> {code:java}
> spark.sql("select some_int * 2.34E10 from t"){code}
> , decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be 
> casted as decimal(20,0).
>  
> So according to the rules in comments:
> {code:java}
> /*
>  *   OperationResult PrecisionResult Scale
>  *   
>  *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 * e2  p1 + p2 + 1 s1 + s2
>  *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
>  *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
>  *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
> */
> {code}
> their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
> assumption (scale>=0) on DecimalType.scala:166.
>  
> My current workaround is to set 
> spark.sql.decimalOperations.allowPrecisionLoss to false.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24412) Adding docs about automagical type casting in `isin` and `isInCollection` APIs

2018-06-08 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-24412.
-
Resolution: Fixed

> Adding docs about automagical type casting in `isin` and `isInCollection` APIs
> --
>
> Key: SPARK-24412
> URL: https://issues.apache.org/jira/browse/SPARK-24412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: thrvskn
>Priority: Major
>  Labels: starter
> Fix For: 2.4.0
>
>
> We should let users know that those two APIs will compare data with 
> autocasting.
> See https://github.com/apache/spark/pull/21416#discussion_r191491943 for 
> detail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24412) Adding docs about automagical type casting in `isin` and `isInCollection` APIs

2018-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506730#comment-16506730
 ] 

Apache Spark commented on SPARK-24412:
--

User 'trvskn' has created a pull request for this issue:
https://github.com/apache/spark/pull/21519

> Adding docs about automagical type casting in `isin` and `isInCollection` APIs
> --
>
> Key: SPARK-24412
> URL: https://issues.apache.org/jira/browse/SPARK-24412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: thrvskn
>Priority: Major
>  Labels: starter
> Fix For: 2.4.0
>
>
> We should let users know that those two APIs will compare data with 
> autocasting.
> See https://github.com/apache/spark/pull/21416#discussion_r191491943 for 
> detail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24412) Adding docs about automagical type casting in `isin` and `isInCollection` APIs

2018-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24412:


Assignee: thrvskn  (was: Apache Spark)

> Adding docs about automagical type casting in `isin` and `isInCollection` APIs
> --
>
> Key: SPARK-24412
> URL: https://issues.apache.org/jira/browse/SPARK-24412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: thrvskn
>Priority: Major
>  Labels: starter
> Fix For: 2.4.0
>
>
> We should let users know that those two APIs will compare data with 
> autocasting.
> See https://github.com/apache/spark/pull/21416#discussion_r191491943 for 
> detail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24412) Adding docs about automagical type casting in `isin` and `isInCollection` APIs

2018-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24412:


Assignee: Apache Spark  (was: thrvskn)

> Adding docs about automagical type casting in `isin` and `isInCollection` APIs
> --
>
> Key: SPARK-24412
> URL: https://issues.apache.org/jira/browse/SPARK-24412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
> Fix For: 2.4.0
>
>
> We should let users know that those two APIs will compare data with 
> autocasting.
> See https://github.com/apache/spark/pull/21416#discussion_r191491943 for 
> detail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506728#comment-16506728
 ] 

Apache Spark commented on SPARK-24502:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21518

> flaky test: UnsafeRowSerializerSuite
> 
>
> Key: SPARK-24502
> URL: https://issues.apache.org/jira/browse/SPARK-24502
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/
> {code}
> sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is 
> stopped.
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
>   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
>   at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
>   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24502:


Assignee: Apache Spark  (was: Wenchen Fan)

> flaky test: UnsafeRowSerializerSuite
> 
>
> Key: SPARK-24502
> URL: https://issues.apache.org/jira/browse/SPARK-24502
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/
> {code}
> sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is 
> stopped.
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
>   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
>   at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
>   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24502:


Assignee: Wenchen Fan  (was: Apache Spark)

> flaky test: UnsafeRowSerializerSuite
> 
>
> Key: SPARK-24502
> URL: https://issues.apache.org/jira/browse/SPARK-24502
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/
> {code}
> sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is 
> stopped.
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
>   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
>   at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
>   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-06-08 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-24502:
---

 Summary: flaky test: UnsafeRowSerializerSuite
 Key: SPARK-24502
 URL: https://issues.apache.org/jira/browse/SPARK-24502
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-06-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24502:

Description: 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/

{code}
sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is 
stopped.
at 
org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
at 
org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
at 
org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
at 
org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
at 
org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
at 
org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
at 
org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
at 
org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
at 
org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
at 
org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
at 
org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
at 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
at 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
...
{code}

> flaky test: UnsafeRowSerializerSuite
> 
>
> Key: SPARK-24502
> URL: https://issues.apache.org/jira/browse/SPARK-24502
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/
> {code}
> sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is 
> stopped.
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
>   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
>   at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at scala.Option.map(Option.scala:146)
>   at 
> 

[jira] [Updated] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-06-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24502:

Labels: flaky-test  (was: )

> flaky test: UnsafeRowSerializerSuite
> 
>
> Key: SPARK-24502
> URL: https://issues.apache.org/jira/browse/SPARK-24502
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher

2018-06-08 Thread Nicholas Parker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506709#comment-16506709
 ] 

Nicholas Parker commented on SPARK-24501:
-

PR: https://github.com/apache/spark/pull/21516

> Add metrics for Mesos Driver and Dispatcher
> ---
>
> Key: SPARK-24501
> URL: https://issues.apache.org/jira/browse/SPARK-24501
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.1
>Reporter: Nicholas Parker
>Priority: Minor
>
> The Mesos Dispatcher currently only has three gauge metrics, for the number 
> of waiting, launched, and retrying drivers, defined in 
> MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have 
> any metrics defined at all.
> Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, 
> including e.g. types of events received, how long it takes jobs to run, for 
> use by operators to monitor and/or diagnose their Spark environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher

2018-06-08 Thread Nicholas Parker (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Parker updated SPARK-24501:

Comment: was deleted

(was: PR: https://github.com/apache/spark/pull/21516)

> Add metrics for Mesos Driver and Dispatcher
> ---
>
> Key: SPARK-24501
> URL: https://issues.apache.org/jira/browse/SPARK-24501
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.1
>Reporter: Nicholas Parker
>Priority: Minor
>
> The Mesos Dispatcher currently only has three gauge metrics, for the number 
> of waiting, launched, and retrying drivers, defined in 
> MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have 
> any metrics defined at all.
> Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, 
> including e.g. types of events received, how long it takes jobs to run, for 
> use by operators to monitor and/or diagnose their Spark environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher

2018-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24501:


Assignee: (was: Apache Spark)

> Add metrics for Mesos Driver and Dispatcher
> ---
>
> Key: SPARK-24501
> URL: https://issues.apache.org/jira/browse/SPARK-24501
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.1
>Reporter: Nicholas Parker
>Priority: Minor
>
> The Mesos Dispatcher currently only has three gauge metrics, for the number 
> of waiting, launched, and retrying drivers, defined in 
> MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have 
> any metrics defined at all.
> Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, 
> including e.g. types of events received, how long it takes jobs to run, for 
> use by operators to monitor and/or diagnose their Spark environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher

2018-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24501:


Assignee: Apache Spark

> Add metrics for Mesos Driver and Dispatcher
> ---
>
> Key: SPARK-24501
> URL: https://issues.apache.org/jira/browse/SPARK-24501
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.1
>Reporter: Nicholas Parker
>Assignee: Apache Spark
>Priority: Minor
>
> The Mesos Dispatcher currently only has three gauge metrics, for the number 
> of waiting, launched, and retrying drivers, defined in 
> MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have 
> any metrics defined at all.
> Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, 
> including e.g. types of events received, how long it takes jobs to run, for 
> use by operators to monitor and/or diagnose their Spark environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher

2018-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506707#comment-16506707
 ] 

Apache Spark commented on SPARK-24501:
--

User 'nickbp' has created a pull request for this issue:
https://github.com/apache/spark/pull/21516

> Add metrics for Mesos Driver and Dispatcher
> ---
>
> Key: SPARK-24501
> URL: https://issues.apache.org/jira/browse/SPARK-24501
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.1
>Reporter: Nicholas Parker
>Priority: Minor
>
> The Mesos Dispatcher currently only has three gauge metrics, for the number 
> of waiting, launched, and retrying drivers, defined in 
> MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have 
> any metrics defined at all.
> Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, 
> including e.g. types of events received, how long it takes jobs to run, for 
> use by operators to monitor and/or diagnose their Spark environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24372) Create script for preparing RCs

2018-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24372:


Assignee: (was: Apache Spark)

> Create script for preparing RCs
> ---
>
> Key: SPARK-24372
> URL: https://issues.apache.org/jira/browse/SPARK-24372
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> Currently, when preparing RCs, the RM has to invoke many scripts manually, 
> make sure that is being done in the correct environment, and set all the 
> correct environment variables, which differ from one script to the other.
> It will be much easier for RMs if all that was automated as much as possible.
> I'm working on something like this as part of releasing 2.3.1, and plan to 
> send my scripts for review after the release is done (i.e. after I make sure 
> the scripts are working properly).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24372) Create script for preparing RCs

2018-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24372:


Assignee: Apache Spark

> Create script for preparing RCs
> ---
>
> Key: SPARK-24372
> URL: https://issues.apache.org/jira/browse/SPARK-24372
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> Currently, when preparing RCs, the RM has to invoke many scripts manually, 
> make sure that is being done in the correct environment, and set all the 
> correct environment variables, which differ from one script to the other.
> It will be much easier for RMs if all that was automated as much as possible.
> I'm working on something like this as part of releasing 2.3.1, and plan to 
> send my scripts for review after the release is done (i.e. after I make sure 
> the scripts are working properly).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24372) Create script for preparing RCs

2018-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506703#comment-16506703
 ] 

Apache Spark commented on SPARK-24372:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21515

> Create script for preparing RCs
> ---
>
> Key: SPARK-24372
> URL: https://issues.apache.org/jira/browse/SPARK-24372
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> Currently, when preparing RCs, the RM has to invoke many scripts manually, 
> make sure that is being done in the correct environment, and set all the 
> correct environment variables, which differ from one script to the other.
> It will be much easier for RMs if all that was automated as much as possible.
> I'm working on something like this as part of releasing 2.3.1, and plan to 
> send my scripts for review after the release is done (i.e. after I make sure 
> the scripts are working properly).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher

2018-06-08 Thread Nicholas Parker (JIRA)
Nicholas Parker created SPARK-24501:
---

 Summary: Add metrics for Mesos Driver and Dispatcher
 Key: SPARK-24501
 URL: https://issues.apache.org/jira/browse/SPARK-24501
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.3.1
Reporter: Nicholas Parker


The Mesos Dispatcher currently only has three gauge metrics, for the number of 
waiting, launched, and retrying drivers, defined in 
MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have any 
metrics defined at all.

Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, 
including e.g. types of events received, how long it takes jobs to run, for use 
by operators to monitor and/or diagnose their Spark environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch

2018-06-08 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24106:
---
Fix Version/s: (was: 2.4.0)
   (was: 2.3.0)

> Spark Structure Streaming with RF model taking long time in processing 
> probability for each mini batch
> --
>
> Key: SPARK-24106
> URL: https://issues.apache.org/jira/browse/SPARK-24106
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark yarn / Standalone cluster
> 2 master nodes - 32 cores - 124 GB
> 9 worker nodes - 32 cores - 124 GB
> Kafka input and output topic with 6 partition
>Reporter: Tamilselvan Veeramani
>Priority: Major
>  Labels: performance
>
> RandomForestClassificationModel broadcasted to executors for every mini batch 
> in spark streaming while try to find probability
> RF model size 45MB
> spark kafka streaming job jar size 8 MB (including kafka dependency’s)
> following log show model broad cast to executors for every mini batch when we 
> call rf_model.transform(dataset).select("probability").
> due to which task deserialization time also increases comes to 6 to 7 second 
> for 45MB of rf model, although processing time is just 400 to 600 ms for mini 
> batch
> 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: 
> KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5))
>  18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
> 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory 
> on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)
> After 2 to 3 weeks of struggle, I found a potentially solution which will 
> help many people who is looking to use RF model for “probability” in real 
> time streaming context
> Since RandomForestClassificationModel class of transformImpl method 
> implements only “prediction” in current version of spark. Which can be 
> leveraged to implement “probability” also in RandomForestClassificationModel 
> class of transformImpl method.
> I have modified the code and implemented in our server and it’s working as 
> fast as 400ms to 500ms for every mini batch
> I see many people our there facing this issue and no solution provided in any 
> of the forums, Can you please review and put this fix in next release ? thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch

2018-06-08 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24106:
---
Target Version/s:   (was: 2.2.1, 2.3.0)

> Spark Structure Streaming with RF model taking long time in processing 
> probability for each mini batch
> --
>
> Key: SPARK-24106
> URL: https://issues.apache.org/jira/browse/SPARK-24106
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark yarn / Standalone cluster
> 2 master nodes - 32 cores - 124 GB
> 9 worker nodes - 32 cores - 124 GB
> Kafka input and output topic with 6 partition
>Reporter: Tamilselvan Veeramani
>Priority: Major
>  Labels: performance
>
> RandomForestClassificationModel broadcasted to executors for every mini batch 
> in spark streaming while try to find probability
> RF model size 45MB
> spark kafka streaming job jar size 8 MB (including kafka dependency’s)
> following log show model broad cast to executors for every mini batch when we 
> call rf_model.transform(dataset).select("probability").
> due to which task deserialization time also increases comes to 6 to 7 second 
> for 45MB of rf model, although processing time is just 400 to 600 ms for mini 
> batch
> 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: 
> KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5))
>  18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
> 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory 
> on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)
> After 2 to 3 weeks of struggle, I found a potentially solution which will 
> help many people who is looking to use RF model for “probability” in real 
> time streaming context
> Since RandomForestClassificationModel class of transformImpl method 
> implements only “prediction” in current version of spark. Which can be 
> leveraged to implement “probability” also in RandomForestClassificationModel 
> class of transformImpl method.
> I have modified the code and implemented in our server and it’s working as 
> fast as 400ms to 500ms for every mini batch
> I see many people our there facing this issue and no solution provided in any 
> of the forums, Can you please review and put this fix in next release ? thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22860) Spark workers log ssl passwords passed to the executors

2018-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22860:


Assignee: (was: Apache Spark)

> Spark workers log ssl passwords passed to the executors
> ---
>
> Key: SPARK-22860
> URL: https://issues.apache.org/jira/browse/SPARK-22860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Felix K.
>Priority: Major
>
> The workers log the spark.ssl.keyStorePassword and 
> spark.ssl.trustStorePassword passed by cli to the executor processes. The 
> ExecutorRunner should escape passwords to not appear in the worker's log 
> files in INFO level. In this example, you can see my 'SuperSecretPassword' in 
> a worker log:
> {code}
> 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: 
> "/global/myapp/oem/jdk/bin/java" "-cp" 
> "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar
> [...]
> :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" 
> "-Dspark.authenticate.enableSaslEncryption=true" 
> "-Dspark.ssl.keyStorePassword=SuperSecretPassword" 
> "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" 
> "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" 
> "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" 
> "-Dspark.ssl.protocol=TLS" 
> "-Dspark.ssl.trustStorePassword=SuperSecretPassword" 
> "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" 
> "-Dmyapp.config.directory=/global/myapp/application/config" 
> "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer"
>  
> "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks"
>  "-XX:+UseG1GC" "-XX:+UseStringDeduplication" 
> "-Dthings.loader.export.zzz_files=false" 
> "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties"
>  "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" 
> "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" 
> "--worker-url" "spark://Worker@192.168.0.1:59530"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22860) Spark workers log ssl passwords passed to the executors

2018-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22860:


Assignee: Apache Spark

> Spark workers log ssl passwords passed to the executors
> ---
>
> Key: SPARK-22860
> URL: https://issues.apache.org/jira/browse/SPARK-22860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Felix K.
>Assignee: Apache Spark
>Priority: Major
>
> The workers log the spark.ssl.keyStorePassword and 
> spark.ssl.trustStorePassword passed by cli to the executor processes. The 
> ExecutorRunner should escape passwords to not appear in the worker's log 
> files in INFO level. In this example, you can see my 'SuperSecretPassword' in 
> a worker log:
> {code}
> 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: 
> "/global/myapp/oem/jdk/bin/java" "-cp" 
> "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar
> [...]
> :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" 
> "-Dspark.authenticate.enableSaslEncryption=true" 
> "-Dspark.ssl.keyStorePassword=SuperSecretPassword" 
> "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" 
> "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" 
> "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" 
> "-Dspark.ssl.protocol=TLS" 
> "-Dspark.ssl.trustStorePassword=SuperSecretPassword" 
> "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" 
> "-Dmyapp.config.directory=/global/myapp/application/config" 
> "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer"
>  
> "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks"
>  "-XX:+UseG1GC" "-XX:+UseStringDeduplication" 
> "-Dthings.loader.export.zzz_files=false" 
> "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties"
>  "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" 
> "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" 
> "--worker-url" "spark://Worker@192.168.0.1:59530"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22860) Spark workers log ssl passwords passed to the executors

2018-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506693#comment-16506693
 ] 

Apache Spark commented on SPARK-22860:
--

User 'tooptoop4' has created a pull request for this issue:
https://github.com/apache/spark/pull/21514

> Spark workers log ssl passwords passed to the executors
> ---
>
> Key: SPARK-22860
> URL: https://issues.apache.org/jira/browse/SPARK-22860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Felix K.
>Priority: Major
>
> The workers log the spark.ssl.keyStorePassword and 
> spark.ssl.trustStorePassword passed by cli to the executor processes. The 
> ExecutorRunner should escape passwords to not appear in the worker's log 
> files in INFO level. In this example, you can see my 'SuperSecretPassword' in 
> a worker log:
> {code}
> 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: 
> "/global/myapp/oem/jdk/bin/java" "-cp" 
> "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar
> [...]
> :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" 
> "-Dspark.authenticate.enableSaslEncryption=true" 
> "-Dspark.ssl.keyStorePassword=SuperSecretPassword" 
> "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" 
> "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" 
> "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" 
> "-Dspark.ssl.protocol=TLS" 
> "-Dspark.ssl.trustStorePassword=SuperSecretPassword" 
> "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" 
> "-Dmyapp.config.directory=/global/myapp/application/config" 
> "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer"
>  
> "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks"
>  "-XX:+UseG1GC" "-XX:+UseStringDeduplication" 
> "-Dthings.loader.export.zzz_files=false" 
> "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties"
>  "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" 
> "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" 
> "--worker-url" "spark://Worker@192.168.0.1:59530"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24412) Adding docs about automagical type casting in `isin` and `isInCollection` APIs

2018-06-08 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24412:
---

Assignee: thrvskn

> Adding docs about automagical type casting in `isin` and `isInCollection` APIs
> --
>
> Key: SPARK-24412
> URL: https://issues.apache.org/jira/browse/SPARK-24412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: thrvskn
>Priority: Major
>  Labels: starter
> Fix For: 2.4.0
>
>
> We should let users know that those two APIs will compare data with 
> autocasting.
> See https://github.com/apache/spark/pull/21416#discussion_r191491943 for 
> detail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23010) Add integration testing for Kubernetes backend into the apache/spark repository

2018-06-08 Thread Matt Cheah (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah resolved SPARK-23010.

   Resolution: Fixed
Fix Version/s: 2.4.0

Some tests are missing but we got a basic set of the tests and the 
infrastructure into master.

> Add integration testing for Kubernetes backend into the apache/spark 
> repository
> ---
>
> Key: SPARK-23010
> URL: https://issues.apache.org/jira/browse/SPARK-23010
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
> Fix For: 2.4.0
>
>
> Add tests for the scheduler backend into apache/spark
> /xref: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Integration-testing-and-Scheduler-Backends-td23105.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-08 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506521#comment-16506521
 ] 

Felix Cheung edited comment on SPARK-24359 at 6/8/18 8:16 PM:
--

Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 
such that it contains only branch-2.4 official release commits + any 
alterations for SparkML and then tag it as v2.4.0.1.

I think many of us would agree to separate repo only for convenience - so if 
one would sign up to handle the branching and commit porting etc and we get 
community to vote on such a "SparkML only" release, then it is ok. Though 
thinking about it we would still have officially a Spark 2.4.0.1 release (with 
no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way 
the release/tag process work.

 


was (Author: felixcheung):
Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 
such that it contains only branch-2.3 official release commits + any 
alterations for SparkML and then tag it as v2.4.0.1.

I think many of us would agree to separate repo only for convenience - so if 
one would sign up to handle the branching and commit porting etc and we get 
community to vote on such a "SparkML only" release, then it is ok. Though 
thinking about it we would still have officially a Spark 2.4.0.1 release (with 
no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way 
the release/tag process work.

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark 

[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-08 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506521#comment-16506521
 ] 

Felix Cheung edited comment on SPARK-24359 at 6/8/18 8:16 PM:
--

Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 
such that it contains only branch-2.4 official release commits + any 
alterations for SparkML and then tag it as v2.4.0.1.

I think many of us would agree to separate repo only for convenience - so if 
one would sign up to handle the branching and commit porting etc and we get 
community to vote on such a "SparkML only" release, then it is ok. Though 
thinking about it we would still have officially a Spark 2.4.0.1 release (with 
no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way 
the release/tag process work. And likely this 2.4.0.1 would be visible in 
release share, maven etc. too

 


was (Author: felixcheung):
Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 
such that it contains only branch-2.4 official release commits + any 
alterations for SparkML and then tag it as v2.4.0.1.

I think many of us would agree to separate repo only for convenience - so if 
one would sign up to handle the branching and commit porting etc and we get 
community to vote on such a "SparkML only" release, then it is ok. Though 
thinking about it we would still have officially a Spark 2.4.0.1 release (with 
no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way 
the release/tag process work.

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-08 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506521#comment-16506521
 ] 

Felix Cheung commented on SPARK-24359:
--

Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 
such that it contains only branch-2.3 official release commits + any 
alterations for SparkML and then tag it as v2.4.0.1.

I think many of us would agree to separate repo only for convenience - so if 
one would sign up to handle the branching and commit porting etc and we get 
community to vote on such a "SparkML only" release, then it is ok. Though 
thinking about it we would still have officially a Spark 2.4.0.1 release (with 
no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way 
the release/tag process work.

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, 

[jira] [Commented] (SPARK-19826) spark.ml Python API for PIC

2018-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506503#comment-16506503
 ] 

Apache Spark commented on SPARK-19826:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21513

> spark.ml Python API for PIC
> ---
>
> Key: SPARK-19826
> URL: https://issues.apache.org/jira/browse/SPARK-19826
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24500) UnsupportedOperationException when trying to execute Union plan with Stream of children

2018-06-08 Thread Bogdan Raducanu (JIRA)
Bogdan Raducanu created SPARK-24500:
---

 Summary: UnsupportedOperationException when trying to execute 
Union plan with Stream of children
 Key: SPARK-24500
 URL: https://issues.apache.org/jira/browse/SPARK-24500
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Bogdan Raducanu


To reproduce:
{code}
import org.apache.spark.sql.catalyst.plans.logical._
def range(i: Int) = Range(1, i, 1, 1)
val union = Union(Stream(range(3), range(5), range(7)))
spark.sessionState.planner.plan(union).next().execute()
{code}

produces

{code}
java.lang.UnsupportedOperationException
  at 
org.apache.spark.sql.execution.PlanLater.doExecute(SparkStrategies.scala:55)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
{code}

The SparkPlan looks like this:

{code}
:- Range (1, 3, step=1, splits=1)
:- PlanLater Range (1, 5, step=1, splits=Some(1))
+- PlanLater Range (1, 7, step=1, splits=Some(1))
{code}

So not all of it was planned (some PlanLater still in there).
This appears to be a longstanding issue.
I traced it to the use of var in TreeNode.
For example in mapChildren:
{code}
case args: Traversable[_] => args.map {
  case arg: TreeNode[_] if containsChild(arg) =>
val newChild = f(arg.asInstanceOf[BaseType])
if (!(newChild fastEquals arg)) {
  changed = true
{code}

If args is a Stream then changed will never be set here, ultimately causing the 
method to return the original plan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22666) Spark datasource for image format

2018-06-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-22666:
--
Target Version/s: 2.4.0

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22666) Spark datasource for image format

2018-06-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506471#comment-16506471
 ] 

Xiangrui Meng commented on SPARK-22666:
---

[~mhamilton] Do you want to take this task?

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22666) Spark datasource for image format

2018-06-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-22666:
--
Shepherd: Xiangrui Meng

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24213) Power Iteration Clustering in the SparkML throws exception, when the ID is IntType

2018-06-08 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid resolved SPARK-24213.

Resolution: Fixed

> Power Iteration Clustering in the SparkML throws exception, when the ID is 
> IntType
> --
>
> Key: SPARK-24213
> URL: https://issues.apache.org/jira/browse/SPARK-24213
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> While running the code, PowerIterationClustering in spark ML throws exception.
> {code:scala}
> val data = spark.createDataFrame(Seq(
> (0, Array(1), Array(0.9)),
> (1, Array(2), Array(0.9)),
> (2, Array(3), Array(0.9)),
> (3, Array(4), Array(0.1)),
> (4, Array(5), Array(0.9))
> )).toDF("id", "neighbors", "similarities")
> val result = new PowerIterationClustering()
> .setK(2)
> .setMaxIter(10)
> .setInitMode("random")
> .transform(data)
> .select("id","prediction")
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`prediction`' given 
> input columns: [id, neighbors, similarities];;
> 'Project [id#215, 'prediction]
> +- AnalysisBarrier
>   +- Project [id#215, neighbors#216, similarities#217]
>  +- Join Inner, (id#215 = id#234)
> :- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211 AS 
> similarities#217]
> :  +- LocalRelation [_1#209, _2#210, _3#211]
> +- Project [cast(id#230L as int) AS id#234]
>+- LogicalRDD [id#230L, prediction#231], false
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-06-08 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid resolved SPARK-24217.

Resolution: Fixed

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23984) PySpark Bindings for K8S

2018-06-08 Thread Matt Cheah (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah resolved SPARK-23984.

   Resolution: Fixed
Fix Version/s: 2.4.0

> PySpark Bindings for K8S
> 
>
> Key: SPARK-23984
> URL: https://issues.apache.org/jira/browse/SPARK-23984
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, PySpark
>Affects Versions: 2.3.0
>Reporter: Ilan Filonenko
>Priority: Major
> Fix For: 2.4.0
>
>
> This ticket is tracking the ongoing work of moving the upsteam work from 
> [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python 
> bindings for Spark on Kubernetes. 
> The points of focus are: dependency management, increased non-JVM memory 
> overhead default values, and modified Docker images to include Python 
> Support. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes

2018-06-08 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506456#comment-16506456
 ] 

Wenchen Fan commented on SPARK-20712:
-

I looked into this. This is actually a hive bug and I feel it's impossible to 
fix it at Spark side, even with Spark 2.0. [~maver1ck] can you double check 
that you can use Spark 2.0 to read the same table without problems?

Ideally Hive would report the error when creating the table, I'm not sure how 
the malformed table get slipped into the Hive metastore.

> [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has 
> length greater than 4000 bytes
> ---
>
> Key: SPARK-20712
> URL: https://issues.apache.org/jira/browse/SPARK-20712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.3.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I have following issue.
> I'm trying to read a table from hive when one of the column is nested so it's 
> schema has length longer than 4000 bytes.
> Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception:
> {code}
> >> spark.read.table("SOME_TABLE")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table
> return self._df(self._jreader.table(tableName))
>   File 
> "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.table.
> : org.apache.spark.SparkException: Cannot recognize hive type string: 
> SOME_VERY_LONG_FIELD_TYPE
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> 

[jira] [Assigned] (SPARK-17756) java.lang.ClassCastException when using cartesian with DStream.transform

2018-06-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-17756:


Assignee: Hyukjin Kwon

> java.lang.ClassCastException when using cartesian with DStream.transform
> 
>
> Key: SPARK-17756
> URL: https://issues.apache.org/jira/browse/SPARK-17756
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, PySpark
>Affects Versions: 2.0.0, 2.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> Steps to reproduce:
> {code}
> from pyspark.streaming import StreamingContext
> ssc = StreamingContext(spark.sparkContext, 10)
> (ssc
> .queueStream([sc.range(10)])
> .transform(lambda rdd: rdd.cartesian(rdd))
> .pprint())
> ssc.start()
> ## 16/10/01 21:34:30 ERROR JobScheduler: Error generating jobs for time 
> 147535047 ms
> ## java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD ## 
> cannot be cast to org.apache.spark.api.java.JavaRDD
> ##at com.sun.proxy.$Proxy15.call(Unknown Source)
> ##
> {code}
> A dummy fix is to put {{map(lamba x: x)}} which suggests it is a problem 
> similar to https://issues.apache.org/jira/browse/SPARK-16589



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17756) java.lang.ClassCastException when using cartesian with DStream.transform

2018-06-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17756.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19498
[https://github.com/apache/spark/pull/19498]

> java.lang.ClassCastException when using cartesian with DStream.transform
> 
>
> Key: SPARK-17756
> URL: https://issues.apache.org/jira/browse/SPARK-17756
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, PySpark
>Affects Versions: 2.0.0, 2.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> Steps to reproduce:
> {code}
> from pyspark.streaming import StreamingContext
> ssc = StreamingContext(spark.sparkContext, 10)
> (ssc
> .queueStream([sc.range(10)])
> .transform(lambda rdd: rdd.cartesian(rdd))
> .pprint())
> ssc.start()
> ## 16/10/01 21:34:30 ERROR JobScheduler: Error generating jobs for time 
> 147535047 ms
> ## java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD ## 
> cannot be cast to org.apache.spark.api.java.JavaRDD
> ##at com.sun.proxy.$Proxy15.call(Unknown Source)
> ##
> {code}
> A dummy fix is to put {{map(lamba x: x)}} which suggests it is a problem 
> similar to https://issues.apache.org/jira/browse/SPARK-16589



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24258) SPIP: Improve PySpark support for ML Matrix and Vector types

2018-06-08 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506399#comment-16506399
 ] 

Li Jin commented on SPARK-24258:


I ran into [~mengxr] and chatted about this. Seems a good first step is to have 
tensor type to be first-class type in Spark DataFrame. For operations, there is 
concerns about having to add many many linear algebra functions in Spark 
codebase, so it's not clear whether it's a good idea.

Any thoughts?

> SPIP: Improve PySpark support for ML Matrix and Vector types
> 
>
> Key: SPARK-24258
> URL: https://issues.apache.org/jira/browse/SPARK-24258
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Leif Walsh
>Priority: Major
>
> h1. Background and Motivation:
> In Spark ML ({{pyspark.ml.linalg}}), there are four column types you can 
> construct, {{SparseVector}}, {{DenseVector}}, {{SparseMatrix}}, and 
> {{DenseMatrix}}.  In PySpark, you can construct one of these vectors with 
> {{VectorAssembler}}, and then you can run python UDFs on these columns, and 
> use {{toArray()}} to get numpy ndarrays and do things with them.  They also 
> have a small native API where you can compute {{dot()}}, {{norm()}}, and a 
> few other things with them (I think these are computed in scala, not python, 
> could be wrong).
> For statistical applications, having the ability to manipulate columns of 
> matrix and vector values (from here on, I will use the term tensor to refer 
> to arrays of arbitrary dimensionality, matrices are 2-tensors and vectors are 
> 1-tensors) would be powerful.  For example, you could use PySpark to reshape 
> your data in parallel, assemble some matrices from your raw data, and then 
> run some statistical computation on them using UDFs leveraging python 
> libraries like statsmodels, numpy, tensorflow, and scikit-learn.
> I propose enriching the {{pyspark.ml.linalg}} types in the following ways:
> # Expand the set of column operations one can apply to tensor columns beyond 
> the few functions currently available on these types.  Ideally, the API 
> should aim to be as wide as the numpy ndarray API, but would wrap Breeze 
> operations.  For example, we should provide {{DenseVector.outerProduct()}} so 
> that a user could write something like {{df.withColumn("XtX", 
> df["X"].outerProduct(df["X"]))}}.
> # Make sure all ser/de mechanisms (including Arrow) understand these types, 
> and faithfully represent them as natural types in all languages (in scala and 
> java, Breeze objects, in python, numpy ndarrays rather than the 
> pyspark.ml.linalg types that wrap them, in SparkR, I'm not sure what, but 
> something natural) when applying UDFs or collecting with {{toPandas()}}.
> # Improve the construction of these types from scalar columns.  The 
> {{VectorAssembler}} API is not very ergonomic.  I propose something like 
> {{df.withColumn("predictors", Vector.of(df["feature1"], df["feature2"], 
> df["feature3"]))}}.
> h1. Target Personas:
> Data scientists, machine learning practitioners, machine learning library 
> developers.
> h1. Goals:
> This would allow users to do more statistical computation in Spark natively, 
> and would allow users to apply python statistical computation to data in 
> Spark using UDFs.
> h1. Non-Goals:
> I suppose one non-goal is to reimplement something like statsmodels using 
> Breeze data structures and computation.  That could be seen as an effort to 
> enrich Spark ML itself, but is out of scope of this effort.  This effort is 
> just to make it possible and easy to apply existing python libraries to 
> tensor values in parallel.
> h1. Proposed API Changes:
> Add the above APIs to PySpark and the other language bindings.  I think the 
> list is:
> # {{pyspark.ml.linalg.Vector.of(*columns)}}
> # {{pyspark.ml.linalg.Matrix.of( provide this>)}}
> # For each of the matrix and vector types in {{pyspark.ml.linalg}}, add more 
> methods like {{outerProduct}}, {{matmul}}, {{kron}}, etc.  
> https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.linalg.html has a 
> good list to look at.
> Also, change python UDFs so that these tensor types are passed to the python 
> function not as \{Sparse,Dense\}\{Matrix,Vector\} objects that wrap 
> {{numpy.ndarray}}, but as {{numpy.ndarray}} objects by themselves, and 
> interpret return values that are {{numpy.ndarray}} objects back into the 
> spark types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24467) VectorAssemblerEstimator

2018-06-08 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506334#comment-16506334
 ] 

Nick Pentreath edited comment on SPARK-24467 at 6/8/18 5:59 PM:


Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't 
think a new estimator could return the existing {{VectorAssembler}} but would 
probably need to return a new {{VectorAssemblerModel. Though perhaps the 
existing one can be made a Model without breaking things}}


was (Author: mlnick):
Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't 
think a new estimator could return the existing {{VectorAssembler}} but would 
probably need to return a new {{VectorAssemblerModel}}

> VectorAssemblerEstimator
> 
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator

2018-06-08 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506334#comment-16506334
 ] 

Nick Pentreath commented on SPARK-24467:


Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't 
think a new estimator could return the existing {{VectorAssembler}} but would 
probably need to return a new {{VectorAssemblerModel}}

> VectorAssemblerEstimator
> 
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-06-08 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506285#comment-16506285
 ] 

Li Yuanjian commented on SPARK-24499:
-

No problem, thanks for ping me, our pleasure.
We'll collect and translate some internal user demos and also some hints during 
MR job migrate to Spark.

> Documentation improvement of Spark core and SQL
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-08 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506250#comment-16506250
 ] 

Xiao Li commented on SPARK-24498:
-

Thank you!

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-08 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506239#comment-16506239
 ] 

Kazuaki Ishizaki commented on SPARK-24498:
--

Hi
[~smilegator] Definetely, I am interested in this task. I will investigate this 
issue.

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24454) ml.image doesn't have __all__ explicitly defined

2018-06-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24454.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21483
[https://github.com/apache/spark/pull/21483]

> ml.image doesn't have __all__ explicitly defined
> 
>
> Key: SPARK-24454
> URL: https://issues.apache.org/jira/browse/SPARK-24454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> ml/image.py doesn't have __all__ explicitly defined. It will import all 
> global names by default (only ImageSchema for now), which is not a good 
> practice. We should add __all__ to image.py.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24454) ml.image doesn't have __all__ explicitly defined

2018-06-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24454:
-

Assignee: Hyukjin Kwon

> ml.image doesn't have __all__ explicitly defined
> 
>
> Key: SPARK-24454
> URL: https://issues.apache.org/jira/browse/SPARK-24454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> ml/image.py doesn't have __all__ explicitly defined. It will import all 
> global names by default (only ImageSchema for now), which is not a good 
> practice. We should add __all__ to image.py.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24477) Import submodules under pyspark.ml by default

2018-06-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24477:
-

Assignee: Hyukjin Kwon

> Import submodules under pyspark.ml by default
> -
>
> Key: SPARK-24477
> URL: https://issues.apache.org/jira/browse/SPARK-24477
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> Right now, we do not import submodules under pyspark.ml by default. So users 
> cannot do
> {code}
> from pyspark import ml
> kmeans = ml.clustering.KMeans(...)
> {code}
> I create this JIRA to discuss if we should import the submodules by default. 
> It will change behavior of
> {code}
> from pyspark.ml import *
> {code}
> But it simplifies unnecessary imports.
> cc [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24477) Import submodules under pyspark.ml by default

2018-06-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24477.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21483
[https://github.com/apache/spark/pull/21483]

> Import submodules under pyspark.ml by default
> -
>
> Key: SPARK-24477
> URL: https://issues.apache.org/jira/browse/SPARK-24477
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> Right now, we do not import submodules under pyspark.ml by default. So users 
> cannot do
> {code}
> from pyspark import ml
> kmeans = ml.clustering.KMeans(...)
> {code}
> I create this JIRA to discuss if we should import the submodules by default. 
> It will change behavior of
> {code}
> from pyspark.ml import *
> {code}
> But it simplifies unnecessary imports.
> cc [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A

2018-06-08 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506232#comment-16506232
 ] 

Jiang Xingbo commented on SPARK-24492:
--

It is possible that one task attempt acquired the permission to commit output, 
but don't finish performing commit in a while. In the mean time, another 
attempt of the same task (e.g. speculative run) may also ask for commit but 
failed with TaskCommitDenied exception. Under this case the current behavior of 
retrying without counting the failure into task failure count could lead to the 
task retries for  infinity times until it get the permission to commit, this 
can waste a lot of resources if the task is short. Instead, I purpose to skip 
retry the task attempt in case of TaskCommitDenied exception, since that means 
another attempt is performing commit for the same task, and we can wait till it 
finishes (If the commit finishes successfully then nothing left to be done, if 
it fail then we can still retry).

> Endless attempted task when TaskCommitDenied exception writing to S3A
> -
>
> Key: SPARK-24492
> URL: https://issues.apache.org/jira/browse/SPARK-24492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yu-Jhe Li
>Priority: Critical
> Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 
> 2018-05-16 上午11.10.57.png
>
>
> Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and 
> output file to S3, some tasks endless retry and all of them failed with 
> TaskCommitDenied exception. This happened when we run Spark application on 
> some network issue instances. (it runs well on healthy spot instances)
> Sorry, I can find a easy way to reproduce this issue, here's all I can 
> provide.
> The Spark UI shows (in attachments) one task of stage 112 failed due to 
> FetchFailedException (it is network issue) and attempt to retry a new stage 
> 112 (retry 1). But in stage 112 (retry 1), all task failed due to 
> TaskCommitDenied exception, and keep retry (it never succeed and cause lots 
> of S3 requests).
> On the other side, driver logs shows:
>  # task 123.0 in stage 112.0 failed due to FetchFailedException (network 
> issue cause corrupted file)
>  # warning message from OutputCommitCoordinator
>  # task 92.0 in stage 112.1 failed when writing rows
>  # keep retry the failed tasks, but never succeed
> {noformat}
> 2018-05-16 02:38:055 WARN  TaskSetManager:66 - Lost task 123.0 in stage 112.0 
> (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 
> 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message=
> org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at 

[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-06-08 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506219#comment-16506219
 ] 

Xiao Li commented on SPARK-24499:
-

Apache Spark is widely adopted in the production systems of Baidu. [~XuanYuan] 
Could your team lead this effort and improve the existing documentation? 

> Documentation improvement of Spark core and SQL
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-06-08 Thread Xiao Li (JIRA)
Xiao Li created SPARK-24499:
---

 Summary: Documentation improvement of Spark core and SQL
 Key: SPARK-24499
 URL: https://issues.apache.org/jira/browse/SPARK-24499
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


The current documentation in Apache Spark lacks enough code examples and tips. 
If needed, we should also split the page of 
https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
separate pages like what we did for 
https://spark.apache.org/docs/latest/ml-guide.html




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-06-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24499:

Component/s: Spark Core
 Documentation

> Documentation improvement of Spark core and SQL
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506183#comment-16506183
 ] 

Marco Gaido commented on SPARK-24481:
-

Yes, because before SPARK-22520 code generation was disabled for large case 
when (and in that case we were fallback-ing to interpreted execution, which has 
much worse performance). For your use case, you can try disabling whole stage 
code generation, by setting {{spark.sql.codegen.wholeStage}} to {{false}}. Or 
if the only problem is logging, you can also set the log level for the class 
{{org.apache.spark.sql.execution.WholeStageCodegenExec}} to ERROR, in order to 
get rid of the warning messages.

> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0 and Databricks Cloud 4.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://;,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300){
>   a = a + s"when action like '$i%' THEN '$i' "
> }
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
>  
> Exception:
> {code:java}
> 18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method 
> "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method 
> "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1444)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1523)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1520)
>   at 
> com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
>   at 

[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-08 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506159#comment-16506159
 ] 

Xiao Li commented on SPARK-24498:
-

cc [~kiszk] Are you interested in this task? Could you investigate this issue?

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-08 Thread Xiao Li (JIRA)
Xiao Li created SPARK-24498:
---

 Summary: Add JDK compiler for runtime codegen
 Key: SPARK-24498
 URL: https://issues.apache.org/jira/browse/SPARK-24498
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


In some cases, JDK compiler can generate smaller bytecode and take less time in 
compilation compared to Janino. However, in some cases, Janino is better. We 
should support both for our runtime codegen. Janino will be still our default 
runtime codegen compiler. 

See the related JIRAs in DRILL: 
- https://issues.apache.org/jira/browse/DRILL-1155
- https://issues.apache.org/jira/browse/DRILL-4778
- https://issues.apache.org/jira/browse/DRILL-5696





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24191) Scala example code for Power Iteration Clustering in Spark ML examples

2018-06-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24191.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21248
[https://github.com/apache/spark/pull/21248]

> Scala example code for Power Iteration Clustering in Spark ML examples
> --
>
> Key: SPARK-24191
> URL: https://issues.apache.org/jira/browse/SPARK-24191
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Trivial
> Fix For: 2.4.0
>
>
> We need to provide an example code for Power Iteration Clustering in Spark ML 
> examples.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24224) Java example code for Power Iteration Clustering in spark.ml

2018-06-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24224.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21283
[https://github.com/apache/spark/pull/21283]

> Java example code for Power Iteration Clustering in spark.ml
> 
>
> Key: SPARK-24224
> URL: https://issues.apache.org/jira/browse/SPARK-24224
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Trivial
> Fix For: 2.4.0
>
>
> Add a java example code for Power iteration clustering in spark.ml examples



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24224) Java example code for Power Iteration Clustering in spark.ml

2018-06-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24224:
-

Assignee: shahid
Target Version/s:   (was: 2.4.0)
Priority: Trivial  (was: Major)
   Fix Version/s: (was: 2.4.0)

> Java example code for Power Iteration Clustering in spark.ml
> 
>
> Key: SPARK-24224
> URL: https://issues.apache.org/jira/browse/SPARK-24224
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Trivial
>
> Add a java example code for Power iteration clustering in spark.ml examples



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24191) Scala example code for Power Iteration Clustering in Spark ML examples

2018-06-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24191:
-

Assignee: shahid
Target Version/s:   (was: 2.4.0)
Priority: Trivial  (was: Major)
   Fix Version/s: (was: 2.4.0)

> Scala example code for Power Iteration Clustering in Spark ML examples
> --
>
> Key: SPARK-24191
> URL: https://issues.apache.org/jira/browse/SPARK-24191
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Trivial
>
> We need to provide an example code for Power Iteration Clustering in Spark ML 
> examples.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24497) Support recursive SQL query

2018-06-08 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-24497:
---

 Summary: Support recursive SQL query
 Key: SPARK-24497
 URL: https://issues.apache.org/jira/browse/SPARK-24497
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang


h3. *Examples*

Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
represents the structure of an organization as an adjacency list.
{code:sql}
CREATE TABLE department (
id INTEGER PRIMARY KEY,  -- department ID
parent_department INTEGER REFERENCES department, -- upper department ID
name TEXT -- department name
);

INSERT INTO department (id, parent_department, "name")
VALUES
 (0, NULL, 'ROOT'),
 (1, 0, 'A'),
 (2, 1, 'B'),
 (3, 2, 'C'),
 (4, 2, 'D'),
 (5, 0, 'E'),
 (6, 4, 'F'),
 (7, 5, 'G');

-- department structure represented here is as follows:
--
-- ROOT-+->A-+->B-+->C
--  | |
--  | +->D-+->F
--  +->E-+->G
{code}
 
 To extract all departments under A, you can use the following recursive query:
{code:sql}
WITH RECURSIVE subdepartment AS
(
-- non-recursive term
SELECT * FROM department WHERE name = 'A'

UNION ALL

-- recursive term
SELECT d.*
FROM
department AS d
JOIN
subdepartment AS sd
ON (d.parent_department = sd.id)
)
SELECT *
FROM subdepartment
ORDER BY name;
{code}
More details:

[http://wiki.postgresql.org/wiki/CTEReadme]

[https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-06-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506023#comment-16506023
 ] 

Saisai Shao commented on SPARK-23534:
-

Yes, it is a Hadoop 2.8+ issue, but we don't a 2.8 profile for Spark, instead 
we have a 3.1 profile, so it will affect our hadoop 3 build, that's why also 
add a link here.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-06-08 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506021#comment-16506021
 ] 

Steve Loughran commented on SPARK-23534:


[~jerryshao] is that the HDFS token identifier thing? In which case, it's 
actually a 2.8+ issue

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A

2018-06-08 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506009#comment-16506009
 ] 

Steve Loughran commented on SPARK-24492:


the retry problem looks like something with the commit protocol's error 
handling; probably merits closer look

But why is the exception being raised? That's SPARK-18883: S3 list 
inconsistency is making your task attempt directory not appearing in S3 LIST 
calls, so the mimiced rename (list everything, COPY *, DELETE *) is failing 
fast.

You cannot safely use S3 via the S3A connector as a safe destination of work 
without a consistency layer (HADOOP-13345) or an S3-specific committer 
(SPARK-23977, HADOOP-13786). Even when the task appears to succeed, the listing 
may have missed newly created files, so the input is incorrect.

Without those or some other consistency layer (e.g consistent emrfs), you need 
to commit to a consistent store (e.g. HDFS) then copy the results to s3 after. 


> Endless attempted task when TaskCommitDenied exception writing to S3A
> -
>
> Key: SPARK-24492
> URL: https://issues.apache.org/jira/browse/SPARK-24492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yu-Jhe Li
>Priority: Critical
> Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 
> 2018-05-16 上午11.10.57.png
>
>
> Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and 
> output file to S3, some tasks endless retry and all of them failed with 
> TaskCommitDenied exception. This happened when we run Spark application on 
> some network issue instances. (it runs well on healthy spot instances)
> Sorry, I can find a easy way to reproduce this issue, here's all I can 
> provide.
> The Spark UI shows (in attachments) one task of stage 112 failed due to 
> FetchFailedException (it is network issue) and attempt to retry a new stage 
> 112 (retry 1). But in stage 112 (retry 1), all task failed due to 
> TaskCommitDenied exception, and keep retry (it never succeed and cause lots 
> of S3 requests).
> On the other side, driver logs shows:
>  # task 123.0 in stage 112.0 failed due to FetchFailedException (network 
> issue cause corrupted file)
>  # warning message from OutputCommitCoordinator
>  # task 92.0 in stage 112.1 failed when writing rows
>  # keep retry the failed tasks, but never succeed
> {noformat}
> 2018-05-16 02:38:055 WARN  TaskSetManager:66 - Lost task 123.0 in stage 112.0 
> (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 
> 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message=
> org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at 

[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A

2018-06-08 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-24492:
---
Summary: Endless attempted task when TaskCommitDenied exception writing to 
S3A  (was: Endless attempted task when TaskCommitDenied exception)

> Endless attempted task when TaskCommitDenied exception writing to S3A
> -
>
> Key: SPARK-24492
> URL: https://issues.apache.org/jira/browse/SPARK-24492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yu-Jhe Li
>Priority: Critical
> Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 
> 2018-05-16 上午11.10.57.png
>
>
> Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and 
> output file to S3, some tasks endless retry and all of them failed with 
> TaskCommitDenied exception. This happened when we run Spark application on 
> some network issue instances. (it runs well on healthy spot instances)
> Sorry, I can find a easy way to reproduce this issue, here's all I can 
> provide.
> The Spark UI shows (in attachments) one task of stage 112 failed due to 
> FetchFailedException (it is network issue) and attempt to retry a new stage 
> 112 (retry 1). But in stage 112 (retry 1), all task failed due to 
> TaskCommitDenied exception, and keep retry (it never succeed and cause lots 
> of S3 requests).
> On the other side, driver logs shows:
>  # task 123.0 in stage 112.0 failed due to FetchFailedException (network 
> issue cause corrupted file)
>  # warning message from OutputCommitCoordinator
>  # task 92.0 in stage 112.1 failed when writing rows
>  # keep retry the failed tasks, but never succeed
> {noformat}
> 2018-05-16 02:38:055 WARN  TaskSetManager:66 - Lost task 123.0 in stage 112.0 
> (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 
> 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message=
> org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> 

[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming

2018-06-08 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-24476:
---
Priority: Minor  (was: Major)

> java.net.SocketTimeoutException: Read timed out under jets3t while running 
> the Spark Structured Streaming
> -
>
> Key: SPARK-24476
> URL: https://issues.apache.org/jira/browse/SPARK-24476
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Minor
> Attachments: socket-timeout-exception
>
>
> We are working on spark streaming application using spark structured 
> streaming with checkpointing in s3. When we start the application, the 
> application runs just fine for sometime  then it crashes with the error 
> mentioned below. The amount of time it will run successfully varies from time 
> to time, sometimes it will run for 2 days without any issues then crashes, 
> sometimes it will crash after 4hrs/ 24hrs. 
> Our streaming application joins(left and inner) multiple sources from kafka 
> and also s3 and aurora database.
> Can you please let us know how to solve this problem?
> Is it possible to somehow tweak the SocketTimeout-Time? 
> Here, I'm pasting the few line of complete exception log below. Also attached 
> the complete exception to the issue.
> *_Exception:_*
> *_Caused by: java.net.SocketTimeoutException: Read timed out_*
>         _at java.net.SocketInputStream.socketRead0(Native Method)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:150)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:121)_
>         _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_
>         _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_
>         _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_
>         _at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming

2018-06-08 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-24476:
---
Summary: java.net.SocketTimeoutException: Read timed out under jets3t while 
running the Spark Structured Streaming  (was: java.net.SocketTimeoutException: 
Read timed out Exception while running the Spark Structured Streaming in 2.3.0)

> java.net.SocketTimeoutException: Read timed out under jets3t while running 
> the Spark Structured Streaming
> -
>
> Key: SPARK-24476
> URL: https://issues.apache.org/jira/browse/SPARK-24476
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Major
> Attachments: socket-timeout-exception
>
>
> We are working on spark streaming application using spark structured 
> streaming with checkpointing in s3. When we start the application, the 
> application runs just fine for sometime  then it crashes with the error 
> mentioned below. The amount of time it will run successfully varies from time 
> to time, sometimes it will run for 2 days without any issues then crashes, 
> sometimes it will crash after 4hrs/ 24hrs. 
> Our streaming application joins(left and inner) multiple sources from kafka 
> and also s3 and aurora database.
> Can you please let us know how to solve this problem?
> Is it possible to somehow tweak the SocketTimeout-Time? 
> Here, I'm pasting the few line of complete exception log below. Also attached 
> the complete exception to the issue.
> *_Exception:_*
> *_Caused by: java.net.SocketTimeoutException: Read timed out_*
>         _at java.net.SocketInputStream.socketRead0(Native Method)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:150)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:121)_
>         _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_
>         _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_
>         _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_
>         _at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming

2018-06-08 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-24476:
---
Component/s: (was: Spark Core)
 Structured Streaming

> java.net.SocketTimeoutException: Read timed out under jets3t while running 
> the Spark Structured Streaming
> -
>
> Key: SPARK-24476
> URL: https://issues.apache.org/jira/browse/SPARK-24476
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Major
> Attachments: socket-timeout-exception
>
>
> We are working on spark streaming application using spark structured 
> streaming with checkpointing in s3. When we start the application, the 
> application runs just fine for sometime  then it crashes with the error 
> mentioned below. The amount of time it will run successfully varies from time 
> to time, sometimes it will run for 2 days without any issues then crashes, 
> sometimes it will crash after 4hrs/ 24hrs. 
> Our streaming application joins(left and inner) multiple sources from kafka 
> and also s3 and aurora database.
> Can you please let us know how to solve this problem?
> Is it possible to somehow tweak the SocketTimeout-Time? 
> Here, I'm pasting the few line of complete exception log below. Also attached 
> the complete exception to the issue.
> *_Exception:_*
> *_Caused by: java.net.SocketTimeoutException: Read timed out_*
>         _at java.net.SocketInputStream.socketRead0(Native Method)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:150)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:121)_
>         _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_
>         _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_
>         _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_
>         _at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24476) java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0

2018-06-08 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506002#comment-16506002
 ] 

Steve Loughran commented on SPARK-24476:


Switch from s3n to the s3a connector, see if it goes away...if not look at the 
hadoop s3a documentation about tuning this stuff. 

> java.net.SocketTimeoutException: Read timed out Exception while running the 
> Spark Structured Streaming in 2.3.0
> ---
>
> Key: SPARK-24476
> URL: https://issues.apache.org/jira/browse/SPARK-24476
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Major
> Attachments: socket-timeout-exception
>
>
> We are working on spark streaming application using spark structured 
> streaming with checkpointing in s3. When we start the application, the 
> application runs just fine for sometime  then it crashes with the error 
> mentioned below. The amount of time it will run successfully varies from time 
> to time, sometimes it will run for 2 days without any issues then crashes, 
> sometimes it will crash after 4hrs/ 24hrs. 
> Our streaming application joins(left and inner) multiple sources from kafka 
> and also s3 and aurora database.
> Can you please let us know how to solve this problem?
> Is it possible to somehow tweak the SocketTimeout-Time? 
> Here, I'm pasting the few line of complete exception log below. Also attached 
> the complete exception to the issue.
> *_Exception:_*
> *_Caused by: java.net.SocketTimeoutException: Read timed out_*
>         _at java.net.SocketInputStream.socketRead0(Native Method)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:150)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:121)_
>         _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_
>         _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_
>         _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_
>         _at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24496) CLONE - JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.

2018-06-08 Thread SHAILENDRA SHAHANE (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506001#comment-16506001
 ] 

SHAILENDRA SHAHANE commented on SPARK-24496:


This issue is still there . I tried to fetch data from MongoDB and got the 
following exception while converting the RDD to DF.

-Code --

SQLContext sparkSQLContext = spark.sqlContext();
 DataFrameReader dfr = spark.read()
 .format("com.mongodb.spark.sql") 
 .option("floatAsBigDecimal", "true");
 Dataset rbkp = dfr.load();

-- OR 

JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
 JavaMongoRDD rdd = MongoSpark.load(jsc);
 
 Dataset rbkp = rdd.toDF();



Spark version 2.3

MongoDB Version - 3.4 and 3.6

Data Sample-

{"_id":"5b0d31f892549e10b61d962a","RSEG_MANDT":"800","RSEG_EBELN":"4500017749","RSEG_EBELP":"00020","RSEG_BELNR":"11","RSEG_BUZEI":"02","RSEG_GJAHR":"2013","RBKP_BUDAT":"2013-10-04","RSEG_MENGE":\{"$numberDecimal":"30.000"},"RSEG_LFBNR":"500472","RSEG_LFGJA":"2013","RSEG_LFPOS":"0002","NOT_ACCOUNT_MAINTENANCE":\{"$numberDecimal":"1.00"},"RBKP_CPUTIMESTAMP":"2013-10-04T10:32:02.000Z","RBKP_WAERS":"USD","RSEG_BNKAN":\{"$numberDecimal":"0.00"},"RSEG_WRBTR":\{"$numberDecimal":"2340.00"},"RSEG_SHKZG":"S"}

> CLONE - JSON data source fails to infer floats as decimal when precision is 
> bigger than 38 or scale is bigger than precision.
> -
>
> Key: SPARK-24496
> URL: https://issues.apache.org/jira/browse/SPARK-24496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: SHAILENDRA SHAHANE
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: SparkJiraIssue08062018.txt
>
>
> Currently, JSON data source supports {{floatAsBigDecimal}} option, which 
> reads floats as {{DecimalType}}.
> I noticed there are several restrictions in Spark {{DecimalType}} below:
> 1. The precision cannot be bigger than 38.
> 2. scale cannot be bigger than precision. 
> However, with the option above, it reads {{BigDecimal}} which does not follow 
> the conditions above.
> This could be observed as below:
> {code}
> def simpleFloats: RDD[String] =
>   sqlContext.sparkContext.parallelize(
> """{"a": 0.01}""" ::
> """{"a": 0.02}""" :: Nil)
> val jsonDF = sqlContext.read
>   .option("floatAsBigDecimal", "true")
>   .json(simpleFloats)
> jsonDF.printSchema()
> {code}
> throws an exception below:
> {code}
> org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater 
> than precision (1).;
>   at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:44)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> ...
> {code}
> Since JSON data source infers {{DataType}} as {{StringType}} when it fails to 
> infer, it might have to be inferred as {{StringType}} or maybe just simply 
> {{DoubleType}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24496) CLONE - JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.

2018-06-08 Thread SHAILENDRA SHAHANE (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHAILENDRA SHAHANE updated SPARK-24496:
---
Attachment: SparkJiraIssue08062018.txt

> CLONE - JSON data source fails to infer floats as decimal when precision is 
> bigger than 38 or scale is bigger than precision.
> -
>
> Key: SPARK-24496
> URL: https://issues.apache.org/jira/browse/SPARK-24496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: SHAILENDRA SHAHANE
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: SparkJiraIssue08062018.txt
>
>
> Currently, JSON data source supports {{floatAsBigDecimal}} option, which 
> reads floats as {{DecimalType}}.
> I noticed there are several restrictions in Spark {{DecimalType}} below:
> 1. The precision cannot be bigger than 38.
> 2. scale cannot be bigger than precision. 
> However, with the option above, it reads {{BigDecimal}} which does not follow 
> the conditions above.
> This could be observed as below:
> {code}
> def simpleFloats: RDD[String] =
>   sqlContext.sparkContext.parallelize(
> """{"a": 0.01}""" ::
> """{"a": 0.02}""" :: Nil)
> val jsonDF = sqlContext.read
>   .option("floatAsBigDecimal", "true")
>   .json(simpleFloats)
> jsonDF.printSchema()
> {code}
> throws an exception below:
> {code}
> org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater 
> than precision (1).;
>   at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:44)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> ...
> {code}
> Since JSON data source infers {{DataType}} as {{StringType}} when it fails to 
> infer, it might have to be inferred as {{StringType}} or maybe just simply 
> {{DoubleType}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24496) CLONE - JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.

2018-06-08 Thread SHAILENDRA SHAHANE (JIRA)
SHAILENDRA SHAHANE created SPARK-24496:
--

 Summary: CLONE - JSON data source fails to infer floats as decimal 
when precision is bigger than 38 or scale is bigger than precision.
 Key: SPARK-24496
 URL: https://issues.apache.org/jira/browse/SPARK-24496
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: SHAILENDRA SHAHANE
Assignee: Hyukjin Kwon
 Fix For: 2.0.0


Currently, JSON data source supports {{floatAsBigDecimal}} option, which reads 
floats as {{DecimalType}}.

I noticed there are several restrictions in Spark {{DecimalType}} below:

1. The precision cannot be bigger than 38.
2. scale cannot be bigger than precision. 

However, with the option above, it reads {{BigDecimal}} which does not follow 
the conditions above.

This could be observed as below:

{code}
def simpleFloats: RDD[String] =
  sqlContext.sparkContext.parallelize(
"""{"a": 0.01}""" ::
"""{"a": 0.02}""" :: Nil)

val jsonDF = sqlContext.read
  .option("floatAsBigDecimal", "true")
  .json(simpleFloats)
jsonDF.printSchema()
{code}

throws an exception below:

{code}
org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater 
than precision (1).;
at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:44)
at 
org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
at 
org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59)
at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249)
at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57)
at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
...
{code}

Since JSON data source infers {{DataType}} as {{StringType}} when it fails to 
infer, it might have to be inferred as {{StringType}} or maybe just simply 
{{DoubleType}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24495) SortMergeJoin with duplicate keys wrong results

2018-06-08 Thread Bogdan Raducanu (JIRA)
Bogdan Raducanu created SPARK-24495:
---

 Summary: SortMergeJoin with duplicate keys wrong results
 Key: SPARK-24495
 URL: https://issues.apache.org/jira/browse/SPARK-24495
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Bogdan Raducanu


To reproduce:
{code:java}
// the bug is in SortMergeJoin but the Shuffles are correct. with the default 
200 it might split the data in such small partitions that the SortMergeJoin 
cannot return wrong results anymore
spark.conf.set("spark.sql.shuffle.partitions", "1")
// disable this, otherwise it would filter results before join, hiding the bug
spark.conf.set("spark.sql.constraintPropagation.enabled", "false")
sql("select id as a1 from range(1000)").createOrReplaceTempView("t1")
sql("select id * 2 as b1, -id as b2 from 
range(1000)").createOrReplaceTempView("t2")

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
sql("""select b1, a1, b2 FROM t1 INNER JOIN t2 ON b1 = a1 AND b2 = a1""").show
{code}
In the results, it's expected that all columns are equal (see join condition).

But the result is:
{code:java}
+---+---+---+
| b1| a1| b2|
+---+---+---+
|  0|  0|  0|
|  2|  2| -1|
|  4|  4| -2|
|  6|  6| -3|
|  8|  8| -4|

{code}
I traced it to {{EnsureRequirements.reorder}} which was introduced by 
[https://github.com/apache/spark/pull/16985] and 
[https://github.com/apache/spark/pull/20041]

It leads to an incorrect plan:
{code:java}
== Physical Plan ==
*(5) Project [b1#735672L, a1#735669L, b2#735673L]
+- *(5) SortMergeJoin [a1#735669L, a1#735669L], [b1#735672L, b1#735672L], Inner
   :- *(2) Sort [a1#735669L ASC NULLS FIRST, a1#735669L ASC NULLS FIRST], 
false, 0
   :  +- Exchange hashpartitioning(a1#735669L, a1#735669L, 1)
   : +- *(1) Project [id#735670L AS a1#735669L]
   :+- *(1) Range (0, 1000, step=1, splits=8)
   +- *(4) Sort [b1#735672L ASC NULLS FIRST, b2#735673L ASC NULLS FIRST], 
false, 0
  +- Exchange hashpartitioning(b1#735672L, b2#735673L, 1)
 +- *(3) Project [(id#735674L * 2) AS b1#735672L, -id#735674L AS 
b2#735673L]
+- *(3) Range (0, 1000, step=1, splits=8)
{code}
The SortMergeJoin keys are wrong: key b2 is missing completely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24119) Add interpreted execution to SortPrefix expression

2018-06-08 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-24119.
---
   Resolution: Fixed
 Assignee: Bruce Robbins
Fix Version/s: 2.4.0

> Add interpreted execution to SortPrefix expression
> --
>
> Key: SPARK-24119
> URL: https://issues.apache.org/jira/browse/SPARK-24119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.4.0
>
>
> [~hvanhovell] [~kiszk]
> I noticed SortPrefix did not support interpreted execution when I was testing 
> the PR for SPARK-24043. Somehow it was not covered by the umbrella Jira for 
> adding interpreted execution (SPARK-23580)
> Since I had to implement interpreted execution for SortPrefix to complete 
> testing, I am creating this Jira. If there's no good reason why eval wasn't 
> implemented, I will make the PR in a few days.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24494) Give users possibility to skip own classes in SparkContext.getCallSite()

2018-06-08 Thread Florian Kaspar (JIRA)
Florian Kaspar created SPARK-24494:
--

 Summary: Give users possibility to skip own classes in 
SparkContext.getCallSite()
 Key: SPARK-24494
 URL: https://issues.apache.org/jira/browse/SPARK-24494
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 2.2.0
 Environment: any (currently using Spark 2.2.0 in both local mode and 
with YARN)
Reporter: Florian Kaspar


org.apache.spark.SparkContext.getCallSite() uses 
org.apache.spark.util.Utils.getCallSite() which by default skips Spark and 
Scala classes.

It would be nice to be able to add user-defined classes/patterns to skip here 
as well.

We have one central class that acts as some kind of proxy for RDD 
transformations and is used by many other classes.

In the SparkUI we would like to see the real origin of a transformation being 
not the proxy but its caller. This would make the UI even more useful to us.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24487) Add support for RabbitMQ.

2018-06-08 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24487.
-
Resolution: Won't Fix

> Add support for RabbitMQ.
> -
>
> Key: SPARK-24487
> URL: https://issues.apache.org/jira/browse/SPARK-24487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michał Jurkiewicz
>Priority: Major
>
> Add support for RabbitMQ.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.

2018-06-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505786#comment-16505786
 ] 

Saisai Shao commented on SPARK-24487:
-

I'm sure Spark community will not merge this into Spark code base. You can 
either maintain a package yourself, or contribute to Apache Bahir.

> Add support for RabbitMQ.
> -
>
> Key: SPARK-24487
> URL: https://issues.apache.org/jira/browse/SPARK-24487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michał Jurkiewicz
>Priority: Major
>
> Add support for RabbitMQ.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2018-06-08 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-23151:

Issue Type: Sub-task  (was: Dependency upgrade)
Parent: SPARK-23534

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2018-06-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505783#comment-16505783
 ] 

Saisai Shao commented on SPARK-23151:
-

I will convert this as a subtask of SPARK-23534.

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-06-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505771#comment-16505771
 ] 

Saisai Shao commented on SPARK-23534:
-

Spark with Hadoop 3 will be failed in token renew for long running case (with 
keytab and principal).

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2018-06-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505768#comment-16505768
 ] 

Saisai Shao edited comment on SPARK-24493 at 6/8/18 6:56 AM:
-

Adding more background, the issue is happened when building Spark with Hadoop 
2.8+.

In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew 
interval for HDFS token, but due to missing service loader file, Hadoop failed 
to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected 
renew interval (Long.MaxValue).

The related code in Hadoop is:

{code}
 private static Class
 getClassForIdentifier(Text kind) { Class cls = 
null; synchronized (Token.class) { if (tokenKindMap == null)

{ tokenKindMap = Maps.newHashMap(); for (TokenIdentifier id : 
ServiceLoader.load(TokenIdentifier.class)) \\{ tokenKindMap.put(id.getKind(), 
id.getClass()); }

}
 cls = tokenKindMap.get(kind);
 } if (cls == null)

{ LOG.debug("Cannot find class for token kind " + kind); return null; }

return cls;
 }
{code}

The problem is:

The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but 
the service loader description file 
"META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in 
hadoop-hdfs jar. Spark local submit process/driver process (depends on client 
or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs 
jar. So the ServiceLoader will be failed to find HDFS 
"DelegationTokenIdentifier" class and return null.

The issue is due to the change in HADOOP-6200.  Previously we only have 
building profile for Hadoop 2.6 and 2.7, so there's no issue here. But 
currently we has a building profile for Hadoop 3.1, so this will fail the token 
renew in Hadoop 3.1.

The is a Hadoop issue, creating a Spark Jira to track this issue and bump the 
version when Hadoop side is fixed.


was (Author: jerryshao):
Adding more background, the issue is happened when building Spark with Hadoop 
2.8+.

In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew 
interval for HDFS token, but due to missing service loader file, Hadoop failed 
to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected 
renew interval (Long.MaxValue).

The related code in Hadoop is:
  private static Class
  getClassForIdentifier(Text kind) {Class 
cls = null;synchronized (Token.class) {  if (tokenKindMap == null) {
tokenKindMap = Maps.newHashMap();for (TokenIdentifier id : 
ServiceLoader.load(TokenIdentifier.class)) \{
  tokenKindMap.put(id.getKind(), id.getClass());
}
  }
  cls = tokenKindMap.get(kind);
}if (cls == null) {
  LOG.debug("Cannot find class for token kind " + kind);  return null;
}return cls;
  }


The problem is:

The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but 
the service loader description file 
"META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in 
hadoop-hdfs jar. Spark local submit process/driver process (depends on client 
or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs 
jar. So the ServiceLoader will be failed to find HDFS 
"DelegationTokenIdentifier" class and return null.

The issue is due to the change in HADOOP-6200.  Previously we only have 
building profile for Hadoop 2.6 and 2.7, so there's no issue here. But 
currently we has a building profile for Hadoop 3.1, so this will fail the token 
renew in Hadoop 3.1.

The is a Hadoop issue, creating a Spark Jira to track this issue and bump the 
version when Hadoop side is fixed.

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , 

[jira] [Commented] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2018-06-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505768#comment-16505768
 ] 

Saisai Shao commented on SPARK-24493:
-

Adding more background, the issue is happened when building Spark with Hadoop 
2.8+.

In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew 
interval for HDFS token, but due to missing service loader file, Hadoop failed 
to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected 
renew interval (Long.MaxValue).

The related code in Hadoop is:
  private static Class
  getClassForIdentifier(Text kind) {Class 
cls = null;synchronized (Token.class) {  if (tokenKindMap == null) {
tokenKindMap = Maps.newHashMap();for (TokenIdentifier id : 
ServiceLoader.load(TokenIdentifier.class)) \{
  tokenKindMap.put(id.getKind(), id.getClass());
}
  }
  cls = tokenKindMap.get(kind);
}if (cls == null) {
  LOG.debug("Cannot find class for token kind " + kind);  return null;
}return cls;
  }


The problem is:

The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but 
the service loader description file 
"META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in 
hadoop-hdfs jar. Spark local submit process/driver process (depends on client 
or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs 
jar. So the ServiceLoader will be failed to find HDFS 
"DelegationTokenIdentifier" class and return null.

The issue is due to the change in HADOOP-6200.  Previously we only have 
building profile for Hadoop 2.6 and 2.7, so there's no issue here. But 
currently we has a building profile for Hadoop 3.1, so this will fail the token 
renew in Hadoop 3.1.

The is a Hadoop issue, creating a Spark Jira to track this issue and bump the 
version when Hadoop side is fixed.

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> 

[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2018-06-08 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24493:

Summary: Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3  
(was: Kerberos Ticket Renewal is failing in long running Spark job)

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at 

[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.

2018-06-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505760#comment-16505760
 ] 

Saisai Shao commented on SPARK-24487:
-

You can add it in your own package with DataSource API. There's no meaning to 
add to Spark code base.

 

> Add support for RabbitMQ.
> -
>
> Key: SPARK-24487
> URL: https://issues.apache.org/jira/browse/SPARK-24487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michał Jurkiewicz
>Priority: Major
>
> Add support for RabbitMQ.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.

2018-06-08 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505759#comment-16505759
 ] 

Michał Jurkiewicz commented on SPARK-24487:
---

I meant to add support for RabbitMQ for Spark Structured streaming and Spark 
streaming. What I see is that Kafka is only supported.

[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]

https://spark.apache.org/docs/latest/streaming-programming-guide.html

> Add support for RabbitMQ.
> -
>
> Key: SPARK-24487
> URL: https://issues.apache.org/jira/browse/SPARK-24487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michał Jurkiewicz
>Priority: Major
>
> Add support for RabbitMQ.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job

2018-06-08 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24493:

Component/s: Spark Core

> Kerberos Ticket Renewal is failing in long running Spark job
> 
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 

[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job

2018-06-08 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24493:

Component/s: (was: Spark Core)
 YARN

> Kerberos Ticket Renewal is failing in long running Spark job
> 
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job

2018-06-08 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24493:

Priority: Major  (was: Blocker)

> Kerberos Ticket Renewal is failing in long running Spark job
> 
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at