[jira] [Commented] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506792#comment-16506792 ] Yu-Jhe Li commented on SPARK-24492: --- [~ste...@apache.org]: i thinks the TaskCommitDenied exception is not caused by S3 eventually consistency. As [~jiangxb1987] mentioned, there may be another task get the lock of commit permission. Btw, in this case, we did not enable speculative. > Endless attempted task when TaskCommitDenied exception writing to S3A > - > > Key: SPARK-24492 > URL: https://issues.apache.org/jira/browse/SPARK-24492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yu-Jhe Li >Priority: Critical > Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 > 2018-05-16 上午11.10.57.png > > > Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and > output file to S3, some tasks endless retry and all of them failed with > TaskCommitDenied exception. This happened when we run Spark application on > some network issue instances. (it runs well on healthy spot instances) > Sorry, I can find a easy way to reproduce this issue, here's all I can > provide. > The Spark UI shows (in attachments) one task of stage 112 failed due to > FetchFailedException (it is network issue) and attempt to retry a new stage > 112 (retry 1). But in stage 112 (retry 1), all task failed due to > TaskCommitDenied exception, and keep retry (it never succeed and cause lots > of S3 requests). > On the other side, driver logs shows: > # task 123.0 in stage 112.0 failed due to FetchFailedException (network > issue cause corrupted file) > # warning message from OutputCommitCoordinator > # task 92.0 in stage 112.1 failed when writing rows > # keep retry the failed tasks, but never succeed > {noformat} > 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 > (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, > 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= > org.apache.spark.shuffle.FetchFailedException: Stream is corrupted > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) >
[jira] [Resolved] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when scale is negative
[ https://issues.apache.org/jira/browse/SPARK-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24468. - Resolution: Fixed Fix Version/s: 2.3.2 2.4.0 Issue resolved by pull request 21499 [https://github.com/apache/spark/pull/21499] > DecimalType `adjustPrecisionScale` might fail when scale is negative > > > Key: SPARK-24468 > URL: https://issues.apache.org/jira/browse/SPARK-24468 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yifei Wu >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0, 2.3.2 > > > Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries. > When multiplying a LongType with a decimal in scientific notation, say > {code:java} > spark.sql("select some_int * 2.34E10 from t"){code} > , decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be > casted as decimal(20,0). > > So according to the rules in comments: > {code:java} > /* > * OperationResult PrecisionResult Scale > * > * e1 + e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2) > * e1 - e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2) > * e1 * e2 p1 + p2 + 1 s1 + s2 > * e1 / e2 p1 - s1 + s2 + max(6, s1 + p2 + 1) max(6, s1 + p2 + 1) > * e1 % e2 min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2) > * e1 union e2 max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2) > */ > {code} > their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert > assumption (scale>=0) on DecimalType.scala:166. > > My current workaround is to set > spark.sql.decimalOperations.allowPrecisionLoss to false. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when scale is negative
[ https://issues.apache.org/jira/browse/SPARK-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24468: --- Assignee: Marco Gaido > DecimalType `adjustPrecisionScale` might fail when scale is negative > > > Key: SPARK-24468 > URL: https://issues.apache.org/jira/browse/SPARK-24468 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yifei Wu >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.3.2, 2.4.0 > > > Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries. > When multiplying a LongType with a decimal in scientific notation, say > {code:java} > spark.sql("select some_int * 2.34E10 from t"){code} > , decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be > casted as decimal(20,0). > > So according to the rules in comments: > {code:java} > /* > * OperationResult PrecisionResult Scale > * > * e1 + e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2) > * e1 - e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2) > * e1 * e2 p1 + p2 + 1 s1 + s2 > * e1 / e2 p1 - s1 + s2 + max(6, s1 + p2 + 1) max(6, s1 + p2 + 1) > * e1 % e2 min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2) > * e1 union e2 max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2) > */ > {code} > their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert > assumption (scale>=0) on DecimalType.scala:166. > > My current workaround is to set > spark.sql.decimalOperations.allowPrecisionLoss to false. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24412) Adding docs about automagical type casting in `isin` and `isInCollection` APIs
[ https://issues.apache.org/jira/browse/SPARK-24412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-24412. - Resolution: Fixed > Adding docs about automagical type casting in `isin` and `isInCollection` APIs > -- > > Key: SPARK-24412 > URL: https://issues.apache.org/jira/browse/SPARK-24412 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: thrvskn >Priority: Major > Labels: starter > Fix For: 2.4.0 > > > We should let users know that those two APIs will compare data with > autocasting. > See https://github.com/apache/spark/pull/21416#discussion_r191491943 for > detail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24412) Adding docs about automagical type casting in `isin` and `isInCollection` APIs
[ https://issues.apache.org/jira/browse/SPARK-24412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506730#comment-16506730 ] Apache Spark commented on SPARK-24412: -- User 'trvskn' has created a pull request for this issue: https://github.com/apache/spark/pull/21519 > Adding docs about automagical type casting in `isin` and `isInCollection` APIs > -- > > Key: SPARK-24412 > URL: https://issues.apache.org/jira/browse/SPARK-24412 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: thrvskn >Priority: Major > Labels: starter > Fix For: 2.4.0 > > > We should let users know that those two APIs will compare data with > autocasting. > See https://github.com/apache/spark/pull/21416#discussion_r191491943 for > detail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24412) Adding docs about automagical type casting in `isin` and `isInCollection` APIs
[ https://issues.apache.org/jira/browse/SPARK-24412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24412: Assignee: thrvskn (was: Apache Spark) > Adding docs about automagical type casting in `isin` and `isInCollection` APIs > -- > > Key: SPARK-24412 > URL: https://issues.apache.org/jira/browse/SPARK-24412 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: thrvskn >Priority: Major > Labels: starter > Fix For: 2.4.0 > > > We should let users know that those two APIs will compare data with > autocasting. > See https://github.com/apache/spark/pull/21416#discussion_r191491943 for > detail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24412) Adding docs about automagical type casting in `isin` and `isInCollection` APIs
[ https://issues.apache.org/jira/browse/SPARK-24412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24412: Assignee: Apache Spark (was: thrvskn) > Adding docs about automagical type casting in `isin` and `isInCollection` APIs > -- > > Key: SPARK-24412 > URL: https://issues.apache.org/jira/browse/SPARK-24412 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Major > Labels: starter > Fix For: 2.4.0 > > > We should let users know that those two APIs will compare data with > autocasting. > See https://github.com/apache/spark/pull/21416#discussion_r191491943 for > detail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24502) flaky test: UnsafeRowSerializerSuite
[ https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506728#comment-16506728 ] Apache Spark commented on SPARK-24502: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21518 > flaky test: UnsafeRowSerializerSuite > > > Key: SPARK-24502 > URL: https://issues.apache.org/jira/browse/SPARK-24502 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: flaky-test > > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/ > {code} > sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is > stopped. > at > org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) > at > org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) > at > org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) > at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) > at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) > at > org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) > at > org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) > at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) > at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) > at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126) > at > org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54) > at > org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157) > at > org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24502) flaky test: UnsafeRowSerializerSuite
[ https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24502: Assignee: Apache Spark (was: Wenchen Fan) > flaky test: UnsafeRowSerializerSuite > > > Key: SPARK-24502 > URL: https://issues.apache.org/jira/browse/SPARK-24502 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > Labels: flaky-test > > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/ > {code} > sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is > stopped. > at > org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) > at > org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) > at > org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) > at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) > at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) > at > org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) > at > org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) > at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) > at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) > at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126) > at > org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54) > at > org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157) > at > org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24502) flaky test: UnsafeRowSerializerSuite
[ https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24502: Assignee: Wenchen Fan (was: Apache Spark) > flaky test: UnsafeRowSerializerSuite > > > Key: SPARK-24502 > URL: https://issues.apache.org/jira/browse/SPARK-24502 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: flaky-test > > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/ > {code} > sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is > stopped. > at > org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) > at > org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) > at > org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) > at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) > at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) > at > org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) > at > org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) > at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) > at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) > at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126) > at > org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54) > at > org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157) > at > org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63) > at > org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24502) flaky test: UnsafeRowSerializerSuite
Wenchen Fan created SPARK-24502: --- Summary: flaky test: UnsafeRowSerializerSuite Key: SPARK-24502 URL: https://issues.apache.org/jira/browse/SPARK-24502 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24502) flaky test: UnsafeRowSerializerSuite
[ https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24502: Description: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/ {code} sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is stopped. at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) at org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) at org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) at org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) at org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126) at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54) at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157) at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60) ... {code} > flaky test: UnsafeRowSerializerSuite > > > Key: SPARK-24502 > URL: https://issues.apache.org/jira/browse/SPARK-24502 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: flaky-test > > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/ > {code} > sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is > stopped. > at > org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) > at > org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) > at > org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) > at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) > at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) > at > org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) > at > org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) > at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > at scala.Option.map(Option.scala:146) > at >
[jira] [Updated] (SPARK-24502) flaky test: UnsafeRowSerializerSuite
[ https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24502: Labels: flaky-test (was: ) > flaky test: UnsafeRowSerializerSuite > > > Key: SPARK-24502 > URL: https://issues.apache.org/jira/browse/SPARK-24502 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: flaky-test > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-24501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506709#comment-16506709 ] Nicholas Parker commented on SPARK-24501: - PR: https://github.com/apache/spark/pull/21516 > Add metrics for Mesos Driver and Dispatcher > --- > > Key: SPARK-24501 > URL: https://issues.apache.org/jira/browse/SPARK-24501 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.1 >Reporter: Nicholas Parker >Priority: Minor > > The Mesos Dispatcher currently only has three gauge metrics, for the number > of waiting, launched, and retrying drivers, defined in > MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have > any metrics defined at all. > Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, > including e.g. types of events received, how long it takes jobs to run, for > use by operators to monitor and/or diagnose their Spark environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-24501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Parker updated SPARK-24501: Comment: was deleted (was: PR: https://github.com/apache/spark/pull/21516) > Add metrics for Mesos Driver and Dispatcher > --- > > Key: SPARK-24501 > URL: https://issues.apache.org/jira/browse/SPARK-24501 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.1 >Reporter: Nicholas Parker >Priority: Minor > > The Mesos Dispatcher currently only has three gauge metrics, for the number > of waiting, launched, and retrying drivers, defined in > MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have > any metrics defined at all. > Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, > including e.g. types of events received, how long it takes jobs to run, for > use by operators to monitor and/or diagnose their Spark environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-24501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24501: Assignee: (was: Apache Spark) > Add metrics for Mesos Driver and Dispatcher > --- > > Key: SPARK-24501 > URL: https://issues.apache.org/jira/browse/SPARK-24501 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.1 >Reporter: Nicholas Parker >Priority: Minor > > The Mesos Dispatcher currently only has three gauge metrics, for the number > of waiting, launched, and retrying drivers, defined in > MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have > any metrics defined at all. > Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, > including e.g. types of events received, how long it takes jobs to run, for > use by operators to monitor and/or diagnose their Spark environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-24501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24501: Assignee: Apache Spark > Add metrics for Mesos Driver and Dispatcher > --- > > Key: SPARK-24501 > URL: https://issues.apache.org/jira/browse/SPARK-24501 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.1 >Reporter: Nicholas Parker >Assignee: Apache Spark >Priority: Minor > > The Mesos Dispatcher currently only has three gauge metrics, for the number > of waiting, launched, and retrying drivers, defined in > MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have > any metrics defined at all. > Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, > including e.g. types of events received, how long it takes jobs to run, for > use by operators to monitor and/or diagnose their Spark environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-24501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506707#comment-16506707 ] Apache Spark commented on SPARK-24501: -- User 'nickbp' has created a pull request for this issue: https://github.com/apache/spark/pull/21516 > Add metrics for Mesos Driver and Dispatcher > --- > > Key: SPARK-24501 > URL: https://issues.apache.org/jira/browse/SPARK-24501 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.1 >Reporter: Nicholas Parker >Priority: Minor > > The Mesos Dispatcher currently only has three gauge metrics, for the number > of waiting, launched, and retrying drivers, defined in > MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have > any metrics defined at all. > Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, > including e.g. types of events received, how long it takes jobs to run, for > use by operators to monitor and/or diagnose their Spark environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24372) Create script for preparing RCs
[ https://issues.apache.org/jira/browse/SPARK-24372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24372: Assignee: (was: Apache Spark) > Create script for preparing RCs > --- > > Key: SPARK-24372 > URL: https://issues.apache.org/jira/browse/SPARK-24372 > Project: Spark > Issue Type: New Feature > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > Currently, when preparing RCs, the RM has to invoke many scripts manually, > make sure that is being done in the correct environment, and set all the > correct environment variables, which differ from one script to the other. > It will be much easier for RMs if all that was automated as much as possible. > I'm working on something like this as part of releasing 2.3.1, and plan to > send my scripts for review after the release is done (i.e. after I make sure > the scripts are working properly). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24372) Create script for preparing RCs
[ https://issues.apache.org/jira/browse/SPARK-24372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24372: Assignee: Apache Spark > Create script for preparing RCs > --- > > Key: SPARK-24372 > URL: https://issues.apache.org/jira/browse/SPARK-24372 > Project: Spark > Issue Type: New Feature > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > Currently, when preparing RCs, the RM has to invoke many scripts manually, > make sure that is being done in the correct environment, and set all the > correct environment variables, which differ from one script to the other. > It will be much easier for RMs if all that was automated as much as possible. > I'm working on something like this as part of releasing 2.3.1, and plan to > send my scripts for review after the release is done (i.e. after I make sure > the scripts are working properly). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24372) Create script for preparing RCs
[ https://issues.apache.org/jira/browse/SPARK-24372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506703#comment-16506703 ] Apache Spark commented on SPARK-24372: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/21515 > Create script for preparing RCs > --- > > Key: SPARK-24372 > URL: https://issues.apache.org/jira/browse/SPARK-24372 > Project: Spark > Issue Type: New Feature > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > Currently, when preparing RCs, the RM has to invoke many scripts manually, > make sure that is being done in the correct environment, and set all the > correct environment variables, which differ from one script to the other. > It will be much easier for RMs if all that was automated as much as possible. > I'm working on something like this as part of releasing 2.3.1, and plan to > send my scripts for review after the release is done (i.e. after I make sure > the scripts are working properly). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24501) Add metrics for Mesos Driver and Dispatcher
Nicholas Parker created SPARK-24501: --- Summary: Add metrics for Mesos Driver and Dispatcher Key: SPARK-24501 URL: https://issues.apache.org/jira/browse/SPARK-24501 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 2.3.1 Reporter: Nicholas Parker The Mesos Dispatcher currently only has three gauge metrics, for the number of waiting, launched, and retrying drivers, defined in MesosClusterSchedulerSource.scala. Meanwhile, the Mesos Driver doesn't have any metrics defined at all. Implement additional metrics for both the Mesos Dispatcher and Mesos Driver, including e.g. types of events received, how long it takes jobs to run, for use by operators to monitor and/or diagnose their Spark environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch
[ https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-24106: --- Fix Version/s: (was: 2.4.0) (was: 2.3.0) > Spark Structure Streaming with RF model taking long time in processing > probability for each mini batch > -- > > Key: SPARK-24106 > URL: https://issues.apache.org/jira/browse/SPARK-24106 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0, 2.2.1, 2.3.0 > Environment: Spark yarn / Standalone cluster > 2 master nodes - 32 cores - 124 GB > 9 worker nodes - 32 cores - 124 GB > Kafka input and output topic with 6 partition >Reporter: Tamilselvan Veeramani >Priority: Major > Labels: performance > > RandomForestClassificationModel broadcasted to executors for every mini batch > in spark streaming while try to find probability > RF model size 45MB > spark kafka streaming job jar size 8 MB (including kafka dependency’s) > following log show model broad cast to executors for every mini batch when we > call rf_model.transform(dataset).select("probability"). > due to which task deserialization time also increases comes to 6 to 7 second > for 45MB of rf model, although processing time is just 400 to 600 ms for mini > batch > 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: > KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) > 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106 > 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory > on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB) > After 2 to 3 weeks of struggle, I found a potentially solution which will > help many people who is looking to use RF model for “probability” in real > time streaming context > Since RandomForestClassificationModel class of transformImpl method > implements only “prediction” in current version of spark. Which can be > leveraged to implement “probability” also in RandomForestClassificationModel > class of transformImpl method. > I have modified the code and implemented in our server and it’s working as > fast as 400ms to 500ms for every mini batch > I see many people our there facing this issue and no solution provided in any > of the forums, Can you please review and put this fix in next release ? thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch
[ https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-24106: --- Target Version/s: (was: 2.2.1, 2.3.0) > Spark Structure Streaming with RF model taking long time in processing > probability for each mini batch > -- > > Key: SPARK-24106 > URL: https://issues.apache.org/jira/browse/SPARK-24106 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0, 2.2.1, 2.3.0 > Environment: Spark yarn / Standalone cluster > 2 master nodes - 32 cores - 124 GB > 9 worker nodes - 32 cores - 124 GB > Kafka input and output topic with 6 partition >Reporter: Tamilselvan Veeramani >Priority: Major > Labels: performance > > RandomForestClassificationModel broadcasted to executors for every mini batch > in spark streaming while try to find probability > RF model size 45MB > spark kafka streaming job jar size 8 MB (including kafka dependency’s) > following log show model broad cast to executors for every mini batch when we > call rf_model.transform(dataset).select("probability"). > due to which task deserialization time also increases comes to 6 to 7 second > for 45MB of rf model, although processing time is just 400 to 600 ms for mini > batch > 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: > KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) > 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106 > 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory > on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB) > After 2 to 3 weeks of struggle, I found a potentially solution which will > help many people who is looking to use RF model for “probability” in real > time streaming context > Since RandomForestClassificationModel class of transformImpl method > implements only “prediction” in current version of spark. Which can be > leveraged to implement “probability” also in RandomForestClassificationModel > class of transformImpl method. > I have modified the code and implemented in our server and it’s working as > fast as 400ms to 500ms for every mini batch > I see many people our there facing this issue and no solution provided in any > of the forums, Can you please review and put this fix in next release ? thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22860) Spark workers log ssl passwords passed to the executors
[ https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22860: Assignee: (was: Apache Spark) > Spark workers log ssl passwords passed to the executors > --- > > Key: SPARK-22860 > URL: https://issues.apache.org/jira/browse/SPARK-22860 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Felix K. >Priority: Major > > The workers log the spark.ssl.keyStorePassword and > spark.ssl.trustStorePassword passed by cli to the executor processes. The > ExecutorRunner should escape passwords to not appear in the worker's log > files in INFO level. In this example, you can see my 'SuperSecretPassword' in > a worker log: > {code} > 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: > "/global/myapp/oem/jdk/bin/java" "-cp" > "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar > [...] > :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" > "-Dspark.authenticate.enableSaslEncryption=true" > "-Dspark.ssl.keyStorePassword=SuperSecretPassword" > "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" > "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" > "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" > "-Dspark.ssl.protocol=TLS" > "-Dspark.ssl.trustStorePassword=SuperSecretPassword" > "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" > "-Dmyapp.config.directory=/global/myapp/application/config" > "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer" > > "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" > "-XX:+UseG1GC" "-XX:+UseStringDeduplication" > "-Dthings.loader.export.zzz_files=false" > "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties" > "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" > "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" > "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" > "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" > "--worker-url" "spark://Worker@192.168.0.1:59530" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22860) Spark workers log ssl passwords passed to the executors
[ https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22860: Assignee: Apache Spark > Spark workers log ssl passwords passed to the executors > --- > > Key: SPARK-22860 > URL: https://issues.apache.org/jira/browse/SPARK-22860 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Felix K. >Assignee: Apache Spark >Priority: Major > > The workers log the spark.ssl.keyStorePassword and > spark.ssl.trustStorePassword passed by cli to the executor processes. The > ExecutorRunner should escape passwords to not appear in the worker's log > files in INFO level. In this example, you can see my 'SuperSecretPassword' in > a worker log: > {code} > 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: > "/global/myapp/oem/jdk/bin/java" "-cp" > "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar > [...] > :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" > "-Dspark.authenticate.enableSaslEncryption=true" > "-Dspark.ssl.keyStorePassword=SuperSecretPassword" > "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" > "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" > "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" > "-Dspark.ssl.protocol=TLS" > "-Dspark.ssl.trustStorePassword=SuperSecretPassword" > "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" > "-Dmyapp.config.directory=/global/myapp/application/config" > "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer" > > "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" > "-XX:+UseG1GC" "-XX:+UseStringDeduplication" > "-Dthings.loader.export.zzz_files=false" > "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties" > "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" > "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" > "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" > "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" > "--worker-url" "spark://Worker@192.168.0.1:59530" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22860) Spark workers log ssl passwords passed to the executors
[ https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506693#comment-16506693 ] Apache Spark commented on SPARK-22860: -- User 'tooptoop4' has created a pull request for this issue: https://github.com/apache/spark/pull/21514 > Spark workers log ssl passwords passed to the executors > --- > > Key: SPARK-22860 > URL: https://issues.apache.org/jira/browse/SPARK-22860 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Felix K. >Priority: Major > > The workers log the spark.ssl.keyStorePassword and > spark.ssl.trustStorePassword passed by cli to the executor processes. The > ExecutorRunner should escape passwords to not appear in the worker's log > files in INFO level. In this example, you can see my 'SuperSecretPassword' in > a worker log: > {code} > 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: > "/global/myapp/oem/jdk/bin/java" "-cp" > "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar > [...] > :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" > "-Dspark.authenticate.enableSaslEncryption=true" > "-Dspark.ssl.keyStorePassword=SuperSecretPassword" > "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" > "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" > "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" > "-Dspark.ssl.protocol=TLS" > "-Dspark.ssl.trustStorePassword=SuperSecretPassword" > "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" > "-Dmyapp.config.directory=/global/myapp/application/config" > "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer" > > "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" > "-XX:+UseG1GC" "-XX:+UseStringDeduplication" > "-Dthings.loader.export.zzz_files=false" > "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties" > "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" > "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" > "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" > "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" > "--worker-url" "spark://Worker@192.168.0.1:59530" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24412) Adding docs about automagical type casting in `isin` and `isInCollection` APIs
[ https://issues.apache.org/jira/browse/SPARK-24412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-24412: --- Assignee: thrvskn > Adding docs about automagical type casting in `isin` and `isInCollection` APIs > -- > > Key: SPARK-24412 > URL: https://issues.apache.org/jira/browse/SPARK-24412 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: thrvskn >Priority: Major > Labels: starter > Fix For: 2.4.0 > > > We should let users know that those two APIs will compare data with > autocasting. > See https://github.com/apache/spark/pull/21416#discussion_r191491943 for > detail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23010) Add integration testing for Kubernetes backend into the apache/spark repository
[ https://issues.apache.org/jira/browse/SPARK-23010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah resolved SPARK-23010. Resolution: Fixed Fix Version/s: 2.4.0 Some tests are missing but we got a basic set of the tests and the infrastructure into master. > Add integration testing for Kubernetes backend into the apache/spark > repository > --- > > Key: SPARK-23010 > URL: https://issues.apache.org/jira/browse/SPARK-23010 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > Fix For: 2.4.0 > > > Add tests for the scheduler backend into apache/spark > /xref: > http://apache-spark-developers-list.1001551.n3.nabble.com/Integration-testing-and-Scheduler-Backends-td23105.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506521#comment-16506521 ] Felix Cheung edited comment on SPARK-24359 at 6/8/18 8:16 PM: -- Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 such that it contains only branch-2.4 official release commits + any alterations for SparkML and then tag it as v2.4.0.1. I think many of us would agree to separate repo only for convenience - so if one would sign up to handle the branching and commit porting etc and we get community to vote on such a "SparkML only" release, then it is ok. Though thinking about it we would still have officially a Spark 2.4.0.1 release (with no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way the release/tag process work. was (Author: felixcheung): Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 such that it contains only branch-2.3 official release commits + any alterations for SparkML and then tag it as v2.4.0.1. I think many of us would agree to separate repo only for convenience - so if one would sign up to handle the branching and commit porting etc and we get community to vote on such a "SparkML only" release, then it is ok. Though thinking about it we would still have officially a Spark 2.4.0.1 release (with no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way the release/tag process work. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark
[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506521#comment-16506521 ] Felix Cheung edited comment on SPARK-24359 at 6/8/18 8:16 PM: -- Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 such that it contains only branch-2.4 official release commits + any alterations for SparkML and then tag it as v2.4.0.1. I think many of us would agree to separate repo only for convenience - so if one would sign up to handle the branching and commit porting etc and we get community to vote on such a "SparkML only" release, then it is ok. Though thinking about it we would still have officially a Spark 2.4.0.1 release (with no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way the release/tag process work. And likely this 2.4.0.1 would be visible in release share, maven etc. too was (Author: felixcheung): Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 such that it contains only branch-2.4 official release commits + any alterations for SparkML and then tag it as v2.4.0.1. I think many of us would agree to separate repo only for convenience - so if one would sign up to handle the branching and commit porting etc and we get community to vote on such a "SparkML only" release, then it is ok. Though thinking about it we would still have officially a Spark 2.4.0.1 release (with no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way the release/tag process work. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506521#comment-16506521 ] Felix Cheung commented on SPARK-24359: -- Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 such that it contains only branch-2.3 official release commits + any alterations for SparkML and then tag it as v2.4.0.1. I think many of us would agree to separate repo only for convenience - so if one would sign up to handle the branching and commit porting etc and we get community to vote on such a "SparkML only" release, then it is ok. Though thinking about it we would still have officially a Spark 2.4.0.1 release (with no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way the release/tag process work. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained,
[jira] [Commented] (SPARK-19826) spark.ml Python API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506503#comment-16506503 ] Apache Spark commented on SPARK-19826: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/21513 > spark.ml Python API for PIC > --- > > Key: SPARK-19826 > URL: https://issues.apache.org/jira/browse/SPARK-19826 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24500) UnsupportedOperationException when trying to execute Union plan with Stream of children
Bogdan Raducanu created SPARK-24500: --- Summary: UnsupportedOperationException when trying to execute Union plan with Stream of children Key: SPARK-24500 URL: https://issues.apache.org/jira/browse/SPARK-24500 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Bogdan Raducanu To reproduce: {code} import org.apache.spark.sql.catalyst.plans.logical._ def range(i: Int) = Range(1, i, 1, 1) val union = Union(Stream(range(3), range(5), range(7))) spark.sessionState.planner.plan(union).next().execute() {code} produces {code} java.lang.UnsupportedOperationException at org.apache.spark.sql.execution.PlanLater.doExecute(SparkStrategies.scala:55) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) {code} The SparkPlan looks like this: {code} :- Range (1, 3, step=1, splits=1) :- PlanLater Range (1, 5, step=1, splits=Some(1)) +- PlanLater Range (1, 7, step=1, splits=Some(1)) {code} So not all of it was planned (some PlanLater still in there). This appears to be a longstanding issue. I traced it to the use of var in TreeNode. For example in mapChildren: {code} case args: Traversable[_] => args.map { case arg: TreeNode[_] if containsChild(arg) => val newChild = f(arg.asInstanceOf[BaseType]) if (!(newChild fastEquals arg)) { changed = true {code} If args is a Stream then changed will never be set here, ultimately causing the method to return the original plan. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22666) Spark datasource for image format
[ https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-22666: -- Target Version/s: 2.4.0 > Spark datasource for image format > - > > Key: SPARK-22666 > URL: https://issues.apache.org/jira/browse/SPARK-22666 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Timothy Hunter >Priority: Major > > The current API for the new image format is implemented as a standalone > feature, in order to make it reside within the mllib package. As discussed in > SPARK-21866, users should be able to load images through the more common > spark source reader interface. > This ticket is concerned with adding image reading support in the spark > source API, through either of the following interfaces: > - {{spark.read.format("image")...}} > - {{spark.read.image}} > The output is a dataframe that contains images (and the file names for > example), following the semantics discussed already in SPARK-21866. > A few technical notes: > * since the functionality is implemented in {{mllib}}, calling this function > may fail at runtime if users have not imported the {{spark-mllib}} dependency > * How to deal with very flat directories? It is common to have millions of > files in a single "directory" (like in S3), which seems to have caused some > issues to some users. If this issue is too complex to handle in this ticket, > it can be dealt with separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22666) Spark datasource for image format
[ https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506471#comment-16506471 ] Xiangrui Meng commented on SPARK-22666: --- [~mhamilton] Do you want to take this task? > Spark datasource for image format > - > > Key: SPARK-22666 > URL: https://issues.apache.org/jira/browse/SPARK-22666 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Timothy Hunter >Priority: Major > > The current API for the new image format is implemented as a standalone > feature, in order to make it reside within the mllib package. As discussed in > SPARK-21866, users should be able to load images through the more common > spark source reader interface. > This ticket is concerned with adding image reading support in the spark > source API, through either of the following interfaces: > - {{spark.read.format("image")...}} > - {{spark.read.image}} > The output is a dataframe that contains images (and the file names for > example), following the semantics discussed already in SPARK-21866. > A few technical notes: > * since the functionality is implemented in {{mllib}}, calling this function > may fail at runtime if users have not imported the {{spark-mllib}} dependency > * How to deal with very flat directories? It is common to have millions of > files in a single "directory" (like in S3), which seems to have caused some > issues to some users. If this issue is too complex to handle in this ticket, > it can be dealt with separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22666) Spark datasource for image format
[ https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-22666: -- Shepherd: Xiangrui Meng > Spark datasource for image format > - > > Key: SPARK-22666 > URL: https://issues.apache.org/jira/browse/SPARK-22666 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Timothy Hunter >Priority: Major > > The current API for the new image format is implemented as a standalone > feature, in order to make it reside within the mllib package. As discussed in > SPARK-21866, users should be able to load images through the more common > spark source reader interface. > This ticket is concerned with adding image reading support in the spark > source API, through either of the following interfaces: > - {{spark.read.format("image")...}} > - {{spark.read.image}} > The output is a dataframe that contains images (and the file names for > example), following the semantics discussed already in SPARK-21866. > A few technical notes: > * since the functionality is implemented in {{mllib}}, calling this function > may fail at runtime if users have not imported the {{spark-mllib}} dependency > * How to deal with very flat directories? It is common to have millions of > files in a single "directory" (like in S3), which seems to have caused some > issues to some users. If this issue is too complex to handle in this ticket, > it can be dealt with separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24213) Power Iteration Clustering in the SparkML throws exception, when the ID is IntType
[ https://issues.apache.org/jira/browse/SPARK-24213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid resolved SPARK-24213. Resolution: Fixed > Power Iteration Clustering in the SparkML throws exception, when the ID is > IntType > -- > > Key: SPARK-24213 > URL: https://issues.apache.org/jira/browse/SPARK-24213 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: shahid >Priority: Major > Fix For: 2.4.0 > > > While running the code, PowerIterationClustering in spark ML throws exception. > {code:scala} > val data = spark.createDataFrame(Seq( > (0, Array(1), Array(0.9)), > (1, Array(2), Array(0.9)), > (2, Array(3), Array(0.9)), > (3, Array(4), Array(0.1)), > (4, Array(5), Array(0.9)) > )).toDF("id", "neighbors", "similarities") > val result = new PowerIterationClustering() > .setK(2) > .setMaxIter(10) > .setInitMode("random") > .transform(data) > .select("id","prediction") > {code} > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve '`prediction`' given > input columns: [id, neighbors, similarities];; > 'Project [id#215, 'prediction] > +- AnalysisBarrier > +- Project [id#215, neighbors#216, similarities#217] > +- Join Inner, (id#215 = id#234) > :- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211 AS > similarities#217] > : +- LocalRelation [_1#209, _2#210, _3#211] > +- Project [cast(id#230L as int) AS id#234] >+- LogicalRDD [id#230L, prediction#231], false > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.
[ https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid resolved SPARK-24217. Resolution: Fixed > Power Iteration Clustering is not displaying cluster indices corresponding to > some vertices. > > > Key: SPARK-24217 > URL: https://issues.apache.org/jira/browse/SPARK-24217 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: shahid >Priority: Major > Fix For: 2.4.0 > > > We should display prediction and id corresponding to all the nodes. > Currently PIC is not returning the cluster indices of neighbour IDs which are > not there in the ID column. > As per the definition of PIC clustering, given in the code, > PIC takes an affinity matrix between items (or vertices) as input. An > affinity matrix > is a symmetric matrix whose entries are non-negative similarities between > items. > PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each > input row includes: > * {{idCol}}: vertex ID > * {{neighborsCol}}: neighbors of vertex in {{idCol}} > * {{similaritiesCol}}: non-negative weights (similarities) of edges between > the vertex > in {{idCol}} and each neighbor in {{neighborsCol}} > * *"PIC returns a cluster assignment for each input vertex."* It appends a > new column {{predictionCol}} > containing the cluster assignment in {{[0,k)}} for each row (vertex). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23984) PySpark Bindings for K8S
[ https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah resolved SPARK-23984. Resolution: Fixed Fix Version/s: 2.4.0 > PySpark Bindings for K8S > > > Key: SPARK-23984 > URL: https://issues.apache.org/jira/browse/SPARK-23984 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, PySpark >Affects Versions: 2.3.0 >Reporter: Ilan Filonenko >Priority: Major > Fix For: 2.4.0 > > > This ticket is tracking the ongoing work of moving the upsteam work from > [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python > bindings for Spark on Kubernetes. > The points of focus are: dependency management, increased non-JVM memory > overhead default values, and modified Docker images to include Python > Support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes
[ https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506456#comment-16506456 ] Wenchen Fan commented on SPARK-20712: - I looked into this. This is actually a hive bug and I feel it's impossible to fix it at Spark side, even with Spark 2.0. [~maver1ck] can you double check that you can use Spark 2.0 to read the same table without problems? Ideally Hive would report the error when creating the table, I'm not sure how the malformed table get slipped into the Hive metastore. > [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has > length greater than 4000 bytes > --- > > Key: SPARK-20712 > URL: https://issues.apache.org/jira/browse/SPARK-20712 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.3.0 >Reporter: Maciej Bryński >Priority: Critical > > Hi, > I have following issue. > I'm trying to read a table from hive when one of the column is nested so it's > schema has length longer than 4000 bytes. > Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception: > {code} > >> spark.read.table("SOME_TABLE") > Traceback (most recent call last): > File "", line 1, in > File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table > return self._df(self._jreader.table(tableName)) > File > "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 1133, in __call__ > File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o71.table. > : org.apache.spark.SparkException: Cannot recognize hive type string: > SOME_VERY_LONG_FIELD_TYPE > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359) > at > org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at >
[jira] [Assigned] (SPARK-17756) java.lang.ClassCastException when using cartesian with DStream.transform
[ https://issues.apache.org/jira/browse/SPARK-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-17756: Assignee: Hyukjin Kwon > java.lang.ClassCastException when using cartesian with DStream.transform > > > Key: SPARK-17756 > URL: https://issues.apache.org/jira/browse/SPARK-17756 > Project: Spark > Issue Type: Bug > Components: DStreams, PySpark >Affects Versions: 2.0.0, 2.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > Steps to reproduce: > {code} > from pyspark.streaming import StreamingContext > ssc = StreamingContext(spark.sparkContext, 10) > (ssc > .queueStream([sc.range(10)]) > .transform(lambda rdd: rdd.cartesian(rdd)) > .pprint()) > ssc.start() > ## 16/10/01 21:34:30 ERROR JobScheduler: Error generating jobs for time > 147535047 ms > ## java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD ## > cannot be cast to org.apache.spark.api.java.JavaRDD > ##at com.sun.proxy.$Proxy15.call(Unknown Source) > ## > {code} > A dummy fix is to put {{map(lamba x: x)}} which suggests it is a problem > similar to https://issues.apache.org/jira/browse/SPARK-16589 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17756) java.lang.ClassCastException when using cartesian with DStream.transform
[ https://issues.apache.org/jira/browse/SPARK-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-17756. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 19498 [https://github.com/apache/spark/pull/19498] > java.lang.ClassCastException when using cartesian with DStream.transform > > > Key: SPARK-17756 > URL: https://issues.apache.org/jira/browse/SPARK-17756 > Project: Spark > Issue Type: Bug > Components: DStreams, PySpark >Affects Versions: 2.0.0, 2.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > Steps to reproduce: > {code} > from pyspark.streaming import StreamingContext > ssc = StreamingContext(spark.sparkContext, 10) > (ssc > .queueStream([sc.range(10)]) > .transform(lambda rdd: rdd.cartesian(rdd)) > .pprint()) > ssc.start() > ## 16/10/01 21:34:30 ERROR JobScheduler: Error generating jobs for time > 147535047 ms > ## java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD ## > cannot be cast to org.apache.spark.api.java.JavaRDD > ##at com.sun.proxy.$Proxy15.call(Unknown Source) > ## > {code} > A dummy fix is to put {{map(lamba x: x)}} which suggests it is a problem > similar to https://issues.apache.org/jira/browse/SPARK-16589 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24258) SPIP: Improve PySpark support for ML Matrix and Vector types
[ https://issues.apache.org/jira/browse/SPARK-24258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506399#comment-16506399 ] Li Jin commented on SPARK-24258: I ran into [~mengxr] and chatted about this. Seems a good first step is to have tensor type to be first-class type in Spark DataFrame. For operations, there is concerns about having to add many many linear algebra functions in Spark codebase, so it's not clear whether it's a good idea. Any thoughts? > SPIP: Improve PySpark support for ML Matrix and Vector types > > > Key: SPARK-24258 > URL: https://issues.apache.org/jira/browse/SPARK-24258 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Leif Walsh >Priority: Major > > h1. Background and Motivation: > In Spark ML ({{pyspark.ml.linalg}}), there are four column types you can > construct, {{SparseVector}}, {{DenseVector}}, {{SparseMatrix}}, and > {{DenseMatrix}}. In PySpark, you can construct one of these vectors with > {{VectorAssembler}}, and then you can run python UDFs on these columns, and > use {{toArray()}} to get numpy ndarrays and do things with them. They also > have a small native API where you can compute {{dot()}}, {{norm()}}, and a > few other things with them (I think these are computed in scala, not python, > could be wrong). > For statistical applications, having the ability to manipulate columns of > matrix and vector values (from here on, I will use the term tensor to refer > to arrays of arbitrary dimensionality, matrices are 2-tensors and vectors are > 1-tensors) would be powerful. For example, you could use PySpark to reshape > your data in parallel, assemble some matrices from your raw data, and then > run some statistical computation on them using UDFs leveraging python > libraries like statsmodels, numpy, tensorflow, and scikit-learn. > I propose enriching the {{pyspark.ml.linalg}} types in the following ways: > # Expand the set of column operations one can apply to tensor columns beyond > the few functions currently available on these types. Ideally, the API > should aim to be as wide as the numpy ndarray API, but would wrap Breeze > operations. For example, we should provide {{DenseVector.outerProduct()}} so > that a user could write something like {{df.withColumn("XtX", > df["X"].outerProduct(df["X"]))}}. > # Make sure all ser/de mechanisms (including Arrow) understand these types, > and faithfully represent them as natural types in all languages (in scala and > java, Breeze objects, in python, numpy ndarrays rather than the > pyspark.ml.linalg types that wrap them, in SparkR, I'm not sure what, but > something natural) when applying UDFs or collecting with {{toPandas()}}. > # Improve the construction of these types from scalar columns. The > {{VectorAssembler}} API is not very ergonomic. I propose something like > {{df.withColumn("predictors", Vector.of(df["feature1"], df["feature2"], > df["feature3"]))}}. > h1. Target Personas: > Data scientists, machine learning practitioners, machine learning library > developers. > h1. Goals: > This would allow users to do more statistical computation in Spark natively, > and would allow users to apply python statistical computation to data in > Spark using UDFs. > h1. Non-Goals: > I suppose one non-goal is to reimplement something like statsmodels using > Breeze data structures and computation. That could be seen as an effort to > enrich Spark ML itself, but is out of scope of this effort. This effort is > just to make it possible and easy to apply existing python libraries to > tensor values in parallel. > h1. Proposed API Changes: > Add the above APIs to PySpark and the other language bindings. I think the > list is: > # {{pyspark.ml.linalg.Vector.of(*columns)}} > # {{pyspark.ml.linalg.Matrix.of( provide this>)}} > # For each of the matrix and vector types in {{pyspark.ml.linalg}}, add more > methods like {{outerProduct}}, {{matmul}}, {{kron}}, etc. > https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.linalg.html has a > good list to look at. > Also, change python UDFs so that these tensor types are passed to the python > function not as \{Sparse,Dense\}\{Matrix,Vector\} objects that wrap > {{numpy.ndarray}}, but as {{numpy.ndarray}} objects by themselves, and > interpret return values that are {{numpy.ndarray}} objects back into the > spark types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24467) VectorAssemblerEstimator
[ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506334#comment-16506334 ] Nick Pentreath edited comment on SPARK-24467 at 6/8/18 5:59 PM: Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't think a new estimator could return the existing {{VectorAssembler}} but would probably need to return a new {{VectorAssemblerModel. Though perhaps the existing one can be made a Model without breaking things}} was (Author: mlnick): Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't think a new estimator could return the existing {{VectorAssembler}} but would probably need to return a new {{VectorAssemblerModel}} > VectorAssemblerEstimator > > > Key: SPARK-24467 > URL: https://issues.apache.org/jira/browse/SPARK-24467 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > In [SPARK-22346], I believe I made a wrong API decision: I recommended added > `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since > I thought the latter option would break most workflows. However, I should > have proposed: > * Add a Param to VectorAssembler for specifying the sizes of Vectors in the > inputCols. This Param can be optional. If not given, then VectorAssembler > will behave as it does now. If given, then VectorAssembler can use that info > instead of figuring out the Vector sizes via metadata or examining Rows in > the data (though it could do consistency checks). > * Add a VectorAssemblerEstimator which gets the Vector lengths from data and > produces a VectorAssembler with the vector lengths Param specified. > This will not break existing workflows. Migrating to > VectorAssemblerEstimator will be easier than adding VectorSizeHint since it > will not require users to manually input Vector lengths. > Note: Even with this Estimator, VectorSizeHint might prove useful for other > things in the future which require vector length metadata, so we could > consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator
[ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506334#comment-16506334 ] Nick Pentreath commented on SPARK-24467: Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't think a new estimator could return the existing {{VectorAssembler}} but would probably need to return a new {{VectorAssemblerModel}} > VectorAssemblerEstimator > > > Key: SPARK-24467 > URL: https://issues.apache.org/jira/browse/SPARK-24467 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > In [SPARK-22346], I believe I made a wrong API decision: I recommended added > `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since > I thought the latter option would break most workflows. However, I should > have proposed: > * Add a Param to VectorAssembler for specifying the sizes of Vectors in the > inputCols. This Param can be optional. If not given, then VectorAssembler > will behave as it does now. If given, then VectorAssembler can use that info > instead of figuring out the Vector sizes via metadata or examining Rows in > the data (though it could do consistency checks). > * Add a VectorAssemblerEstimator which gets the Vector lengths from data and > produces a VectorAssembler with the vector lengths Param specified. > This will not break existing workflows. Migrating to > VectorAssemblerEstimator will be easier than adding VectorSizeHint since it > will not require users to manually input Vector lengths. > Note: Even with this Estimator, VectorSizeHint might prove useful for other > things in the future which require vector length metadata, so we could > consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL
[ https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506285#comment-16506285 ] Li Yuanjian commented on SPARK-24499: - No problem, thanks for ping me, our pleasure. We'll collect and translate some internal user demos and also some hints during MR job migrate to Spark. > Documentation improvement of Spark core and SQL > --- > > Key: SPARK-24499 > URL: https://issues.apache.org/jira/browse/SPARK-24499 > Project: Spark > Issue Type: New Feature > Components: Documentation, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > The current documentation in Apache Spark lacks enough code examples and > tips. If needed, we should also split the page of > https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple > separate pages like what we did for > https://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506250#comment-16506250 ] Xiao Li commented on SPARK-24498: - Thank you! > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506239#comment-16506239 ] Kazuaki Ishizaki commented on SPARK-24498: -- Hi [~smilegator] Definetely, I am interested in this task. I will investigate this issue. > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24454) ml.image doesn't have __all__ explicitly defined
[ https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-24454. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21483 [https://github.com/apache/spark/pull/21483] > ml.image doesn't have __all__ explicitly defined > > > Key: SPARK-24454 > URL: https://issues.apache.org/jira/browse/SPARK-24454 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.4.0 > > > ml/image.py doesn't have __all__ explicitly defined. It will import all > global names by default (only ImageSchema for now), which is not a good > practice. We should add __all__ to image.py. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24454) ml.image doesn't have __all__ explicitly defined
[ https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-24454: - Assignee: Hyukjin Kwon > ml.image doesn't have __all__ explicitly defined > > > Key: SPARK-24454 > URL: https://issues.apache.org/jira/browse/SPARK-24454 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.4.0 > > > ml/image.py doesn't have __all__ explicitly defined. It will import all > global names by default (only ImageSchema for now), which is not a good > practice. We should add __all__ to image.py. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24477) Import submodules under pyspark.ml by default
[ https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-24477: - Assignee: Hyukjin Kwon > Import submodules under pyspark.ml by default > - > > Key: SPARK-24477 > URL: https://issues.apache.org/jira/browse/SPARK-24477 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > Right now, we do not import submodules under pyspark.ml by default. So users > cannot do > {code} > from pyspark import ml > kmeans = ml.clustering.KMeans(...) > {code} > I create this JIRA to discuss if we should import the submodules by default. > It will change behavior of > {code} > from pyspark.ml import * > {code} > But it simplifies unnecessary imports. > cc [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24477) Import submodules under pyspark.ml by default
[ https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-24477. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21483 [https://github.com/apache/spark/pull/21483] > Import submodules under pyspark.ml by default > - > > Key: SPARK-24477 > URL: https://issues.apache.org/jira/browse/SPARK-24477 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > Right now, we do not import submodules under pyspark.ml by default. So users > cannot do > {code} > from pyspark import ml > kmeans = ml.clustering.KMeans(...) > {code} > I create this JIRA to discuss if we should import the submodules by default. > It will change behavior of > {code} > from pyspark.ml import * > {code} > But it simplifies unnecessary imports. > cc [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506232#comment-16506232 ] Jiang Xingbo commented on SPARK-24492: -- It is possible that one task attempt acquired the permission to commit output, but don't finish performing commit in a while. In the mean time, another attempt of the same task (e.g. speculative run) may also ask for commit but failed with TaskCommitDenied exception. Under this case the current behavior of retrying without counting the failure into task failure count could lead to the task retries for infinity times until it get the permission to commit, this can waste a lot of resources if the task is short. Instead, I purpose to skip retry the task attempt in case of TaskCommitDenied exception, since that means another attempt is performing commit for the same task, and we can wait till it finishes (If the commit finishes successfully then nothing left to be done, if it fail then we can still retry). > Endless attempted task when TaskCommitDenied exception writing to S3A > - > > Key: SPARK-24492 > URL: https://issues.apache.org/jira/browse/SPARK-24492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yu-Jhe Li >Priority: Critical > Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 > 2018-05-16 上午11.10.57.png > > > Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and > output file to S3, some tasks endless retry and all of them failed with > TaskCommitDenied exception. This happened when we run Spark application on > some network issue instances. (it runs well on healthy spot instances) > Sorry, I can find a easy way to reproduce this issue, here's all I can > provide. > The Spark UI shows (in attachments) one task of stage 112 failed due to > FetchFailedException (it is network issue) and attempt to retry a new stage > 112 (retry 1). But in stage 112 (retry 1), all task failed due to > TaskCommitDenied exception, and keep retry (it never succeed and cause lots > of S3 requests). > On the other side, driver logs shows: > # task 123.0 in stage 112.0 failed due to FetchFailedException (network > issue cause corrupted file) > # warning message from OutputCommitCoordinator > # task 92.0 in stage 112.1 failed when writing rows > # keep retry the failed tasks, but never succeed > {noformat} > 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 > (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, > 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= > org.apache.spark.shuffle.FetchFailedException: Stream is corrupted > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at
[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL
[ https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506219#comment-16506219 ] Xiao Li commented on SPARK-24499: - Apache Spark is widely adopted in the production systems of Baidu. [~XuanYuan] Could your team lead this effort and improve the existing documentation? > Documentation improvement of Spark core and SQL > --- > > Key: SPARK-24499 > URL: https://issues.apache.org/jira/browse/SPARK-24499 > Project: Spark > Issue Type: New Feature > Components: Documentation, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > The current documentation in Apache Spark lacks enough code examples and > tips. If needed, we should also split the page of > https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple > separate pages like what we did for > https://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24499) Documentation improvement of Spark core and SQL
Xiao Li created SPARK-24499: --- Summary: Documentation improvement of Spark core and SQL Key: SPARK-24499 URL: https://issues.apache.org/jira/browse/SPARK-24499 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li The current documentation in Apache Spark lacks enough code examples and tips. If needed, we should also split the page of https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple separate pages like what we did for https://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24499) Documentation improvement of Spark core and SQL
[ https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24499: Component/s: Spark Core Documentation > Documentation improvement of Spark core and SQL > --- > > Key: SPARK-24499 > URL: https://issues.apache.org/jira/browse/SPARK-24499 > Project: Spark > Issue Type: New Feature > Components: Documentation, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > The current documentation in Apache Spark lacks enough code examples and > tips. If needed, we should also split the page of > https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple > separate pages like what we did for > https://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506183#comment-16506183 ] Marco Gaido commented on SPARK-24481: - Yes, because before SPARK-22520 code generation was disabled for large case when (and in that case we were fallback-ing to interpreted execution, which has much worse performance). For your use case, you can try disabling whole stage code generation, by setting {{spark.sql.codegen.wholeStage}} to {{false}}. Or if the only problem is logging, you can also set the log level for the class {{org.apache.spark.sql.execution.WholeStageCodegenExec}} to ERROR, in order to get rid of the warning messages. > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 and Databricks Cloud 4.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://;, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = > df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") > .toDF("id","event_time","url","action","page_name","event_name","navigation_action") > var a = "case " > for(i <- 1 to 300){ > a = a + s"when action like '$i%' THEN '$i' " > } > a = a + " else null end as task_id" > val expression = expr(a) > df = df.filter("id is not null and id <> '' and event_time is not null") > val transformationExpressions: mutable.HashMap[String, Column] = > mutable.HashMap( > "action" -> expr("coalesce(action, navigation_action) as action"), > "task_id" -> expression > ) > for((col, expr) <- transformationExpressions) > df = df.withColumn(col, expr) > df = df.filter("(action is not null and action <> '') or (page_name is not > null and page_name <> '')") > df.show > {code} > > Exception: > {code:java} > 18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method > "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" > grows beyond 64 KB > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method > "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" > grows beyond 64 KB > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at > org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1444) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1523) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1520) > at > com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193) > at com.google.common.cache.LocalCache.get(LocalCache.java:3932) > at
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506159#comment-16506159 ] Xiao Li commented on SPARK-24498: - cc [~kiszk] Are you interested in this task? Could you investigate this issue? > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24498) Add JDK compiler for runtime codegen
Xiao Li created SPARK-24498: --- Summary: Add JDK compiler for runtime codegen Key: SPARK-24498 URL: https://issues.apache.org/jira/browse/SPARK-24498 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li In some cases, JDK compiler can generate smaller bytecode and take less time in compilation compared to Janino. However, in some cases, Janino is better. We should support both for our runtime codegen. Janino will be still our default runtime codegen compiler. See the related JIRAs in DRILL: - https://issues.apache.org/jira/browse/DRILL-1155 - https://issues.apache.org/jira/browse/DRILL-4778 - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24191) Scala example code for Power Iteration Clustering in Spark ML examples
[ https://issues.apache.org/jira/browse/SPARK-24191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24191. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21248 [https://github.com/apache/spark/pull/21248] > Scala example code for Power Iteration Clustering in Spark ML examples > -- > > Key: SPARK-24191 > URL: https://issues.apache.org/jira/browse/SPARK-24191 > Project: Spark > Issue Type: Documentation > Components: Documentation, Examples, ML >Affects Versions: 2.4.0 >Reporter: shahid >Assignee: shahid >Priority: Trivial > Fix For: 2.4.0 > > > We need to provide an example code for Power Iteration Clustering in Spark ML > examples. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24224) Java example code for Power Iteration Clustering in spark.ml
[ https://issues.apache.org/jira/browse/SPARK-24224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24224. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21283 [https://github.com/apache/spark/pull/21283] > Java example code for Power Iteration Clustering in spark.ml > > > Key: SPARK-24224 > URL: https://issues.apache.org/jira/browse/SPARK-24224 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.4.0 >Reporter: shahid >Assignee: shahid >Priority: Trivial > Fix For: 2.4.0 > > > Add a java example code for Power iteration clustering in spark.ml examples -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24224) Java example code for Power Iteration Clustering in spark.ml
[ https://issues.apache.org/jira/browse/SPARK-24224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-24224: - Assignee: shahid Target Version/s: (was: 2.4.0) Priority: Trivial (was: Major) Fix Version/s: (was: 2.4.0) > Java example code for Power Iteration Clustering in spark.ml > > > Key: SPARK-24224 > URL: https://issues.apache.org/jira/browse/SPARK-24224 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.4.0 >Reporter: shahid >Assignee: shahid >Priority: Trivial > > Add a java example code for Power iteration clustering in spark.ml examples -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24191) Scala example code for Power Iteration Clustering in Spark ML examples
[ https://issues.apache.org/jira/browse/SPARK-24191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-24191: - Assignee: shahid Target Version/s: (was: 2.4.0) Priority: Trivial (was: Major) Fix Version/s: (was: 2.4.0) > Scala example code for Power Iteration Clustering in Spark ML examples > -- > > Key: SPARK-24191 > URL: https://issues.apache.org/jira/browse/SPARK-24191 > Project: Spark > Issue Type: Documentation > Components: Documentation, Examples, ML >Affects Versions: 2.4.0 >Reporter: shahid >Assignee: shahid >Priority: Trivial > > We need to provide an example code for Power Iteration Clustering in Spark ML > examples. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24497) Support recursive SQL query
Yuming Wang created SPARK-24497: --- Summary: Support recursive SQL query Key: SPARK-24497 URL: https://issues.apache.org/jira/browse/SPARK-24497 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.0 Reporter: Yuming Wang h3. *Examples* Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" represents the structure of an organization as an adjacency list. {code:sql} CREATE TABLE department ( id INTEGER PRIMARY KEY, -- department ID parent_department INTEGER REFERENCES department, -- upper department ID name TEXT -- department name ); INSERT INTO department (id, parent_department, "name") VALUES (0, NULL, 'ROOT'), (1, 0, 'A'), (2, 1, 'B'), (3, 2, 'C'), (4, 2, 'D'), (5, 0, 'E'), (6, 4, 'F'), (7, 5, 'G'); -- department structure represented here is as follows: -- -- ROOT-+->A-+->B-+->C -- | | -- | +->D-+->F -- +->E-+->G {code} To extract all departments under A, you can use the following recursive query: {code:sql} WITH RECURSIVE subdepartment AS ( -- non-recursive term SELECT * FROM department WHERE name = 'A' UNION ALL -- recursive term SELECT d.* FROM department AS d JOIN subdepartment AS sd ON (d.parent_department = sd.id) ) SELECT * FROM subdepartment ORDER BY name; {code} More details: [http://wiki.postgresql.org/wiki/CTEReadme] [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506023#comment-16506023 ] Saisai Shao commented on SPARK-23534: - Yes, it is a Hadoop 2.8+ issue, but we don't a 2.8 profile for Spark, instead we have a 3.1 profile, so it will affect our hadoop 3 build, that's why also add a link here. > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506021#comment-16506021 ] Steve Loughran commented on SPARK-23534: [~jerryshao] is that the HDFS token identifier thing? In which case, it's actually a 2.8+ issue > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506009#comment-16506009 ] Steve Loughran commented on SPARK-24492: the retry problem looks like something with the commit protocol's error handling; probably merits closer look But why is the exception being raised? That's SPARK-18883: S3 list inconsistency is making your task attempt directory not appearing in S3 LIST calls, so the mimiced rename (list everything, COPY *, DELETE *) is failing fast. You cannot safely use S3 via the S3A connector as a safe destination of work without a consistency layer (HADOOP-13345) or an S3-specific committer (SPARK-23977, HADOOP-13786). Even when the task appears to succeed, the listing may have missed newly created files, so the input is incorrect. Without those or some other consistency layer (e.g consistent emrfs), you need to commit to a consistent store (e.g. HDFS) then copy the results to s3 after. > Endless attempted task when TaskCommitDenied exception writing to S3A > - > > Key: SPARK-24492 > URL: https://issues.apache.org/jira/browse/SPARK-24492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yu-Jhe Li >Priority: Critical > Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 > 2018-05-16 上午11.10.57.png > > > Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and > output file to S3, some tasks endless retry and all of them failed with > TaskCommitDenied exception. This happened when we run Spark application on > some network issue instances. (it runs well on healthy spot instances) > Sorry, I can find a easy way to reproduce this issue, here's all I can > provide. > The Spark UI shows (in attachments) one task of stage 112 failed due to > FetchFailedException (it is network issue) and attempt to retry a new stage > 112 (retry 1). But in stage 112 (retry 1), all task failed due to > TaskCommitDenied exception, and keep retry (it never succeed and cause lots > of S3 requests). > On the other side, driver logs shows: > # task 123.0 in stage 112.0 failed due to FetchFailedException (network > issue cause corrupted file) > # warning message from OutputCommitCoordinator > # task 92.0 in stage 112.1 failed when writing rows > # keep retry the failed tasks, but never succeed > {noformat} > 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 > (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, > 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= > org.apache.spark.shuffle.FetchFailedException: Stream is corrupted > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-24492: --- Summary: Endless attempted task when TaskCommitDenied exception writing to S3A (was: Endless attempted task when TaskCommitDenied exception) > Endless attempted task when TaskCommitDenied exception writing to S3A > - > > Key: SPARK-24492 > URL: https://issues.apache.org/jira/browse/SPARK-24492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yu-Jhe Li >Priority: Critical > Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 > 2018-05-16 上午11.10.57.png > > > Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and > output file to S3, some tasks endless retry and all of them failed with > TaskCommitDenied exception. This happened when we run Spark application on > some network issue instances. (it runs well on healthy spot instances) > Sorry, I can find a easy way to reproduce this issue, here's all I can > provide. > The Spark UI shows (in attachments) one task of stage 112 failed due to > FetchFailedException (it is network issue) and attempt to retry a new stage > 112 (retry 1). But in stage 112 (retry 1), all task failed due to > TaskCommitDenied exception, and keep retry (it never succeed and cause lots > of S3 requests). > On the other side, driver logs shows: > # task 123.0 in stage 112.0 failed due to FetchFailedException (network > issue cause corrupted file) > # warning message from OutputCommitCoordinator > # task 92.0 in stage 112.1 failed when writing rows > # keep retry the failed tasks, but never succeed > {noformat} > 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 > (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, > 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= > org.apache.spark.shuffle.FetchFailedException: Stream is corrupted > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at >
[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-24476: --- Priority: Minor (was: Major) > java.net.SocketTimeoutException: Read timed out under jets3t while running > the Spark Structured Streaming > - > > Key: SPARK-24476 > URL: https://issues.apache.org/jira/browse/SPARK-24476 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: bharath kumar avusherla >Priority: Minor > Attachments: socket-timeout-exception > > > We are working on spark streaming application using spark structured > streaming with checkpointing in s3. When we start the application, the > application runs just fine for sometime then it crashes with the error > mentioned below. The amount of time it will run successfully varies from time > to time, sometimes it will run for 2 days without any issues then crashes, > sometimes it will crash after 4hrs/ 24hrs. > Our streaming application joins(left and inner) multiple sources from kafka > and also s3 and aurora database. > Can you please let us know how to solve this problem? > Is it possible to somehow tweak the SocketTimeout-Time? > Here, I'm pasting the few line of complete exception log below. Also attached > the complete exception to the issue. > *_Exception:_* > *_Caused by: java.net.SocketTimeoutException: Read timed out_* > _at java.net.SocketInputStream.socketRead0(Native Method)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ > _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ > _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ > _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ > _at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-24476: --- Summary: java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming (was: java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0) > java.net.SocketTimeoutException: Read timed out under jets3t while running > the Spark Structured Streaming > - > > Key: SPARK-24476 > URL: https://issues.apache.org/jira/browse/SPARK-24476 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: bharath kumar avusherla >Priority: Major > Attachments: socket-timeout-exception > > > We are working on spark streaming application using spark structured > streaming with checkpointing in s3. When we start the application, the > application runs just fine for sometime then it crashes with the error > mentioned below. The amount of time it will run successfully varies from time > to time, sometimes it will run for 2 days without any issues then crashes, > sometimes it will crash after 4hrs/ 24hrs. > Our streaming application joins(left and inner) multiple sources from kafka > and also s3 and aurora database. > Can you please let us know how to solve this problem? > Is it possible to somehow tweak the SocketTimeout-Time? > Here, I'm pasting the few line of complete exception log below. Also attached > the complete exception to the issue. > *_Exception:_* > *_Caused by: java.net.SocketTimeoutException: Read timed out_* > _at java.net.SocketInputStream.socketRead0(Native Method)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ > _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ > _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ > _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ > _at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-24476: --- Component/s: (was: Spark Core) Structured Streaming > java.net.SocketTimeoutException: Read timed out under jets3t while running > the Spark Structured Streaming > - > > Key: SPARK-24476 > URL: https://issues.apache.org/jira/browse/SPARK-24476 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: bharath kumar avusherla >Priority: Major > Attachments: socket-timeout-exception > > > We are working on spark streaming application using spark structured > streaming with checkpointing in s3. When we start the application, the > application runs just fine for sometime then it crashes with the error > mentioned below. The amount of time it will run successfully varies from time > to time, sometimes it will run for 2 days without any issues then crashes, > sometimes it will crash after 4hrs/ 24hrs. > Our streaming application joins(left and inner) multiple sources from kafka > and also s3 and aurora database. > Can you please let us know how to solve this problem? > Is it possible to somehow tweak the SocketTimeout-Time? > Here, I'm pasting the few line of complete exception log below. Also attached > the complete exception to the issue. > *_Exception:_* > *_Caused by: java.net.SocketTimeoutException: Read timed out_* > _at java.net.SocketInputStream.socketRead0(Native Method)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ > _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ > _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ > _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ > _at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24476) java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506002#comment-16506002 ] Steve Loughran commented on SPARK-24476: Switch from s3n to the s3a connector, see if it goes away...if not look at the hadoop s3a documentation about tuning this stuff. > java.net.SocketTimeoutException: Read timed out Exception while running the > Spark Structured Streaming in 2.3.0 > --- > > Key: SPARK-24476 > URL: https://issues.apache.org/jira/browse/SPARK-24476 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: bharath kumar avusherla >Priority: Major > Attachments: socket-timeout-exception > > > We are working on spark streaming application using spark structured > streaming with checkpointing in s3. When we start the application, the > application runs just fine for sometime then it crashes with the error > mentioned below. The amount of time it will run successfully varies from time > to time, sometimes it will run for 2 days without any issues then crashes, > sometimes it will crash after 4hrs/ 24hrs. > Our streaming application joins(left and inner) multiple sources from kafka > and also s3 and aurora database. > Can you please let us know how to solve this problem? > Is it possible to somehow tweak the SocketTimeout-Time? > Here, I'm pasting the few line of complete exception log below. Also attached > the complete exception to the issue. > *_Exception:_* > *_Caused by: java.net.SocketTimeoutException: Read timed out_* > _at java.net.SocketInputStream.socketRead0(Native Method)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ > _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ > _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ > _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ > _at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24496) CLONE - JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.
[ https://issues.apache.org/jira/browse/SPARK-24496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506001#comment-16506001 ] SHAILENDRA SHAHANE commented on SPARK-24496: This issue is still there . I tried to fetch data from MongoDB and got the following exception while converting the RDD to DF. -Code -- SQLContext sparkSQLContext = spark.sqlContext(); DataFrameReader dfr = spark.read() .format("com.mongodb.spark.sql") .option("floatAsBigDecimal", "true"); Dataset rbkp = dfr.load(); -- OR JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext()); JavaMongoRDD rdd = MongoSpark.load(jsc); Dataset rbkp = rdd.toDF(); Spark version 2.3 MongoDB Version - 3.4 and 3.6 Data Sample- {"_id":"5b0d31f892549e10b61d962a","RSEG_MANDT":"800","RSEG_EBELN":"4500017749","RSEG_EBELP":"00020","RSEG_BELNR":"11","RSEG_BUZEI":"02","RSEG_GJAHR":"2013","RBKP_BUDAT":"2013-10-04","RSEG_MENGE":\{"$numberDecimal":"30.000"},"RSEG_LFBNR":"500472","RSEG_LFGJA":"2013","RSEG_LFPOS":"0002","NOT_ACCOUNT_MAINTENANCE":\{"$numberDecimal":"1.00"},"RBKP_CPUTIMESTAMP":"2013-10-04T10:32:02.000Z","RBKP_WAERS":"USD","RSEG_BNKAN":\{"$numberDecimal":"0.00"},"RSEG_WRBTR":\{"$numberDecimal":"2340.00"},"RSEG_SHKZG":"S"} > CLONE - JSON data source fails to infer floats as decimal when precision is > bigger than 38 or scale is bigger than precision. > - > > Key: SPARK-24496 > URL: https://issues.apache.org/jira/browse/SPARK-24496 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: SHAILENDRA SHAHANE >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.0.0 > > Attachments: SparkJiraIssue08062018.txt > > > Currently, JSON data source supports {{floatAsBigDecimal}} option, which > reads floats as {{DecimalType}}. > I noticed there are several restrictions in Spark {{DecimalType}} below: > 1. The precision cannot be bigger than 38. > 2. scale cannot be bigger than precision. > However, with the option above, it reads {{BigDecimal}} which does not follow > the conditions above. > This could be observed as below: > {code} > def simpleFloats: RDD[String] = > sqlContext.sparkContext.parallelize( > """{"a": 0.01}""" :: > """{"a": 0.02}""" :: Nil) > val jsonDF = sqlContext.read > .option("floatAsBigDecimal", "true") > .json(simpleFloats) > jsonDF.printSchema() > {code} > throws an exception below: > {code} > org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater > than precision (1).; > at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:44) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > ... > {code} > Since JSON data source infers {{DataType}} as {{StringType}} when it fails to > infer, it might have to be inferred as {{StringType}} or maybe just simply > {{DoubleType}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24496) CLONE - JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.
[ https://issues.apache.org/jira/browse/SPARK-24496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHAILENDRA SHAHANE updated SPARK-24496: --- Attachment: SparkJiraIssue08062018.txt > CLONE - JSON data source fails to infer floats as decimal when precision is > bigger than 38 or scale is bigger than precision. > - > > Key: SPARK-24496 > URL: https://issues.apache.org/jira/browse/SPARK-24496 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: SHAILENDRA SHAHANE >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.0.0 > > Attachments: SparkJiraIssue08062018.txt > > > Currently, JSON data source supports {{floatAsBigDecimal}} option, which > reads floats as {{DecimalType}}. > I noticed there are several restrictions in Spark {{DecimalType}} below: > 1. The precision cannot be bigger than 38. > 2. scale cannot be bigger than precision. > However, with the option above, it reads {{BigDecimal}} which does not follow > the conditions above. > This could be observed as below: > {code} > def simpleFloats: RDD[String] = > sqlContext.sparkContext.parallelize( > """{"a": 0.01}""" :: > """{"a": 0.02}""" :: Nil) > val jsonDF = sqlContext.read > .option("floatAsBigDecimal", "true") > .json(simpleFloats) > jsonDF.printSchema() > {code} > throws an exception below: > {code} > org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater > than precision (1).; > at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:44) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > ... > {code} > Since JSON data source infers {{DataType}} as {{StringType}} when it fails to > infer, it might have to be inferred as {{StringType}} or maybe just simply > {{DoubleType}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24496) CLONE - JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.
SHAILENDRA SHAHANE created SPARK-24496: -- Summary: CLONE - JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision. Key: SPARK-24496 URL: https://issues.apache.org/jira/browse/SPARK-24496 Project: Spark Issue Type: Bug Components: SQL Reporter: SHAILENDRA SHAHANE Assignee: Hyukjin Kwon Fix For: 2.0.0 Currently, JSON data source supports {{floatAsBigDecimal}} option, which reads floats as {{DecimalType}}. I noticed there are several restrictions in Spark {{DecimalType}} below: 1. The precision cannot be bigger than 38. 2. scale cannot be bigger than precision. However, with the option above, it reads {{BigDecimal}} which does not follow the conditions above. This could be observed as below: {code} def simpleFloats: RDD[String] = sqlContext.sparkContext.parallelize( """{"a": 0.01}""" :: """{"a": 0.02}""" :: Nil) val jsonDF = sqlContext.read .option("floatAsBigDecimal", "true") .json(simpleFloats) jsonDF.printSchema() {code} throws an exception below: {code} org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).; at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:44) at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144) at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108) at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59) at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249) at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57) at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at scala.collection.Iterator$class.foreach(Iterator.scala:742) ... {code} Since JSON data source infers {{DataType}} as {{StringType}} when it fails to infer, it might have to be inferred as {{StringType}} or maybe just simply {{DoubleType}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24495) SortMergeJoin with duplicate keys wrong results
Bogdan Raducanu created SPARK-24495: --- Summary: SortMergeJoin with duplicate keys wrong results Key: SPARK-24495 URL: https://issues.apache.org/jira/browse/SPARK-24495 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Bogdan Raducanu To reproduce: {code:java} // the bug is in SortMergeJoin but the Shuffles are correct. with the default 200 it might split the data in such small partitions that the SortMergeJoin cannot return wrong results anymore spark.conf.set("spark.sql.shuffle.partitions", "1") // disable this, otherwise it would filter results before join, hiding the bug spark.conf.set("spark.sql.constraintPropagation.enabled", "false") sql("select id as a1 from range(1000)").createOrReplaceTempView("t1") sql("select id * 2 as b1, -id as b2 from range(1000)").createOrReplaceTempView("t2") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") sql("""select b1, a1, b2 FROM t1 INNER JOIN t2 ON b1 = a1 AND b2 = a1""").show {code} In the results, it's expected that all columns are equal (see join condition). But the result is: {code:java} +---+---+---+ | b1| a1| b2| +---+---+---+ | 0| 0| 0| | 2| 2| -1| | 4| 4| -2| | 6| 6| -3| | 8| 8| -4| {code} I traced it to {{EnsureRequirements.reorder}} which was introduced by [https://github.com/apache/spark/pull/16985] and [https://github.com/apache/spark/pull/20041] It leads to an incorrect plan: {code:java} == Physical Plan == *(5) Project [b1#735672L, a1#735669L, b2#735673L] +- *(5) SortMergeJoin [a1#735669L, a1#735669L], [b1#735672L, b1#735672L], Inner :- *(2) Sort [a1#735669L ASC NULLS FIRST, a1#735669L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(a1#735669L, a1#735669L, 1) : +- *(1) Project [id#735670L AS a1#735669L] :+- *(1) Range (0, 1000, step=1, splits=8) +- *(4) Sort [b1#735672L ASC NULLS FIRST, b2#735673L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(b1#735672L, b2#735673L, 1) +- *(3) Project [(id#735674L * 2) AS b1#735672L, -id#735674L AS b2#735673L] +- *(3) Range (0, 1000, step=1, splits=8) {code} The SortMergeJoin keys are wrong: key b2 is missing completely. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24119) Add interpreted execution to SortPrefix expression
[ https://issues.apache.org/jira/browse/SPARK-24119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-24119. --- Resolution: Fixed Assignee: Bruce Robbins Fix Version/s: 2.4.0 > Add interpreted execution to SortPrefix expression > -- > > Key: SPARK-24119 > URL: https://issues.apache.org/jira/browse/SPARK-24119 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Fix For: 2.4.0 > > > [~hvanhovell] [~kiszk] > I noticed SortPrefix did not support interpreted execution when I was testing > the PR for SPARK-24043. Somehow it was not covered by the umbrella Jira for > adding interpreted execution (SPARK-23580) > Since I had to implement interpreted execution for SortPrefix to complete > testing, I am creating this Jira. If there's no good reason why eval wasn't > implemented, I will make the PR in a few days. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24494) Give users possibility to skip own classes in SparkContext.getCallSite()
Florian Kaspar created SPARK-24494: -- Summary: Give users possibility to skip own classes in SparkContext.getCallSite() Key: SPARK-24494 URL: https://issues.apache.org/jira/browse/SPARK-24494 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 2.2.0 Environment: any (currently using Spark 2.2.0 in both local mode and with YARN) Reporter: Florian Kaspar org.apache.spark.SparkContext.getCallSite() uses org.apache.spark.util.Utils.getCallSite() which by default skips Spark and Scala classes. It would be nice to be able to add user-defined classes/patterns to skip here as well. We have one central class that acts as some kind of proxy for RDD transformations and is used by many other classes. In the SparkUI we would like to see the real origin of a transformation being not the proxy but its caller. This would make the UI even more useful to us. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24487) Add support for RabbitMQ.
[ https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24487. - Resolution: Won't Fix > Add support for RabbitMQ. > - > > Key: SPARK-24487 > URL: https://issues.apache.org/jira/browse/SPARK-24487 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Michał Jurkiewicz >Priority: Major > > Add support for RabbitMQ. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.
[ https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505786#comment-16505786 ] Saisai Shao commented on SPARK-24487: - I'm sure Spark community will not merge this into Spark code base. You can either maintain a package yourself, or contribute to Apache Bahir. > Add support for RabbitMQ. > - > > Key: SPARK-24487 > URL: https://issues.apache.org/jira/browse/SPARK-24487 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Michał Jurkiewicz >Priority: Major > > Add support for RabbitMQ. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0
[ https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-23151: Issue Type: Sub-task (was: Dependency upgrade) Parent: SPARK-23534 > Provide a distribution of Spark with Hadoop 3.0 > --- > > Key: SPARK-23151 > URL: https://issues.apache.org/jira/browse/SPARK-23151 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Louis Burke >Priority: Major > > Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark > package > only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication > is > that using up to date Kinesis libraries alongside s3 causes a clash w.r.t > aws-java-sdk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0
[ https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505783#comment-16505783 ] Saisai Shao commented on SPARK-23151: - I will convert this as a subtask of SPARK-23534. > Provide a distribution of Spark with Hadoop 3.0 > --- > > Key: SPARK-23151 > URL: https://issues.apache.org/jira/browse/SPARK-23151 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Louis Burke >Priority: Major > > Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark > package > only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication > is > that using up to date Kinesis libraries alongside s3 causes a clash w.r.t > aws-java-sdk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505771#comment-16505771 ] Saisai Shao commented on SPARK-23534: - Spark with Hadoop 3 will be failed in token renew for long running case (with keytab and principal). > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505768#comment-16505768 ] Saisai Shao edited comment on SPARK-24493 at 6/8/18 6:56 AM: - Adding more background, the issue is happened when building Spark with Hadoop 2.8+. In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew interval for HDFS token, but due to missing service loader file, Hadoop failed to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected renew interval (Long.MaxValue). The related code in Hadoop is: {code} private static Class getClassForIdentifier(Text kind) { Class cls = null; synchronized (Token.class) { if (tokenKindMap == null) { tokenKindMap = Maps.newHashMap(); for (TokenIdentifier id : ServiceLoader.load(TokenIdentifier.class)) \\{ tokenKindMap.put(id.getKind(), id.getClass()); } } cls = tokenKindMap.get(kind); } if (cls == null) { LOG.debug("Cannot find class for token kind " + kind); return null; } return cls; } {code} The problem is: The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but the service loader description file "META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in hadoop-hdfs jar. Spark local submit process/driver process (depends on client or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs jar. So the ServiceLoader will be failed to find HDFS "DelegationTokenIdentifier" class and return null. The issue is due to the change in HADOOP-6200. Previously we only have building profile for Hadoop 2.6 and 2.7, so there's no issue here. But currently we has a building profile for Hadoop 3.1, so this will fail the token renew in Hadoop 3.1. The is a Hadoop issue, creating a Spark Jira to track this issue and bump the version when Hadoop side is fixed. was (Author: jerryshao): Adding more background, the issue is happened when building Spark with Hadoop 2.8+. In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew interval for HDFS token, but due to missing service loader file, Hadoop failed to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected renew interval (Long.MaxValue). The related code in Hadoop is: private static Class getClassForIdentifier(Text kind) {Class cls = null;synchronized (Token.class) { if (tokenKindMap == null) { tokenKindMap = Maps.newHashMap();for (TokenIdentifier id : ServiceLoader.load(TokenIdentifier.class)) \{ tokenKindMap.put(id.getKind(), id.getClass()); } } cls = tokenKindMap.get(kind); }if (cls == null) { LOG.debug("Cannot find class for token kind " + kind); return null; }return cls; } The problem is: The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but the service loader description file "META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in hadoop-hdfs jar. Spark local submit process/driver process (depends on client or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs jar. So the ServiceLoader will be failed to find HDFS "DelegationTokenIdentifier" class and return null. The issue is due to the change in HADOOP-6200. Previously we only have building profile for Hadoop 2.6 and 2.7, so there's no issue here. But currently we has a building profile for Hadoop 3.1, so this will fail the token renew in Hadoop 3.1. The is a Hadoop issue, creating a Spark Jira to track this issue and bump the version when Hadoop side is fixed. > Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3 > -- > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, ,
[jira] [Commented] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505768#comment-16505768 ] Saisai Shao commented on SPARK-24493: - Adding more background, the issue is happened when building Spark with Hadoop 2.8+. In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew interval for HDFS token, but due to missing service loader file, Hadoop failed to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected renew interval (Long.MaxValue). The related code in Hadoop is: private static Class getClassForIdentifier(Text kind) {Class cls = null;synchronized (Token.class) { if (tokenKindMap == null) { tokenKindMap = Maps.newHashMap();for (TokenIdentifier id : ServiceLoader.load(TokenIdentifier.class)) \{ tokenKindMap.put(id.getKind(), id.getClass()); } } cls = tokenKindMap.get(kind); }if (cls == null) { LOG.debug("Cannot find class for token kind " + kind); return null; }return cls; } The problem is: The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but the service loader description file "META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in hadoop-hdfs jar. Spark local submit process/driver process (depends on client or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs jar. So the ServiceLoader will be failed to find HDFS "DelegationTokenIdentifier" class and return null. The issue is due to the change in HADOOP-6200. Previously we only have building profile for Hadoop 2.6 and 2.7, so there's no issue here. But currently we has a building profile for Hadoop 3.1, so this will fail the token renew in Hadoop 3.1. The is a Hadoop issue, creating a Spark Jira to track this issue and bump the version when Hadoop side is fixed. > Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3 > -- > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, , executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+ > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at >
[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24493: Summary: Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3 (was: Kerberos Ticket Renewal is failing in long running Spark job) > Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3 > -- > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, , executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+ > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.
[ https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505760#comment-16505760 ] Saisai Shao commented on SPARK-24487: - You can add it in your own package with DataSource API. There's no meaning to add to Spark code base. > Add support for RabbitMQ. > - > > Key: SPARK-24487 > URL: https://issues.apache.org/jira/browse/SPARK-24487 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Michał Jurkiewicz >Priority: Major > > Add support for RabbitMQ. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.
[ https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505759#comment-16505759 ] Michał Jurkiewicz commented on SPARK-24487: --- I meant to add support for RabbitMQ for Spark Structured streaming and Spark streaming. What I see is that Kafka is only supported. [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] https://spark.apache.org/docs/latest/streaming-programming-guide.html > Add support for RabbitMQ. > - > > Key: SPARK-24487 > URL: https://issues.apache.org/jira/browse/SPARK-24487 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Michał Jurkiewicz >Priority: Major > > Add support for RabbitMQ. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24493: Component/s: Spark Core > Kerberos Ticket Renewal is failing in long running Spark job > > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, , executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+ > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at
[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24493: Component/s: (was: Spark Core) YARN > Kerberos Ticket Renewal is failing in long running Spark job > > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, , executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+ > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24493: Priority: Major (was: Blocker) > Kerberos Ticket Renewal is failing in long running Spark job > > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, , executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+ > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at