[jira] [Created] (SPARK-22769) When driver stopping, there is problem: RpcEnv already stopped
KaiXinXIaoLei created SPARK-22769: - Summary: When driver stopping, there is problem: RpcEnv already stopped Key: SPARK-22769 URL: https://issues.apache.org/jira/browse/SPARK-22769 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Reporter: KaiXinXIaoLei I run "spark-sql --master yarn --num-executors 1000 -f createTable.sql". When task is finished, there is a error: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. I think the log level should be warning, not error. {noformat} 17/12/12 18:20:44 INFO MemoryStore: MemoryStore cleared 17/12/12 18:20:44 INFO BlockManager: BlockManager stopped 17/12/12 18:20:44 INFO BlockManagerMaster: BlockManagerMaster stopped 17/12/12 18:20:44 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:152) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570) at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided
[ https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22042. - Resolution: Fixed Assignee: Tejas Patil Fix Version/s: 2.3.0 > ReorderJoinPredicates can break when child's partitioning is not decided > > > Key: SPARK-22042 > URL: https://issues.apache.org/jira/browse/SPARK-22042 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.3.0 > > > When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its > children, the children may not be properly constructed as the child-subtree > has to still go through other planner rules. > In this particular case, the child is `SortMergeJoinExec`. Since the required > `Exchange` operators are not in place (because `EnsureRequirements` runs > _after_ `ReorderJoinPredicates`), the join's children would not have > partitioning defined. This breaks while creation the `PartitioningCollection` > here : > https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69 > Small repro: > {noformat} > context.sql("SET spark.sql.autoBroadcastJoinThreshold=0") > val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", > "k") > df.write.format("parquet").saveAsTable("table1") > df.write.format("parquet").saveAsTable("table2") > df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table") > sql(""" > SELECT * > FROM ( > SELECT a.i, a.j, a.k > FROM bucketed_table a > JOIN table1 b > ON a.i = b.i > ) c > JOIN table2 > ON c.i = table2.i > """).explain > {noformat} > This fails with : > {noformat} > java.lang.IllegalArgumentException: requirement failed: > PartitioningCollection requires all of its partitionings have the same > numPartitions. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69) > at > org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91) > at org.apache.spark.sql.Dataset.explain(Dataset.scala:464) > at
[jira] [Resolved] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN
[ https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-22700. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19894 [https://github.com/apache/spark/pull/19894] > Bucketizer.transform incorrectly drops row containing NaN > - > > Key: SPARK-22700 > URL: https://issues.apache.org/jira/browse/SPARK-22700 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: zhengruifeng > Fix For: 2.3.0 > > > {code} > import org.apache.spark.ml.feature._ > val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, > Double.NaN))).toDF("a", "b") > val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity) > val bucketizer: Bucketizer = new > Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits) > bucketizer.setHandleInvalid("skip") > scala> df.show > +---+---+ > | a| b| > +---+---+ > |2.3|3.0| > |NaN|3.0| > |6.7|NaN| > +---+---+ > scala> bucketizer.transform(df).show > +---+---+---+ > | a| b| aa| > +---+---+---+ > |2.3|3.0|0.0| > +---+---+---+ > {code} > When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly > droped, though colum 'b' is not an input column -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN
[ https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-22700: -- Assignee: zhengruifeng > Bucketizer.transform incorrectly drops row containing NaN > - > > Key: SPARK-22700 > URL: https://issues.apache.org/jira/browse/SPARK-22700 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: zhengruifeng >Assignee: zhengruifeng > Fix For: 2.3.0 > > > {code} > import org.apache.spark.ml.feature._ > val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, > Double.NaN))).toDF("a", "b") > val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity) > val bucketizer: Bucketizer = new > Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits) > bucketizer.setHandleInvalid("skip") > scala> df.show > +---+---+ > | a| b| > +---+---+ > |2.3|3.0| > |NaN|3.0| > |6.7|NaN| > +---+---+ > scala> bucketizer.transform(df).show > +---+---+---+ > | a| b| aa| > +---+---+---+ > |2.3|3.0|0.0| > +---+---+---+ > {code} > When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly > droped, though colum 'b' is not an input column -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19809: Component/s: (was: Input/Output) SQL > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun > Fix For: 2.3.0 > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >
[jira] [Resolved] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19809. - Resolution: Fixed > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun > Fix For: 2.3.0 > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Commented] (SPARK-20849) Document R DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-20849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288740#comment-16288740 ] Apache Spark commented on SPARK-20849: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/19963 > Document R DecisionTree > --- > > Key: SPARK-20849 > URL: https://issues.apache.org/jira/browse/SPARK-20849 > Project: Spark > Issue Type: Improvement > Components: Documentation, SparkR >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 2.3.0 > > > add sparkr example for {{decisionTree}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22644) Make ML testsuite support StructuredStreaming test
[ https://issues.apache.org/jira/browse/SPARK-22644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22644. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19843 [https://github.com/apache/spark/pull/19843] > Make ML testsuite support StructuredStreaming test > -- > > Key: SPARK-22644 > URL: https://issues.apache.org/jira/browse/SPARK-22644 > Project: Spark > Issue Type: Test > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.3.0 > > > We need to add some helper code to make testing ML transformers & models > easier with streaming data. These tests might help us catch any remaining > issues and we could encourage future PRs to use these tests to prevent new > Models & Transformers from having issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py
[ https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prateek updated SPARK-22711: Affects Version/s: 2.2.1 > _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from > cloudpickle.py > --- > > Key: SPARK-22711 > URL: https://issues.apache.org/jira/browse/SPARK-22711 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.2.0, 2.2.1 > Environment: Ubuntu pseudo distributed installation of Spark 2.2.0 >Reporter: Prateek > Original Estimate: 336h > Remaining Estimate: 336h > > When I submit a Pyspark program with spark-submit command this error is > thrown. > It happens when for code like below > RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v) > Traceback (most recent call last): > File "/home/prateek/Project/textrank.py", line 299, in > summaryRDD = sentenceTokensReduceRDD.map(lambda m: > get_summary(m)).reduceByKey(lambda c,v :c+v) > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, > in reduceByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, > in combineByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, > in partitionBy > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, > in _jrdd > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, > in _wrap_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, > in _prepare_for_python_RDD > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 460, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 704, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 148, in dump > File "/usr/lib/python3.5/pickle.py", line 408, in dump > self.save(obj) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File
[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py
[ https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288722#comment-16288722 ] Prateek commented on SPARK-22711: - This is not linear regression, do you want me to send the code? > _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from > cloudpickle.py > --- > > Key: SPARK-22711 > URL: https://issues.apache.org/jira/browse/SPARK-22711 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.2.0 > Environment: Ubuntu pseudo distributed installation of Spark 2.2.0 >Reporter: Prateek > Original Estimate: 336h > Remaining Estimate: 336h > > When I submit a Pyspark program with spark-submit command this error is > thrown. > It happens when for code like below > RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v) > Traceback (most recent call last): > File "/home/prateek/Project/textrank.py", line 299, in > summaryRDD = sentenceTokensReduceRDD.map(lambda m: > get_summary(m)).reduceByKey(lambda c,v :c+v) > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, > in reduceByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, > in combineByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, > in partitionBy > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, > in _jrdd > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, > in _wrap_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, > in _prepare_for_python_RDD > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 460, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 704, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 148, in dump > File "/usr/lib/python3.5/pickle.py", line 408, in dump > self.save(obj) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list
[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py
[ https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288714#comment-16288714 ] Prateek commented on SPARK-22711: - with regular python its working fine. I tried with latest spark also 2.2.1 > _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from > cloudpickle.py > --- > > Key: SPARK-22711 > URL: https://issues.apache.org/jira/browse/SPARK-22711 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.2.0 > Environment: Ubuntu pseudo distributed installation of Spark 2.2.0 >Reporter: Prateek > Original Estimate: 336h > Remaining Estimate: 336h > > When I submit a Pyspark program with spark-submit command this error is > thrown. > It happens when for code like below > RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v) > Traceback (most recent call last): > File "/home/prateek/Project/textrank.py", line 299, in > summaryRDD = sentenceTokensReduceRDD.map(lambda m: > get_summary(m)).reduceByKey(lambda c,v :c+v) > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, > in reduceByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, > in combineByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, > in partitionBy > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, > in _jrdd > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, > in _wrap_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, > in _prepare_for_python_RDD > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 460, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 704, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 148, in dump > File "/usr/lib/python3.5/pickle.py", line 408, in dump > self.save(obj) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770,
[jira] [Resolved] (SPARK-22768) Spark SQL alter table and beeline alter table Metadata is different
[ https://issues.apache.org/jira/browse/SPARK-22768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22768. --- Resolution: Invalid It's very unclear what you're asking, but this should start on the mailing list. > Spark SQL alter table and beeline alter table Metadata is different > --- > > Key: SPARK-22768 > URL: https://issues.apache.org/jira/browse/SPARK-22768 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 + hadoop 2.6 + hivemeta (2) >Reporter: 吴志龙 > > Metadata to reproduce a problem: I have two hivemeta server > 1. a new table in beeline 2 fields (connection hivemeta1) > 2. I modified the table above spark-sql structure, into 3 fields > (connection hivemeta2) > 3.beeline above query data, the reported data table only three Field > java.lang.IndexOutOfBoundsException: toIndex = 3 and desc table not found > coluns >Question: 1. beeline data is not synchronized > 2.hivemeat data synchronization problem > 3.This is a very serious problem > This url can not solve my problem: > https://issues.apache.org/jira/browse/SPARK-18355 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22768) Spark SQL alter table and beeline alter table Metadata is different
吴志龙 created SPARK-22768: --- Summary: Spark SQL alter table and beeline alter table Metadata is different Key: SPARK-22768 URL: https://issues.apache.org/jira/browse/SPARK-22768 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Environment: Spark 2.2 + hadoop 2.6 + hivemeta (2) Reporter: 吴志龙 Metadata to reproduce a problem: I have two hivemeta server 1. a new table in beeline 2 fields (connection hivemeta1) 2. I modified the table above spark-sql structure, into 3 fields (connection hivemeta2) 3.beeline above query data, the reported data table only three Field java.lang.IndexOutOfBoundsException: toIndex = 3 and desc table not found coluns Question: 1. beeline data is not synchronized 2.hivemeat data synchronization problem 3.This is a very serious problem This url can not solve my problem: https://issues.apache.org/jira/browse/SPARK-18355 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22767) use ctx.addReferenceObj in InSet and ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-22767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22767: Assignee: Apache Spark (was: Wenchen Fan) > use ctx.addReferenceObj in InSet and ScalaUDF > - > > Key: SPARK-22767 > URL: https://issues.apache.org/jira/browse/SPARK-22767 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22767) use ctx.addReferenceObj in InSet and ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-22767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22767: Assignee: Wenchen Fan (was: Apache Spark) > use ctx.addReferenceObj in InSet and ScalaUDF > - > > Key: SPARK-22767 > URL: https://issues.apache.org/jira/browse/SPARK-22767 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22767) use ctx.addReferenceObj in InSet and ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-22767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288670#comment-16288670 ] Apache Spark commented on SPARK-22767: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19962 > use ctx.addReferenceObj in InSet and ScalaUDF > - > > Key: SPARK-22767 > URL: https://issues.apache.org/jira/browse/SPARK-22767 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22767) use ctx.addReferenceObj in InSet and ScalaUDF
Wenchen Fan created SPARK-22767: --- Summary: use ctx.addReferenceObj in InSet and ScalaUDF Key: SPARK-22767 URL: https://issues.apache.org/jira/browse/SPARK-22767 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22496) beeline display operation log
[ https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288642#comment-16288642 ] Apache Spark commented on SPARK-22496: -- User 'ChenjunZou' has created a pull request for this issue: https://github.com/apache/spark/pull/19961 > beeline display operation log > - > > Key: SPARK-22496 > URL: https://issues.apache.org/jira/browse/SPARK-22496 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: StephenZou >Priority: Minor > Fix For: 2.3.0 > > > For now,when end user runs queries in beeline or in hue through STS, > no logs are displayed, end user will wait until the job finishes or fails. > Progress information is needed to inform end users how the job is running if > they are not familiar with yarn RM or standalone spark master ui. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22600) Fix 64kb limit for deeply nested expressions under wholestage codegen
[ https://issues.apache.org/jira/browse/SPARK-22600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22600: --- Assignee: Liang-Chi Hsieh > Fix 64kb limit for deeply nested expressions under wholestage codegen > - > > Key: SPARK-22600 > URL: https://issues.apache.org/jira/browse/SPARK-22600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > This is an extension of SPARK-22543 to fix 64kb compile error for deeply > nested expressions under wholestage codegen. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22600) Fix 64kb limit for deeply nested expressions under wholestage codegen
[ https://issues.apache.org/jira/browse/SPARK-22600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22600. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19813 [https://github.com/apache/spark/pull/19813] > Fix 64kb limit for deeply nested expressions under wholestage codegen > - > > Key: SPARK-22600 > URL: https://issues.apache.org/jira/browse/SPARK-22600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > Fix For: 2.3.0 > > > This is an extension of SPARK-22543 to fix 64kb compile error for deeply > nested expressions under wholestage codegen. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288623#comment-16288623 ] Dongjoon Hyun commented on SPARK-19809: --- Since Hive 1.2.1 library code path still has this problem, users may hit this when spark.sql.hive.convertMetastoreOrc=false. However, after SPARK-22279, Apache Spark with the default configuration doesn't hit this bug. The PR adds a test coverage for `convertMetastoreOrc=true (default)` on both `native` and `hive` ORC implementation in order to prevent regression. > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun > Fix For: 2.3.0 > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at
[jira] [Assigned] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19809: Assignee: Apache Spark (was: Dongjoon Hyun) > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Apache Spark > Fix For: 2.3.0 > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288622#comment-16288622 ] Apache Spark commented on SPARK-19809: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19960 > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun > Fix For: 2.3.0 > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
[jira] [Assigned] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19809: Assignee: Dongjoon Hyun (was: Apache Spark) > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun > Fix For: 2.3.0 > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Resolved] (SPARK-22716) revisit ctx.addReferenceObj
[ https://issues.apache.org/jira/browse/SPARK-22716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22716. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19916 [https://github.com/apache/spark/pull/19916] > revisit ctx.addReferenceObj > --- > > Key: SPARK-22716 > URL: https://issues.apache.org/jira/browse/SPARK-22716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > Fix For: 2.3.0 > > > `ctx.addReferenceObj` creats a global variable just for referring an object, > which seems an overkill. We should revisit it and always use > `ctx.addReferenceMinorObj` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22716) Avoid the creation of mutable states in addReferenceObj
[ https://issues.apache.org/jira/browse/SPARK-22716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22716: --- Assignee: Marco Gaido > Avoid the creation of mutable states in addReferenceObj > --- > > Key: SPARK-22716 > URL: https://issues.apache.org/jira/browse/SPARK-22716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Marco Gaido > Fix For: 2.3.0 > > > `ctx.addReferenceObj` creats a global variable just for referring an object, > which seems an overkill. We should revisit it and always use > `ctx.addReferenceMinorObj` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22716) Avoid the creation of mutable states in addReferenceObj
[ https://issues.apache.org/jira/browse/SPARK-22716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22716: Summary: Avoid the creation of mutable states in addReferenceObj (was: revisit ctx.addReferenceObj) > Avoid the creation of mutable states in addReferenceObj > --- > > Key: SPARK-22716 > URL: https://issues.apache.org/jira/browse/SPARK-22716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > Fix For: 2.3.0 > > > `ctx.addReferenceObj` creats a global variable just for referring an object, > which seems an overkill. We should revisit it and always use > `ctx.addReferenceMinorObj` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22755) Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return different results
[ https://issues.apache.org/jira/browse/SPARK-22755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288608#comment-16288608 ] Kevin Zhang commented on SPARK-22755: - [~ksunitha] Thanks, it helps a lot > Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return > different results > - > > Key: SPARK-22755 > URL: https://issues.apache.org/jira/browse/SPARK-22755 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kevin Zhang > Fix For: 2.2.1 > > > both of the following sql statements > {code:sql} > select ((946-885)*1.000/946 < 0.1) > {code} > and > {code:sql} > select ((946-885)*1.0/946 < 0.100) > {code} > return true, while the following statement > {code:sql} > select ((946-885)*1.0/946 < 0.1) > {code} > returns false -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-19809: - > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun > Fix For: 2.3.0 > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Updated] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive
[ https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-22574: --- Fix Version/s: (was: 2.2.2) (was: 2.3.0) > Wrong request causing Spark Dispatcher going inactive > - > > Key: SPARK-22574 > URL: https://issues.apache.org/jira/browse/SPARK-22574 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Submit >Affects Versions: 2.2.0 >Reporter: German Schiavon Matteo >Assignee: German Schiavon Matteo >Priority: Minor > > When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is > causing a bad state of Dispatcher and making it inactive as a mesos framework. > The class CreateSubmissionRequest initialise its arguments to null as follows: > {code:title=CreateSubmissionRequest.scala|borderStyle=solid} > var appResource: String = null > var mainClass: String = null > var appArgs: Array[String] = null > var sparkProperties: Map[String, String] = null > var environmentVariables: Map[String, String] = null > {code} > There are some checks of these variables but not in all of them, for example > in appArgs and environmentVariables. > If you don't set _appArgs_ it will cause the following error: > {code:title=error|borderStyle=solid} > 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers. > Exception in thread "Thread-22" java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) > {code} > Because it's trying to access to it without checking whether is null or not. > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive
[ https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reopened SPARK-22574: > Wrong request causing Spark Dispatcher going inactive > - > > Key: SPARK-22574 > URL: https://issues.apache.org/jira/browse/SPARK-22574 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Submit >Affects Versions: 2.2.0 >Reporter: German Schiavon Matteo >Assignee: German Schiavon Matteo >Priority: Minor > > When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is > causing a bad state of Dispatcher and making it inactive as a mesos framework. > The class CreateSubmissionRequest initialise its arguments to null as follows: > {code:title=CreateSubmissionRequest.scala|borderStyle=solid} > var appResource: String = null > var mainClass: String = null > var appArgs: Array[String] = null > var sparkProperties: Map[String, String] = null > var environmentVariables: Map[String, String] = null > {code} > There are some checks of these variables but not in all of them, for example > in appArgs and environmentVariables. > If you don't set _appArgs_ it will cause the following error: > {code:title=error|borderStyle=solid} > 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers. > Exception in thread "Thread-22" java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) > {code} > Because it's trying to access to it without checking whether is null or not. > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive
[ https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22574: -- Assignee: (was: German Schiavon Matteo) > Wrong request causing Spark Dispatcher going inactive > - > > Key: SPARK-22574 > URL: https://issues.apache.org/jira/browse/SPARK-22574 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Submit >Affects Versions: 2.2.0 >Reporter: German Schiavon Matteo >Priority: Minor > > When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is > causing a bad state of Dispatcher and making it inactive as a mesos framework. > The class CreateSubmissionRequest initialise its arguments to null as follows: > {code:title=CreateSubmissionRequest.scala|borderStyle=solid} > var appResource: String = null > var mainClass: String = null > var appArgs: Array[String] = null > var sparkProperties: Map[String, String] = null > var environmentVariables: Map[String, String] = null > {code} > There are some checks of these variables but not in all of them, for example > in appArgs and environmentVariables. > If you don't set _appArgs_ it will cause the following error: > {code:title=error|borderStyle=solid} > 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers. > Exception in thread "Thread-22" java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) > {code} > Because it's trying to access to it without checking whether is null or not. > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive
[ https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288506#comment-16288506 ] Marcelo Vanzin commented on SPARK-22574: (Commit above was reverted.) > Wrong request causing Spark Dispatcher going inactive > - > > Key: SPARK-22574 > URL: https://issues.apache.org/jira/browse/SPARK-22574 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Submit >Affects Versions: 2.2.0 >Reporter: German Schiavon Matteo >Priority: Minor > > When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is > causing a bad state of Dispatcher and making it inactive as a mesos framework. > The class CreateSubmissionRequest initialise its arguments to null as follows: > {code:title=CreateSubmissionRequest.scala|borderStyle=solid} > var appResource: String = null > var mainClass: String = null > var appArgs: Array[String] = null > var sparkProperties: Map[String, String] = null > var environmentVariables: Map[String, String] = null > {code} > There are some checks of these variables but not in all of them, for example > in appArgs and environmentVariables. > If you don't set _appArgs_ it will cause the following error: > {code:title=error|borderStyle=solid} > 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers. > Exception in thread "Thread-22" java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) > {code} > Because it's trying to access to it without checking whether is null or not. > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22766) Install R linter package in spark lib directory
[ https://issues.apache.org/jira/browse/SPARK-22766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22766: Assignee: Apache Spark > Install R linter package in spark lib directory > --- > > Key: SPARK-22766 > URL: https://issues.apache.org/jira/browse/SPARK-22766 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Hossein Falaki >Assignee: Apache Spark > > {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} > package in the default site library location which is > {{/usr/local/lib/R/site-library}. This is not recommended and can fail > because we are running this script as jenkins while that directory is owned > by root. > We need to install the linter package in a local directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22766) Install R linter package in spark lib directory
[ https://issues.apache.org/jira/browse/SPARK-22766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288428#comment-16288428 ] Apache Spark commented on SPARK-22766: -- User 'falaki' has created a pull request for this issue: https://github.com/apache/spark/pull/19959 > Install R linter package in spark lib directory > --- > > Key: SPARK-22766 > URL: https://issues.apache.org/jira/browse/SPARK-22766 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Hossein Falaki > > {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} > package in the default site library location which is > {{/usr/local/lib/R/site-library}. This is not recommended and can fail > because we are running this script as jenkins while that directory is owned > by root. > We need to install the linter package in a local directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22766) Install R linter package in spark lib directory
[ https://issues.apache.org/jira/browse/SPARK-22766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22766: Assignee: (was: Apache Spark) > Install R linter package in spark lib directory > --- > > Key: SPARK-22766 > URL: https://issues.apache.org/jira/browse/SPARK-22766 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Hossein Falaki > > {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} > package in the default site library location which is > {{/usr/local/lib/R/site-library}. This is not recommended and can fail > because we are running this script as jenkins while that directory is owned > by root. > We need to install the linter package in a local directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21087) CrossValidator, TrainValidationSplit should collect all models when fitting: Scala API
[ https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288426#comment-16288426 ] Apache Spark commented on SPARK-21087: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/19958 > CrossValidator, TrainValidationSplit should collect all models when fitting: > Scala API > -- > > Key: SPARK-21087 > URL: https://issues.apache.org/jira/browse/SPARK-21087 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu > Fix For: 2.3.0 > > > We add a parameter whether to collect the full model list when > CrossValidator/TrainValidationSplit training (Default is NOT, avoid the > change cause OOM) > Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to > get the model list > CrossValidatorModelWriter add a “option”, allow user to control whether to > persist the model list to disk. > Note: when persisting the model list, use indices as the sub-model path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22766) Install R linter package in spark lib directory
Hossein Falaki created SPARK-22766: -- Summary: Install R linter package in spark lib directory Key: SPARK-22766 URL: https://issues.apache.org/jira/browse/SPARK-22766 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.1 Reporter: Hossein Falaki {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} package in the default site library location which is {{/usr/local/lib/R/site-library}. This is not recommended and can fail because we are running this script as jenkins while that directory is owned by root. We need to install the linter package in a local directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22722) Test Coverage for Type Coercion Compatibility
[ https://issues.apache.org/jira/browse/SPARK-22722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288425#comment-16288425 ] Xiao Li commented on SPARK-22722: - We need to run the same test cases in Hive. Thus, please ensure the queries you added can be executed in Hive with very trivial changes. > Test Coverage for Type Coercion Compatibility > - > > Key: SPARK-22722 > URL: https://issues.apache.org/jira/browse/SPARK-22722 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Yuming Wang > > Hive compatibility is pretty important for the users who run or migrate both > Hive and Spark SQL. > We plan to add a SQLConf for type coercion compatibility > (spark.sql.typeCoercion.mode). Users can choose Spark's native mode (default) > or Hive mode (hive). > Before we deliver the Hive compatibility mode, we plan to write a set of test > cases that can be easily run in both Spark and Hive sides. We can easily > compare whether they are the same or not. When new typeCoercion rules are > added, we also can easily track the changes. These test cases can also be > backported to the previous Spark versions for determining the changes we > made. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22722) Test Coverage for Type Coercion Compatibility
[ https://issues.apache.org/jira/browse/SPARK-22722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288422#comment-16288422 ] Xiao Li commented on SPARK-22722: - [~q79969786] I assigned this ticket to you. Thanks for your work! > Test Coverage for Type Coercion Compatibility > - > > Key: SPARK-22722 > URL: https://issues.apache.org/jira/browse/SPARK-22722 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Yuming Wang > > Hive compatibility is pretty important for the users who run or migrate both > Hive and Spark SQL. > We plan to add a SQLConf for type coercion compatibility > (spark.sql.typeCoercion.mode). Users can choose Spark's native mode (default) > or Hive mode (hive). > Before we deliver the Hive compatibility mode, we plan to write a set of test > cases that can be easily run in both Spark and Hive sides. We can easily > compare whether they are the same or not. When new typeCoercion rules are > added, we also can easily track the changes. These test cases can also be > backported to the previous Spark versions for determining the changes we > made. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22722) Test Coverage for Type Coercion Compatibility
[ https://issues.apache.org/jira/browse/SPARK-22722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-22722: --- Assignee: Yuming Wang > Test Coverage for Type Coercion Compatibility > - > > Key: SPARK-22722 > URL: https://issues.apache.org/jira/browse/SPARK-22722 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Yuming Wang > > Hive compatibility is pretty important for the users who run or migrate both > Hive and Spark SQL. > We plan to add a SQLConf for type coercion compatibility > (spark.sql.typeCoercion.mode). Users can choose Spark's native mode (default) > or Hive mode (hive). > Before we deliver the Hive compatibility mode, we plan to write a set of test > cases that can be easily run in both Spark and Hive sides. We can easily > compare whether they are the same or not. When new typeCoercion rules are > added, we also can easily track the changes. These test cases can also be > backported to the previous Spark versions for determining the changes we > made. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR
[ https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288414#comment-16288414 ] Xuefu Zhang commented on SPARK-22765: - I wouldn't say that MR is static, at lease not static in Spark's sense. MR allocates an executor for each map or reduce task and the executor exits when the running task completes. This would avoid the inefficiency in dynamic allocation where executor has to slowly die out (0 idle time doesn't really work). Secondly, this proposal doesn't really intend to go back to the MR paradigm. Instead, I propose a scheduling scheme similar to MR but enhanced to fit into Spark's DAG execution model. To be clear, the proposal here is not to replace dynamic allocation. Rather, it provides an alternative that's more efficiency-centric than dynamic allocation. I understand there are a lot of details lacking, but I'd like to start a discussion now and hopefully something concrete will come out soon. > Create a new executor allocation scheme based on that of MR > --- > > Key: SPARK-22765 > URL: https://issues.apache.org/jira/browse/SPARK-22765 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Xuefu Zhang > > Many users migrating their workload from MR to Spark find a significant > resource consumption hike (i.e, SPARK-22683). While this might not be a > concern for users that are more performance centric, for others conscious > about cost, such hike creates a migration obstacle. This situation can get > worse as more users are moving to cloud. > Dynamic allocation make it possible for Spark to be deployed in multi-tenant > environment. With its performance-centric design, its inefficiency has also > unfortunately shown up, especially when compared with MR. Thus, it's believed > that MR-styled scheduler still has its merit. Based on our research, the > inefficiency associated with dynamic allocation comes in many aspects such as > executor idling out, bigger executors, many stages (rather than 2 stages only > in MR) in a spark job, etc. > Rather than fine tuning dynamic allocation for efficiency, the proposal here > is to add a new, efficiency-centric scheduling scheme based on that of MR. > Such a MR-based scheme can be further enhanced and be more adapted to Spark > execution model. This alternative is expected to offer good performance > improvement (compared to MR) still with similar to or even better efficiency > than MR. > Inputs are greatly welcome! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-19809: Assignee: Dongjoon Hyun > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun > Fix For: 2.3.0 > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Resolved] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19809. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19948 [https://github.com/apache/spark/pull/19948] > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid > Fix For: 2.3.0 > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288391#comment-16288391 ] Michael Armbrust commented on SPARK-20928: -- An update on this. We've started to create subtasks break down the process of adding this new execution mode and we are targeting an alpha version in 2.3. The basics of the new API for Sources and Sinks (for both microbatch and continuous mode) has been up for a few days if people want to see more details. We'll follow with PRs to add a continuous execution engine and an implementation of a continuous kafka connector. Regarding some of the questions: - The version we are targeting for 2.3 will only support map operations and thus will not support shuffles / aggregating by windows (although the window() operator is just a projection so will work for window assignment). - I think the API is designed in such as way that we can build a streaming shuffle that aligns on epochIds in the future, allowing us to easily extend the continuous engine to handle stateful operations as well. > SPIP: Continuous Processing Mode for Structured Streaming > - > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust > Labels: SPIP > Attachments: Continuous Processing in Structured Streaming Design > Sketch.pdf > > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22764) Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"
[ https://issues.apache.org/jira/browse/SPARK-22764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22764: Assignee: (was: Apache Spark) > Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons" > -- > > Key: SPARK-22764 > URL: https://issues.apache.org/jira/browse/SPARK-22764 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Saw this in a PR builder: > {noformat} > [info] - Cancelling stages/jobs with custom reasons. *** FAILED *** (135 > milliseconds) > [info] Expected exception org.apache.spark.SparkException to be thrown, but > no exception was thrown (SparkContextSuite.scala:531) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:822) > [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1560) > {noformat} > From the logs, the job is finishing before the test code cancels it: > {noformat} > 17/12/12 11:00:41.680 Executor task launch worker for task 1 INFO Executor: > Finished task 0.0 in stage 1.0 (TID 1). 703 bytes result sent to driver > 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSetManager: Finished task > 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (executor driver) (1/1) > 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSchedulerImpl: Removed > TaskSet 1.0, whose tasks have all completed, from pool > 17/12/12 11:00:41.681 dag-scheduler-event-loop INFO DAGScheduler: ResultStage > 1 (apply at Assertions.scala:805) finished in 0.066 s > 17/12/12 11:00:41.681 pool-1-thread-1-ScalaTest-running-SparkContextSuite > INFO DAGScheduler: Job 1 finished: apply at Assertions.scala:805, took > 0.066946 s > 17/12/12 11:00:41.682 spark-listener-group-shared INFO DAGScheduler: Asked to > cancel job 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22764) Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"
[ https://issues.apache.org/jira/browse/SPARK-22764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22764: Assignee: Apache Spark > Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons" > -- > > Key: SPARK-22764 > URL: https://issues.apache.org/jira/browse/SPARK-22764 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > Saw this in a PR builder: > {noformat} > [info] - Cancelling stages/jobs with custom reasons. *** FAILED *** (135 > milliseconds) > [info] Expected exception org.apache.spark.SparkException to be thrown, but > no exception was thrown (SparkContextSuite.scala:531) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:822) > [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1560) > {noformat} > From the logs, the job is finishing before the test code cancels it: > {noformat} > 17/12/12 11:00:41.680 Executor task launch worker for task 1 INFO Executor: > Finished task 0.0 in stage 1.0 (TID 1). 703 bytes result sent to driver > 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSetManager: Finished task > 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (executor driver) (1/1) > 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSchedulerImpl: Removed > TaskSet 1.0, whose tasks have all completed, from pool > 17/12/12 11:00:41.681 dag-scheduler-event-loop INFO DAGScheduler: ResultStage > 1 (apply at Assertions.scala:805) finished in 0.066 s > 17/12/12 11:00:41.681 pool-1-thread-1-ScalaTest-running-SparkContextSuite > INFO DAGScheduler: Job 1 finished: apply at Assertions.scala:805, took > 0.066946 s > 17/12/12 11:00:41.682 spark-listener-group-shared INFO DAGScheduler: Asked to > cancel job 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21867) Support async spilling in UnsafeShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21867: Assignee: Apache Spark > Support async spilling in UnsafeShuffleWriter > - > > Key: SPARK-21867 > URL: https://issues.apache.org/jira/browse/SPARK-21867 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Apache Spark >Priority: Minor > Attachments: Async ShuffleExternalSorter.pdf > > > Currently, Spark tasks are single-threaded. But we see it could greatly > improve the performance of the jobs, if we can multi-thread some part of it. > For example, profiling our map tasks, which reads large amount of data from > HDFS and spill to disks, we see that we are blocked on HDFS read and spilling > majority of the time. Since both these operations are IO intensive the > average CPU consumption during map phase is significantly low. In theory, > both HDFS read and spilling can be done in parallel if we had additional > memory to store data read from HDFS while we are spilling the last batch read. > Let's say we have 1G of shuffle memory available per task. Currently, in case > of map task, it reads from HDFS and the records are stored in the available > memory buffer. Once we hit the memory limit and there is no more space to > store the records, we sort and spill the content to disk. While we are > spilling to disk, since we do not have any available memory, we can not read > from HDFS concurrently. > Here we propose supporting async spilling for UnsafeShuffleWriter, so that we > can support reading from HDFS when sort and spill is happening > asynchronously. Let's say the total 1G of shuffle memory can be split into > two regions - active region and spilling region - each of size 500 MB. We > start with reading from HDFS and filling the active region. Once we hit the > limit of active region, we issue an asynchronous spill, while fliping the > active region and spilling region. While the spil is happening > asynchronosuly, we still have 500 MB of memory available to read the data > from HDFS. This way we can amortize the high disk/network io cost during > spilling. > We made a prototype hack to implement this feature and we could see our map > tasks were as much as 40% faster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288363#comment-16288363 ] Apache Spark commented on SPARK-21867: -- User 'ericvandenbergfb' has created a pull request for this issue: https://github.com/apache/spark/pull/19955 > Support async spilling in UnsafeShuffleWriter > - > > Key: SPARK-21867 > URL: https://issues.apache.org/jira/browse/SPARK-21867 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Priority: Minor > Attachments: Async ShuffleExternalSorter.pdf > > > Currently, Spark tasks are single-threaded. But we see it could greatly > improve the performance of the jobs, if we can multi-thread some part of it. > For example, profiling our map tasks, which reads large amount of data from > HDFS and spill to disks, we see that we are blocked on HDFS read and spilling > majority of the time. Since both these operations are IO intensive the > average CPU consumption during map phase is significantly low. In theory, > both HDFS read and spilling can be done in parallel if we had additional > memory to store data read from HDFS while we are spilling the last batch read. > Let's say we have 1G of shuffle memory available per task. Currently, in case > of map task, it reads from HDFS and the records are stored in the available > memory buffer. Once we hit the memory limit and there is no more space to > store the records, we sort and spill the content to disk. While we are > spilling to disk, since we do not have any available memory, we can not read > from HDFS concurrently. > Here we propose supporting async spilling for UnsafeShuffleWriter, so that we > can support reading from HDFS when sort and spill is happening > asynchronously. Let's say the total 1G of shuffle memory can be split into > two regions - active region and spilling region - each of size 500 MB. We > start with reading from HDFS and filling the active region. Once we hit the > limit of active region, we issue an asynchronous spill, while fliping the > active region and spilling region. While the spil is happening > asynchronosuly, we still have 500 MB of memory available to read the data > from HDFS. This way we can amortize the high disk/network io cost during > spilling. > We made a prototype hack to implement this feature and we could see our map > tasks were as much as 40% faster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21867) Support async spilling in UnsafeShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21867: Assignee: (was: Apache Spark) > Support async spilling in UnsafeShuffleWriter > - > > Key: SPARK-21867 > URL: https://issues.apache.org/jira/browse/SPARK-21867 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Priority: Minor > Attachments: Async ShuffleExternalSorter.pdf > > > Currently, Spark tasks are single-threaded. But we see it could greatly > improve the performance of the jobs, if we can multi-thread some part of it. > For example, profiling our map tasks, which reads large amount of data from > HDFS and spill to disks, we see that we are blocked on HDFS read and spilling > majority of the time. Since both these operations are IO intensive the > average CPU consumption during map phase is significantly low. In theory, > both HDFS read and spilling can be done in parallel if we had additional > memory to store data read from HDFS while we are spilling the last batch read. > Let's say we have 1G of shuffle memory available per task. Currently, in case > of map task, it reads from HDFS and the records are stored in the available > memory buffer. Once we hit the memory limit and there is no more space to > store the records, we sort and spill the content to disk. While we are > spilling to disk, since we do not have any available memory, we can not read > from HDFS concurrently. > Here we propose supporting async spilling for UnsafeShuffleWriter, so that we > can support reading from HDFS when sort and spill is happening > asynchronously. Let's say the total 1G of shuffle memory can be split into > two regions - active region and spilling region - each of size 500 MB. We > start with reading from HDFS and filling the active region. Once we hit the > limit of active region, we issue an asynchronous spill, while fliping the > active region and spilling region. While the spil is happening > asynchronosuly, we still have 500 MB of memory available to read the data > from HDFS. This way we can amortize the high disk/network io cost during > spilling. > We made a prototype hack to implement this feature and we could see our map > tasks were as much as 40% faster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22764) Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"
[ https://issues.apache.org/jira/browse/SPARK-22764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288364#comment-16288364 ] Apache Spark commented on SPARK-22764: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/19956 > Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons" > -- > > Key: SPARK-22764 > URL: https://issues.apache.org/jira/browse/SPARK-22764 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Saw this in a PR builder: > {noformat} > [info] - Cancelling stages/jobs with custom reasons. *** FAILED *** (135 > milliseconds) > [info] Expected exception org.apache.spark.SparkException to be thrown, but > no exception was thrown (SparkContextSuite.scala:531) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:822) > [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1560) > {noformat} > From the logs, the job is finishing before the test code cancels it: > {noformat} > 17/12/12 11:00:41.680 Executor task launch worker for task 1 INFO Executor: > Finished task 0.0 in stage 1.0 (TID 1). 703 bytes result sent to driver > 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSetManager: Finished task > 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (executor driver) (1/1) > 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSchedulerImpl: Removed > TaskSet 1.0, whose tasks have all completed, from pool > 17/12/12 11:00:41.681 dag-scheduler-event-loop INFO DAGScheduler: ResultStage > 1 (apply at Assertions.scala:805) finished in 0.066 s > 17/12/12 11:00:41.681 pool-1-thread-1-ScalaTest-running-SparkContextSuite > INFO DAGScheduler: Job 1 finished: apply at Assertions.scala:805, took > 0.066946 s > 17/12/12 11:00:41.682 spark-listener-group-shared INFO DAGScheduler: Asked to > cancel job 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR
[ https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288353#comment-16288353 ] Sean Owen commented on SPARK-22765: --- I'm not clear why this needs another allocation scheme. You say dynamic allocation has overhead at runtime -- yes -- and M/R doesn't because it's static. So why not disable dynamic allocation? Things you're identifying as "problems" are just because Spark is a generalization; you can write a bunch of independent 2-stage map-reduce jobs if you want. Killing idle executors is the point of dynamic allocation, not a problem. I don't see any detail on how this differs from anything else in Spark. > Create a new executor allocation scheme based on that of MR > --- > > Key: SPARK-22765 > URL: https://issues.apache.org/jira/browse/SPARK-22765 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Xuefu Zhang > > Many users migrating their workload from MR to Spark find a significant > resource consumption hike (i.e, SPARK-22683). While this might not be a > concern for users that are more performance centric, for others conscious > about cost, such hike creates a migration obstacle. This situation can get > worse as more users are moving to cloud. > Dynamic allocation make it possible for Spark to be deployed in multi-tenant > environment. With its performance-centric design, its inefficiency has also > unfortunately shown up, especially when compared with MR. Thus, it's believed > that MR-styled scheduler still has its merit. Based on our research, the > inefficiency associated with dynamic allocation comes in many aspects such as > executor idling out, bigger executors, many stages (rather than 2 stages only > in MR) in a spark job, etc. > Rather than fine tuning dynamic allocation for efficiency, the proposal here > is to add a new, efficiency-centric scheduling scheme based on that of MR. > Such a MR-based scheme can be further enhanced and be more adapted to Spark > execution model. This alternative is expected to offer good performance > improvement (compared to MR) still with similar to or even better efficiency > than MR. > Inputs are greatly welcome! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22765) Create a new executor allocation scheme based on that of MR
Xuefu Zhang created SPARK-22765: --- Summary: Create a new executor allocation scheme based on that of MR Key: SPARK-22765 URL: https://issues.apache.org/jira/browse/SPARK-22765 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.6.0 Reporter: Xuefu Zhang Many users migrating their workload from MR to Spark find a significant resource consumption hike (i.e, SPARK-22683). While this might not be a concern for users that are more performance centric, for others conscious about cost, such hike creates a migration obstacle. This situation can get worse as more users are moving to cloud. Dynamic allocation make it possible for Spark to be deployed in multi-tenant environment. With its performance-centric design, its inefficiency has also unfortunately shown up, especially when compared with MR. Thus, it's believed that MR-styled scheduler still has its merit. Based on our research, the inefficiency associated with dynamic allocation comes in many aspects such as executor idling out, bigger executors, many stages (rather than 2 stages only in MR) in a spark job, etc. Rather than fine tuning dynamic allocation for efficiency, the proposal here is to add a new, efficiency-centric scheduling scheme based on that of MR. Such a MR-based scheme can be further enhanced and be more adapted to Spark execution model. This alternative is expected to offer good performance improvement (compared to MR) still with similar to or even better efficiency than MR. Inputs are greatly welcome! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22764) Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"
Marcelo Vanzin created SPARK-22764: -- Summary: Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons" Key: SPARK-22764 URL: https://issues.apache.org/jira/browse/SPARK-22764 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.3.0 Reporter: Marcelo Vanzin Priority: Minor Saw this in a PR builder: {noformat} [info] - Cancelling stages/jobs with custom reasons. *** FAILED *** (135 milliseconds) [info] Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown (SparkContextSuite.scala:531) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:822) [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1560) {noformat} >From the logs, the job is finishing before the test code cancels it: {noformat} 17/12/12 11:00:41.680 Executor task launch worker for task 1 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 703 bytes result sent to driver 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (executor driver) (1/1) 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 17/12/12 11:00:41.681 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 1 (apply at Assertions.scala:805) finished in 0.066 s 17/12/12 11:00:41.681 pool-1-thread-1-ScalaTest-running-SparkContextSuite INFO DAGScheduler: Job 1 finished: apply at Assertions.scala:805, took 0.066946 s 17/12/12 11:00:41.682 spark-listener-group-shared INFO DAGScheduler: Asked to cancel job 1 {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22644) Make ML testsuite support StructuredStreaming test
[ https://issues.apache.org/jira/browse/SPARK-22644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22644: -- Target Version/s: 2.3.0 > Make ML testsuite support StructuredStreaming test > -- > > Key: SPARK-22644 > URL: https://issues.apache.org/jira/browse/SPARK-22644 > Project: Spark > Issue Type: Test > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Priority: Minor > > We need to add some helper code to make testing ML transformers & models > easier with streaming data. These tests might help us catch any remaining > issues and we could encourage future PRs to use these tests to prevent new > Models & Transformers from having issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22644) Make ML testsuite support StructuredStreaming test
[ https://issues.apache.org/jira/browse/SPARK-22644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-22644: - Assignee: Weichen Xu > Make ML testsuite support StructuredStreaming test > -- > > Key: SPARK-22644 > URL: https://issues.apache.org/jira/browse/SPARK-22644 > Project: Spark > Issue Type: Test > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > > We need to add some helper code to make testing ML transformers & models > easier with streaming data. These tests might help us catch any remaining > issues and we could encourage future PRs to use these tests to prevent new > Models & Transformers from having issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22644) Make ML testsuite support StructuredStreaming test
[ https://issues.apache.org/jira/browse/SPARK-22644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22644: -- Shepherd: Joseph K. Bradley > Make ML testsuite support StructuredStreaming test > -- > > Key: SPARK-22644 > URL: https://issues.apache.org/jira/browse/SPARK-22644 > Project: Spark > Issue Type: Test > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Priority: Minor > > We need to add some helper code to make testing ML transformers & models > easier with streaming data. These tests might help us catch any remaining > issues and we could encourage future PRs to use these tests to prevent new > Models & Transformers from having issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288305#comment-16288305 ] Joseph K. Bradley commented on SPARK-22126: --- Hi all, thanks for the comments. I realized my use of "Spark job" was imprecise above, so I extended that explanation within the linked design doc. Check out the new sections: * "Use cases for model-specific optimizations & parallelism" * "Other Q" This should answer both [~WeichenXu123] and [~bago.amirbekian]. [~tomas.nykodym], with the current state of Spark master (2.3), we already let users shoot themselves in the foot by running big jobs in parallel. We try to make it less likely by marking the "parallelism" Param as an expert-only API. I think it will be hard to prevent this without completely removing the "parallelism" Param, but please say if you have ideas for a better user-facing API. > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the API I proposed above, the code in > [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159] > can be changed to: > {code} > val trainingDataset = sparkSession.createDataFrame(training, > schema).cache() > val validationDataset = sparkSession.createDataFrame(validation, > schema).cache() > // Fit models in a Future for training in parallel > val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { > callable => > Future[Map[Int, Model[_]]] { > val modelMap = callable.call() > if (collectSubModelsParam) { >... > } > modelMap > } (executionContext) > } > // Unpersist training data only when all models have trained > Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, > executionContext) > .onComplete { _ => trainingDataset.unpersist() } (executionContext) > // Evaluate models in a Future that will calulate a metric and allow > model to be
[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288298#comment-16288298 ] Eric Vandenberg commented on SPARK-21867: - 1. The default would be 1 so does not change default behavior. It is currently configurable to set this higher (spark.shuffle.async.num.sorter=2). 2. Yes, the number of spill files could increase, that's one reason this is not on by default. This could be an issue if it hits file system limits, etc in extreme cases. For the jobs we've tested, this wasn't as a problem. We think this improvement has biggest impact on larger jobs (we've seen cpu reduction by ~30% in some large jobs), it may not help as much for smaller jobs with fewer spills. 3. When sorter hits the threshold, it will kick off an asynchronous spill and then continue inserting into another sorter (assuming one is available.) It could make sense to raise the threshold, this would result in larger spill files. There is some risk that raising it might push too high causing an OOM and then needing to lower again. I'm thinking the algorithm could be improved by more accurately calculating and enforcing the threshold based on available memory over time, however, to do this would require exposing some memory allocation metrics not currently available (in the memory manager), so opt'd to not do that for now. 4. Yes, too many open files/buffers could be an issue. So for now this is something should look at enabling case by case as part of performance tuning. > Support async spilling in UnsafeShuffleWriter > - > > Key: SPARK-21867 > URL: https://issues.apache.org/jira/browse/SPARK-21867 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Priority: Minor > Attachments: Async ShuffleExternalSorter.pdf > > > Currently, Spark tasks are single-threaded. But we see it could greatly > improve the performance of the jobs, if we can multi-thread some part of it. > For example, profiling our map tasks, which reads large amount of data from > HDFS and spill to disks, we see that we are blocked on HDFS read and spilling > majority of the time. Since both these operations are IO intensive the > average CPU consumption during map phase is significantly low. In theory, > both HDFS read and spilling can be done in parallel if we had additional > memory to store data read from HDFS while we are spilling the last batch read. > Let's say we have 1G of shuffle memory available per task. Currently, in case > of map task, it reads from HDFS and the records are stored in the available > memory buffer. Once we hit the memory limit and there is no more space to > store the records, we sort and spill the content to disk. While we are > spilling to disk, since we do not have any available memory, we can not read > from HDFS concurrently. > Here we propose supporting async spilling for UnsafeShuffleWriter, so that we > can support reading from HDFS when sort and spill is happening > asynchronously. Let's say the total 1G of shuffle memory can be split into > two regions - active region and spilling region - each of size 500 MB. We > start with reading from HDFS and filling the active region. Once we hit the > limit of active region, we issue an asynchronous spill, while fliping the > active region and spilling region. While the spil is happening > asynchronosuly, we still have 500 MB of memory available to read the data > from HDFS. This way we can amortize the high disk/network io cost during > spilling. > We made a prototype hack to implement this feature and we could see our map > tasks were as much as 40% faster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288262#comment-16288262 ] Xuefu Zhang commented on SPARK-22683: - Hi [~jcuquemelle], Thanks for working on this and bringing up the efficiency problem associated with dynamic allocation. Significant resource consumption increase is also experienced in our company when workload is migrated from MR to Spark (via Hive). Thus, I believe that there is a strong need to improve spark efficiency in addition to performance. While your proposal has its merit, I largely concur with Sean that it might not be universally applicable to solve a class of problem rather than particular workload. Take MR as an example, which also allocate as many mappers/reducers as the number of map or reduce tasks, yet offers higher efficiency than Spark in many cases. The inefficiency associated with dynamic allocation comes in many aspects such as executor idling out, bigger executors, many stages (rather than 2 stages only in MR) in a spark job, etc. As there is a class of users conscious about resource consumption, especially when many moving their workload to the cloud, there demands a solution that's more generic to such users. I have been thinking about a proposal that introduces a MR-based resource allocation in parallel with dynamic allocation. Such an allocation mechanism is based on MR style, but can be further enhanced to beat MR and be more adapted to Spark execution model. This would be a great alternative to dynamic allocation. While dynamic is certainly performance centric, the new allocation scheme can still offer good performance improvement (compared to MR) while being efficiency-centric. As a start point, I'm going to create an JIRA and move the discussion along this proposal over there. You're welcome to share your thoughts and/or contribute. Thanks. > Allow tuning the number of dynamically allocated executors wrt task number > -- > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > Labels: pull-request-available > > let's say an executor has spark.executor.cores / spark.task.cpus taskSlots > The current dynamic allocation policy allocates enough executors > to have each taskSlot execute a single task, which minimizes latency, > but wastes resources when tasks are small regarding executor allocation > overhead. > By adding the tasksPerExecutorSlot, it is made possible to specify how many > tasks > a single slot should ideally execute to mitigate the overhead of executor > allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies
[ https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22757: Assignee: Apache Spark > Init-container in the driver/executor pods for downloading remote dependencies > -- > > Key: SPARK-22757 > URL: https://issues.apache.org/jira/browse/SPARK-22757 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies
[ https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22757: Assignee: (was: Apache Spark) > Init-container in the driver/executor pods for downloading remote dependencies > -- > > Key: SPARK-22757 > URL: https://issues.apache.org/jira/browse/SPARK-22757 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies
[ https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288213#comment-16288213 ] Apache Spark commented on SPARK-22757: -- User 'liyinan926' has created a pull request for this issue: https://github.com/apache/spark/pull/19954 > Init-container in the driver/executor pods for downloading remote dependencies > -- > > Key: SPARK-22757 > URL: https://issues.apache.org/jira/browse/SPARK-22757 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive
[ https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22574. Resolution: Fixed Fix Version/s: 2.3.0 2.2.2 Issue resolved by pull request 19793 [https://github.com/apache/spark/pull/19793] > Wrong request causing Spark Dispatcher going inactive > - > > Key: SPARK-22574 > URL: https://issues.apache.org/jira/browse/SPARK-22574 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Submit >Affects Versions: 2.2.0 >Reporter: German Schiavon Matteo >Priority: Minor > Fix For: 2.2.2, 2.3.0 > > > When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is > causing a bad state of Dispatcher and making it inactive as a mesos framework. > The class CreateSubmissionRequest initialise its arguments to null as follows: > {code:title=CreateSubmissionRequest.scala|borderStyle=solid} > var appResource: String = null > var mainClass: String = null > var appArgs: Array[String] = null > var sparkProperties: Map[String, String] = null > var environmentVariables: Map[String, String] = null > {code} > There are some checks of these variables but not in all of them, for example > in appArgs and environmentVariables. > If you don't set _appArgs_ it will cause the following error: > {code:title=error|borderStyle=solid} > 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers. > Exception in thread "Thread-22" java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) > {code} > Because it's trying to access to it without checking whether is null or not. > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive
[ https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22574: -- Assignee: German Schiavon Matteo > Wrong request causing Spark Dispatcher going inactive > - > > Key: SPARK-22574 > URL: https://issues.apache.org/jira/browse/SPARK-22574 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Submit >Affects Versions: 2.2.0 >Reporter: German Schiavon Matteo >Assignee: German Schiavon Matteo >Priority: Minor > Fix For: 2.2.2, 2.3.0 > > > When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is > causing a bad state of Dispatcher and making it inactive as a mesos framework. > The class CreateSubmissionRequest initialise its arguments to null as follows: > {code:title=CreateSubmissionRequest.scala|borderStyle=solid} > var appResource: String = null > var mainClass: String = null > var appArgs: Array[String] = null > var sparkProperties: Map[String, String] = null > var environmentVariables: Map[String, String] = null > {code} > There are some checks of these variables but not in all of them, for example > in appArgs and environmentVariables. > If you don't set _appArgs_ it will cause the following error: > {code:title=error|borderStyle=solid} > 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers. > Exception in thread "Thread-22" java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) > {code} > Because it's trying to access to it without checking whether is null or not. > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-22289. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.2 > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert >Assignee: yuhao yang > Fix For: 2.2.2, 2.3.0 > > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22755) Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return different results
[ https://issues.apache.org/jira/browse/SPARK-22755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati resolved SPARK-22755. - Resolution: Fixed Fix Version/s: 2.2.1 > Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return > different results > - > > Key: SPARK-22755 > URL: https://issues.apache.org/jira/browse/SPARK-22755 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kevin Zhang > Fix For: 2.2.1 > > > both of the following sql statements > {code:sql} > select ((946-885)*1.000/946 < 0.1) > {code} > and > {code:sql} > select ((946-885)*1.0/946 < 0.100) > {code} > return true, while the following statement > {code:sql} > select ((946-885)*1.0/946 < 0.1) > {code} > returns false -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22755) Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return different results
[ https://issues.apache.org/jira/browse/SPARK-22755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288061#comment-16288061 ] Sunitha Kambhampati commented on SPARK-22755: - SPARK-21332 fixed this issue. The changes for it are also in Spark-2.2.1 that is released on Dec 1. I tried it on Spark-2.2.1 and the queries return true. > Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return > different results > - > > Key: SPARK-22755 > URL: https://issues.apache.org/jira/browse/SPARK-22755 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kevin Zhang > > both of the following sql statements > {code:sql} > select ((946-885)*1.000/946 < 0.1) > {code} > and > {code:sql} > select ((946-885)*1.0/946 < 0.100) > {code} > return true, while the following statement > {code:sql} > select ((946-885)*1.0/946 < 0.1) > {code} > returns false -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22729) Add getTruncateQuery to JdbcDialect
[ https://issues.apache.org/jira/browse/SPARK-22729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22729: Issue Type: Improvement (was: New Feature) > Add getTruncateQuery to JdbcDialect > --- > > Key: SPARK-22729 > URL: https://issues.apache.org/jira/browse/SPARK-22729 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel van der Ende >Assignee: Daniel van der Ende > Fix For: 2.3.0 > > > In order to enable truncate for PostgreSQL databases in Spark JDBC, a change > is needed to the query used for truncating a PostgreSQL table. By default, > PostgreSQL will automatically truncate any descendant tables if a TRUNCATE > query is executed. As this may result in (unwanted) side-effects, the query > used for the truncate should be specified separately for PostgreSQL, > specifying only to TRUNCATE a single table. > This will also resolve SPARK-22717 > See PostgreSQL documentation > https://www.postgresql.org/docs/current/static/sql-truncate.html > This change will still not let users truncate a table with cascade enabled > (which would also truncate tables with foreign key constraints to the table). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22729) Add getTruncateQuery to JdbcDialect
[ https://issues.apache.org/jira/browse/SPARK-22729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22729. - Resolution: Fixed Fix Version/s: 2.3.0 > Add getTruncateQuery to JdbcDialect > --- > > Key: SPARK-22729 > URL: https://issues.apache.org/jira/browse/SPARK-22729 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel van der Ende >Assignee: Daniel van der Ende > Fix For: 2.3.0 > > > In order to enable truncate for PostgreSQL databases in Spark JDBC, a change > is needed to the query used for truncating a PostgreSQL table. By default, > PostgreSQL will automatically truncate any descendant tables if a TRUNCATE > query is executed. As this may result in (unwanted) side-effects, the query > used for the truncate should be specified separately for PostgreSQL, > specifying only to TRUNCATE a single table. > This will also resolve SPARK-22717 > See PostgreSQL documentation > https://www.postgresql.org/docs/current/static/sql-truncate.html > This change will still not let users truncate a table with cascade enabled > (which would also truncate tables with foreign key constraints to the table). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22729) Add getTruncateQuery to JdbcDialect
[ https://issues.apache.org/jira/browse/SPARK-22729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-22729: --- Assignee: Daniel van der Ende > Add getTruncateQuery to JdbcDialect > --- > > Key: SPARK-22729 > URL: https://issues.apache.org/jira/browse/SPARK-22729 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel van der Ende >Assignee: Daniel van der Ende > Fix For: 2.3.0 > > > In order to enable truncate for PostgreSQL databases in Spark JDBC, a change > is needed to the query used for truncating a PostgreSQL table. By default, > PostgreSQL will automatically truncate any descendant tables if a TRUNCATE > query is executed. As this may result in (unwanted) side-effects, the query > used for the truncate should be specified separately for PostgreSQL, > specifying only to TRUNCATE a single table. > This will also resolve SPARK-22717 > See PostgreSQL documentation > https://www.postgresql.org/docs/current/static/sql-truncate.html > This change will still not let users truncate a table with cascade enabled > (which would also truncate tables with foreign key constraints to the table). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22742) Spark2.x does not support read data from Hive 2.2 and 2.3
[ https://issues.apache.org/jira/browse/SPARK-22742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288044#comment-16288044 ] Zhongting Hu commented on SPARK-22742: -- Thanks [~smilegator] to reopen it, can I get a rough estimation on which coming spark release will has 2.3 support, Thanks > Spark2.x does not support read data from Hive 2.2 and 2.3 > - > > Key: SPARK-22742 > URL: https://issues.apache.org/jira/browse/SPARK-22742 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Zhongting Hu > > Hive has been release latest version 2.3.2 but spark doesn't support read > from metadata store yet. The latest version of Hive metastore is 2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19834) csv escape of quote escape
[ https://issues.apache.org/jira/browse/SPARK-19834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288025#comment-16288025 ] Soonmok Kwon commented on SPARK-19834: -- I will re-open this soon (in a week). > csv escape of quote escape > -- > > Key: SPARK-19834 > URL: https://issues.apache.org/jira/browse/SPARK-19834 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Soonmok Kwon >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > A DataFrame is stored in CSV format and loaded again. When there's backslash > followed by quotation mark, csv reading seems to make an error. > reference: > http://stackoverflow.com/questions/42607208/spark-csv-error-when-reading-backslash-and-quotation-mark -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22755) Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return different results
[ https://issues.apache.org/jira/browse/SPARK-22755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287994#comment-16287994 ] Sunitha Kambhampati commented on SPARK-22755: - A) TRUNK: - My trunk codeline is sync'd up to Dec 7, commit 2d4c2b0bdf89badf25f2d1d98903125e48e7cd5c - That said, I tried these statements now in spark sql repl and the answer is true. - I have not set any configuration, it is just the defaults in my dev env. B) Spark 2.2: - I downloaded the spark 2.2 binaries and tried the queries in spark-sql repl and I can repro the issue. {code} select ((946-885)*1.0/946 < 0.1) -> returns false select ((946-885)*1.0/946 < 0.100) -> returns true {code} So it looks like some fix has gone in as I cannot repro it on trunk. I am not sure which issue has fixed it but just wanted to add this info for now. > Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return > different results > - > > Key: SPARK-22755 > URL: https://issues.apache.org/jira/browse/SPARK-22755 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kevin Zhang > > both of the following sql statements > {code:sql} > select ((946-885)*1.000/946 < 0.1) > {code} > and > {code:sql} > select ((946-885)*1.0/946 < 0.100) > {code} > return true, while the following statement > {code:sql} > select ((946-885)*1.0/946 < 0.1) > {code} > returns false -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16986) "Started" time, "Completed" time and "Last Updated" time in history server UI are not user local time
[ https://issues.apache.org/jira/browse/SPARK-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16986. Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 2.3.0 > "Started" time, "Completed" time and "Last Updated" time in history server UI > are not user local time > - > > Key: SPARK-16986 > URL: https://issues.apache.org/jira/browse/SPARK-16986 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Weiqing Yang >Assignee: Yuming Wang >Priority: Minor > Fix For: 2.3.0 > > > Currently, "Started" time, "Completed" time and "Last Updated" time in > history server UI are GMT. They should be the user local time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-16986) "Started" time, "Completed" time and "Last Updated" time in history server UI are not user local time
[ https://issues.apache.org/jira/browse/SPARK-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reopened SPARK-16986: > "Started" time, "Completed" time and "Last Updated" time in history server UI > are not user local time > - > > Key: SPARK-16986 > URL: https://issues.apache.org/jira/browse/SPARK-16986 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Weiqing Yang >Priority: Minor > > Currently, "Started" time, "Completed" time and "Last Updated" time in > history server UI are GMT. They should be the user local time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22763) SHS: Ignore unknown events and parse through the file
[ https://issues.apache.org/jira/browse/SPARK-22763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287924#comment-16287924 ] Apache Spark commented on SPARK-22763: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/19953 > SHS: Ignore unknown events and parse through the file > - > > Key: SPARK-22763 > URL: https://issues.apache.org/jira/browse/SPARK-22763 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Gengliang Wang > > While spark code changes, there are new events in event log: > https://github.com/apache/spark/pull/19649 > And we used to maintain a whitelist to avoid exceptions: > https://github.com/apache/spark/pull/15663/files > Currently Spark history server will stop parsing on unknown events or > unrecognized properties. We may still see part of the UI data. > For better compatibility, we can ignore unknown events and parse through the > log file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22763) SHS: Ignore unknown events and parse through the file
[ https://issues.apache.org/jira/browse/SPARK-22763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22763: Assignee: Apache Spark > SHS: Ignore unknown events and parse through the file > - > > Key: SPARK-22763 > URL: https://issues.apache.org/jira/browse/SPARK-22763 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Assignee: Apache Spark > > While spark code changes, there are new events in event log: > https://github.com/apache/spark/pull/19649 > And we used to maintain a whitelist to avoid exceptions: > https://github.com/apache/spark/pull/15663/files > Currently Spark history server will stop parsing on unknown events or > unrecognized properties. We may still see part of the UI data. > For better compatibility, we can ignore unknown events and parse through the > log file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19809: -- Affects Version/s: (was: 2.2.0) 2.2.1 > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Assigned] (SPARK-22763) SHS: Ignore unknown events and parse through the file
[ https://issues.apache.org/jira/browse/SPARK-22763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22763: Assignee: (was: Apache Spark) > SHS: Ignore unknown events and parse through the file > - > > Key: SPARK-22763 > URL: https://issues.apache.org/jira/browse/SPARK-22763 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Gengliang Wang > > While spark code changes, there are new events in event log: > https://github.com/apache/spark/pull/19649 > And we used to maintain a whitelist to avoid exceptions: > https://github.com/apache/spark/pull/15663/files > Currently Spark history server will stop parsing on unknown events or > unrecognized properties. We may still see part of the UI data. > For better compatibility, we can ignore unknown events and parse through the > log file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22763) SHS: Ignore unknown events and parse through the file
Gengliang Wang created SPARK-22763: -- Summary: SHS: Ignore unknown events and parse through the file Key: SPARK-22763 URL: https://issues.apache.org/jira/browse/SPARK-22763 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.1 Reporter: Gengliang Wang While spark code changes, there are new events in event log: https://github.com/apache/spark/pull/19649 And we used to maintain a whitelist to avoid exceptions: https://github.com/apache/spark/pull/15663/files Currently Spark history server will stop parsing on unknown events or unrecognized properties. We may still see part of the UI data. For better compatibility, we can ignore unknown events and parse through the log file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22351) Support user-created custom Encoders for Datasets
[ https://issues.apache.org/jira/browse/SPARK-22351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287735#comment-16287735 ] Adamos Loizou commented on SPARK-22351: --- Hello guys, once more I've run against this problem now with ADT/Sealed hierarchies examples. For reference, there are already people facing this issue ([stack overflow link|https://stackoverflow.com/questions/41030073/encode-an-adt-sealed-trait-hierarchy-into-spark-dataset-column]). Here is an example: {code:java} sealed trait Fruit case object Apple extends Fruit case object Orange extends Fruit case class Bag(quantity: Int, fruit: Fruit) Seq(Bag(1, Apple), Bag(3, Orange)).toDS // <- This fails because it can't find an encoder for Fruit {code} Ideally I'd like to be able to create my encoder where I can tell it, for example, to use the case object toString method for mapping it to a String column. How feasible would it be to expose an API for creating custom encoders? Unfortunately, not having this limits the capacity for generalised and typesafe models quite a bit. Thank you. > Support user-created custom Encoders for Datasets > - > > Key: SPARK-22351 > URL: https://issues.apache.org/jira/browse/SPARK-22351 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adamos Loizou >Priority: Minor > > It would be very helpful if we could easily support creating custom encoders > for classes in Spark SQL. > This is to allow a user to properly define a business model using types of > their choice. They can then map them to Spark SQL types without being forced > to pollute their model with the built-in mappable types (e.g. > {{java.sql.Timestamp}}). > Specifically in our case, we tend to use either the Java 8 time API or the > joda time API for dates instead of {{java.sql.Timestamp}} whose API is quite > limited compared to the others. > Ideally we would like to be able to have a dataset of such a class: > {code:java} > case class Person(name: String, dateOfBirth: org.joda.time.LocalDate) > implicit def localDateTimeEncoder: Encoder[LocalDate] = ??? // we define > something that maps to Spark SQL TimestampType > ... > // read csv and map it to model > val people:Dataset[Person] = spark.read.csv("/my/path/file.csv").as[Person] > {code} > While this was possible in Spark 1.6 it's not longer the case in Spark 2.x. > It's also not straight forward as to how to support that using an > {{ExpressionEncoder}} (any tips would be much appreciated) > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286253#comment-16286253 ] Julien Cuquemelle edited comment on SPARK-22683 at 12/12/17 3:14 PM: - The impression I get from our discussion is that you mainly focus on the latency of the jobs, and that the current setting is optimized for that, which is why you consider the current setup sufficient. If you consider your previous example from a resource usage point of view, my proposal would allow to have about the same resource usage in both scenarios, but the current setup doubles the resource usage of the workload with small tasks... I've tried to experiment with the parameters you've proposed, but right now I don't have a solution to optimize my type of workload (not every single job) for resource consumption. I don't know if the majority of Spark users run on idle clusters, but ours is routinely full, so for us resource usage is more important than latency. was (Author: jcuquemelle): The impression I get from our discussion is that you mainly focus on the latency of the jobs, and that the current setting is optimized for that, which is why you consider the current setup sufficient. If you consider your previous example from a resource usage point of view, my proposal would allow to have about the same resource usage in both scenarios, but the current setup doubles the resource usage of the workload with small tasks... I've tried to experiment with the parameters you've proposed, but right now I don't have a solution to optimize my type of workload (nor every single job) for resource consumption. I don't know if the majority of Spark users run on idle clusters, but ours is routinely full, so for us resource usage is more important than latency. > Allow tuning the number of dynamically allocated executors wrt task number > -- > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > Labels: pull-request-available > > let's say an executor has spark.executor.cores / spark.task.cpus taskSlots > The current dynamic allocation policy allocates enough executors > to have each taskSlot execute a single task, which minimizes latency, > but wastes resources when tasks are small regarding executor allocation > overhead. > By adding the tasksPerExecutorSlot, it is made possible to specify how many > tasks > a single slot should ideally execute to mitigate the overhead of executor > allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21322) support histogram in filter cardinality estimation
[ https://issues.apache.org/jira/browse/SPARK-21322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287691#comment-16287691 ] Apache Spark commented on SPARK-21322: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19952 > support histogram in filter cardinality estimation > -- > > Key: SPARK-21322 > URL: https://issues.apache.org/jira/browse/SPARK-21322 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ron Hu >Assignee: Ron Hu > Fix For: 2.3.0 > > > Histogram is effective in dealing with skewed distribution. After we > generate histogram information for column statistics, we need to adjust > filter estimation based on histogram data structure. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22761) 64KB JVM bytecode limit problem with GLM
[ https://issues.apache.org/jira/browse/SPARK-22761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22761. --- Resolution: Duplicate Without more info this sounds like a possible duplicate of several other issues. > 64KB JVM bytecode limit problem with GLM > > > Key: SPARK-22761 > URL: https://issues.apache.org/jira/browse/SPARK-22761 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Alan Lai > > {code:java} > GLM > {code} (presumably other mllib tools) > can throw an exception due to the 64KB JVM bytecode limit when they use with > a lot of variables/arguments (~ 2k). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21914) Running examples as tests in SQL builtin function documentation
[ https://issues.apache.org/jira/browse/SPARK-21914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287538#comment-16287538 ] Hyukjin Kwon commented on SPARK-21914: -- I totally misunderstood your comment [~rxin]. I was so focused on the elapsed time in the tests at that time but you the actual test coverage .. Would it be okay if the diff is small? Let me try this in my local and maybe open a PR if the diff is small to show it. Otherwise, will leave this closed if it's complex and big. > Running examples as tests in SQL builtin function documentation > --- > > Key: SPARK-21914 > URL: https://issues.apache.org/jira/browse/SPARK-21914 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > It looks we have added many examples in {{ExpressionDescription}} for builtin > functions. > Actually, if I have seen correctly, we have fixed many examples so far in > some minor PRs and sometimes require to add the examples as tests sql and > golden files. > As we have formatted examples in {{ExpressionDescription.examples}} - > https://github.com/apache/spark/blob/ba327ee54c32b11107793604895bd38559804858/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionDescription.java#L44-L50, > and we have `SQLQueryTestSuite`, I think we could run the examples as tests > like Python's doctests. > Rough way I am thinking: > 1. Loads the example in {{ExpressionDescription}}. > 2. identify queries by {{>}}. > 3. identify the rest of them as the results. > 4. run the examples by reusing {{SQLQueryTestSuite}} if possible. > 5. compare the output by reusing {{SQLQueryTestSuite}} if possible. > Advantages of doing this I could think for now: > - Reduce the number of PRs to fix the examples > - De-duplicate the test cases that should be added into sql and golden files. > - Correct documentation with correct examples. > - Reduce reviewing costs for documentation fix PRs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18580) Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream
[ https://issues.apache.org/jira/browse/SPARK-18580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287496#comment-16287496 ] Alexey Lipodat commented on SPARK-18580: Guys, Could you please review Oleksandr's pull ^ ? This feature is very useful for us. Thank you! > Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream > --- > > Key: SPARK-18580 > URL: https://issues.apache.org/jira/browse/SPARK-18580 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Oleg Muravskiy > > Currently the `spark.streaming.kafka.maxRatePerPartition` is used as the > initial rate when the backpressure is enabled. This is too exhaustive for the > application while it still warms up. > This is similar to SPARK-11627, applying the solution provided there to > DirectKafkaInputDStream. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.
[ https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287489#comment-16287489 ] Apache Spark commented on SPARK-22760: -- User 'KaiXinXiaoLei' has created a pull request for this issue: https://github.com/apache/spark/pull/19951 > where driver is stopping, and some executors lost because of > YarnSchedulerBackend.stop, then there is a problem. > - > > Key: SPARK-22760 > URL: https://issues.apache.org/jira/browse/SPARK-22760 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.2.1 >Reporter: KaiXinXIaoLei > Attachments: 微信图片_20171212094100.jpg > > > Using SPARK-14228 , i still find a problem: > {noformat} > 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to > shut down > 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63. > 17/12/12 15:34:45 ERROR Inbox: Ignoring error > org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it > has been stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163) > at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356) > at > org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and sometimes, the below problem is also exists: > {noformat} > 17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped > 17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 17/12/11 15:50:53 ERROR Inbox: Ignoring error > org.apache.spark.SparkException: Unsupported message > OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container > container_e05_1512975871311_0007_01_69 exited because of a YARN event > (e.g., pre-emption) and not because of an error in the running job.)) from > 101.8.73.53:42930 > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) > at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186) > at >
[jira] [Commented] (SPARK-22450) Safely register class for mllib
[ https://issues.apache.org/jira/browse/SPARK-22450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287483#comment-16287483 ] Apache Spark commented on SPARK-22450: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/19950 > Safely register class for mllib > --- > > Key: SPARK-22450 > URL: https://issues.apache.org/jira/browse/SPARK-22450 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Xianyang Liu >Assignee: Xianyang Liu > Fix For: 2.3.0 > > > There are still some algorithms based on mllib, such as KMeans. For now, > many mllib common class (such as: Vector, DenseVector, SparseVector, Matrix, > DenseMatrix, SparseMatrix) are not registered in Kryo. So there are some > performance issues for those object serialization or deserialization. > Previously dicussed: https://github.com/apache/spark/pull/19586 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287461#comment-16287461 ] Eyal Farago commented on SPARK-21867: - [~ericvandenbergfb], looks good few questions though: 1. initial number of sorters? 2. assuming initial number of sorters >1, doesn't this mean you're potentially increasing number of spills? if a single sorter would spill after N records, a multi-sorter will spill after N/k records (where k is the number of sorters), doesn't this mean more spills and merges? 3. when a sorter hits the spill threshold, does it immediately spill or will it keep going if there's available execution memory? it might make sense to spill and raise the threshold in this condition... 4. merging spills may require some attention: compression, encryption, etc. also avoiding too many open files/buffers at the same time looking forward for the PR > Support async spilling in UnsafeShuffleWriter > - > > Key: SPARK-21867 > URL: https://issues.apache.org/jira/browse/SPARK-21867 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Priority: Minor > Attachments: Async ShuffleExternalSorter.pdf > > > Currently, Spark tasks are single-threaded. But we see it could greatly > improve the performance of the jobs, if we can multi-thread some part of it. > For example, profiling our map tasks, which reads large amount of data from > HDFS and spill to disks, we see that we are blocked on HDFS read and spilling > majority of the time. Since both these operations are IO intensive the > average CPU consumption during map phase is significantly low. In theory, > both HDFS read and spilling can be done in parallel if we had additional > memory to store data read from HDFS while we are spilling the last batch read. > Let's say we have 1G of shuffle memory available per task. Currently, in case > of map task, it reads from HDFS and the records are stored in the available > memory buffer. Once we hit the memory limit and there is no more space to > store the records, we sort and spill the content to disk. While we are > spilling to disk, since we do not have any available memory, we can not read > from HDFS concurrently. > Here we propose supporting async spilling for UnsafeShuffleWriter, so that we > can support reading from HDFS when sort and spill is happening > asynchronously. Let's say the total 1G of shuffle memory can be split into > two regions - active region and spilling region - each of size 500 MB. We > start with reading from HDFS and filling the active region. Once we hit the > limit of active region, we issue an asynchronous spill, while fliping the > active region and spilling region. While the spil is happening > asynchronosuly, we still have 500 MB of memory available to read the data > from HDFS. This way we can amortize the high disk/network io cost during > spilling. > We made a prototype hack to implement this feature and we could see our map > tasks were as much as 40% faster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22761) 64KB JVM bytecode limit problem with GLM
[ https://issues.apache.org/jira/browse/SPARK-22761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287394#comment-16287394 ] Marco Gaido commented on SPARK-22761: - this is probably solved by SPARK-6. Can you try to reproduce on current master branch? > 64KB JVM bytecode limit problem with GLM > > > Key: SPARK-22761 > URL: https://issues.apache.org/jira/browse/SPARK-22761 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Alan Lai > > {code:java} > GLM > {code} (presumably other mllib tools) > can throw an exception due to the 64KB JVM bytecode limit when they use with > a lot of variables/arguments (~ 2k). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.
[ https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-22760: -- Description: Using {{SPARK-14228}} , i still find a problem: {noformat} 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut down 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63. 17/12/12 15:34:45 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356) at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} and sometimes, the below problem is also exists: {noformat} 17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped 17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/12/11 15:50:53 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Unsupported message OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container container_e05_1512975871311_0007_01_69 exited because of a YARN event (e.g., pre-emption) and not because of an error in the running job.)) from 101.8.73.53:42930 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at scala.util.Success.foreach(Try.scala:236) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) {noformat} I analysis this reason. When the number of
[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.
[ https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-22760: -- Description: Using SPARK-14228 {{monospaced text}} , i still find a problem: {noformat} 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut down 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63. 17/12/12 15:34:45 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356) at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} and sometimes, the below problem is also exists: {noformat} 17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped 17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/12/11 15:50:53 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Unsupported message OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container container_e05_1512975871311_0007_01_69 exited because of a YARN event (e.g., pre-emption) and not because of an error in the running job.)) from 101.8.73.53:42930 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at scala.util.Success.foreach(Try.scala:236) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) {noformat} I analysis this reason. When
[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.
[ https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-22760: -- Description: Using SPARK-14228 , i still find a problem: {noformat} 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut down 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63. 17/12/12 15:34:45 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356) at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} and sometimes, the below problem is also exists: {noformat} 17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped 17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/12/11 15:50:53 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Unsupported message OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container container_e05_1512975871311_0007_01_69 exited because of a YARN event (e.g., pre-emption) and not because of an error in the running job.)) from 101.8.73.53:42930 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at scala.util.Success.foreach(Try.scala:236) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) {noformat} I analysis this reason. When the number of
[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.
[ https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-22760: -- Description: Using SPARK-14228 , i still find a problem: {noformat} 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut down 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63. 17/12/12 15:34:45 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356) at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} and sometimes, the below problem is also exists: {noformat} 17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped 17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/12/11 15:50:53 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Unsupported message OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container container_e05_1512975871311_0007_01_69 exited because of a YARN event (e.g., pre-emption) and not because of an error in the running job.)) from 101.8.73.53:42930 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at scala.util.Success.foreach(Try.scala:236) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) {noformat} I analysis this reason. When the number of
[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.
[ https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-22760: -- Description: Use SPARK-14228 , i find a problem: {panel:title=My title} Some text with a title {panel} {noformat} *no* further _formatting_ is done here {noformat} @ ||Heading 1||Heading 2|| |Mention someone by typing their name...|Col A2| 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut down 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63. 17/12/12 15:34:45 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356) at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) and sometimes, 17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped 17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/12/11 15:50:53 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Unsupported message OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container container_e05_1512975871311_0007_01_69 exited because of a YARN event (e.g., pre-emption) and not because of an error in the running job.)) from 101.8.73.53:42930 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at scala.util.Success.foreach(Try.scala:236) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) at
[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.
[ https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-22760: -- Description: Use SPARK-14228 , i find a problem: ^superscript text^ ~subscript text~ ??citation?? 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut down 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63. 17/12/12 15:34:45 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356) at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) and sometimes, 17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped 17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/12/11 15:50:53 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Unsupported message OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container container_e05_1512975871311_0007_01_69 exited because of a YARN event (e.g., pre-emption) and not because of an error in the running job.)) from 101.8.73.53:42930 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at scala.util.Success.foreach(Try.scala:236) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) I analysis this reason. When the number of executors is big, and
[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.
[ https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-22760: -- Description: Use SPARK-14228 , i find a problem: ^ 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut down 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63. 17/12/12 15:34:45 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356) at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ^ and sometimes, 17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped 17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/12/11 15:50:53 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Unsupported message OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container container_e05_1512975871311_0007_01_69 exited because of a YARN event (e.g., pre-emption) and not because of an error in the running job.)) from 101.8.73.53:42930 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255) at scala.util.Success.foreach(Try.scala:236) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206) I analysis this reason. When the number of executors is big, and YarnSchedulerBackend.stopped=False after YarnSchedulerBackend.stop()
[jira] [Commented] (SPARK-22752) FileNotFoundException while reading from Kafka
[ https://issues.apache.org/jira/browse/SPARK-22752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287305#comment-16287305 ] Marco Gaido commented on SPARK-22752: - Hi [~zsxwing], thanks for looking at this. The checkpointDir is on HDFS. Inside it we have: {noformat} metadata . 0.1 kB commits offsets sources state {noformat} where {{commits}} contains files from {{0}} to {{13}}, {{offsets}} from {{0}} to {{14}}, {{sources}} contains only a file which is {{0/0}} and {{state}} contains only the directory {{0/0}} which is empty. Do you need other information? Have you idea of which is the root cause of the issue? Thanks. > FileNotFoundException while reading from Kafka > -- > > Key: SPARK-22752 > URL: https://issues.apache.org/jira/browse/SPARK-22752 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Marco Gaido > > We are running a stateful structured streaming job which reads from Kafka and > writes to HDFS. And we are hitting this exception: > {noformat} > 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): > java.lang.IllegalStateException: Error reading delta file > /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, > part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta > does not exist > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) > at scala.Option.getOrElse(Option.scala:121) > {noformat} > Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory > there is no file at all. While we have some files in the commits and offsets > folders. I am not sure about the reason of this behavior. It seems to happen > on the second time the job is started, after the first one failed. So it > looks like task failures can generate it. Or it might be related to > watermarks, since there are some problems related to the incoming data for > which the watermark was filtering all the incoming data. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22749) df.na().fill(0) over an unnamed FIRST() fails the query
[ https://issues.apache.org/jira/browse/SPARK-22749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287251#comment-16287251 ] Hyukjin Kwon commented on SPARK-22749: -- ping [~efrat.s] > df.na().fill(0) over an unnamed FIRST() fails the query > --- > > Key: SPARK-22749 > URL: https://issues.apache.org/jira/browse/SPARK-22749 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Efrat > > When I created a query with FIRST() and didn't renamed the field I queried > over - applying .na().fill(0) created a syntax that failed the query. > Renaming the field using AS solved it, but I guess it's a bug in fillNA that > doesn't support it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org