date:20171212

[jira] [Created] (SPARK-22769) When driver stopping, there is problem: RpcEnv already stopped

2017-12-12 Thread KaiXinXIaoLei (JIRA)

KaiXinXIaoLei created SPARK-22769:
-

 Summary: When driver stopping, there is problem: RpcEnv already 
stopped
 Key: SPARK-22769
 URL: https://issues.apache.org/jira/browse/SPARK-22769
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: KaiXinXIaoLei


I run "spark-sql --master yarn  --num-executors 1000 -f createTable.sql". When 
task is finished, there is a error: 
org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. I think 
the log level should be warning, not error.
{noformat}
17/12/12 18:20:44 INFO MemoryStore: MemoryStore cleared
17/12/12 18:20:44 INFO BlockManager: BlockManager stopped
17/12/12 18:20:44 INFO BlockManagerMaster: BlockManagerMaster stopped
17/12/12 18:20:44 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() for one-way message.
org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:152)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570)
at 
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided

2017-12-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22042.
-
   Resolution: Fixed
 Assignee: Tejas Patil
Fix Version/s: 2.3.0

> ReorderJoinPredicates can break when child's partitioning is not decided
> 
>
> Key: SPARK-22042
> URL: https://issues.apache.org/jira/browse/SPARK-22042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3.0
>
>
> When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its 
> children, the children may not be properly constructed as the child-subtree 
> has to still go through other planner rules.
> In this particular case, the child is `SortMergeJoinExec`. Since the required 
> `Exchange` operators are not in place (because `EnsureRequirements` runs 
> _after_ `ReorderJoinPredicates`), the join's children would not have 
> partitioning defined. This breaks while creation the `PartitioningCollection` 
> here : 
> https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69
> Small repro:
> {noformat}
> context.sql("SET spark.sql.autoBroadcastJoinThreshold=0")
> val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", 
> "k")
> df.write.format("parquet").saveAsTable("table1")
> df.write.format("parquet").saveAsTable("table2")
> df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table")
> sql("""
>   SELECT *
>   FROM (
> SELECT a.i, a.j, a.k
> FROM bucketed_table a
> JOIN table1 b
> ON a.i = b.i
>   ) c
>   JOIN table2
>   ON c.i = table2.i
> """).explain
> {noformat}
> This fails with :
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: 
> PartitioningCollection requires all of its partitionings have the same 
> numPartitions.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69)
>   at 
> org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
>   at org.apache.spark.sql.Dataset.explain(Dataset.scala:464)
>   at

[jira] [Resolved] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN

2017-12-12 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22700.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19894
[https://github.com/apache/spark/pull/19894]

> Bucketizer.transform incorrectly drops row containing NaN
> -
>
> Key: SPARK-22700
> URL: https://issues.apache.org/jira/browse/SPARK-22700
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: zhengruifeng
> Fix For: 2.3.0
>
>
> {code}
> import org.apache.spark.ml.feature._
> val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, 
> Double.NaN))).toDF("a", "b")
> val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity)
> val bucketizer: Bucketizer = new 
> Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits)
> bucketizer.setHandleInvalid("skip")
> scala> df.show
> +---+---+
> |  a|  b|
> +---+---+
> |2.3|3.0|
> |NaN|3.0|
> |6.7|NaN|
> +---+---+
> scala> bucketizer.transform(df).show
> +---+---+---+
> |  a|  b| aa|
> +---+---+---+
> |2.3|3.0|0.0|
> +---+---+---+
> {code}
> When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly 
> droped, though colum 'b' is not an input column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN

2017-12-12 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22700:
--

Assignee: zhengruifeng

> Bucketizer.transform incorrectly drops row containing NaN
> -
>
> Key: SPARK-22700
> URL: https://issues.apache.org/jira/browse/SPARK-22700
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
> Fix For: 2.3.0
>
>
> {code}
> import org.apache.spark.ml.feature._
> val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, 
> Double.NaN))).toDF("a", "b")
> val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity)
> val bucketizer: Bucketizer = new 
> Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits)
> bucketizer.setHandleInvalid("skip")
> scala> df.show
> +---+---+
> |  a|  b|
> +---+---+
> |2.3|3.0|
> |NaN|3.0|
> |6.7|NaN|
> +---+---+
> scala> bucketizer.transform(df).show
> +---+---+---+
> |  a|  b| aa|
> +---+---+---+
> |2.3|3.0|0.0|
> +---+---+---+
> {code}
> When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly 
> droped, though colum 'b' is not an input column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19809) NullPointerException on zero-size ORC file

2017-12-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19809:

Component/s: (was: Input/Output)
 SQL

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>

[jira] [Resolved] (SPARK-19809) NullPointerException on zero-size ORC file

2017-12-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19809.
-
Resolution: Fixed

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
>

[jira] [Commented] (SPARK-20849) Document R DecisionTree

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288740#comment-16288740
 ] 

Apache Spark commented on SPARK-20849:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/19963

> Document R DecisionTree
> ---
>
> Key: SPARK-20849
> URL: https://issues.apache.org/jira/browse/SPARK-20849
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SparkR
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.3.0
>
>
> add sparkr example for {{decisionTree}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22644) Make ML testsuite support StructuredStreaming test

2017-12-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22644.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19843
[https://github.com/apache/spark/pull/19843]

> Make ML testsuite support StructuredStreaming test
> --
>
> Key: SPARK-22644
> URL: https://issues.apache.org/jira/browse/SPARK-22644
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.3.0
>
>
> We need to add some helper code to make testing ML transformers & models 
> easier with streaming data. These tests might help us catch any remaining 
> issues and we could encourage future PRs to use these tests to prevent new 
> Models & Transformers from having issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22711) _pickle.PicklingError: args[0] from newobj args has the wrong class from cloudpickle.py

2017-12-12 Thread Prateek (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prateek updated SPARK-22711:

Affects Version/s: 2.2.1

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File

[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from newobj args has the wrong class from cloudpickle.py

2017-12-12 Thread Prateek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288722#comment-16288722
 ] 

Prateek commented on SPARK-22711:
-

This is not linear regression, do you want me to send the code?

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list

[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from newobj args has the wrong class from cloudpickle.py

2017-12-12 Thread Prateek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288714#comment-16288714
 ] 

Prateek commented on SPARK-22711:
-

with regular python its working fine.
I tried with latest spark also 2.2.1

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770,

[jira] [Resolved] (SPARK-22768) Spark SQL alter table and beeline alter table Metadata is different

2017-12-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22768.
---
Resolution: Invalid

It's very unclear what you're asking, but this should start on the mailing list.

> Spark SQL alter table and beeline alter table Metadata is different
> ---
>
> Key: SPARK-22768
> URL: https://issues.apache.org/jira/browse/SPARK-22768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 + hadoop 2.6 + hivemeta (2)
>Reporter: 吴志龙
>
> Metadata to reproduce a problem: I have two hivemeta server
>   1. a new table in beeline 2 fields (connection hivemeta1)
>   2. I modified the table above spark-sql structure, into 3 fields 
> (connection hivemeta2)
>   3.beeline above query data, the reported data table only three Field 
> java.lang.IndexOutOfBoundsException: toIndex = 3  and desc table not found 
> coluns
>Question: 1. beeline data is not synchronized 
> 2.hivemeat data synchronization problem
> 3.This is a very serious problem 
> This url  can not solve my problem:
>  https://issues.apache.org/jira/browse/SPARK-18355



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22768) Spark SQL alter table and beeline alter table Metadata is different

2017-12-12 Thread JIRA

吴志龙 created SPARK-22768:
---

 Summary: Spark SQL alter table and beeline alter table Metadata is 
different
 Key: SPARK-22768
 URL: https://issues.apache.org/jira/browse/SPARK-22768
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
 Environment: Spark 2.2 + hadoop 2.6 + hivemeta (2)
Reporter: 吴志龙


Metadata to reproduce a problem: I have two hivemeta server
  1. a new table in beeline 2 fields (connection hivemeta1)
  2. I modified the table above spark-sql structure, into 3 fields (connection 
hivemeta2)
  3.beeline above query data, the reported data table only three Field 
java.lang.IndexOutOfBoundsException: toIndex = 3  and desc table not found 
coluns
   Question: 1. beeline data is not synchronized 
2.hivemeat data synchronization problem
3.This is a very serious problem 
This url  can not solve my problem:
 https://issues.apache.org/jira/browse/SPARK-18355



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22767) use ctx.addReferenceObj in InSet and ScalaUDF

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22767:


Assignee: Apache Spark  (was: Wenchen Fan)

> use ctx.addReferenceObj in InSet and ScalaUDF
> -
>
> Key: SPARK-22767
> URL: https://issues.apache.org/jira/browse/SPARK-22767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22767) use ctx.addReferenceObj in InSet and ScalaUDF

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22767:


Assignee: Wenchen Fan  (was: Apache Spark)

> use ctx.addReferenceObj in InSet and ScalaUDF
> -
>
> Key: SPARK-22767
> URL: https://issues.apache.org/jira/browse/SPARK-22767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22767) use ctx.addReferenceObj in InSet and ScalaUDF

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288670#comment-16288670
 ] 

Apache Spark commented on SPARK-22767:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19962

> use ctx.addReferenceObj in InSet and ScalaUDF
> -
>
> Key: SPARK-22767
> URL: https://issues.apache.org/jira/browse/SPARK-22767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22767) use ctx.addReferenceObj in InSet and ScalaUDF

2017-12-12 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-22767:
---

 Summary: use ctx.addReferenceObj in InSet and ScalaUDF
 Key: SPARK-22767
 URL: https://issues.apache.org/jira/browse/SPARK-22767
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22496) beeline display operation log

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288642#comment-16288642
 ] 

Apache Spark commented on SPARK-22496:
--

User 'ChenjunZou' has created a pull request for this issue:
https://github.com/apache/spark/pull/19961

> beeline display operation log
> -
>
> Key: SPARK-22496
> URL: https://issues.apache.org/jira/browse/SPARK-22496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StephenZou
>Priority: Minor
> Fix For: 2.3.0
>
>
> For now，when end user runs queries in beeline or in hue through STS, 
> no logs are displayed, end user will wait until the job finishes or fails. 
> Progress information is needed to inform end users how the job is running if 
> they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22600) Fix 64kb limit for deeply nested expressions under wholestage codegen

2017-12-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22600:
---

Assignee: Liang-Chi Hsieh

> Fix 64kb limit for deeply nested expressions under wholestage codegen
> -
>
> Key: SPARK-22600
> URL: https://issues.apache.org/jira/browse/SPARK-22600
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> This is an extension of SPARK-22543 to fix 64kb compile error for deeply 
> nested expressions under wholestage codegen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22600) Fix 64kb limit for deeply nested expressions under wholestage codegen

2017-12-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22600.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19813
[https://github.com/apache/spark/pull/19813]

> Fix 64kb limit for deeply nested expressions under wholestage codegen
> -
>
> Key: SPARK-22600
> URL: https://issues.apache.org/jira/browse/SPARK-22600
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> This is an extension of SPARK-22543 to fix 64kb compile error for deeply 
> nested expressions under wholestage codegen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file

2017-12-12 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288623#comment-16288623
 ] 

Dongjoon Hyun commented on SPARK-19809:
---

Since Hive 1.2.1 library code path still has this problem, users may hit this 
when spark.sql.hive.convertMetastoreOrc=false. However, after SPARK-22279, 
Apache Spark with the default configuration doesn't hit this bug. The PR adds a 
test coverage for `convertMetastoreOrc=true (default)` on both `native` and 
`hive` ORC implementation in order to prevent regression.


> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at

[jira] [Assigned] (SPARK-19809) NullPointerException on zero-size ORC file

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19809:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Apache Spark
> Fix For: 2.3.0
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
>

[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288622#comment-16288622
 ] 

Apache Spark commented on SPARK-19809:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19960

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native

[jira] [Assigned] (SPARK-19809) NullPointerException on zero-size ORC file

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19809:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
>

[jira] [Resolved] (SPARK-22716) revisit ctx.addReferenceObj

2017-12-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22716.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19916
[https://github.com/apache/spark/pull/19916]

> revisit ctx.addReferenceObj
> ---
>
> Key: SPARK-22716
> URL: https://issues.apache.org/jira/browse/SPARK-22716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
> Fix For: 2.3.0
>
>
> `ctx.addReferenceObj` creats a global variable just for referring an object, 
> which seems an overkill. We should revisit it and always use 
> `ctx.addReferenceMinorObj`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22716) Avoid the creation of mutable states in addReferenceObj

2017-12-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22716:
---

Assignee: Marco Gaido

> Avoid the creation of mutable states in addReferenceObj
> ---
>
> Key: SPARK-22716
> URL: https://issues.apache.org/jira/browse/SPARK-22716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> `ctx.addReferenceObj` creats a global variable just for referring an object, 
> which seems an overkill. We should revisit it and always use 
> `ctx.addReferenceMinorObj`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22716) Avoid the creation of mutable states in addReferenceObj

2017-12-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22716:

Summary: Avoid the creation of mutable states in addReferenceObj  (was: 
revisit ctx.addReferenceObj)

> Avoid the creation of mutable states in addReferenceObj
> ---
>
> Key: SPARK-22716
> URL: https://issues.apache.org/jira/browse/SPARK-22716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
> Fix For: 2.3.0
>
>
> `ctx.addReferenceObj` creats a global variable just for referring an object, 
> which seems an overkill. We should revisit it and always use 
> `ctx.addReferenceMinorObj`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22755) Expression (946-885)1.0/946 < 0.1 and (946-885)1.000/946 < 0.1 return different results

2017-12-12 Thread Kevin Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288608#comment-16288608
 ] 

Kevin Zhang commented on SPARK-22755:
-

[~ksunitha] Thanks, it helps a lot

> Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return 
> different results
> -
>
> Key: SPARK-22755
> URL: https://issues.apache.org/jira/browse/SPARK-22755
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
> Fix For: 2.2.1
>
>
> both of the following sql statements
> {code:sql}
> select ((946-885)*1.000/946 < 0.1)
> {code}
> and
> {code:sql}
> select ((946-885)*1.0/946 < 0.100)
> {code}
> return true, while the following statement 
> {code:sql}
> select ((946-885)*1.0/946 < 0.1)
> {code}
> returns false



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-19809) NullPointerException on zero-size ORC file

2017-12-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-19809:
-

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
>

[jira] [Updated] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-12-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-22574:
---
Fix Version/s: (was: 2.2.2)
   (was: 2.3.0)

> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Assignee: German Schiavon Matteo
>Priority: Minor
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it inactive as a mesos framework.
> The class CreateSubmissionRequest initialise its arguments to null as follows:
> {code:title=CreateSubmissionRequest.scala|borderStyle=solid}
>   var appResource: String = null
>   var mainClass: String = null
>   var appArgs: Array[String] = null
>   var sparkProperties: Map[String, String] = null
>   var environmentVariables: Map[String, String] = null
> {code}
> There are some checks of these variables but not in all of them, for example 
> in appArgs and environmentVariables. 
> If you don't set _appArgs_ it will cause the following error: 
> {code:title=error|borderStyle=solid}
> 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
> Exception in thread "Thread-22" java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
> {code}
> Because it's trying to access to it without checking whether is null or not.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-12-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-22574:


> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Assignee: German Schiavon Matteo
>Priority: Minor
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it inactive as a mesos framework.
> The class CreateSubmissionRequest initialise its arguments to null as follows:
> {code:title=CreateSubmissionRequest.scala|borderStyle=solid}
>   var appResource: String = null
>   var mainClass: String = null
>   var appArgs: Array[String] = null
>   var sparkProperties: Map[String, String] = null
>   var environmentVariables: Map[String, String] = null
> {code}
> There are some checks of these variables but not in all of them, for example 
> in appArgs and environmentVariables. 
> If you don't set _appArgs_ it will cause the following error: 
> {code:title=error|borderStyle=solid}
> 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
> Exception in thread "Thread-22" java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
> {code}
> Because it's trying to access to it without checking whether is null or not.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-12-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22574:
--

Assignee: (was: German Schiavon Matteo)

> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Priority: Minor
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it inactive as a mesos framework.
> The class CreateSubmissionRequest initialise its arguments to null as follows:
> {code:title=CreateSubmissionRequest.scala|borderStyle=solid}
>   var appResource: String = null
>   var mainClass: String = null
>   var appArgs: Array[String] = null
>   var sparkProperties: Map[String, String] = null
>   var environmentVariables: Map[String, String] = null
> {code}
> There are some checks of these variables but not in all of them, for example 
> in appArgs and environmentVariables. 
> If you don't set _appArgs_ it will cause the following error: 
> {code:title=error|borderStyle=solid}
> 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
> Exception in thread "Thread-22" java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
> {code}
> Because it's trying to access to it without checking whether is null or not.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-12-12 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288506#comment-16288506
 ] 

Marcelo Vanzin commented on SPARK-22574:


(Commit above was reverted.)

> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Priority: Minor
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it inactive as a mesos framework.
> The class CreateSubmissionRequest initialise its arguments to null as follows:
> {code:title=CreateSubmissionRequest.scala|borderStyle=solid}
>   var appResource: String = null
>   var mainClass: String = null
>   var appArgs: Array[String] = null
>   var sparkProperties: Map[String, String] = null
>   var environmentVariables: Map[String, String] = null
> {code}
> There are some checks of these variables but not in all of them, for example 
> in appArgs and environmentVariables. 
> If you don't set _appArgs_ it will cause the following error: 
> {code:title=error|borderStyle=solid}
> 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
> Exception in thread "Thread-22" java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
> {code}
> Because it's trying to access to it without checking whether is null or not.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22766) Install R linter package in spark lib directory

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22766:


Assignee: Apache Spark

> Install R linter package in spark lib directory
> ---
>
> Key: SPARK-22766
> URL: https://issues.apache.org/jira/browse/SPARK-22766
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Hossein Falaki
>Assignee: Apache Spark
>
> {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} 
> package in the default site library location which is 
> {{/usr/local/lib/R/site-library}. This is not recommended and can fail 
> because we are running this script as jenkins while that directory is owned 
> by root.
> We need to install the linter package in a local directory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22766) Install R linter package in spark lib directory

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288428#comment-16288428
 ] 

Apache Spark commented on SPARK-22766:
--

User 'falaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/19959

> Install R linter package in spark lib directory
> ---
>
> Key: SPARK-22766
> URL: https://issues.apache.org/jira/browse/SPARK-22766
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Hossein Falaki
>
> {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} 
> package in the default site library location which is 
> {{/usr/local/lib/R/site-library}. This is not recommended and can fail 
> because we are running this script as jenkins while that directory is owned 
> by root.
> We need to install the linter package in a local directory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22766) Install R linter package in spark lib directory

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22766:


Assignee: (was: Apache Spark)

> Install R linter package in spark lib directory
> ---
>
> Key: SPARK-22766
> URL: https://issues.apache.org/jira/browse/SPARK-22766
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Hossein Falaki
>
> {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} 
> package in the default site library location which is 
> {{/usr/local/lib/R/site-library}. This is not recommended and can fail 
> because we are running this script as jenkins while that directory is owned 
> by root.
> We need to install the linter package in a local directory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21087) CrossValidator, TrainValidationSplit should collect all models when fitting: Scala API

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288426#comment-16288426
 ] 

Apache Spark commented on SPARK-21087:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/19958

> CrossValidator, TrainValidationSplit should collect all models when fitting: 
> Scala API
> --
>
> Key: SPARK-21087
> URL: https://issues.apache.org/jira/browse/SPARK-21087
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> We add a parameter whether to collect the full model list when 
> CrossValidator/TrainValidationSplit training (Default is NOT, avoid the 
> change cause OOM)
> Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to 
> get the model list
> CrossValidatorModelWriter add a “option”, allow user to control whether to 
> persist the model list to disk.
> Note: when persisting the model list, use indices as the sub-model path



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22766) Install R linter package in spark lib directory

2017-12-12 Thread Hossein Falaki (JIRA)

Hossein Falaki created SPARK-22766:
--

 Summary: Install R linter package in spark lib directory
 Key: SPARK-22766
 URL: https://issues.apache.org/jira/browse/SPARK-22766
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.1
Reporter: Hossein Falaki


{{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} 
package in the default site library location which is 
{{/usr/local/lib/R/site-library}. This is not recommended and can fail because 
we are running this script as jenkins while that directory is owned by root.

We need to install the linter package in a local directory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22722) Test Coverage for Type Coercion Compatibility

2017-12-12 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288425#comment-16288425
 ] 

Xiao Li commented on SPARK-22722:
-

We need to run the same test cases in Hive. Thus, please ensure the queries you 
added can be executed in Hive with very trivial changes. 

> Test Coverage for Type Coercion Compatibility
> -
>
> Key: SPARK-22722
> URL: https://issues.apache.org/jira/browse/SPARK-22722
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>
> Hive compatibility is pretty important for the users who run or migrate both 
> Hive and Spark SQL. 
> We plan to add a SQLConf for type coercion compatibility 
> (spark.sql.typeCoercion.mode). Users can choose Spark's native mode (default) 
> or Hive mode (hive). 
> Before we deliver the Hive compatibility mode, we plan to write a set of test 
> cases that can be easily run in both Spark and Hive sides. We can easily 
> compare whether they are the same or not. When new typeCoercion rules are 
> added, we also can easily track the changes. These test cases can also be 
> backported to the previous Spark versions for determining the changes we 
> made. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22722) Test Coverage for Type Coercion Compatibility

2017-12-12 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288422#comment-16288422
 ] 

Xiao Li commented on SPARK-22722:
-

[~q79969786] I assigned this ticket to you. Thanks for your work!

> Test Coverage for Type Coercion Compatibility
> -
>
> Key: SPARK-22722
> URL: https://issues.apache.org/jira/browse/SPARK-22722
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>
> Hive compatibility is pretty important for the users who run or migrate both 
> Hive and Spark SQL. 
> We plan to add a SQLConf for type coercion compatibility 
> (spark.sql.typeCoercion.mode). Users can choose Spark's native mode (default) 
> or Hive mode (hive). 
> Before we deliver the Hive compatibility mode, we plan to write a set of test 
> cases that can be easily run in both Spark and Hive sides. We can easily 
> compare whether they are the same or not. When new typeCoercion rules are 
> added, we also can easily track the changes. These test cases can also be 
> backported to the previous Spark versions for determining the changes we 
> made. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22722) Test Coverage for Type Coercion Compatibility

2017-12-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-22722:
---

Assignee: Yuming Wang

> Test Coverage for Type Coercion Compatibility
> -
>
> Key: SPARK-22722
> URL: https://issues.apache.org/jira/browse/SPARK-22722
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>
> Hive compatibility is pretty important for the users who run or migrate both 
> Hive and Spark SQL. 
> We plan to add a SQLConf for type coercion compatibility 
> (spark.sql.typeCoercion.mode). Users can choose Spark's native mode (default) 
> or Hive mode (hive). 
> Before we deliver the Hive compatibility mode, we plan to write a set of test 
> cases that can be easily run in both Spark and Hive sides. We can easily 
> compare whether they are the same or not. When new typeCoercion rules are 
> added, we also can easily track the changes. These test cases can also be 
> backported to the previous Spark versions for determining the changes we 
> made. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR

2017-12-12 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288414#comment-16288414
 ] 

Xuefu Zhang commented on SPARK-22765:
-

I wouldn't say that MR is static, at lease not static in Spark's sense. MR 
allocates an executor for each map or reduce task and the executor exits when 
the running task completes. This would avoid the inefficiency in dynamic 
allocation where executor has to slowly die out (0 idle time doesn't really 
work). Secondly, this proposal doesn't really intend to go back to the MR 
paradigm. Instead, I propose a scheduling scheme similar to MR but enhanced to 
fit into Spark's DAG execution model.

To be clear, the proposal here is not to replace dynamic allocation. Rather, it 
provides an alternative that's more efficiency-centric than dynamic allocation. 
I understand there are a lot of details lacking, but I'd like to start a 
discussion now and hopefully something concrete will come out soon.

> Create a new executor allocation scheme based on that of MR
> ---
>
> Key: SPARK-22765
> URL: https://issues.apache.org/jira/browse/SPARK-22765
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>
> Many users migrating their workload from MR to Spark find a significant 
> resource consumption hike (i.e, SPARK-22683). While this might not be a 
> concern for users that are more performance centric, for others conscious 
> about cost, such hike creates a migration obstacle. This situation can get 
> worse as more users are moving to cloud.
> Dynamic allocation make it possible for Spark to be deployed in multi-tenant 
> environment. With its performance-centric design, its inefficiency has also 
> unfortunately shown up, especially when compared with MR. Thus, it's believed 
> that MR-styled scheduler still has its merit. Based on our research, the 
> inefficiency associated with dynamic allocation comes in many aspects such as 
> executor idling out, bigger executors, many stages (rather than 2 stages only 
> in MR) in a spark job, etc.
> Rather than fine tuning dynamic allocation for efficiency, the proposal here 
> is to add a new, efficiency-centric  scheduling scheme based on that of MR. 
> Such a MR-based scheme can be further enhanced and be more adapted to Spark 
> execution model. This alternative is expected to offer good performance 
> improvement (compared to MR) still with similar to or even better efficiency 
> than MR.
> Inputs are greatly welcome!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19809) NullPointerException on zero-size ORC file

2017-12-12 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-19809:


Assignee: Dongjoon Hyun

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
>

[jira] [Resolved] (SPARK-19809) NullPointerException on zero-size ORC file

2017-12-12 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19809.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19948
[https://github.com/apache/spark/pull/19948]

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
> Fix For: 2.3.0
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
>

[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming

2017-12-12 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288391#comment-16288391
 ] 

Michael Armbrust commented on SPARK-20928:
--

An update on this.  We've started to create subtasks break down the process of 
adding this new execution mode and we are targeting an alpha version in 2.3.  
The basics of the new API for Sources and Sinks (for both microbatch and 
continuous mode) has been up for a few days if people want to see more details. 
 We'll follow with PRs to add a continuous execution engine and an 
implementation of a continuous kafka connector.

Regarding some of the questions:
 - The version we are targeting for 2.3 will only support map operations and 
thus will not support shuffles / aggregating by windows (although the window() 
operator is just a projection so will work for window assignment).
 - I think the API is designed in such as way that we can build a streaming 
shuffle that aligns on epochIds in the future, allowing us to easily extend the 
continuous engine to handle stateful operations as well.

> SPIP: Continuous Processing Mode for Structured Streaming
> -
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>  Labels: SPIP
> Attachments: Continuous Processing in Structured Streaming Design 
> Sketch.pdf
>
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22764) Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22764:


Assignee: (was: Apache Spark)

> Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"
> --
>
> Key: SPARK-22764
> URL: https://issues.apache.org/jira/browse/SPARK-22764
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Saw this in a PR builder:
> {noformat}
> [info] - Cancelling stages/jobs with custom reasons. *** FAILED *** (135 
> milliseconds)
> [info]   Expected exception org.apache.spark.SparkException to be thrown, but 
> no exception was thrown (SparkContextSuite.scala:531)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
> [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
> {noformat}
> From the logs, the job is finishing before the test code cancels it:
> {noformat}
> 17/12/12 11:00:41.680 Executor task launch worker for task 1 INFO Executor: 
> Finished task 0.0 in stage 1.0 (TID 1). 703 bytes result sent to driver
> 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSetManager: Finished task 
> 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (executor driver) (1/1)
> 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSchedulerImpl: Removed 
> TaskSet 1.0, whose tasks have all completed, from pool 
> 17/12/12 11:00:41.681 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 
> 1 (apply at Assertions.scala:805) finished in 0.066 s
> 17/12/12 11:00:41.681 pool-1-thread-1-ScalaTest-running-SparkContextSuite 
> INFO DAGScheduler: Job 1 finished: apply at Assertions.scala:805, took 
> 0.066946 s
> 17/12/12 11:00:41.682 spark-listener-group-shared INFO DAGScheduler: Asked to 
> cancel job 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22764) Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22764:


Assignee: Apache Spark

> Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"
> --
>
> Key: SPARK-22764
> URL: https://issues.apache.org/jira/browse/SPARK-22764
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> Saw this in a PR builder:
> {noformat}
> [info] - Cancelling stages/jobs with custom reasons. *** FAILED *** (135 
> milliseconds)
> [info]   Expected exception org.apache.spark.SparkException to be thrown, but 
> no exception was thrown (SparkContextSuite.scala:531)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
> [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
> {noformat}
> From the logs, the job is finishing before the test code cancels it:
> {noformat}
> 17/12/12 11:00:41.680 Executor task launch worker for task 1 INFO Executor: 
> Finished task 0.0 in stage 1.0 (TID 1). 703 bytes result sent to driver
> 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSetManager: Finished task 
> 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (executor driver) (1/1)
> 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSchedulerImpl: Removed 
> TaskSet 1.0, whose tasks have all completed, from pool 
> 17/12/12 11:00:41.681 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 
> 1 (apply at Assertions.scala:805) finished in 0.066 s
> 17/12/12 11:00:41.681 pool-1-thread-1-ScalaTest-running-SparkContextSuite 
> INFO DAGScheduler: Job 1 finished: apply at Assertions.scala:805, took 
> 0.066946 s
> 17/12/12 11:00:41.682 spark-listener-group-shared INFO DAGScheduler: Asked to 
> cancel job 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21867) Support async spilling in UnsafeShuffleWriter

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21867:


Assignee: Apache Spark

> Support async spilling in UnsafeShuffleWriter
> -
>
> Key: SPARK-21867
> URL: https://issues.apache.org/jira/browse/SPARK-21867
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Apache Spark
>Priority: Minor
> Attachments: Async ShuffleExternalSorter.pdf
>
>
> Currently, Spark tasks are single-threaded. But we see it could greatly 
> improve the performance of the jobs, if we can multi-thread some part of it. 
> For example, profiling our map tasks, which reads large amount of data from 
> HDFS and spill to disks, we see that we are blocked on HDFS read and spilling 
> majority of the time. Since both these operations are IO intensive the 
> average CPU consumption during map phase is significantly low. In theory, 
> both HDFS read and spilling can be done in parallel if we had additional 
> memory to store data read from HDFS while we are spilling the last batch read.
> Let's say we have 1G of shuffle memory available per task. Currently, in case 
> of map task, it reads from HDFS and the records are stored in the available 
> memory buffer. Once we hit the memory limit and there is no more space to 
> store the records, we sort and spill the content to disk. While we are 
> spilling to disk, since we do not have any available memory, we can not read 
> from HDFS concurrently. 
> Here we propose supporting async spilling for UnsafeShuffleWriter, so that we 
> can support reading from HDFS when sort and spill is happening 
> asynchronously.  Let's say the total 1G of shuffle memory can be split into 
> two regions - active region and spilling region - each of size 500 MB. We 
> start with reading from HDFS and filling the active region. Once we hit the 
> limit of active region, we issue an asynchronous spill, while fliping the 
> active region and spilling region. While the spil is happening 
> asynchronosuly, we still have 500 MB of memory available to read the data 
> from HDFS. This way we can amortize the high disk/network io cost during 
> spilling.
> We made a prototype hack to implement this feature and we could see our map 
> tasks were as much as 40% faster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288363#comment-16288363
 ] 

Apache Spark commented on SPARK-21867:
--

User 'ericvandenbergfb' has created a pull request for this issue:
https://github.com/apache/spark/pull/19955

> Support async spilling in UnsafeShuffleWriter
> -
>
> Key: SPARK-21867
> URL: https://issues.apache.org/jira/browse/SPARK-21867
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Priority: Minor
> Attachments: Async ShuffleExternalSorter.pdf
>
>
> Currently, Spark tasks are single-threaded. But we see it could greatly 
> improve the performance of the jobs, if we can multi-thread some part of it. 
> For example, profiling our map tasks, which reads large amount of data from 
> HDFS and spill to disks, we see that we are blocked on HDFS read and spilling 
> majority of the time. Since both these operations are IO intensive the 
> average CPU consumption during map phase is significantly low. In theory, 
> both HDFS read and spilling can be done in parallel if we had additional 
> memory to store data read from HDFS while we are spilling the last batch read.
> Let's say we have 1G of shuffle memory available per task. Currently, in case 
> of map task, it reads from HDFS and the records are stored in the available 
> memory buffer. Once we hit the memory limit and there is no more space to 
> store the records, we sort and spill the content to disk. While we are 
> spilling to disk, since we do not have any available memory, we can not read 
> from HDFS concurrently. 
> Here we propose supporting async spilling for UnsafeShuffleWriter, so that we 
> can support reading from HDFS when sort and spill is happening 
> asynchronously.  Let's say the total 1G of shuffle memory can be split into 
> two regions - active region and spilling region - each of size 500 MB. We 
> start with reading from HDFS and filling the active region. Once we hit the 
> limit of active region, we issue an asynchronous spill, while fliping the 
> active region and spilling region. While the spil is happening 
> asynchronosuly, we still have 500 MB of memory available to read the data 
> from HDFS. This way we can amortize the high disk/network io cost during 
> spilling.
> We made a prototype hack to implement this feature and we could see our map 
> tasks were as much as 40% faster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21867) Support async spilling in UnsafeShuffleWriter

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21867:


Assignee: (was: Apache Spark)

> Support async spilling in UnsafeShuffleWriter
> -
>
> Key: SPARK-21867
> URL: https://issues.apache.org/jira/browse/SPARK-21867
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Priority: Minor
> Attachments: Async ShuffleExternalSorter.pdf
>
>
> Currently, Spark tasks are single-threaded. But we see it could greatly 
> improve the performance of the jobs, if we can multi-thread some part of it. 
> For example, profiling our map tasks, which reads large amount of data from 
> HDFS and spill to disks, we see that we are blocked on HDFS read and spilling 
> majority of the time. Since both these operations are IO intensive the 
> average CPU consumption during map phase is significantly low. In theory, 
> both HDFS read and spilling can be done in parallel if we had additional 
> memory to store data read from HDFS while we are spilling the last batch read.
> Let's say we have 1G of shuffle memory available per task. Currently, in case 
> of map task, it reads from HDFS and the records are stored in the available 
> memory buffer. Once we hit the memory limit and there is no more space to 
> store the records, we sort and spill the content to disk. While we are 
> spilling to disk, since we do not have any available memory, we can not read 
> from HDFS concurrently. 
> Here we propose supporting async spilling for UnsafeShuffleWriter, so that we 
> can support reading from HDFS when sort and spill is happening 
> asynchronously.  Let's say the total 1G of shuffle memory can be split into 
> two regions - active region and spilling region - each of size 500 MB. We 
> start with reading from HDFS and filling the active region. Once we hit the 
> limit of active region, we issue an asynchronous spill, while fliping the 
> active region and spilling region. While the spil is happening 
> asynchronosuly, we still have 500 MB of memory available to read the data 
> from HDFS. This way we can amortize the high disk/network io cost during 
> spilling.
> We made a prototype hack to implement this feature and we could see our map 
> tasks were as much as 40% faster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22764) Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288364#comment-16288364
 ] 

Apache Spark commented on SPARK-22764:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19956

> Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"
> --
>
> Key: SPARK-22764
> URL: https://issues.apache.org/jira/browse/SPARK-22764
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Saw this in a PR builder:
> {noformat}
> [info] - Cancelling stages/jobs with custom reasons. *** FAILED *** (135 
> milliseconds)
> [info]   Expected exception org.apache.spark.SparkException to be thrown, but 
> no exception was thrown (SparkContextSuite.scala:531)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
> [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
> {noformat}
> From the logs, the job is finishing before the test code cancels it:
> {noformat}
> 17/12/12 11:00:41.680 Executor task launch worker for task 1 INFO Executor: 
> Finished task 0.0 in stage 1.0 (TID 1). 703 bytes result sent to driver
> 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSetManager: Finished task 
> 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (executor driver) (1/1)
> 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSchedulerImpl: Removed 
> TaskSet 1.0, whose tasks have all completed, from pool 
> 17/12/12 11:00:41.681 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 
> 1 (apply at Assertions.scala:805) finished in 0.066 s
> 17/12/12 11:00:41.681 pool-1-thread-1-ScalaTest-running-SparkContextSuite 
> INFO DAGScheduler: Job 1 finished: apply at Assertions.scala:805, took 
> 0.066946 s
> 17/12/12 11:00:41.682 spark-listener-group-shared INFO DAGScheduler: Asked to 
> cancel job 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR

2017-12-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288353#comment-16288353
 ] 

Sean Owen commented on SPARK-22765:
---

I'm not clear why this needs another allocation scheme. You say dynamic 
allocation has overhead at runtime -- yes -- and M/R doesn't because it's 
static. So why not disable dynamic allocation? Things you're identifying as 
"problems" are just because Spark is a generalization; you can write a bunch of 
independent 2-stage map-reduce jobs if you want. Killing idle executors is the 
point of dynamic allocation, not a problem.

I don't see any detail on how this differs from anything else in Spark. 

> Create a new executor allocation scheme based on that of MR
> ---
>
> Key: SPARK-22765
> URL: https://issues.apache.org/jira/browse/SPARK-22765
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>
> Many users migrating their workload from MR to Spark find a significant 
> resource consumption hike (i.e, SPARK-22683). While this might not be a 
> concern for users that are more performance centric, for others conscious 
> about cost, such hike creates a migration obstacle. This situation can get 
> worse as more users are moving to cloud.
> Dynamic allocation make it possible for Spark to be deployed in multi-tenant 
> environment. With its performance-centric design, its inefficiency has also 
> unfortunately shown up, especially when compared with MR. Thus, it's believed 
> that MR-styled scheduler still has its merit. Based on our research, the 
> inefficiency associated with dynamic allocation comes in many aspects such as 
> executor idling out, bigger executors, many stages (rather than 2 stages only 
> in MR) in a spark job, etc.
> Rather than fine tuning dynamic allocation for efficiency, the proposal here 
> is to add a new, efficiency-centric  scheduling scheme based on that of MR. 
> Such a MR-based scheme can be further enhanced and be more adapted to Spark 
> execution model. This alternative is expected to offer good performance 
> improvement (compared to MR) still with similar to or even better efficiency 
> than MR.
> Inputs are greatly welcome!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22765) Create a new executor allocation scheme based on that of MR

2017-12-12 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created SPARK-22765:
---

 Summary: Create a new executor allocation scheme based on that of 
MR
 Key: SPARK-22765
 URL: https://issues.apache.org/jira/browse/SPARK-22765
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.6.0
Reporter: Xuefu Zhang


Many users migrating their workload from MR to Spark find a significant 
resource consumption hike (i.e, SPARK-22683). While this might not be a concern 
for users that are more performance centric, for others conscious about cost, 
such hike creates a migration obstacle. This situation can get worse as more 
users are moving to cloud.

Dynamic allocation make it possible for Spark to be deployed in multi-tenant 
environment. With its performance-centric design, its inefficiency has also 
unfortunately shown up, especially when compared with MR. Thus, it's believed 
that MR-styled scheduler still has its merit. Based on our research, the 
inefficiency associated with dynamic allocation comes in many aspects such as 
executor idling out, bigger executors, many stages (rather than 2 stages only 
in MR) in a spark job, etc.

Rather than fine tuning dynamic allocation for efficiency, the proposal here is 
to add a new, efficiency-centric  scheduling scheme based on that of MR. Such a 
MR-based scheme can be further enhanced and be more adapted to Spark execution 
model. This alternative is expected to offer good performance improvement 
(compared to MR) still with similar to or even better efficiency than MR.

Inputs are greatly welcome!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22764) Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"

2017-12-12 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-22764:
--

 Summary: Flaky test: SparkContextSuite "Cancelling stages/jobs 
with custom reasons"
 Key: SPARK-22764
 URL: https://issues.apache.org/jira/browse/SPARK-22764
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin
Priority: Minor


Saw this in a PR builder:

{noformat}
[info] - Cancelling stages/jobs with custom reasons. *** FAILED *** (135 
milliseconds)
[info]   Expected exception org.apache.spark.SparkException to be thrown, but 
no exception was thrown (SparkContextSuite.scala:531)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
[info]   at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
[info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)

{noformat}

>From the logs, the job is finishing before the test code cancels it:

{noformat}
17/12/12 11:00:41.680 Executor task launch worker for task 1 INFO Executor: 
Finished task 0.0 in stage 1.0 (TID 1). 703 bytes result sent to driver
17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSetManager: Finished task 
0.0 in stage 1.0 (TID 1) in 13 ms on localhost (executor driver) (1/1)
17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSchedulerImpl: Removed 
TaskSet 1.0, whose tasks have all completed, from pool 
17/12/12 11:00:41.681 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 1 
(apply at Assertions.scala:805) finished in 0.066 s
17/12/12 11:00:41.681 pool-1-thread-1-ScalaTest-running-SparkContextSuite INFO 
DAGScheduler: Job 1 finished: apply at Assertions.scala:805, took 0.066946 s
17/12/12 11:00:41.682 spark-listener-group-shared INFO DAGScheduler: Asked to 
cancel job 1
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22644) Make ML testsuite support StructuredStreaming test

2017-12-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22644:
--
Target Version/s: 2.3.0

> Make ML testsuite support StructuredStreaming test
> --
>
> Key: SPARK-22644
> URL: https://issues.apache.org/jira/browse/SPARK-22644
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>
> We need to add some helper code to make testing ML transformers & models 
> easier with streaming data. These tests might help us catch any remaining 
> issues and we could encourage future PRs to use these tests to prevent new 
> Models & Transformers from having issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22644) Make ML testsuite support StructuredStreaming test

2017-12-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22644:
-

Assignee: Weichen Xu

> Make ML testsuite support StructuredStreaming test
> --
>
> Key: SPARK-22644
> URL: https://issues.apache.org/jira/browse/SPARK-22644
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> We need to add some helper code to make testing ML transformers & models 
> easier with streaming data. These tests might help us catch any remaining 
> issues and we could encourage future PRs to use these tests to prevent new 
> Models & Transformers from having issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22644) Make ML testsuite support StructuredStreaming test

2017-12-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22644:
--
Shepherd: Joseph K. Bradley

> Make ML testsuite support StructuredStreaming test
> --
>
> Key: SPARK-22644
> URL: https://issues.apache.org/jira/browse/SPARK-22644
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>
> We need to add some helper code to make testing ML transformers & models 
> easier with streaming data. These tests might help us catch any remaining 
> issues and we could encourage future PRs to use these tests to prevent new 
> Models & Transformers from having issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288305#comment-16288305
 ] 

Joseph K. Bradley commented on SPARK-22126:
---

Hi all, thanks for the comments.  I realized my use of "Spark job" was 
imprecise above, so I extended that explanation within the linked design doc.  
Check out the new sections:
* "Use cases for model-specific optimizations & parallelism"
* "Other Q"

This should answer both [~WeichenXu123] and [~bago.amirbekian].

[~tomas.nykodym], with the current state of Spark master (2.3), we already let 
users shoot themselves in the foot by running big jobs in parallel.  We try to 
make it less likely by marking the "parallelism" Param as an expert-only API.  
I think it will be hard to prevent this without completely removing the 
"parallelism" Param, but please say if you have ideas for a better user-facing 
API.

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the API I proposed above, the code in 
> [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
> can be changed to:
> {code}
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
>   // Fit models in a Future for training in parallel
>   val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { 
> callable =>
>  Future[Map[Int, Model[_]]] {
> val modelMap = callable.call()
> if (collectSubModelsParam) {
>...
> }
> modelMap
>  } (executionContext)
>   }
>   // Unpersist training data only when all models have trained
>   Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, 
> executionContext)
> .onComplete { _ => trainingDataset.unpersist() } (executionContext)
>   // Evaluate models in a Future that will calulate a metric and allow 
> model to be

[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter

2017-12-12 Thread Eric Vandenberg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288298#comment-16288298
 ] 

Eric Vandenberg commented on SPARK-21867:
-

1. The default would be 1 so does not change default behavior.  It is currently 
configurable to set this higher (spark.shuffle.async.num.sorter=2).
2. Yes, the number of spill files could increase, that's one reason this is not 
on by default.  This could be an issue if it hits file system limits, etc in 
extreme cases.  For the jobs we've tested, this wasn't as a problem.  We think 
this improvement has biggest impact on larger jobs (we've seen cpu reduction by 
~30% in some large jobs), it may not help as much for smaller jobs with fewer 
spills.
3. When sorter hits the threshold, it will kick off an asynchronous spill and 
then continue inserting into another sorter (assuming one is available.)  It 
could make sense to raise the threshold, this would result in larger spill 
files.  There is some risk that raising it might push too high causing an OOM 
and then needing to lower again.  I'm thinking the algorithm could be improved 
by more accurately calculating and enforcing the threshold based on available 
memory over time, however, to do this would require exposing some memory 
allocation metrics not currently available (in the memory manager), so opt'd to 
not do that for now.
4. Yes, too many open files/buffers could be an issue.  So for now this is 
something should look at enabling case by case as part of performance tuning.


> Support async spilling in UnsafeShuffleWriter
> -
>
> Key: SPARK-21867
> URL: https://issues.apache.org/jira/browse/SPARK-21867
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Priority: Minor
> Attachments: Async ShuffleExternalSorter.pdf
>
>
> Currently, Spark tasks are single-threaded. But we see it could greatly 
> improve the performance of the jobs, if we can multi-thread some part of it. 
> For example, profiling our map tasks, which reads large amount of data from 
> HDFS and spill to disks, we see that we are blocked on HDFS read and spilling 
> majority of the time. Since both these operations are IO intensive the 
> average CPU consumption during map phase is significantly low. In theory, 
> both HDFS read and spilling can be done in parallel if we had additional 
> memory to store data read from HDFS while we are spilling the last batch read.
> Let's say we have 1G of shuffle memory available per task. Currently, in case 
> of map task, it reads from HDFS and the records are stored in the available 
> memory buffer. Once we hit the memory limit and there is no more space to 
> store the records, we sort and spill the content to disk. While we are 
> spilling to disk, since we do not have any available memory, we can not read 
> from HDFS concurrently. 
> Here we propose supporting async spilling for UnsafeShuffleWriter, so that we 
> can support reading from HDFS when sort and spill is happening 
> asynchronously.  Let's say the total 1G of shuffle memory can be split into 
> two regions - active region and spilling region - each of size 500 MB. We 
> start with reading from HDFS and filling the active region. Once we hit the 
> limit of active region, we issue an asynchronous spill, while fliping the 
> active region and spilling region. While the spil is happening 
> asynchronosuly, we still have 500 MB of memory available to read the data 
> from HDFS. This way we can amortize the high disk/network io cost during 
> spilling.
> We made a prototype hack to implement this feature and we could see our map 
> tasks were as much as 40% faster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-12 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288262#comment-16288262
 ] 

Xuefu Zhang commented on SPARK-22683:
-

Hi [~jcuquemelle], Thanks for working on this and bringing up the efficiency 
problem associated with dynamic allocation. Significant resource consumption 
increase is also experienced in our company when workload is migrated from MR 
to Spark (via Hive). Thus, I believe that there is a strong need to improve 
spark efficiency in addition to performance.

While your proposal has its merit, I largely concur with Sean that it might not 
be universally applicable to solve a class of problem rather than particular 
workload. Take MR as an example, which also allocate as many mappers/reducers 
as the number of map or reduce tasks, yet offers higher efficiency than Spark 
in many cases. The inefficiency associated with dynamic allocation comes in 
many aspects such as executor idling out, bigger executors, many stages (rather 
than 2 stages only in MR) in a spark job, etc. As there is a class of users 
conscious about resource consumption, especially when many moving their 
workload to the cloud, there demands a solution that's more generic to such 
users.

I have been thinking about a proposal that introduces a MR-based resource 
allocation in parallel with dynamic allocation. Such an allocation mechanism is 
based on MR style, but can be further enhanced to beat MR and be more adapted 
to Spark execution model. This would be a great alternative to dynamic 
allocation.

While dynamic is certainly performance centric, the new allocation scheme can 
still offer good performance improvement (compared to MR) while being 
efficiency-centric.

As a start point, I'm going to create an JIRA and move the discussion along 
this proposal over there. You're welcome to share your thoughts and/or 
contribute.

Thanks.

> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22757:


Assignee: Apache Spark

> Init-container in the driver/executor pods for downloading remote dependencies
> --
>
> Key: SPARK-22757
> URL: https://issues.apache.org/jira/browse/SPARK-22757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22757:


Assignee: (was: Apache Spark)

> Init-container in the driver/executor pods for downloading remote dependencies
> --
>
> Key: SPARK-22757
> URL: https://issues.apache.org/jira/browse/SPARK-22757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288213#comment-16288213
 ] 

Apache Spark commented on SPARK-22757:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/19954

> Init-container in the driver/executor pods for downloading remote dependencies
> --
>
> Key: SPARK-22757
> URL: https://issues.apache.org/jira/browse/SPARK-22757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-12-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22574.

   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

Issue resolved by pull request 19793
[https://github.com/apache/spark/pull/19793]

> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it inactive as a mesos framework.
> The class CreateSubmissionRequest initialise its arguments to null as follows:
> {code:title=CreateSubmissionRequest.scala|borderStyle=solid}
>   var appResource: String = null
>   var mainClass: String = null
>   var appArgs: Array[String] = null
>   var sparkProperties: Map[String, String] = null
>   var environmentVariables: Map[String, String] = null
> {code}
> There are some checks of these variables but not in all of them, for example 
> in appArgs and environmentVariables. 
> If you don't set _appArgs_ it will cause the following error: 
> {code:title=error|borderStyle=solid}
> 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
> Exception in thread "Thread-22" java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
> {code}
> Because it's trying to access to it without checking whether is null or not.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-12-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22574:
--

Assignee: German Schiavon Matteo

> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Assignee: German Schiavon Matteo
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it inactive as a mesos framework.
> The class CreateSubmissionRequest initialise its arguments to null as follows:
> {code:title=CreateSubmissionRequest.scala|borderStyle=solid}
>   var appResource: String = null
>   var mainClass: String = null
>   var appArgs: Array[String] = null
>   var sparkProperties: Map[String, String] = null
>   var environmentVariables: Map[String, String] = null
> {code}
> There are some checks of these variables but not in all of them, for example 
> in appArgs and environmentVariables. 
> If you don't set _appArgs_ it will cause the following error: 
> {code:title=error|borderStyle=solid}
> 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
> Exception in thread "Thread-22" java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
> {code}
> Because it's trying to access to it without checking whether is null or not.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-12-12 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-22289.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>Assignee: yuhao yang
> Fix For: 2.2.2, 2.3.0
>
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22755) Expression (946-885)1.0/946 < 0.1 and (946-885)1.000/946 < 0.1 return different results

2017-12-12 Thread Sunitha Kambhampati (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati resolved SPARK-22755.
-
   Resolution: Fixed
Fix Version/s: 2.2.1

> Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return 
> different results
> -
>
> Key: SPARK-22755
> URL: https://issues.apache.org/jira/browse/SPARK-22755
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
> Fix For: 2.2.1
>
>
> both of the following sql statements
> {code:sql}
> select ((946-885)*1.000/946 < 0.1)
> {code}
> and
> {code:sql}
> select ((946-885)*1.0/946 < 0.100)
> {code}
> return true, while the following statement 
> {code:sql}
> select ((946-885)*1.0/946 < 0.1)
> {code}
> returns false



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22755) Expression (946-885)1.0/946 < 0.1 and (946-885)1.000/946 < 0.1 return different results

2017-12-12 Thread Sunitha Kambhampati (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288061#comment-16288061
 ] 

Sunitha Kambhampati commented on SPARK-22755:
-

SPARK-21332 fixed this issue.   
The changes for it are also in Spark-2.2.1 that is released on Dec 1.  I tried 
it on Spark-2.2.1 and the queries return true. 



> Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return 
> different results
> -
>
> Key: SPARK-22755
> URL: https://issues.apache.org/jira/browse/SPARK-22755
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> both of the following sql statements
> {code:sql}
> select ((946-885)*1.000/946 < 0.1)
> {code}
> and
> {code:sql}
> select ((946-885)*1.0/946 < 0.100)
> {code}
> return true, while the following statement 
> {code:sql}
> select ((946-885)*1.0/946 < 0.1)
> {code}
> returns false



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22729) Add getTruncateQuery to JdbcDialect

2017-12-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22729:

Issue Type: Improvement  (was: New Feature)

> Add getTruncateQuery to JdbcDialect
> ---
>
> Key: SPARK-22729
> URL: https://issues.apache.org/jira/browse/SPARK-22729
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Assignee: Daniel van der Ende
> Fix For: 2.3.0
>
>
> In order to enable truncate for PostgreSQL databases in Spark JDBC, a change 
> is needed to the query used for truncating a PostgreSQL table. By default, 
> PostgreSQL will automatically truncate any descendant tables if a TRUNCATE 
> query is executed. As this may result in (unwanted) side-effects, the query 
> used for the truncate should be specified separately for PostgreSQL, 
> specifying only to TRUNCATE a single table.
> This will also resolve SPARK-22717
> See PostgreSQL documentation 
> https://www.postgresql.org/docs/current/static/sql-truncate.html
> This change will still not let users truncate a table with cascade enabled 
> (which would also truncate tables with foreign key constraints to the table).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22729) Add getTruncateQuery to JdbcDialect

2017-12-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22729.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add getTruncateQuery to JdbcDialect
> ---
>
> Key: SPARK-22729
> URL: https://issues.apache.org/jira/browse/SPARK-22729
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Assignee: Daniel van der Ende
> Fix For: 2.3.0
>
>
> In order to enable truncate for PostgreSQL databases in Spark JDBC, a change 
> is needed to the query used for truncating a PostgreSQL table. By default, 
> PostgreSQL will automatically truncate any descendant tables if a TRUNCATE 
> query is executed. As this may result in (unwanted) side-effects, the query 
> used for the truncate should be specified separately for PostgreSQL, 
> specifying only to TRUNCATE a single table.
> This will also resolve SPARK-22717
> See PostgreSQL documentation 
> https://www.postgresql.org/docs/current/static/sql-truncate.html
> This change will still not let users truncate a table with cascade enabled 
> (which would also truncate tables with foreign key constraints to the table).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22729) Add getTruncateQuery to JdbcDialect

2017-12-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-22729:
---

Assignee: Daniel van der Ende

> Add getTruncateQuery to JdbcDialect
> ---
>
> Key: SPARK-22729
> URL: https://issues.apache.org/jira/browse/SPARK-22729
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Assignee: Daniel van der Ende
> Fix For: 2.3.0
>
>
> In order to enable truncate for PostgreSQL databases in Spark JDBC, a change 
> is needed to the query used for truncating a PostgreSQL table. By default, 
> PostgreSQL will automatically truncate any descendant tables if a TRUNCATE 
> query is executed. As this may result in (unwanted) side-effects, the query 
> used for the truncate should be specified separately for PostgreSQL, 
> specifying only to TRUNCATE a single table.
> This will also resolve SPARK-22717
> See PostgreSQL documentation 
> https://www.postgresql.org/docs/current/static/sql-truncate.html
> This change will still not let users truncate a table with cascade enabled 
> (which would also truncate tables with foreign key constraints to the table).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22742) Spark2.x does not support read data from Hive 2.2 and 2.3

2017-12-12 Thread Zhongting Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288044#comment-16288044
 ] 

Zhongting Hu commented on SPARK-22742:
--

Thanks [~smilegator] to reopen it, can I get a rough estimation on which coming 
spark release will has 2.3 support, Thanks

> Spark2.x does not support read data from Hive 2.2 and 2.3
> -
>
> Key: SPARK-22742
> URL: https://issues.apache.org/jira/browse/SPARK-22742
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Zhongting Hu
>
> Hive has been release latest version 2.3.2 but spark doesn't support read 
> from metadata store yet. The latest version of Hive metastore is 2.1. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19834) csv escape of quote escape

2017-12-12 Thread Soonmok Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288025#comment-16288025
 ] 

Soonmok Kwon commented on SPARK-19834:
--

I will re-open this soon (in a week).

> csv escape of quote escape
> --
>
> Key: SPARK-19834
> URL: https://issues.apache.org/jira/browse/SPARK-19834
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Soonmok Kwon
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> A DataFrame is stored in CSV format and loaded again. When there's backslash 
> followed by quotation mark, csv reading seems to make an error.
> reference:
> http://stackoverflow.com/questions/42607208/spark-csv-error-when-reading-backslash-and-quotation-mark



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22755) Expression (946-885)1.0/946 < 0.1 and (946-885)1.000/946 < 0.1 return different results

2017-12-12 Thread Sunitha Kambhampati (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287994#comment-16287994
 ] 

Sunitha Kambhampati commented on SPARK-22755:
-


A) TRUNK:
- My trunk codeline is sync'd up to Dec 7, commit 
2d4c2b0bdf89badf25f2d1d98903125e48e7cd5c
- That said,  I tried these statements now in spark sql repl and the answer is 
true.   
- I have not set any configuration, it is just the defaults in my dev env. 

B) Spark 2.2:
- I downloaded the spark 2.2 binaries and tried the queries in spark-sql repl 
and I can repro the issue. 
{code}
select ((946-885)*1.0/946 < 0.1) -> returns false
select ((946-885)*1.0/946 < 0.100)  -> returns true
{code}

So it looks like some fix has gone in as I cannot repro it on trunk.   
I am not sure which issue has fixed it but just wanted to add this info for 
now. 

> Expression (946-885)*1.0/946 < 0.1 and (946-885)*1.000/946 < 0.1 return 
> different results
> -
>
> Key: SPARK-22755
> URL: https://issues.apache.org/jira/browse/SPARK-22755
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> both of the following sql statements
> {code:sql}
> select ((946-885)*1.000/946 < 0.1)
> {code}
> and
> {code:sql}
> select ((946-885)*1.0/946 < 0.100)
> {code}
> return true, while the following statement 
> {code:sql}
> select ((946-885)*1.0/946 < 0.1)
> {code}
> returns false



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16986) "Started" time, "Completed" time and "Last Updated" time in history server UI are not user local time

2017-12-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16986.

   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 2.3.0

> "Started" time, "Completed" time and "Last Updated" time in history server UI 
> are not user local time
> -
>
> Key: SPARK-16986
> URL: https://issues.apache.org/jira/browse/SPARK-16986
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Weiqing Yang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, "Started" time, "Completed" time and "Last Updated" time in 
> history server UI are GMT. They should be the user local time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-16986) "Started" time, "Completed" time and "Last Updated" time in history server UI are not user local time

2017-12-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-16986:


> "Started" time, "Completed" time and "Last Updated" time in history server UI 
> are not user local time
> -
>
> Key: SPARK-16986
> URL: https://issues.apache.org/jira/browse/SPARK-16986
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Weiqing Yang
>Priority: Minor
>
> Currently, "Started" time, "Completed" time and "Last Updated" time in 
> history server UI are GMT. They should be the user local time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22763) SHS: Ignore unknown events and parse through the file

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287924#comment-16287924
 ] 

Apache Spark commented on SPARK-22763:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/19953

> SHS: Ignore unknown events and parse through the file
> -
>
> Key: SPARK-22763
> URL: https://issues.apache.org/jira/browse/SPARK-22763
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>
> While spark code changes, there are new events in event log:
> https://github.com/apache/spark/pull/19649
> And we used to maintain a whitelist to avoid exceptions:
> https://github.com/apache/spark/pull/15663/files
> Currently Spark history server will stop parsing on unknown events or 
> unrecognized properties. We may still see part of the UI data.
> For better compatibility, we can ignore unknown events and parse through the 
> log file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22763) SHS: Ignore unknown events and parse through the file

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22763:


Assignee: Apache Spark

> SHS: Ignore unknown events and parse through the file
> -
>
> Key: SPARK-22763
> URL: https://issues.apache.org/jira/browse/SPARK-22763
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>
> While spark code changes, there are new events in event log:
> https://github.com/apache/spark/pull/19649
> And we used to maintain a whitelist to avoid exceptions:
> https://github.com/apache/spark/pull/15663/files
> Currently Spark history server will stop parsing on unknown events or 
> unrecognized properties. We may still see part of the UI data.
> For better compatibility, we can ignore unknown events and parse through the 
> log file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19809) NullPointerException on zero-size ORC file

2017-12-12 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19809:
--
Affects Version/s: (was: 2.2.0)
   2.2.1

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
>

[jira] [Assigned] (SPARK-22763) SHS: Ignore unknown events and parse through the file

2017-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22763:


Assignee: (was: Apache Spark)

> SHS: Ignore unknown events and parse through the file
> -
>
> Key: SPARK-22763
> URL: https://issues.apache.org/jira/browse/SPARK-22763
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>
> While spark code changes, there are new events in event log:
> https://github.com/apache/spark/pull/19649
> And we used to maintain a whitelist to avoid exceptions:
> https://github.com/apache/spark/pull/15663/files
> Currently Spark history server will stop parsing on unknown events or 
> unrecognized properties. We may still see part of the UI data.
> For better compatibility, we can ignore unknown events and parse through the 
> log file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22763) SHS: Ignore unknown events and parse through the file

2017-12-12 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-22763:
--

 Summary: SHS: Ignore unknown events and parse through the file
 Key: SPARK-22763
 URL: https://issues.apache.org/jira/browse/SPARK-22763
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Gengliang Wang


While spark code changes, there are new events in event log:
https://github.com/apache/spark/pull/19649

And we used to maintain a whitelist to avoid exceptions:
https://github.com/apache/spark/pull/15663/files

Currently Spark history server will stop parsing on unknown events or 
unrecognized properties. We may still see part of the UI data.

For better compatibility, we can ignore unknown events and parse through the 
log file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22351) Support user-created custom Encoders for Datasets

2017-12-12 Thread Adamos Loizou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287735#comment-16287735
 ] 

Adamos Loizou commented on SPARK-22351:
---

Hello guys, once more I've run against this problem now with ADT/Sealed 
hierarchies examples.
For reference, there are already people facing this issue ([stack overflow 
link|https://stackoverflow.com/questions/41030073/encode-an-adt-sealed-trait-hierarchy-into-spark-dataset-column]).
Here is an example:

{code:java}
sealed trait Fruit
case object Apple extends Fruit
case object Orange extends Fruit

case class Bag(quantity: Int, fruit: Fruit)

Seq(Bag(1, Apple), Bag(3, Orange)).toDS // <- This fails because it can't find 
an encoder for Fruit
{code}

Ideally I'd like to be able to create my encoder where I can tell it, for 
example, to use the case object toString method for mapping it to a String 
column.

How feasible would it be to expose an API for creating custom encoders?
Unfortunately, not having this limits the capacity for generalised and typesafe 
models quite a bit.

Thank you.

> Support user-created custom Encoders for Datasets
> -
>
> Key: SPARK-22351
> URL: https://issues.apache.org/jira/browse/SPARK-22351
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adamos Loizou
>Priority: Minor
>
> It would be very helpful if we could easily support creating custom encoders 
> for classes in Spark SQL.
> This is to allow a user to properly define a business model using types of 
> their choice. They can then map them to Spark SQL types without being forced 
> to pollute their model with the built-in mappable types (e.g. 
> {{java.sql.Timestamp}}).
> Specifically in our case, we tend to use either the Java 8 time API or the 
> joda time API for dates instead of {{java.sql.Timestamp}} whose API is quite 
> limited compared to the others.
> Ideally we would like to be able to have a dataset of such a class:
> {code:java}
> case class Person(name: String, dateOfBirth: org.joda.time.LocalDate)
> implicit def localDateTimeEncoder: Encoder[LocalDate] = ??? // we define 
> something that maps to Spark SQL TimestampType
> ...
> // read csv and map it to model
> val people:Dataset[Person] = spark.read.csv("/my/path/file.csv").as[Person]
> {code}
> While this was possible in Spark 1.6 it's not longer the case in Spark 2.x.
> It's also not straight forward as to how to support that using an 
> {{ExpressionEncoder}} (any tips would be much appreciated)
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-12 Thread Julien Cuquemelle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286253#comment-16286253
 ] 

Julien Cuquemelle edited comment on SPARK-22683 at 12/12/17 3:14 PM:
-

The impression I get from our discussion is that you mainly focus on the 
latency of the jobs, and that the current setting is optimized for that, which 
is why you consider the current setup sufficient.

If you consider your previous example from a resource usage point of view, my 
proposal would allow to have about the same resource usage in both scenarios, 
but the current setup doubles the resource usage of the workload with small 
tasks... 

I've tried to experiment with the parameters you've proposed, but right now I 
don't have a solution to optimize my type of workload (not every single job) 
for resource consumption. 

I don't know if the majority of Spark users run on idle clusters, but ours is 
routinely full, so for us resource usage is more important than latency.


was (Author: jcuquemelle):
The impression I get from our discussion is that you mainly focus on the 
latency of the jobs, and that the current setting is optimized for that, which 
is why you consider the current setup sufficient.

If you consider your previous example from a resource usage point of view, my 
proposal would allow to have about the same resource usage in both scenarios, 
but the current setup doubles the resource usage of the workload with small 
tasks... 

I've tried to experiment with the parameters you've proposed, but right now I 
don't have a solution to optimize my type of workload (nor every single job) 
for resource consumption. 

I don't know if the majority of Spark users run on idle clusters, but ours is 
routinely full, so for us resource usage is more important than latency.

> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21322) support histogram in filter cardinality estimation

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287691#comment-16287691
 ] 

Apache Spark commented on SPARK-21322:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19952

> support histogram in filter cardinality estimation
> --
>
> Key: SPARK-21322
> URL: https://issues.apache.org/jira/browse/SPARK-21322
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ron Hu
>Assignee: Ron Hu
> Fix For: 2.3.0
>
>
> Histogram is effective in dealing with skewed distribution.  After we 
> generate histogram information for column statistics, we need to adjust 
> filter estimation based on histogram data structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22761) 64KB JVM bytecode limit problem with GLM

2017-12-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22761.
---
Resolution: Duplicate

Without more info this sounds like a possible duplicate of several other issues.

> 64KB JVM bytecode limit problem with GLM
> 
>
> Key: SPARK-22761
> URL: https://issues.apache.org/jira/browse/SPARK-22761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Alan Lai
>
> {code:java}
> GLM
> {code} (presumably other mllib tools)
>  can throw an exception due to the 64KB JVM bytecode limit when they use with 
> a lot of variables/arguments (~ 2k).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21914) Running examples as tests in SQL builtin function documentation

2017-12-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287538#comment-16287538
 ] 

Hyukjin Kwon commented on SPARK-21914:
--

I totally misunderstood your comment [~rxin]. I was so focused on the elapsed 
time in the tests at that time but you the actual test coverage ..

Would it be okay if the diff is small? Let me try this in my local and maybe 
open a PR if the diff is small to show it. Otherwise, will leave this closed if 
it's complex and big.

> Running examples as tests in SQL builtin function documentation
> ---
>
> Key: SPARK-21914
> URL: https://issues.apache.org/jira/browse/SPARK-21914
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> It looks we have added many examples in {{ExpressionDescription}} for builtin 
> functions.
> Actually, if I have seen correctly, we have fixed many examples so far in 
> some minor PRs and sometimes require to add the examples as tests sql and 
> golden files.
> As we have formatted examples in {{ExpressionDescription.examples}} - 
> https://github.com/apache/spark/blob/ba327ee54c32b11107793604895bd38559804858/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionDescription.java#L44-L50,
>  and we have `SQLQueryTestSuite`, I think we could run the examples as tests 
> like Python's doctests.
> Rough way I am thinking:
> 1. Loads the example in {{ExpressionDescription}}.
> 2. identify queries by {{>}}.
> 3. identify the rest of them as the results.
> 4. run the examples by reusing {{SQLQueryTestSuite}} if possible.
> 5. compare the output by reusing {{SQLQueryTestSuite}} if possible.
> Advantages of doing this I could think for now:
> - Reduce the number of PRs to fix the examples
> - De-duplicate the test cases that should be added into sql and golden files.
> - Correct documentation with correct examples.
> - Reduce reviewing costs for documentation fix PRs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18580) Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream

2017-12-12 Thread Alexey Lipodat (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287496#comment-16287496
 ] 

Alexey Lipodat commented on SPARK-18580:


Guys, Could you please review Oleksandr's pull ^ ?
This feature is very useful for us. 
Thank you!

> Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream
> ---
>
> Key: SPARK-18580
> URL: https://issues.apache.org/jira/browse/SPARK-18580
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Oleg Muravskiy
>
> Currently the `spark.streaming.kafka.maxRatePerPartition` is used as the 
> initial rate when the backpressure is enabled. This is too exhaustive for the 
> application while it still warms up.
> This is similar to SPARK-11627, applying the solution provided there to 
> DirectKafkaInputDStream.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287489#comment-16287489
 ] 

Apache Spark commented on SPARK-22760:
--

User 'KaiXinXiaoLei' has created a pull request for this issue:
https://github.com/apache/spark/pull/19951

> where driver is stopping, and some executors lost because of 
> YarnSchedulerBackend.stop, then there is a problem. 
> -
>
> Key: SPARK-22760
> URL: https://issues.apache.org/jira/browse/SPARK-22760
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
> Attachments: 微信图片_20171212094100.jpg
>
>
> Using   SPARK-14228 , i still find a problem:
> {noformat}
> 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to 
> shut down
> 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63.
> 17/12/12 15:34:45 ERROR Inbox: Ignoring error
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
> has been stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and sometimes,  the below problem is also exists:
> {noformat}
> 17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped
> 17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 17/12/11 15:50:53 ERROR Inbox: Ignoring error
> org.apache.spark.SparkException: Unsupported message 
> OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container 
> container_e05_1512975871311_0007_01_69 exited because of a YARN event 
> (e.g., pre-emption) and not because of an error in the running job.)) from 
> 101.8.73.53:42930
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
> at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
> at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
> at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186)
> at 
>

[jira] [Commented] (SPARK-22450) Safely register class for mllib

2017-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287483#comment-16287483
 ] 

Apache Spark commented on SPARK-22450:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/19950

> Safely register class for mllib
> ---
>
> Key: SPARK-22450
> URL: https://issues.apache.org/jira/browse/SPARK-22450
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>Assignee: Xianyang Liu
> Fix For: 2.3.0
>
>
> There are still some algorithms based on mllib, such as KMeans.  For now, 
> many mllib common class (such as: Vector, DenseVector, SparseVector, Matrix, 
> DenseMatrix, SparseMatrix) are not registered in Kryo. So there are some 
> performance issues for those object serialization or deserialization.
> Previously dicussed: https://github.com/apache/spark/pull/19586



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter

2017-12-12 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287461#comment-16287461
 ] 

Eyal Farago commented on SPARK-21867:
-

[~ericvandenbergfb], looks good few questions though:
1. initial number of sorters?
2. assuming initial number of sorters >1, doesn't this mean you're potentially 
increasing number of spills? if a single sorter would spill after N records, a 
multi-sorter will spill after N/k records (where k is the number of sorters), 
doesn't this mean more spills and merges?
3. when a sorter hits the spill threshold, does it immediately spill or will it 
keep going if there's available execution memory? it might make sense to spill 
and raise the threshold in this condition...
4. merging spills may require some attention: compression, encryption, etc. 
also avoiding too many open files/buffers at the same time

looking forward for the PR

> Support async spilling in UnsafeShuffleWriter
> -
>
> Key: SPARK-21867
> URL: https://issues.apache.org/jira/browse/SPARK-21867
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Priority: Minor
> Attachments: Async ShuffleExternalSorter.pdf
>
>
> Currently, Spark tasks are single-threaded. But we see it could greatly 
> improve the performance of the jobs, if we can multi-thread some part of it. 
> For example, profiling our map tasks, which reads large amount of data from 
> HDFS and spill to disks, we see that we are blocked on HDFS read and spilling 
> majority of the time. Since both these operations are IO intensive the 
> average CPU consumption during map phase is significantly low. In theory, 
> both HDFS read and spilling can be done in parallel if we had additional 
> memory to store data read from HDFS while we are spilling the last batch read.
> Let's say we have 1G of shuffle memory available per task. Currently, in case 
> of map task, it reads from HDFS and the records are stored in the available 
> memory buffer. Once we hit the memory limit and there is no more space to 
> store the records, we sort and spill the content to disk. While we are 
> spilling to disk, since we do not have any available memory, we can not read 
> from HDFS concurrently. 
> Here we propose supporting async spilling for UnsafeShuffleWriter, so that we 
> can support reading from HDFS when sort and spill is happening 
> asynchronously.  Let's say the total 1G of shuffle memory can be split into 
> two regions - active region and spilling region - each of size 500 MB. We 
> start with reading from HDFS and filling the active region. Once we hit the 
> limit of active region, we issue an asynchronous spill, while fliping the 
> active region and spilling region. While the spil is happening 
> asynchronosuly, we still have 500 MB of memory available to read the data 
> from HDFS. This way we can amortize the high disk/network io cost during 
> spilling.
> We made a prototype hack to implement this feature and we could see our map 
> tasks were as much as 40% faster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22761) 64KB JVM bytecode limit problem with GLM

2017-12-12 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287394#comment-16287394
 ] 

Marco Gaido commented on SPARK-22761:
-

this is probably solved by SPARK-6. Can you try to reproduce on current 
master branch?

> 64KB JVM bytecode limit problem with GLM
> 
>
> Key: SPARK-22761
> URL: https://issues.apache.org/jira/browse/SPARK-22761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Alan Lai
>
> {code:java}
> GLM
> {code} (presumably other mllib tools)
>  can throw an exception due to the 64KB JVM bytecode limit when they use with 
> a lot of variables/arguments (~ 2k).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.

2017-12-12 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-22760:
--
Description: 
Using   {{SPARK-14228}} , i still find a problem:
{noformat}
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut 
down
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63.
17/12/12 15:34:45 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
has been stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

and sometimes,  the below problem is also exists:

{noformat}
17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped
17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/12/11 15:50:53 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Unsupported message 
OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container 
container_e05_1512975871311_0007_01_69 exited because of a YARN event 
(e.g., pre-emption) and not because of an error in the running job.)) from 
101.8.73.53:42930
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at scala.util.Success.foreach(Try.scala:236)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)
{noformat}

I analysis this reason. When the number of

[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.

2017-12-12 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-22760:
--
Description: 
Using  SPARK-14228 {{monospaced text}} , i still find a problem:
{noformat}
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut 
down
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63.
17/12/12 15:34:45 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
has been stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

and sometimes,  the below problem is also exists:

{noformat}
17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped
17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/12/11 15:50:53 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Unsupported message 
OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container 
container_e05_1512975871311_0007_01_69 exited because of a YARN event 
(e.g., pre-emption) and not because of an error in the running job.)) from 
101.8.73.53:42930
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at scala.util.Success.foreach(Try.scala:236)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)
{noformat}

I analysis this reason. When

[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.

2017-12-12 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-22760:
--
Description: 
Using   SPARK-14228 , i still find a problem:
{noformat}
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut 
down
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63.
17/12/12 15:34:45 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
has been stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

and sometimes,  the below problem is also exists:

{noformat}
17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped
17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/12/11 15:50:53 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Unsupported message 
OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container 
container_e05_1512975871311_0007_01_69 exited because of a YARN event 
(e.g., pre-emption) and not because of an error in the running job.)) from 
101.8.73.53:42930
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at scala.util.Success.foreach(Try.scala:236)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)
{noformat}

I analysis this reason. When the number of

[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.

2017-12-12 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-22760:
--
Description: 
Using  SPARK-14228  , i still find a problem:
{noformat}
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut 
down
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63.
17/12/12 15:34:45 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
has been stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

and sometimes,  the below problem is also exists:

{noformat}
17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped
17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/12/11 15:50:53 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Unsupported message 
OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container 
container_e05_1512975871311_0007_01_69 exited because of a YARN event 
(e.g., pre-emption) and not because of an error in the running job.)) from 
101.8.73.53:42930
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at scala.util.Success.foreach(Try.scala:236)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)
{noformat}

I analysis this reason. When the number of

[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.

2017-12-12 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-22760:
--
Description: 
Use SPARK-14228 , i find a problem:



{panel:title=My title}
Some text with a title
{panel}


{noformat}
*no* further _formatting_ is done here
{noformat}
@
||Heading 1||Heading 2||
|Mention someone by typing their name...|Col A2|

17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut 
down
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63.
17/12/12 15:34:45 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
has been stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


and sometimes, 

17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped
17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/12/11 15:50:53 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Unsupported message 
OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container 
container_e05_1512975871311_0007_01_69 exited because of a YARN event 
(e.g., pre-emption) and not because of an error in the running job.)) from 
101.8.73.53:42930
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at scala.util.Success.foreach(Try.scala:236)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)
at

[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.

2017-12-12 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-22760:
--
Description: 
Use SPARK-14228 , i find a problem:

^superscript text^
~subscript text~
??citation??
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut 
down
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63.
17/12/12 15:34:45 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
has been stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


and sometimes, 

17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped
17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/12/11 15:50:53 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Unsupported message 
OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container 
container_e05_1512975871311_0007_01_69 exited because of a YARN event 
(e.g., pre-emption) and not because of an error in the running job.)) from 
101.8.73.53:42930
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at scala.util.Success.foreach(Try.scala:236)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)

I analysis this reason. When the number of executors is big, and

[jira] [Updated] (SPARK-22760) where driver is stopping, and some executors lost because of YarnSchedulerBackend.stop, then there is a problem.

2017-12-12 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-22760:
--
Description: 
Use SPARK-14228 , i find a problem:
^ 17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to 
shut down
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63.
17/12/12 15:34:45 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
has been stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
^

and sometimes, 

17/12/11 15:50:53 INFO YarnClientSchedulerBackend: Stopped
17/12/11 15:50:53 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/12/11 15:50:53 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Unsupported message 
OneWayMessage(101.8.73.53:42930,RemoveExecutor(68,Executor for container 
container_e05_1512975871311_0007_01_69 exited because of a YARN event 
(e.g., pre-emption) and not because of an error in the running job.)) from 
101.8.73.53:42930
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:118)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$2.apply(Inbox.scala:117)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:126)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:186)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$handleExecutorDisconnectedFromDriver$1.apply(YarnSchedulerBackend.scala:255)
at scala.util.Success.foreach(Try.scala:236)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)
at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:206)

I analysis this reason. When the number of executors is big, and 
YarnSchedulerBackend.stopped=False after YarnSchedulerBackend.stop()

[jira] [Commented] (SPARK-22752) FileNotFoundException while reading from Kafka

2017-12-12 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287305#comment-16287305
 ] 

Marco Gaido commented on SPARK-22752:
-

Hi [~zsxwing], thanks for looking at this. The checkpointDir is on HDFS. Inside 
it we have:
{noformat}
metadata . 0.1 kB
commits
offsets
sources
state
{noformat}
where {{commits}} contains files from {{0}} to {{13}}, {{offsets}} from {{0}} 
to {{14}}, {{sources}} contains only a file which is {{0/0}} and {{state}} 
contains only the directory {{0/0}} which is empty.
Do you need other information? Have you idea of which is the root cause of the 
issue? Thanks.

> FileNotFoundException while reading from Kafka
> --
>
> Key: SPARK-22752
> URL: https://issues.apache.org/jira/browse/SPARK-22752
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> We are running a stateful structured streaming job which reads from Kafka and 
> writes to HDFS. And we are hitting this exception:
> {noformat}
> 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): 
> java.lang.IllegalStateException: Error reading delta file 
> /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, 
> part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta 
> does not exist
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
> {noformat}
> Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory 
> there is no file at all. While we have some files in the commits and offsets 
> folders. I am not sure about the reason of this behavior. It seems to happen 
> on the second time the job is started, after the first one failed. So it 
> looks like task failures can generate it. Or it might be related to 
> watermarks, since there are some problems related to the incoming data for 
> which the watermark was filtering all the incoming data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22749) df.na().fill(0) over an unnamed FIRST() fails the query

2017-12-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287251#comment-16287251
 ] 

Hyukjin Kwon commented on SPARK-22749:
--

ping [~efrat.s]

> df.na().fill(0) over an unnamed FIRST() fails the query
> ---
>
> Key: SPARK-22749
> URL: https://issues.apache.org/jira/browse/SPARK-22749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Efrat
>
> When I created a query with FIRST() and didn't renamed the field I queried 
> over - applying .na().fill(0) created a syntax that failed the query. 
> Renaming the field using AS solved it, but I guess it's a bug in fillNA that 
> doesn't support it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 105 matches

Mail list logo