[jira] [Commented] (SPARK-21962) Distributed Tracing in Spark

2018-07-12 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541265#comment-16541265
 ] 

Andrew Ash commented on SPARK-21962:


Note that HTrace is now being removed from Hadoop – HADOOP-15566 

> Distributed Tracing in Spark
> 
>
> Key: SPARK-21962
> URL: https://issues.apache.org/jira/browse/SPARK-21962
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Major
>
> Spark should support distributed tracing, which is the mechanism, widely 
> popularized by Google in the [Dapper 
> Paper|https://research.google.com/pubs/pub36356.html], where network requests 
> have additional metadata used for tracing requests between services.
> This would be useful for me since I have OpenZipkin style tracing in my 
> distributed application up to the Spark driver, and from the executors out to 
> my other services, but the link is broken in Spark between driver and 
> executor since the Span IDs aren't propagated across that link.
> An initial implementation could instrument the most important network calls 
> with trace ids (like launching and finishing tasks), and incrementally add 
> more tracing to other calls (torrent block distribution, external shuffle 
> service, etc) as the feature matures.
> Search keywords: Dapper, Brave, OpenZipkin, HTrace



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column

2018-01-31 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347480#comment-16347480
 ] 

Andrew Ash commented on SPARK-23274:


Many thanks for the fast fix [~smilegator]!

> ReplaceExceptWithFilter fails on dataframes filtered on same column
> ---
>
> Key: SPARK-23274
> URL: https://issues.apache.org/jira/browse/SPARK-23274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Onur Satici
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Currently affects:
> {code:java}
> $ git tag --contains 01f6ba0e7a
> v2.3.0-rc1
> v2.3.0-rc2
> {code}
> Steps to reproduce:
> {code:java}
> $ cat test.csv
> a,b
> 1,2
> 1,3
> 2,2
> 2,4
> {code}
> {code:java}
> val df = spark.read.format("csv").option("header", "true").load("test.csv")
> val df1 = df.filter($"a" === 1)
> val df2 = df.filter($"a" === 2)
> df1.select("b").except(df2.select("b")).show
> {code}
> results in:
> {code:java}
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.tree

[jira] [Commented] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column

2018-01-30 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345539#comment-16345539
 ] 

Andrew Ash commented on SPARK-23274:


Suspect this regression was introduced by 
[https://github.com/apache/spark/commit/01f6ba0e7a] 

> ReplaceExceptWithFilter fails on dataframes filtered on same column
> ---
>
> Key: SPARK-23274
> URL: https://issues.apache.org/jira/browse/SPARK-23274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Onur Satici
>Priority: Major
>
> Currently affects:
> {code:java}
> $ git tag --contains 01f6ba0e7a
> v2.3.0-rc1
> v2.3.0-rc2
> {code}
> Steps to reproduce:
> {code:java}
> $ cat test.csv
> a,b
> 1,2
> 1,3
> 2,2
> 2,4
> {code}
> {code:java}
> val df = spark.read.format("csv").option("header", "true").load("test.csv")
> val df1 = df.filter($"a" === 1)
> val df2 = df.filter($"a" === 2)
> df1.select("b").except(df2.select("b")).show
> {code}
> results in:
> {code:java}
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$

[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-18 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330943#comment-16330943
 ] 

Andrew Ash commented on SPARK-22982:


[~joshrosen] do you have some example stacktraces of what this bug can cause?  
Several of our clusters hit what I think is this problem earlier this month, 
see below for details.

 

For a few days in January (4th through 12th) on our AWS infra, we observed 
massively degraded disk read throughput (down to 33% of previous peaks).  
During this time, we also began observing intermittent exceptions coming from 
Spark at read time of parquet files that a previous Spark job had written.  
When the read throughput recovered on the 12th, we stopped observing the 
exceptions and haven't seen them since.

At first we observed this stacktrace when reading .snappy.parquet files:
{noformat}
java.lang.RuntimeException: java.io.IOException: could not read page Page 
[bytes.size=1048641, valueCount=29945, uncompressedSize=1048641] in col 
[my_column] BINARY
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:493)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:486)
at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:96)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:486)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:157)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:229)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:398)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: could not read page Page [bytes.size=1048641, 
valueCount=29945, uncompressedSize=1048641] in col [my_column] BINARY
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV1(VectorizedColumnReader.java:562)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$000(VectorizedColumnReader.java:47)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:490)
... 31 more
Caused by: java.io

[jira] [Commented] (SPARK-22725) df.select on a Stream is broken, vs a List

2017-12-06 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281354#comment-16281354
 ] 

Andrew Ash commented on SPARK-22725:


Demonstration of difference between {{.map}} on List vs Stream:

{noformat}
scala> Stream(1,2,3).map { println(_) }
1
res1: scala.collection.immutable.Stream[Unit] = Stream((), ?)

scala> List(1,2,3).map { println(_) }
1
2
3
res2: List[Unit] = List((), (), ())

scala>
{noformat}

> df.select on a Stream is broken, vs a List
> --
>
> Key: SPARK-22725
> URL: https://issues.apache.org/jira/browse/SPARK-22725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>
> See failing test at https://github.com/apache/spark/pull/19917
> Failing:
> {noformat}
>   test("SPARK-ABC123: support select with a splatted stream") {
> val df = spark.createDataFrame(sparkContext.emptyRDD[Row], 
> StructType(List("bar", "foo").map {
>   StructField(_, StringType, false)
> }))
> val allColumns = Stream(df.col("bar"), col("foo"))
> val result = df.select(allColumns : _*)
>   }
> {noformat}
> Succeeds:
> {noformat}
>   test("SPARK-ABC123: support select with a splatted stream") {
> val df = spark.createDataFrame(sparkContext.emptyRDD[Row], 
> StructType(List("bar", "foo").map {
>   StructField(_, StringType, false)
> }))
> val allColumns = Seq(df.col("bar"), col("foo"))
> val result = df.select(allColumns : _*)
>   }
> {noformat}
> After stepping through in a debugger, the difference manifests at 
> https://github.com/apache/spark/blob/8ae004b4602266d1f210e4c1564246d590412c06/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L120
> Changing {{seq.map}} to {{seq.toList.map}} causes the test to pass.
> I think there's a very subtle bug here where the {{Seq}} of column names 
> passed into {{select}} is expected to eagerly evaluate when {{.map}} is 
> called on it, even though that's not part of the {{Seq}} contract.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22725) df.select on a Stream is broken, vs a List

2017-12-06 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-22725:
--

 Summary: df.select on a Stream is broken, vs a List
 Key: SPARK-22725
 URL: https://issues.apache.org/jira/browse/SPARK-22725
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Andrew Ash


See failing test at https://github.com/apache/spark/pull/19917

Failing:

{noformat}
  test("SPARK-ABC123: support select with a splatted stream") {
val df = spark.createDataFrame(sparkContext.emptyRDD[Row], 
StructType(List("bar", "foo").map {
  StructField(_, StringType, false)
}))
val allColumns = Stream(df.col("bar"), col("foo"))
val result = df.select(allColumns : _*)
  }
{noformat}

Succeeds:

{noformat}
  test("SPARK-ABC123: support select with a splatted stream") {
val df = spark.createDataFrame(sparkContext.emptyRDD[Row], 
StructType(List("bar", "foo").map {
  StructField(_, StringType, false)
}))
val allColumns = Seq(df.col("bar"), col("foo"))
val result = df.select(allColumns : _*)
  }
{noformat}

After stepping through in a debugger, the difference manifests at 
https://github.com/apache/spark/blob/8ae004b4602266d1f210e4c1564246d590412c06/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L120

Changing {{seq.map}} to {{seq.toList.map}} causes the test to pass.

I think there's a very subtle bug here where the {{Seq}} of column names passed 
into {{select}} is expected to eagerly evaluate when {{.map}} is called on it, 
even though that's not part of the {{Seq}} contract.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials

2017-11-09 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246328#comment-16246328
 ] 

Andrew Ash commented on SPARK-22479:


Completely agree that credentials shouldn't be in the toString since query 
plans are logged in many places.  This looks like it brings 
SaveIntoDataSourceCommand more in-line with JdbcRelation, which also currently 
redacts credentials from its toString to avoid them being written to logs.

> SaveIntoDataSourceCommand logs jdbc credentials
> ---
>
> Key: SPARK-22479
> URL: https://issues.apache.org/jira/browse/SPARK-22479
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Onur Satici
>
> JDBC credentials are not redacted in plans including a 
> 'SaveIntoDataSourceCommand'.
> Steps to reproduce:
> {code}
> spark-shell --packages org.postgresql:postgresql:42.1.1
> {code}
> {code}
> import org.apache.spark.sql.execution.QueryExecution
> import org.apache.spark.sql.util.QueryExecutionListener
> val listener = new QueryExecutionListener {
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, duration: 
> Long): Unit = {
> System.out.println(qe.toString())
>   }
> }
> spark.listenerManager.register(listener)
> spark.range(100).write.format("jdbc").option("url", 
> "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", 
> "org.postgresql.Driver").option("dbtable", "test").save()
> {code}
> The above will yield the following plan:
> {code}
> == Parsed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Analyzed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Optimized Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Physical Plan ==
> ExecutedCommand
>+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>  +- Range (0, 100, step=1, splits=Some(8))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing

2017-11-08 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-22470:
--

 Summary: Doc that functions.hash is also used internally for 
shuffle and bucketing
 Key: SPARK-22470
 URL: https://issues.apache.org/jira/browse/SPARK-22470
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Affects Versions: 2.2.0
Reporter: Andrew Ash


https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that 
appears to be the same hash function as what Spark uses internally for shuffle 
and bucketing.

One of my users would like to bake this assumption into code, but is unsure if 
it's a guarantee or a coincidence that they're the same function.  Would it be 
considered an API break if at some point the two functions were different, or 
if the implementation of both changed together?

We should add a line to the scaladoc to clarify.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided

2017-10-26 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220734#comment-16220734
 ] 

Andrew Ash commented on SPARK-22042:


Hi I'm seeing this problem as well, thanks for investigating and putting up a 
PR [~tejasp]!  Have you been running any of your clusters with a patched 
version of Spark including that change, and has it been behaving as expected?

The repro one of my users independently provided was this:
{noformat}
val rows = List(1, 2, 3, 4, 5, 6);
 
val df1 = sc.parallelize(rows).toDF("col").repartition(1);
val df2 = sc.parallelize(rows).toDF("col").repartition(2);
val df3 = sc.parallelize(rows).toDF("col").repartition(2);
 
val dd1 = df1.join(df2, df1.col("col").equalTo(df2.col("col"))).join(df3, 
df2.col("col").equalTo(df3.col("col")));
 
dd1.show;
{noformat}

> ReorderJoinPredicates can break when child's partitioning is not decided
> 
>
> Key: SPARK-22042
> URL: https://issues.apache.org/jira/browse/SPARK-22042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tejas Patil
>Priority: Minor
>
> When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its 
> children, the children may not be properly constructed as the child-subtree 
> has to still go through other planner rules.
> In this particular case, the child is `SortMergeJoinExec`. Since the required 
> `Exchange` operators are not in place (because `EnsureRequirements` runs 
> _after_ `ReorderJoinPredicates`), the join's children would not have 
> partitioning defined. This breaks while creation the `PartitioningCollection` 
> here : 
> https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69
> Small repro:
> {noformat}
> context.sql("SET spark.sql.autoBroadcastJoinThreshold=0")
> val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", 
> "k")
> df.write.format("parquet").saveAsTable("table1")
> df.write.format("parquet").saveAsTable("table2")
> df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table")
> sql("""
>   SELECT *
>   FROM (
> SELECT a.i, a.j, a.k
> FROM bucketed_table a
> JOIN table1 b
> ON a.i = b.i
>   ) c
>   JOIN table2
>   ON c.i = table2.i
> """).explain
> {noformat}
> This fails with :
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: 
> PartitioningCollection requires all of its partitionings have the same 
> numPartitions.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69)
>   at 
> org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution.stringOrErro

[jira] [Commented] (SPARK-21991) [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine has very high load

2017-10-25 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219139#comment-16219139
 ] 

Andrew Ash commented on SPARK-21991:


Thanks for the contribution to Spark [~nivox]!  I'll be testing this on some 
clusters of mine and will echo back here if I see any problems.  Cheers!

> [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine 
> has very high load
> --
>
> Key: SPARK-21991
> URL: https://issues.apache.org/jira/browse/SPARK-21991
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.2, 2.1.0, 2.1.1, 2.2.0
> Environment: Single node machine running Ubuntu 16.04.2 LTS 
> (4.4.0-79-generic)
> YARN 2.7.2
> Spark 2.0.2
>Reporter: Andrea Zito
>Assignee: Andrea Zito
>Priority: Minor
> Fix For: 2.0.3, 2.1.3, 2.2.1, 2.3.0
>
>
> The way the _LauncherServer_ _acceptConnections_ thread schedules client 
> timeouts causes (non-deterministically) the thread to die with the following 
> exception if the machine is under very high load:
> {noformat}
> Exception in thread "LauncherServer-1" java.lang.IllegalStateException: Task 
> already scheduled or cancelled
> at java.util.Timer.sched(Timer.java:401)
> at java.util.Timer.schedule(Timer.java:193)
> at 
> org.apache.spark.launcher.LauncherServer.acceptConnections(LauncherServer.java:249)
> at 
> org.apache.spark.launcher.LauncherServer.access$000(LauncherServer.java:80)
> at 
> org.apache.spark.launcher.LauncherServer$1.run(LauncherServer.java:143)
> {noformat}
> The issue is related to the ordering of actions that the _acceptConnections_ 
> thread uses to handle a client connection:
> # create timeout action
> # create client thread
> # start client thread
> # schedule timeout action
> Under normal conditions the scheduling of the timeout action happen before 
> the client thread has a chance to start, however if the machine is under very 
> high load the client thread can receive CPU time before the timeout action 
> gets scheduled.
> If this condition happen, the client thread cancel the timeout action (which 
> is not yet been scheduled) and goes on, but as soon as the 
> _acceptConnections_ thread gets the CPU back, it will try to schedule the 
> timeout action (which has already been canceled) thus raising the exception.
> Changing the order in which the client thread gets started and the timeout 
> gets scheduled seems to be sufficient to fix this issue.
> As stated above the issue is non-deterministic, I faced the issue multiple 
> times on a single-node machine submitting a high number of short jobs 
> sequentially, but I couldn't easily create a test reproducing the issue. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21991) [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine has very high load

2017-10-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216529#comment-16216529
 ] 

Andrew Ash commented on SPARK-21991:


Thanks for debugging and diagnosing this [~nivox]! I'm seeing the same issue 
right now on one of my Spark clusters so am interested in getting your fix in 
to mainline Spark for my users.

Have you deployed the change from your linked PR in a live setting, and has it 
fixed the issue for you?

> [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine 
> has very high load
> --
>
> Key: SPARK-21991
> URL: https://issues.apache.org/jira/browse/SPARK-21991
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.2, 2.1.0, 2.1.1, 2.2.0
> Environment: Single node machine running Ubuntu 16.04.2 LTS 
> (4.4.0-79-generic)
> YARN 2.7.2
> Spark 2.0.2
>Reporter: Andrea Zito
>Priority: Minor
>
> The way the _LauncherServer_ _acceptConnections_ thread schedules client 
> timeouts causes (non-deterministically) the thread to die with the following 
> exception if the machine is under very high load:
> {noformat}
> Exception in thread "LauncherServer-1" java.lang.IllegalStateException: Task 
> already scheduled or cancelled
> at java.util.Timer.sched(Timer.java:401)
> at java.util.Timer.schedule(Timer.java:193)
> at 
> org.apache.spark.launcher.LauncherServer.acceptConnections(LauncherServer.java:249)
> at 
> org.apache.spark.launcher.LauncherServer.access$000(LauncherServer.java:80)
> at 
> org.apache.spark.launcher.LauncherServer$1.run(LauncherServer.java:143)
> {noformat}
> The issue is related to the ordering of actions that the _acceptConnections_ 
> thread uses to handle a client connection:
> # create timeout action
> # create client thread
> # start client thread
> # schedule timeout action
> Under normal conditions the scheduling of the timeout action happen before 
> the client thread has a chance to start, however if the machine is under very 
> high load the client thread can receive CPU time before the timeout action 
> gets scheduled.
> If this condition happen, the client thread cancel the timeout action (which 
> is not yet been scheduled) and goes on, but as soon as the 
> _acceptConnections_ thread gets the CPU back, it will try to schedule the 
> timeout action (which has already been canceled) thus raising the exception.
> Changing the order in which the client thread gets started and the timeout 
> gets scheduled seems to be sufficient to fix this issue.
> As stated above the issue is non-deterministic, I faced the issue multiple 
> times on a single-node machine submitting a high number of short jobs 
> sequentially, but I couldn't easily create a test reproducing the issue. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22204) Explain output for SQL with commands shows no optimization

2017-10-17 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208676#comment-16208676
 ] 

Andrew Ash commented on SPARK-22204:


One way to work around this issue could be by getting the child of the command 
node and running explain on that.  This does do the query planning twice though.

See also discussion at 
https://github.com/apache/spark/pull/19269#discussion_r139841435

> Explain output for SQL with commands shows no optimization
> --
>
> Key: SPARK-22204
> URL: https://issues.apache.org/jira/browse/SPARK-22204
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>
> When displaying the explain output for a basic SELECT query, the query plan 
> changes as expected from analyzed -> optimized stages.  But when putting that 
> same query into a command, for example {{CREATE TABLE}} it appears that the 
> optimization doesn't take place.
> In Spark shell:
> Explain output for a {{SELECT}} statement shows optimization:
> {noformat}
> scala> spark.sql("SELECT a FROM (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) 
> AS b) AS c) AS d").explain(true)
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'SubqueryAlias d
>+- 'Project ['a]
>   +- 'SubqueryAlias c
>  +- 'Project ['a]
> +- SubqueryAlias b
>+- Project [1 AS a#29]
>   +- OneRowRelation
> == Analyzed Logical Plan ==
> a: int
> Project [a#29]
> +- SubqueryAlias d
>+- Project [a#29]
>   +- SubqueryAlias c
>  +- Project [a#29]
> +- SubqueryAlias b
>+- Project [1 AS a#29]
>   +- OneRowRelation
> == Optimized Logical Plan ==
> Project [1 AS a#29]
> +- OneRowRelation
> == Physical Plan ==
> *Project [1 AS a#29]
> +- Scan OneRowRelation[]
> scala> 
> {noformat}
> But the same command run inside {{CREATE TABLE}} does not:
> {noformat}
> scala> spark.sql("CREATE TABLE IF NOT EXISTS tmptable AS SELECT a FROM 
> (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) AS b) AS c) AS d").explain(true)
> == Parsed Logical Plan ==
> 'CreateTable `tmptable`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> Ignore
> +- 'Project ['a]
>+- 'SubqueryAlias d
>   +- 'Project ['a]
>  +- 'SubqueryAlias c
> +- 'Project ['a]
>+- SubqueryAlias b
>   +- Project [1 AS a#33]
>  +- OneRowRelation
> == Analyzed Logical Plan ==
> CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, 
> InsertIntoHiveTable]
>+- Project [a#33]
>   +- SubqueryAlias d
>  +- Project [a#33]
> +- SubqueryAlias c
>+- Project [a#33]
>   +- SubqueryAlias b
>  +- Project [1 AS a#33]
> +- OneRowRelation
> == Optimized Logical Plan ==
> CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, 
> InsertIntoHiveTable]
>+- Project [a#33]
>   +- SubqueryAlias d
>  +- Project [a#33]
> +- SubqueryAlias c
>+- Project [a#33]
>   +- SubqueryAlias b
>  +- Project [1 AS a#33]
> +- OneRowRelation
> == Physical Plan ==
> CreateHiveTableAsSelectCommand CreateHiveTableAsSelectCommand 
> [Database:default}, TableName: tmptable, InsertIntoHiveTable]
>+- Project [a#33]
>   +- SubqueryAlias d
>  +- Project [a#33]
> +- SubqueryAlias c
>+- Project [a#33]
>   +- SubqueryAlias b
>  +- Project [1 AS a#33]
> +- OneRowRelation
> scala>
> {noformat}
> Note that there is no change between the analyzed and optimized plans when 
> run in a command.
> This is misleading my users, as they think that there is no optimization 
> happening in the query!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22269) Java style checks should be run in Jenkins

2017-10-12 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202795#comment-16202795
 ] 

Andrew Ash commented on SPARK-22269:


[~sowen] you closed this as a duplicate.  What issue is it a duplicate of?

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Minor
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22268) Fix java style errors

2017-10-12 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202793#comment-16202793
 ] 

Andrew Ash commented on SPARK-22268:


Any time {{./dev/run-tests}} is failing I consider that a bug.

> Fix java style errors
> -
>
> Key: SPARK-22268
> URL: https://issues.apache.org/jira/browse/SPARK-22268
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Trivial
>
> {{./dev/lint-java}} fails on master right now with these exceptions:
> {noformat}
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176]
>  (sizes) LineLength: Line is longer than 100 characters (found 112).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177]
>  (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178]
>  (sizes) LineLength: Line is longer than 100 characters (found 136).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520]
>  (sizes) LineLength: Line is longer than 100 characters (found 104).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524]
>  (sizes) LineLength: Line is longer than 100 characters (found 123).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533]
>  (sizes) LineLength: Line is longer than 100 characters (found 120).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535]
>  (sizes) LineLength: Line is longer than 100 characters (found 114).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182]
>  (sizes) LineLength: Line is longer than 100 characters (found 116).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8]
>  (imports) UnusedImports: Unused import - 
> org.apache.spark.sql.catalyst.expressions.Expression.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22269) Java style checks should be run in Jenkins

2017-10-12 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-22269:
--

 Summary: Java style checks should be run in Jenkins
 Key: SPARK-22269
 URL: https://issues.apache.org/jira/browse/SPARK-22269
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.3.0
Reporter: Andrew Ash


A few times now I've gone to build the master branch and it's failed due to 
Java style errors, which I've sent in PRs to fix:

- https://issues.apache.org/jira/browse/SPARK-22268
- https://issues.apache.org/jira/browse/SPARK-21875

Digging through the history a bit, it looks like this check used to run on 
Jenkins and was previously enabled at 
https://github.com/apache/spark/pull/10763 but then reverted at 
https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5

We should work out what it takes to enable the Java check in Jenkins so these 
kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22268) Fix java style errors

2017-10-12 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-22268:
--

 Summary: Fix java style errors
 Key: SPARK-22268
 URL: https://issues.apache.org/jira/browse/SPARK-22268
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Andrew Ash


{{./dev/lint-java}} fails on master right now with these exceptions:

{noformat}
[ERROR] 
src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176]
 (sizes) LineLength: Line is longer than 100 characters (found 112).
[ERROR] 
src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177]
 (sizes) LineLength: Line is longer than 100 characters (found 106).
[ERROR] 
src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178]
 (sizes) LineLength: Line is longer than 100 characters (found 136).
[ERROR] 
src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520]
 (sizes) LineLength: Line is longer than 100 characters (found 104).
[ERROR] 
src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524]
 (sizes) LineLength: Line is longer than 100 characters (found 123).
[ERROR] 
src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533]
 (sizes) LineLength: Line is longer than 100 characters (found 120).
[ERROR] 
src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535]
 (sizes) LineLength: Line is longer than 100 characters (found 114).
[ERROR] 
src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182]
 (sizes) LineLength: Line is longer than 100 characters (found 116).
[ERROR] 
src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8]
 (imports) UnusedImports: Unused import - 
org.apache.spark.sql.catalyst.expressions.Expression.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18359) Let user specify locale in CSV parsing

2017-10-11 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200831#comment-16200831
 ] 

Andrew Ash commented on SPARK-18359:


I agree with Sean -- using the submitting JVM's locale is insufficient.  Some 
users will want to join CSVs with different locales, and that doesn't offer a 
means of using different locales on different files.  This really should be 
specified on a per-csv basis, which is why I think the direction [~abicz] 
suggested supports this nicely:

{noformat}
spark.read.option("locale", "value").csv("filefromeurope.csv")
{noformat}

where the value is something we can use to get a {{java.util.Locale}}

Has anyone worked around this functionality gap in a useful way, or do we still 
need to work through these Locale issues in spark-csv?

> Let user specify locale in CSV parsing
> --
>
> Key: SPARK-18359
> URL: https://issues.apache.org/jira/browse/SPARK-18359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: yannick Radji
>
> On the DataFrameReader object there no CSV-specific option to set decimal 
> delimiter on comma whereas dot like it use to be in France and Europe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

2017-10-05 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193355#comment-16193355
 ] 

Andrew Ash commented on SPARK-20055:


What I would find most useful is a list of available options and their 
behaviors, like https://github.com/databricks/spark-csv#features

Even though that project has been in-lined into Apache Spark, that github page 
is still the best reference for csv options

> Documentation for CSV datasets in SQL programming guide
> ---
>
> Key: SPARK-20055
> URL: https://issues.apache.org/jira/browse/SPARK-20055
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the type/column name of data.
> In that sense, I think we might better document CSV datasets with some 
> examples too because I believe reading CSV is pretty much common use cases.
> Also, I think we could also leave some pointers for options of API 
> documentations for both (rather than duplicating the documentation).
> So, my suggestion is,
> - Add CSV Datasets section.
> - Add links for options for both JSON and CSV that point each API 
> documentation
> - Fix trivial minor fixes together in both sections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22204) Explain output for SQL with commands shows no optimization

2017-10-04 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-22204:
--

 Summary: Explain output for SQL with commands shows no optimization
 Key: SPARK-22204
 URL: https://issues.apache.org/jira/browse/SPARK-22204
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Andrew Ash


When displaying the explain output for a basic SELECT query, the query plan 
changes as expected from analyzed -> optimized stages.  But when putting that 
same query into a command, for example {{CREATE TABLE}} it appears that the 
optimization doesn't take place.

In Spark shell:

Explain output for a {{SELECT}} statement shows optimization:
{noformat}
scala> spark.sql("SELECT a FROM (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) 
AS b) AS c) AS d").explain(true)
== Parsed Logical Plan ==
'Project ['a]
+- 'SubqueryAlias d
   +- 'Project ['a]
  +- 'SubqueryAlias c
 +- 'Project ['a]
+- SubqueryAlias b
   +- Project [1 AS a#29]
  +- OneRowRelation

== Analyzed Logical Plan ==
a: int
Project [a#29]
+- SubqueryAlias d
   +- Project [a#29]
  +- SubqueryAlias c
 +- Project [a#29]
+- SubqueryAlias b
   +- Project [1 AS a#29]
  +- OneRowRelation

== Optimized Logical Plan ==
Project [1 AS a#29]
+- OneRowRelation

== Physical Plan ==
*Project [1 AS a#29]
+- Scan OneRowRelation[]

scala> 
{noformat}

But the same command run inside {{CREATE TABLE}} does not:

{noformat}
scala> spark.sql("CREATE TABLE IF NOT EXISTS tmptable AS SELECT a FROM (SELECT 
a FROM (SELECT a FROM (SELECT 1 AS a) AS b) AS c) AS d").explain(true)
== Parsed Logical Plan ==
'CreateTable `tmptable`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
Ignore
+- 'Project ['a]
   +- 'SubqueryAlias d
  +- 'Project ['a]
 +- 'SubqueryAlias c
+- 'Project ['a]
   +- SubqueryAlias b
  +- Project [1 AS a#33]
 +- OneRowRelation

== Analyzed Logical Plan ==
CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, 
InsertIntoHiveTable]
   +- Project [a#33]
  +- SubqueryAlias d
 +- Project [a#33]
+- SubqueryAlias c
   +- Project [a#33]
  +- SubqueryAlias b
 +- Project [1 AS a#33]
+- OneRowRelation

== Optimized Logical Plan ==
CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, 
InsertIntoHiveTable]
   +- Project [a#33]
  +- SubqueryAlias d
 +- Project [a#33]
+- SubqueryAlias c
   +- Project [a#33]
  +- SubqueryAlias b
 +- Project [1 AS a#33]
+- OneRowRelation

== Physical Plan ==
CreateHiveTableAsSelectCommand CreateHiveTableAsSelectCommand 
[Database:default}, TableName: tmptable, InsertIntoHiveTable]
   +- Project [a#33]
  +- SubqueryAlias d
 +- Project [a#33]
+- SubqueryAlias c
   +- Project [a#33]
  +- SubqueryAlias b
 +- Project [1 AS a#33]
+- OneRowRelation

scala>
{noformat}

Note that there is no change between the analyzed and optimized plans when run 
in a command.


This is misleading my users, as they think that there is no optimization 
happening in the query!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-09-28 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash reopened SPARK-18016:


// reopening issue

One PR addressing this bug has been merged -- 
https://github.com/apache/spark/pull/18075 -- but the second PR hasn't gone in 
yet: https://github.com/apache/spark/pull/16648

I'm still observing the error message on a dataset with ~3000 columns:

{noformat}
java.lang.RuntimeException: Error while encoding: 
org.codehaus.janino.JaninoRuntimeException: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass2
 has grown past JVM limit of 0x
{noformat}

even when running with the first PR, and [~jamcon] reported similarly at 
https://issues.apache.org/jira/browse/SPARK-18016?focusedCommentId=16103853&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16103853

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.code

[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations

2017-09-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178459#comment-16178459
 ] 

Andrew Ash commented on SPARK-19700:


There was a thread on the dev list recently about Apache Aurora: 
http://aurora.apache.org/

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22112) Add missing method to pyspark api: spark.read.csv(Dataset)

2017-09-24 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-22112:
--

 Summary: Add missing method to pyspark api: 
spark.read.csv(Dataset)
 Key: SPARK-22112
 URL: https://issues.apache.org/jira/browse/SPARK-22112
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Andrew Ash


https://issues.apache.org/jira/browse/SPARK-15463 added a method to the scala 
API without adding an equivalent in pyspark: {{spark.read.csv(Dataset)}}

I was writing some things with pyspark but had to switch it to scala/java to 
use that method -- since equivalency between python/java/scala is a Spark goal, 
we should make sure this functionality exists in all the supported languages.

https://github.com/apache/spark/pull/16854/files#diff-f70bda59304588cc3abfa3a9840653f4R408

cc [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21962) Distributed Tracing in Spark

2017-09-08 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-21962:
--

 Summary: Distributed Tracing in Spark
 Key: SPARK-21962
 URL: https://issues.apache.org/jira/browse/SPARK-21962
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andrew Ash


Spark should support distributed tracing, which is the mechanism, widely 
popularized by Google in the [Dapper 
Paper|https://research.google.com/pubs/pub36356.html], where network requests 
have additional metadata used for tracing requests between services.

This would be useful for me since I have OpenZipkin style tracing in my 
distributed application up to the Spark driver, and from the executors out to 
my other services, but the link is broken in Spark between driver and executor 
since the Span IDs aren't propagated across that link.

An initial implementation could instrument the most important network calls 
with trace ids (like launching and finishing tasks), and incrementally add more 
tracing to other calls (torrent block distribution, external shuffle service, 
etc) as the feature matures.

Search keywords: Dapper, Brave, OpenZipkin, HTrace



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21953) Show both memory and disk bytes spilled if either is present

2017-09-08 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-21953:
--

 Summary: Show both memory and disk bytes spilled if either is 
present
 Key: SPARK-21953
 URL: https://issues.apache.org/jira/browse/SPARK-21953
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.2.0
Reporter: Andrew Ash
Priority: Minor


https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61
 should be {{||}} not {{&&}}

As written now, there must be both memory and disk bytes spilled to show either 
of them.  If there is only one of those types of spill recorded, it will be 
hidden.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2017-09-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156622#comment-16156622
 ] 

Andrew Ash commented on SPARK-12449:


[~velvia] I'm not involved with the CatalystSource or SAP HANAVora, so can't 
comment on the direction that project is going right now.

However there is an effort to add a new Datasources V2 API happening at 
https://issues.apache.org/jira/browse/SPARK-15689 and on the email list right 
now that could grow to encompass the goals of this issue.

[~stephank85] if you are able to comment on SPARK-15689 your input would be 
very valuable to that API design.

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21941) Stop storing unused attemptId in SQLTaskMetrics

2017-09-07 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-21941:
--

 Summary: Stop storing unused attemptId in SQLTaskMetrics
 Key: SPARK-21941
 URL: https://issues.apache.org/jira/browse/SPARK-21941
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.2.0
Reporter: Andrew Ash


Currently SQLTaskMetrics has a long attemptId field on it that is unused, with 
a TODO saying to populate the value in the future.  We should save this memory 
by leaving the TODO but taking the unused field out of the class.

I have a driver that heap dumped on OOM and has 390,105 instances of 
SQLTaskMetric -- removing this 8 bytes field will save roughly 390k*8 = 3.1MB 
of heap space.  It's not going to fix my OOM, but there's no reason to put this 
pressure on the GC if we don't get anything by storing it.

https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L485



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21807) The getAliasedConstraints function in LogicalPlan will take a long time when number of expressions is greater than 100

2017-08-31 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149600#comment-16149600
 ] 

Andrew Ash commented on SPARK-21807:


For reference, here's a stacktrace I'm seeing on a cluster before this change 
that I think this PR will improve:

{noformat}
"spark-task-4" #714 prio=5 os_prio=0 tid=0x7fa368031000 nid=0x4d91 runnable 
[0x7fa24e592000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.spark.sql.catalyst.expressions.AttributeReference.equals(namedExpressions.scala:220)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.EqualNullSafe.equals(predicates.scala:505)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:151)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40)
at 
scala.collection.mutable.FlatHashTable$class.growTable(FlatHashTable.scala:225)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:159)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40)
at 
scala.collection.mutable.FlatHashTable$class.addElem(FlatHashTable.scala:139)
at scala.collection.mutable.HashSet.addElem(HashSet.scala:40)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:59)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:40)
at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.AbstractSet.$plus$plus$eq(Set.scala:46)
at scala.collection.mutable.HashSet.clone(HashSet.scala:83)
at scala.collection.mutable.HashSet.clone(HashSet.scala:40)
at 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:65)
at 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:50)
at scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
at scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:104)
at 
scala.collection.TraversableOnce$class.$div$colon(TraversableOn

[jira] [Commented] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java

2017-08-30 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147976#comment-16147976
 ] 

Andrew Ash commented on SPARK-21875:


I'd be interested in more details on why it can't be run in the PR builder -- I 
have the full `./dev/run-tests` running in CI and it catches things like this 
occasionally

PR at https://github.com/apache/spark/pull/19088

> Jenkins passes Java code that violates ./dev/lint-java
> --
>
> Key: SPARK-21875
> URL: https://issues.apache.org/jira/browse/SPARK-21875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Trivial
>
> Two recent PRs merged which caused lint-java errors:
> {noformat}
> 
> Running Java style checks
> 
> Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
> Checkstyle checks failed at following occurrences:
> [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1
> {noformat}
> The first error is from https://github.com/apache/spark/pull/19025 and the 
> second is from https://github.com/apache/spark/pull/18488
> Should we be expecting Jenkins to enforce Java code style pre-commit?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java

2017-08-30 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-21875:
--

 Summary: Jenkins passes Java code that violates ./dev/lint-java
 Key: SPARK-21875
 URL: https://issues.apache.org/jira/browse/SPARK-21875
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.0
Reporter: Andrew Ash


Two recent PRs merged which caused lint-java errors:

{noformat}

Running Java style checks

Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] 
(sizes) LineLength: Line is longer than 100 characters (found 106).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] 
(sizes) LineLength: Line is longer than 100 characters (found 106).
[error] running /home/ubuntu/spark/dev/lint-java ; received return code 1
{noformat}

The first error is from https://github.com/apache/spark/pull/19025 and the 
second is from https://github.com/apache/spark/pull/18488

Should we be expecting Jenkins to enforce Java code style pre-commit?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2017-08-23 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138509#comment-16138509
 ] 

Andrew Ash commented on SPARK-15689:


Can the authors of this document add a section contrasting the approach with 
the one from https://issues.apache.org/jira/browse/SPARK-12449 ?  In that 
approach the data source receives an entire arbitrary plan rather than just 
parts of the plan.

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2017-08-23 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138501#comment-16138501
 ] 

Andrew Ash commented on SPARK-12449:


Relevant slides: 
https://www.slideshare.net/SparkSummit/the-pushdown-of-everything-by-stephan-kessler-and-santiago-mola

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-08-21 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136371#comment-16136371
 ] 

Andrew Ash commented on SPARK-19552:


I didn't see anything other than the issue you just commented on at 
https://issues.apache.org/jira/browse/ARROW-292

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern but not one we'd be 
> exposed to with Spark "out of the box". Let's upgrade the version we use to 
> be on the safe side as the security fix I'm especially interested in is not 
> available in the 4.0.x release line. 
> We should move up anyway to take on a bunch of other big fixes cited in the 
> release notes (and if anyone were to use Spark with netty and tcnative, they 
> shouldn't be exposed to the security problem) - we should be good citizens 
> and make this change.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. This JIRA and associated pull 
> request starts the process which I'll work on - and any help would be much 
> appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-08-20 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16134657#comment-16134657
 ] 

Andrew Ash commented on SPARK-19552:


Heads up the next time someone attempts this:

Upgrading to 4.1.x causes a few Apache Arrow-related integration test failures 
in Spark, because Arrow depends on a part of Netty 4.0.x that has changed in 
the 4.1.x series. I've been running with Netty 4.1.x on my fork for a few 
months but due to recent Arrow changes will now have to downgrade back to Netty 
4.0.x due to this Arrow dependency.  More details at 
https://github.com/palantir/spark/pull/247#issuecomment-323469174

So when Spark does go to 4.1.x we might need Arrow to bump its dependencies as 
well (or shade netty in Spark at the same time as Sean suggests, which I 
generally support).

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern but not one we'd be 
> exposed to with Spark "out of the box". Let's upgrade the version we use to 
> be on the safe side as the security fix I'm especially interested in is not 
> available in the 4.0.x release line. 
> We should move up anyway to take on a bunch of other big fixes cited in the 
> release notes (and if anyone were to use Spark with netty and tcnative, they 
> shouldn't be exposed to the security problem) - we should be good citizens 
> and make this change.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. This JIRA and associated pull 
> request starts the process which I'll work on - and any help would be much 
> appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21757) Jobs page fails to load when executor removed event's reason contains single quote

2017-08-16 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-21757:
---
Description: 
At the following two places if the {{e.reason}} value contains a single quote 
character, then the rendered JSON is invalid.

https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala?utf8=%E2%9C%93#L158
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala#L127

A quick fix could be to do replacement of {{'}} with html entity 
{noformat}'{noformat} and a longer term fix could be to not assemble JSON 
via string concatenation, and instead use a real JSON serialization library in 
those two places.

  was:
At the following two places if the {{e.reason}} value contains a single quote 
character, then the rendered JSON is invalid.

https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala?utf8=%E2%9C%93#L158
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala#L127

A quick fix could be to do replacement of {{'}} with html entity {{'}} and 
a longer term fix could be to not assemble JSON via string concatenation, and 
instead use a real JSON serialization library in those two places.


> Jobs page fails to load when executor removed event's reason contains single 
> quote
> --
>
> Key: SPARK-21757
> URL: https://issues.apache.org/jira/browse/SPARK-21757
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> At the following two places if the {{e.reason}} value contains a single quote 
> character, then the rendered JSON is invalid.
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala?utf8=%E2%9C%93#L158
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala#L127
> A quick fix could be to do replacement of {{'}} with html entity 
> {noformat}'{noformat} and a longer term fix could be to not assemble JSON 
> via string concatenation, and instead use a real JSON serialization library 
> in those two places.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21757) Jobs page fails to load when executor removed event's reason contains single quote

2017-08-16 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-21757:
--

 Summary: Jobs page fails to load when executor removed event's 
reason contains single quote
 Key: SPARK-21757
 URL: https://issues.apache.org/jira/browse/SPARK-21757
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.2.0
Reporter: Andrew Ash


At the following two places if the {{e.reason}} value contains a single quote 
character, then the rendered JSON is invalid.

https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala?utf8=%E2%9C%93#L158
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala#L127

A quick fix could be to do replacement of {{'}} with html entity {{'}} and 
a longer term fix could be to not assemble JSON via string concatenation, and 
instead use a real JSON serialization library in those two places.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21564) TaskDescription decoding failure should fail the task

2017-08-10 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122651#comment-16122651
 ] 

Andrew Ash commented on SPARK-21564:


[~irashid] a possible fix could look roughly like this:

{noformat}
diff --git 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
index a2f1aa22b0..06d72fe106 100644
--- 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
@@ -17,6 +17,7 @@

 package org.apache.spark.executor

+import java.io.{DataInputStream, NotSerializableException}
 import java.net.URL
 import java.nio.ByteBuffer
 import java.util.Locale
@@ -35,7 +36,7 @@ import org.apache.spark.rpc._
 import org.apache.spark.scheduler.{ExecutorLossReason, TaskDescription}
 import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages._
 import org.apache.spark.serializer.SerializerInstance
-import org.apache.spark.util.{ThreadUtils, Utils}
+import org.apache.spark.util.{ByteBufferInputStream, ThreadUtils, Utils}

 private[spark] class CoarseGrainedExecutorBackend(
 override val rpcEnv: RpcEnv,
@@ -93,9 +94,26 @@ private[spark] class CoarseGrainedExecutorBackend(
   if (executor == null) {
 exitExecutor(1, "Received LaunchTask command but executor was null")
   } else {
-val taskDesc = TaskDescription.decode(data.value)
-logInfo("Got assigned task " + taskDesc.taskId)
-executor.launchTask(this, taskDesc)
+try {
+  val taskDesc = TaskDescription.decode(data.value)
+  logInfo("Got assigned task " + taskDesc.taskId)
+  executor.launchTask(this, taskDesc)
+} catch {
+  case e: Exception =>
+val taskId = new DataInputStream(new ByteBufferInputStream(
+  ByteBuffer.wrap(data.value.array(.readLong()
+val ser = env.closureSerializer.newInstance()
+val serializedTaskEndReason = {
+  try {
+ser.serialize(new ExceptionFailure(e, Nil))
+  } catch {
+case _: NotSerializableException =>
+  // e is not serializable so just send the stacktrace
+  ser.serialize(new ExceptionFailure(e, Nil, false))
+  }
+}
+statusUpdate(taskId, TaskState.FAILED, serializedTaskEndReason)
+}
   }

 case KillTask(taskId, _, interruptThread, reason) =>
{noformat}

The downside here though is that we're still making the assumption that the 
TaskDescription is well-formatted enough to be able to get the taskId out of it 
(the first long in the serialized bytes).

Any other thoughts on how to do this?

> TaskDescription decoding failure should fail the task
> -
>
> Key: SPARK-21564
> URL: https://issues.apache.org/jira/browse/SPARK-21564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> cc [~robert3005]
> I was seeing an issue where Spark was throwing this exception:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - Ignoring error
> java.io.EOFException: null
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readUTF(DataInputStream.java:609)
> at java.io.DataInputStream.readUTF(DataInputStream.java:564)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
> at scala.collection.immutable.Range.foreach(Range.scala:160)
> at 
> org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> For details on the cause of that exception, see SPARK-21563
> We've since changed the application and have a proposed fix in Spark at the 
> ticket above, but it was troubling that decodi

[jira] [Commented] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-08-10 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122549#comment-16122549
 ] 

Andrew Ash commented on SPARK-21563:


Thanks for the thoughts [~irashid] -- I submitted a PR implementing this 
approach at https://github.com/apache/spark/pull/18913

> Race condition when serializing TaskDescriptions and adding jars
> 
>
> Key: SPARK-21563
> URL: https://issues.apache.org/jira/browse/SPARK-21563
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> cc [~robert3005]
> I was seeing this exception during some running Spark jobs:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - Ignoring error
> java.io.EOFException: null
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readUTF(DataInputStream.java:609)
> at java.io.DataInputStream.readUTF(DataInputStream.java:564)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
> at scala.collection.immutable.Range.foreach(Range.scala:160)
> at 
> org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> After some debugging, we determined that this is due to a race condition in 
> task serde.  cc [~irashid] [~kayousterhout] who last touched that code in 
> SPARK-19796
> The race is between adding additional jars to the SparkContext and 
> serializing the TaskDescription.
> Consider this sequence of events:
> - TaskSetManager creates a TaskDescription using a reference to the 
> SparkContext's jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
> - TaskDescription starts serializing, and begins writing jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
> - the size of the jar map is written out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
> - _on another thread_: the application adds a jar to the SparkContext's jars 
> list
> - then the entries in the jars list are serialized out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64
> The problem now is that the jars list is serialized as having N entries, but 
> actually N+1 entries follow that count!
> This causes task deserialization to fail in the executor, with the stacktrace 
> above.
> The same issue also likely exists for files, though I haven't observed that 
> and our application does not stress that codepath the same way it did for jar 
> additions.
> One fix here is that TaskSetManager could make an immutable copy of the jars 
> list that it passes into the TaskDescription constructor, so that list 
> doesn't change mid-serialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file

2017-08-09 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120643#comment-16120643
 ] 

Andrew Ash commented on SPARK-19116:


Ah yes, for files it seems like Spark currently uses size of the parquet files 
on disk, rather than estimating in-memory size by multiplying the sum of the 
column type sizes by the row count.

> LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
> -
>
> Key: SPARK-19116
> URL: https://issues.apache.org/jira/browse/SPARK-19116
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1, 2.0.2
> Environment: Python 3.5.x
> Windows 10
>Reporter: Shea Parkes
>
> We're having some modestly severe issues with broadcast join inference, and 
> I've been chasing them through the join heuristics in the catalyst engine.  
> I've made it as far as I can, and I've hit upon something that does not make 
> any sense to me.
> I thought that loading from parquet would be a RelationPlan, which would just 
> use the sum of default sizeInBytes for each column times the number of rows.  
> But this trivial example shows that I am not correct:
> {code}
> import pyspark.sql.functions as F
> df_range = session.range(100).select(F.col('id').cast('integer'))
> df_range.write.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet.explain(True)
> # Expected sizeInBytes
> integer_default_sizeinbytes = 4
> print(df_parquet.count() * integer_default_sizeinbytes)  # = 400
> # Inferred sizeInBytes
> print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes())  # = 2318
> # For posterity (Didn't really expect this to match anything above)
> print(df_range._jdf.logicalPlan().statistics().sizeInBytes())  # = 600
> {code}
> And here's the results of explain(True) on df_parquet:
> {code}
> In [456]: == Parsed Logical Plan ==
> Relation[id#794] parquet
> == Analyzed Logical Plan ==
> id: int
> Relation[id#794] parquet
> == Optimized Logical Plan ==
> Relation[id#794] parquet
> == Physical Plan ==
> *BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: 
> file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}
> So basically, I'm not understanding well how the size of the parquet file is 
> being estimated.  I don't expect it to be extremely accurate, but empirically 
> it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold 
> way too much.  (It's not always too high like the example above, it's often 
> way too low.)
> Without deeper understanding, I'm considering a result of 2318 instead of 400 
> to be a bug.  My apologies if I'm missing something obvious.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file

2017-08-09 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash closed SPARK-19116.
--
Resolution: Not A Problem

> LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
> -
>
> Key: SPARK-19116
> URL: https://issues.apache.org/jira/browse/SPARK-19116
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1, 2.0.2
> Environment: Python 3.5.x
> Windows 10
>Reporter: Shea Parkes
>
> We're having some modestly severe issues with broadcast join inference, and 
> I've been chasing them through the join heuristics in the catalyst engine.  
> I've made it as far as I can, and I've hit upon something that does not make 
> any sense to me.
> I thought that loading from parquet would be a RelationPlan, which would just 
> use the sum of default sizeInBytes for each column times the number of rows.  
> But this trivial example shows that I am not correct:
> {code}
> import pyspark.sql.functions as F
> df_range = session.range(100).select(F.col('id').cast('integer'))
> df_range.write.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet.explain(True)
> # Expected sizeInBytes
> integer_default_sizeinbytes = 4
> print(df_parquet.count() * integer_default_sizeinbytes)  # = 400
> # Inferred sizeInBytes
> print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes())  # = 2318
> # For posterity (Didn't really expect this to match anything above)
> print(df_range._jdf.logicalPlan().statistics().sizeInBytes())  # = 600
> {code}
> And here's the results of explain(True) on df_parquet:
> {code}
> In [456]: == Parsed Logical Plan ==
> Relation[id#794] parquet
> == Analyzed Logical Plan ==
> id: int
> Relation[id#794] parquet
> == Optimized Logical Plan ==
> Relation[id#794] parquet
> == Physical Plan ==
> *BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: 
> file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}
> So basically, I'm not understanding well how the size of the parquet file is 
> being estimated.  I don't expect it to be extremely accurate, but empirically 
> it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold 
> way too much.  (It's not always too high like the example above, it's often 
> way too low.)
> Without deeper understanding, I'm considering a result of 2318 instead of 400 
> to be a bug.  My apologies if I'm missing something obvious.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file

2017-08-04 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114751#comment-16114751
 ] 

Andrew Ash commented on SPARK-19116:


[~shea.parkes] does this answer your question?

> LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
> -
>
> Key: SPARK-19116
> URL: https://issues.apache.org/jira/browse/SPARK-19116
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1, 2.0.2
> Environment: Python 3.5.x
> Windows 10
>Reporter: Shea Parkes
>
> We're having some modestly severe issues with broadcast join inference, and 
> I've been chasing them through the join heuristics in the catalyst engine.  
> I've made it as far as I can, and I've hit upon something that does not make 
> any sense to me.
> I thought that loading from parquet would be a RelationPlan, which would just 
> use the sum of default sizeInBytes for each column times the number of rows.  
> But this trivial example shows that I am not correct:
> {code}
> import pyspark.sql.functions as F
> df_range = session.range(100).select(F.col('id').cast('integer'))
> df_range.write.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet.explain(True)
> # Expected sizeInBytes
> integer_default_sizeinbytes = 4
> print(df_parquet.count() * integer_default_sizeinbytes)  # = 400
> # Inferred sizeInBytes
> print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes())  # = 2318
> # For posterity (Didn't really expect this to match anything above)
> print(df_range._jdf.logicalPlan().statistics().sizeInBytes())  # = 600
> {code}
> And here's the results of explain(True) on df_parquet:
> {code}
> In [456]: == Parsed Logical Plan ==
> Relation[id#794] parquet
> == Analyzed Logical Plan ==
> id: int
> Relation[id#794] parquet
> == Optimized Logical Plan ==
> Relation[id#794] parquet
> == Physical Plan ==
> *BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: 
> file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}
> So basically, I'm not understanding well how the size of the parquet file is 
> being estimated.  I don't expect it to be extremely accurate, but empirically 
> it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold 
> way too much.  (It's not always too high like the example above, it's often 
> way too low.)
> Without deeper understanding, I'm considering a result of 2318 instead of 400 
> to be a bug.  My apologies if I'm missing something obvious.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20433) Update jackson-databind to 2.6.7.1

2017-07-31 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108117#comment-16108117
 ] 

Andrew Ash commented on SPARK-20433:


Sorry about not updating the ticket description -- the 2.6.7.1 release was came 
after the initial batch of patches for projects still on older versions of 
Jackson.

I've now opened PR https://github.com/apache/spark/pull/18789 where we can 
further discussion on the risk of the version bump.

> Update jackson-databind to 2.6.7.1
> --
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>Priority: Minor
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.  UPDATE: now the 2.6.X line has a 
> patch as well: 2.6.7.1 as mentioned at 
> https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this 
> library for the next Spark release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20433) Update jackson-databind to 2.6.7.1

2017-07-31 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-20433:
---
Description: 
There was a security vulnerability recently reported to the upstream 
jackson-databind project at 
https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
released.

>From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
>first fixed versions in their respectful 2.X branches, and versions in the 
>2.6.X line and earlier remain vulnerable.  UPDATE: now the 2.6.X line has a 
>patch as well: 2.6.7.1 as mentioned at 
>https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340

Right now Spark master branch is on 2.6.5: 
https://github.com/apache/spark/blob/master/pom.xml#L164

and Hadoop branch-2.7 is on 2.2.3: 
https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71

and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74

We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this 
library for the next Spark release.

  was:
There was a security vulnerability recently reported to the upstream 
jackson-databind project at 
https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
released.

>From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
>first fixed versions in their respectful 2.X branches, and versions in the 
>2.6.X line and earlier remain vulnerable.

Right now Spark master branch is on 2.6.5: 
https://github.com/apache/spark/blob/master/pom.xml#L164

and Hadoop branch-2.7 is on 2.2.3: 
https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71

and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74

We should try to find to find a way to get on a patched version of jackson-bind 
for the Spark 2.2.0 release.


> Update jackson-databind to 2.6.7.1
> --
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>Priority: Minor
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.  UPDATE: now the 2.6.X line has a 
> patch as well: 2.6.7.1 as mentioned at 
> https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this 
> library for the next Spark release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20433) Update jackson-databind to 2.6.7.1

2017-07-31 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-20433:
---
Summary: Update jackson-databind to 2.6.7.1  (was: Update jackson-databind 
to 2.6.7)

> Update jackson-databind to 2.6.7.1
> --
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>Priority: Minor
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20433) Update jackson-databind to 2.6.7

2017-07-31 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash reopened SPARK-20433:


> Update jackson-databind to 2.6.7
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>Priority: Minor
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20433) Security issue with jackson-databind

2017-07-28 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105951#comment-16105951
 ] 

Andrew Ash commented on SPARK-20433:


As I wrote in that PR, it's 2.6.7.1 of jackson-databind that has the fix, and 
the Jackson project did not publish a corresponding 2.6.7.1 of the other 
components of Jackson.

This affects Spark because a known vulnerable library is on the classpath at 
runtime. So you can only guarantee that Spark isn't vulnerable by removing the 
vulnerable code from the runtime classpath.

Anyways a Jackson bump to a fixed version will likely be picked up by Apache 
Spark the next time Jackson is upgraded so I trust this will get fixed 
eventually regardless of whether Apache takes the hotfix version now or a 
regular release in the future.

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-07-28 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105772#comment-16105772
 ] 

Andrew Ash commented on SPARK-21563:


And for reference, I added this additional logging to assist in debugging: 
https://github.com/palantir/spark/pull/238

> Race condition when serializing TaskDescriptions and adding jars
> 
>
> Key: SPARK-21563
> URL: https://issues.apache.org/jira/browse/SPARK-21563
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> cc [~robert3005]
> I was seeing this exception during some running Spark jobs:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - Ignoring error
> java.io.EOFException: null
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readUTF(DataInputStream.java:609)
> at java.io.DataInputStream.readUTF(DataInputStream.java:564)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
> at scala.collection.immutable.Range.foreach(Range.scala:160)
> at 
> org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> After some debugging, we determined that this is due to a race condition in 
> task serde.  cc [~irashid] [~kayousterhout] who last touched that code in 
> SPARK-19796
> The race is between adding additional jars to the SparkContext and 
> serializing the TaskDescription.
> Consider this sequence of events:
> - TaskSetManager creates a TaskDescription using a reference to the 
> SparkContext's jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
> - TaskDescription starts serializing, and begins writing jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
> - the size of the jar map is written out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
> - _on another thread_: the application adds a jar to the SparkContext's jars 
> list
> - then the entries in the jars list are serialized out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64
> The problem now is that the jars list is serialized as having N entries, but 
> actually N+1 entries follow that count!
> This causes task deserialization to fail in the executor, with the stacktrace 
> above.
> The same issue also likely exists for files, though I haven't observed that 
> and our application does not stress that codepath the same way it did for jar 
> additions.
> One fix here is that TaskSetManager could make an immutable copy of the jars 
> list that it passes into the TaskDescription constructor, so that list 
> doesn't change mid-serialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20433) Security issue with jackson-databind

2017-07-28 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105751#comment-16105751
 ] 

Andrew Ash commented on SPARK-20433:


Here's the patch I put in my fork of Spark: 
https://github.com/palantir/spark/pull/241

It addresses CVE-2017-7525 -- http://www.securityfocus.com/bid/99623

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21564) TaskDescription decoding failure should fail the task

2017-07-28 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-21564:
---
Description: 
cc [~robert3005]

I was seeing an issue where Spark was throwing this exception:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

For details on the cause of that exception, see SPARK-21563

We've since changed the application and have a proposed fix in Spark at the 
ticket above, but it was troubling that decoding the TaskDescription wasn't 
failing the tasks.  So the Spark job ended up hanging and making no progress, 
permanently stuck, because the driver thinks the task is running but the thread 
has died in the executor.

We should make a change around 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
 so that when that decode throws an exception, the task is marked as failed.

cc [~kayousterhout] [~irashid]

  was:
I was seeing an issue where Spark was throwing this exception:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

For details on the cause of that exception, see SPARK-21563

We've since changed the application and have a proposed fix in Spark at the 
ticket above, but it was troubling that decoding the TaskDescription wasn't 
failing the tasks.  So the Spark job ended up hanging and making no progress, 
permanently stuck, because the driver thinks the task is running but the thread 
has died in the executor.

We should make a change around 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
 so that when that decode throws an exception, the task is marked as failed.

cc [~kayousterhout] [~irashid]


> TaskDescription decoding failure should fail the task
> -
>
> Key: SPARK-21564
> URL: https://issues.apache.org/jira/browse/SPARK-21564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> cc [~robert3005]
> I was seeing an issue where Spark was throwing this exception:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - Ignoring

[jira] [Updated] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-07-28 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-21563:
---
Description: 
cc [~robert3005]

I was seeing this exception during some running Spark jobs:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

After some debugging, we determined that this is due to a race condition in 
task serde.  cc [~irashid] [~kayousterhout] who last touched that code in 
SPARK-19796

The race is between adding additional jars to the SparkContext and serializing 
the TaskDescription.

Consider this sequence of events:

- TaskSetManager creates a TaskDescription using a reference to the 
SparkContext's jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
- TaskDescription starts serializing, and begins writing jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
- the size of the jar map is written out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
- _on another thread_: the application adds a jar to the SparkContext's jars 
list
- then the entries in the jars list are serialized out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64

The problem now is that the jars list is serialized as having N entries, but 
actually N+1 entries follow that count!

This causes task deserialization to fail in the executor, with the stacktrace 
above.

The same issue also likely exists for files, though I haven't observed that and 
our application does not stress that codepath the same way it did for jar 
additions.

One fix here is that TaskSetManager could make an immutable copy of the jars 
list that it passes into the TaskDescription constructor, so that list doesn't 
change mid-serialization.

  was:
cc [~robert3005]

I was seeing this exception during some running Spark jobs:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

After some debugging, we determined that this is due to a race condition in 
task serde introduced in SPARK-19796.  cc [~irashid] [~kayousterhout]

The race is between adding additional jars to the SparkContext and serializing 
the Tas

[jira] [Created] (SPARK-21564) TaskDescription decoding failure should fail the task

2017-07-28 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-21564:
--

 Summary: TaskDescription decoding failure should fail the task
 Key: SPARK-21564
 URL: https://issues.apache.org/jira/browse/SPARK-21564
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andrew Ash


I was seeing an issue where Spark was throwing this exception:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

For details on the cause of that exception, see SPARK-21563

We've since changed the application and have a proposed fix in Spark at the 
ticket above, but it was troubling that decoding the TaskDescription wasn't 
failing the tasks.  So the Spark job ended up hanging and making no progress, 
permanently stuck, because the driver thinks the task is running but the thread 
has died in the executor.

We should make a change around 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
 so that when that decode throws an exception, the task is marked as failed.

cc [~kayousterhout] [~irashid]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-07-28 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-21563:
--

 Summary: Race condition when serializing TaskDescriptions and 
adding jars
 Key: SPARK-21563
 URL: https://issues.apache.org/jira/browse/SPARK-21563
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andrew Ash


cc [~robert3005]

I was seeing this exception during some running Spark jobs:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

After some debugging, we determined that this is due to a race condition in 
task serde introduced in SPARK-19796.  cc [~irashid] [~kayousterhout]

The race is between adding additional jars to the SparkContext and serializing 
the TaskDescription.

Consider this sequence of events:

- TaskSetManager creates a TaskDescription using a reference to the 
SparkContext's jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
- TaskDescription starts serializing, and begins writing jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
- the size of the jar map is written out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
- _on another thread_: the application adds a jar to the SparkContext's jars 
list
- then the entries in the jars list are serialized out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64

The problem now is that the jars list is serialized as having N entries, but 
actually N+1 entries follow that count!

This causes task deserialization to fail in the executor, with the stacktrace 
above.

The same issue also likely exists for files, though I haven't observed that and 
our application does not stress that codepath the same way it did for jar 
additions.

One fix here is that TaskSetManager could make an immutable copy of the jars 
list that it passes into the TaskDescription constructor, so that list doesn't 
change mid-serialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14887) Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits

2017-07-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099584#comment-16099584
 ] 

Andrew Ash commented on SPARK-14887:


[~fang fang chen] have you seen this in the latest version of Spark?  There 
have been many fixes since the "affects version" Spark 1.5.2 you set on the bug.

Can you please retry with latest Spark and let us know if the problem persists?

> Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits
> ---
>
> Key: SPARK-14887
> URL: https://issues.apache.org/jira/browse/SPARK-14887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: fang fang chen
>Assignee: Davies Liu
>
> Similiar issue with SPARK-14138 and SPARK-8443:
> With large sql syntax(673K), following error happened:
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21289) Text and CSV formats do not support custom end-of-line delimiters

2017-07-10 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081136#comment-16081136
 ] 

Andrew Ash commented on SPARK-21289:


Looks like this will fix SPARK-17227 also

> Text and CSV formats do not support custom end-of-line delimiters
> -
>
> Key: SPARK-21289
> URL: https://issues.apache.org/jira/browse/SPARK-21289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Yevgen Galchenko
>Priority: Minor
>
> Spark csv and text readers always use default CR, LF or CRLF line terminators 
> without an option to configure a custom delimiter.
> Option "textinputformat.record.delimiter" is not being used to set delimiter 
> in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() 
> is used to read file.
> Possible solution would be to change HadoopFileLinesReader and create 
> LineRecordReader with delimiters specified in configuration. LineRecordReader 
> already supports passing recordDelimiter in its constructor.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15226) CSV file data-line with newline at first line load error

2017-07-10 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash closed SPARK-15226.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Fixed by Fixed by https://issues.apache.org/jira/browse/SPARK-19610

> CSV file data-line with newline at first line load error
> 
>
> Key: SPARK-15226
> URL: https://issues.apache.org/jira/browse/SPARK-15226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Weichen Xu
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> CSV file such as:
> ---
> v1,v2,"v
> 3",v4,v5
> a,b,c,d,e
> ---
> it contains two row,first row :
> v1, v2, v\n3, v4, v5 (in value v\n3 it contains a newline character,it is 
> legal)
> second row:
> a,b,c,d,e
> then in spark-shell run commands like:
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> var df = reader.csv("path/to/csvfile")
> df.collect
> then we find the load data is wrong,
> the load data has only 3 columns, but in fact it has 5 columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15226) CSV file data-line with newline at first line load error

2017-07-10 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081154#comment-16081154
 ] 

Andrew Ash edited comment on SPARK-15226 at 7/10/17 9:07 PM:
-

Fixed by https://issues.apache.org/jira/browse/SPARK-19610


was (Author: aash):
Fixed by Fixed by https://issues.apache.org/jira/browse/SPARK-19610

> CSV file data-line with newline at first line load error
> 
>
> Key: SPARK-15226
> URL: https://issues.apache.org/jira/browse/SPARK-15226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Weichen Xu
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> CSV file such as:
> ---
> v1,v2,"v
> 3",v4,v5
> a,b,c,d,e
> ---
> it contains two row,first row :
> v1, v2, v\n3, v4, v5 (in value v\n3 it contains a newline character,it is 
> legal)
> second row:
> a,b,c,d,e
> then in spark-shell run commands like:
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> var df = reader.csv("path/to/csvfile")
> df.collect
> then we find the load data is wrong,
> the load data has only 3 columns, but in fact it has 5 columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21220) Use outputPartitioning's bucketing if possible on write

2017-06-26 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-21220:
--

 Summary: Use outputPartitioning's bucketing if possible on write
 Key: SPARK-21220
 URL: https://issues.apache.org/jira/browse/SPARK-21220
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andrew Ash


When reading a bucketed dataset and writing it back with no transformations (a 
copy) the bucketing information is lost and the user is required to re-specify 
the bucketing information on write.  This negatively affects read performance 
on the copied dataset since the bucketing information enables significant 
optimizations that aren't possible on the un-bucketed copied table.

Spark should propagate this bucketing information for copied datasets, and more 
generally could support inferring bucket information based on the known 
partitioning of the final RDD at save time when that partitioning is a 
{{HashPartitioning}}.

https://github.com/apache/spark/blob/v2.2.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L118

In the above linked {{bucketIdExpression}}, we could {{.orElse}} a bucket 
expression based on outputPartitionings that are HashPartitioning.

This preserves bucket information for bucketed datasets, and also supports 
saving this metadata at write time for datasets with a known partitioning.  
Both of these cases should improve performance at read time of the 
newly-written dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19700) Design an API for pluggable scheduler implementations

2017-06-22 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058930#comment-16058930
 ] 

Andrew Ash edited comment on SPARK-19700 at 6/22/17 7:47 AM:
-

Public implementation that's been around a while (Hamel is familiar with this): 
Two Sigma's integration with their Cook scheduler, recent diff at 
https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook


was (Author: aash):
Public implementation that's been around a while: Two Sigma's integration with 
their Cook scheduler, recent diff at 
https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations

2017-06-22 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058930#comment-16058930
 ] 

Andrew Ash commented on SPARK-19700:


Public implementation that's been around a while: Two Sigma's integration with 
their Cook scheduler, recent diff at 
https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations

2017-06-22 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058917#comment-16058917
 ] 

Andrew Ash commented on SPARK-19700:


Found another potential implementation: Facebook's in-house scheduler mentioned 
by [~tejasp] at 
https://github.com/apache/spark/pull/18209#issuecomment-307538061

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations

2017-06-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16041803#comment-16041803
 ] 

Andrew Ash commented on SPARK-19700:


Found another potential implementation: Nomad by [~barnardb] at SPARK-20992

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-02 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035171#comment-16035171
 ] 

Andrew Ash commented on SPARK-20952:


For the localProperties on SparkContext it does 2 things I can see to improve 
safety:

- first, it clones the properties for new threads so changes in the parent 
thread don't unintentionally affect a child thread: 
https://github.com/apache/spark/blob/v2.2.0-rc2/core/src/main/scala/org/apache/spark/SparkContext.scala#L330
- second, it clears the properties when they're no longer being used: 
https://github.com/apache/spark/blob/v2.2.0-rc2/core/src/main/scala/org/apache/spark/SparkContext.scala#L1942

Do we need to do do either the defensive cloning or the proactive clearing of 
taskInfos in executors like are done in the driver?

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20815) NullPointerException in RPackageUtils#checkManifestForR

2017-05-19 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-20815:
--

 Summary: NullPointerException in RPackageUtils#checkManifestForR
 Key: SPARK-20815
 URL: https://issues.apache.org/jira/browse/SPARK-20815
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.1
Reporter: Andrew Ash


Some jars don't have manifest files in them, such as in my case 
javax.inject-1.jar and value-2.2.1-annotations.jar

This causes the below NPE:

{noformat}
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.spark.deploy.RPackageUtils$.checkManifestForR(RPackageUtils.scala:95)
at 
org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply$mcV$sp(RPackageUtils.scala:180)
at 
org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180)
at 
org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1322)
at 
org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:202)
at 
org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:175)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.deploy.RPackageUtils$.checkAndBuildRPackage(RPackageUtils.scala:175)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:311)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:152)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{noformat}

due to RPackageUtils#checkManifestForR assuming {{jar.getManifest}} is non-null.

However per the JDK spec it can be null:

{noformat}
/**
 * Returns the jar file manifest, or null if none.
 *
 * @return the jar file manifest, or null if none
 *
 * @throws IllegalStateException
 * may be thrown if the jar file has been closed
 * @throws IOException  if an I/O error has occurred
 */
public Manifest getManifest() throws IOException {
return getManifestFromReference();
}
{noformat}

This method should do a null check and return false if the manifest is null 
(meaning no R code in that jar)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20683) Make table uncache chaining optional

2017-05-19 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018008#comment-16018008
 ] 

Andrew Ash commented on SPARK-20683:


Thanks for that diff [~shea.parkes] -- we're planning on trying it in our fork 
too: https://github.com/palantir/spark/pull/188

> Make table uncache chaining optional
> 
>
> Key: SPARK-20683
> URL: https://issues.apache.org/jira/browse/SPARK-20683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Not particularly environment sensitive.  
> Encountered/tested on Linux and Windows.
>Reporter: Shea Parkes
>
> A recent change was made in SPARK-19765 that causes table uncaching to chain. 
>  That is, if table B is a child of table A, and they are both cached, now 
> uncaching table A will automatically uncache table B.
> At first I did not understand the need for this, but when reading the unit 
> tests, I see that it is likely that many people do not keep named references 
> to the child table (e.g. B).  Perhaps B is just made and cached as some part 
> of data exploration.  In that situation, it makes sense for B to 
> automatically be uncached when you are finished with A.
> However, we commonly utilize a different design pattern that is now harmed by 
> this automatic uncaching.  It is common for us to cache table A to then make 
> two, independent children tables (e.g. B and C).  Once those two child tables 
> are realized and cached, we'd then uncache table A (as it was no longer 
> needed and could be quite large).  After this change now, when we uncache 
> table A, we suddenly lose our cached status on both table B and C (which is 
> quite frustrating).  All of these tables are often quite large, and we view 
> what we're doing as mindful memory management.  We are maintaining named 
> references to B and C at all times, so we can always uncache them ourselves 
> when it makes sense.
> Would it be acceptable/feasible to make this table uncache chaining optional? 
>  I would be fine if the default is for the chaining to happen, as long as we 
> can turn it off via parameters.
> If acceptable, I can try to work towards making the required changes.  I am 
> most comfortable in Python (and would want the optional parameter surfaced in 
> Python), but have found the places required to make this change in Scala 
> (since I reverted the functionality in a private fork already).  Any help 
> would be greatly appreciated however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations

2017-05-16 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012011#comment-16012011
 ] 

Andrew Ash commented on SPARK-19700:


Found another potential implementation: Eagle cluster manager from 
https://github.com/epfl-labos/eagle with PR at 
https://github.com/apache/spark/pull/17974

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20433) Security issue with jackson-databind

2017-04-21 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979630#comment-15979630
 ] 

Andrew Ash commented on SPARK-20433:


It's unclear if Spark is affected, I wanted to open this ticket to start the 
discussion.

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>Priority: Critical
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20433) Security issue with jackson-databind

2017-04-21 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-20433:
---
Priority: Major  (was: Blocker)

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20433) Security issue with jackson-databind

2017-04-21 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-20433:
---
Priority: Critical  (was: Major)

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>Priority: Critical
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20433) Security issue with jackson-databind

2017-04-21 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-20433:
--

 Summary: Security issue with jackson-databind
 Key: SPARK-20433
 URL: https://issues.apache.org/jira/browse/SPARK-20433
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Andrew Ash
Priority: Blocker


There was a security vulnerability recently reported to the upstream 
jackson-databind project at 
https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
released.

>From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
>first fixed versions in their respectful 2.X branches, and versions in the 
>2.6.X line and earlier remain vulnerable.

Right now Spark master branch is on 2.6.5: 
https://github.com/apache/spark/blob/master/pom.xml#L164

and Hadoop branch-2.7 is on 2.2.3: 
https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71

and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74

We should try to find to find a way to get on a patched version of jackson-bind 
for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results

2017-04-18 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973763#comment-15973763
 ] 

Andrew Ash commented on SPARK-20364:


Thanks for the investigation [~hyukjin.kwon]!

The proposal you put up (sorry Rob, wasn't me) looks like a good direction.  
I'm not sure it will work as written though, since I'd expect the 
IntColumn/LongColumn/etc classes from Parquet to be expected in other places, 
an instanceof for example.  As an alternate suggestion, I'm wondering if we 
could call the package-protected constructors of the {{IntColumn}} class from 
Spark-owned class in that package with the result from the {{ColumnPath.get()}} 
method, rather than its {{.fromDotString()}} method.  Then Spark can use its 
better knowledge of whether the field is a column-with-dot or a field-in-struct 
and assemble the ColumnPath correctly.

And I do think that this should eventually live in Parquet.  Maybe there's a 
ticket already open there for this?

Does that make sense?

> Parquet predicate pushdown on columns with dots return empty results
> 
>
> Key: SPARK-20364
> URL: https://issues.apache.org/jira/browse/SPARK-20364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> Currently, if there are dots in the column name, predicate pushdown seems 
> being failed in Parquet.
> **With dots**
> {code}
> val path = "/tmp/abcde"
> Seq(Some(1), None).toDF("col.dots").write.parquet(path)
> spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
> {code}
> {code}
> ++
> |col.dots|
> ++
> ++
> {code}
> **Without dots**
> {code}
> val path = "/tmp/abcde2"
> Seq(Some(1), None).toDF("coldots").write.parquet(path)
> spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
> {code}
> {code}
> +---+
> |coldots|
> +---+
> |  1|
> +---+
> {code}
> It seems dot in the column names via {{FilterApi}} tries to separate the 
> field name with dot ({{ColumnPath}} with multiple column paths) whereas the 
> actual column name is {{col.dots}}. (See [FilterApi.java#L71 
> |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71]
>  and it calls 
> [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44].
> I just tried to come up with ways to resolve it and I came up with two as 
> below:
> One is simply to don't push down filters when there are dots in column names 
> so that it reads all and filters in Spark-side.
> The other way creates Spark's {{FilterApi}} for those columns (it seems 
> final) to get always use single column path it in Spark-side (this seems 
> hacky) as we are not pushing down nested columns currently. So, it looks we 
> can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} 
> in this way.
> I just made a rough version of the latter. 
> {code}
> private[parquet] object ParquetColumns {
>   def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = {
> new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with 
> SupportsLtGt
>   }
>   def longColumn(columnPath: String): Column[java.lang.Long] with 
> SupportsLtGt = {
> new Column[java.lang.Long] (
>   ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt
>   }
>   def floatColumn(columnPath: String): Column[java.lang.Float] with 
> SupportsLtGt = {
> new Column[java.lang.Float] (
>   ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt
>   }
>   def doubleColumn(columnPath: String): Column[java.lang.Double] with 
> SupportsLtGt = {
> new Column[java.lang.Double] (
>   ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt
>   }
>   def booleanColumn(columnPath: String): Column[java.lang.Boolean] with 
> SupportsEqNotEq = {
> new Column[java.lang.Boolean] (
>   ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with 
> SupportsEqNotEq
>   }
>   def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = {
> new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with 
> SupportsLtGt
>   }
> }
> {code}
> Both sound not the best. Please let me know if anyone has a better idea.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1809) Mesos backend doesn't respect HADOOP_CONF_DIR

2017-04-12 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966782#comment-15966782
 ] 

Andrew Ash edited comment on SPARK-1809 at 4/12/17 11:00 PM:
-

I'm not using Mesos anymore, so closing


was (Author: aash):
Not using Mesos anymore, so closing

> Mesos backend doesn't respect HADOOP_CONF_DIR
> -
>
> Key: SPARK-1809
> URL: https://issues.apache.org/jira/browse/SPARK-1809
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>
> In order to use HDFS paths without the server component, standalone mode 
> reads spark-env.sh and scans the HADOOP_CONF_DIR to open core-site.xml and 
> get the fs.default.name parameter.
> This lets you use HDFS paths like:
> - hdfs:///tmp/myfile.txt
> instead of
> - hdfs://myserver.mydomain.com:8020/tmp/myfile.txt
> However as of recent 1.0.0 pre-release (hash 756c96) I had to specify HDFS 
> paths with the full server even though I have HADOOP_CONF_DIR still set in 
> spark-env.sh.  The HDFS, Spark, and Mesos nodes are all co-located and 
> non-domain HDFS paths work fine when using the standalone mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1809) Mesos backend doesn't respect HADOOP_CONF_DIR

2017-04-12 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash closed SPARK-1809.
-
Resolution: Unresolved

Not using Mesos anymore, so closing

> Mesos backend doesn't respect HADOOP_CONF_DIR
> -
>
> Key: SPARK-1809
> URL: https://issues.apache.org/jira/browse/SPARK-1809
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>
> In order to use HDFS paths without the server component, standalone mode 
> reads spark-env.sh and scans the HADOOP_CONF_DIR to open core-site.xml and 
> get the fs.default.name parameter.
> This lets you use HDFS paths like:
> - hdfs:///tmp/myfile.txt
> instead of
> - hdfs://myserver.mydomain.com:8020/tmp/myfile.txt
> However as of recent 1.0.0 pre-release (hash 756c96) I had to specify HDFS 
> paths with the full server even though I have HADOOP_CONF_DIR still set in 
> spark-env.sh.  The HDFS, Spark, and Mesos nodes are all co-located and 
> non-domain HDFS paths work fine when using the standalone mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-04-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961170#comment-15961170
 ] 

Andrew Ash commented on SPARK-20144:


This is a regression from 1.6 to the 2.x line.  [~marmbrus] recommended 
modifying {{spark.sql.files.openCostInBytes}} as a workaround in this post:

http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-within-partitions-is-not-maintained-in-parquet-td18618.html#a18627

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-03-23 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939072#comment-15939072
 ] 

Andrew Ash commented on SPARK-19372:


I've seen this as well on parquet files.

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19528) external shuffle service would close while still have request from executor when dynamic allocation is enabled

2017-03-18 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-19528:
---
Description: 
when dynamic allocation is enabled, the external shuffle service is used for 
maintain the unfinished status between executors. So the external shuffle 
service should not close before the executor while still have request from 
executor.

container's log:

{noformat}
17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
driver: spark://CoarseGrainedScheduler@192.168.1.1:41867
17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Successfully 
registered with driver
17/02/09 08:30:46 INFO executor.Executor: Starting executor ID 75 on host 
hsx-node8
17/02/09 08:30:46 INFO util.Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 40374.
17/02/09 08:30:46 INFO netty.NettyBlockTransferService: Server created on 40374
17/02/09 08:30:46 INFO storage.BlockManager: external shuffle service port = 
7337
17/02/09 08:30:46 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
17/02/09 08:30:46 INFO storage.BlockManagerMaster: Registered BlockManager
17/02/09 08:30:46 INFO storage.BlockManager: Registering executor with local 
external shuffle service.
17/02/09 08:30:51 ERROR client.TransportResponseHandler: Still have 1 requests 
outstanding when connection from hsx-node8/192.168.1.8:7337 is closed
17/02/09 08:30:51 ERROR storage.BlockManager: Failed to connect to external 
shuffle server, will retry 2 more times after waiting 5 seconds...
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
waiting for task.
at 
org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144)
at 
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:215)
at 
org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:201)
at org.apache.spark.executor.Executor.(Executor.scala:86)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
at 
org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:274)
... 14 more
17/02/09 08:31:01 ERROR storage.BlockManager: Failed to connect to external 
shuffle server, will retry 1 more times after waiting 5 seconds...
{noformat}

nodemanager's log:

{noformat}
2017-02-09 08:30:48,836 INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
completed containers from NM context: [container_1486564603520_0097_01_05]
2017-02-09 08:31:12,122 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
from container container_1486564603520_0096_01_71 is : 1
2017-02-09 08:31:12,122 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
from container-launch with container ID: container_1486564603520_0096_01_71 
and exit code: 1
ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.ut

[jira] [Updated] (SPARK-20001) Support PythonRunner executing inside a Conda env

2017-03-17 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-20001:
---
Description: 
Similar to SPARK-13587, I'm trying to allow the user to configure a Conda 
environment that PythonRunner will run from. 
This change remembers theconda environment found on the driver and installs the 
same packages on the executor side, only once per PythonWorkerFactory. The list 
of requested conda packages are added to the PythonWorkerFactory cache, so two 
collects using the same environment (incl packages) can re-use the same running 
executors.

You have to specify outright what packages and channels to "bootstrap" the 
environment with. 

However, SparkContext (as well as JavaSparkContext & the pyspark version) are 
expanded to support addCondaPackage and addCondaChannel.
Rationale is:
* you might want to add more packages once you're already running in the driver
* you might want to add a channel which requires some token for authentication, 
which you don't yet have access to until the module is already running

This issue requires that the conda binary is already available on the driver as 
well as executors, you just have to specify where it can be found.

Please see the attached pull request on palantir/spark for additional details: 
https://github.com/palantir/spark/pull/115

As for tests, there is a local python test, as well as yarn client & 
cluster-mode tests, which ensure that a newly installed package is visible from 
both the driver and the executor.

  was:
Similar to SPARK-13587, I'm trying to allow the user to configure a Conda 
environment that PythonRunner will run from. 
This change remembers theconda environment found on the driver and installs the 
same packages on the executor side, only once per PythonWorkerFactory. The list 
of requested conda packages are added to the PythonWorkerFactory cache, so two 
collects using the same environment (incl packages) can re-use the same running 
executors.

You have to specify outright what packages and channels to "bootstrap" the 
environment with. 

However, SparkContext (as well as JavaSparkContext & the pyspark version) are 
expanded to support addCondaPackage and addCondaChannel.
Rationale is:
* you might want to add more packages once you're already running in the driver
* you might want to add a channel which requires some token for authentication, 
which you don't yet have access to until the module is already running

This issue requires that the conda binary is already available on the driver as 
well as executors, you just have to specify where it can be found.

Please see the attached issue on palantir/spark for additional details: 
https://github.com/palantir/spark/pull/115

As for tests, there is a local python test, as well as yarn client & 
cluster-mode tests, which ensure that a newly installed package is visible from 
both the driver and the executor.


> Support PythonRunner executing inside a Conda env
> -
>
> Key: SPARK-20001
> URL: https://issues.apache.org/jira/browse/SPARK-20001
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Dan Sanduleac
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Similar to SPARK-13587, I'm trying to allow the user to configure a Conda 
> environment that PythonRunner will run from. 
> This change remembers theconda environment found on the driver and installs 
> the same packages on the executor side, only once per PythonWorkerFactory. 
> The list of requested conda packages are added to the PythonWorkerFactory 
> cache, so two collects using the same environment (incl packages) can re-use 
> the same running executors.
> You have to specify outright what packages and channels to "bootstrap" the 
> environment with. 
> However, SparkContext (as well as JavaSparkContext & the pyspark version) are 
> expanded to support addCondaPackage and addCondaChannel.
> Rationale is:
> * you might want to add more packages once you're already running in the 
> driver
> * you might want to add a channel which requires some token for 
> authentication, which you don't yet have access to until the module is 
> already running
> This issue requires that the conda binary is already available on the driver 
> as well as executors, you just have to specify where it can be found.
> Please see the attached pull request on palantir/spark for additional 
> details: https://github.com/palantir/spark/pull/115
> As for tests, there is a local python test, as well as yarn client & 
> cluster-mode tests, which ensure that a newly installed package is visible 
> from both the driver and the executor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

--

[jira] [Updated] (SPARK-20001) Support PythonRunner executing inside a Conda env

2017-03-17 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-20001:
---
Description: 
Similar to SPARK-13587, I'm trying to allow the user to configure a Conda 
environment that PythonRunner will run from. 
This change remembers theconda environment found on the driver and installs the 
same packages on the executor side, only once per PythonWorkerFactory. The list 
of requested conda packages are added to the PythonWorkerFactory cache, so two 
collects using the same environment (incl packages) can re-use the same running 
executors.

You have to specify outright what packages and channels to "bootstrap" the 
environment with. 

However, SparkContext (as well as JavaSparkContext & the pyspark version) are 
expanded to support addCondaPackage and addCondaChannel.
Rationale is:
* you might want to add more packages once you're already running in the driver
* you might want to add a channel which requires some token for authentication, 
which you don't yet have access to until the module is already running

This issue requires that the conda binary is already available on the driver as 
well as executors, you just have to specify where it can be found.

Please see the attached issue on palantir/spark for additional details: 
https://github.com/palantir/spark/pull/115

As for tests, there is a local python test, as well as yarn client & 
cluster-mode tests, which ensure that a newly installed package is visible from 
both the driver and the executor.

  was:
Similar to SPARK-13587, I'm trying to allow the user to configure a Conda 
environment that PythonRunner will run from. 
This change remembers theconda environment found on the driver and installs the 
same packages on the executor side, only once per PythonWorkerFactory. The list 
of requested conda packages are added to the PythonWorkerFactory cache, so two 
collects using the same environment (incl packages) can re-use the same running 
executors.

You have to specify outright what packages and channels to "bootstrap" the 
environment with. 

However, SparkContext (as well as JavaSparkContext & the pyspark version) are 
expanded to support addCondaPackage and addCondaChannel.
Rationale is:
* you might want to add more packages once you're already running in the driver
* you might want to add a channel which requires some token for authentication, 
which you don't yet have access to until the module is already running

This issue requires that the conda binary is already available on the driver as 
well as executors, you just have to specify where it can be found.

Please see the attached issue on palantir/spark for additional details.

As for tests, there is a local python test, as well as yarn client & 
cluster-mode tests, which ensure that a newly installed package is visible from 
both the driver and the executor.


> Support PythonRunner executing inside a Conda env
> -
>
> Key: SPARK-20001
> URL: https://issues.apache.org/jira/browse/SPARK-20001
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Dan Sanduleac
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Similar to SPARK-13587, I'm trying to allow the user to configure a Conda 
> environment that PythonRunner will run from. 
> This change remembers theconda environment found on the driver and installs 
> the same packages on the executor side, only once per PythonWorkerFactory. 
> The list of requested conda packages are added to the PythonWorkerFactory 
> cache, so two collects using the same environment (incl packages) can re-use 
> the same running executors.
> You have to specify outright what packages and channels to "bootstrap" the 
> environment with. 
> However, SparkContext (as well as JavaSparkContext & the pyspark version) are 
> expanded to support addCondaPackage and addCondaChannel.
> Rationale is:
> * you might want to add more packages once you're already running in the 
> driver
> * you might want to add a channel which requires some token for 
> authentication, which you don't yet have access to until the module is 
> already running
> This issue requires that the conda binary is already available on the driver 
> as well as executors, you just have to specify where it can be found.
> Please see the attached issue on palantir/spark for additional details: 
> https://github.com/palantir/spark/pull/115
> As for tests, there is a local python test, as well as yarn client & 
> cluster-mode tests, which ensure that a newly installed package is visible 
> from both the driver and the executor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...

[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2017-03-17 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929591#comment-15929591
 ] 

Andrew Ash commented on SPARK-18278:


As an update on this ticket:

For those not already aware, work on native Spark integration with Kubernetes 
has been proceeding for the past several months in this repo 
https://github.com/apache-spark-on-k8s/spark in the {{branch-2.1-kubernetes}} 
branch, based off the 2.1.0 Apache release.

We have an active core of about a half dozen contributors to the project with a 
wider group observing of about another dozen.   Communication happens through 
the issues on the GitHub repo, a dedicated room in the Kubernetes Slack, and 
weekly video conferences hosted by the Kubernetes Big Data SIG.

The full patch set is currently about 5500 lines, with about 500 of that as 
user/dev documentation.  Infrastructure-wise, we have a cloud-hosted CI Jenkins 
instance set up donated by project members, which is running both unit tests 
and Kubernetes integration tests over the code.

We recently entered a code freeze for our release branch and are preparing a 
first release to the wider community, which we plan to announce on the general 
Spark users list.  It includes the completed "phase one" portion of the design 
doc shared a few months ago 
(https://docs.google.com/document/d/1_bBzOZ8rKiOSjQg78DXOA3ZBIo_KkDJjqxVuq0yXdew/edit#heading=h.fua3ml5mcolt),
 featuring cluster mode with static allocation of executors, submission of 
local resources, SSL throughout, and support for JVM languages (Java/Scala).

After that release we'll be continuing to stabilize and improve the phase one 
feature set and move into a second phase of kubernetes work.  It will likely be 
focused on support for dynamic allocation, though we haven't finalized planning 
for phase two yet.  Working on the pluggable scheduler in SPARK-19700 may be 
included as well.

Interested parties are of course welcome to watch the repo, join the weekly 
video conferences, give the code a shot, and contribute to the project!

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-03-02 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893222#comment-15893222
 ] 

Andrew Ash commented on SPARK-18113:


We discovered another bug related to committing that causes task deadloop and 
have work being done in SPARK-19631 to fix it.

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(Thread.java:785)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> at 
> scala.concurrent.BlockContext$Def

[jira] [Commented] (SPARK-7354) Flaky test: o.a.s.deploy.SparkSubmitSuite --jars

2017-02-23 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881593#comment-15881593
 ] 

Andrew Ash commented on SPARK-7354:
---

We saw a flake for this test in the k8s repo's Travis builds too: 
https://github.com/apache-spark-on-k8s/spark/issues/110#issuecomment-281837162

> Flaky test: o.a.s.deploy.SparkSubmitSuite --jars
> 
>
> Key: SPARK-7354
> URL: https://issues.apache.org/jira/browse/SPARK-7354
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2271/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-14 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865778#comment-15865778
 ] 

Andrew Ash commented on SPARK-18113:


[~xukun] the scenario you describe should be accommodated by the newly-added 
{{spark.scheduler.outputcommitcoordinator.maxwaittime}} in that PR, which sets 
a timeout for how long an executor can hold the commit lock for a task.  If the 
executor fails while holding the lock, then after that period of time the lock 
will be released and a subsequent executor will be able to commit the task.  By 
default right now it is 2min.

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(Thread.java:785)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> second

[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-14 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865536#comment-15865536
 ] 

Andrew Ash commented on SPARK-18113:


Thanks for the updates you both.  I've been working with a coworker who's 
seeing this on a cluster of his and we think the deadloop is caused by an 
interaction between commit authorization and preemption (his cluster has lots 
of preemption occurring).

You can see the in-progress patch we've been working on at 
https://github.com/palantir/spark/pull/94 which adds a two-phase commit 
sequence to the OutputCommitCoordinator.  We're not done testing it yet but it 
might provide a useful example for discussion here.

Maybe also we should file another ticket for the deadloop in preemption 
scenario, which is a little different from the deadloop in lost message 
scenario (what I think the already-merged PR fixes).

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
> at 
> java.util.concurrent.Thre

[jira] [Commented] (SPARK-19493) Remove Java 7 support

2017-02-10 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861576#comment-15861576
 ] 

Andrew Ash commented on SPARK-19493:


+1 -- we're removing Java 7 compatibility from core internal libraries run in 
Spark and rarely encounter clusters running with Java 7 anymore.

> Remove Java 7 support
> -
>
> Key: SPARK-19493
> URL: https://issues.apache.org/jira/browse/SPARK-19493
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Spark deprecated Java 7 support in 2.0, and the goal of the ticket is to 
> officially remove Java 7 support in 2.2 or 2.3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11471) Improve the way that we plan shuffled join

2017-01-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836547#comment-15836547
 ] 

Andrew Ash commented on SPARK-11471:


[~yhuai] I'm interested in helping make progress on this -- it's causing pretty 
significant slowness on a SQL query I have.  Have you had any more thoughts on 
this since you first filed?

For my slow query, running the same SQL on another system (Teradata) completes 
in a few minutes whereas in Spark SQL it eventually fails after 20h+ of 
computation.  By breaking up the query into subqueries (forcing joins to happen 
in a different order) we can get the Spark SQL execution down to about 15min.

The query joins 8 tables together with a number of where clauses, and the join 
tree ends up doing a large number of seemingly "extra" shuffles.  At the bottom 
of the join tree, tables A and B are both shuffled to the same distribution and 
then joined.  Then when table C comes in (the next table) it's shuffled to a 
new distribution and then the result of (A+B) is reshuffled to match the new 
table C distribution.  Potentially instead table C would be shuffled to match 
the distribution of (A+B) so that (A+B) doesn't have to be reshuffled.

The general pattern here seems to be that N-Way joins require N + N-1 = 2N-1 
shuffles.  Potentially some of those shuffles could be eliminated with more 
intelligent join ordering and/or distribution selection.

What do you think?

> Improve the way that we plan shuffled join
> --
>
> Key: SPARK-11471
> URL: https://issues.apache.org/jira/browse/SPARK-11471
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> Right now, when adaptive query execution is enabled, in most of cases, we 
> will shuffle input tables for every join. However, once we finish our work of 
> https://issues.apache.org/jira/browse/SPARK-10665, we will be able to have a 
> global on the input datasets of a stage. Then, we should be able to add 
> exchange coordinators after we get the entire physical plan (after the phase 
> that we add Exchanges).
> I will try to fill in more information later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19213) FileSourceScanExec uses SparkSession from HadoopFsRelation creation time instead of the active session at execution time

2017-01-13 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-19213:
---
Summary: FileSourceScanExec uses SparkSession from HadoopFsRelation 
creation time instead of the active session at execution time  (was: 
FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
instead of the one active at time of execution)

> FileSourceScanExec uses SparkSession from HadoopFsRelation creation time 
> instead of the active session at execution time
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence take the one from spark plan. 
> However, in case you want share Datasets across SparkSessions that is not 
> enough since as soon as dataset is executed the queryexecution will have 
> capture spark session at that point. If we want to share datasets across 
> users we need to make configurations not fixed upon first execution. I 
> consider 1st part (using sparksession from logical plan) a bug while the 
> second (using sparksession active at runtime) an enhancement so that sharing 
> across sessions is made easier.
> For example:
> {code}
> val df = spark.read.parquet(...)
> df.count()
> val newSession = spark.newSession()
> SparkSession.setActiveSession(newSession)
> //  (simplest one to try is disable 
> vectorized reads)
> val df2 = Dataset.ofRows(newSession, df.logicalPlan) // logical plan still 
> holds reference to original sparksession and changes don't take effect
> {code}
> I suggest that it shouldn't be necessary to create a new dataset for changes 
> to take effect. For most of the plans doing Dataset.ofRows work but this is 
> not the case for hadoopfsrelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-01-09 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15812819#comment-15812819
 ] 

Andrew Ash commented on SPARK-18113:


I've done some more diagnosis on an example I saw, and think there's a failure 
mode that https://issues.apache.org/jira/browse/SPARK-8029 didn't consider.  
Here's the sequence of steps I think causes this issue:

- executor requests a task commit
- coordinator approves commit and sends response message
- executor is preempted by YARN!
- response message goes nowhere
- OCC has attempt=0 fixed for that stage/partition now so no other attempt will 
succeed :(

Are you running in YARN with preemption enabled? Is there preemption activity 
around the time of the task deadloop?

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(Thread.java:785)
>

[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-01-09 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15811041#comment-15811041
 ] 

Andrew Ash commented on SPARK-18113:


Thanks for sending in that PR [~jinxing6...@126.com]!  It's very similar to the 
one I've also been testing -- see https://github.com/palantir/spark/pull/79 for 
that diff (I haven't sent a PR into Apache yet).

Has that fixed the problem for you?

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(Thread.java:785)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.s

[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-01-05 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802102#comment-15802102
 ] 

Andrew Ash commented on SPARK-18113:


[~xq2005] can you please send a PR to https://github.com/apache/spark with your 
proposed change?

I think I'm seeing the same issue and would like to see the code diff you're 
suggesting so I can test the fix.

Thanks!

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(Thread.java:785)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> at 
> sc

[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-15 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752907#comment-15752907
 ] 

Andrew Ash commented on SPARK-18278:


There are definitely challenges in building features that take longer than a 
release cycle (quarterly for Spark).

We could maintain a long-running feature branch for spark-k8s that lasts 
several months and then gets merged into Spark in a big-bang merge, with that 
feature branch living either on apache/spark or in some other 
community-accessible repo.  I don't think there are many practical differences 
between in apache/spark vs a different repo for where the source is hosted if 
both are not in Apache releases.

Or we could merge many smaller commits for spark-k8s into the apache/spark 
master branch along the way and release as an experimental feature when release 
time comes.  This enables more continuous code review but has the risk of 
destabilizing the master branch if code reviews miss things.

Looking to past instances of large features spanning multiple release cycles 
(like SparkSQL and YARN integration), both of those had work happening 
primarily in-repo from what I can tell, and releases included large disclaimers 
in release notes for those experimental features.  That precedent seems to 
suggest Kubernetes integration should follow a similar path.

Personally I lean towards the approach of more smaller commits into master 
rather than a long-running feature branch.  By code reviewing PRs into the main 
repo as we go the feature will be easier to code review and will also get wider 
feedback as an experimental feature than a side branch or side repo would get.  
This also serves to include Apache committers from the start in understanding 
the codebase, rather than foisting a foreign codebase onto the project and hope 
committers grok it well enough to hold the line on high quality code reviews.  
Looking to the future where Kubernetes integration is potentially included in 
the mainline apache release (like Mesos and YARN), it's best to work as 
contributor + committer together from the start for shared understanding.

Making an API for third party cluster managers sound great and the easy, clean 
choice from a software engineering point of view, but I wonder how much value 
the practical benefits of having a pluggable cluster manager actually gets the 
Apache project.  It seems like both Two Sigma and IBM have been able to 
maintain their proprietary schedulers without the benefits of the API we're 
considering building.  Who / what workflows are we aiming to support with an 
API?

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17119) Add configuration property to allow the history server to delete .inprogress files

2016-12-15 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752439#comment-15752439
 ] 

Andrew Ash commented on SPARK-17119:


+1 I would use this feature

> Add configuration property to allow the history server to delete .inprogress 
> files
> --
>
> Key: SPARK-17119
> URL: https://issues.apache.org/jira/browse/SPARK-17119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bjorn Jonsson
>Priority: Minor
>  Labels: historyserver
>
> The History Server (HS) currently only considers completed applications when 
> deleting event logs from spark.history.fs.logDirectory (since SPARK-6879). 
> This means that over time, .inprogress files (from failed jobs, jobs where 
> the SparkContext is not closed, spark-shell exits etc...) can accumulate and 
> impact the HS.
> Instead of having to manually delete these files, maybe users could have the 
> option of telling the HS to delete all files where (now - 
> attempt.lastUpdated) > spark.history.fs.cleaner.maxAge, or just delete 
> .inprogress files with lastUpdated older then 7d?
> https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L467



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2016-12-12 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-18113:
---
Description: 
Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
handles them. But executor ignores the first response(true) and receives the 
second response(false). The TaskAttemptNumber for this partition in 
authorizedCommittersByStage is locked forever. Driver enters into infinite loop.

h4. Driver Log:

{noformat}
16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
partition: 24, attemptNumber: 0
...
16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
stage: 2, partition: 24, attempt: 0
...
16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
partition: 24, attemptNumber: 1
16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
stage: 2, partition: 24, attempt: 1
...
16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 (TID 
28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
{noformat}

h4. Executor Log:

{noformat}
...
16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
...
16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
AskPermissionToCommitOutput(2,24,0)] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.rpc.askTimeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at 
org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
at 
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
at 
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 13 more
...
16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119)
...
16/10/25 05:39:24 INFO SparkHadoopMapRedUtil: 
attempt_201610250536_0002_m_24_119: Not committed because the driver did 
not authorize commit
...
{noformat}


  was:
Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
handles them. But executor ignores the first response(true) and receives the 
second response(false). The TaskAttemptNumber for this partition in 
authorizedCommittersByStage is locked forever. Driver enters into infinite loop.

h4. Driver Log:
16/10/25 05:38:28 INFO TaskSetManager: Starting task 24

[jira] [Updated] (SPARK-17664) Failed to saveAsHadoop when speculate is enabled

2016-12-12 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-17664:
---
Description: 
>From follow logs, task 22 has failed 4 times because of "the driver did not 
>authorize commit". But the strange thing was that I could't find task 22.1. 
>Why? Maybe some synchronization error?

{noformat}
16/09/26 02:14:18 INFO TaskSetManager: Lost task 22.0 in stage 1856.0 (TID 
953902) on executor 10.196.131.13: java.security.PrivilegedActionException 
(null) [duplicate 4]
16/09/26 02:14:18 INFO TaskSetManager: Marking task 22 in stage 1856.0 (on 
10.196.131.13) as speculatable because it ran more than 5601 ms
16/09/26 02:14:18 INFO TaskSetManager: Starting task 22.2 in stage 1856.0 (TID 
954074, 10.215.143.14, partition 22,PROCESS_LOCAL, 2163 bytes)
16/09/26 02:14:18 INFO TaskSetManager: Lost task 22.2 in stage 1856.0 (TID 
954074) on executor 10.215.143.14: java.security.PrivilegedActionException 
(null) [duplicate 5]
16/09/26 02:14:18 INFO TaskSetManager: Marking task 22 in stage 1856.0 (on 
10.196.131.13) as speculatable because it ran more than 5601 ms
16/09/26 02:14:18 INFO TaskSetManager: Starting task 22.3 in stage 1856.0 (TID 
954075, 10.196.131.28, partition 22,PROCESS_LOCAL, 2163 bytes)
16/09/26 02:14:19 INFO TaskSetManager: Lost task 22.3 in stage 1856.0 (TID 
954075) on executor 10.196.131.28: java.security.PrivilegedActionException 
(null) [duplicate 6]
16/09/26 02:14:19 INFO TaskSetManager: Marking task 22 in stage 1856.0 (on 
10.196.131.13) as speculatable because it ran more than 5601 ms
16/09/26 02:14:19 INFO TaskSetManager: Starting task 22.4 in stage 1856.0 (TID 
954076, 10.215.153.225, partition 22,PROCESS_LOCAL, 2163 bytes)
16/09/26 02:14:19 INFO TaskSetManager: Lost task 22.4 in stage 1856.0 (TID 
954076) on executor 10.215.153.225: java.security.PrivilegedActionException 
(null) [duplicate 7]
16/09/26 02:14:19 ERROR TaskSetManager: Task 22 in stage 1856.0 failed 4 times; 
aborting job
16/09/26 02:14:19 INFO YarnClusterScheduler: Cancelling stage 1856
16/09/26 02:14:19 INFO YarnClusterScheduler: Stage 1856 was cancelled
16/09/26 02:14:19 INFO DAGScheduler: ResultStage 1856 (saveAsHadoopFile at 
TDWProvider.scala:514) failed in 23.049 s
16/09/26 02:14:19 INFO DAGScheduler: Job 76 failed: saveAsHadoopFile at 
TDWProvider.scala:514, took 69.865181 s
16/09/26 02:14:19 ERROR ApplicationMaster: User class threw exception: 
java.security.PrivilegedActionException: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 22 in stage 1856.0 failed 4 times, most 
recent failure: Lost task 22.4 in stage 1856.0 (TID 954076, 10.215.153.225): 
java.security.PrivilegedActionException: 
org.apache.spark.executor.CommitDeniedException: 
attempt_201609260213_1856_m_22_954076: Not committed because the driver did 
not authorize commit
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1723)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1284)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.spark.executor.CommitDeniedException: 
attempt_201609260213_1856_m_22_954076: Not committed because the driver did 
not authorize commit
at 
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:135)
at 
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:142)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anon$4.run(PairRDDFunctions.scala:1311)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anon$4.run(PairRDDFunctions.scala:1284)
... 11 more
{noformat}

  was:
>From follow logs, task 22 has failed 4 times because of "the driver did not 
>authorize commit". But the strange thing was that I could't find task 22.1. 
>Why? Maybe some synchronization error?


16/09/26 02:14:18 INFO TaskSetManager: Lost task 22.0 in stage 1856.0 (TID 
953902) on executor 10.196.131.13: java.security.PrivilegedActionException 
(null) [duplicate 4]
16/09/26 02:14:18 INFO TaskSetMana

[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-08 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731562#comment-15731562
 ] 

Andrew Ash commented on SPARK-18278:


[~rxin] is it a problem for ASF projects to publish docker images?  Have other 
Apache projects done this before, or would Spark be the first one?

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18499) Add back support for custom Spark SQL dialects

2016-11-17 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675714#comment-15675714
 ] 

Andrew Ash commented on SPARK-18499:


Specifically what I'm most interested in is a strict ANSI SQL dialect, not 
bending Spark SQL to support a proprietary dialect.

> Add back support for custom Spark SQL dialects
> --
>
> Key: SPARK-18499
> URL: https://issues.apache.org/jira/browse/SPARK-18499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Ash
>
> Point 5 from the parent task:
> {quote}
> 5. I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18499) Add back support for custom Spark SQL dialects

2016-11-17 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-18499:
--

 Summary: Add back support for custom Spark SQL dialects
 Key: SPARK-18499
 URL: https://issues.apache.org/jira/browse/SPARK-18499
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Andrew Ash


Point 5 from the parent task:

{quote}
5. I want to be able to use my own customized SQL constructs. An example of 
this would supporting my own dialect, or be able to add constructs to the 
current SQL language. I should not have to implement a complete parse, and 
should be able to delegate to an underlying parser.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18448) SparkSession should implement java.lang.AutoCloseable like JavaSparkContext

2016-11-15 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-18448:
--

 Summary: SparkSession should implement java.lang.AutoCloseable 
like JavaSparkContext
 Key: SPARK-18448
 URL: https://issues.apache.org/jira/browse/SPARK-18448
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2
Reporter: Andrew Ash


https://docs.oracle.com/javase/8/docs/api/java/lang/AutoCloseable.html

This makes using cleaning up SparkSessions in Java easier, but may introduce a 
Java 8 dependency if applied directly.  JavaSparkContext uses java.io.Closeable 
I think to avoid this Java 8 dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17874) Enabling SSL on HistoryServer should only open one port not two

2016-10-11 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-17874:
--

 Summary: Enabling SSL on HistoryServer should only open one port 
not two
 Key: SPARK-17874
 URL: https://issues.apache.org/jira/browse/SPARK-17874
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.0.1
Reporter: Andrew Ash


When turning on SSL on the HistoryServer with 
{{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the 
[hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262]
 result of calculating {{spark.history.ui.port + 400}}, and sets up a redirect 
from the original (http) port to the new (https) port.

{noformat}
$ netstat -nlp | grep 23714
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp0  0 :::18080:::*
LISTEN  23714/java
tcp0  0 :::18480:::*
LISTEN  23714/java
{noformat}

By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one 
open port to change protocol from http to https, not to have 1) additional 
ports open 2) the http port remain open 3) the additional port at a value I 
didn't specify.

To fix this could take one of two approaches:

Approach 1:
- one port always, which is configured with {{spark.history.ui.port}}
- the protocol on that port is http by default
- or if {{spark.ssl.historyServer.enabled=true}} then it's https

Approach 2:
- add a new configuration item {{spark.history.ui.sslPort}} which configures 
the second port that starts up

In approach 1 we probably need a way to specify to Spark jobs whether the 
history server has ssl or not, based on SPARK-16988

That makes me think we should go with approach 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17874) Additional SSL port on HistoryServer should be configurable

2016-10-11 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-17874:
---
Summary: Additional SSL port on HistoryServer should be configurable  (was: 
Enabling SSL on HistoryServer should only open one port not two)

> Additional SSL port on HistoryServer should be configurable
> ---
>
> Key: SPARK-17874
> URL: https://issues.apache.org/jira/browse/SPARK-17874
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.1
>Reporter: Andrew Ash
>
> When turning on SSL on the HistoryServer with 
> {{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the 
> [hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262]
>  result of calculating {{spark.history.ui.port + 400}}, and sets up a 
> redirect from the original (http) port to the new (https) port.
> {noformat}
> $ netstat -nlp | grep 23714
> (Not all processes could be identified, non-owned process info
>  will not be shown, you would have to be root to see it all.)
> tcp0  0 :::18080:::*
> LISTEN  23714/java
> tcp0  0 :::18480:::*
> LISTEN  23714/java
> {noformat}
> By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one 
> open port to change protocol from http to https, not to have 1) additional 
> ports open 2) the http port remain open 3) the additional port at a value I 
> didn't specify.
> To fix this could take one of two approaches:
> Approach 1:
> - one port always, which is configured with {{spark.history.ui.port}}
> - the protocol on that port is http by default
> - or if {{spark.ssl.historyServer.enabled=true}} then it's https
> Approach 2:
> - add a new configuration item {{spark.history.ui.sslPort}} which configures 
> the second port that starts up
> In approach 1 we probably need a way to specify to Spark jobs whether the 
> history server has ssl or not, based on SPARK-16988
> That makes me think we should go with approach 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv

2016-08-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435539#comment-15435539
 ] 

Andrew Ash commented on SPARK-17227:


Rob and I work together, and we've seen datasets in mostly-CSV format that have 
non-standard record delimiters ('\0' character for instance).

For some broader context, we've created our own CSV text parser and use that in 
all our various internal products that use Spark, but would like to contribute 
this additional flexibility back to the Spark community at large and in the 
process eliminate the need for our internal CSV datasource.

Here are the tickets Rob just opened that we would require to eliminate our 
internal CSV datasource:

SPARK-17222
SPARK-17224
SPARK-17225
SPARK-17226
SPARK-17227

The basic question then, is would the Spark community accept patches that 
extend Spark's CSV parser to cover these features?  We're willing to write the 
code and get the patches through code review, but would rather know up front if 
these changes would never be accepted into mainline Spark due to philosophical 
disagreements around what Spark's CSV datasource should be.

> Allow configuring record delimiter in csv
> -
>
> Key: SPARK-17227
> URL: https://issues.apache.org/jira/browse/SPARK-17227
> Project: Spark
>  Issue Type: Improvement
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Instead of hard coded "\n"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself

2016-08-11 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418149#comment-15418149
 ] 

Andrew Ash commented on SPARK-17029:


Note RDD form usage from https://issues.apache.org/jira/browse/SPARK-10705

> Dataset toJSON goes through RDD form instead of transforming dataset itself
> ---
>
> Key: SPARK-17029
> URL: https://issues.apache.org/jira/browse/SPARK-17029
> Project: Spark
>  Issue Type: Bug
>Reporter: Robert Kruszewski
>
> No longer necessary and can be optimized with datasets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15104) Bad spacing in log line

2016-05-03 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-15104:
--

 Summary: Bad spacing in log line
 Key: SPARK-15104
 URL: https://issues.apache.org/jira/browse/SPARK-15104
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Andrew Ash
Priority: Minor


{noformat}INFO  [2016-05-03 21:18:51,477] 
org.apache.spark.scheduler.TaskSetManager: Starting task 0.0 in stage 101.0 
(TID 7029, localhost, partition 0,NODE_LOCAL, 1894 bytes){noformat}

Should have a space before "NODE_LOCAL"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >