[GitHub] spark issue #20239: [SPARK-23047][PYTHON][SQL] Change MapVector to NullableM...

2018-01-11 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20239
  
Btw, I don't mean to block this pr but why does only `MapVector` have 
`Nullable` version, just out of curiosity.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20183#discussion_r161157993
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala ---
@@ -52,6 +54,10 @@ private[spark] class BroadcastManager(
 
   private val nextBroadcastId = new AtomicLong(0)
 
+  private[broadcast] val cachedValues = {
+new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK)
--- End diff --

ohh, I see. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20183#discussion_r161157178
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala ---
@@ -52,6 +54,10 @@ private[spark] class BroadcastManager(
 
   private val nextBroadcastId = new AtomicLong(0)
 
+  private[broadcast] val cachedValues = {
+new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK)
--- End diff --

`can remove mappings...` That why I say entries can be removed 
automatically.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage...

2018-01-11 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20204#discussion_r161155869
  
--- Diff: python/run-tests-with-coverage ---
@@ -0,0 +1,69 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set -o pipefail
+set -e
+
+# This variable indicates which coverage executable to run to combine 
coverages
+# and generate HTMLs, for example, 'coverage3' in Python 3.
+COV_EXEC="${COV_EXEC:-coverage}"
+FWDIR="$(cd "`dirname $0`"; pwd)"
+pushd "$FWDIR" > /dev/null
+
+# Ensure that coverage executable is installed.
+if ! hash $COV_EXEC 2>/dev/null; then
+  echo "Missing coverage executable in your path, skipping PySpark 
coverage"
+  exit 1
+fi
+
+# Set up the directories for coverage results.
+export COVERAGE_DIR="$FWDIR/test_coverage"
+rm -fr "$COVERAGE_DIR/coverage_data"
+rm -fr "$COVERAGE_DIR/htmlcov"
+mkdir -p "$COVERAGE_DIR/coverage_data"
+
+# Current directory are added in the python path so that it doesn't refer 
our built
+# pyspark zip library first.
+export PYTHONPATH="$FWDIR:$PYTHONPATH"
+# Also, our sitecustomize.py and coverage_daemon.py are included in the 
path.
+export PYTHONPATH="$COVERAGE_DIR:$PYTHONPATH"
+
+# We use 'spark.python.daemon.module' configuration to insert the coverage 
supported workers.
+export SPARK_CONF_DIR="$COVERAGE_DIR/conf"
+
+# This environment variable enables the coverage.
+export COVERAGE_PROCESS_START="$FWDIR/.coveragerc"
+
+# If you'd like to run a specific unittest class, you could do such as
+# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
+./run-tests $@
--- End diff --

nit: `"$@"` instead of `$@`, just in case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20214: [SPARK-23023][SQL] Cast field data to strings in ...

2018-01-11 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20214#discussion_r161156501
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -237,13 +237,18 @@ class Dataset[T] private[sql](
   private[sql] def showString(
   _numRows: Int, truncate: Int = 20, vertical: Boolean = false): 
String = {
 val numRows = _numRows.max(0).min(Int.MaxValue - 1)
-val takeResult = toDF().take(numRows + 1)
+val newDf = toDF()
+val castExprs = newDf.schema.map { f => f.dataType match {
+  // Since binary types in top-level schema fields have a specific 
format to print,
+  // so we do not cast them to strings here.
+  case BinaryType => s"`${f.name}`"
+  case _: UserDefinedType[_] => s"`${f.name}`"
--- End diff --

oh, yea. I missed that. Thanks, I'll make a separate pr.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20245
  
**[Test build #86024 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86024/testReport)**
 for PR 20245 at commit 
[`473b37a`](https://github.com/apache/spark/commit/473b37a1d2abd74f66d29883a1e8bb488cf6bb94).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20183#discussion_r161156210
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala ---
@@ -52,6 +54,10 @@ private[spark] class BroadcastManager(
 
   private val nextBroadcastId = new AtomicLong(0)
 
+  private[broadcast] val cachedValues = {
+new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK)
--- End diff --

BTW - welcome back @jerryshao ! long time no see!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20183#discussion_r161156002
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala ---
@@ -52,6 +54,10 @@ private[spark] class BroadcastManager(
 
   private val nextBroadcastId = new AtomicLong(0)
 
+  private[broadcast] val cachedValues = {
+new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK)
--- End diff --

That should be fine even if that hard references not removed, since the 
memory consumption should be quite minor.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...

2018-01-11 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20245
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20245
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20245
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86019/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20245
  
**[Test build #86019 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86019/testReport)**
 for PR 20245 at commit 
[`473b37a`](https://github.com/apache/spark/commit/473b37a1d2abd74f66d29883a1e8bb488cf6bb94).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20183


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20183#discussion_r161154730
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala ---
@@ -52,6 +54,10 @@ private[spark] class BroadcastManager(
 
   private val nextBroadcastId = new AtomicLong(0)
 
+  private[broadcast] val cachedValues = {
+new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK)
--- End diff --

This is what I found from the doc:

>Hash-based Map implementation that allows mappings to be removed by the 
garbage collector.
When you construct a ReferenceMap, you can specify what kind of references 
are used to store the map's keys and values. If non-hard references are used, 
then the garbage collector can remove mappings if a key or value becomes 
unreachable, or if the JVM's memory is running low. For information on how the 
different reference types behave, see Reference.

It only mentions that non-hard references can be removed by GC, please 
correct me if I'm wrong.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...

2018-01-11 Thread ivoson
Github user ivoson commented on the issue:

https://github.com/apache/spark/pull/20244
  
This is the stack trace of the Exception.

```
java.lang.ClassCastException: org.apache.spark.rdd.CheckpointRDDPartition 
cannot be cast to org.apache.spark.streaming.rdd.MapWithStateRDDPartition
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:152)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...

2018-01-11 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20183
  
thanks, merging to master/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20183#discussion_r161154173
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala ---
@@ -52,6 +54,10 @@ private[spark] class BroadcastManager(
 
   private val nextBroadcastId = new AtomicLong(0)
 
+  private[broadcast] val cachedValues = {
+new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK)
--- End diff --

according to the document of `ReferenceMap`, if key or value is eligible 
for GC, the entry will be removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...

2018-01-11 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/20244
  
@ivoson Tengfei, please post the full stack trace of the 
`ClassCastException`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20229: [SPARK-23045][ML][SparkR] Update RFormula to use ...

2018-01-11 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/20229#discussion_r161153997
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -230,16 +231,17 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") 
override val uid: String)
 val encodedTerms = resolvedFormula.terms.map {
   case Seq(term) if dataset.schema(term).dataType == StringType =>
 val encodedCol = tmpColumn("onehot")
-var encoder = new OneHotEncoder()
-  .setInputCol(indexed(term))
-  .setOutputCol(encodedCol)
 // Formula w/o intercept, one of the categories in the first 
category feature is
 // being used as reference category, we will not drop any category 
for that feature.
 if (!hasIntercept && !keepReferenceCategory) {
-  encoder = encoder.setDropLast(false)
+  encoderStages += new OneHotEncoderEstimator(uid)
+.setInputCols(Array(indexed(term)))
+.setOutputCols(Array(encodedCol))
+.setDropLast(false)
--- End diff --

There is at most 1 encoder with `dropLast(false)`, the next line sets 
`keepReferenceCategory = true` to ensure we won't take this code path for the 
remaining columns.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20242: [MINOR][BUILD] Fix Java linter errors

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20242
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86015/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20242: [MINOR][BUILD] Fix Java linter errors

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20242
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20183
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20183
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86016/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20242: [MINOR][BUILD] Fix Java linter errors

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20242
  
**[Test build #86015 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86015/testReport)**
 for PR 20242 at commit 
[`752daa3`](https://github.com/apache/spark/commit/752daa38736d9f620efc47d65fe9277315a397e9).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20183
  
**[Test build #86016 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86016/testReport)**
 for PR 20183 at commit 
[`a6e5e81`](https://github.com/apache/spark/commit/a6e5e810402d0fc807d86dba9f699d2851be1f3a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20214: [SPARK-23023][SQL] Cast field data to strings in ...

2018-01-11 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20214#discussion_r161153123
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -237,13 +237,18 @@ class Dataset[T] private[sql](
   private[sql] def showString(
   _numRows: Int, truncate: Int = 20, vertical: Boolean = false): 
String = {
 val numRows = _numRows.max(0).min(Int.MaxValue - 1)
-val takeResult = toDF().take(numRows + 1)
+val newDf = toDF()
+val castExprs = newDf.schema.map { f => f.dataType match {
+  // Since binary types in top-level schema fields have a specific 
format to print,
+  // so we do not cast them to strings here.
+  case BinaryType => s"`${f.name}`"
+  case _: UserDefinedType[_] => s"`${f.name}`"
--- End diff --

How about something like:

```scala
  case udt: UserDefinedType[_] =>
(c, evPrim, evNull) => {
  val udtTerm = ctx.addReferenceObj("udt", udt)
  s"$evPrim = 
UTF8String.fromString($udtTerm.deserialize($c).toString());"
}
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20240: [SPARK-23049][SQL] `spark.sql.files.ignoreCorruptFiles` ...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20240
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20240: [SPARK-23049][SQL] `spark.sql.files.ignoreCorruptFiles` ...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20240
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86014/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20240: [SPARK-23049][SQL] `spark.sql.files.ignoreCorruptFiles` ...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20240
  
**[Test build #86014 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86014/testReport)**
 for PR 20240 at commit 
[`33ae3ca`](https://github.com/apache/spark/commit/33ae3ca34aa237c630927c96d9421ea53ed6a775).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark

2018-01-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20217
  
@cloud-fan, do you prefer to have a new API just to be clear, BTW?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20183#discussion_r161151338
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala ---
@@ -52,6 +54,10 @@ private[spark] class BroadcastManager(
 
   private val nextBroadcastId = new AtomicLong(0)
 
+  private[broadcast] val cachedValues = {
+new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK)
--- End diff --

If the key is a hard reference, does it mean that this key will never be 
cleaned from map automatically based on GC?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20183#discussion_r161149468
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -206,36 +206,50 @@ private[spark] class TorrentBroadcast[T: 
ClassTag](obj: T, id: Long)
 
   private def readBroadcastBlock(): T = Utils.tryOrIOException {
 TorrentBroadcast.synchronized {
-  setConf(SparkEnv.get.conf)
-  val blockManager = SparkEnv.get.blockManager
-  blockManager.getLocalValues(broadcastId) match {
-case Some(blockResult) =>
-  if (blockResult.data.hasNext) {
-val x = blockResult.data.next().asInstanceOf[T]
-releaseLock(broadcastId)
-x
-  } else {
-throw new SparkException(s"Failed to get locally stored 
broadcast data: $broadcastId")
-  }
-case None =>
-  logInfo("Started reading broadcast variable " + id)
-  val startTimeMs = System.currentTimeMillis()
-  val blocks = readBlocks()
-  logInfo("Reading broadcast variable " + id + " took" + 
Utils.getUsedTimeMs(startTimeMs))
-
-  try {
-val obj = TorrentBroadcast.unBlockifyObject[T](
-  blocks.map(_.toInputStream()), SparkEnv.get.serializer, 
compressionCodec)
-// Store the merged copy in BlockManager so other tasks on 
this executor don't
-// need to re-fetch it.
-val storageLevel = StorageLevel.MEMORY_AND_DISK
-if (!blockManager.putSingle(broadcastId, obj, storageLevel, 
tellMaster = false)) {
-  throw new SparkException(s"Failed to store $broadcastId in 
BlockManager")
+  val broadcastCache = SparkEnv.get.broadcastManager.cachedValues
+
+  
Option(broadcastCache.get(broadcastId)).map(_.asInstanceOf[T]).getOrElse {
--- End diff --

I see, thanks for pointing out.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20183
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86010/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20183
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with returnType=S...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20163
  
**[Test build #86023 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86023/testReport)**
 for PR 20163 at commit 
[`d307cee`](https://github.com/apache/spark/commit/d307ceeaa581dc219ae744482a6d3c374b3a3025).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20242: [MINOR][BUILD] Fix Java linter errors

2018-01-11 Thread sameeragarwal
Github user sameeragarwal commented on the issue:

https://github.com/apache/spark/pull/20242
  
Thanks @dongwang218, LGTM

It seems like the java linter checks are not included in 
https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-lint/. I'll update 
the scripts so that they're automatically checked.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20183
  
**[Test build #86010 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86010/testReport)**
 for PR 20183 at commit 
[`8e13585`](https://github.com/apache/spark/commit/8e13585f563162362d4e20f473438a0d0f9ce3d3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20226
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20226
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86022/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20226
  
**[Test build #86022 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86022/testReport)**
 for PR 20226 at commit 
[`9bcd905`](https://github.com/apache/spark/commit/9bcd9052416d7d6127b560f9d7ce77965a0005ce).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with returnType=S...

2018-01-11 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20163
  
jenkins retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-11 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/20224
  
As high level comment, to add IDs helps performance/error diagnosis in 
production environments. I strongly support to always enable this.
Let me look at technical detail later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread ho3rexqj
Github user ho3rexqj commented on a diff in the pull request:

https://github.com/apache/spark/pull/20183#discussion_r161148920
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -206,36 +206,50 @@ private[spark] class TorrentBroadcast[T: 
ClassTag](obj: T, id: Long)
 
   private def readBroadcastBlock(): T = Utils.tryOrIOException {
 TorrentBroadcast.synchronized {
-  setConf(SparkEnv.get.conf)
-  val blockManager = SparkEnv.get.blockManager
-  blockManager.getLocalValues(broadcastId) match {
-case Some(blockResult) =>
-  if (blockResult.data.hasNext) {
-val x = blockResult.data.next().asInstanceOf[T]
-releaseLock(broadcastId)
-x
-  } else {
-throw new SparkException(s"Failed to get locally stored 
broadcast data: $broadcastId")
-  }
-case None =>
-  logInfo("Started reading broadcast variable " + id)
-  val startTimeMs = System.currentTimeMillis()
-  val blocks = readBlocks()
-  logInfo("Reading broadcast variable " + id + " took" + 
Utils.getUsedTimeMs(startTimeMs))
-
-  try {
-val obj = TorrentBroadcast.unBlockifyObject[T](
-  blocks.map(_.toInputStream()), SparkEnv.get.serializer, 
compressionCodec)
-// Store the merged copy in BlockManager so other tasks on 
this executor don't
-// need to re-fetch it.
-val storageLevel = StorageLevel.MEMORY_AND_DISK
-if (!blockManager.putSingle(broadcastId, obj, storageLevel, 
tellMaster = false)) {
-  throw new SparkException(s"Failed to store $broadcastId in 
BlockManager")
+  val broadcastCache = SparkEnv.get.broadcastManager.cachedValues
+
+  
Option(broadcastCache.get(broadcastId)).map(_.asInstanceOf[T]).getOrElse {
--- End diff --

`ReferenceMap` is not thread safe, no - however, all operations on 
`broadcastCache` occur within the context of a synchronized block; 
TorrentBroadcast.scala lines 208-254.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19001
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86013/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19001
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19001
  
**[Test build #86013 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86013/testReport)**
 for PR 19001 at commit 
[`7b8a072`](https://github.com/apache/spark/commit/7b8a0729b38ba2fbdc1c4359fcb82a1b6cde5b5c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  throw new IOException(\"Cannot find class \" + 
inputFormatClassName, e);`
  * `  throw new IOException(\"Unable to find the InputFormat class \" 
+ inputFormatClassName, e);`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20183#discussion_r161147892
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -206,36 +206,50 @@ private[spark] class TorrentBroadcast[T: 
ClassTag](obj: T, id: Long)
 
   private def readBroadcastBlock(): T = Utils.tryOrIOException {
 TorrentBroadcast.synchronized {
-  setConf(SparkEnv.get.conf)
-  val blockManager = SparkEnv.get.blockManager
-  blockManager.getLocalValues(broadcastId) match {
-case Some(blockResult) =>
-  if (blockResult.data.hasNext) {
-val x = blockResult.data.next().asInstanceOf[T]
-releaseLock(broadcastId)
-x
-  } else {
-throw new SparkException(s"Failed to get locally stored 
broadcast data: $broadcastId")
-  }
-case None =>
-  logInfo("Started reading broadcast variable " + id)
-  val startTimeMs = System.currentTimeMillis()
-  val blocks = readBlocks()
-  logInfo("Reading broadcast variable " + id + " took" + 
Utils.getUsedTimeMs(startTimeMs))
-
-  try {
-val obj = TorrentBroadcast.unBlockifyObject[T](
-  blocks.map(_.toInputStream()), SparkEnv.get.serializer, 
compressionCodec)
-// Store the merged copy in BlockManager so other tasks on 
this executor don't
-// need to re-fetch it.
-val storageLevel = StorageLevel.MEMORY_AND_DISK
-if (!blockManager.putSingle(broadcastId, obj, storageLevel, 
tellMaster = false)) {
-  throw new SparkException(s"Failed to store $broadcastId in 
BlockManager")
+  val broadcastCache = SparkEnv.get.broadcastManager.cachedValues
+
+  
Option(broadcastCache.get(broadcastId)).map(_.asInstanceOf[T]).getOrElse {
--- End diff --

Is this `ReferenceMap` thread safe?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20243
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86017/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20226: [SPARK-23034][SQL][UI] Display tablename for `Hiv...

2018-01-11 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/20226#discussion_r161147457
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
 ---
@@ -62,6 +62,8 @@ case class HiveTableScanExec(
 
   override def conf: SQLConf = sparkSession.sessionState.conf
 
+  override def nodeName: String = s"${super.nodeName} 
(${relation.tableMeta.qualifiedName})"
--- End diff --

@gatorsmile : I have updated the PR after going through all the *ScanExec 
implementations

Changes introduced in this PR:

Scan impl | overridden `nodeName`
 | -
DataSourceV2ScanExec | `Scan DataSourceV2 [output_attribute1, 
output_attribute2, ..]`
ExternalRDDScanExec | `Scan ExternalRDD [output_attribute1, 
output_attribute2, ..]`
FileSourceScanExec | `Scan FileSource 
${tableIdentifier.map(_.unquotedString).getOrElse(relation.location)}"`
HiveTableScanExec | `Scan HiveTable relation.tableMeta.qualifiedName`
InMemoryTableScanExec | `Scan In-memory relation.tableName`
LocalTableScanExec | `Scan LocalTable [output_attribute1, 
output_attribute2, ..]`
RDDScanExec | `Scan RDD name [output_attribute1, output_attribute2, ..]`
RowDataSourceScanExec | `Scan FileSource 
${tableIdentifier.map(_.unquotedString).getOrElse(relation)}`


Things not affected:
- DataSourceScanExec : already uses `Scan relation 
tableIdentifier.unquotedString`
- RDDScanExec forces clients to specify the `nodeName`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20226
  
**[Test build #86022 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86022/testReport)**
 for PR 20226 at commit 
[`9bcd905`](https://github.com/apache/spark/commit/9bcd9052416d7d6127b560f9d7ce77965a0005ce).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20243
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20243
  
**[Test build #86017 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86017/testReport)**
 for PR 20243 at commit 
[`71cc6e4`](https://github.com/apache/spark/commit/71cc6e41cc19af8e672c67624ca16f330804ccc8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...

2018-01-11 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20211#discussion_r161146793
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -457,13 +458,26 @@ class RelationalGroupedDataset protected[sql](
 
 val groupingNamedExpressions = groupingExprs.map {
   case ne: NamedExpression => ne
-  case other => Alias(other, other.toString)()
+  case other => Alias(other, toPrettySQL(other))()
 }
 val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
 val child = df.logicalPlan
 val project = Project(groupingNamedExpressions ++ child.output, child)
-val output = expr.dataType.asInstanceOf[StructType].toAttributes
-val plan = FlatMapGroupsInPandas(groupingAttributes, expr, output, 
project)
+val udfOutput: Seq[Attribute] = 
expr.dataType.asInstanceOf[StructType].toAttributes
+val additionalGroupingAttributes = mutable.ArrayBuffer[Attribute]()
+
+for (attribute <- groupingAttributes) {
+  if (!udfOutput.map(_.name).contains(attribute.name)) {
--- End diff --

Maybe this relates to the discussion above 
(https://github.com/apache/spark/pull/20211#discussion_r160524679).
Let's wait and see for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20214
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20214
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86012/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20214
  
**[Test build #86012 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86012/testReport)**
 for PR 20214 at commit 
[`afe0af5`](https://github.com/apache/spark/commit/afe0af504b8a799dadfcd8c18ade339432f889b0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...

2018-01-11 Thread ivoson
Github user ivoson commented on a diff in the pull request:

https://github.com/apache/spark/pull/20244#discussion_r161145542
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2399,6 +2417,93 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 }
   }
 
+  /**
+   * In this test, we simply simulate the scene in concurrent jobs using 
the same
+   * rdd which is marked to do checkpoint:
+   * Job one has already finished the spark job, and start the process of 
doCheckpoint;
+   * Job two is submitted, and submitMissingTasks is called.
+   * In submitMissingTasks, if taskSerialization is called before 
doCheckpoint is done,
+   * while part calculates from stage.rdd.partitions is called after 
doCheckpoint is done,
+   * we may get a ClassCastException when execute the task because of some 
rdd will do
+   * Partition cast.
+   *
+   * With this test case, just want to indicate that we should do 
taskSerialization and
+   * part calculate in submitMissingTasks with the same rdd checkpoint 
status.
+   */
+  test("task part misType with checkpoint rdd in concurrent execution 
scenes") {
+// set checkpointDir.
+val tempDir = Utils.createTempDir()
+val checkpointDir = File.createTempFile("temp", "", tempDir)
+checkpointDir.delete()
+sc.setCheckpointDir(checkpointDir.toString)
+
+val latch = new CountDownLatch(2)
+val semaphore1 = new Semaphore(0)
+val semaphore2 = new Semaphore(0)
+
+val rdd = new WrappedRDD(sc.makeRDD(1 to 100, 4))
+rdd.checkpoint()
+
+val checkpointRunnable = new Runnable {
+  override def run() = {
+// Simply simulate what RDD.doCheckpoint() do here.
+rdd.doCheckpointCalled = true
+val checkpointData = rdd.checkpointData.get
+RDDCheckpointData.synchronized {
+  if (checkpointData.cpState == CheckpointState.Initialized) {
+checkpointData.cpState = 
CheckpointState.CheckpointingInProgress
+  }
+}
+
+val newRDD = checkpointData.doCheckpoint()
+
+// Release semaphore1 after job triggered in checkpoint finished.
+semaphore1.release()
+semaphore2.acquire()
+// Update our state and truncate the RDD lineage.
+RDDCheckpointData.synchronized {
+  checkpointData.cpRDD = Some(newRDD)
+  checkpointData.cpState = CheckpointState.Checkpointed
+  rdd.markCheckpointed()
+}
+semaphore1.release()
+
+latch.countDown()
+  }
+}
+
+val submitMissingTasksRunnable = new Runnable {
+  override def run() = {
+// Simply simulate the process of submitMissingTasks.
+val ser = SparkEnv.get.closureSerializer.newInstance()
+semaphore1.acquire()
+// Simulate task serialization while submitMissingTasks.
+// Task serialized with rdd checkpoint not finished.
+val cleanedFunc = sc.clean(Utils.getIteratorSize _)
+val func = (ctx: TaskContext, it: Iterator[Int]) => cleanedFunc(it)
+val taskBinaryBytes = JavaUtils.bufferToArray(
+  ser.serialize((rdd, func): AnyRef))
+semaphore2.release()
+semaphore1.acquire()
+// Part calculated with rdd checkpoint already finished.
+val (taskRdd, taskFunc) = ser.deserialize[(RDD[Int], (TaskContext, 
Iterator[Int]) => Unit)](
+  ByteBuffer.wrap(taskBinaryBytes), 
Thread.currentThread.getContextClassLoader)
+val part = rdd.partitions(0)
+intercept[ClassCastException] {
--- End diff --

it is a reproduce case, i will fix this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...

2018-01-11 Thread ivoson
Github user ivoson commented on a diff in the pull request:

https://github.com/apache/spark/pull/20244#discussion_r161145538
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2399,6 +2417,93 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 }
   }
 
+  /**
+   * In this test, we simply simulate the scene in concurrent jobs using 
the same
+   * rdd which is marked to do checkpoint:
+   * Job one has already finished the spark job, and start the process of 
doCheckpoint;
+   * Job two is submitted, and submitMissingTasks is called.
+   * In submitMissingTasks, if taskSerialization is called before 
doCheckpoint is done,
+   * while part calculates from stage.rdd.partitions is called after 
doCheckpoint is done,
+   * we may get a ClassCastException when execute the task because of some 
rdd will do
+   * Partition cast.
+   *
+   * With this test case, just want to indicate that we should do 
taskSerialization and
+   * part calculate in submitMissingTasks with the same rdd checkpoint 
status.
+   */
+  test("task part misType with checkpoint rdd in concurrent execution 
scenes") {
--- End diff --

thanks for the suggest.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...

2018-01-11 Thread ivoson
Github user ivoson commented on a diff in the pull request:

https://github.com/apache/spark/pull/20244#discussion_r161145547
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -96,6 +98,22 @@ class MyRDD(
   override def toString: String = "DAGSchedulerSuiteRDD " + id
 }
 
+/** Wrapped rdd partition. */
+class WrappedPartition(val partition: Partition) extends Partition {
+  def index: Int = partition.index
+}
+
+/** Wrapped rdd with WrappedPartition. */
+class WrappedRDD(parent: RDD[Int]) extends RDD[Int](parent) {
+  protected def getPartitions: Array[Partition] = {
+parent.partitions.map(p => new WrappedPartition(p))
+  }
+
+  def compute(split: Partition, context: TaskContext): Iterator[Int] = {
+parent.compute(split.asInstanceOf[WrappedPartition].partition, context)
--- End diff --

thanks for the comment, i will work on this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...

2018-01-11 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/20244
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20222
  
**[Test build #86021 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86021/testReport)**
 for PR 20222 at commit 
[`440d11f`](https://github.com/apache/spark/commit/440d11f53a6f4b7531d96659c88a7e7961021778).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...

2018-01-11 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/20244
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20222
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...

2018-01-11 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/20244#discussion_r161141879
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2399,6 +2417,93 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 }
   }
 
+  /**
+   * In this test, we simply simulate the scene in concurrent jobs using 
the same
+   * rdd which is marked to do checkpoint:
+   * Job one has already finished the spark job, and start the process of 
doCheckpoint;
+   * Job two is submitted, and submitMissingTasks is called.
+   * In submitMissingTasks, if taskSerialization is called before 
doCheckpoint is done,
+   * while part calculates from stage.rdd.partitions is called after 
doCheckpoint is done,
+   * we may get a ClassCastException when execute the task because of some 
rdd will do
+   * Partition cast.
+   *
+   * With this test case, just want to indicate that we should do 
taskSerialization and
+   * part calculate in submitMissingTasks with the same rdd checkpoint 
status.
+   */
+  test("task part misType with checkpoint rdd in concurrent execution 
scenes") {
--- End diff --

maybe "SPARK-23053: avoid CastException in concurrent execution with 
checkpoint" better?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...

2018-01-11 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/20244#discussion_r161141499
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -96,6 +98,22 @@ class MyRDD(
   override def toString: String = "DAGSchedulerSuiteRDD " + id
 }
 
+/** Wrapped rdd partition. */
+class WrappedPartition(val partition: Partition) extends Partition {
+  def index: Int = partition.index
+}
+
+/** Wrapped rdd with WrappedPartition. */
+class WrappedRDD(parent: RDD[Int]) extends RDD[Int](parent) {
+  protected def getPartitions: Array[Partition] = {
+parent.partitions.map(p => new WrappedPartition(p))
+  }
+
+  def compute(split: Partition, context: TaskContext): Iterator[Int] = {
+parent.compute(split.asInstanceOf[WrappedPartition].partition, context)
--- End diff --

I think this line is the key point for `WrppedPartition` and `WrappedRDD`, 
please give comments for explaining your intention.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...

2018-01-11 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/20244#discussion_r161144809
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2399,6 +2417,93 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 }
   }
 
+  /**
+   * In this test, we simply simulate the scene in concurrent jobs using 
the same
+   * rdd which is marked to do checkpoint:
+   * Job one has already finished the spark job, and start the process of 
doCheckpoint;
+   * Job two is submitted, and submitMissingTasks is called.
+   * In submitMissingTasks, if taskSerialization is called before 
doCheckpoint is done,
+   * while part calculates from stage.rdd.partitions is called after 
doCheckpoint is done,
+   * we may get a ClassCastException when execute the task because of some 
rdd will do
+   * Partition cast.
+   *
+   * With this test case, just want to indicate that we should do 
taskSerialization and
+   * part calculate in submitMissingTasks with the same rdd checkpoint 
status.
+   */
+  test("task part misType with checkpoint rdd in concurrent execution 
scenes") {
+// set checkpointDir.
+val tempDir = Utils.createTempDir()
+val checkpointDir = File.createTempFile("temp", "", tempDir)
+checkpointDir.delete()
+sc.setCheckpointDir(checkpointDir.toString)
+
+val latch = new CountDownLatch(2)
+val semaphore1 = new Semaphore(0)
+val semaphore2 = new Semaphore(0)
+
+val rdd = new WrappedRDD(sc.makeRDD(1 to 100, 4))
+rdd.checkpoint()
+
+val checkpointRunnable = new Runnable {
+  override def run() = {
+// Simply simulate what RDD.doCheckpoint() do here.
+rdd.doCheckpointCalled = true
+val checkpointData = rdd.checkpointData.get
+RDDCheckpointData.synchronized {
+  if (checkpointData.cpState == CheckpointState.Initialized) {
+checkpointData.cpState = 
CheckpointState.CheckpointingInProgress
+  }
+}
+
+val newRDD = checkpointData.doCheckpoint()
+
+// Release semaphore1 after job triggered in checkpoint finished.
+semaphore1.release()
+semaphore2.acquire()
+// Update our state and truncate the RDD lineage.
+RDDCheckpointData.synchronized {
+  checkpointData.cpRDD = Some(newRDD)
+  checkpointData.cpState = CheckpointState.Checkpointed
+  rdd.markCheckpointed()
+}
+semaphore1.release()
+
+latch.countDown()
+  }
+}
+
+val submitMissingTasksRunnable = new Runnable {
+  override def run() = {
+// Simply simulate the process of submitMissingTasks.
+val ser = SparkEnv.get.closureSerializer.newInstance()
+semaphore1.acquire()
+// Simulate task serialization while submitMissingTasks.
+// Task serialized with rdd checkpoint not finished.
+val cleanedFunc = sc.clean(Utils.getIteratorSize _)
+val func = (ctx: TaskContext, it: Iterator[Int]) => cleanedFunc(it)
+val taskBinaryBytes = JavaUtils.bufferToArray(
+  ser.serialize((rdd, func): AnyRef))
+semaphore2.release()
+semaphore1.acquire()
+// Part calculated with rdd checkpoint already finished.
+val (taskRdd, taskFunc) = ser.deserialize[(RDD[Int], (TaskContext, 
Iterator[Int]) => Unit)](
+  ByteBuffer.wrap(taskBinaryBytes), 
Thread.currentThread.getContextClassLoader)
+val part = rdd.partitions(0)
+intercept[ClassCastException] {
--- End diff --

I think this not a "test", this just a "reproduce" for the problem you want 
to fix. We should prove your code added in `DAGScheduler.scala` can fix that 
problem and with the original code base, a `ClassCastException` raised.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20222
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86008/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20222
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20222
  
**[Test build #86008 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86008/testReport)**
 for PR 20222 at commit 
[`440d11f`](https://github.com/apache/spark/commit/440d11f53a6f4b7531d96659c88a7e7961021778).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20242: [MINOR][BUILD] Fix Java linter errors

2018-01-11 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/20242
  
LGTM, @dongjoon-hyun is the current changes include all the lint issues, or 
you still have further changes?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20184: [SPARK-22987][Core] UnsafeExternalSorter cases OOM when ...

2018-01-11 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/20184
  
@liutang123 , can you please tell us how to produce your issue easily?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20216: [SPARK-23024][WEB-UI]Spark ui about the contents of the ...

2018-01-11 Thread guoxiaolongzte
Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/20216
  
@ajbozarth @srowen 
Fix the code, increase the arrow of the form page, maintain the consistency 
of the function.

after fix:

![4](https://user-images.githubusercontent.com/26266482/34861201-29dcd1e4-f79e-11e7-8015-28c320e4b4bc.png)

![5](https://user-images.githubusercontent.com/26266482/34861202-2a114334-f79e-11e7-8deb-428836770bef.png)

![6](https://user-images.githubusercontent.com/26266482/34861203-2a3f6a3e-f79e-11e7-9c2a-924ea67b12c7.png)

![7](https://user-images.githubusercontent.com/26266482/34861204-2a70296c-f79e-11e7-915a-fc6a64e07108.png)

![8](https://user-images.githubusercontent.com/26266482/34861205-2aa42262-f79e-11e7-8c97-828ec24e0f71.png)




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20225
  
**[Test build #86020 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86020/testReport)**
 for PR 20225 at commit 
[`49f1eb6`](https://github.com/apache/spark/commit/49f1eb63840b80984542553ff1037f61e7149a83).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20239: [SPARK-23047][PYTHON][SQL] Change MapVector to NullableM...

2018-01-11 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20239
  
I'm not sure we can change to `NullableMapVector` and I'm just worrying 
whether the `MapVector` is never happened here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...

2018-01-11 Thread jose-torres
Github user jose-torres commented on the issue:

https://github.com/apache/spark/pull/20225
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...

2018-01-11 Thread jose-torres
Github user jose-torres commented on the issue:

https://github.com/apache/spark/pull/20225
  
The most recent test build failure is from an earlier commit which I think 
is obsoleted. I think #86007 is correct but we should retest this please to 
confirm.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20222: [SPARK-23028] Bump master branch version to 2.4.0...

2018-01-11 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20222#discussion_r161140353
  
--- Diff: dev/run-tests-jenkins.py ---
@@ -181,8 +181,8 @@ def main():
 short_commit_hash = ghprb_actual_commit[0:7]
 
 # format: http://linux.die.net/man/1/timeout
-# must be less than the timeout configured on Jenkins (currently 300m)
-tests_timeout = "250m"
+# must be less than the timeout configured on Jenkins (currently 350m)
+tests_timeout = "300m"
--- End diff --

Ah, this was the root causes. Thanks, @HyukjinKwon .
Sorry, @shaneknapp . I sent an annoying email before knowing this here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...

2018-01-11 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20245
  
nitpicking though, could you check? @gatorsmile @wzhfy @mbasmanova


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20245
  
**[Test build #86019 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86019/testReport)**
 for PR 20245 at commit 
[`473b37a`](https://github.com/apache/spark/commit/473b37a1d2abd74f66d29883a1e8bb488cf6bb94).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types...

2018-01-11 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/20245

[SPARK-21213][SQL][FOLLOWUP] Use compatible types for comparisons in 
compareAndGetNewStats

## What changes were proposed in this pull request?
This pr fixed code to compare values in `compareAndGetNewStats`.
The test below fails in the current master;
```
val oldStats2 = CatalogStatistics(sizeInBytes = BigInt(Long.MaxValue) * 
2)
val newStats5 = CommandUtils.compareAndGetNewStats(
  Some(oldStats2), newTotalSize = BigInt(Long.MaxValue) * 2, None)
assert(newStats5.isEmpty)
```

## How was this patch tested?
Added some tests in `CommandUtilsSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark SPARK-21213-FOLLOWUP

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20245.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20245


commit 473b37a1d2abd74f66d29883a1e8bb488cf6bb94
Author: Takeshi Yamamuro 
Date:   2018-01-12T04:57:32Z

Use compatible types for comparisons in compareAndGetNewStats




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...

2018-01-11 Thread ivoson
GitHub user ivoson reopened a pull request:

https://github.com/apache/spark/pull/20244

[SPARK-23053][CORE] taskBinarySerialization and task partitions calculate 
in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status

…d is the same when calculate taskSerialization and task partitions

Change-Id: Ib9839ca552653343d264135c116742effa6feb60

## What changes were proposed in this pull request?

When we run concurrent jobs using the same rdd which is marked to do 
checkpoint. If one job has finished running the job, and start the process of 
RDD.doCheckpoint, while another job is submitted, then submitStage and 
submitMissingTasks will be called. In 
[submitMissingTasks](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961),
 will serialize taskBinaryBytes and calculate task partitions which are both 
affected by the status of checkpoint, if the former is calculated before 
doCheckpoint finished, while the latter is calculated after doCheckpoint 
finished, when run task, rdd.compute will be called, for some rdds with 
particular partition type such as 
[MapWithStateRDD](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala)
 who will do partition type cast, will get a ClassCastException because the 
part params is actually a CheckpointRDDPartition.

## How was this patch tested?

the exist uts and also add a test case in DAGScheduerSuite to show the 
exception case.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ivoson/spark branch-taskpart-mistype

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20244.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20244


commit 0dea573e9e724d591803b73f678e14f94e0af447
Author: huangtengfei 
Date:   2018-01-12T02:53:29Z

submitMissingTasks should make sure the checkpoint status of stage.rdd is 
the same when calculate taskSerialization and task partitions

Change-Id: Ib9839ca552653343d264135c116742effa6feb60




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...

2018-01-11 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/20244
  
reopen this...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...

2018-01-11 Thread ivoson
Github user ivoson commented on the issue:

https://github.com/apache/spark/pull/20244
  
@xuanyuanking could review this please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20244
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...

2018-01-11 Thread ivoson
Github user ivoson closed the pull request at:

https://github.com/apache/spark/pull/20244


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20225
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86007/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...

2018-01-11 Thread ivoson
GitHub user ivoson opened a pull request:

https://github.com/apache/spark/pull/20244

[SPARK-23053][CORE] taskBinarySerialization and task partitions calculate 
in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status

…d is the same when calculate taskSerialization and task partitions

Change-Id: Ib9839ca552653343d264135c116742effa6feb60

## What changes were proposed in this pull request?

When we run concurrent jobs using the same rdd which is marked to do 
checkpoint. If one job has finished running the job, and start the process of 
RDD.doCheckpoint, while another job is submitted, then submitStage and 
submitMissingTasks will be called. In 
[submitMissingTasks](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961),
 will serialize taskBinaryBytes and calculate task partitions which are both 
affected by the status of checkpoint, if the former is calculated before 
doCheckpoint finished, while the latter is calculated after doCheckpoint 
finished, when run task, rdd.compute will be called, for some rdds with 
particular partition type such as 
[MapWithStateRDD](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala)
 who will do partition type cast, will get a ClassCastException because the 
part params is actually a CheckpointRDDPartition.

## How was this patch tested?

the exist uts and also add a test case in DAGScheduerSuite to show the 
exception case.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ivoson/spark branch-taskpart-mistype

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20244.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20244


commit 0dea573e9e724d591803b73f678e14f94e0af447
Author: huangtengfei 
Date:   2018-01-12T02:53:29Z

submitMissingTasks should make sure the checkpoint status of stage.rdd is 
the same when calculate taskSerialization and task partitions

Change-Id: Ib9839ca552653343d264135c116742effa6feb60




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20225
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with returnType=S...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20163
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with returnType=S...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20163
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86003/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with returnType=S...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20163
  
**[Test build #86003 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86003/testReport)**
 for PR 20163 at commit 
[`d307cee`](https://github.com/apache/spark/commit/d307ceeaa581dc219ae744482a6d3c374b3a3025).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20232
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86002/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20232
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20232
  
**[Test build #86002 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86002/testReport)**
 for PR 20232 at commit 
[`b5a7ad2`](https://github.com/apache/spark/commit/b5a7ad28eceb01d32ce15e49ddbb1710c6973d34).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20209
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20209
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86001/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20209
  
**[Test build #86001 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86001/testReport)**
 for PR 20209 at commit 
[`c7db798`](https://github.com/apache/spark/commit/c7db798d7faf343a27c778fd48f2cdd77d732107).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20225
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85997/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...

2018-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20225
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...

2018-01-11 Thread ho3rexqj
Github user ho3rexqj commented on a diff in the pull request:

https://github.com/apache/spark/pull/20183#discussion_r161135870
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -206,37 +206,51 @@ private[spark] class TorrentBroadcast[T: 
ClassTag](obj: T, id: Long)
 
   private def readBroadcastBlock(): T = Utils.tryOrIOException {
 TorrentBroadcast.synchronized {
-  setConf(SparkEnv.get.conf)
-  val blockManager = SparkEnv.get.blockManager
-  blockManager.getLocalValues(broadcastId) match {
-case Some(blockResult) =>
-  if (blockResult.data.hasNext) {
-val x = blockResult.data.next().asInstanceOf[T]
-releaseLock(broadcastId)
-x
-  } else {
-throw new SparkException(s"Failed to get locally stored 
broadcast data: $broadcastId")
-  }
-case None =>
-  logInfo("Started reading broadcast variable " + id)
-  val startTimeMs = System.currentTimeMillis()
-  val blocks = readBlocks()
-  logInfo("Reading broadcast variable " + id + " took" + 
Utils.getUsedTimeMs(startTimeMs))
-
-  try {
-val obj = TorrentBroadcast.unBlockifyObject[T](
-  blocks.map(_.toInputStream()), SparkEnv.get.serializer, 
compressionCodec)
-// Store the merged copy in BlockManager so other tasks on 
this executor don't
-// need to re-fetch it.
-val storageLevel = StorageLevel.MEMORY_AND_DISK
-if (!blockManager.putSingle(broadcastId, obj, storageLevel, 
tellMaster = false)) {
-  throw new SparkException(s"Failed to store $broadcastId in 
BlockManager")
+  val broadcastCache = SparkEnv.get.broadcastManager.cachedValues
+
+  
Option(broadcastCache.get(broadcastId)).map(_.asInstanceOf[T]).getOrElse({
+setConf(SparkEnv.get.conf)
--- End diff --

No, sorry - the cache update takes place within that block.  With the 
exception of those blocks (lines 220-222 and lines 244-246), yes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >