[GitHub] spark issue #21260: [SPARK-23529][K8s] Support mounting volumes

2018-07-07 Thread liyinan926
Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/21260
  
@felixcheung This feature was discussed and this PR was started before 
https://issues.apache.org/jira/browse/SPARK-24434 was even brought up. Being 
able to mount commonly used types of volumes seems super useful for some users, 
so it might make sense to accept it while 
https://issues.apache.org/jira/browse/SPARK-24434 is still going through design 
review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-07-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21305#discussion_r200828393
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -344,6 +344,36 @@ case class Join(
   }
 }
 
+/**
+ * Append data to an existing DataSourceV2 table.
+ */
+case class AppendData(
+table: LogicalPlan,
--- End diff --

Then seems that above code comment can be updated?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21664: [SPARK-24678][CORE] NoClassDefFoundError will not...

2018-07-07 Thread caneGuy
Github user caneGuy commented on a diff in the pull request:

https://github.com/apache/spark/pull/21664#discussion_r200828384
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1049,6 +1049,13 @@ class DAGScheduler(
 abortStage(stage, s"Task serialization failed: 
$e\n${Utils.exceptionString(e)}", Some(e))
 runningStages -= stage
 return
+
+  case e: NoClassDefFoundError =>
--- End diff --

Actually,i will cause job hung since the state never update.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21260: [SPARK-23529][K8s] Support mounting volumes

2018-07-07 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/21260
  
@skonto is it better to generalize the approach to match the one in 
https://issues.apache.org/jira/browse/SPARK-24435?

not sure if @mccheah @foxish @erikerlandson have any last thought


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21552: [SPARK-24544][SQL] Print actual failure cause when look ...

2018-07-07 Thread caneGuy
Github user caneGuy commented on the issue:

https://github.com/apache/spark/pull/21552
  
@maropu May be i will do this check?As @cloud-fan mentioned.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21552: [SPARK-24544][SQL] Print actual failure cause whe...

2018-07-07 Thread caneGuy
Github user caneGuy commented on a diff in the pull request:

https://github.com/apache/spark/pull/21552#discussion_r200828333
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala ---
@@ -131,6 +132,8 @@ private[sql] class HiveSessionCatalog(
 Try(super.lookupFunction(funcName, children)) match {
   case Success(expr) => expr
   case Failure(error) =>
+logWarning(s"Encounter a failure during looking up function:" +
+  s" ${Utils.exceptionString(error)}")
 if (functionRegistry.functionExists(funcName)) {
--- End diff --

@viirya Thanks, i will set up the cause for `NoSuchFunctionException ` later


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21608
  
**[Test build #92715 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92715/testReport)**
 for PR 21608 at commit 
[`06a275b`](https://github.com/apache/spark/commit/06a275b92646f3ccdfa8dbc29af5cfd82f518007).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21608
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92713/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21608
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21608
  
**[Test build #92713 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92713/testReport)**
 for PR 21608 at commit 
[`f9b382d`](https://github.com/apache/spark/commit/f9b382d9bb3d9d722a6afe7b36a44d9764b0145a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-07 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/21698
  
@jiangxb1987 Any closure sensitive to iteration order [1] is effected by 
this - under the set of circumstances.
If we cannot solve it in a principled manner (make shuffle repeatable which 
I believe you have investigated and found to be difficult ?) - next best thing 
until we have a performant solution, would be to expose it to user's and have 
them deal with it (which is what I did, for example) - with hints on how to 
accomplish it.

The proposed solution will cause cascading failures for non trivial 
applications (chain of shuffles) - and also introduce high cost - and can 
unfortunately cause application failures and unpredictable SLA's.


[1] I gave example of zip* and sampling, but really - any user defined 
closure is affected; and we cannot special case for all of them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21728: [SPARK-24759] [SQL] No reordering keys for broadcast has...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21728
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92714/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21728: [SPARK-24759] [SQL] No reordering keys for broadcast has...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21728
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21728: [SPARK-24759] [SQL] No reordering keys for broadcast has...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21728
  
**[Test build #92714 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92714/testReport)**
 for PR 21728 at commit 
[`194991b`](https://github.com/apache/spark/commit/194991b0e8f6375ede6b615813974bbcf75ef036).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21707: Update for spark 2.2.2 release

2018-07-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21707#discussion_r200826322
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 ---
@@ -160,7 +160,7 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 
 object PROCESS_TABLES extends QueryTest with SQLTestUtils {
   // Tests the latest version of every release line.
-  val testingVersions = Seq("2.0.2", "2.1.2", "2.2.1")
+  val testingVersions = Seq("2.0.2", "2.1.2", "2.2.2")
--- End diff --

@tgravescs . Could you replace 2.1.2 with 2.1.3, too?

cc @vanzin (2.1.3 release manager).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-07-07 Thread edwinalu
Github user edwinalu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21221#discussion_r200826235
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -160,11 +160,29 @@ case class 
SparkListenerBlockUpdated(blockUpdatedInfo: BlockUpdatedInfo) extends
  * Periodic updates from executors.
  * @param execId executor id
  * @param accumUpdates sequence of (taskId, stageId, stageAttemptId, 
accumUpdates)
+ * @param executorUpdates executor level metrics updates
  */
 @DeveloperApi
 case class SparkListenerExecutorMetricsUpdate(
 execId: String,
-accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])])
+accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])],
+executorUpdates: Option[Array[Long]] = None)
+  extends SparkListenerEvent
+
+/**
+ * Peak metric values for the executor for the stage, written to the 
history log at stage
+ * completion.
+ * @param execId executor id
+ * @param stageId stage id
+ * @param stageAttemptId stage attempt
+ * @param executorMetrics executor level metrics, indexed by 
MetricGetter.values
+ */
+@DeveloperApi
+case class SparkListenerStageExecutorMetrics(
+execId: String,
+stageId: Int,
+stageAttemptId: Int,
+executorMetrics: Array[Long])
--- End diff --

We can change back to using an ExecutorMetrics class in this case.

The plan was for any new metrics to be added to the end, so that there 
wouldn't be any change in ordering, and executorMetrics could be changed to 
immutable Seq[Long], but there would still be the issue of having to reference 
MetricGetter to find out how the metrics are indexed. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21584: [SPARK-24433][K8S][WIP] Initial R Bindings for SparkR on...

2018-07-07 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/21584
  
IMO it's fine we have one version supported in the image and stick with 
that.
the tricky thing is having maintainers to keep updating/testing the newer 
versions in the images (we have history of not able to keep up)

would it be possible for the integration test to build the image running 
`docker-image-tool.sh -m -t testing build
`, and then run the integration test with it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-07-07 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21305#discussion_r200825129
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -240,21 +238,27 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
 
 val cls = DataSource.lookupDataSource(source, 
df.sparkSession.sessionState.conf)
 if (classOf[DataSourceV2].isAssignableFrom(cls)) {
-  val ds = cls.newInstance()
-  ds match {
+  val source = cls.newInstance().asInstanceOf[DataSourceV2]
+  source match {
 case ws: WriteSupport =>
-  val options = new DataSourceOptions((extraOptions ++
-DataSourceV2Utils.extractSessionConfigs(
-  ds = ds.asInstanceOf[DataSourceV2],
-  conf = df.sparkSession.sessionState.conf)).asJava)
-  // Using a timestamp and a random UUID to distinguish different 
writing jobs. This is good
-  // enough as there won't be tons of writing jobs created at the 
same second.
-  val jobId = new SimpleDateFormat("MMddHHmmss", Locale.US)
-.format(new Date()) + "-" + UUID.randomUUID()
-  val writer = ws.createWriter(jobId, df.logicalPlan.schema, mode, 
options)
-  if (writer.isPresent) {
+  val options = extraOptions ++
+  DataSourceV2Utils.extractSessionConfigs(source, 
df.sparkSession.sessionState.conf)
+
+  val relation = DataSourceV2Relation.create(source, options.toMap)
+  if (mode == SaveMode.Append) {
 runCommand(df.sparkSession, "save") {
-  WriteToDataSourceV2(writer.get(), df.logicalPlan)
+  AppendData.byName(relation, df.logicalPlan)
+}
+
+  } else {
+val writer = ws.createWriter(
+  UUID.randomUUID.toString, 
df.logicalPlan.output.toStructType, mode,
--- End diff --

How would random UUIDs conflict?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-07-07 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21305#discussion_r200824906
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java ---
@@ -38,15 +38,16 @@
* If this method fails (by throwing an exception), the action will fail 
and no Spark job will be
* submitted.
*
-   * @param jobId A unique string for the writing job. It's possible that 
there are many writing
-   *  jobs running at the same time, and the returned {@link 
DataSourceWriter} can
-   *  use this job id to distinguish itself from other jobs.
+   * @param writeUUID A unique string for the writing job. It's possible 
that there are many writing
--- End diff --

This is not the ID of the Spark job that is writing. I think the UUID name 
is more clear about what is actually passed, a unique string that identifies 
the write. There's also no need to make the string more complicated than a UUID 
since there are no guarantees about it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-07-07 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21305#discussion_r200824639
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -2120,6 +2122,99 @@ class Analyzer(
 }
   }
 
+  /**
+   * Resolves columns of an output table from the data in a logical plan. 
This rule will:
+   *
+   * - Reorder columns when the write is by name
+   * - Insert safe casts when data types do not match
+   * - Insert aliases when column names do not match
+   * - Detect plans that are not compatible with the output table and 
throw AnalysisException
+   */
+  object ResolveOutputRelation extends Rule[LogicalPlan] {
+override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case append @ AppendData(table: NamedRelation, query, isByName)
+  if table.resolved && query.resolved && !append.resolved =>
+val projection = resolveOutputColumns(table.name, table.output, 
query, isByName)
+
+if (projection != query) {
+  append.copy(query = projection)
+} else {
+  append
+}
+}
+
+def resolveOutputColumns(
+tableName: String,
+expected: Seq[Attribute],
+query: LogicalPlan,
+byName: Boolean): LogicalPlan = {
+
+  if (expected.size < query.output.size) {
+throw new AnalysisException(
+  s"""Cannot write to '$tableName', too many data columns:
+ |Table columns: ${expected.map(_.name).mkString(", ")}
+ |Data columns: ${query.output.map(_.name).mkString(", 
")}""".stripMargin)
+  }
+
+  val errors = new mutable.ArrayBuffer[String]()
+  val resolved: Seq[NamedExpression] = if (byName) {
+expected.flatMap { outAttr =>
+  query.resolveQuoted(outAttr.name, resolver) match {
+case Some(inAttr) if inAttr.nullable && !outAttr.nullable =>
+  errors += s"Cannot write nullable values to non-null column 
'${outAttr.name}'"
+  None
+
+case Some(inAttr) if 
!outAttr.dataType.sameType(inAttr.dataType) =>
+  Some(upcast(inAttr, outAttr))
+
+case Some(inAttr) =>
+  Some(inAttr) // matches nullability, datatype, and name
+
+case _ =>
+  errors += s"Cannot find data for output column 
'${outAttr.name}'"
+  None
+  }
+}
+
+  } else {
+if (expected.size > query.output.size) {
--- End diff --

That check is the other direction: not enough columns.

When matching by position, we need to have the same number of columns so we 
add this check (we already know that there aren't too few columns, so this 
checks for too many). When matching by name, we can call out specific columns 
that are missing, which is why we do the validation differently for the two 
cases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-07-07 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21305#discussion_r200824599
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -2120,6 +2122,99 @@ class Analyzer(
 }
   }
 
+  /**
+   * Resolves columns of an output table from the data in a logical plan. 
This rule will:
+   *
+   * - Reorder columns when the write is by name
+   * - Insert safe casts when data types do not match
+   * - Insert aliases when column names do not match
+   * - Detect plans that are not compatible with the output table and 
throw AnalysisException
+   */
+  object ResolveOutputRelation extends Rule[LogicalPlan] {
+override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case append @ AppendData(table: NamedRelation, query, isByName)
+  if table.resolved && query.resolved && !append.resolved =>
+val projection = resolveOutputColumns(table.name, table.output, 
query, isByName)
+
+if (projection != query) {
+  append.copy(query = projection)
+} else {
+  append
+}
+}
+
+def resolveOutputColumns(
+tableName: String,
+expected: Seq[Attribute],
+query: LogicalPlan,
+byName: Boolean): LogicalPlan = {
+
+  if (expected.size < query.output.size) {
+throw new AnalysisException(
+  s"""Cannot write to '$tableName', too many data columns:
+ |Table columns: ${expected.map(_.name).mkString(", ")}
+ |Data columns: ${query.output.map(_.name).mkString(", 
")}""".stripMargin)
+  }
+
+  val errors = new mutable.ArrayBuffer[String]()
+  val resolved: Seq[NamedExpression] = if (byName) {
+expected.flatMap { outAttr =>
+  query.resolveQuoted(outAttr.name, resolver) match {
+case Some(inAttr) if inAttr.nullable && !outAttr.nullable =>
+  errors += s"Cannot write nullable values to non-null column 
'${outAttr.name}'"
--- End diff --

I would much rather have a job fail fast and give a clear error message 
than to fail during a write. I can see how adding such an assertion to the plan 
could be useful, so I'd consider it if someone wanted to add that feature 
later. Right now, though, I think this is good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-07-07 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21305#discussion_r200824602
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -2120,6 +2122,99 @@ class Analyzer(
 }
   }
 
+  /**
+   * Resolves columns of an output table from the data in a logical plan. 
This rule will:
+   *
+   * - Reorder columns when the write is by name
+   * - Insert safe casts when data types do not match
+   * - Insert aliases when column names do not match
+   * - Detect plans that are not compatible with the output table and 
throw AnalysisException
+   */
+  object ResolveOutputRelation extends Rule[LogicalPlan] {
+override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case append @ AppendData(table: NamedRelation, query, isByName)
+  if table.resolved && query.resolved && !append.resolved =>
+val projection = resolveOutputColumns(table.name, table.output, 
query, isByName)
+
+if (projection != query) {
+  append.copy(query = projection)
+} else {
+  append
+}
+}
+
+def resolveOutputColumns(
+tableName: String,
+expected: Seq[Attribute],
+query: LogicalPlan,
+byName: Boolean): LogicalPlan = {
+
+  if (expected.size < query.output.size) {
+throw new AnalysisException(
+  s"""Cannot write to '$tableName', too many data columns:
+ |Table columns: ${expected.map(_.name).mkString(", ")}
+ |Data columns: ${query.output.map(_.name).mkString(", 
")}""".stripMargin)
+  }
+
+  val errors = new mutable.ArrayBuffer[String]()
+  val resolved: Seq[NamedExpression] = if (byName) {
+expected.flatMap { outAttr =>
+  query.resolveQuoted(outAttr.name, resolver) match {
+case Some(inAttr) if inAttr.nullable && !outAttr.nullable =>
+  errors += s"Cannot write nullable values to non-null column 
'${outAttr.name}'"
+  None
+
+case Some(inAttr) if 
!outAttr.dataType.sameType(inAttr.dataType) =>
--- End diff --

Yes, I'll update to check nested fields.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21728: [SPARK-24759] [SQL] No reordering keys for broadcast has...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21728
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21728: [SPARK-24759] [SQL] No reordering keys for broadcast has...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21728
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/750/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21728: [SPARK-24759] [SQL] No reordering keys for broadcast has...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21728
  
**[Test build #92714 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92714/testReport)**
 for PR 21728 at commit 
[`194991b`](https://github.com/apache/spark/commit/194991b0e8f6375ede6b615813974bbcf75ef036).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-07-07 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21305#discussion_r200824532
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -2120,6 +2122,99 @@ class Analyzer(
 }
   }
 
+  /**
+   * Resolves columns of an output table from the data in a logical plan. 
This rule will:
+   *
+   * - Reorder columns when the write is by name
+   * - Insert safe casts when data types do not match
+   * - Insert aliases when column names do not match
+   * - Detect plans that are not compatible with the output table and 
throw AnalysisException
+   */
+  object ResolveOutputRelation extends Rule[LogicalPlan] {
+override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case append @ AppendData(table: NamedRelation, query, isByName)
--- End diff --

Yes, I agree.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21728: [SPARK-24759] [SQL] No reordering keys for broadc...

2018-07-07 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/21728

[SPARK-24759] [SQL] No reordering keys for broadcast hash join

## What changes were proposed in this pull request?

As the implementation of the broadcast hash join is independent of the 
input hash partitioning, reordering keys is not necessary. Thus, we solve this 
issue by simply removing the broadcast hash join from the reordering rule in 
EnsureRequirements.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark cleanER

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21728.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21728


commit 194991b0e8f6375ede6b615813974bbcf75ef036
Author: Xiao Li 
Date:   2018-07-07T23:06:39Z

remove BroadcastHashJoinExec from reorderJoinPredicates




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for ca...

2018-07-07 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21306#discussion_r200824504
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/catalog/TableCatalog.java
 ---
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.catalog;
+
+import org.apache.spark.sql.catalyst.TableIdentifier;
+import org.apache.spark.sql.catalyst.analysis.NoSuchTableException;
+import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException;
+import org.apache.spark.sql.catalyst.expressions.Expression;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.Map;
+
+public interface TableCatalog {
+  /**
+   * Load table metadata by {@link TableIdentifier identifier} from the 
catalog.
+   *
+   * @param ident a table identifier
+   * @return the table's metadata
+   * @throws NoSuchTableException If the table doesn't exist.
+   */
+  Table loadTable(TableIdentifier ident) throws NoSuchTableException;
+
+  /**
+   * Create a table in the catalog.
+   *
+   * @param ident a table identifier
+   * @param schema the schema of the new table, as a struct type
+   * @return metadata for the new table
+   * @throws TableAlreadyExistsException If a table already exists for the 
identifier
+   */
+  default Table createTable(TableIdentifier ident,
+StructType schema) throws 
TableAlreadyExistsException {
+return createTable(ident, schema, Collections.emptyList(), 
Collections.emptyMap());
+  }
+
+  /**
+   * Create a table in the catalog.
+   *
+   * @param ident a table identifier
+   * @param schema the schema of the new table, as a struct type
+   * @param properties a string map of table properties
+   * @return metadata for the new table
+   * @throws TableAlreadyExistsException If a table already exists for the 
identifier
+   */
+  default Table createTable(TableIdentifier ident,
+StructType schema,
+Map properties) throws 
TableAlreadyExistsException {
+return createTable(ident, schema, Collections.emptyList(), properties);
+  }
+
+  /**
+   * Create a table in the catalog.
+   *
+   * @param ident a table identifier
+   * @param schema the schema of the new table, as a struct type
+   * @param partitions a list of expressions to use for partitioning data 
in the table
+   * @param properties a string map of table properties
+   * @return metadata for the new table
+   * @throws TableAlreadyExistsException If a table already exists for the 
identifier
+   */
+  Table createTable(TableIdentifier ident,
+StructType schema,
+List partitions,
--- End diff --

I wouldn't say this way of passing partitioning is a new feature. It's just 
a generalization of the existing partitioning that allows us to pass any type 
of partition, whether it is bucketing or column-based.

As for open discussion, this was proposed in the SPIP that was fairly 
widely read and commented on. That SPIP was posted to the dev list a few times, 
too. I do appreciate you wanting to make sure there's been a chance for the 
community to discuss it, but there has been plenty of opportunity to comment. 
At this point, I think it's reasonable to move forward with the implementation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21608
  
**[Test build #92713 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92713/testReport)**
 for PR 21608 at commit 
[`f9b382d`](https://github.com/apache/spark/commit/f9b382d9bb3d9d722a6afe7b36a44d9764b0145a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21608: [SPARK-24626] [SQL] Improve location size calcula...

2018-07-07 Thread Achuth17
Github user Achuth17 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21608#discussion_r200824025
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 ---
@@ -47,15 +48,26 @@ object CommandUtils extends Logging {
 }
   }
 
-  def calculateTotalSize(sessionState: SessionState, catalogTable: 
CatalogTable): BigInt = {
+def calculateTotalSize(spark: SparkSession, catalogTable: 
CatalogTable): BigInt = {
+
+val sessionState = spark.sessionState
+val stagingDir = 
sessionState.conf.getConfString("hive.exec.stagingdir", ".hive-staging")
+
 if (catalogTable.partitionColumnNames.isEmpty) {
-  calculateLocationSize(sessionState, catalogTable.identifier, 
catalogTable.storage.locationUri)
+  calculateLocationSize(sessionState, catalogTable.identifier,
+  catalogTable.storage.locationUri)
 } else {
   // Calculate table size as a sum of the visible partitions. See 
SPARK-21079
   val partitions = 
sessionState.catalog.listPartitions(catalogTable.identifier)
-  partitions.map { p =>
-calculateLocationSize(sessionState, catalogTable.identifier, 
p.storage.locationUri)
-  }.sum
+  val paths = partitions.map(x => new 
Path(x.storage.locationUri.get.getPath))
+  val pathFilter = new PathFilter {
+override def accept(path: Path): Boolean = {
+  !path.getName.startsWith(stagingDir)
+}
+  }
+  val fileStatusSeq = InMemoryFileIndex.bulkListLeafFiles(paths,
+sessionState.newHadoopConf(), pathFilter, spark).flatMap(x => x._2)
--- End diff --

Thank you, I have made the changes. Can you review this? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21727: [SPARK-24757][SQL] Improving the error message fo...

2018-07-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21727


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21727: [SPARK-24757][SQL] Improving the error message for broad...

2018-07-07 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/21727
  
Merging to master. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21727: [SPARK-24757][SQL] Improving the error message for broad...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21727
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92712/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21727: [SPARK-24757][SQL] Improving the error message for broad...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21727
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21727: [SPARK-24757][SQL] Improving the error message for broad...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21727
  
**[Test build #92712 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92712/testReport)**
 for PR 21727 at commit 
[`86587ed`](https://github.com/apache/spark/commit/86587edca7c1345b8a3687877b598d8389fbd56b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21667
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21667
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92711/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21667
  
**[Test build #92711 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92711/testReport)**
 for PR 21667 at commit 
[`7266611`](https://github.com/apache/spark/commit/7266611b243000868c81f4538dd850c394fe7c20).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...

2018-07-07 Thread tengpeng
Github user tengpeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/20933#discussion_r200819285
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -241,39 +240,47 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
 val cls = DataSource.lookupDataSource(source, 
df.sparkSession.sessionState.conf)
 if (classOf[DataSourceV2].isAssignableFrom(cls)) {
   val ds = cls.newInstance()
-  ds match {
-case ws: WriteSupport =>
-  val options = new DataSourceOptions((extraOptions ++
-DataSourceV2Utils.extractSessionConfigs(
-  ds = ds.asInstanceOf[DataSourceV2],
-  conf = df.sparkSession.sessionState.conf)).asJava)
-  // Using a timestamp and a random UUID to distinguish different 
writing jobs. This is good
-  // enough as there won't be tons of writing jobs created at the 
same second.
-  val jobId = new SimpleDateFormat("MMddHHmmss", Locale.US)
-.format(new Date()) + "-" + UUID.randomUUID()
-  val writer = ws.createWriter(jobId, df.logicalPlan.schema, mode, 
options)
-  if (writer.isPresent) {
-runCommand(df.sparkSession, "save") {
-  WriteToDataSourceV2(writer.get(), df.logicalPlan)
-}
-  }
+  val (needToFallBackFileDataSourceV2, fallBackFileFormat) = ds match {
+case f: FileDataSourceV2 =>
+  val disabledV2Readers =
+
df.sparkSession.sessionState.conf.disabledV2FileDataSourceWriter.split(",")
+  (disabledV2Readers.contains(f.shortName), 
f.fallBackFileFormat.getCanonicalName)
+case _ => (false, source)
+  }
 
-// Streaming also uses the data source V2 API. So it may be that 
the data source implements
-// v2, but has no v2 implementation for batch writes. In that 
case, we fall back to saving
-// as though it's a V1 source.
-case _ => saveToV1Source()
+  if (ds.isInstanceOf[WriteSupport] && 
!needToFallBackFileDataSourceV2) {
+val options = new DataSourceOptions((extraOptions ++
+  DataSourceV2Utils.extractSessionConfigs(
+ds = ds.asInstanceOf[DataSourceV2],
+conf = df.sparkSession.sessionState.conf)).asJava)
+// Using a timestamp and a random UUID to distinguish different 
writing jobs. This is good
+// enough as there won't be tons of writing jobs created at the 
same second.
+val jobId = new SimpleDateFormat("MMddHHmmss", Locale.US)
+  .format(new Date()) + "-" + UUID.randomUUID()
+val writer = ds.asInstanceOf[WriteSupport]
+  .createWriter(jobId, df.logicalPlan.schema, mode, options)
--- End diff --

I am not sure I understand this: why do use `.createWriter` here, but we do 
not use `.createReader` in `DataFrameReader`. It seems "unsymmetrical" to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21556
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92710/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21556
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21556
  
**[Test build #92710 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92710/testReport)**
 for PR 21556 at commit 
[`7128539`](https://github.com/apache/spark/commit/712853999442a611ce7b97db8dad43039268573e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21439
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21439
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92708/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21439
  
**[Test build #92708 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92708/testReport)**
 for PR 21439 at commit 
[`f3efb1b`](https://github.com/apache/spark/commit/f3efb1b97f9366839eacbc2611e39013f8c1fcfc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-07 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/21698
  
Thank you for your comments @mridulm !
We focus on resolving the RDD.repartition() correctness issue here in this 
PR, because it is most commonly used, and that we can still address the 
RDD.zip* issue using the similar approach. At first I was worried that the 
changes may be huge and trying to address the correctness issue for multiple 
operations may make it difficult to backport the PR. But now it turns out that 
the PR didn't change that much code, so maybe I can consider include the 
RDD.zip* fix in this PR too.

Since you are also deeply involved in the related discussion on the 
correctness issue caused by non-deterministic input for shuffle, you may also 
agree that there is actually no easy way to both guarantee correctness and 
don't cause serious performance drop-off. I have to insist that correctness 
always goes beyond performance concerns, and that we shall not expect users to 
always remember they may hit a known correctness bug in case of some use 
patterns.

As for the proposed solution, there are actually two ways to follow: Either 
you insert a local sort before a shuffle repartition (that's how we fixed the 
DataFrame.repartition()), or you always retry the whole stage with repartition 
on FetchFailure. The problem with the local-sort solution is that, it can't fix 
all the problems for RDD (since the data type of an RDD can be not sortable, 
and it's hard to construct a sorting for a generic type), also it can make the 
time consumption of repartition() increases by 3X ~ 5X. By applying the 
approach proposed in this PR, the performance shall keep the same in case no 
FetchFailure happens, and it shall works well for DataFrames as well as for 
RDDs.

I have to admit that if you have a big query running on a huge cluster, and 
the tasks can easily hit FetchFailure issues, then you may see the job takes 
more time to finish (or even fall due to reach max consequence stage failure 
limit). But again, your big query may be producing wrong result without a 
patch, and I have to say that is even more unacceptable :( . As for the 
cascading cost, you are right, it makes things worse, and I don't have good 
advice for that issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21727: [SPARK-24757][SQL] Improving the error message for broad...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21727
  
**[Test build #92712 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92712/testReport)**
 for PR 21727 at commit 
[`86587ed`](https://github.com/apache/spark/commit/86587edca7c1345b8a3687877b598d8389fbd56b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21727: [SPARK-24757][SQL] Improving the error message for broad...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21727
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21727: [SPARK-24757][SQL] Improving the error message for broad...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21727
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21727: [SPARK-24757][SQL] Improving the error message for broad...

2018-07-07 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21727
  
@hvanhovell Please, have a look at the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21727: [SPARK-24757][SQL] Improving the error message fo...

2018-07-07 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/21727

[SPARK-24757][SQL] Improving the error message for broadcast timeouts

## What changes were proposed in this pull request?

In the PR, I propose to provide a tip to user how to resolve the issue of 
timeout expiration for broadcast joins. In particular, they can increase the 
timeout via **spark.sql.broadcastTimeout** or disable the broadcast at all by 
setting **spark.sql.autoBroadcastJoinThreshold** to `-1`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 broadcast-timeout-error

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21727.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21727


commit 0d0b3f34a90469ba894a207639456b4b815a90e8
Author: Maxim Gekk 
Date:   2018-07-07T14:46:21Z

Improving the error message for broadcast timeouts

I added a recommendation for increasing broadcast timeout. This sentence is 
added to existing error message:

```
You can increase the timeout for broadcasts via 
${SQLConf.BROADCAST_TIMEOUT.key}
```

Author: Maxim Gekk 

Closes #2801 from MaxGekk/broadcast-error-message.

commit 86587edca7c1345b8a3687877b598d8389fbd56b
Author: Maxim Gekk 
Date:   2018-07-07T15:57:36Z

Remove empty line in imports




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21667
  
**[Test build #92711 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92711/testReport)**
 for PR 21667 at commit 
[`7266611`](https://github.com/apache/spark/commit/7266611b243000868c81f4538dd850c394fe7c20).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21667
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21667
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/749/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

2018-07-07 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/21667
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21192
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92707/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21192
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21192
  
**[Test build #92707 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92707/testReport)**
 for PR 21192 at commit 
[`5384c07`](https://github.com/apache/spark/commit/5384c073a0761dbe24ec52e9474d618535ad8f69).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21669
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21669
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92705/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21669
  
**[Test build #92705 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92705/testReport)**
 for PR 21669 at commit 
[`13b3adc`](https://github.com/apache/spark/commit/13b3adc5ffb55fbfd6572089b1f54e8bca393494).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21657
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92706/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21657
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21657
  
**[Test build #92706 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92706/testReport)**
 for PR 21657 at commit 
[`d5921f0`](https://github.com/apache/spark/commit/d5921f08d8efa00f64f01d005e843291568c1e80).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

2018-07-07 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21556#discussion_r200813391
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean, pushDownStartWith:
   (n: String, v: Any) => FilterApi.eq(
 intColumn(n),
 Option(v).map(date => 
dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
+
+case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>
--- End diff --

Add check method to `canMakeFilterOn` and add a test case:
```scala
val decimal = new JBigDecimal(10).setScale(scale)
assert(decimal.scale() === scale)
assertResult(Some(lt(intColumn("cdecimal1"), 1000: Integer))) {
  parquetFilters.createFilter(parquetSchema, 
sources.LessThan("cdecimal1", decimal))
}

val decimal1 = new JBigDecimal(10).setScale(scale + 1)
assert(decimal1.scale() === scale + 1)

assertResult(None) {
  parquetFilters.createFilter(parquetSchema, 
sources.LessThan("cdecimal1", decimal1))
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21556
  
**[Test build #92710 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92710/testReport)**
 for PR 21556 at commit 
[`7128539`](https://github.com/apache/spark/commit/712853999442a611ce7b97db8dad43039268573e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21556
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/748/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21556
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21690: [SPARK-24713]AppMatser of spark streaming kafka OOM if t...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21690
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92709/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21690: [SPARK-24713]AppMatser of spark streaming kafka OOM if t...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21690
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21690: [SPARK-24713]AppMatser of spark streaming kafka OOM if t...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21690
  
**[Test build #92709 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92709/testReport)**
 for PR 21690 at commit 
[`d1a8c60`](https://github.com/apache/spark/commit/d1a8c605e163bc09d1329cbd90560cc5165de555).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21146: [SPARK-23654][BUILD] remove jets3t as a dependenc...

2018-07-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/21146#discussion_r200812101
  
--- Diff: dev/deps/spark-deps-hadoop-2.6 ---
@@ -21,8 +21,6 @@ automaton-1.11-8.jar
 avro-1.7.7.jar
 avro-ipc-1.7.7.jar
 avro-mapred-1.7.7-hadoop2.jar
-base64-2.3.8.jar
-bcprov-jdk15on-1.58.jar
--- End diff --

Sounds like it's already been removed, so the other 'remnants' should go.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21690: [SPARK-24713]AppMatser of spark streaming kafka OOM if t...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21690
  
**[Test build #92709 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92709/testReport)**
 for PR 21690 at commit 
[`d1a8c60`](https://github.com/apache/spark/commit/d1a8c605e163bc09d1329cbd90560cc5165de555).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21439
  
**[Test build #92708 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92708/testReport)**
 for PR 21439 at commit 
[`f3efb1b`](https://github.com/apache/spark/commit/f3efb1b97f9366839eacbc2611e39013f8c1fcfc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21192
  
**[Test build #92707 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92707/testReport)**
 for PR 21192 at commit 
[`5384c07`](https://github.com/apache/spark/commit/5384c073a0761dbe24ec52e9474d618535ad8f69).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21657
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/747/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21657
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21657
  
**[Test build #92706 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92706/testReport)**
 for PR 21657 at commit 
[`d5921f0`](https://github.com/apache/spark/commit/d5921f08d8efa00f64f01d005e843291568c1e80).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21657
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21669
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/746/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21669
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/746/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21669
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21669
  
Kubernetes integration test starting
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/746/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21657
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92704/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21657
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21657
  
**[Test build #92704 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92704/testReport)**
 for PR 21657 at commit 
[`d5921f0`](https://github.com/apache/spark/commit/d5921f08d8efa00f64f01d005e843291568c1e80).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21669
  
**[Test build #92705 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92705/testReport)**
 for PR 21669 at commit 
[`13b3adc`](https://github.com/apache/spark/commit/13b3adc5ffb55fbfd6572089b1f54e8bca393494).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21690: [SPARK-24713]AppMatser of spark streaming kafka OOM if t...

2018-07-07 Thread yuanboliu
Github user yuanboliu commented on the issue:

https://github.com/apache/spark/pull/21690
  
@koeninger Thanks for your reply. Agree with you. there is no need to to 
use pause repeatedly. 
This is my test without any pause, and the app master stuck for a long time 
without any process




![wechatworkscreenshot_abb443bd-97db-48f9-88a2-e45a65617f80](https://user-images.githubusercontent.com/5643344/42409693-b324d45e-8210-11e8-96eb-39fc359b1b42.png)

I will update my patch shortly.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21260: [SPARK-23529][K8s] Support mounting volumes

2018-07-07 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21260
  
Nice work, only one thing I would consider including a StorageClass name 
option for the PersistentVolumeClaim volume type which defaults to an empty 
string. Without that the PVC will always use the default StorageClass which may 
not exists in all scenarios. Thus the pod will remain in pending state 
indefinitely.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-07 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/21698
  
I did not go over the PR itself in detail, but the proposal sounds very 
expensive - particularly given the cascading costs involved.

Also, I am not sure why we are special case'ing only coalasce/repartition 
here : any closure which is depending on ordering of tuples is bound to fail - 
for example, RDD.zip* variants, sampling in ML etc will suffer from same issue.

Either we fix shuffle itself to become deterministic (which I am not sure 
if we can do efficiently), or we could simply document this issue with 
coalasce/other relevant api - so that users do a sort when applicable : when 
they deem the additional cost is required to be borne.
Note that in a lot of cases, this is not an issue - for example when 
reading from external data stores, checkpointed data, persisted data, etc : 
which typically are reasons why coalasce gets used a lot (to minimize number of 
partitions).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21583: [SPARK-23984][K8S][Test] Added Integration Tests for PyS...

2018-07-07 Thread ifilonenko
Github user ifilonenko commented on the issue:

https://github.com/apache/spark/pull/21583
  
@foxish The problem is the same as described here: 
https://github.com/moby/moby/issues/32457 which should have been solved in 
`Docker 17.05`. As such, this is prompted by a deprecated version Docker and I 
am waiting for update to happen on the Jenkins nodes (as locally this works 
perfectly fine with the newest version of Docker). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-07-07 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/21221#discussion_r200805499
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -160,11 +160,29 @@ case class 
SparkListenerBlockUpdated(blockUpdatedInfo: BlockUpdatedInfo) extends
  * Periodic updates from executors.
  * @param execId executor id
  * @param accumUpdates sequence of (taskId, stageId, stageAttemptId, 
accumUpdates)
+ * @param executorUpdates executor level metrics updates
  */
 @DeveloperApi
 case class SparkListenerExecutorMetricsUpdate(
 execId: String,
-accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])])
+accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])],
+executorUpdates: Option[Array[Long]] = None)
+  extends SparkListenerEvent
+
+/**
+ * Peak metric values for the executor for the stage, written to the 
history log at stage
+ * completion.
+ * @param execId executor id
+ * @param stageId stage id
+ * @param stageAttemptId stage attempt
+ * @param executorMetrics executor level metrics, indexed by 
MetricGetter.values
+ */
+@DeveloperApi
+case class SparkListenerStageExecutorMetrics(
+execId: String,
+stageId: Int,
+stageAttemptId: Int,
+executorMetrics: Array[Long])
--- End diff --

We cannot expose an array of long's in a developer api (mutability).
In addition, we cannot have users needing to reference private spark api's 
to understand the meaning of it - particularly when the ordering can be subject 
to change in subsequent versions of spark.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21653: [SPARK-13343] speculative tasks that didn't commi...

2018-07-07 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/21653#discussion_r200805005
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -723,6 +723,13 @@ private[spark] class TaskSetManager(
   def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = 
{
 val info = taskInfos(tid)
 val index = info.index
+// Check if any other attempt succeeded before this and this attempt 
has not been handled
+if (successful(index) && killedByOtherAttempt(index)) {
--- End diff --

For completeness, we will also need to 'undo' the changes in 
`enqueueSuccessfulTask` : to reverse the counters in `canFetchMoreResults`.


(Orthogonal to this PR): Looking at use of `killedByOtherAttempt`, I see 
that there is a bug in `executorLost` w.r.t how it is updated - hopefully a fix 
for SPARK-24755 wont cause issues here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21657
  
**[Test build #92704 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92704/testReport)**
 for PR 21657 at commit 
[`d5921f0`](https://github.com/apache/spark/commit/d5921f08d8efa00f64f01d005e843291568c1e80).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21657
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/745/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21657
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21657
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21657
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21657
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92703/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...

2018-07-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21657
  
**[Test build #92703 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92703/testReport)**
 for PR 21657 at commit 
[`d5921f0`](https://github.com/apache/spark/commit/d5921f08d8efa00f64f01d005e843291568c1e80).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >