[GitHub] spark pull request #14638: [SPARK-11374][SQL] Support `skip.header.line.coun...

2016-11-29 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14638#discussion_r90098551
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -122,10 +126,20 @@ class HadoopTableReader(
 val attrsWithIndex = attributes.zipWithIndex
 val mutableRow = new SpecificInternalRow(attributes.map(_.dataType))
 
-val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter =>
+val deserializedHadoopRDD = hadoopRDD.mapPartitionsWithIndex { (index, 
iter) =>
   val hconf = broadcastedHadoopConf.value.value
   val deserializer = deserializerClass.newInstance()
   deserializer.initialize(hconf, tableDesc.getProperties)
+  if (skipHeaderLineCount > 0 && isTextInputFormatTable) {
+val partition = 
hadoopRDD.partitions(index).asInstanceOf[HadoopPartition]
+if (partition.inputSplit.t.asInstanceOf[FileSplit].getStart() == 
0) {
--- End diff --

is `partition.inputSplit.t.asInstanceOf[FileSplit].getStart() != 0` tested 
(for a split that does not start from the start of the file, we do not skip)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regression sho...

2016-11-29 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15910
  
@yanboliang @felixcheung I am back from vacation and made changes according 
to your comments.

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14638: [SPARK-11374][SQL] Support `skip.header.line.coun...

2016-11-29 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14638#discussion_r90098793
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -122,10 +126,20 @@ class HadoopTableReader(
 val attrsWithIndex = attributes.zipWithIndex
 val mutableRow = new SpecificInternalRow(attributes.map(_.dataType))
 
-val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter =>
+val deserializedHadoopRDD = hadoopRDD.mapPartitionsWithIndex { (index, 
iter) =>
   val hconf = broadcastedHadoopConf.value.value
   val deserializer = deserializerClass.newInstance()
   deserializer.initialize(hconf, tableDesc.getProperties)
+  if (skipHeaderLineCount > 0 && isTextInputFormatTable) {
+val partition = 
hadoopRDD.partitions(index).asInstanceOf[HadoopPartition]
--- End diff --

I am +1 on adding the check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching...

2016-11-29 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15975#discussion_r90098792
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
 ---
@@ -76,9 +76,6 @@ class JDBCOptions(
 
   // the number of partitions
   val numPartitions = parameters.get(JDBC_NUM_PARTITIONS).map(_.toInt)
--- End diff --

Not yet. : ) Will try to document it in the jdbc API of 
DataFrameReader.scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16064: [SPARK-18633][ML][Example]: Add multiclass logistic regr...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16064
  
**[Test build #69344 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69344/consoleFull)**
 for PR 16064 at commit 
[`7040089`](https://github.com/apache/spark/commit/70400890ca83b91ea44b0d34bf53b753f07ba46b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16044: [Spark-18614][SQL] Incorrect predicate pushdown f...

2016-11-29 Thread nsyca
Github user nsyca commented on a diff in the pull request:

https://github.com/apache/spark/pull/16044#discussion_r90100457
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -932,7 +932,7 @@ object PushPredicateThroughJoin extends 
Rule[LogicalPlan] with PredicateHelper {
 
split(joinCondition.map(splitConjunctivePredicates).getOrElse(Nil), left, right)
 
   joinType match {
-case _: InnerLike |  LeftSemi | ExistenceJoin(_) =>
+case _: InnerLike |  LeftSemi =>
--- End diff --

I will make that change as part of my next PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14638: [SPARK-11374][SQL] Support `skip.header.line.coun...

2016-11-29 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14638#discussion_r90101517
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -122,10 +126,20 @@ class HadoopTableReader(
 val attrsWithIndex = attributes.zipWithIndex
 val mutableRow = new SpecificInternalRow(attributes.map(_.dataType))
 
-val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter =>
+val deserializedHadoopRDD = hadoopRDD.mapPartitionsWithIndex { (index, 
iter) =>
   val hconf = broadcastedHadoopConf.value.value
   val deserializer = deserializerClass.newInstance()
   deserializer.initialize(hconf, tableDesc.getProperties)
+  if (skipHeaderLineCount > 0 && isTextInputFormatTable) {
+val partition = 
hadoopRDD.partitions(index).asInstanceOf[HadoopPartition]
+if (partition.inputSplit.t.asInstanceOf[FileSplit].getStart() == 
0) {
--- End diff --

Yep. I'll add that, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-11-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15979
  
FWIW I don't think we should call it nonflat.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16048: [DO_NOT_MERGE]Test kafka deletion

2016-11-29 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/16048
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16048: [DO_NOT_MERGE]Test kafka deletion

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16048
  
**[Test build #69345 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69345/consoleFull)**
 for PR 16048 at commit 
[`9ff2ed4`](https://github.com/apache/spark/commit/9ff2ed48062fbd7d9c92749ead54b62bcc9ee4ce).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15982: [SPARK-18546][core] Fix merging shuffle spills when usin...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15982
  
**[Test build #69337 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69337/consoleFull)**
 for PR 15982 at commit 
[`2e03ee6`](https://github.com/apache/spark/commit/2e03ee6faec0984eeab0ffe2699ec7cb59bf0c43).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15982: [SPARK-18546][core] Fix merging shuffle spills when usin...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15982
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69337/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15982: [SPARK-18546][core] Fix merging shuffle spills when usin...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15982
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16064: [SPARK-18633][ML][Example]: Add multiclass logistic regr...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16064
  
**[Test build #69344 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69344/consoleFull)**
 for PR 16064 at commit 
[`7040089`](https://github.com/apache/spark/commit/70400890ca83b91ea44b0d34bf53b753f07ba46b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16064: [SPARK-18633][ML][Example]: Add multiclass logistic regr...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16064
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16064: [SPARK-18633][ML][Example]: Add multiclass logistic regr...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16064
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69344/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15924: [SPARK-18498] [SQL] Revise HDFSMetadataLog API fo...

2016-11-29 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/15924#discussion_r90090753
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
 ---
@@ -129,48 +129,18 @@ class HDFSMetadataLog[T <: AnyRef : 
ClassTag](sparkSession: SparkSession, path:
 }
   }
 
-  /**
-   * Write a batch to a temp file then rename it to the batch file.
-   *
-   * There may be multiple [[HDFSMetadataLog]] using the same metadata 
path. Although it is not a
-   * valid behavior, we still need to prevent it from destroying the files.
-   */
-  private def writeBatch(batchId: Long, metadata: T, writer: (T, 
OutputStream) => Unit): Unit = {
-// Use nextId to create a temp file
+  def writeTempBatch(metadata: T, writer: (T, OutputStream) => Unit = 
serialize): Option[Path] = {
 var nextId = 0
-while (true) {
+while(true) {
--- End diff --

Nit: Space after `while`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15954: [WIP][SPARK-18516][SQL] Split state and progress ...

2016-11-29 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15954#discussion_r90108129
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala ---
@@ -669,55 +658,48 @@ trait StreamTest extends QueryTest with 
SharedSQLContext with Timeouts {
 }
   }
 
-
-  class QueryStatusCollector extends StreamingQueryListener {
+  /** Collects events from the StreamingQueryListener for testing */
+  class EventCollector extends StreamingQueryListener {
--- End diff --

Ah yes, it was being used in multiple files at some point so I put it here. 
but not any more. I will put it only in StreamingQueryListener.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15954: [WIP][SPARK-18516][SQL] Split state and progress in stre...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15954
  
**[Test build #69346 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69346/consoleFull)**
 for PR 15954 at commit 
[`d9d8f82`](https://github.com/apache/spark/commit/d9d8f82e0adfb23223e6d445f0f832824b08cf9b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/16038
  
Without understanding the specifics of the ML part here - wont the actual 
impact of a large dense vector on Task 'bytes' be minimal at best ?
We do compress the task binary; and 1B zero's should have fairly low 
compressed footprint, no ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16065: [SPARK-17064][SQL] Changed ExchangeCoordinator re...

2016-11-29 Thread markhamstra
GitHub user markhamstra opened a pull request:

https://github.com/apache/spark/pull/16065

[SPARK-17064][SQL] Changed ExchangeCoordinator re-partitioning to avoid 
additional data …

## What changes were proposed in this pull request?

Re-partitioning logic in ExchangeCoordinator changed so that adding another 
pre-shuffle partition to the post-shuffle partition will not be done if doing 
so would cause the size of the post-shuffle partition to exceed the target 
partition size.  

## How was this patch tested?

Existing tests updated to reflect new expectations.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markhamstra/spark SPARK-17064

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16065.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16065


commit 561fcf67bd3c1541352b00f33981a44fa58a6ccc
Author: Mark Hamstra 
Date:   2016-11-29T20:34:03Z

Changed ExchangeCoordinator re-partitioning to avoid additional data skew




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15924: [SPARK-18498] [SQL] Revise HDFSMetadataLog API fo...

2016-11-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15924


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16065: [SPARK-17064][SQL] Changed ExchangeCoordinator re-partit...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16065
  
**[Test build #69347 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69347/consoleFull)**
 for PR 16065 at commit 
[`561fcf6`](https://github.com/apache/spark/commit/561fcf67bd3c1541352b00f33981a44fa58a6ccc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16063: [SPARK-18622][SQL] Remove TypeCoercion rules for ...

2016-11-29 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/16063#discussion_r90110048
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -482,21 +482,6 @@ object TypeCoercion {
 
 CreateMap(newKeys.zip(newValues).flatMap { case (k, v) => Seq(k, 
v) })
 
-  // Promote SUM, SUM DISTINCT and AVERAGE to largest types to prevent 
overflows.
-  case s @ Sum(e @ DecimalType()) => s // Decimal is already the 
biggest.
-  case Sum(e @ IntegralType()) if e.dataType != LongType => 
Sum(Cast(e, LongType))
-  case Sum(e @ FractionalType()) if e.dataType != DoubleType => 
Sum(Cast(e, DoubleType))
-
-  case s @ Average(e @ DecimalType()) => s // Decimal is already the 
biggest.
-  case Average(e @ IntegralType()) if e.dataType != LongType =>
-Average(Cast(e, LongType))
-  case Average(e @ FractionalType()) if e.dataType != DoubleType =>
-Average(Cast(e, DoubleType))
-
-  // Hive lets you do aggregation of timestamps... for some reason
-  case Sum(e @ TimestampType()) => Sum(Cast(e, DoubleType))
-  case Average(e @ TimestampType()) => Average(Cast(e, DoubleType))
--- End diff --

Is this safe in terms of backward-compatibility for `Timestamp`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16065: [SPARK-17064][SQL] Changed ExchangeCoordinator re-partit...

2016-11-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16065
  
Wrong JIRA ticket?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16065: [SPARK-18631][SQL] Changed ExchangeCoordinator re-partit...

2016-11-29 Thread markhamstra
Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/16065
  
@rxin fixed it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15505: [SPARK-17931][CORE] taskScheduler has some unneeded seri...

2016-11-29 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15505
  
I agree with Kay that putting in a smaller change first is better, assuming 
it still has the performance gains.  That doesn't preclude any further 
optimizations that are bigger changes.

I'm a little surprised that the serializing tasks has much of an impact, 
given how little data is getting serialized.  But if it really is, I feel like 
there is a much bigger optimization we're completely missing.  Why are we 
repeating the work of serialization for each task in a taskset?  The serialized 
data is almost exactly the same for *every* task.  they only differ in the 
partition id (an int) and the preferred locations (which aren't even used by 
the executor at all).

Task serialization already leverages the idea of having info across all the 
tasks in the Broadcast for the task binary.  We just need to use that same idea 
for all the rest of the task data that is sent to the executor.  Then the only 
difference between the serialized task data sent to executors is the int for 
the partitionId.  You'd serialize into a bytebuffer once, and then your 
per-task "serialization" becomes copying the buffer and modifying that int 
directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15982: [SPARK-18546][core] Fix merging shuffle spills when usin...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15982
  
**[Test build #69348 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69348/consoleFull)**
 for PR 15982 at commit 
[`8ac9276`](https://github.com/apache/spark/commit/8ac927623c5d7809208b766001f46ea2ad576af9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15954: [WIP][SPARK-18516][SQL] Split state and progress ...

2016-11-29 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15954#discussion_r90112136
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala ---
@@ -38,11 +40,11 @@ trait StreamingQuery {
   def name: String
 
   /**
-   * Returns the unique id of this query. This id is automatically 
generated and is unique across
-   * all queries that have been started in the current process.
-   * @since 2.0.0
+   * Returns the unique id of this query.  An id is tied to the checkpoint 
location and will
+   * be the same across restarts of a given streaming query.
--- End diff --

removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14638: [SPARK-11374][SQL] Support `skip.header.line.coun...

2016-11-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14638#discussion_r90112125
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -113,6 +113,10 @@ class HadoopTableReader(
 
 val tablePath = hiveTable.getPath
 val inputPathStr = applyFilterIfNeeded(tablePath, filterOpt)
+val skipHeaderLineCount =
+  tableDesc.getProperties.getProperty("skip.header.line.count", 
"0").toInt
+val isTextInputFormatTable =
--- End diff --

Oh right, I guess I mean ... `hiveTable.getInputFormatClass == 
classOf[TextInputFormat]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16066: [SPARK-18632][SQL] AggregateFunction should not i...

2016-11-29 Thread hvanhovell
GitHub user hvanhovell opened a pull request:

https://github.com/apache/spark/pull/16066

[SPARK-18632][SQL] AggregateFunction should not implement 
ImplicitCastInputTypes

## What changes were proposed in this pull request?
`AggregateFunction` currently implements `ImplicitCastInputTypes` (which 
enables implicit input type casting). There are actually quite a few situations 
in which we don't need this, or require more control over our input. A recent 
example is the aggregate for `CountMinSketch` which should only take string, 
binary or integral types inputs.

This PR removes `ImplicitCastInputTypes` from the `AggregateFunction` and 
makes a case-by-case decision on what kind of input validation we should use.

## How was this patch tested?
Refactoring only. Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hvanhovell/spark SPARK-18632

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16066.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16066


commit 9a722cf3d48850ab6579db856876bada8749330c
Author: Herman van Hovell 
Date:   2016-11-29T20:56:53Z

AggregateFunction should not implement ImplicitCastInputTypes.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16066: [SPARK-18632][SQL] AggregateFunction should not implemen...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16066
  
**[Test build #69349 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69349/consoleFull)**
 for PR 16066 at commit 
[`9a722cf`](https://github.com/apache/spark/commit/9a722cf3d48850ab6579db856876bada8749330c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16066: [SPARK-18632][SQL] AggregateFunction should not implemen...

2016-11-29 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/16066
  
cc @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16065: [SPARK-18631][SQL] Changed ExchangeCoordinator re-partit...

2016-11-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16065
  
cc @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16066: [SPARK-18632][SQL] AggregateFunction should not i...

2016-11-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16066#discussion_r90114430
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala
 ---
@@ -56,6 +52,20 @@ case class Last(child: Expression, ignoreNullsExpr: 
Expression) extends Declarat
   // Expected input data type.
   override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType, 
BooleanType)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+val defaultCheck = super.checkInputDataTypes()
+if (defaultCheck.isFailure) {
+  defaultCheck
+} else if (!ignoreNullsExpr.foldable) {
+  TypeCheckFailure(
+s"The second argument of Last must be a boolean literal, but got: 
$ignoreNullsExpr")
--- End diff --

ignoreNullsExpr.toSQL?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16066: [SPARK-18632][SQL] AggregateFunction should not implemen...

2016-11-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16066
  
LGTM other than that tiny comment.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15998: [SPARK-18572][SQL] Add a method `listPartitionNam...

2016-11-29 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/15998#discussion_r90114390
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
 ---
@@ -408,14 +411,18 @@ class HiveCommandSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleto
 
   test("show partitions - datasource") {
 withTable("part_datasrc") {
-  val df = (1 to 3).map(i => (i, s"val_$i", i * 2)).toDF("a", "b", "c")
+  val df = (1 to 3).map(i => (i, s"val_$i", i * 2)).toDF("A", "b", "c")
   df.write
-.partitionBy("a")
+.partitionBy("A")
 .format("parquet")
 .mode(SaveMode.Overwrite)
 .saveAsTable("part_datasrc")
 
-  assert(sql("SHOW PARTITIONS part_datasrc").count() == 3)
+  checkAnswer(
+sql("SHOW PARTITIONS part_datasrc"),
+Row("A=1") ::
+  Row("A=2") ::
--- End diff --

nit: indentation seems a bit funny here, could inline the rows


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15998: [SPARK-18572][SQL] Add a method `listPartitionNam...

2016-11-29 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/15998#discussion_r90092813
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -922,6 +923,29 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   /**
* Returns the partition names from hive metastore for a given table in 
a database.
*/
+  override def listPartitionNames(
+  db: String,
+  table: String,
+  partialSpec: Option[TablePartitionSpec] = None): Seq[String] = 
withClient {
+val actualPartColNames = getTable(db, table).partitionColumnNames
+val clientPartitionNames =
+  client.getPartitionNames(db, table, 
partialSpec.map(lowerCasePartitionSpec))
+
+if (actualPartColNames.exists(partColName => partColName != 
partColName.toLowerCase)) {
+  clientPartitionNames.map { partName =>
+val partSpec = PartitioningUtils.parsePathFragmentAsSeq(partName)
+partSpec.map { case (partName, partValue) =>
+  actualPartColNames.find(_.equalsIgnoreCase(partName)).get + "=" 
+ partValue
+}.mkString("/")
+  }
+} else {
+  clientPartitionNames
--- End diff --

Since this is not in the data path and the times are comparable, I think it 
would be better to drop the short circuit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15998: [SPARK-18572][SQL] Add a method `listPartitionNam...

2016-11-29 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/15998#discussion_r90103384
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
 ---
@@ -189,11 +189,28 @@ abstract class ExternalCatalog {
   spec: TablePartitionSpec): Option[CatalogTablePartition]
 
   /**
+   * List the names of all partitions that belong to the specified table, 
assuming it exists.
--- End diff --

Could you specify the return format of partitions (e.g. `/a=v1,b=v2`) and 
that the values are escape, and that callers can decode it using the functions 
found in PartitioningUtils?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15998: [SPARK-18572][SQL] Add a method `listPartitionNam...

2016-11-29 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/15998#discussion_r90092477
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -922,6 +923,29 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   /**
* Returns the partition names from hive metastore for a given table in 
a database.
*/
+  override def listPartitionNames(
+  db: String,
+  table: String,
+  partialSpec: Option[TablePartitionSpec] = None): Seq[String] = 
withClient {
+val actualPartColNames = getTable(db, table).partitionColumnNames
+val clientPartitionNames =
+  client.getPartitionNames(db, table, 
partialSpec.map(lowerCasePartitionSpec))
+
+if (actualPartColNames.exists(partColName => partColName != 
partColName.toLowerCase)) {
+  clientPartitionNames.map { partName =>
+val partSpec = PartitioningUtils.parsePathFragmentAsSeq(partName)
--- End diff --

That's correct. As of 2.1 we go through some extremes to preserve the case 
of partition names by saving extra metadata inside the table storage 
properties. When the table is resolved in `HiveExternalCatalog`, the lowercase 
schema is reconciled with the saved original names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15998: [SPARK-18572][SQL] Add a method `listPartitionNam...

2016-11-29 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/15998#discussion_r90097773
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala
 ---
@@ -482,6 +482,19 @@ class InMemoryCatalog(
 }
   }
 
+  override def listPartitionNames(
+  db: String,
+  table: String,
+  partialSpec: Option[TablePartitionSpec] = None): Seq[String] = 
synchronized {
+val partitionColumnNames = getTable(db, table).partitionColumnNames
+
+listPartitions(db, table, partialSpec).map { partition =>
+  partitionColumnNames.map { name =>
+name + "=" + partition.spec(name)
--- End diff --

Based on manual testing it seems that hive returns escaped values here. So 
for example, if you had a partition with column `B` and value ``, 
then the hive client returns  `b=%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D`. So this 
code should probably call `getPathFragment` to have the same behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...

2016-11-29 Thread ericl
Github user ericl commented on the issue:

https://github.com/apache/spark/pull/15998
  
* looks good once InMemoryCatalog is fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15877: [SPARK-18429] [SQL] implement a new Aggregate for CountM...

2016-11-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15877
  
Thanks - I'm going to merge this in master. I will submit a follow-up PR to 
simplify this a little bit, and remove the handling of float/double/decimal 
types and require explicit user action on how to turn that into long.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15877: [SPARK-18429] [SQL] implement a new Aggregate for...

2016-11-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15877


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16063: [SPARK-18622][SQL] Remove TypeCoercion rules for Average...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16063
  
**[Test build #69343 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69343/consoleFull)**
 for PR 16063 at commit 
[`7596b5a`](https://github.com/apache/spark/commit/7596b5aaffa39b5ded5e2ea5a90e0d7d751973b2).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16063: [SPARK-18622][SQL] Remove TypeCoercion rules for Average...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16063
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69343/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16063: [SPARK-18622][SQL] Remove TypeCoercion rules for Average...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16063
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16062: [SPARK-18629][SQL] Fix numPartition of JDBCSuite Testcas...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16062
  
**[Test build #69340 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69340/consoleFull)**
 for PR 16062 at commit 
[`30c5d6f`](https://github.com/apache/spark/commit/30c5d6f450a8958c2133d1014533378de6966a1f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16062: [SPARK-18629][SQL] Fix numPartition of JDBCSuite Testcas...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16062
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16062: [SPARK-18629][SQL] Fix numPartition of JDBCSuite Testcas...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16062
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69340/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16067: [SPARK-17897] [SQL] Fixed IsNotNull Inference Rul...

2016-11-29 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/16067

[SPARK-17897] [SQL] Fixed IsNotNull Inference Rule

### What changes were proposed in this pull request?
The `constraints` of an operator is the expressions that evaluate to `true` 
for all the rows produced. That means, the expression result should be neither 
`false` nor `unknown` (NULL). Thus, we can conclude that `IsNotNull` on all the 
constraints, which are generated by its own predicates or propagated from the 
children. The constraint can be a complex expression. For better usage of these 
constraints, we try to push down `IsNotNull` to the lowest-level expressions 
(i.e., `Attribute`). `IsNotNull` can be pushed through an expression when it is 
null intolerant. (When the input is NULL, the null-intolerant expression always 
evaluates to NULL.)

Below is the code we have for `IsNotNull` pushdown.
```Scala
  private def scanNullIntolerantExpr(expr: Expression): Seq[Attribute] = 
expr match {
case a: Attribute => Seq(a)
case _: NullIntolerant | IsNotNull(_: NullIntolerant) =>
  expr.children.flatMap(scanNullIntolerantExpr)
case _ => Seq.empty[Attribute]
  }
```

**`IsNotNull` itself is not null-intolerant.** It converts `null` to 
`false`. If the expression does not include any `Not`-like expression, it 
works; otherwise, it could generate a wrong result. This PR is to fix the above 
function by removing the `IsNotNull` from the inference. After the fix, when a 
constraint has `IsNotNull`, we only infer attribute-level `IsNotNull` if it 
appears in the root. 

Without the fix, the following test case will return empty.
```Scala
val data = Seq[java.lang.Integer](1, null).toDF("key")
data.filter("not key is not null").show()
```
Before the fix, the optimized plan is like
```
== Optimized Logical Plan ==
Project [value#1 AS key#3]
+- Filter (isnotnull(value#1) && NOT isnotnull(value#1))
   +- LocalRelation [value#1]
```

After the fix, the optimized plan is like
```
== Optimized Logical Plan ==
Project [value#1 AS key#3]
+- Filter NOT isnotnull(value#1)
   +- LocalRelation [value#1]
```

### How was this patch tested?
Added a test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark isNotNull2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16067.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16067


commit 33c10a0994c9802df901f211e1f28c52e34df27f
Author: gatorsmile 
Date:   2016-11-29T08:00:55Z

fix.

commit 025632a6897abd4901254688a049079ed7358e93
Author: gatorsmile 
Date:   2016-11-29T21:26:02Z

fix.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16048: [DO_NOT_MERGE]Test kafka deletion

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16048
  
**[Test build #69351 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69351/consoleFull)**
 for PR 16048 at commit 
[`27102eb`](https://github.com/apache/spark/commit/27102eb283e25837646f78943c43a31403a9ce0b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16067: [SPARK-17897] [SQL] Fixed IsNotNull Constraint Inference...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16067
  
**[Test build #69350 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69350/consoleFull)**
 for PR 16067 at commit 
[`0722ae5`](https://github.com/apache/spark/commit/0722ae52d4b4031b4ff2751d22c787b070547fa0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16044: [Spark-18614][SQL] Incorrect predicate pushdown from Exi...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16044
  
**[Test build #69341 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69341/consoleFull)**
 for PR 16044 at commit 
[`d4002c7`](https://github.com/apache/spark/commit/d4002c741c3917c347c7be78b7ddfe053d25e5b6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16009: [SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scal...

2016-11-29 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16009#discussion_r90120095
  
--- Diff: docs/ml-features.md ---
@@ -1188,7 +1188,9 @@ categorical features. The number of bins is set by 
the `numBuckets` parameter. I
 that the number of buckets used will be smaller than this value, for 
example, if there are too few
 distinct values of the input to create enough distinct quantiles.
 
-NaN values: Note also that QuantileDiscretizer
+NaN values:
+NaN values will be removed from the column when `QuantileDiscretizer` 
fitting. This will produce
--- End diff --

"from the column when" -> "from the column during"
"model for making prediction and transformation" -> "model for making 
predictions"



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16044: [Spark-18614][SQL] Incorrect predicate pushdown from Exi...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16044
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69341/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16044: [Spark-18614][SQL] Incorrect predicate pushdown from Exi...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16044
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15877: [SPARK-18429] [SQL] implement a new Aggregate for CountM...

2016-11-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15877
  
Hey guys - after looking at the pr more, I'm afraid we have gone overboard 
with testing here. Most of the test cases written are just repeating each other 
and doing exactly the same thing. For testing something like this I'd probably 
just have some simple end-to-end test and then be done with it, because most of 
the complicated logics are isolated in the actual CountMinSketch implementation 
itself and already has good test coverage.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15954: [SPARK-18516][SQL] Split state and progress in streaming

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15954
  
**[Test build #69352 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69352/consoleFull)**
 for PR 15954 at commit 
[`aa8af9c`](https://github.com/apache/spark/commit/aa8af9ca980987a568c6fe65f47456b144168d76).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15877: [SPARK-18429] [SQL] implement a new Aggregate for...

2016-11-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15877#discussion_r90122127
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAggSuite.scala
 ---
@@ -0,0 +1,320 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.io.ByteArrayInputStream
+import java.nio.charset.StandardCharsets
+
+import scala.reflect.ClassTag
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.InternalRow
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.TypeCheckFailure
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
BoundReference, Cast, GenericInternalRow, Literal}
+import org.apache.spark.sql.types.{DecimalType, _}
+import org.apache.spark.unsafe.types.UTF8String
+import org.apache.spark.util.sketch.CountMinSketch
+
+class CountMinSketchAggSuite extends SparkFunSuite {
+  private val childExpression = BoundReference(0, IntegerType, nullable = 
true)
+  private val epsOfTotalCount = 0.0001
+  private val confidence = 0.99
+  private val seed = 42
+
+  test("serialize and de-serialize") {
+// Check empty serialize and de-serialize
+val agg = new CountMinSketchAgg(childExpression, 
Literal(epsOfTotalCount), Literal(confidence),
+  Literal(seed))
+val buffer = CountMinSketch.create(epsOfTotalCount, confidence, seed)
+assert(buffer.equals(agg.deserialize(agg.serialize(buffer
+
+// Check non-empty serialize and de-serialize
+val random = new Random(31)
+(0 until 1).map(_ => random.nextInt(100)).foreach { value =>
+  buffer.add(value)
+}
+assert(buffer.equals(agg.deserialize(agg.serialize(buffer
+  }
+
+  def testHighLevelInterface[T: ClassTag](
--- End diff --

@wzhfy can you comment on why we need to test both the high level interface 
and the low level interface?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14638: [SPARK-11374][SQL] Support `skip.header.line.coun...

2016-11-29 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14638#discussion_r90123854
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -113,6 +113,10 @@ class HadoopTableReader(
 
 val tablePath = hiveTable.getPath
 val inputPathStr = applyFilterIfNeeded(tablePath, filterOpt)
+val skipHeaderLineCount =
+  tableDesc.getProperties.getProperty("skip.header.line.count", 
"0").toInt
+val isTextInputFormatTable =
--- End diff --

Oh, right. I see.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15946: [SPARK-18513][Structured Streaming] Record and recover w...

2016-11-29 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15946
  
@lw-lin could you close this one please? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15954: [SPARK-18516][SQL] Split state and progress in streaming

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15954
  
**[Test build #69353 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69353/consoleFull)**
 for PR 15954 at commit 
[`c11d2e5`](https://github.com/apache/spark/commit/c11d2e51dd1bbbcededeb48db83dd8e060f9c0ae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15780: [SPARK-18284][SQL] Make ExpressionEncoder.serializer.nul...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15780
  
**[Test build #69342 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69342/consoleFull)**
 for PR 15780 at commit 
[`39e4930`](https://github.com/apache/spark/commit/39e4930cc7dc1ebf36b8d517a030072ebf0a7df3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15780: [SPARK-18284][SQL] Make ExpressionEncoder.serializer.nul...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15780
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69342/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15780: [SPARK-18284][SQL] Make ExpressionEncoder.serializer.nul...

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15780
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15954: [SPARK-18516][SQL] Split state and progress in streaming

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15954
  
**[Test build #69352 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69352/consoleFull)**
 for PR 15954 at commit 
[`aa8af9c`](https://github.com/apache/spark/commit/aa8af9ca980987a568c6fe65f47456b144168d76).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15954: [SPARK-18516][SQL] Split state and progress in streaming

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15954
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69352/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15954: [SPARK-18516][SQL] Split state and progress in streaming

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15954
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15982: [SPARK-18546][core] Fix merging shuffle spills wh...

2016-11-29 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/15982#discussion_r90127514
  
--- Diff: 
core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
 ---
@@ -75,13 +75,6 @@
   @Mock(answer = RETURNS_SMART_NULLS) BlockManager blockManager;
   @Mock(answer = RETURNS_SMART_NULLS) DiskBlockManager diskBlockManager;
 
-  private static final class WrapStream extends 
AbstractFunction1 {
--- End diff --

you can eliminate the imports of `AbstractFunction1` and `OutputStream` 
after this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15982: [SPARK-18546][core] Fix merging shuffle spills wh...

2016-11-29 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/15982#discussion_r90126726
  
--- Diff: 
core/src/test/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriterSuite.java 
---
@@ -40,9 +41,11 @@
 import org.mockito.stubbing.Answer;
 
 import org.apache.spark.HashPartitioner;
+import org.apache.spark.SecurityManager;
 import org.apache.spark.ShuffleDependency;
 import org.apache.spark.SparkConf;
 import org.apache.spark.TaskContext;
+import org.apache.spark.deploy.SparkHadoopUtil;
--- End diff --

other than `CryptoStreamUtils`, the other added imports look unused.  Also 
looks like you can eliminate `AbstractFunction1` and `ByteStreams` since you 
are no longer using them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15982: [SPARK-18546][core] Fix merging shuffle spills wh...

2016-11-29 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/15982#discussion_r90121766
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java ---
@@ -337,42 +340,47 @@ void forceSorterToSpill() throws IOException {
 final int numPartitions = partitioner.numPartitions();
 final long[] partitionLengths = new long[numPartitions];
 final InputStream[] spillInputStreams = new 
FileInputStream[spills.length];
-OutputStream mergedFileOutputStream = null;
+
+// Use a counting output stream to avoid having to close the 
underlying file and ask
+// the file system for its size after each partition is written.
+final CountingOutputStream mergedFileOutputStream = new 
CountingOutputStream(
+  new FileOutputStream(outputFile));
 
 boolean threwException = true;
 try {
   for (int i = 0; i < spills.length; i++) {
 spillInputStreams[i] = new FileInputStream(spills[i].file);
   }
   for (int partition = 0; partition < numPartitions; partition++) {
-final long initialFileLength = outputFile.length();
-mergedFileOutputStream =
-  new TimeTrackingOutputStream(writeMetrics, new 
FileOutputStream(outputFile, true));
+final long initialFileLength = 
mergedFileOutputStream.getByteCount();
+// Shield the underlying output stream from close() calls, so that 
we can close the higher
+// level streams to make sure all data is really flushed and 
internal state is cleaned.
+OutputStream partitionOutput = new 
CloseShieldOutputStream(mergedFileOutputStream);
+partitionOutput = 
blockManager.serializerManager().wrapForEncryption(partitionOutput);
 if (compressionCodec != null) {
-  mergedFileOutputStream = 
compressionCodec.compressedOutputStream(mergedFileOutputStream);
+  partitionOutput = 
compressionCodec.compressedOutputStream(partitionOutput);
 }
-
+partitionOutput = new TimeTrackingOutputStream(writeMetrics, 
partitionOutput);
--- End diff --

another change here is that `TimeTrackingOutputStream` now goes around the 
compression codec.  I think that is the right change, but its at least worth 
mentioning in the commit msg.

I'm wondering if this its worth having a separate jira for this, just since 
it will effect metrics for all users


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15982: [SPARK-18546][core] Fix merging shuffle spills wh...

2016-11-29 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/15982#discussion_r90127615
  
--- Diff: 
core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java
 ---
@@ -86,14 +88,7 @@ public int compare(
 
   protected boolean shouldUseRadixSort() { return false; }
 
-  private final long pageSizeBytes = new 
SparkConf().getSizeAsBytes("spark.buffer.pageSize", "4m");
-
-  private static final class WrapStream extends 
AbstractFunction1 {
--- End diff --

same on trimming imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16065: [SPARK-18631][SQL] Changed ExchangeCoordinator re-partit...

2016-11-29 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16065
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15954: [SPARK-18516][SQL] Split state and progress in streaming

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15954
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15954: [SPARK-18516][SQL] Split state and progress in streaming

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15954
  
**[Test build #69353 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69353/consoleFull)**
 for PR 15954 at commit 
[`c11d2e5`](https://github.com/apache/spark/commit/c11d2e51dd1bbbcededeb48db83dd8e060f9c0ae).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15954: [SPARK-18516][SQL] Split state and progress in streaming

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15954
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69353/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16067: [SPARK-17897] [SQL] Fixed IsNotNull Constraint Inference...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16067
  
**[Test build #69354 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69354/consoleFull)**
 for PR 16067 at commit 
[`f693040`](https://github.com/apache/spark/commit/f693040d8bd1bfcf7ddeda7a6eabfce1de08c62a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r90129155
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -1089,66 +1064,10 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
* MapReduce job.
*/
   def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {
-// Rename this as hadoopConf internally to avoid shadowing (see 
SPARK-2038).
-val hadoopConf = conf
-val outputFormatInstance = hadoopConf.getOutputFormat
-val keyClass = hadoopConf.getOutputKeyClass
-val valueClass = hadoopConf.getOutputValueClass
-if (outputFormatInstance == null) {
-  throw new SparkException("Output format class not set")
-}
-if (keyClass == null) {
-  throw new SparkException("Output key class not set")
-}
-if (valueClass == null) {
-  throw new SparkException("Output value class not set")
-}
-SparkHadoopUtil.get.addCredentials(hadoopConf)
-
-logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + 
", " +
-  valueClass.getSimpleName + ")")
-
-if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) {
-  // FileOutputFormat ignores the filesystem parameter
-  val ignoredFs = FileSystem.get(hadoopConf)
-  hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf)
-}
--- End diff --

These validations should go into HadoopMapReduceWriteConfigUtil


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r90116879
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -1016,11 +1013,6 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   /**
* Output the RDD to any Hadoop-supported file system, using a Hadoop 
`OutputFormat` class
* supporting the key and value types K and V in this RDD.
-   *
-   * @note We should make sure our tasks are idempotent when speculation 
is enabled, i.e. do
-   * not use output committer that writes data directly.
-   * There is an example in 
https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
-   * result of using direct output committer with speculation enabled.
--- End diff --

Why was this removed ? It is still relevant now even if checked in a 
different method invoked from here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r88075283
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapRedCommitProtocol.scala
 ---
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.io
+
+import org.apache.hadoop.mapred._
+import org.apache.hadoop.mapreduce.{TaskAttemptContext => 
NewTaskAttemptContext}
+
+/**
+ * An [[FileCommitProtocol]] implementation backed by an underlying Hadoop 
OutputCommitter
+ * (from the old mapred API).
+ *
+ * Unlike Hadoop's OutputCommitter, this implementation is serializable.
+ */
+class HadoopMapRedCommitProtocol(jobId: String, path: String)
+  extends HadoopMapReduceCommitProtocol(jobId, path) {
+
+  override def setupCommitter(context: NewTaskAttemptContext): 
OutputCommitter = {
+val config = context.getConfiguration.asInstanceOf[JobConf]
+config.getOutputCommitter
--- End diff --

Do we need a setupJob on the committer here ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r90122251
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -0,0 +1,408 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.io
+
+import java.text.NumberFormat
+import java.util.{Date, Locale}
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.{Configurable, Configuration}
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.mapred._
+import org.apache.hadoop.mapreduce.{JobContext => NewJobContext, 
OutputFormat => NewOutputFormat, RecordWriter => NewRecordWriter, 
TaskAttemptContext => NewTaskAttemptContext, TaskAttemptID => NewTaskAttemptID, 
TaskType}
+import org.apache.hadoop.mapreduce.task.{TaskAttemptContextImpl => 
NewTaskAttemptContextImpl}
+
+import org.apache.spark.{SerializableWritable, SparkException, TaskContext}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.executor.OutputMetrics
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage
+import org.apache.spark.rdd.{HadoopRDD, RDD}
+import org.apache.spark.util.{SerializableConfiguration, 
SerializableJobConf, Utils}
+
+/**
+ * A helper object that saves an RDD using a Hadoop OutputFormat
+ * (from the old mapred API).
+ */
+private[spark]
+object SparkHadoopWriter extends Logging {
+  import SparkHadoopWriterUtils._
+
+  /**
+   * Basic work flow of this command is:
+   * 1. Driver side setup, prepare the data source and hadoop 
configuration for the write job to
+   *be issued.
+   * 2. Issues a write job consists of one or more executor side tasks, 
each of which writes all
+   *rows within an RDD partition.
+   * 3. If no exception is thrown in a task, commits that task, otherwise 
aborts that task;  If any
+   *exception is thrown during task commitment, also aborts that task.
+   * 4. If all tasks are committed, commit the job, otherwise aborts the 
job;  If any exception is
+   *thrown during job commitment, also aborts the job.
+   */
+  def write[K, V: ClassTag](
+  rdd: RDD[(K, V)],
+  config: HadoopWriteConfigUtil[K, V]): Unit = {
+// Extract context and configuration from RDD.
+val sparkContext = rdd.context
+val stageId = rdd.id
+val sparkConf = rdd.conf
+
+// Set up a job.
+val jobTrackerId = createJobTrackerID(new Date())
+val jobContext = config.createJobContext(jobTrackerId, stageId)
+config.initOutputFormat(jobContext)
+
+// Assert the output format/key/value class is set in JobConf.
+config.assertConf()
+
+if (isOutputSpecValidationEnabled(sparkConf)) {
+  // FileOutputFormat ignores the filesystem parameter
+  config.checkOutputSpecs(jobContext)
+}
+
+val committer = config.createCommitter(stageId)
+committer.setupJob(jobContext)
+
+// When speculation is on and output committer class name contains 
"Direct", we should warn
+// users that they may loss data if they are using a direct output 
committer.
+// There is an example in 
https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
+// result of using direct output committer with speculation enabled.
+if (isSpeculationEnabled(sparkConf) && committer.isDirectOutput) {
+  val warningMessage =
+s"$committer may be an output committer that writes data directly 
to " +
+  "the final location. Because speculation is enabled, this output 
committer may " +
+  "cause data loss (see the case in SPARK-10063). If possible, 
please use an output " +
+  "committer that does not have this behavior (e.g. 
FileOutputCommitter)."
+  logWarning(warningMessage)
+}
+
+// Try to write a

[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r90121527
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -0,0 +1,408 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.io
+
+import java.text.NumberFormat
+import java.util.{Date, Locale}
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.{Configurable, Configuration}
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.mapred._
+import org.apache.hadoop.mapreduce.{JobContext => NewJobContext, 
OutputFormat => NewOutputFormat, RecordWriter => NewRecordWriter, 
TaskAttemptContext => NewTaskAttemptContext, TaskAttemptID => NewTaskAttemptID, 
TaskType}
+import org.apache.hadoop.mapreduce.task.{TaskAttemptContextImpl => 
NewTaskAttemptContextImpl}
+
+import org.apache.spark.{SerializableWritable, SparkException, TaskContext}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.executor.OutputMetrics
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage
+import org.apache.spark.rdd.{HadoopRDD, RDD}
+import org.apache.spark.util.{SerializableConfiguration, 
SerializableJobConf, Utils}
+
+/**
+ * A helper object that saves an RDD using a Hadoop OutputFormat
+ * (from the old mapred API).
+ */
+private[spark]
+object SparkHadoopWriter extends Logging {
+  import SparkHadoopWriterUtils._
+
+  /**
+   * Basic work flow of this command is:
+   * 1. Driver side setup, prepare the data source and hadoop 
configuration for the write job to
+   *be issued.
+   * 2. Issues a write job consists of one or more executor side tasks, 
each of which writes all
+   *rows within an RDD partition.
+   * 3. If no exception is thrown in a task, commits that task, otherwise 
aborts that task;  If any
+   *exception is thrown during task commitment, also aborts that task.
+   * 4. If all tasks are committed, commit the job, otherwise aborts the 
job;  If any exception is
+   *thrown during job commitment, also aborts the job.
+   */
+  def write[K, V: ClassTag](
+  rdd: RDD[(K, V)],
+  config: SparkHadoopWriterConfig[K, V]): Unit = {
+// Extract context and configuration from RDD.
+val sparkContext = rdd.context
+val stageId = rdd.id
+val sparkConf = rdd.conf
+
+// Set up a job.
+val jobTrackerId = createJobTrackerID(new Date())
+val jobContext = config.createJobContext(jobTrackerId, stageId)
+config.initOutputFormat(jobContext)
+
+// Assert the output format/key/value class is set in JobConf.
+config.assertConf()
+
+if (isOutputSpecValidationEnabled(sparkConf)) {
+  // FileOutputFormat ignores the filesystem parameter
+  config.checkOutputSpecs(jobContext)
+}
+
+val committer = config.createCommitter(stageId)
+committer.setupJob(jobContext)
+
+// When speculation is on and output committer class name contains 
"Direct", we should warn
+// users that they may loss data if they are using a direct output 
committer.
+// There is an example in 
https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
+// result of using direct output committer with speculation enabled.
+if (isSpeculationEnabled(sparkConf) && committer.isDirectOutput) {
+  val warningMessage =
+s"$committer may be an output committer that writes data directly 
to " +
+  "the final location. Because speculation is enabled, this output 
committer may " +
+  "cause data loss (see the case in SPARK-10063). If possible, 
please use an output " +
+  "committer that does not have this behavior (e.g. 
FileOutputCommitter)."
+  logWarning(warningMessage)
+}
+
+// Try to write

[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r90120144
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -0,0 +1,408 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.io
+
+import java.text.NumberFormat
+import java.util.{Date, Locale}
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.{Configurable, Configuration}
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.mapred._
+import org.apache.hadoop.mapreduce.{JobContext => NewJobContext, 
OutputFormat => NewOutputFormat, RecordWriter => NewRecordWriter, 
TaskAttemptContext => NewTaskAttemptContext, TaskAttemptID => NewTaskAttemptID, 
TaskType}
+import org.apache.hadoop.mapreduce.task.{TaskAttemptContextImpl => 
NewTaskAttemptContextImpl}
+
+import org.apache.spark.{SerializableWritable, SparkException, TaskContext}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.executor.OutputMetrics
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage
+import org.apache.spark.rdd.{HadoopRDD, RDD}
+import org.apache.spark.util.{SerializableConfiguration, 
SerializableJobConf, Utils}
+
+/**
+ * A helper object that saves an RDD using a Hadoop OutputFormat
+ * (from the old mapred API).
+ */
+private[spark]
+object SparkHadoopWriter extends Logging {
+  import SparkHadoopWriterUtils._
+
+  /**
+   * Basic work flow of this command is:
+   * 1. Driver side setup, prepare the data source and hadoop 
configuration for the write job to
+   *be issued.
+   * 2. Issues a write job consists of one or more executor side tasks, 
each of which writes all
+   *rows within an RDD partition.
+   * 3. If no exception is thrown in a task, commits that task, otherwise 
aborts that task;  If any
+   *exception is thrown during task commitment, also aborts that task.
+   * 4. If all tasks are committed, commit the job, otherwise aborts the 
job;  If any exception is
+   *thrown during job commitment, also aborts the job.
+   */
+  def write[K, V: ClassTag](
+  rdd: RDD[(K, V)],
+  config: HadoopWriteConfigUtil[K, V]): Unit = {
+// Extract context and configuration from RDD.
+val sparkContext = rdd.context
+val stageId = rdd.id
+val sparkConf = rdd.conf
+
+// Set up a job.
+val jobTrackerId = createJobTrackerID(new Date())
+val jobContext = config.createJobContext(jobTrackerId, stageId)
+config.initOutputFormat(jobContext)
+
+// Assert the output format/key/value class is set in JobConf.
+config.assertConf()
+
+if (isOutputSpecValidationEnabled(sparkConf)) {
+  // FileOutputFormat ignores the filesystem parameter
+  config.checkOutputSpecs(jobContext)
+}
+
+val committer = config.createCommitter(stageId)
+committer.setupJob(jobContext)
+
+// When speculation is on and output committer class name contains 
"Direct", we should warn
+// users that they may loss data if they are using a direct output 
committer.
+// There is an example in 
https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
+// result of using direct output committer with speculation enabled.
+if (isSpeculationEnabled(sparkConf) && committer.isDirectOutput) {
+  val warningMessage =
+s"$committer may be an output committer that writes data directly 
to " +
+  "the final location. Because speculation is enabled, this output 
committer may " +
+  "cause data loss (see the case in SPARK-10063). If possible, 
please use an output " +
+  "committer that does not have this behavior (e.g. 
FileOutputCommitter)."
+  logWarning(warningMessage)
+}
+
+// Try to write a

[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r90127987
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -0,0 +1,408 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.io
+
+import java.text.NumberFormat
+import java.util.{Date, Locale}
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.{Configurable, Configuration}
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.mapred._
+import org.apache.hadoop.mapreduce.{JobContext => NewJobContext, 
OutputFormat => NewOutputFormat, RecordWriter => NewRecordWriter, 
TaskAttemptContext => NewTaskAttemptContext, TaskAttemptID => NewTaskAttemptID, 
TaskType}
+import org.apache.hadoop.mapreduce.task.{TaskAttemptContextImpl => 
NewTaskAttemptContextImpl}
+
+import org.apache.spark.{SerializableWritable, SparkException, TaskContext}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.executor.OutputMetrics
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage
+import org.apache.spark.rdd.{HadoopRDD, RDD}
+import org.apache.spark.util.{SerializableConfiguration, 
SerializableJobConf, Utils}
+
+/**
+ * A helper object that saves an RDD using a Hadoop OutputFormat
+ * (from the old mapred API).
+ */
+private[spark]
+object SparkHadoopWriter extends Logging {
+  import SparkHadoopWriterUtils._
+
+  /**
+   * Basic work flow of this command is:
+   * 1. Driver side setup, prepare the data source and hadoop 
configuration for the write job to
+   *be issued.
+   * 2. Issues a write job consists of one or more executor side tasks, 
each of which writes all
+   *rows within an RDD partition.
+   * 3. If no exception is thrown in a task, commits that task, otherwise 
aborts that task;  If any
+   *exception is thrown during task commitment, also aborts that task.
+   * 4. If all tasks are committed, commit the job, otherwise aborts the 
job;  If any exception is
+   *thrown during job commitment, also aborts the job.
+   */
+  def write[K, V: ClassTag](
+  rdd: RDD[(K, V)],
+  config: HadoopWriteConfigUtil[K, V]): Unit = {
+// Extract context and configuration from RDD.
+val sparkContext = rdd.context
+val stageId = rdd.id
+val sparkConf = rdd.conf
+
+// Set up a job.
+val jobTrackerId = createJobTrackerID(new Date())
+val jobContext = config.createJobContext(jobTrackerId, stageId)
+config.initOutputFormat(jobContext)
+
+// Assert the output format/key/value class is set in JobConf.
+config.assertConf()
+
+if (isOutputSpecValidationEnabled(sparkConf)) {
+  // FileOutputFormat ignores the filesystem parameter
+  config.checkOutputSpecs(jobContext)
+}
+
+val committer = config.createCommitter(stageId)
+committer.setupJob(jobContext)
+
+// When speculation is on and output committer class name contains 
"Direct", we should warn
+// users that they may loss data if they are using a direct output 
committer.
+// There is an example in 
https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
+// result of using direct output committer with speculation enabled.
+if (isSpeculationEnabled(sparkConf) && committer.isDirectOutput) {
+  val warningMessage =
+s"$committer may be an output committer that writes data directly 
to " +
+  "the final location. Because speculation is enabled, this output 
committer may " +
+  "cause data loss (see the case in SPARK-10063). If possible, 
please use an output " +
+  "committer that does not have this behavior (e.g. 
FileOutputCommitter)."
+  logWarning(warningMessage)
+}
+
+// Try to write a

[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r90129259
  
--- Diff: 
core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala ---
@@ -561,7 +561,7 @@ class PairRDDFunctionsSuite extends SparkFunSuite with 
SharedSparkContext {
   pairs.saveAsHadoopFile(
 "ignored", pairs.keyClass, pairs.valueClass, 
classOf[FakeFormatWithCallback], conf)
 }
-assert(e.getMessage contains "failed to write")
+assert(e.getCause.getMessage contains "failed to write")
--- End diff --

Curious, how/why did this change ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r88077635
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -0,0 +1,408 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.io
+
+import java.text.NumberFormat
+import java.util.{Date, Locale}
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.{Configurable, Configuration}
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.mapred._
+import org.apache.hadoop.mapreduce.{JobContext => NewJobContext, 
OutputFormat => NewOutputFormat, RecordWriter => NewRecordWriter, 
TaskAttemptContext => NewTaskAttemptContext, TaskAttemptID => NewTaskAttemptID, 
TaskType}
+import org.apache.hadoop.mapreduce.task.{TaskAttemptContextImpl => 
NewTaskAttemptContextImpl}
+
+import org.apache.spark.{SerializableWritable, SparkException, TaskContext}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.executor.OutputMetrics
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage
+import org.apache.spark.rdd.{HadoopRDD, RDD}
+import org.apache.spark.util.{SerializableConfiguration, 
SerializableJobConf, Utils}
+
+/**
+ * A helper object that saves an RDD using a Hadoop OutputFormat
+ * (from the old mapred API).
+ */
+private[spark]
+object SparkHadoopWriter extends Logging {
+  import SparkHadoopWriterUtils._
+
+  /**
+   * Basic work flow of this command is:
+   * 1. Driver side setup, prepare the data source and hadoop 
configuration for the write job to
+   *be issued.
+   * 2. Issues a write job consists of one or more executor side tasks, 
each of which writes all
+   *rows within an RDD partition.
+   * 3. If no exception is thrown in a task, commits that task, otherwise 
aborts that task;  If any
+   *exception is thrown during task commitment, also aborts that task.
+   * 4. If all tasks are committed, commit the job, otherwise aborts the 
job;  If any exception is
+   *thrown during job commitment, also aborts the job.
+   */
+  def write[K, V: ClassTag](
+  rdd: RDD[(K, V)],
+  config: SparkHadoopWriterConfig[K, V]): Unit = {
+// Extract context and configuration from RDD.
+val sparkContext = rdd.context
+val stageId = rdd.id
+val sparkConf = rdd.conf
+
+// Set up a job.
+val jobTrackerId = createJobTrackerID(new Date())
+val jobContext = config.createJobContext(jobTrackerId, stageId)
+config.initOutputFormat(jobContext)
+
+// Assert the output format/key/value class is set in JobConf.
+config.assertConf()
+
+if (isOutputSpecValidationEnabled(sparkConf)) {
+  // FileOutputFormat ignores the filesystem parameter
+  config.checkOutputSpecs(jobContext)
+}
+
+val committer = config.createCommitter(stageId)
+committer.setupJob(jobContext)
+
+// When speculation is on and output committer class name contains 
"Direct", we should warn
+// users that they may loss data if they are using a direct output 
committer.
+// There is an example in 
https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
+// result of using direct output committer with speculation enabled.
+if (isSpeculationEnabled(sparkConf) && committer.isDirectOutput) {
+  val warningMessage =
+s"$committer may be an output committer that writes data directly 
to " +
+  "the final location. Because speculation is enabled, this output 
committer may " +
+  "cause data loss (see the case in SPARK-10063). If possible, 
please use an output " +
+  "committer that does not have this behavior (e.g. 
FileOutputCommitter)."
+  logWarning(warningMessage)
+}
+
+// Try to write

[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r90121670
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -0,0 +1,408 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.io
+
+import java.text.NumberFormat
+import java.util.{Date, Locale}
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.{Configurable, Configuration}
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.mapred._
+import org.apache.hadoop.mapreduce.{JobContext => NewJobContext, 
OutputFormat => NewOutputFormat, RecordWriter => NewRecordWriter, 
TaskAttemptContext => NewTaskAttemptContext, TaskAttemptID => NewTaskAttemptID, 
TaskType}
+import org.apache.hadoop.mapreduce.task.{TaskAttemptContextImpl => 
NewTaskAttemptContextImpl}
+
+import org.apache.spark.{SerializableWritable, SparkException, TaskContext}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.executor.OutputMetrics
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage
+import org.apache.spark.rdd.{HadoopRDD, RDD}
+import org.apache.spark.util.{SerializableConfiguration, 
SerializableJobConf, Utils}
+
+/**
+ * A helper object that saves an RDD using a Hadoop OutputFormat
+ * (from the old mapred API).
+ */
+private[spark]
+object SparkHadoopWriter extends Logging {
+  import SparkHadoopWriterUtils._
+
+  /**
+   * Basic work flow of this command is:
+   * 1. Driver side setup, prepare the data source and hadoop 
configuration for the write job to
+   *be issued.
+   * 2. Issues a write job consists of one or more executor side tasks, 
each of which writes all
+   *rows within an RDD partition.
+   * 3. If no exception is thrown in a task, commits that task, otherwise 
aborts that task;  If any
+   *exception is thrown during task commitment, also aborts that task.
+   * 4. If all tasks are committed, commit the job, otherwise aborts the 
job;  If any exception is
+   *thrown during job commitment, also aborts the job.
+   */
+  def write[K, V: ClassTag](
+  rdd: RDD[(K, V)],
+  config: HadoopWriteConfigUtil[K, V]): Unit = {
+// Extract context and configuration from RDD.
+val sparkContext = rdd.context
+val stageId = rdd.id
+val sparkConf = rdd.conf
+
+// Set up a job.
+val jobTrackerId = createJobTrackerID(new Date())
+val jobContext = config.createJobContext(jobTrackerId, stageId)
+config.initOutputFormat(jobContext)
+
+// Assert the output format/key/value class is set in JobConf.
+config.assertConf()
+
+if (isOutputSpecValidationEnabled(sparkConf)) {
+  // FileOutputFormat ignores the filesystem parameter
+  config.checkOutputSpecs(jobContext)
+}
+
+val committer = config.createCommitter(stageId)
+committer.setupJob(jobContext)
+
+// When speculation is on and output committer class name contains 
"Direct", we should warn
+// users that they may loss data if they are using a direct output 
committer.
+// There is an example in 
https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
+// result of using direct output committer with speculation enabled.
+if (isSpeculationEnabled(sparkConf) && committer.isDirectOutput) {
+  val warningMessage =
+s"$committer may be an output committer that writes data directly 
to " +
+  "the final location. Because speculation is enabled, this output 
committer may " +
+  "cause data loss (see the case in SPARK-10063). If possible, 
please use an output " +
+  "committer that does not have this behavior (e.g. 
FileOutputCommitter)."
+  logWarning(warningMessage)
+}
+
+// Try to write a

[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r87708046
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -0,0 +1,408 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.io
+
+import java.text.NumberFormat
+import java.util.{Date, Locale}
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.{Configurable, Configuration}
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.mapred._
+import org.apache.hadoop.mapreduce.{JobContext => NewJobContext, 
OutputFormat => NewOutputFormat, RecordWriter => NewRecordWriter, 
TaskAttemptContext => NewTaskAttemptContext, TaskAttemptID => NewTaskAttemptID, 
TaskType}
--- End diff --

Split into multiple lines



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r90119536
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -0,0 +1,408 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.io
+
+import java.text.NumberFormat
+import java.util.{Date, Locale}
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.{Configurable, Configuration}
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.mapred._
+import org.apache.hadoop.mapreduce.{JobContext => NewJobContext, 
OutputFormat => NewOutputFormat, RecordWriter => NewRecordWriter, 
TaskAttemptContext => NewTaskAttemptContext, TaskAttemptID => NewTaskAttemptID, 
TaskType}
+import org.apache.hadoop.mapreduce.task.{TaskAttemptContextImpl => 
NewTaskAttemptContextImpl}
+
+import org.apache.spark.{SerializableWritable, SparkException, TaskContext}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.executor.OutputMetrics
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage
+import org.apache.spark.rdd.{HadoopRDD, RDD}
+import org.apache.spark.util.{SerializableConfiguration, 
SerializableJobConf, Utils}
+
+/**
+ * A helper object that saves an RDD using a Hadoop OutputFormat
+ * (from the old mapred API).
+ */
+private[spark]
+object SparkHadoopWriter extends Logging {
+  import SparkHadoopWriterUtils._
+
+  /**
+   * Basic work flow of this command is:
+   * 1. Driver side setup, prepare the data source and hadoop 
configuration for the write job to
+   *be issued.
+   * 2. Issues a write job consists of one or more executor side tasks, 
each of which writes all
+   *rows within an RDD partition.
+   * 3. If no exception is thrown in a task, commits that task, otherwise 
aborts that task;  If any
+   *exception is thrown during task commitment, also aborts that task.
+   * 4. If all tasks are committed, commit the job, otherwise aborts the 
job;  If any exception is
+   *thrown during job commitment, also aborts the job.
+   */
+  def write[K, V: ClassTag](
+  rdd: RDD[(K, V)],
+  config: HadoopWriteConfigUtil[K, V]): Unit = {
+// Extract context and configuration from RDD.
+val sparkContext = rdd.context
+val stageId = rdd.id
+val sparkConf = rdd.conf
+
+// Set up a job.
+val jobTrackerId = createJobTrackerID(new Date())
+val jobContext = config.createJobContext(jobTrackerId, stageId)
+config.initOutputFormat(jobContext)
+
+// Assert the output format/key/value class is set in JobConf.
+config.assertConf()
+
+if (isOutputSpecValidationEnabled(sparkConf)) {
+  // FileOutputFormat ignores the filesystem parameter
+  config.checkOutputSpecs(jobContext)
+}
+
+val committer = config.createCommitter(stageId)
+committer.setupJob(jobContext)
+
+// When speculation is on and output committer class name contains 
"Direct", we should warn
+// users that they may loss data if they are using a direct output 
committer.
+// There is an example in 
https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
+// result of using direct output committer with speculation enabled.
+if (isSpeculationEnabled(sparkConf) && committer.isDirectOutput) {
+  val warningMessage =
+s"$committer may be an output committer that writes data directly 
to " +
+  "the final location. Because speculation is enabled, this output 
committer may " +
+  "cause data loss (see the case in SPARK-10063). If possible, 
please use an output " +
+  "committer that does not have this behavior (e.g. 
FileOutputCommitter)."
+  logWarning(warningMessage)
+}
+
+// Try to write a

[GitHub] spark pull request #15861: [SPARK-18294][CORE] Implement commit protocol to ...

2016-11-29 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/15861#discussion_r90124359
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -0,0 +1,408 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.io
+
+import java.text.NumberFormat
+import java.util.{Date, Locale}
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.{Configurable, Configuration}
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.mapred._
+import org.apache.hadoop.mapreduce.{JobContext => NewJobContext, 
OutputFormat => NewOutputFormat, RecordWriter => NewRecordWriter, 
TaskAttemptContext => NewTaskAttemptContext, TaskAttemptID => NewTaskAttemptID, 
TaskType}
+import org.apache.hadoop.mapreduce.task.{TaskAttemptContextImpl => 
NewTaskAttemptContextImpl}
+
+import org.apache.spark.{SerializableWritable, SparkException, TaskContext}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.executor.OutputMetrics
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage
+import org.apache.spark.rdd.{HadoopRDD, RDD}
+import org.apache.spark.util.{SerializableConfiguration, 
SerializableJobConf, Utils}
+
+/**
+ * A helper object that saves an RDD using a Hadoop OutputFormat
+ * (from the old mapred API).
+ */
+private[spark]
+object SparkHadoopWriter extends Logging {
+  import SparkHadoopWriterUtils._
+
+  /**
+   * Basic work flow of this command is:
+   * 1. Driver side setup, prepare the data source and hadoop 
configuration for the write job to
+   *be issued.
+   * 2. Issues a write job consists of one or more executor side tasks, 
each of which writes all
+   *rows within an RDD partition.
+   * 3. If no exception is thrown in a task, commits that task, otherwise 
aborts that task;  If any
+   *exception is thrown during task commitment, also aborts that task.
+   * 4. If all tasks are committed, commit the job, otherwise aborts the 
job;  If any exception is
+   *thrown during job commitment, also aborts the job.
+   */
+  def write[K, V: ClassTag](
+  rdd: RDD[(K, V)],
+  config: HadoopWriteConfigUtil[K, V]): Unit = {
+// Extract context and configuration from RDD.
+val sparkContext = rdd.context
+val stageId = rdd.id
+val sparkConf = rdd.conf
+
+// Set up a job.
+val jobTrackerId = createJobTrackerID(new Date())
+val jobContext = config.createJobContext(jobTrackerId, stageId)
+config.initOutputFormat(jobContext)
+
+// Assert the output format/key/value class is set in JobConf.
+config.assertConf()
+
+if (isOutputSpecValidationEnabled(sparkConf)) {
+  // FileOutputFormat ignores the filesystem parameter
+  config.checkOutputSpecs(jobContext)
+}
+
+val committer = config.createCommitter(stageId)
+committer.setupJob(jobContext)
+
+// When speculation is on and output committer class name contains 
"Direct", we should warn
+// users that they may loss data if they are using a direct output 
committer.
+// There is an example in 
https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
+// result of using direct output committer with speculation enabled.
+if (isSpeculationEnabled(sparkConf) && committer.isDirectOutput) {
+  val warningMessage =
+s"$committer may be an output committer that writes data directly 
to " +
+  "the final location. Because speculation is enabled, this output 
committer may " +
+  "cause data loss (see the case in SPARK-10063). If possible, 
please use an output " +
+  "committer that does not have this behavior (e.g. 
FileOutputCommitter)."
+  logWarning(warningMessage)
+}
+
+// Try to write a

[GitHub] spark issue #15861: [SPARK-18294][CORE] Implement commit protocol to support...

2016-11-29 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15861
  
@jiangxb1987 I did a single pass review - particularly given the 
similarities in both the codepaths and the classnames, I will need to go over 
it again to ensure we dont miss anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15954: [SPARK-18516][SQL] Split state and progress in streaming

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15954
  
**[Test build #69355 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69355/consoleFull)**
 for PR 15954 at commit 
[`69d9b4a`](https://github.com/apache/spark/commit/69d9b4a1de6c7f07bb2153b02d3ffabbb87eaac1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15954: [SPARK-18516][SQL] Split state and progress in st...

2016-11-29 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/15954#discussion_r90130320
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala ---
@@ -64,23 +68,26 @@ trait StreamingQuery {
 
   /**
* Returns the current status of the query.
+   *
* @since 2.0.2
*/
   def status: StreamingQueryStatus
 
   /**
-   * Returns current status of all the sources.
-   * @since 2.0.0
+   * Returns an array of the most recent [[StreamingQueryProgress]] 
updates for this query.
+   * The number of progress updates retained for each stream is configured 
by Spark session
+   * configuration `spark.sql.streaming.numRecentProgresses`.
+   *
+   * @since 2.1.0
*/
-  @deprecated("use status.sourceStatuses", "2.0.2")
-  def sourceStatuses: Array[SourceStatus]
+  def recentProgresses: Array[StreamingQueryProgress]
--- End diff --

Are these for the last `n` triggers? Or is it last `n` instantaneous 
progress updates, e.g. finished reading from a source etc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16066: [SPARK-18632][SQL] AggregateFunction should not implemen...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16066
  
**[Test build #69356 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69356/consoleFull)**
 for PR 16066 at commit 
[`1246792`](https://github.com/apache/spark/commit/1246792cdcf96a4eb1ecfa158aaf6861269735a8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15255: [SPARK-17680] [SQL] [TEST] Added a Testcase for Verifyin...

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15255
  
**[Test build #69357 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69357/consoleFull)**
 for PR 15255 at commit 
[`57817a1`](https://github.com/apache/spark/commit/57817a1c96c9577725ee8766834b20b06adfe521).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16048: [DO_NOT_MERGE]Test kafka deletion

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16048
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16048: [DO_NOT_MERGE]Test kafka deletion

2016-11-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16048
  
**[Test build #69345 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69345/consoleFull)**
 for PR 16048 at commit 
[`9ff2ed4`](https://github.com/apache/spark/commit/9ff2ed48062fbd7d9c92749ead54b62bcc9ee4ce).
 * This patch **fails executing the `dev/run-tests` script**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16048: [DO_NOT_MERGE]Test kafka deletion

2016-11-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16048
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69345/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15954: [SPARK-18516][SQL] Split state and progress in st...

2016-11-29 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/15954#discussion_r90132677
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
 ---
@@ -59,13 +62,20 @@ class StreamingQueryManager private[sql] (sparkSession: 
SparkSession) {
   /**
* Returns the query if there is an active query with the given id, or 
null.
*
-   * @since 2.0.0
+   * @since 2.1.0
*/
-  def get(id: Long): StreamingQuery = activeQueriesLock.synchronized {
+  def get(id: UUID): StreamingQuery = activeQueriesLock.synchronized {
 activeQueries.get(id).orNull
   }
 
   /**
+   * Returns the query if there is an active query with the given id, or 
null.
+   *
+   * @since 2.1.0
+   */
+  def get(id: String): StreamingQuery = get(UUID.fromString(id))
--- End diff --

with this I guess we can't provide API's for `get(name: String)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   >