[GitHub] spark pull request #16979: [SPARK-19617][SS]Fix the race condition when star...

2017-02-21 Thread zsxwing
Github user zsxwing closed the pull request at:

https://github.com/apache/spark/pull/16979


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16979: [SPARK-19617][SS]Fix the race condition when starting an...

2017-02-21 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/16979
  
Thanks. Merging to 2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17023: [SPARK-19695][SQL] Throw an exception if a `colum...

2017-02-21 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/17023

[SPARK-19695][SQL] Throw an exception if a `columnNameOfCorruptRecord` 
field violates requirements

## What changes were proposed in this pull request?
This pr comes from #16928 and fixed a json behaviour along with the CSV 
one. 

## How was this patch tested?
Added tests in `JsonSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark SPARK-19695

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17023.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17023






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17022: Aqp 271

2017-02-21 Thread ahshahid
GitHub user ahshahid opened a pull request:

https://github.com/apache/spark/pull/17022

Aqp 271

Looks like in Spark 2.0 the optimization of repeat aggregates being 
represented by a single aggregate was broken because of passing of resultId: 
ExprID in the constructor of AggregateExpression. Thus if the query is of type
select avg(x), y from tab  group by y having avg(x) > 0
ideally there should be only 1 aggregate evaluated . But since the resultID 
passed is different in the analyze phase, the distinct on aggregateExpressions, 
does not result in 1 aggregate .
The change is to implement the equality/hashCode method of aggregate 
expression instead of relying on scala case class generated methods. and 
ensuring the dependency on the aggregate which gets removed, is correctly 
rewritten. The test for this bug is present in CommonBugTest of aqp 



(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/SnappyDataInc/spark AQP-271

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17022.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17022






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...

2017-02-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/16970#discussion_r102380326
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/DeduplicationSuite.scala 
---
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark.sql.catalyst.streaming.InternalOutputModes._
+import org.apache.spark.sql.execution.streaming.MemoryStream
+import org.apache.spark.sql.execution.streaming.state.StateStore
+import org.apache.spark.sql.functions._
+
+class DeduplicationSuite extends StateStoreMetricsTest with 
BeforeAndAfterAll {
+
+  import testImplicits._
+
+  override def afterAll(): Unit = {
+super.afterAll()
+StateStore.stop()
+  }
+
+  test("deduplication with all columns") {
+val inputData = MemoryStream[String]
+val result = inputData.toDS().dropDuplicates()
+
+testStream(result, Append)(
+  AddData(inputData, "a"),
+  CheckLastBatch("a"),
+  assertNumStateRows(total = 1, updated = 1),
+  AddData(inputData, "a"),
+  CheckLastBatch(),
+  assertNumStateRows(total = 1, updated = 0),
+  AddData(inputData, "b"),
+  CheckLastBatch("b"),
+  assertNumStateRows(total = 2, updated = 1)
+)
+  }
+
+  test("deduplication with some columns") {
+val inputData = MemoryStream[(String, Int)]
+val result = inputData.toDS().dropDuplicates("_1")
+
+testStream(result, Append)(
+  AddData(inputData, "a" -> 1),
+  CheckLastBatch("a" -> 1),
+  assertNumStateRows(total = 1, updated = 1),
+  AddData(inputData, "a" -> 2), // Dropped
+  CheckLastBatch(),
+  assertNumStateRows(total = 1, updated = 0),
+  AddData(inputData, "b" -> 1),
+  CheckLastBatch("b" -> 1),
+  assertNumStateRows(total = 2, updated = 1)
+)
+  }
+
+  test("multiple deduplications") {
+val inputData = MemoryStream[(String, Int)]
+val result = inputData.toDS().dropDuplicates().dropDuplicates("_1")
+
+testStream(result, Append)(
+  AddData(inputData, "a" -> 1),
+  CheckLastBatch("a" -> 1),
+  assertNumStateRows(total = Seq(1L, 1L), updated = Seq(1L, 1L)),
+
+  AddData(inputData, "a" -> 2), // Dropped from the second 
`dropDuplicates`
+  CheckLastBatch(),
+  assertNumStateRows(total = Seq(1L, 2L), updated = Seq(0L, 1L)),
+
+  AddData(inputData, "b" -> 1),
+  CheckLastBatch("b" -> 1),
+  assertNumStateRows(total = Seq(2L, 3L), updated = Seq(1L, 1L))
+)
+  }
+
+  test("deduplication with watermark") {
+val inputData = MemoryStream[Int]
+val result = inputData.toDS()
+  .withColumn("eventTime", $"value".cast("timestamp"))
+  .withWatermark("eventTime", "10 seconds")
+  .dropDuplicates()
+  .select($"eventTime".cast("long").as[Long])
+
+testStream(result, Append)(
+  AddData(inputData, (1 to 5).flatMap(_ => (10 to 15)): _*),
+  CheckLastBatch(10 to 15: _*),
+  assertNumStateRows(total = 6, updated = 6),
+
+  AddData(inputData, 25), // Advance watermark to 15 seconds
+  CheckLastBatch(25),
+  assertNumStateRows(total = 7, updated = 1),
+
+  AddData(inputData, 25), // Drop states less than watermark
+  CheckLastBatch(),
+  assertNumStateRows(total = 1, updated = 0),
+
+  AddData(inputData, 10), // Should not emit anything as data less 
than watermark
+  CheckLastBatch(),
+  assertNumStateRows(total = 1, updated = 0),
+
+  AddData(inputData, 45), // Advance watermark to 35 seconds
+  CheckLastBatch(45),
+  assertNumStateRows(total = 2, updated = 1),
+
+  AddData(inputData, 45), // Drop states less than watermark
+  

[GitHub] spark issue #17013: [SPARK-19666][SQL] Skip a property without getter in Jav...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17013
  
**[Test build #73258 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73258/testReport)**
 for PR 17013 at commit 
[`ed686fa`](https://github.com/apache/spark/commit/ed686fae82fcd8984817615955ff5b2caf24ea08).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17013: [SPARK-19666][SQL] Skip a property without getter in Jav...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17013
  
**[Test build #73257 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73257/testReport)**
 for PR 17013 at commit 
[`5808d71`](https://github.com/apache/spark/commit/5808d71c5284ce9fcaaaffb7ada6d91e34e0b29e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...

2017-02-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/16970#discussion_r102379272
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala
 ---
@@ -321,3 +327,66 @@ case class MapGroupsWithStateExec(
   }
   }
 }
+
+
+/** Physical operator for executing streaming Deduplication. */
+case class DeduplicationExec(
+keyExpressions: Seq[Attribute],
+child: SparkPlan,
+stateId: Option[OperatorStateId] = None,
+eventTimeWatermark: Option[Long] = None)
+  extends UnaryExecNode with StateStoreWriter with WatermarkSupport {
+
+  /** Distribute by grouping attributes */
+  override def requiredChildDistribution: Seq[Distribution] =
+ClusteredDistribution(keyExpressions) :: Nil
+
+  override protected def doExecute(): RDD[InternalRow] = {
+metrics // force lazy init at driver
+
+child.execute().mapPartitionsWithStateStore(
+  getStateId.checkpointLocation,
+  operatorId = getStateId.operatorId,
+  storeVersion = getStateId.batchId,
+  keyExpressions.toStructType,
+  child.output.toStructType,
+  sqlContext.sessionState,
+  Some(sqlContext.streams.stateStoreCoordinator)) { (store, iter) =>
+  val getKey = GenerateUnsafeProjection.generate(keyExpressions, 
child.output)
+  val numOutputRows = longMetric("numOutputRows")
+  val numTotalStateRows = longMetric("numTotalStateRows")
+  val numUpdatedStateRows = longMetric("numUpdatedStateRows")
+
+
+  val baseIterator = watermarkPredicate match {
+case Some(predicate) => iter.filter((row: InternalRow) => 
!predicate.eval(row))
+case None => iter
+  }
+
+  while (baseIterator.hasNext) {
+val row = baseIterator.next().asInstanceOf[UnsafeRow]
+val key = getKey(row)
+val value = store.get(key)
+if (value.isEmpty) {
+  store.put(key.copy(), row.copy())
--- End diff --

How about `UnsafeRow.createFromByteArray(1, 1)` (1 byte array) rather than 
copying it completely?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17013: [SPARK-19666][SQL] Improve error message for JavaBean wi...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17013
  
**[Test build #73256 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73256/testReport)**
 for PR 17013 at commit 
[`91cee26`](https://github.com/apache/spark/commit/91cee264e936421e33ecfb0cd5d3c3b4a474d4f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102378915
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

But I am still wondering if we need to do copy like this here. Right, it is 
defensive, but as `BlockManager` is private to spark internal, and if all 
callers to it do not modify/release the byte buffer passed in, doesn't this 
defensive copy only cause performance regression?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...

2017-02-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/16970#discussion_r102378877
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1996,7 +1996,7 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
 val resolver = sparkSession.sessionState.analyzer.resolver
 val allColumns = queryExecution.analyzed.output
-val groupCols = colNames.flatMap { colName =>
+val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>
--- End diff --

was this a bug with batch queries as well? and what would the result be 
without this fix?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...

2017-02-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/16970#discussion_r102378594
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala
 ---
@@ -129,6 +156,33 @@ class UnsupportedOperationsSuite extends SparkFunSuite 
{
 outputMode = Complete,
 expectedMsgs = Seq("(map/flatMap)GroupsWithState"))
 
+  assertSupportedInStreamingPlan(
+"mapGroupsWithState - mapGroupsWithState on batch relation inside 
streaming relation",
+MapGroupsWithState(null, att, att, Seq(att), Seq(att), att, att, 
Seq(att), batchRelation),
+outputMode = Append
+  )
+
+  // Deduplication:  Not supported after a streaming aggregation
+  assertSupportedInStreamingPlan(
+"Deduplication - Deduplication on streaming relation before 
aggregation",
+Aggregate(
+  Seq(attributeWithWatermark),
+  aggExprs("c"),
+  Deduplication(Seq(att), streamRelation, streaming = true)),
+outputMode = Append)
+
+  assertNotSupportedInStreamingPlan(
+"Deduplication - Deduplication on streaming relation after 
aggregation",
+Deduplication(Seq(att), Aggregate(Nil, aggExprs("c"), streamRelation), 
streaming = true),
+outputMode = Complete,
+expectedMsgs = Seq("dropDuplicates"))
+
+  assertSupportedInStreamingPlan(
+"Deduplication - Deduplication on batch relation inside streaming 
relation",
--- End diff --

nit: inside a streaming query. 
sounds weird otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...

2017-02-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/16970#discussion_r102378522
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala
 ---
@@ -129,6 +156,33 @@ class UnsupportedOperationsSuite extends SparkFunSuite 
{
 outputMode = Complete,
 expectedMsgs = Seq("(map/flatMap)GroupsWithState"))
 
+  assertSupportedInStreamingPlan(
+"mapGroupsWithState - mapGroupsWithState on batch relation inside 
streaming relation",
+MapGroupsWithState(null, att, att, Seq(att), Seq(att), att, att, 
Seq(att), batchRelation),
+outputMode = Append
+  )
+
+  // Deduplication:  Not supported after a streaming aggregation
--- End diff --

nit: Change this comment to just `// Deduplication` to reflect the whole 
subsection 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17021: [SPARK-19694][ML] Add missing 'setTopicDistributionCol' ...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17021
  
**[Test build #73255 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73255/testReport)**
 for PR 17021 at commit 
[`367a681`](https://github.com/apache/spark/commit/367a681ffa457f906a1f54cced4b8f8219e2f888).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...

2017-02-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/16970#discussion_r102378336
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1143,6 +1144,24 @@ object ReplaceDistinctWithAggregate extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * Replaces logical [[Deduplication]] operator with an [[Aggregate]] 
operator.
+ */
+object ReplaceDeduplicationWithAggregate extends Rule[LogicalPlan] {
--- End diff --

`ReplaceDeduplicateWithAggregate`. see comment below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...

2017-02-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/16970#discussion_r102378293
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -869,3 +869,12 @@ case object OneRowRelation extends LeafNode {
   override def output: Seq[Attribute] = Nil
   override def computeStats(conf: CatalystConf): Statistics = 
Statistics(sizeInBytes = 1)
 }
+
+/** A logical plan for `dropDuplicates`. */
+case class Deduplication(
--- End diff --

Most names are like "verbs" - aggregate, project, intersect. I think its 
best to name this "Deduplicate".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16979: [SPARK-19617][SS]Fix the race condition when starting an...

2017-02-21 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/16979
  
LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17021: [SPARK-19694][ML] Add missing 'setTopicDistributi...

2017-02-21 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/17021

[SPARK-19694][ML] Add missing 'setTopicDistributionCol' for LDAModel

## What changes were proposed in this pull request?
Add missing 'setTopicDistributionCol' for LDAModel
## How was this patch tested?
existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark lda_outputCol

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17021.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17021


commit 367a681ffa457f906a1f54cced4b8f8219e2f888
Author: Zheng RuiFeng 
Date:   2017-02-22T03:39:10Z

create pr




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17004: [SPARK-19670] [SQL] [TEST] Enable Bucketed Table ...

2017-02-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17004


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16928
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16928
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73250/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16928
  
**[Test build #73250 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73250/testReport)**
 for PR 16928 at commit 
[`448e6fe`](https://github.com/apache/spark/commit/448e6fe9f20f11c1171bcaeebf27620fd2f93ac3).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17004: [SPARK-19670] [SQL] [TEST] Enable Bucketed Table Reading...

2017-02-21 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17004
  
Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102377114
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

Oh. nvm. That `duplicate` doesn't actually copy buffer content...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16928
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73249/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16928
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17020: [SPARK-19693][SQL] Make the SET mapreduce.job.reduces au...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17020
  
**[Test build #73254 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73254/testReport)**
 for PR 17020 at commit 
[`7948466`](https://github.com/apache/spark/commit/79484664490cb24ae3cf51667758902edfb6b896).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16928
  
**[Test build #73249 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73249/testReport)**
 for PR 16928 at commit 
[`8e83522`](https://github.com/apache/spark/commit/8e8352259d9eac1eb4ac973811379af7dfd6ed83).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102376897
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

No. I meant do we actually need to defensive copy here? All usage of 
`putBytes` across Spark have duplicated the byte buffer before passing it in. 
Is there any missing case we should do this defensive copy?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17020: [SPARK-19693][SQL] Make the SET mapreduce.job.red...

2017-02-21 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/17020

[SPARK-19693][SQL] Make the SET mapreduce.job.reduces automatically 
converted to spark.sql.shuffle.partitions

## What changes were proposed in this pull request?
Make the `SET mapreduce.job.reduces` automatically converted to 
`spark.sql.shuffle.partitions`, it's similar to `SET mapred.reduce.tasks`.

## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-19693

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17020.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17020


commit c29ca8a5158a5caab72a866fb2043d88683fb44a
Author: Yuming Wang 
Date:   2017-02-20T10:57:05Z

SET mapreduce.job.reduces automatically converted to 
spark.sql.shuffle.partitions

commit 79484664490cb24ae3cf51667758902edfb6b896
Author: Yuming Wang 
Date:   2017-02-22T03:23:12Z

Make test case for better readability




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102376170
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

Sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17014
  
**[Test build #73253 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73253/testReport)**
 for PR 17014 at commit 
[`a3f3bb6`](https://github.com/apache/spark/commit/a3f3bb66acbbd670369eb9939f73a2968ecc7649).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16970
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16970
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73247/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16970
  
**[Test build #73247 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73247/testReport)**
 for PR 16970 at commit 
[`78dfdfe`](https://github.com/apache/spark/commit/78dfdfe20b6c7f788e5d289ecc63c325679ccd44).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class Deduplication(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16744
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16744
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73244/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16744
  
**[Test build #73244 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73244/testReport)**
 for PR 16744 at commit 
[`d15affb`](https://github.com/apache/spark/commit/d15affb9d263c6201370f40b463b16ffd63c9ca2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17014
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73252/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17014
  
**[Test build #73252 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73252/testReport)**
 for PR 17014 at commit 
[`b81eeb7`](https://github.com/apache/spark/commit/b81eeb7c955fbba941237efa7760e41e190ef9ea).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17014
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17013: [SPARK-19666][SQL] Improve error message for Java...

2017-02-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17013#discussion_r102374176
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
 ---
@@ -123,7 +123,11 @@ object JavaTypeInference {
 val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
 val properties = 
beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
--- End diff --

Ah, sure. Then, let me ignore a property without the getter in 
`JavaTypeInference.inferDataType`, and allow empty property in 
`JavaTypeInference.serializerFor`/`deserializerFor`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16744
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16744
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73245/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17014
  
**[Test build #73252 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73252/testReport)**
 for PR 17014 at commit 
[`b81eeb7`](https://github.com/apache/spark/commit/b81eeb7c955fbba941237efa7760e41e190ef9ea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16946: [SPARK-19554][UI,YARN] Allow SHS URL to be used for trac...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16946
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73242/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16946: [SPARK-19554][UI,YARN] Allow SHS URL to be used for trac...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16946
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16744
  
**[Test build #73245 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73245/testReport)**
 for PR 16744 at commit 
[`b4bf3a8`](https://github.com/apache/spark/commit/b4bf3a8a74eaeb67aee60c73e24c8a69b145006a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16946: [SPARK-19554][UI,YARN] Allow SHS URL to be used for trac...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16946
  
**[Test build #73242 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73242/testReport)**
 for PR 16946 at commit 
[`5aef8eb`](https://github.com/apache/spark/commit/5aef8eb569a5422be0c3c18906a04e173782c2ba).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16290: [SPARK-18817] [SPARKR] [SQL] Set default warehous...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16290#discussion_r102373053
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -376,6 +377,12 @@ sparkR.session <- function(
 overrideEnvs(sparkConfigMap, paramMap)
   }
 
+  # NOTE(shivaram): Set default warehouse dir to tmpdir to meet CRAN 
requirements
+  # See SPARK-18817 for more details
+  if (!exists("spark.sql.default.warehouse.dir", envir = sparkConfigMap)) {
--- End diff --

actually we can, `SessionState.conf.settings` contains all the user-setted 
entries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...

2017-02-21 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16928#discussion_r102372757
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -45,24 +46,41 @@ private[csv] class UnivocityParser(
   // A `ValueConverter` is responsible for converting the given value to a 
desired type.
   private type ValueConverter = String => Any
 
+  private val corruptFieldIndex = 
schema.getFieldIndex(options.columnNameOfCorruptRecord)
+  corruptFieldIndex.foreach { corrFieldIndex =>
+require(schema(corrFieldIndex).dataType == StringType,
+  "A field for corrupt records must have a string type")
+require(schema(corrFieldIndex).nullable, "A field for corrupt must be 
nullable")
+  }
+
+  private val inputSchema = StructType(schema.filter(_.name != 
options.columnNameOfCorruptRecord))
+  CSVUtils.verifySchema(inputSchema)
+
   private val valueConverters =
-schema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
+inputSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
 
   private val parser = new CsvParser(options.asParserSettings)
 
   private var numMalformedRecords = 0
 
   private val row = new GenericInternalRow(requiredSchema.length)
 
-  private val indexArr: Array[Int] = {
+  private val indexArr: Array[(Int, Int)] = {
--- End diff --

@HyukjinKwon Sorry, but my bad. we use the second index here: 
https://github.com/apache/spark/pull/16928/files#diff-d19881aceddcaa5c60620fdcda99b4c4R202.
I added a test for this case in the latest commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16988: [SPARK-19658] [SQL] Set NumPartitions of RepartitionByEx...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16988
  
how about we set the `numPartitions` when we build 
`RepartitionByExpression`? the parser can also access the SQLConf.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17013: [SPARK-19666][SQL] Improve error message for Java...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17013#discussion_r102372119
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
 ---
@@ -123,7 +123,11 @@ object JavaTypeInference {
 val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
 val properties = 
beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
--- End diff --

I mean `JavaTypeInference`. Why we throw exception for a bean property 
without getter instead of not treating it as a property?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16928
  
**[Test build #73251 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73251/testReport)**
 for PR 16928 at commit 
[`619094a`](https://github.com/apache/spark/commit/619094a4dbb0e400daac0d94905b40df07b650b6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16979: [SPARK-19617][SS]Fix the race condition when star...

2017-02-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/16979#discussion_r102371334
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
 ---
@@ -63,8 +63,34 @@ class HDFSMetadataLog[T <: AnyRef : 
ClassTag](sparkSession: SparkSession, path:
   val metadataPath = new Path(path)
   protected val fileManager = createFileManager()
 
-  if (!fileManager.exists(metadataPath)) {
-fileManager.mkdirs(metadataPath)
+  runUninterruptiblyIfLocal {
+if (!fileManager.exists(metadataPath)) {
+  fileManager.mkdirs(metadataPath)
+}
+  }
+
+  private def runUninterruptiblyIfLocal[T](body: => T): T = {
+if (fileManager.isLocalFileSystem && 
Thread.currentThread.isInstanceOf[UninterruptibleThread]) {
--- End diff --

So we are changing this to a best-effort attempt, rather than the 
try-and-explicitly-fail attempt, in the case of a local file system... right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r102371290
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -690,10 +696,10 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   "different from the schema when this table was created by Spark 
SQL" +
   s"(${schemaFromTableProps.simpleString}). We have to fall back 
to the table schema " +
   "from Hive metastore which is not case preserving.")
-hiveTable
+hiveTable.copy(schemaPreservesCase = false)
--- End diff --

this is not only about case-preserving, maybe we should leave it unchanged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16989: [SPARK-19659] Fetch big blocks to disk when shuffle-read...

2017-02-21 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16989
  
@squito 
Thanks a lot for your comments : )
Yes, There must be a design doc for discussing. I will prepare and  post a 
pdf to jira.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment

2017-02-21 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16949
  
cc @srowen and @vanzin also.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17011: [SPARK-19676][CORE] Flaky test: FsHistoryProvider...

2017-02-21 Thread uncleGen
Github user uncleGen closed the pull request at:

https://github.com/apache/spark/pull/17011


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17013: [SPARK-19666][SQL] Improve error message for Java...

2017-02-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17013#discussion_r102370203
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
 ---
@@ -123,7 +123,11 @@ object JavaTypeInference {
 val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
 val properties = 
beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
--- End diff --

I am sorry for the long comment that maybe a bit messed around. So, there 
are two code paths

- `Encoders.bean(...)`: non-empty required OK & getter/setter required

- `JavaTypeInference.inferDataType`: empty one OK & getter only OK

Could I please ask which case you mean just to double check?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16946: [SPARK-19554][UI,YARN] Allow SHS URL to be used for trac...

2017-02-21 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/16946
  
On vacation back next Monday and will review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17011: [SPARK-19676][CORE] Flaky test: FsHistoryProviderSuite.S...

2017-02-21 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/17011
  
Don't run things as root. Please close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17019: [SPARK-19652][UI] Do auth checks for REST API access (br...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17019
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17019: [SPARK-19652][UI] Do auth checks for REST API access (br...

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17019
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73243/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17019: [SPARK-19652][UI] Do auth checks for REST API access (br...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17019
  
**[Test build #73243 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73243/testReport)**
 for PR 17019 at commit 
[`21006ff`](https://github.com/apache/spark/commit/21006ff4a9d01c6da02b48d6fadac9b3d9a0273a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16928
  
**[Test build #73250 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73250/testReport)**
 for PR 16928 at commit 
[`448e6fe`](https://github.com/apache/spark/commit/448e6fe9f20f11c1171bcaeebf27620fd2f93ac3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r102369921
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -181,7 +186,8 @@ case class CatalogTable(
 viewText: Option[String] = None,
 comment: Option[String] = None,
 unsupportedFeatures: Seq[String] = Seq.empty,
-tracksPartitionsInCatalog: Boolean = false) {
+tracksPartitionsInCatalog: Boolean = false,
+schemaPreservesCase: Boolean = true) {
--- End diff --

and when we fix the schema and try to write it back, remember to remove 
this property first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r102369793
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -181,7 +186,8 @@ case class CatalogTable(
 viewText: Option[String] = None,
 comment: Option[String] = None,
 unsupportedFeatures: Seq[String] = Seq.empty,
-tracksPartitionsInCatalog: Boolean = false) {
+tracksPartitionsInCatalog: Boolean = false,
+schemaPreservesCase: Boolean = true) {
--- End diff --

shall we create a special table property? `object CatalogTable` defines 
some specify properties for view and we can follow it. If we keeps adding more 
parameters, we may blow up the `CatalogTable` one day...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17015: [SPARK-19678][SQL] remove MetastoreRelation

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17015
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73246/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...

2017-02-21 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16928#discussion_r102369500
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -96,31 +96,44 @@ class CSVFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
   filters: Seq[Filter],
   options: Map[String, String],
   hadoopConf: Configuration): (PartitionedFile) => 
Iterator[InternalRow] = {
-val csvOptions = new CSVOptions(options, 
sparkSession.sessionState.conf.sessionLocalTimeZone)
-
+CSVUtils.verifySchema(dataSchema)
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))
 
+val parsedOptions = new CSVOptions(
+  options,
+  sparkSession.sessionState.conf.sessionLocalTimeZone,
+  sparkSession.sessionState.conf.columnNameOfCorruptRecord)
+
+// Check a field requirement for corrupt records here to throw an 
exception in a driver side
+dataSchema.getFieldIndex(parsedOptions.columnNameOfCorruptRecord).map 
{ corruptFieldIndex =>
+  val f = dataSchema(corruptFieldIndex)
+  if (f.dataType != StringType || !f.nullable) {
--- End diff --

I remove the entry in `UnivocityParser`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17015: [SPARK-19678][SQL] remove MetastoreRelation

2017-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17015
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17015: [SPARK-19678][SQL] remove MetastoreRelation

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17015
  
**[Test build #73246 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73246/testReport)**
 for PR 17015 at commit 
[`b61910e`](https://github.com/apache/spark/commit/b61910e8767a20cfe046bbb0fbf27305fceb7f32).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16928
  
**[Test build #73249 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73249/testReport)**
 for PR 16928 at commit 
[`8e83522`](https://github.com/apache/spark/commit/8e8352259d9eac1eb4ac973811379af7dfd6ed83).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17011: [SPARK-19676][CORE] Flaky test: FsHistoryProviderSuite.S...

2017-02-21 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17011
  
@srowen @vanzin I think the root cause is I test it in root user. So it 
always be readable no matter what access permission. IMHO, it is OK to add once 
extra access permission check, as the code scop is catching the 
AccessControlException. Maybe, I need to make the comment more clear. What's 
your opinion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r102369195
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -296,6 +296,21 @@ object SQLConf {
   .longConf
   .createWithDefault(250 * 1024 * 1024)
 
+  object HiveCaseSensitiveInferenceMode extends Enumeration {
--- End diff --

we can follow `PARQUET_COMPRESSION` and write the string literal directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...

2017-02-21 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/17014
  
@srowen @hhbyyh You are right. I will update this without breaking `train`. 
Thanks for pointing it out!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17013: [SPARK-19666][SQL] Improve error message for Java...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17013#discussion_r102368765
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
 ---
@@ -123,7 +123,11 @@ object JavaTypeInference {
 val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
 val properties = 
beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
--- End diff --

ideally a property without a getter is not a bean property, let's fix the 
empty property problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...

2017-02-21 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16928#discussion_r102367846
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -147,8 +165,6 @@ private[csv] class UnivocityParser(
 
 case udt: UserDefinedType[_] => (datum: String) =>
   makeConverter(name, udt.sqlType, nullable, options)
-
-case _ => throw new RuntimeException(s"Unsupported type: 
${dataType.typeName}")
--- End diff --

okay


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...

2017-02-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16928#discussion_r102367607
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -189,9 +205,10 @@ private[csv] class UnivocityParser(
 }
   }
 
-  private def convertWithParseMode(
-  tokens: Array[String])(convert: Array[String] => InternalRow): 
Option[InternalRow] = {
-if (options.dropMalformed && schema.length != tokens.length) {
+  private def parseWithMode(input: String)(convert: Array[String] => 
InternalRow)
+: Option[InternalRow] = {
--- End diff --

Oh, one more, it seems it fits into 100 line length limit?

```scala
private def convertWithParseMode(
input: String)(convert: Array[String] => InternalRow): 
Option[InternalRow] = {
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16928#discussion_r102367601
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -147,8 +165,6 @@ private[csv] class UnivocityParser(
 
 case udt: UserDefinedType[_] => (datum: String) =>
   makeConverter(name, udt.sqlType, nullable, options)
-
-case _ => throw new RuntimeException(s"Unsupported type: 
${dataType.typeName}")
--- End diff --

I think we should keep it, although theoretically we won't hit it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...

2017-02-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16928#discussion_r102366629
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -45,24 +46,41 @@ private[csv] class UnivocityParser(
   // A `ValueConverter` is responsible for converting the given value to a 
desired type.
   private type ValueConverter = String => Any
 
+  private val corruptFieldIndex = 
schema.getFieldIndex(options.columnNameOfCorruptRecord)
+  corruptFieldIndex.foreach { corrFieldIndex =>
+require(schema(corrFieldIndex).dataType == StringType,
+  "A field for corrupt records must have a string type")
+require(schema(corrFieldIndex).nullable, "A field for corrupt must be 
nullable")
+  }
+
+  private val inputSchema = StructType(schema.filter(_.name != 
options.columnNameOfCorruptRecord))
+  CSVUtils.verifySchema(inputSchema)
+
   private val valueConverters =
-schema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
+inputSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
 
   private val parser = new CsvParser(options.asParserSettings)
 
   private var numMalformedRecords = 0
 
   private val row = new GenericInternalRow(requiredSchema.length)
 
-  private val indexArr: Array[Int] = {
+  private val indexArr: Array[(Int, Int)] = {
--- End diff --

Actually, it seems we don't need both? It seems second one is not used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...

2017-02-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16928#discussion_r102366771
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -45,24 +46,41 @@ private[csv] class UnivocityParser(
   // A `ValueConverter` is responsible for converting the given value to a 
desired type.
   private type ValueConverter = String => Any
 
+  private val corruptFieldIndex = 
schema.getFieldIndex(options.columnNameOfCorruptRecord)
+  corruptFieldIndex.foreach { corrFieldIndex =>
+require(schema(corrFieldIndex).dataType == StringType,
+  "A field for corrupt records must have a string type")
+require(schema(corrFieldIndex).nullable, "A field for corrupt must be 
nullable")
+  }
+
+  private val inputSchema = StructType(schema.filter(_.name != 
options.columnNameOfCorruptRecord))
+  CSVUtils.verifySchema(inputSchema)
--- End diff --

It seems `verifySchema` is being called duplicatedly 
(https://github.com/apache/spark/pull/16928/files/4eed4a4bea927c6365dac2e2734c895a0a1a0026#diff-a549ac2e19ee7486911e2e6403444d9dR99)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...

2017-02-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16928#discussion_r102366908
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -96,31 +96,44 @@ class CSVFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
   filters: Seq[Filter],
   options: Map[String, String],
   hadoopConf: Configuration): (PartitionedFile) => 
Iterator[InternalRow] = {
-val csvOptions = new CSVOptions(options, 
sparkSession.sessionState.conf.sessionLocalTimeZone)
-
+CSVUtils.verifySchema(dataSchema)
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))
 
+val parsedOptions = new CSVOptions(
+  options,
+  sparkSession.sessionState.conf.sessionLocalTimeZone,
+  sparkSession.sessionState.conf.columnNameOfCorruptRecord)
+
+// Check a field requirement for corrupt records here to throw an 
exception in a driver side
+dataSchema.getFieldIndex(parsedOptions.columnNameOfCorruptRecord).map 
{ corruptFieldIndex =>
+  val f = dataSchema(corruptFieldIndex)
+  if (f.dataType != StringType || !f.nullable) {
--- End diff --

This check seems duplicated with 
https://github.com/apache/spark/pull/16928/files/4eed4a4bea927c6365dac2e2734c895a0a1a0026#diff-d19881aceddcaa5c60620fdcda99b4c4R51


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...

2017-02-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16928#discussion_r102366449
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -147,8 +165,6 @@ private[csv] class UnivocityParser(
 
 case udt: UserDefinedType[_] => (datum: String) =>
   makeConverter(name, udt.sqlType, nullable, options)
-
-case _ => throw new RuntimeException(s"Unsupported type: 
${dataType.typeName}")
--- End diff --

Hm, wouldn't removing it cause `MatchError`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17001: [SPARK-19667][SQL]create table with hiveenabled in defau...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17001
  
I'd like to treat this as a workaround, the location of default database is 
still invalid in cluster-
B.

We can make this logic more clear and consistent: the default database 
should not have a location, when we try to get the location of default DB, we 
should use the warehouse path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...

2017-02-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16744
  
**[Test build #73248 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73248/testReport)**
 for PR 16744 at commit 
[`da18da0`](https://github.com/apache/spark/commit/da18da0d98d1f4e433480de8df6f6c34b1e0fb39).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

2017-02-21 Thread Gauravshah
Github user Gauravshah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16842#discussion_r102366702
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala
 ---
@@ -204,10 +208,11 @@ class KinesisSequenceRangeIterator(
* to get records from Kinesis), and get the next shard iterator for 
next consumption.
*/
   private def getRecordsAndNextKinesisIterator(
-  shardIterator: String): (Iterator[Record], String) = {
+  shardIterator: String, recordCount: Int): (Iterator[Record], String) 
= {
 val getRecordsRequest = new GetRecordsRequest
 getRecordsRequest.setRequestCredentials(credentials)
 getRecordsRequest.setShardIterator(shardIterator)
+getRecordsRequest.setLimit(recordCount)
--- End diff --

👍 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

2017-02-21 Thread Gauravshah
Github user Gauravshah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16842#discussion_r102366680
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala
 ---
@@ -36,7 +36,8 @@ import org.apache.spark.util.NextIterator
 /** Class representing a range of Kinesis sequence numbers. Both sequence 
numbers are inclusive. */
 private[kinesis]
 case class SequenceNumberRange(
-streamName: String, shardId: String, fromSeqNumber: String, 
toSeqNumber: String)
+streamName: String, shardId: String, fromSeqNumber: String, 
toSeqNumber: String,
+recordCount: Int)
--- End diff --

Not sure on upgrading, since for code upgrade we need to delete the 
checkpoint directory and start afresh. I did run this patch and was able to 
serialize the limit into checkpoint, ( not a scala pro though)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16744: [SPARK-19405][STREAMING] Support for cross-accoun...

2017-02-21 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/16744#discussion_r102366429
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SerializableCredentialsProvider.scala
 ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.kinesis
+
+import scala.collection.JavaConverters._
+
+import com.amazonaws.auth._
+
+import org.apache.spark.internal.Logging
+
+/**
+ * Serializable interface providing a method executors can call to obtain 
an
+ * AWSCredentialsProvider instance for authenticating to AWS services.
+ */
+private[kinesis] sealed trait SerializableCredentialsProvider extends 
Serializable {
+  /**
+   * Return an AWSCredentialProvider instance that can be used by the 
Kinesis Client
+   * Library to authenticate to AWS services (Kinesis, CloudWatch and 
DynamoDB).
+   */
+  def provider: AWSCredentialsProvider
+}
+
+/** Returns DefaultAWSCredentialsProviderChain for authentication. */
+private[kinesis] final case object DefaultCredentialsProvider
+  extends SerializableCredentialsProvider {
+
+  def provider: AWSCredentialsProvider = new 
DefaultAWSCredentialsProviderChain
+}
+
+/**
+ * Returns AWSStaticCredentialsProvider constructed using basic AWS 
keypair. Falls back to using
+ * DefaultAWSCredentialsProviderChain if unable to construct a 
AWSCredentialsProviderChain
+ * instance with the provided arguments (e.g. if they are null).
+ */
+private[kinesis] final case class BasicCredentialsProvider(
+awsAccessKeyId: String,
+awsSecretKey: String) extends SerializableCredentialsProvider with 
Logging {
+
+  def provider: AWSCredentialsProvider = try {
+new AWSStaticCredentialsProvider(new 
BasicAWSCredentials(awsAccessKeyId, awsSecretKey))
+  } catch {
+case e: IllegalArgumentException =>
+  logWarning("Unable to construct AWSStaticCredentialsProvider with 
provided keypair; " +
+s"falling back to DefaultAWSCredentialsProviderChain: $e")
--- End diff --

I went ahead and updated it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-21 Thread wesm
Github user wesm commented on the issue:

https://github.com/apache/spark/pull/15821
  
The 0.2 Maven artifacts have been posted. I'll try to update the 
conda-forge packages this week -- if anyone can help with conda-forge 
maintenance that would be a big help.

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16744: [SPARK-19405][STREAMING] Support for cross-accoun...

2017-02-21 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/16744#discussion_r102366089
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SerializableCredentialsProvider.scala
 ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.kinesis
+
+import scala.collection.JavaConverters._
+
+import com.amazonaws.auth._
+
+import org.apache.spark.internal.Logging
+
+/**
+ * Serializable interface providing a method executors can call to obtain 
an
+ * AWSCredentialsProvider instance for authenticating to AWS services.
+ */
+private[kinesis] sealed trait SerializableCredentialsProvider extends 
Serializable {
+  /**
+   * Return an AWSCredentialProvider instance that can be used by the 
Kinesis Client
+   * Library to authenticate to AWS services (Kinesis, CloudWatch and 
DynamoDB).
+   */
+  def provider: AWSCredentialsProvider
+}
+
+/** Returns DefaultAWSCredentialsProviderChain for authentication. */
+private[kinesis] final case object DefaultCredentialsProvider
+  extends SerializableCredentialsProvider {
+
+  def provider: AWSCredentialsProvider = new 
DefaultAWSCredentialsProviderChain
+}
+
+/**
+ * Returns AWSStaticCredentialsProvider constructed using basic AWS 
keypair. Falls back to using
+ * DefaultAWSCredentialsProviderChain if unable to construct a 
AWSCredentialsProviderChain
+ * instance with the provided arguments (e.g. if they are null).
+ */
+private[kinesis] final case class BasicCredentialsProvider(
+awsAccessKeyId: String,
+awsSecretKey: String) extends SerializableCredentialsProvider with 
Logging {
+
+  def provider: AWSCredentialsProvider = try {
+new AWSStaticCredentialsProvider(new 
BasicAWSCredentials(awsAccessKeyId, awsSecretKey))
+  } catch {
+case e: IllegalArgumentException =>
+  logWarning("Unable to construct AWSStaticCredentialsProvider with 
provided keypair; " +
+s"falling back to DefaultAWSCredentialsProviderChain: $e")
--- End diff --

can do it in a separate PR if you like


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16744: [SPARK-19405][STREAMING] Support for cross-accoun...

2017-02-21 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/16744#discussion_r102366055
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SerializableCredentialsProvider.scala
 ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.kinesis
+
+import scala.collection.JavaConverters._
+
+import com.amazonaws.auth._
+
+import org.apache.spark.internal.Logging
+
+/**
+ * Serializable interface providing a method executors can call to obtain 
an
+ * AWSCredentialsProvider instance for authenticating to AWS services.
+ */
+private[kinesis] sealed trait SerializableCredentialsProvider extends 
Serializable {
+  /**
+   * Return an AWSCredentialProvider instance that can be used by the 
Kinesis Client
+   * Library to authenticate to AWS services (Kinesis, CloudWatch and 
DynamoDB).
+   */
+  def provider: AWSCredentialsProvider
+}
+
+/** Returns DefaultAWSCredentialsProviderChain for authentication. */
+private[kinesis] final case object DefaultCredentialsProvider
+  extends SerializableCredentialsProvider {
+
+  def provider: AWSCredentialsProvider = new 
DefaultAWSCredentialsProviderChain
+}
+
+/**
+ * Returns AWSStaticCredentialsProvider constructed using basic AWS 
keypair. Falls back to using
+ * DefaultAWSCredentialsProviderChain if unable to construct a 
AWSCredentialsProviderChain
+ * instance with the provided arguments (e.g. if they are null).
+ */
+private[kinesis] final case class BasicCredentialsProvider(
+awsAccessKeyId: String,
+awsSecretKey: String) extends SerializableCredentialsProvider with 
Logging {
+
+  def provider: AWSCredentialsProvider = try {
+new AWSStaticCredentialsProvider(new 
BasicAWSCredentials(awsAccessKeyId, awsSecretKey))
+  } catch {
+case e: IllegalArgumentException =>
+  logWarning("Unable to construct AWSStaticCredentialsProvider with 
provided keypair; " +
+s"falling back to DefaultAWSCredentialsProviderChain: $e")
--- End diff --

Use `"falling back to DefaultAWSCredentialsProviderChain.", e)` instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16819: [SPARK-16441][YARN] Set maxNumExecutor depends on...

2017-02-21 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/16819#discussion_r102365927
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 
---
@@ -1193,6 +1189,37 @@ private[spark] class Client(
   }
   }
 
+  def init(): Unit = {
+launcherBackend.connect()
+// Setup the credentials before doing anything else,
+// so we have don't have issues at any point.
+setupCredentials()
+yarnClient.init(yarnConf)
+yarnClient.start()
+
+setMaxNumExecutors()
+  }
+
+  /**
+   * If using dynamic allocation and user doesn't set 
spark.dynamicAllocation.maxExecutors
+   * then set the max number of executors depends on yarn cluster VCores 
Total.
+   * If not using dynamic allocation don't set it.
+   */
+  private def setMaxNumExecutors(): Unit = {
+if (Utils.isDynamicAllocationEnabled(sparkConf)) {
+
+  val defaultMaxNumExecutors = 
DYN_ALLOCATION_MAX_EXECUTORS.defaultValue.get
+  if (defaultMaxNumExecutors == 
sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS)) {
+val executorCores = sparkConf.getInt("spark.executor.cores", 1)
+val maxNumExecutors = yarnClient.getNodeReports().asScala.
--- End diff --

Good suggestion. I will try API first. Pseudo code:
```
import org.apache.hadoop.yarn.client.api.{YarnClient, YarnClientApplication}
import scala.collection.JavaConverters._
import org.apache.hadoop.yarn.api.protocolrecords._
import org.apache.hadoop.yarn.api.records._
import org.apache.hadoop.yarn.conf.YarnConfiguration

val yarnConf = new YarnConfiguration()
val yarnClient = YarnClient.createYarnClient
yarnClient.init(yarnConf)
yarnClient.start()
yarnClient.getRootQueueInfos
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...

2017-02-21 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16785#discussion_r102364604
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -314,7 +314,17 @@ abstract class UnaryNode extends LogicalPlan {
* expressions with the corresponding alias
*/
   protected def getAliasedConstraints(projectList: Seq[NamedExpression]): 
Set[Expression] = {
-var allConstraints = child.constraints.asInstanceOf[Set[Expression]]
+val relativeReferences = AttributeSet(projectList.collect {
+  case a: Alias => a
+}.flatMap(_.references)) ++ outputSet
+
+// We only care about the constraints which refer to attributes in 
output and aliases.
+// For example, for a constraint 'a > b', if 'a' is aliased to 'c', we 
need to get aliased
+// constraint 'c > b' only if 'b' is in output.
+var allConstraints = child.constraints.filter { constraint =>
+  constraint.references.subsetOf(relativeReferences)
--- End diff --

Yes. You can see the benchmark in the pr description. With pruning these 
attributes, the running time is cut half.

If I understand your comment correctly, pruning them later in `QueryPlan` 
means we prune constraints which don't refer attributes in `outputSet`.

But the pruning here is happened before the pruning you pointed out, we 
need to reduce the constraints taken for transforming aliasing attributes to 
lower the computation cost.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16938: [SPARK-19583][SQL]CTAS for data source table with a crea...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16938
  
> CREATE TABLE ... (PARTITIONED BY ...) LOCATION path

I think hive's behavior makes more sense. Users may wanna insert data to 
this table and put the data in a specified location, even it doesn't exist at 
the beginning.

> CREATE TABLE ...(PARTITIONED BY ...) LOCATION path AS SELECT ...

The reason applies here too.

> CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...

When users don't specify the location, mostly they would expect this is a 
fresh table and the table path should not exist.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16990: [SPARK-19660][CORE][SQL] Replace the configuration prope...

2017-02-21 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/16990
  
@srowen @felixcheung 
The SQL query is related to the file name, see:

https://github.com/apache/spark/blob/v2.1.0/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L314

e.g;
`set mapred.reduce.tasks=31`'s MD5 is `83c59d378571a6e487aa20217bd87817`, 
`set mapreduce.job.reduces=31`'s MD5 is `be2c0b32a02a1154bfdee1a52530f387`.

So I change the following file names:
```
mv input12_hadoop20-0-db1cd54a4cb36de2087605f32e41824f  
input12_hadoop20-0-2b9ccaa793eae0e73bf76335d3d6880
mv join14_hadoop20-1-db1cd54a4cb36de2087605f32e41824f   
join14_hadoop20-1-2b9ccaa793eae0e73bf76335d3d6880
mv auto_join14_hadoop20-2-db1cd54a4cb36de2087605f32e41824f  
auto_join14_hadoop20-2-2b9ccaa793eae0e73bf76335d3d6880
mv groupby4_noskew-2-83c59d378571a6e487aa20217bd87817   
groupby4_noskew-2-be2c0b32a02a1154bfdee1a52530f387
mv groupby4_map-2-83c59d378571a6e487aa20217bd87817  
groupby4_map-2-be2c0b32a02a1154bfdee1a52530f387
mv groupby4_map_skew-2-83c59d378571a6e487aa20217bd87817 
groupby4_map_skew-2-be2c0b32a02a1154bfdee1a52530f387   
mv groupby7_map-3-83c59d378571a6e487aa20217bd87817  
groupby7_map-3-be2c0b32a02a1154bfdee1a52530f387
mv groupby2_limit-0-83c59d378571a6e487aa20217bd87817
groupby2_limit-0-be2c0b32a02a1154bfdee1a52530f387  
mv groupby6_map_skew-2-83c59d378571a6e487aa20217bd87817 
groupby6_map_skew-2-be2c0b32a02a1154bfdee1a52530f387   
mv groupby5_map_skew-2-83c59d378571a6e487aa20217bd87817 
groupby5_map_skew-2-be2c0b32a02a1154bfdee1a52530f387   
mv groupby5_noskew-2-83c59d378571a6e487aa20217bd87817   
groupby5_noskew-2-be2c0b32a02a1154bfdee1a52530f387 
mv groupby2_map-2-83c59d378571a6e487aa20217bd87817  
groupby2_map-2-be2c0b32a02a1154bfdee1a52530f387
mv groupby7_noskew-3-83c59d378571a6e487aa20217bd87817   
groupby7_noskew-3-be2c0b32a02a1154bfdee1a52530f387 
mv groupby1_map_skew-2-83c59d378571a6e487aa20217bd87817 
groupby1_map_skew-2-be2c0b32a02a1154bfdee1a52530f387   
mv groupby8_map-2-83c59d378571a6e487aa20217bd87817  
groupby8_map-2-be2c0b32a02a1154bfdee1a52530f387
mv groupby6_noskew-2-83c59d378571a6e487aa20217bd87817   
groupby6_noskew-2-be2c0b32a02a1154bfdee1a52530f387 
mv groupby7_map_skew-2-83c59d378571a6e487aa20217bd87817 
groupby7_map_skew-2-be2c0b32a02a1154bfdee1a52530f387   
mv groupby7_noskew_multi_single_reducer-2-83c59d378571a6e487aa20217bd87817  
groupby7_noskew_multi_single_reducer-2-be2c0b32a02a1154bfdee1a52530f387
mv groupby_map_ppr-2-83c59d378571a6e487aa20217bd87817   
groupby_map_ppr-2-be2c0b32a02a1154bfdee1a52530f387 
mv groupby8_map_skew-2-83c59d378571a6e487aa20217bd87817 
groupby8_map_skew-2-be2c0b32a02a1154bfdee1a52530f387   
mv groupby6_map-2-83c59d378571a6e487aa20217bd87817  
groupby6_map-2-be2c0b32a02a1154bfdee1a52530f387
mv groupby1_noskew-2-83c59d378571a6e487aa20217bd87817   
groupby1_noskew-2-be2c0b32a02a1154bfdee1a52530f387 
mv groupby1_limit-0-83c59d378571a6e487aa20217bd87817
groupby1_limit-0-be2c0b32a02a1154bfdee1a52530f387  
mv groupby5_map-2-83c59d378571a6e487aa20217bd87817  
groupby5_map-2-be2c0b32a02a1154bfdee1a52530f387
mv groupby7_map_multi_single_reducer-2-83c59d378571a6e487aa20217bd87817 
groupby7_map_multi_single_reducer-2-be2c0b32a02a1154bfdee1a52530f387   
mv groupby1_map-2-83c59d378571a6e487aa20217bd87817  
groupby1_map-2-be2c0b32a02a1154bfdee1a52530f387
mv groupby2_noskew-2-83c59d378571a6e487aa20217bd87817   
groupby2_noskew-2-be2c0b32a02a1154bfdee1a52530f387 
mv groupby8_noskew-2-83c59d378571a6e487aa20217bd87817   
groupby8_noskew-2-be2c0b32a02a1154bfdee1a52530f387 
mv groupby2_map_skew-2-83c59d378571a6e487aa20217bd87817 
groupby2_map_skew-2-be2c0b32a02a1154bfdee1a52530f387   
mv 

[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

2017-02-21 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16998
  
@sameeragarwal That's correct.

> By the way, as an aside we should probably allow constraint 
inference/propagation to be turned off via a conf flag to provide a quick work 
around against these kind of problems.

As we use constraints in optimization, if we turn off constraint 
inference/propagation, wouldn't it miss optimization chance for query plans?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16928#discussion_r102364205
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -45,24 +46,41 @@ private[csv] class UnivocityParser(
   // A `ValueConverter` is responsible for converting the given value to a 
desired type.
   private type ValueConverter = String => Any
 
+  private val corruptFieldIndex = 
schema.getFieldIndex(options.columnNameOfCorruptRecord)
+  corruptFieldIndex.foreach { corrFieldIndex =>
+require(schema(corrFieldIndex).dataType == StringType,
+  "A field for corrupt records must have a string type")
+require(schema(corrFieldIndex).nullable, "A field for corrupt must be 
nullable")
+  }
+
+  private val inputSchema = StructType(schema.filter(_.name != 
options.columnNameOfCorruptRecord))
+  CSVUtils.verifySchema(inputSchema)
+
   private val valueConverters =
-schema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
+inputSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
 
   private val parser = new CsvParser(options.asParserSettings)
 
   private var numMalformedRecords = 0
 
   private val row = new GenericInternalRow(requiredSchema.length)
 
-  private val indexArr: Array[Int] = {
+  private val indexArr: Array[(Int, Int)] = {
--- End diff --

add comment to explain these 2 ints.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...

2017-02-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16928#discussion_r102364069
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -45,24 +46,41 @@ private[csv] class UnivocityParser(
   // A `ValueConverter` is responsible for converting the given value to a 
desired type.
   private type ValueConverter = String => Any
 
+  private val corruptFieldIndex = 
schema.getFieldIndex(options.columnNameOfCorruptRecord)
+  corruptFieldIndex.foreach { corrFieldIndex =>
+require(schema(corrFieldIndex).dataType == StringType,
+  "A field for corrupt records must have a string type")
+require(schema(corrFieldIndex).nullable, "A field for corrupt must be 
nullable")
+  }
+
+  private val inputSchema = StructType(schema.filter(_.name != 
options.columnNameOfCorruptRecord))
+  CSVUtils.verifySchema(inputSchema)
+
   private val valueConverters =
-schema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
+inputSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
 
   private val parser = new CsvParser(options.asParserSettings)
 
   private var numMalformedRecords = 0
 
   private val row = new GenericInternalRow(requiredSchema.length)
 
-  private val indexArr: Array[Int] = {
+  private val indexArr: Array[(Int, Int)] = {
 val fields = if (options.dropMalformed) {
   // If `dropMalformed` is enabled, then it needs to parse all the 
values
   // so that we can decide which row is malformed.
   requiredSchema ++ schema.filterNot(requiredSchema.contains(_))
 } else {
   requiredSchema
 }
-fields.map(schema.indexOf(_: StructField)).toArray
+val fieldsWithIndexes = fields.zipWithIndex
--- End diff --

isn't it just `fields.map(inputSchema.indexOf(_: StructField)).toArray`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

2017-02-21 Thread sameeragarwal
Github user sameeragarwal commented on the issue:

https://github.com/apache/spark/pull/16998
  
By the way, as an aside we should probably allow constraint 
inference/propagation to be turned off via a conf flag to provide a quick work 
around against these kind of problems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   >