from:"uncleGen"

Re: Support SqlStreaming in spark

2019-03-28 Thread uncleGen

Hi all, 

I have rewritten the design doc based on previous discussing. 
https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0

Would be interested to hear what others think.

Regards,
Genmao Yu 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Support SqlStreaming in spark

2019-03-28 Thread uncleGen

Hi all, 

I have rewritten the design doc based on previous discussing. 
https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0

Would be interested to hear what others think. 

Regards, 
Genmao Yu



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[GitHub] spark pull request #22575: [SPARK-24630][SS] Support SQLStreaming in Spark

2018-11-28 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22575#discussion_r237372804
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StreamTableDDLCommandSuite.scala
 ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.hive
+
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.test.SQLTestUtils
+
+class StreamTableDDLCommandSuite extends SQLTestUtils with 
TestHiveSingleton {
+  private val catalog = spark.sessionState.catalog
+
+  test("CTAS: create data source stream table") {
+withTempPath { dir =>
+  withTable("t") {
+sql(
+  s"""CREATE TABLE t USING PARQUET
+ |OPTIONS (
+ |PATH = '${dir.toURI}',
+ |location = '${dir.toURI}',
+ |isStreaming = 'true')
+ |AS SELECT 1 AS a, 2 AS b, 3 AS c
+  """.stripMargin)
--- End diff --

At 
https://github.com/apache/spark/pull/22575/files#diff-fa4547f0c6dd7810576cd4262a2dfb46R78

the `child` logicalPlan is not streaming logicalPlan?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18099: [SPARK-18406][CORE][Backport-2.1] Race between end-of-ta...

2018-08-27 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/18099
  
same issue in spark 2.2.1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] kafka pull request #3897: KAFKA-5929: Save pre-assignment to file to avoid t...

2017-09-26 Thread uncleGen

GitHub user uncleGen reopened a pull request:

https://github.com/apache/kafka/pull/3897

KAFKA-5929: Save pre-assignment to file to avoid too long text to display 
when do topic partition reassign

When do partition reassign
- before pr
Pre-assignment will be printed directly. It is not friendly when the text 
is too long.

- after pr
Pre-assignment will still be printed directly, but will be save to a file 
at the same time, naming with suffix ".rollback" of "reassignment-json-file". 
For example:

```
./kafka-reassign-partitions.sh --reassignment-json-file test.json ...
```
then we may get a file **test.json.rollback**


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/kafka KAFKA-5929

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3897.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3897


commit b5806c9f09459e473d6d1eb66c46a546c526d9c7
Author: uncleGen <husty...@gmail.com>
Date:   2017-09-19T07:40:09Z

Save pre-assignment to file to avoid too long text to display when do topic 
partition reassign

commit 19d3f84a237c3a685d805f0974282734fcc8e655
Author: uncleGen <husty...@gmail.com>
Date:   2017-09-19T08:23:50Z

findbugs fix




---

[GitHub] kafka pull request #3894: KAFKA-5928: Avoid redundant requests to zookeeper ...

2017-09-26 Thread uncleGen

GitHub user uncleGen reopened a pull request:

https://github.com/apache/kafka/pull/3894

KAFKA-5928: Avoid redundant requests to zookeeper when reassign topic 
partition

We mistakenly request topic level information according to partitions 
config in the assignment json file. For example 
https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/admin/ReassignPartitionsCommand.scala#L550:
```
val validPartitions = proposedPartitionAssignment.filter { case (p, _) => 
validatePartition(zkUtils, p.topic, p.partition) } 
```
If reassign 1000 partitions (in 10 topics), we need to request zookeeper 
1000 times here. But actually we only need to request just 10 (topics) times. 
We test a large-scale assignment, about 10K partitions. It takes tens of 
minutes. After optimization, it will reduce to less than 1minute.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/kafka KAFKA-5928

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3894.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3894


commit b8ebcbe00b4b10cdda023efceed2789e39f75781
Author: uncleGen <husty...@gmail.com>
Date:   2017-09-19T03:01:20Z

Avoid redundant requests to zookeeper when reassign topic partition




---

[GitHub] kafka pull request #3894: KAFKA-5928: Avoid redundant requests to zookeeper ...

2017-09-24 Thread uncleGen

Github user uncleGen closed the pull request at:

https://github.com/apache/kafka/pull/3894


---

[GitHub] kafka pull request #3897: KAFKA-5929: Save pre-assignment to file to avoid t...

2017-09-19 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/kafka/pull/3897

KAFKA-5929: Save pre-assignment to file to avoid too long text to display 
when do topic partition reassign



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/kafka KAFKA-5929

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3897.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3897


commit 9a78be366c3ad99f4418c6dab1c9701739b26c07
Author: æ¨è® <genmao@alibaba-inc.com>
Date:   2017-09-19T07:40:09Z

Save pre-assignment to file to avoid too long text to display when do topic 
partition reassign




---

[GitHub] spark pull request #16656: [SPARK-18116][DStream] Report stream input inform...

2017-09-18 Thread uncleGen

Github user uncleGen closed the pull request at:

https://github.com/apache/spark/pull/16656


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] kafka pull request #3894: KAFKA-5928: Avoid redundant requests to zookeeper ...

2017-09-18 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/kafka/pull/3894

KAFKA-5928: Avoid redundant requests to zookeeper when reassign topic 
partition

We mistakenly request topic level information according to partitions 
config in the assignment json file. For example 
https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/admin/ReassignPartitionsCommand.scala#L550:
```
val validPartitions = proposedPartitionAssignment.filter { case (p, _) => 
validatePartition(zkUtils, p.topic, p.partition) } 
```
If reassign 1000 partitions (in 10 topics), we need to request zookeeper 
1000 times here. But actually we only need to request just 10 (topics) times. 
We test a large-scale assignment, about 10K partitions. It takes tens of 
minutes. After optimization, it will reduce to less than 1minute.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/kafka KAFKA-5928

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3894.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3894


commit f6c30e81c7110f72e254bb9dfa81a25f951b70a1
Author: æ¨è® <genmao@alibaba-inc.com>
Date:   2017-09-19T03:01:20Z

Avoid redundant requests to zookeeper when reassign topic partition




---

[GitHub] spark pull request #17395: [SPARK-20065][SS][WIP] Avoid to output empty parq...

2017-06-19 Thread uncleGen

Github user uncleGen closed the pull request at:

https://github.com/apache/spark/pull/17395


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...

2017-05-11 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17052
  
@HyukjinKwon Sorry! Busy for this period of time. Let me resolve this 
conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17913: [SPARK-20672][SS] Keep the `isStreaming` property in tri...

2017-05-10 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17913
  
@zsxwing Great! Close this pr then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17913: [SPARK-20672][SS] Keep the `isStreaming` property...

2017-05-10 Thread uncleGen

Github user uncleGen closed the pull request at:

https://github.com/apache/spark/pull/17913


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17917: [SPARK-20600][SS] KafkaRelation should be pretty ...

2017-05-10 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17917#discussion_r115659920
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala
 ---
@@ -143,4 +143,6 @@ private[kafka010] class KafkaRelation(
 validateTopicPartitions(partitions, partitionOffsets)
 }
   }
+
+  override def toString: String = "kafka"
 }
--- End diff --

How about giving some more information about the kafka source? like topic, 
partition? refers to 
https://github.com/jaceklaskowski/spark/blob/2ffe4476553cfe50eb6392d8e573545a92fef737/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala#L140


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17913: [SPARK-20672][SS] Keep the `isStreaming` property...

2017-05-10 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17913#discussion_r115659132
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
 ---
@@ -48,7 +48,7 @@ case class StreamingRelation(dataSource: DataSource, 
sourceName: String, output:
  * Used to link a streaming [[Source]] of data into a
  * [[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]].
  */
-case class StreamingExecutionRelation(source: Source, output: 
Seq[Attribute]) extends LeafNode {
+case class StreamingSourceRelation(source: Source, output: Seq[Attribute]) 
extends LeafNode {
   override def isStreaming: Boolean = true
--- End diff --

just one renaming


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17913: [SPARK-20672][SS] Keep the `isStreaming` property in tri...

2017-05-10 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17913
  
cc @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17913: [SPARK-20672][SS] Keep the `isStreaming` property in tri...

2017-05-09 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17913
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17896: [SPARK-20373][SQL][SS] Batch queries with 'Datase...

2017-05-09 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17896#discussion_r115415803
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -2457,6 +2457,19 @@ object CleanupAliases extends Rule[LogicalPlan] {
 }
 
 /**
+ * Ignore event time watermark in batch query, which is only supported in 
Structured Streaming.
+ * TODO: add this rule into analyzer rule list.
+ */
+object CheckEventTimeWatermark extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case EventTimeWatermark(_, _, child) if !child.isStreaming =>
+  logWarning("EventTime watermark is only supported in Structured 
Streaming but found " +
--- End diff --

got 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17896: [SPARK-20373][SQL][SS] Batch queries with 'Datase...

2017-05-09 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17896#discussion_r115415668
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -2457,6 +2457,19 @@ object CleanupAliases extends Rule[LogicalPlan] {
 }
 
 /**
+ * Ignore event time watermark in batch query, which is only supported in 
Structured Streaming.
+ * TODO: add this rule into analyzer rule list.
+ */
+object CheckEventTimeWatermark extends Rule[LogicalPlan] {
--- End diff --

@zsxwing This pr does some prepare work before we add 
`EliminateEventTimeWatermark ` into `Analyzer.batches`. Could you please take a 
review?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17913: [SPARK-20672][SS] Keep the `isStreaming` property...

2017-05-09 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17913#discussion_r115415483
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
 ---
@@ -64,8 +64,20 @@ case class StreamingRelationExec(sourceName: String, 
output: Seq[Attribute]) ext
   }
 }
 
-object StreamingExecutionRelation {
-  def apply(source: Source): StreamingExecutionRelation = {
-StreamingExecutionRelation(source, source.schema.toAttributes)
+case class StreamingRelationWrapper(child: LogicalPlan) extends UnaryNode {
+  override def isStreaming: Boolean = true
+  override def output: Seq[Attribute] = child.output
+}
+
--- End diff --

Add a new `StreamingRelationWrapper` relation to wrap the internal relation 
in each trigger. It keeps the `isStreaming` property.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17896: [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataF...

2017-05-09 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17896
  
Depends upon: 
[SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17913: [SPARK-20672][SS] Keep the `isStreaming` property...

2017-05-09 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17913

[SPARK-20672][SS] Keep the `isStreaming` property in triggerLogicalPlan in 
Structured Streaming

## What changes were proposed in this pull request?

In Structured Streaming, the "isStreaming" property will be eliminated in 
each triggerLogicalPlan. Then, some rules will be applied to this 
triggerLogicalPlan mistakely. So, we should refactor existing code to better 
execute batch query and ss query.

## How was this patch tested?

existing ut.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark SPARK-20672

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17913.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17913


commit d1c4cbf0fa369db993855ef3f63b05561cf6662a
Author: uncleGen <husty...@gmail.com>
Date:   2017-05-09T06:01:51Z

Keep the `streaming` property in triggerLogicalPlan in Structured Streaming

commit 20648d99b1b95ea074be56708f13901bba2ee10d
Author: uncleGen <husty...@gmail.com>
Date:   2017-05-09T06:18:50Z

update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17896: [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataF...

2017-05-08 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17896
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17896: [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataF...

2017-05-08 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17896
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17395: [SPARK-20065][SS][WIP] Avoid to output empty parquet fil...

2017-05-07 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17395
  
@HyukjinKwon Sorry for the long absence. I will keep online for next period 
of time. Please give me some time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17896: [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataF...

2017-05-07 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17896
  
cc @zsxwing and @tdas 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17896: [SPARK-20373][SQL][SS] Batch queries with 'Datase...

2017-05-07 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17896

[SPARK-20373][SQL][SS] Batch queries with 
'Dataset/DataFrame.withWatermark()` does not execute

## What changes were proposed in this pull request?

Any Dataset/DataFrame batch query with the operation `withWatermark` does 
not execute because the batch planner does not have any rule to explicitly 
handle the EventTimeWatermark logical plan. 
The right solution is to simply remove the plan node, as the watermark 
should not affect any batch query in any way.

Changes: 
- In this PR, we add a new rule `CheckEventTimeWatermark` to check if we 
need to ignore the event time watermark. We will ignore watermark in any batch 
query. 

Followups:
- Add `CheckEventTimeWatermark` into analyzer rule list. We can not add 
this rule into analyzer directly, because streaming query will be copied to a 
internal batch query in every trigger, and the rule will be applied to this 
internal batch query mistakenly. IIUC, we should refactor related codes to 
better define a query is batch or streaming. Right?

Others:
- A typo fix in example.

## How was this patch tested?

add new unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark SPARK-20373

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17896.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17896


commit 563721241851751c2bb1736161febe73b8abba3b
Author: uncleGen <husty...@gmail.com>
Date:   2017-05-08T03:19:35Z

Ignore event time watermark in batch query.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apac...

2017-04-23 Thread uncleGen

Github user uncleGen closed the pull request at:

https://github.com/apache/spark/pull/17463


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apache.spar...

2017-04-01 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17463
  
@srowen It's hard to say it's because shutting down SparkContext is the 
slow part, and we can improve this case by avoiding stooping SparkContext in a 
separate thread.
cc @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apac...

2017-03-28 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17463

[SPARK-20131][DStream][Test] Flaky Test: 
org.apache.spark.streaming.StreamingContextSuite

## What changes were proposed in this pull request?

do not stop the `SparkContext` in thread. 

## How was this patch tested?

Jenkins.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark SPARK-20131

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17463.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17463


commit 89d1d35562bdb47c54464f31adeddadbe3a3ec1b
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-29T04:43:49Z

Flaky Test: org.apache.spark.streaming.StreamingContextSuite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17395: [SPARK-20065][SS] Avoid to output empty parquet files

2017-03-23 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17395
  
Let me change this pr into WIP based on the discussion with @HyukjinKwon 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17395: [SPARK-20065][SS] Avoid to output empty parquet f...

2017-03-23 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17395#discussion_r107821138
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 ---
@@ -292,7 +292,10 @@ object FileFormatWriter extends Logging {
 override def execute(iter: Iterator[InternalRow]): Set[String] = {
   var fileCounter = 0
   var recordsInFile: Long = 0L
-  newOutputWriter(fileCounter)
+  // Skip the empty partition to avoid creating a mass of 'empty' 
files.
+  if (iter.hasNext) {
+newOutputWriter(fileCounter)
--- End diff --

Thanks for your prompt. How about just left one empty file containing the 
metadata when df has empty partition? Furthmore, we may just left one metadata 
file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17395: [SPARK-20065][SS] Avoid to output empty parquet f...

2017-03-23 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17395#discussion_r107650070
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 ---
@@ -292,7 +292,10 @@ object FileFormatWriter extends Logging {
 override def execute(iter: Iterator[InternalRow]): Set[String] = {
   var fileCounter = 0
   var recordsInFile: Long = 0L
-  newOutputWriter(fileCounter)
+  // Skip the empty partition to avoid creating a mass of 'empty' 
files.
+  if (iter.hasNext) {
+newOutputWriter(fileCounter)
--- End diff --

@HyukjinKwon IIUC, this case should fail as expected, as there is no 
output. Am i missing something?

```
spark.range(100).filter("id > 100").write.parquet("/tmp/abc")
spark.read.parquet("/tmp/abc").show()
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17395: [SPARK-20065][SS] Avoid to output empty parquet f...

2017-03-23 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17395#discussion_r107637615
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 ---
@@ -292,7 +292,10 @@ object FileFormatWriter extends Logging {
 override def execute(iter: Iterator[InternalRow]): Set[String] = {
   var fileCounter = 0
   var recordsInFile: Long = 0L
-  newOutputWriter(fileCounter)
+  // Skip the empty partition to avoid creating a mass of 'empty' 
files.
+  if (iter.hasNext) {
+newOutputWriter(fileCounter)
--- End diff --

Let me see how to cover this case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16972: [SPARK-19556][CORE][WIP] Broadcast data is not encrypted...

2017-03-23 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16972
  
close it before i have a better solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...

2017-03-23 Thread uncleGen

Github user uncleGen closed the pull request at:

https://github.com/apache/spark/pull/16972


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17395: [SPARK-20065][SS] Avoid to output empty parquet f...

2017-03-23 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17395

[SPARK-20065][SS] Avoid to output empty parquet files

## Problem Description

Reported by Silvio Fiorito

I've got a Kafka topic which I'm querying, running a windowed aggregation, 
with a 30 second watermark, 10 second trigger, writing out to Parquet with 
append output mode.

Every 10 second trigger generates a file, regardless of whether there was 
any data for that trigger, or whether any records were actually finalized by 
the watermark.

Is this expected behavior or should it not write out these empty files?

```
val df = spark.readStream.format("kafka")

val query = df
  .withWatermark("timestamp", "30 seconds")
  .groupBy(window($"timestamp", "10 seconds"))
  .count()
  .select(date_format($"window.start", "HH:mm:ss").as("time"), $"count")

query
  .writeStream
  .format("parquet")
  .option("checkpointLocation", aggChk)
  .trigger(ProcessingTime("10 seconds"))
  .outputMode("append")
  .start(aggPath)
```

As the query executes, do a file listing on "aggPath" and you'll see 339 
byte files at a minimum until we arrive at the first watermark and the initial 
batch is finalized. Even after that though, as there are empty batches it'll 
keep generating empty files every trigger.

## What changes were proposed in this pull request?

Check the partition is empty or not, and skip empty partition to avoid 
output empty file.

## How was this patch tested?

Jenkins


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark SPARK-20065

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17395


commit 86a7d2fa96e3134c1e64864eba81a3bebdedceea
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-23T08:10:31Z

avoid to output empty parquet files




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17371: [SPARK-19903][PYSPARK][SS] window operator miss the `wat...

2017-03-21 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17371
  
@viirya Great, you give a more clear explanation. 

> I am thinking, should we create new expression id for the watermarking 
column with withWatermark? So we must write the query like:

It really can fix this problem, but not very user-friendly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...

2017-03-21 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17052
  
 retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17371: [SPARK-19903][PYSPARK][SS] window operator miss t...

2017-03-21 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17371#discussion_r107101883
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1163,7 +1163,10 @@ def check_string_field(field, fieldName):
 raise TypeError("%s should be provided as a string" % 
fieldName)
 
 sc = SparkContext._active_spark_context
-time_col = _to_java_column(timeColumn)
+if isinstance(timeColumn, Column):
--- End diff --

@viirya Sounds  reasonable, I pushed an update, take a review please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17371: [SPARK-19903][PYSPARK][SS] window operator miss t...

2017-03-21 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17371#discussion_r107095629
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1163,7 +1163,10 @@ def check_string_field(field, fieldName):
 raise TypeError("%s should be provided as a string" % 
fieldName)
 
 sc = SparkContext._active_spark_context
-time_col = _to_java_column(timeColumn)
+if isinstance(timeColumn, Column):
--- End diff --

IIUC, it is OK for current codebase. Am I missing something?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17371: [SPARK-19903][PYSPARK][SS] window operator miss t...

2017-03-21 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17371

[SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of 
time column

## What changes were proposed in this pull request?

reproduce code:

```
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window
bootstrapServers = sys.argv[1]
subscribeType = sys.argv[2]
topics = sys.argv[3]
spark = SparkSession\
  .builder\
  .appName("StructuredKafkaWordCount")\
  .getOrCreate()

lines = spark\
  .readStream\
  .format("kafka")\
  .option("kafka.bootstrap.servers", bootstrapServers)\
  .option(subscribeType, topics)\
  .load()\
  .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")

words = lines.select(explode(split(lines.value, ' 
')).alias('word'),lines.timestamp)

windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
window(words.timestamp, "30 seconds", "30 seconds"), words.word
).count()

query = windowedCounts\
  .writeStream\
  .outputMode('append')\
  .format('console')\ 
  .option("truncate", "false")\
  .start()
query.awaitTermination()
```

An exception was thrown:

```
pyspark.sql.utils.AnalysisException: Append output mode not supported when 
there are streaming aggregations on streaming DataFrames/DataSets without 
watermark;;
Aggregate [window#32, word#21], [window#32 AS window#26, word#21, count(1) 
AS count#31L]
+- Filter ((timestamp#16 >= window#32.start) && (timestamp#16 < 
window#32.end))
   +- Expand [ArrayBuffer(named_struct(start, ...]
  +- EventTimeWatermark timestamp#16: timestamp, interval 10 seconds
 +- Project [word#21, timestamp#16]
+- Generate explode(split(value#15,  )), true, false, [word#21]
   +- Project [cast(value#1 as string) AS value#15, 
cast(timestamp#5 as timestamp) AS timestamp#16]
  +- StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession ...]
```

IIUC, the root cause is:  `words.withWatermark("timestamp", "30 seconds")` 
add the watermark metadata into time column, but in `groupBy(
window(words.timestamp, "30 seconds", "30 seconds"), words.word
)`, the `words.timestamp` miss the metadata. At last, it failed to pass the 
check:

```
if (watermarkAttributes.isEmpty) {
  throwError(
s"$outputMode output mode not supported when there are 
streaming aggregations on " +
s"streaming DataFrames/DataSets without watermark")(plan)
}
```

after pr, run successfully.

## How was this patch tested?

Jenkins

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark python-window

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17371.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17371


commit 654c5121fd26a85036787882d3d2c3b56360b686
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-21T07:49:11Z

bug fix: window operator miss the `watermark` metadata of time column




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...

2017-03-19 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17052
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17352: [SPARK-20021][PySpark] Miss backslash in python code

2017-03-19 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17352
  
@felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17352: Miss backslash in python code

2017-03-19 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17352

Miss backslash in python code

## What changes were proposed in this pull request?

Add backslash for line continuation in python code.

## How was this patch tested?

Jenkins.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark python-example-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17352.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17352


commit d0a1c9f15288a4af4b3a3e12a89aff94d7104f7a
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-13T02:58:23Z

fix python example in doc

commit 965dce3d8707cadadf59594dc88310e2224ffeef
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-20T02:00:06Z

Merge branch 'master' into python-example-doc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-16 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17216
  
Does this PR mix in some test file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...

2017-03-16 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17267
  
@srowen Could you please take a view and help to merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...

2017-03-14 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17267
  
ping @viirya


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17267: [SPARK-19926][PYSPARK] Make pyspark exception mor...

2017-03-13 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17267#discussion_r105827541
  
--- Diff: python/pyspark/sql/utils.py ---
@@ -24,7 +24,7 @@ def __init__(self, desc, stackTrace):
 self.stackTrace = stackTrace
 
 def __str__(self):
-return repr(self.desc)
+return str(self.desc)
--- End diff --

based on latest commit:

```
>>> df.select("ì")
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select
jdf = self._jdf.select(self._jcols(*cols))
  File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 75, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException
: cannot resolve '`ì`' given input columns: [age, name];;
'Project ['ì]
+- Relation[age#0L,name#1] json


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...

2017-03-13 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17267
  
Thanks @HyukjinKwonï¼you give a good catchï¼I lost that case. Thanks 
@viirya for your suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...

2017-03-13 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17267
  
IMHO, yes. And @viirya is the original author.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...

2017-03-12 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17267
  
@viirya Thanks for you review. 
cc @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17209: [SPARK-19853][SS] uppercase kafka topics fail whe...

2017-03-12 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17209#discussion_r105575677
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 ---
@@ -606,6 +607,24 @@ class KafkaSourceSuite extends KafkaSourceTest {
 assert(query.exception.isEmpty)
   }
 
+  test("test to get offsets from case insensitive parameters") {
--- End diff --

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...

2017-03-12 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17267
  
Maybe @viirya can give some suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17267: [SPARK-19926][PYSPARK] Make pyspark exception mor...

2017-03-12 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17267

[SPARK-19926][PYSPARK] Make pyspark exception more readable

## What changes were proposed in this pull request?

Exception in pyspark is a little difficult to read.

before pr, like:

```
Traceback (most recent call last):
  File "", line 5, in 
  File "/root/dev/spark/dist/python/pyspark/sql/streaming.py", line 853, in 
start
return self._sq(self._jwrite.start())
  File 
"/root/dev/spark/dist/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/root/dev/spark/dist/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Append output mode not supported 
when there are streaming aggregations on streaming DataFrames/DataSets without 
watermark;;\nAggregate [window#17, word#5], [window#17 AS window#11, word#5, 
count(1) AS count#16L]\n+- Filter ((t#6 >= window#17.start) && (t#6 < 
window#17.end))\n   +- Expand [ArrayBuffer(named_struct(start, 
CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, 
(CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0) + 
3000)), word#5, t#6-T3ms), ArrayBuffer(named_struct(start, 
CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
double))) + cast(1 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, 
(CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
double))) + cast(1 as bigint)) - cast(1 as bigint))
  * 3000) + 0) + 3000)), word#5, t#6-T3ms)], [window#17, word#5, 
t#6-T3ms]\n  +- EventTimeWatermark t#6: timestamp, interval 30 
seconds\n +- Project [cast(word#0 as string) AS word#5, cast(t#1 as 
timestamp) AS t#6]\n+- StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@c4079ca,csv,List(),Some(StructType(StructField(word,StringType,true),
 StructField(t,IntegerType,true))),List(),None,Map(sep -> ;, path -> 
/tmp/data),None), FileSource[/tmp/data], [word#0, t#1]\n'
```

after pr:

```
Traceback (most recent call last):
  File "", line 5, in 
  File "/root/dev/spark/dist/python/pyspark/sql/streaming.py", line 853, in 
start
return self._sq(self._jwrite.start())
  File 
"/root/dev/spark/dist/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/root/dev/spark/dist/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: Append output mode not supported when 
there are streaming aggregations on streaming DataFrames/DataSets without 
watermark;;
Aggregate [window#17, word#5], [window#17 AS window#11, word#5, count(1) AS 
count#16L]
+- Filter ((t#6 >= window#17.start) && (t#6 < window#17.end))
   +- Expand [ArrayBuffer(named_struct(start, 
CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, 
(CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0) + 
3000)), word#5, t#6-T3ms), ArrayBuffer(named_struct(start, 
CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
double))) + cast(1 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, 
(CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
double))) + cast(1 as bigint)) - cast(1 as bigint)) * 3000) + 0) + 
3000)), word#5, t#6-T3ms)], [window#17, word#5, t#6-T3ms]
  +- EventTimeWatermark t#6: timestamp, interval 30 seconds
 +- Project [cast(word#0 as string) AS word#5, cast(t#1 as 
timestamp) AS t#6]
+- StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@5265083b,csv,List(),Some(StructType(StructField(word,StringType,true),
 StructField(t,IntegerType,true))),List(),None,Map(sep -> ;, path -> 
/tmp/data),None), FileSource[/tmp/data], [word#0, t#1]
```

## How was this patch tested?

    Jenkins


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark SPARK-19926

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17267.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17267


commit 273c1bc8d719158dd074cb806d5db

[GitHub] spark pull request #17209: [SPARK-19853][SS] uppercase kafka topics fail whe...

2017-03-12 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17209#discussion_r105553128
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala
 ---
@@ -128,18 +123,18 @@ private[kafka010] class KafkaSourceProvider extends 
DataSourceRegister
 .map { k => k.drop(6).toString -> parameters(k) }
 .toMap
 
-val startingRelationOffsets =
-  
caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) 
match {
-case Some("earliest") => EarliestOffsetRangeLimit
-case Some(json) => 
SpecificOffsetRangeLimit(JsonUtils.partitionOffsets(json))
-case None => EarliestOffsetRangeLimit
+val startingRelationOffsets = 
KafkaSourceProvider.getKafkaOffsetRangeLimit(
+  caseInsensitiveParams, STARTING_OFFSETS_OPTION_KEY, 
EarliestOffsetRangeLimit) match {
+case earliest @ EarliestOffsetRangeLimit => earliest
--- End diff --

ð  much more simple


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17257: [DOCS][SS] fix structured streaming python exampl...

2017-03-11 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17257

[DOCS][SS] fix structured streaming python example

## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark docs-ss-python

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17257.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17257


commit cd8269022e3066d07c6fc480305ccef89efd0993
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-11T09:27:40Z

fix structured streaming python example code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17209: [SPARK-19853][SS] uppercase kafka topics fail whe...

2017-03-10 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17209#discussion_r105528025
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala
 ---
@@ -450,10 +445,22 @@ private[kafka010] class KafkaSourceProvider extends 
DataSourceRegister
 
 private[kafka010] object KafkaSourceProvider {
   private val STRATEGY_OPTION_KEYS = Set("subscribe", "subscribepattern", 
"assign")
-  private val STARTING_OFFSETS_OPTION_KEY = "startingoffsets"
-  private val ENDING_OFFSETS_OPTION_KEY = "endingoffsets"
+  private[kafka010] val STARTING_OFFSETS_OPTION_KEY = "startingoffsets"
+  private[kafka010] val ENDING_OFFSETS_OPTION_KEY = "endingoffsets"
   private val FAIL_ON_DATA_LOSS_OPTION_KEY = "failondataloss"
--- End diff --

change for unit test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17209: [SPARK-19853][SS] uppercase kafka topics fail when start...

2017-03-10 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17209
  
@zsxwing doneï¼but forgot to pushï¼I will update it as soon as possible 
when I connect to internet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17202: [SPARK-19861][SS] watermark should not be a negative tim...

2017-03-09 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17202
  
cc @srowen and @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17221: [SPARK-19859][SS][Follow-up] The new watermark should ov...

2017-03-08 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17221
  
cc @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17209: [SPARK-19853][SS] uppercase kafka topics fail when start...

2017-03-08 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17209
  
cc @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...

2017-03-08 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17202#discussion_r105075392
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -576,6 +576,8 @@ class Dataset[T] private[sql](
 val parsedDelay =
   Option(CalendarInterval.fromString("interval " + delayThreshold))
 .getOrElse(throw new AnalysisException(s"Unable to parse time 
delay '$delayThreshold'"))
+require(parsedDelay.milliseconds >= 0 && parsedDelay.months >= 0,
+  s"delay threshold ($delayThreshold) should not be negative.")
--- End diff --

use `require(parsedDelay.milliseconds >= 0 && parsedDelay.months >= 0)` to 
make `delayThreshold` more reasonable and significative.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17221: [SPARK-19859][SS][Follow-up] The new watermark sh...

2017-03-08 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17221

[SPARK-19859][SS][Follow-up] The new watermark should override the old one.

## What changes were proposed in this pull request?

A follow up to SPARK-19859:

- extract the calculation of `delayMs` and reuse it.
- update EventTimeWatermarkExec
- use the correct `delayMs` in EventTimeWatermark

## How was this patch tested?

Jenkins.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark SPARK-19859

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17221.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17221


commit 2c2c8062921e83b4c8ae19afa2a0e199cd7a4814
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-09T02:26:35Z

follow-up to SPARK-19859




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...

2017-03-08 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17202#discussion_r105069930
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -576,6 +576,11 @@ class Dataset[T] private[sql](
 val parsedDelay =
   Option(CalendarInterval.fromString("interval " + delayThreshold))
 .getOrElse(throw new AnalysisException(s"Unable to parse time 
delay '$delayThreshold'"))
+val delayMs = {
--- End diff --

sure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17216: [SPARK-19873][SS] Record num shuffle partitions i...

2017-03-08 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17216#discussion_r105069281
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -380,7 +382,20 @@ class StreamExecution(
 logInfo(s"Resuming streaming query, starting with batch $batchId")
 currentBatchId = batchId
 availableOffsets = nextOffsets.toStreamProgress(sources)
-offsetSeqMetadata = 
nextOffsets.metadata.getOrElse(OffsetSeqMetadata())
+val numShufflePartitionsFromConf = 
sparkSession.conf.get(SQLConf.SHUFFLE_PARTITIONS)
+offsetSeqMetadata = nextOffsets
+  .metadata
+  .getOrElse(OffsetSeqMetadata(0, 0, numShufflePartitionsFromConf))
+
+/*
+ * For backwards compatibility, if # partitions was not recorded 
in the offset log, then
+ * ensure it is non-zero. The new value is picked up from the conf.
+ */
--- End diff --

for inline comment with the code, use // and not /* .. */.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...

2017-03-08 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17202#discussion_r105067664
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -576,6 +576,11 @@ class Dataset[T] private[sql](
 val parsedDelay =
   Option(CalendarInterval.fromString("interval " + delayThreshold))
 .getOrElse(throw new AnalysisException(s"Unable to parse time 
delay '$delayThreshold'"))
+val delayMs = {
+  val millisPerMonth = CalendarInterval.MICROS_PER_DAY / 1000 * 31
+  parsedDelay.milliseconds + parsedDelay.months * millisPerMonth
+}
+assert(delayMs >= 0, s"delay threshold should not be a negative time: 
$delayThreshold")
--- End diff --

@srowen +1 to your comments. We should make it significative but not just 
valid.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...

2017-03-08 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17202#discussion_r104930545
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -576,6 +576,11 @@ class Dataset[T] private[sql](
 val parsedDelay =
   Option(CalendarInterval.fromString("interval " + delayThreshold))
 .getOrElse(throw new AnalysisException(s"Unable to parse time 
delay '$delayThreshold'"))
+val delayMs = {
+  val millisPerMonth = CalendarInterval.MICROS_PER_DAY / 1000 * 31
+  parsedDelay.milliseconds + parsedDelay.months * millisPerMonth
+}
+assert(delayMs >= 0, s"delay threshold should not be a negative time: 
$delayThreshold")
--- End diff --

Maybe you misunderstand my cases above. Those cases are invalid, i.e. the 
`parsedDelay` are negative.

|cases|validity|
||--|
|inputData.withWatermark("value", "1 month -40 days")|invalid|
|inputData.withWatermark("value", "-10 seconds")|invalid|
|inputData.withWatermark("value", "10 seconds")|valid|
|inputData.withWatermark("value", "1 day -10 seconds")|valid|


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...

2017-03-08 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17202#discussion_r104904202
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -576,6 +576,11 @@ class Dataset[T] private[sql](
 val parsedDelay =
   Option(CalendarInterval.fromString("interval " + delayThreshold))
 .getOrElse(throw new AnalysisException(s"Unable to parse time 
delay '$delayThreshold'"))
+val delayMs = {
+  val millisPerMonth = CalendarInterval.MICROS_PER_DAY / 1000 * 31
+  parsedDelay.milliseconds + parsedDelay.months * millisPerMonth
+}
+assert(delayMs >= 0, s"delay threshold should not be a negative time: 
$delayThreshold")
--- End diff --

`delayThreshold: String` can not be used to assert directly, like:
```
inputData.withWatermark("value", "1 month -40 days")
inputData.withWatermark("value", "-10 seconds")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17209: [SPARK-19853][SS] uppercase kafka topics fail whe...

2017-03-08 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17209

[SPARK-19853][SS] uppercase kafka topics fail when startingOffsets are 
SpecificOffsets

## What changes were proposed in this pull request?

When using the KafkaSource with Structured Streaming, consumer assignments 
are not what the user expects if startingOffsets is set to an explicit set of 
topics/partitions in JSON where the topic(s) happen to have uppercase 
characters. When StartingOffsets is constructed, the original string value from 
options is transformed toLowerCase to make matching on "earliest" and "latest" 
case insensitive. However, the toLowerCase JSON is passed to SpecificOffsets 
for the terminal condition, so topic names may not be what the user intended by 
the time assignments are made with the underlying KafkaConsumer.

KafkaSourceProvider.scala:
```
val startingOffsets = 
caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) 
match {
case Some("latest") => LatestOffsets
case Some("earliest") => EarliestOffsets
case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
case None => LatestOffsets
  }
```

## How was this patch tested?

Jenkins


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark SPARK-19853

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17209


commit e2a26bf8fb8554fb030e7f5bd2197befb9ed55e2
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-08T11:59:17Z

Uppercase Kafka topics fail when startingOffsets are SpecificOffsets




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...

2017-03-08 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17202#discussion_r104899893
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -576,6 +576,11 @@ class Dataset[T] private[sql](
 val parsedDelay =
   Option(CalendarInterval.fromString("interval " + delayThreshold))
 .getOrElse(throw new AnalysisException(s"Unable to parse time 
delay '$delayThreshold'"))
+val delayMs = {
+  val millisPerMonth = CalendarInterval.MICROS_PER_DAY / 1000 * 31
+  parsedDelay.milliseconds + parsedDelay.months * millisPerMonth
+}
+assert(delayMs >= 0, s"delay threshold should not be a negative time: 
$delayThreshold")
--- End diff --

@srowen Thanks for you review!
> Why compute all this -- don't you just mean to assert about 
delayThreshold?

I do mean to check the `delayThreshold`. `delayThreshold` is converted from 
`String` to `CalendarInterval`. `CalendarInterval` divides the `delayThreshold` 
into two parts, i.e. month (contain year and month) and microseconds of rest. 
(https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java#L86)

(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/EventTimeWatermarkExec.scala#L87)

>  this derived value can only be negative if the input is right?

Sorry, I dont get what you mean.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...

2017-03-08 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17202#discussion_r104888633
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -576,6 +576,8 @@ class Dataset[T] private[sql](
 val parsedDelay =
   Option(CalendarInterval.fromString("interval " + delayThreshold))
 .getOrElse(throw new AnalysisException(s"Unable to parse time 
delay '$delayThreshold'"))
+assert(parsedDelay.microseconds >= 0,
--- End diff --

set 0 means event time should be not less than max event time in last batch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17202: [SPARK-19861][SS] watermark should not be a negative tim...

2017-03-07 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17202
  
cc @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17202: [SPARK-19861][SS] watermark should not be a negat...

2017-03-07 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17202

[SPARK-19861][SS] watermark should not be a negative time.

## What changes were proposed in this pull request?

watermark should not be a negative time.

## How was this patch tested?

add new unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark SPARK-19861

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17202.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17202


commit dcc77eda3d88f8cd5c66b60730c0ada5ae717cc3
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-08T03:28:29Z

watermark should not be a negative time.

commit 00bbadc063e515653af86ab85fc95833bccf727a
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-08T03:30:13Z

update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17052: [SPARK-19690][SS] Join a streaming DataFrame with...

2017-03-07 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17052#discussion_r104651490
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala 
---
@@ -57,10 +60,31 @@ trait Source  {
   def getBatch(start: Option[Offset], end: Offset): DataFrame
 
   /**
+   * In a streaming query, stream relation will be cut into a series of 
batch relations.
+   * We need to mark the batch relation as streaming, i.e. data coming 
from a stream source,
+   * so we can apply those streaming strategies to it.
+   */
+  def markAsStreaming(df: DataFrame): DataFrame = {
+val markAsStreaming = df.logicalPlan transform {
+  case logicalRDD @ LogicalRDD(_, _, _, _, false) =>
+logicalRDD.dataFromStreaming = true
+logicalRDD
+  case logicalRelation @ LogicalRelation(_, _, _, false) =>
+logicalRelation.dataFromStreaming = true
+logicalRelation
+  case localRelation @ LocalRelation(_, _, false) =>
+localRelation.dataFromStreaming = true
+localRelation
+}
+
--- End diff --

add a new parameter `dataFromStreaming ` to the constructor of 
LogicalRelation, LogicalRDD and LocalRelation. `dataFromStreaming ` indicate if 
this relation comes from a streaming source. In a streaming query, stream 
relation will be cut into a series of batch relations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...

2017-03-07 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17144#discussion_r104635754
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala 
---
@@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends 
BlockManagerReplicationBehav
 
 val newLocations = master.getLocations(blockId).toSet
 logInfo(s"New locations : $newLocations")
-assert(newLocations.size === replicationFactor)
+eventually(timeout(5 seconds), interval(10 millis)) {
+  assert(newLocations.size === replicationFactor)
--- End diff --

Ahhh, got it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17141: [SPARK-19800][SS][WIP] Implement one kind of streaming s...

2017-03-07 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17141
  
ping @tdas and @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...

2017-03-07 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17144#discussion_r104630604
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala 
---
@@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends 
BlockManagerReplicationBehav
 
 val newLocations = master.getLocations(blockId).toSet
 logInfo(s"New locations : $newLocations")
-assert(newLocations.size === replicationFactor)
+eventually(timeout(5 seconds), interval(10 millis)) {
+  assert(newLocations.size === replicationFactor)
--- End diff --

@srowen Please view the discussion here. Maybe we should keep the first 
sleep.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...

2017-03-07 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17144#discussion_r104619252
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala 
---
@@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends 
BlockManagerReplicationBehav
 
 val newLocations = master.getLocations(blockId).toSet
 logInfo(s"New locations : $newLocations")
-assert(newLocations.size === replicationFactor)
+eventually(timeout(5 seconds), interval(10 millis)) {
+  assert(newLocations.size === replicationFactor)
--- End diff --

IMHO, we can not remove the first sleep. For example there are three 
blockmanager A, B, C. When we stats to remove BM-A, all blocks in BM-A will be 
replicated to BM-B and BM-C. We can not remove BM-B immediately or too fast, as 
there may be no enough time to do replication and new block info may can not be 
registered to master properly. So, we should instead give a little more time to 
sleep just like my fist fix. But it is OK to remove the second sleep. 
@kayousterhout Tell me if i was missing something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...

2017-03-06 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17144
  
cc @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...

2017-03-06 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17144#discussion_r104570057
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala 
---
@@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends 
BlockManagerReplicationBehav
 
 val newLocations = master.getLocations(blockId).toSet
 logInfo(s"New locations : $newLocations")
-assert(newLocations.size === replicationFactor)
+eventually(timeout(5 seconds), interval(10 millis)) {
+  assert(newLocations.size === replicationFactor)
+}
 // there should only be one common block manager between initial and 
new locations
--- End diff --

continually check a condition and then timeout after 5 seconds


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpoin...

2017-03-05 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17167#discussion_r104310809
  
--- Diff: 
streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala ---
@@ -152,11 +152,9 @@ trait DStreamCheckpointTester { self: SparkFunSuite =>
   stopSparkContext: Boolean
 ): Seq[Seq[V]] = {
 try {
-  val batchDuration = ssc.graph.batchDuration
--- End diff --

@srowen Yes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...

2017-03-04 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17145
  
cc @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOper...

2017-03-04 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17167
  
cc @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOper...

2017-03-04 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17167
  
cc @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16656: [SPARK-18116][DStream] Report stream input information a...

2017-03-04 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16656
  
ping @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpoin...

2017-03-04 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17167

[SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should not 
check checkpointFilesOfLatestTime by the PATH string.

## What changes were proposed in this pull request?


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73800/testReport/

```
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed 
to eventually never 
returned normally. Attempted 617 times over 10.003740484 seconds. Last 
failure message: 8 did 
not equal 2.
```

the check condition is:

```
val checkpointFilesOfLatestTime = 
Checkpoint.getCheckpointFiles(checkpointDir).filter {
 _.toString.contains(clock.getTimeMillis.toString)
}
// Checkpoint files are written twice for every batch interval. So assert 
that both
// are written to make sure that both of them have been written.
assert(checkpointFilesOfLatestTime.size === 2)
```

the path string may contain the `clock.getTimeMillis.toString`, like:

```

file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-500

file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1000

file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1500

file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2000

file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2500

file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3000

file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500.bk

file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500

------
```

so we should only check the filename, but not the while path.

## How was this patch tested?

Jenkins.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark flaky-CheckpointSuite

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17167.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17167


commit 72f1963a36f9f1abfe8ca10d30b01f52c2281d82
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-03T10:11:52Z

flaky CheckpointSuite test failure




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...

2017-03-04 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17144
  
@kayousterhout sure, I was being doing that flaky test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17145: [SPARK-19805][TEST] Log the row type when query r...

2017-03-04 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17145#discussion_r104304108
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala ---
@@ -312,13 +312,23 @@ object QueryTest {
   sparkAnswer: Seq[Row],
   isSorted: Boolean = false): Option[String] = {
 if (prepareAnswer(expectedAnswer, isSorted) != 
prepareAnswer(sparkAnswer, isSorted)) {
+  val getRowType: Option[Row] => String = row =>
+"RowType" + row.map(row =>
--- End diff --

@hvanhovell After use `schema.catalogString`

```
!== Correct Answer - 1 ==  == Spark Answer - 1 ==
!struct<_1:string,_2:string>   struct<_1:int,_2:string>
![1,a] [1,a]
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17145: [SPARK-19805][TEST] Log the row type when query r...

2017-03-03 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17145#discussion_r104157326
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala ---
@@ -312,13 +312,23 @@ object QueryTest {
   sparkAnswer: Seq[Row],
   isSorted: Boolean = false): Option[String] = {
 if (prepareAnswer(expectedAnswer, isSorted) != 
prepareAnswer(sparkAnswer, isSorted)) {
+  val getRowType: Option[Row] => String = row =>
+"RowType" + row.map(row =>
--- End diff --

OK, Iet me have a test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...

2017-03-03 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17145
  
cc @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...

2017-03-03 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17144
  
cc @kayousterhout 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...

2017-03-03 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17144
  
test crash. retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...

2017-03-03 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17145
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query type dos...

2017-03-02 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17145
  
unrelated failure: ` 
org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress
 test for failOnDataLoss=false`. retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-03-02 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/14731
  
@srowen Waiting for your final OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...

2017-03-02 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17144
  
one more flaky test? `org.apache.spark.streaming.CheckpointSuite.recovery 
with map and reduceByKey operations` I will check it later. retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17080: [SPARK-19739][CORE] propagate S3 session token to cluser

2017-03-02 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17080
  
cc @vanzin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17145: [SPARK-19805][TEST] Log the row type when query t...

2017-03-02 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17145

[SPARK-19805][TEST] Log the row type when query type dose not match

## What changes were proposed in this pull request?

before pr:

```
== Results ==
!== Correct Answer - 3 ==   == Spark Answer - 3 ==
 [1][1]
 [2][2]
 [3][3]

```

after pr:

```
== Results ==
!== Correct Answer - 3 ==   == Spark Answer - 3 ==
!RowType[string]RowType[integer]
 [1][1]
 [2][2]
 [3][3]
```

## How was this patch tested?

Jenkins


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark improve-test-result

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17145.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17145


commit de98dfd60feedb74d4ff9e9cecbd61ceb7582530
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-03T03:52:06Z

Log the row type when query type dose not match




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 >

1 - 100 of 689 matches

Mail list logo