[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-28 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213267299
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -266,7 +266,9 @@ object SQLConf {
 .createWithDefault(Long.MaxValue)
 
   val SHUFFLE_PARTITIONS = buildConf("spark.sql.shuffle.partitions")
-.doc("The default number of partitions to use when shuffling data for 
joins or aggregations.")
+.doc("The default number of partitions to use when shuffling data for 
joins or aggregations. " +
+  "Note: For structured streaming, this configuration cannot be 
changed between query " +
--- End diff --

The sentence is borrowed from existing one: 
https://github.com/apache/spark/blob/8198ea50192cad615071beb5510c73aa9e9178f4/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L978-L979


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-28 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213267329
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -868,7 +870,9 @@ object SQLConf {
   .internal()
   .doc(
 "The class used to manage state data in stateful streaming 
queries. This class must " +
-  "be a subclass of StateStoreProvider, and must have a zero-arg 
constructor.")
+  "be a subclass of StateStoreProvider, and must have a zero-arg 
constructor. " +
+  "Note: For structured streaming, this configuration cannot be 
changed between query " +
--- End diff --

Same here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-28 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213264912
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -868,7 +870,9 @@ object SQLConf {
   .internal()
   .doc(
 "The class used to manage state data in stateful streaming 
queries. This class must " +
-  "be a subclass of StateStoreProvider, and must have a zero-arg 
constructor.")
+  "be a subclass of StateStoreProvider, and must have a zero-arg 
constructor. " +
+  "Note: For structured streaming, this configuration cannot be 
changed between query " +
--- End diff --

s/cannot be/must not be/


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-28 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213264786
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -266,7 +266,9 @@ object SQLConf {
 .createWithDefault(Long.MaxValue)
 
   val SHUFFLE_PARTITIONS = buildConf("spark.sql.shuffle.partitions")
-.doc("The default number of partitions to use when shuffling data for 
joins or aggregations.")
+.doc("The default number of partitions to use when shuffling data for 
joins or aggregations. " +
+  "Note: For structured streaming, this configuration cannot be 
changed between query " +
--- End diff --

s/cannot be/must not be/


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213181165
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,6 +2812,19 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
+**Notes**
+
+- There're couple of configurations which are not modifiable once you run 
the query. If you really want to make changes for these configurations, you 
have to discard checkpoint and start a new query.
+  - `spark.sql.shuffle.partitions`
+- This is due to the physical partitioning of state: state is 
partitioned via applying hash function to key, hence the number of partitions 
for state should be unchanged.
+- If you want to run less tasks for stateful operations, `coalesce` 
would help with avoiding unnecessary repartitioning.
+  - e.g. `df.groupBy("time").count().coalesce(10)` reduces the number 
of tasks by 10, whereas `spark.sql.shuffle.partitions` may be bigger.
+  - After `coalesce`, the number of (reduced) tasks will be kept 
unless another shuffle happens.
+  - `spark.sql.streaming.stateStore.providerClass`
--- End diff --

Ah, okay, so there are more instances to describe here. If so, im okay.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213181040
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,6 +2812,19 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
+**Notes**
--- End diff --

I was thinking adding this information somewhere API or configuration only. 
For instance, notes like https://github.com/apache/spark/pull/19617. 

> lots of wondering around SO and user mailing list,

I don't object to note these stuff but usually the site has only key points 
for some features or configurations. 

If there are more instance to describe specifically for structured 
streaming (where the same SQL configurations could lead to some confusions), I 
am fine with adding this. If not or less sure for now, I would add them into 
API's doc or configuration's doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213180977
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,7 +2812,18 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
-**Further Reading**
+## Configuration Options For Structured Streaming
+
+This section is for configurations which are only available for structured 
streaming, or they behave differently with batch query.
+
+- spark.sql.shuffle.partitions
--- End diff --

I was thinking adding this information somewhere API or configuration only. 
For instance, notes like https://github.com/apache/spark/pull/19617. 

> lots of wondering around SO and user mailing list,

I don't object to note these stuff but usually the site has only key points 
for some features or configurations. 

If there are more instance to describe specifically for structured 
streaming (where the same SQL configurations could lead to some confusions), I 
am fine with adding this. If not or less sure for now, I would add them into 
API's doc or configuration's doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213179259
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,7 +2812,18 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
-**Further Reading**
+## Configuration Options For Structured Streaming
+
+This section is for configurations which are only available for structured 
streaming, or they behave differently with batch query.
+
+- spark.sql.shuffle.partitions
--- End diff --

I can revert adding a new section if you meant adding `##` on it. While 
gotcha looks more like funny, I will change it to `**Notes**`.

The rationalization on adding this to doc is, this restriction had been 
making lots of wondering around SO and user mailing list, as well as even a 
patch for fixing this. So all of end users who use structured streaming would 
be nice to see it at least once, even they skim the doc, so that they can 
remember and revisit the doc once they get stuck on this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213178020
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,7 +2812,18 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
-**Further Reading**
+## Configuration Options For Structured Streaming
+
+This section is for configurations which are only available for structured 
streaming, or they behave differently with batch query.
+
+- spark.sql.shuffle.partitions
--- End diff --

What I am worried is about adding a new section, which is quite unusual. 
Usually we go for it when multiple instances are detected later.

Are there more instance to describe here specifically for structured 
streaming?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213177623
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,7 +2812,18 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
-**Further Reading**
+## Configuration Options For Structured Streaming
+
+This section is for configurations which are only available for structured 
streaming, or they behave differently with batch query.
+
+- spark.sql.shuffle.partitions
--- End diff --

IMHO, if something goes wrong with structured streaming, end users would 
try to review structured streaming guide doc, rather than sql programming guide 
doc. Could we wait for hearing more voices on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213176029
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,7 +2812,18 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
-**Further Reading**
+## Configuration Options For Structured Streaming
+
+This section is for configurations which are only available for structured 
streaming, or they behave differently with batch query.
+
+- spark.sql.shuffle.partitions
--- End diff --

We do have it in `sql-programming-guide.md`. Shall we add some info there 
for now?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213129120
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,6 +2812,12 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
+**Gotchas**
+
+- For structured streaming, modifying "spark.sql.shuffle.partitions" is 
restricted once you run the query.
+  - This is because state is partitioned via key, hence number of 
partitions for state should be unchanged.
+  - If you want to run less tasks for stateful operations, `coalesce` 
would help with avoiding unnecessary repartitioning. Please note that it will 
also affect downstream operators.
--- End diff --

It just means that the number of partitions in stateful operations' output 
will be same as parameter for `coalesce`, and the number of partitions will be 
kept unless another shuffle happens. It is implicitly same as 
`spark.sql.shuffle.partitions`, which default value is 200.

I'll add the code, but not sure we need to have the code per language like 
Scala / Java / Python tabs since they will be same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213123711
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,6 +2812,12 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
+**Gotchas**
--- End diff --

I was going to add the explanation to `doc()` of 
`spark.sql.shuffle.partitions`, but looks like what we explained in `doc()` 
would not be published automatically. (Please correct me if I'm missing here.) 
SQLConf is even not exposed to scaladoc. That's why I'm adding this to 
structured streaming guide doc. Actually I think most of end users only take a 
look at this doc for structured streaming, and we can't (and shouldn't) expect 
end users to take a look at source code to find it.

But also actually I didn't notice that `spark.sql.shuffle.partitions` is 
explained in `sql-programming-guide.md` but I also think we need to explain all 
configs here if they work differently with batch query. 
`spark.sql.shuffle.partitions` is the case. 

Btw, `Gotchas` looks like funny though. Maybe having section would be 
better. Maybe like `## Other Configuration Options` in 
`sql-programming-guide.md`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213063267
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,6 +2812,12 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
+**Gotchas**
+
+- For structured streaming, modifying "spark.sql.shuffle.partitions" is 
restricted once you run the query.
+  - This is because state is partitioned via key, hence number of 
partitions for state should be unchanged.
+  - If you want to run less tasks for stateful operations, `coalesce` 
would help with avoiding unnecessary repartitioning. Please note that it will 
also affect downstream operators.
--- End diff --

An example of how to use `coalesce` operator with stateful streaming query 
would be superb.

I'd also appreciate if you added what type of downstream operators are 
affected and how.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread arunmahadevan
Github user arunmahadevan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213049895
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,6 +2812,12 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
+**Gotchas**
--- End diff --

IMO, It would be better to keep it here as well as in the code, we may not 
be able to surface it in the right api docs and chance for users to ignore it.

@HeartSaVioR, may be add an example here to illustrate how to use the 
coalesce?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213005430
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,6 +2812,12 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
+**Gotchas**
--- End diff --

hm .. @HeartSaVioR how about leaving them in codes or API somewhere as 
`note`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-26 Thread HeartSaVioR
GitHub user HeartSaVioR opened a pull request:

https://github.com/apache/spark/pull/22238

[SPARK-25245][DOCS][SS] Explain regarding limiting modification on 
"spark.sql.shuffle.partitions" for structured streaming

## What changes were proposed in this pull request?

This patch adds explanation of `why "spark.sql.shuffle.partitions" keeps 
unchanged in structured streaming`, which couple of users already wondered and 
some of them even thought it as a bug.

This patch would help other end users to know about such behavior before 
they find by theirselves and being wondered.

## How was this patch tested?

No need to test because this is a simple addition on guide doc with 
markdown editor.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HeartSaVioR/spark SPARK-25245

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22238.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22238


commit ec24f2941021d34058d7af3d3f103094582fd79d
Author: Jungtaek Lim 
Date:   2018-08-26T21:08:04Z

SPARK-25245 Explain regarding limiting modification on 
"spark.sql.shuffle.partitions" for structured streaming




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org