[GitHub] spark pull request #14183: [SPARK-16114] [SQL] updated structured streaming ...

2016-07-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14183


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14183: [SPARK-16114] [SQL] updated structured streaming ...

2016-07-13 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/14183#discussion_r70700875
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -626,52 +626,48 @@ The result tables would look something like the 
following.
 
 ![Window Operations](img/structured-streaming-window.png)
 
-Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations.
+Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations. You can 
see the full code for the below examples in

+[Scala]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala)/

+[Java]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java)/

+[Python]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py).
 
 
 
 
 {% highlight scala %}
-// Number of events in every 1 minute time windows
-df.groupBy(window(df.col("time"), "1 minute"))
-  .count()
+import spark.implicits._
 
+val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
word: String }
 
-// Average number of events for each device type in every 1 minute time 
windows
-df.groupBy(
- df.col("type"),
- window(df.col("time"), "1 minute"))
-  .avg("signal")
+// Group the data by window and word and compute the count of each group
+val windowedCounts = words.groupBy(
+  window($"timestamp", "10 minutes", "5 minutes"), $"word"
+).count().orderBy("window")
 {% endhighlight %}
 
 
 
 
 {% highlight java %}
-import static org.apache.spark.sql.functions.window;
-
-// Number of events in every 1 minute time windows
-df.groupBy(window(df.col("time"), "1 minute"))
-  .count();
-
-// Average number of events for each device type in every 1 minute time 
windows
-df.groupBy(
- df.col("type"),
- window(df.col("time"), "1 minute"))
-  .avg("signal");
+Dataset words = ... // streaming DataFrame of schema { timestamp: 
Timestamp, word: String }
 
+// Group the data by window and word and compute the count of each group
+Dataset windowedCounts = words.groupBy(
+  functions.window(words.col("timestamp"), "10 minutes", "5 minutes"),
+  words.col("word")
+).count().orderBy("window");
 {% endhighlight %}
 
 
 
 {% highlight python %}
-from pyspark.sql.functions import window
-
-# Number of events in every 1 minute time windows
-df.groupBy(window("time", "1 minute")).count()
+words = ... # streaming DataFrame of schema { timestamp: Timestamp, word: 
String }
 
-# Average number of events for each device type in every 1 minute time 
windows
-df.groupBy("type", window("time", "1 minute")).avg("signal")
+# Group the data by window and word and compute the count of each group
+windowedCounts = words.groupBy(
+window(words.timestamp, '10 minutes', '5 minutes'),
+words.word
+).count().orderBy('window')
--- End diff --

orderBy not important.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14183: [SPARK-16114] [SQL] updated structured streaming ...

2016-07-13 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/14183#discussion_r70700829
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -626,52 +626,48 @@ The result tables would look something like the 
following.
 
 ![Window Operations](img/structured-streaming-window.png)
 
-Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations.
+Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations. You can 
see the full code for the below examples in

+[Scala]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala)/

+[Java]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java)/

+[Python]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py).
 
 
 
 
 {% highlight scala %}
-// Number of events in every 1 minute time windows
-df.groupBy(window(df.col("time"), "1 minute"))
-  .count()
+import spark.implicits._
 
+val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
word: String }
 
-// Average number of events for each device type in every 1 minute time 
windows
-df.groupBy(
- df.col("type"),
- window(df.col("time"), "1 minute"))
-  .avg("signal")
+// Group the data by window and word and compute the count of each group
+val windowedCounts = words.groupBy(
+  window($"timestamp", "10 minutes", "5 minutes"), $"word"
+).count().orderBy("window")
--- End diff --

orderBy("window") is not essential. it was only for pretty printing in the 
example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14183: [SPARK-16114] [SQL] updated structured streaming ...

2016-07-13 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/14183#discussion_r70700857
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -626,52 +626,48 @@ The result tables would look something like the 
following.
 
 ![Window Operations](img/structured-streaming-window.png)
 
-Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations.
+Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations. You can 
see the full code for the below examples in

+[Scala]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala)/

+[Java]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java)/

+[Python]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py).
 
 
 
 
 {% highlight scala %}
-// Number of events in every 1 minute time windows
-df.groupBy(window(df.col("time"), "1 minute"))
-  .count()
+import spark.implicits._
 
+val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
word: String }
 
-// Average number of events for each device type in every 1 minute time 
windows
-df.groupBy(
- df.col("type"),
- window(df.col("time"), "1 minute"))
-  .avg("signal")
+// Group the data by window and word and compute the count of each group
+val windowedCounts = words.groupBy(
+  window($"timestamp", "10 minutes", "5 minutes"), $"word"
+).count().orderBy("window")
 {% endhighlight %}
 
 
 
 
 {% highlight java %}
-import static org.apache.spark.sql.functions.window;
-
-// Number of events in every 1 minute time windows
-df.groupBy(window(df.col("time"), "1 minute"))
-  .count();
-
-// Average number of events for each device type in every 1 minute time 
windows
-df.groupBy(
- df.col("type"),
- window(df.col("time"), "1 minute"))
-  .avg("signal");
+Dataset words = ... // streaming DataFrame of schema { timestamp: 
Timestamp, word: String }
 
+// Group the data by window and word and compute the count of each group
+Dataset windowedCounts = words.groupBy(
+  functions.window(words.col("timestamp"), "10 minutes", "5 minutes"),
+  words.col("word")
+).count().orderBy("window");
--- End diff --

orderby not important.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14183: [SPARK-16114] [SQL] updated structured streaming ...

2016-07-13 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/14183#discussion_r70700752
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -626,52 +626,48 @@ The result tables would look something like the 
following.
 
 ![Window Operations](img/structured-streaming-window.png)
 
-Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations.
+Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations. You can 
see the full code for the below examples in

+[Scala]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala)/

+[Java]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java)/

+[Python]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py).
 
 
 
 
 {% highlight scala %}
-// Number of events in every 1 minute time windows
-df.groupBy(window(df.col("time"), "1 minute"))
-  .count()
+import spark.implicits._
 
+val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
word: String }
 
-// Average number of events for each device type in every 1 minute time 
windows
-df.groupBy(
- df.col("type"),
- window(df.col("time"), "1 minute"))
-  .avg("signal")
+// Group the data by window and word and compute the count of each group
+val windowedCounts = words.groupBy(
+  window($"timestamp", "10 minutes", "5 minutes"), $"word"
--- End diff --

put word on next line, to be consistent with other examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14183: [SPARK-16114] [SQL] updated structured streaming ...

2016-07-13 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/14183#discussion_r70687523
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -626,52 +626,95 @@ The result tables would look something like the 
following.
 
 ![Window Operations](img/structured-streaming-window.png)
 
-Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations.
+Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations. You can 
see the full code for the below examples in

+[Scala]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala)/

+[Java]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java)/

+[Python]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py).
--- End diff --

these changes are good. do not revert this based on what i have said below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14183: [SPARK-16114] [SQL] updated structured streaming ...

2016-07-13 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/14183#discussion_r70687380
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -626,52 +626,95 @@ The result tables would look something like the 
following.
 
 ![Window Operations](img/structured-streaming-window.png)
 
-Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations.
+Since this windowing is similar to grouping, in code, you can use 
`groupBy()` and `window()` operations to express windowed aggregations. You can 
see the full code for the below examples in

+[Scala]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala)/

+[Java]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java)/

+[Python]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py).
 
 
 
 
 {% highlight scala %}
-// Number of events in every 1 minute time windows
-df.groupBy(window(df.col("time"), "1 minute"))
-  .count()
-
+import spark.implicits._
 
-// Average number of events for each device type in every 1 minute time 
windows
-df.groupBy(
- df.col("type"),
- window(df.col("time"), "1 minute"))
-  .avg("signal")
+// Create DataFrame representing the stream of input lines from connection 
to host:port
+val lines = spark.readStream
+  .format("socket")
+  .option("host", "localhost")
+  .option("port", )
+  .option("includeTimestamp", true)
+  .load().as[(String, Timestamp)]
+
+// Split the lines into words, retaining timestamps
+val words = lines.flatMap(line =>
+  line._1.split(" ").map(word => (word, line._2))
+).toDF("word", "timestamp")
+
+// Group the data by window and word and compute the count of each group
+val windowedCounts = words.groupBy(
--- End diff --

I took a look at the built doc again and imagined what it would look like. 
This would look very verbose. I think since the nearest example in the doc 
(Basic Operations - Selection, Projection, Aggregation) uses device data and 
already has all the boilerplate code to define DeviceData class, etc., lets not 
change the code snippet to the exact one in the example. 


Can you revert all the code snippet changes, and just do one change for the 
Scala snippet. 
- Change df.col("..") to use $"..."
- Add import spark.implicits._





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org