Repository: spark Updated Branches: refs/heads/branch-2.0 7de183d97 -> 86adc5cfb
[SPARK-16114][SQL] updated structured streaming guide ## What changes were proposed in this pull request? Updated structured streaming programming guide with new windowed example. ## How was this patch tested? Docs Author: James Thomas <jamesjoetho...@gmail.com> Closes #14183 from jjthomas/ss_docs_update. (cherry picked from commit 51a6706b1339bb761602e33276a469f71be2cd90) Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/86adc5cf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/86adc5cf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/86adc5cf Branch: refs/heads/branch-2.0 Commit: 86adc5cfbe286eb4d6071ec9ee09b6d0960a8509 Parents: 7de183d Author: James Thomas <jamesjoetho...@gmail.com> Authored: Wed Jul 13 13:26:23 2016 -0700 Committer: Tathagata Das <tathagata.das1...@gmail.com> Committed: Wed Jul 13 13:26:34 2016 -0700 ---------------------------------------------------------------------- docs/structured-streaming-programming-guide.md | 49 ++++++++++----------- 1 file changed, 23 insertions(+), 26 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/86adc5cf/docs/structured-streaming-programming-guide.md ---------------------------------------------------------------------- diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 7949396..3ef39e4 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -626,52 +626,49 @@ The result tables would look something like the following. ![Window Operations](img/structured-streaming-window.png) -Since this windowing is similar to grouping, in code, you can use `groupBy()` and `window()` operations to express windowed aggregations. +Since this windowing is similar to grouping, in code, you can use `groupBy()` and `window()` operations to express windowed aggregations. You can see the full code for the below examples in +[Scala]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala)/ +[Java]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java)/ +[Python]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py). <div class="codetabs"> <div data-lang="scala" markdown="1"> {% highlight scala %} -// Number of events in every 1 minute time windows -df.groupBy(window(df.col("time"), "1 minute")) - .count() +import spark.implicits._ +val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } -// Average number of events for each device type in every 1 minute time windows -df.groupBy( - df.col("type"), - window(df.col("time"), "1 minute")) - .avg("signal") +// Group the data by window and word and compute the count of each group +val windowedCounts = words.groupBy( + window($"timestamp", "10 minutes", "5 minutes"), + $"word" +).count() {% endhighlight %} </div> <div data-lang="java" markdown="1"> {% highlight java %} -import static org.apache.spark.sql.functions.window; - -// Number of events in every 1 minute time windows -df.groupBy(window(df.col("time"), "1 minute")) - .count(); - -// Average number of events for each device type in every 1 minute time windows -df.groupBy( - df.col("type"), - window(df.col("time"), "1 minute")) - .avg("signal"); +Dataset<Row> words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } +// Group the data by window and word and compute the count of each group +Dataset<Row> windowedCounts = words.groupBy( + functions.window(words.col("timestamp"), "10 minutes", "5 minutes"), + words.col("word") +).count(); {% endhighlight %} </div> <div data-lang="python" markdown="1"> {% highlight python %} -from pyspark.sql.functions import window - -# Number of events in every 1 minute time windows -df.groupBy(window("time", "1 minute")).count() +words = ... # streaming DataFrame of schema { timestamp: Timestamp, word: String } -# Average number of events for each device type in every 1 minute time windows -df.groupBy("type", window("time", "1 minute")).avg("signal") +# Group the data by window and word and compute the count of each group +windowedCounts = words.groupBy( + window(words.timestamp, '10 minutes', '5 minutes'), + words.word +).count() {% endhighlight %} </div> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org