[
https://issues.apache.org/jira/browse/MRQL-63?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Leonidas Fegaras resolved MRQL-63.
----------------------------------
Resolution: Implemented
The patch was applied to GIT master.
> Add support for MRQL streaming in spark streaming mode
> ------------------------------------------------------
>
> Key: MRQL-63
> URL: https://issues.apache.org/jira/browse/MRQL-63
> Project: MRQL
> Issue Type: Improvement
> Components: Streaming
> Affects Versions: 0.9.4
> Reporter: Leonidas Fegaras
> Assignee: Leonidas Fegaras
> Attachments: MRQL-63.patch
>
>
> This patch introduces a major extension to MRQL, called MRQL streaming.
> We can now run continuous MRQL queries on streams of data.
> Currently, it works on Spark Streaming only but we may add support for Flink
> Streaming and/or Storm in the future.
> It has been tested in Spark local mode and in Spark distributed mode on a
> Yarn cluster.
> MRQL now supports window-based streaming based on a sliding window during a
> certain time interval. To support MRQL streaming, you need to add the
> parameter "-stream t" to the mrql command, where t is the time interval in
> milliseconds. Then MRQL will processes the new batch of data in the input
> streams every t milliseconds.
> A stream source in MRQL takes the form stream(...), which has the same
> parameters as the source(...) form. For example:
> {code:SQL}
> select (k,avg(p.Y))
> from p in stream(binary,"tmp/points.bin")
> group by k: p.X;
> {code}
> This query process all sequence files in the directory tmp/points.bin and
> then checks this directory every t milliseconds for new files. When a new
> file is inserted in the directory (or if the modification time of an existing
> file changes), it processes the new files. One may work on multiple files and
> the query may contain both stream and regular data sources. If there is at
> least one stream source, the query becomes continuous (never stops). One may
> dump the output stream to binary or CVS files using the existing MRQL syntax:
> {code:SQL}
> store "tmp/out" from e
> {code}
> This dumps the output of the continuous query e into tmp/out/f1, tmp/out/f2,
> ... etc.
> Example for testing:
> First create data:
> {quote}
> mrql.spark -local queries/points.mrql 100
> {quote}
> Then run the continuous query:
> {quote}
> mrql.spark -local -stream 1000 queries/streaming.mrql
> {quote}
> On a separate terminal, you can type:
> {quote}
> touch tmp/points.bin/part-00000
> {quote}
> to process a new batch of data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)