[ 
https://issues.apache.org/jira/browse/MRQL-63?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leonidas Fegaras updated MRQL-63:
---------------------------------
    Description: 
This patch introduces a major extension to MRQL, called MRQL streaming.
We can now run continuous MRQL queries on streams of data.
Currently, it works on Spark Streaming only but we may add support for Flink 
Streaming and/or Storm in the future.
It has been tested in Spark local mode and in Spark distributed mode on a Yarn 
cluster.

MRQL now supports window-based streaming based on a sliding window during a 
certain time interval. To support MRQL streaming, you need to add the parameter 
"-stream t" to the mrql command, where t is the time interval in milliseconds. 
Then MRQL will processes the new batch of data in the input streams every t 
milliseconds.
A stream source in MRQL takes the form stream(...), which has the same 
parameters as the source(...) form. For example:
{code:SQL}
select (k,avg(p.Y))
from p in stream(binary,"tmp/points.bin")
group by k: p.X;
{code}
This query process all sequence files in the directory tmp/points.bin and then 
checks this directory every t milliseconds for new files. When a new file is 
inserted in the directory (or if the modification time of an existing file 
changes), it processes the new files. One may work on multiple files and the 
query may contain both stream and regular data sources. If there is at least 
one stream source, the query becomes continuous (never stops). One may dump the 
output stream to binary or CVS files using the existing MRQL syntax:
{code:SQL}
store "tmp/out" from e
{code}
This dumps the output of the continuous query e into tmp/out/f1, tmp/out/f2, 
... etc.

Example for testing:
First create data:
{quote}
mrql.spark -local queries/points.mrql 100
{quote}
Then run the continuous query:
{quote}
mrql.spark -local -stream 1000 queries/streaming.mrql
{quote}
On a separate terminal, you can type:
{quote}
touch tmp/points.bin/part-00000
{quote}
to process a new batch of data.


  was:
This patch introduces a major extension to MRQL, called MRQL streaming.
We can now run continuous MRQL queries on streams of data.
Currently, it works on Spark Streaming only but we may add support for Flink 
Streaming and/or Storm in the future.
It has been tested in Spark local mode and in Spark distributed mode on a Yarn 
cluster.

MRQL now supports window-based streaming based on a sliding window during a 
certain time interval. To support MRQL streaming, you need to add the parameter 
"-stream t" to the mrql command, where t is the time interval in milliseconds. 
Then MRQL will processes the new batch of data in the input streams every t 
milliseconds.
A stream source in MRQL takes the form stream(...), which has the same 
parameters as the source(...) form. For example:
{code:SQL}
select (k,avg(p.Y))
from p in stream(binary,"tmp/points.bin")
group by k: p.X;
{code:SQL}
This query process all sequence files in the directory tmp/points.bin and then 
checks this directory every t milliseconds for new files. When a new file is 
inserted in the directory (or if the modification time of an existing file 
changes), it processes the new files. One may work on multiple files and the 
query may contain both stream and regular data sources. If there is at least 
one stream source, the query becomes continuous (never stops). One may dump the 
output stream to binary or CVS files using the existing MRQL syntax:
{code:SQL}
store "tmp/out" from e
{code:SQL}
This dumps the output of the continuous query e into tmp/out/f1, tmp/out/f2, 
... etc.

Example for testing:
First create data:
{quote}
mrql.spark -local queries/points.mrql 100
{quote}
Then run the continuous query:
{quote}
mrql.spark -local -stream 1000 queries/streaming.mrql
{quote}
On a separate terminal, you can type:
{quote}
touch tmp/points.bin/part-00000
{quote}
to process a new batch of data.



> Add support for MRQL streaming in spark streaming mode
> ------------------------------------------------------
>
>                 Key: MRQL-63
>                 URL: https://issues.apache.org/jira/browse/MRQL-63
>             Project: MRQL
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 0.9.4
>            Reporter: Leonidas Fegaras
>            Assignee: Leonidas Fegaras
>
> This patch introduces a major extension to MRQL, called MRQL streaming.
> We can now run continuous MRQL queries on streams of data.
> Currently, it works on Spark Streaming only but we may add support for Flink 
> Streaming and/or Storm in the future.
> It has been tested in Spark local mode and in Spark distributed mode on a 
> Yarn cluster.
> MRQL now supports window-based streaming based on a sliding window during a 
> certain time interval. To support MRQL streaming, you need to add the 
> parameter "-stream t" to the mrql command, where t is the time interval in 
> milliseconds. Then MRQL will processes the new batch of data in the input 
> streams every t milliseconds.
> A stream source in MRQL takes the form stream(...), which has the same 
> parameters as the source(...) form. For example:
> {code:SQL}
> select (k,avg(p.Y))
> from p in stream(binary,"tmp/points.bin")
> group by k: p.X;
> {code}
> This query process all sequence files in the directory tmp/points.bin and 
> then checks this directory every t milliseconds for new files. When a new 
> file is inserted in the directory (or if the modification time of an existing 
> file changes), it processes the new files. One may work on multiple files and 
> the query may contain both stream and regular data sources. If there is at 
> least one stream source, the query becomes continuous (never stops). One may 
> dump the output stream to binary or CVS files using the existing MRQL syntax:
> {code:SQL}
> store "tmp/out" from e
> {code}
> This dumps the output of the continuous query e into tmp/out/f1, tmp/out/f2, 
> ... etc.
> Example for testing:
> First create data:
> {quote}
> mrql.spark -local queries/points.mrql 100
> {quote}
> Then run the continuous query:
> {quote}
> mrql.spark -local -stream 1000 queries/streaming.mrql
> {quote}
> On a separate terminal, you can type:
> {quote}
> touch tmp/points.bin/part-00000
> {quote}
> to process a new batch of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to