Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Eugene Kirpichov
...Final note: performance when executing queries "limit A, B" and "limit
C, D" in sequence may be completely different than when executing them in
parallel. In particular, if they are being run in parallel, most likely a
lot fewer caching will happen. Make sure your benchmarks account for this
too.

On Tue, Jun 13, 2017 at 4:40 PM Eugene Kirpichov 
wrote:

> Most likely the identical performance you observed for "limit" clause is
> because you are not sorting the rows. Without sorting, a "limit" query is
> meaningless: the database is technically allowed return exactly the same
> result for "limit 0, 10" and "limit 10, 20", because both of these queries
> are in essence asking to give you 10 arbitrary rows from the table.
>
> On Tue, Jun 13, 2017 at 4:38 PM Eugene Kirpichov 
> wrote:
>
>> Thanks Madhusudan. Please note that in your case, likely, the time was
>> dominated by shipping the rows over the network, rather than executing the
>> query. Please make sure to include benchmarks where the query itself is
>> expensive to evaluate (e.g. "select count(*) from query" takes time
>> comparable to "select * from query"), and make sure that its results are
>> large enough to be non-cacheable.
>>
>> Also let me know your thoughts on ensuring *semantic* correctness - i.e.
>> the fact that "select * limit n, m" is non-deterministic, and requires an
>> ORDER BY clause to provide any kind of determinism - and that determinism,
>> too, is lost, if the table may be concurrently modified.
>> E.g. you have a table with values in column "id": 0, 10, 20, 30, 40, 50.
>> You split it into 2 parts - "order by id limit 0, 3" and "order by id limit
>> 3, 6".
>> While you read the "limit 0, 3" part, the value 5 is added, and this part
>> gets values 0, 10, 5.
>> When you read the "limit 3, 6" part, the value 7 is added, and this part
>> gets values 10, 20, 30.
>> As a result, you get the value 10 twice, and you completely miss the
>> values 7, 40 and 50.
>>
>> The only way to address this is if the database supports snapshot
>> (point-in-time) reads. If Aurora supports this and if you plan to use them,
>> then make sure that you use this feature in all of your benchmarks - it
>> might be less optimized by Aurora internally.
>>
>> On Tue, Jun 13, 2017 at 4:25 PM Madhusudan Borkar 
>> wrote:
>>
>>> Hi,
>>> Appreciate your questions.
>>> One thing I believe, AWS Aurora even though it is based on MySQL, it is
>>> no
>>> MySQL. The reason being, AWS has developed this database service RDS
>>> ground
>>> up and has improved or completely changed its implementation. That being
>>> said some of things that one may have experienced with MySQL standard
>>> implementation will not hold. They have changed the way query results are
>>> cached, new technique. RDS has multiple read replicas which are load
>>> balanced and separated from writer. There are many more improvements.
>>> The query to read m rows from row n works like
>>> select * from table limit n,m;
>>> My research with Aurora shows that this doesn't read up-to n-1 rows first
>>> and then discard that data before picking up next m rows. It looks like
>>> that it picks up from nth row. My table with 2M rows has returned data
>>> for
>>> 200K rows under 15 secs. If I repeat the query then it will only take
>>> less
>>> than 5 secs. I am not sorting the rows.
>>> We can get avg size of the table row by querying the metadata for the
>>> table
>>> from its metadata table information_schema.tables. It doesn't have to
>>> even
>>> access the actual table.
>>> Please, give us more time to provide more on bench marking.
>>>
>>>
>>> Madhu Borkar
>>>
>>> On Sat, Jun 10, 2017 at 10:51 PM, Eugene Kirpichov <
>>> kirpic...@google.com.invalid> wrote:
>>>
>>> > To elaborate a bit on what JB said:
>>> >
>>> > Suppose the table has 1,000,000 rows, and suppose you split it into
>>> 1000
>>> > bundles, 1000 rows per bundle.
>>> >
>>> > Does Aurora provide an API that allows to efficiently read the bundle
>>> > containing rows 999,000-1,000,000, that does not involve reading and
>>> > throwing away the first 999,000 rows?
>>> > Cause if the answer to this is "no", then reading these 1000 bundles
>>> would
>>> > involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows
>>> (and
>>> > throwing away 498,500,000 of them) instead of 1,000,000.
>>> >
>>> > For a typical relational database, the answer is "no" - that's why
>>> JdbcIO
>>> > does not provide splitting. Instead it reads the rows sequentially, but
>>> > uses a reshuffle-like transform to make sure that they will be
>>> *processed*
>>> > in parallel by downstream transforms.
>>> >
>>> > There's also a couple more questions that make this proposal much
>>> harder
>>> > than it may seem at first sight:
>>> > - In order to make sure the bundles cover each row exactly once, you
>>> need
>>> > to scan the table in a particular fixed order - otherwise "rows number
>>> X
>>> > through Y" is meaningless - this adds the overhead of an ORDE

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Eugene Kirpichov
Most likely the identical performance you observed for "limit" clause is
because you are not sorting the rows. Without sorting, a "limit" query is
meaningless: the database is technically allowed return exactly the same
result for "limit 0, 10" and "limit 10, 20", because both of these queries
are in essence asking to give you 10 arbitrary rows from the table.

On Tue, Jun 13, 2017 at 4:38 PM Eugene Kirpichov 
wrote:

> Thanks Madhusudan. Please note that in your case, likely, the time was
> dominated by shipping the rows over the network, rather than executing the
> query. Please make sure to include benchmarks where the query itself is
> expensive to evaluate (e.g. "select count(*) from query" takes time
> comparable to "select * from query"), and make sure that its results are
> large enough to be non-cacheable.
>
> Also let me know your thoughts on ensuring *semantic* correctness - i.e.
> the fact that "select * limit n, m" is non-deterministic, and requires an
> ORDER BY clause to provide any kind of determinism - and that determinism,
> too, is lost, if the table may be concurrently modified.
> E.g. you have a table with values in column "id": 0, 10, 20, 30, 40, 50.
> You split it into 2 parts - "order by id limit 0, 3" and "order by id limit
> 3, 6".
> While you read the "limit 0, 3" part, the value 5 is added, and this part
> gets values 0, 10, 5.
> When you read the "limit 3, 6" part, the value 7 is added, and this part
> gets values 10, 20, 30.
> As a result, you get the value 10 twice, and you completely miss the
> values 7, 40 and 50.
>
> The only way to address this is if the database supports snapshot
> (point-in-time) reads. If Aurora supports this and if you plan to use them,
> then make sure that you use this feature in all of your benchmarks - it
> might be less optimized by Aurora internally.
>
> On Tue, Jun 13, 2017 at 4:25 PM Madhusudan Borkar 
> wrote:
>
>> Hi,
>> Appreciate your questions.
>> One thing I believe, AWS Aurora even though it is based on MySQL, it is no
>> MySQL. The reason being, AWS has developed this database service RDS
>> ground
>> up and has improved or completely changed its implementation. That being
>> said some of things that one may have experienced with MySQL standard
>> implementation will not hold. They have changed the way query results are
>> cached, new technique. RDS has multiple read replicas which are load
>> balanced and separated from writer. There are many more improvements.
>> The query to read m rows from row n works like
>> select * from table limit n,m;
>> My research with Aurora shows that this doesn't read up-to n-1 rows first
>> and then discard that data before picking up next m rows. It looks like
>> that it picks up from nth row. My table with 2M rows has returned data for
>> 200K rows under 15 secs. If I repeat the query then it will only take less
>> than 5 secs. I am not sorting the rows.
>> We can get avg size of the table row by querying the metadata for the
>> table
>> from its metadata table information_schema.tables. It doesn't have to even
>> access the actual table.
>> Please, give us more time to provide more on bench marking.
>>
>>
>> Madhu Borkar
>>
>> On Sat, Jun 10, 2017 at 10:51 PM, Eugene Kirpichov <
>> kirpic...@google.com.invalid> wrote:
>>
>> > To elaborate a bit on what JB said:
>> >
>> > Suppose the table has 1,000,000 rows, and suppose you split it into 1000
>> > bundles, 1000 rows per bundle.
>> >
>> > Does Aurora provide an API that allows to efficiently read the bundle
>> > containing rows 999,000-1,000,000, that does not involve reading and
>> > throwing away the first 999,000 rows?
>> > Cause if the answer to this is "no", then reading these 1000 bundles
>> would
>> > involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows
>> (and
>> > throwing away 498,500,000 of them) instead of 1,000,000.
>> >
>> > For a typical relational database, the answer is "no" - that's why
>> JdbcIO
>> > does not provide splitting. Instead it reads the rows sequentially, but
>> > uses a reshuffle-like transform to make sure that they will be
>> *processed*
>> > in parallel by downstream transforms.
>> >
>> > There's also a couple more questions that make this proposal much harder
>> > than it may seem at first sight:
>> > - In order to make sure the bundles cover each row exactly once, you
>> need
>> > to scan the table in a particular fixed order - otherwise "rows number X
>> > through Y" is meaningless - this adds the overhead of an ORDER BY clause
>> > (though for a table with a primary key it's probably negligible).
>> > - If the table is changed, and rows are inserted and deleted while you
>> read
>> > it, then again "rows number X through Y" is a meaningless concept,
>> because
>> > what is "row number X" at one moment may be completely different at
>> another
>> > moment, and from reading 1000 bundles in parallel you might get
>> duplicate
>> > rows, lost rows, or both.
>> > - You mention the "size of a single r

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Eugene Kirpichov
Thanks Madhusudan. Please note that in your case, likely, the time was
dominated by shipping the rows over the network, rather than executing the
query. Please make sure to include benchmarks where the query itself is
expensive to evaluate (e.g. "select count(*) from query" takes time
comparable to "select * from query"), and make sure that its results are
large enough to be non-cacheable.

Also let me know your thoughts on ensuring *semantic* correctness - i.e.
the fact that "select * limit n, m" is non-deterministic, and requires an
ORDER BY clause to provide any kind of determinism - and that determinism,
too, is lost, if the table may be concurrently modified.
E.g. you have a table with values in column "id": 0, 10, 20, 30, 40, 50.
You split it into 2 parts - "order by id limit 0, 3" and "order by id limit
3, 6".
While you read the "limit 0, 3" part, the value 5 is added, and this part
gets values 0, 10, 5.
When you read the "limit 3, 6" part, the value 7 is added, and this part
gets values 10, 20, 30.
As a result, you get the value 10 twice, and you completely miss the values
7, 40 and 50.

The only way to address this is if the database supports snapshot
(point-in-time) reads. If Aurora supports this and if you plan to use them,
then make sure that you use this feature in all of your benchmarks - it
might be less optimized by Aurora internally.

On Tue, Jun 13, 2017 at 4:25 PM Madhusudan Borkar 
wrote:

> Hi,
> Appreciate your questions.
> One thing I believe, AWS Aurora even though it is based on MySQL, it is no
> MySQL. The reason being, AWS has developed this database service RDS ground
> up and has improved or completely changed its implementation. That being
> said some of things that one may have experienced with MySQL standard
> implementation will not hold. They have changed the way query results are
> cached, new technique. RDS has multiple read replicas which are load
> balanced and separated from writer. There are many more improvements.
> The query to read m rows from row n works like
> select * from table limit n,m;
> My research with Aurora shows that this doesn't read up-to n-1 rows first
> and then discard that data before picking up next m rows. It looks like
> that it picks up from nth row. My table with 2M rows has returned data for
> 200K rows under 15 secs. If I repeat the query then it will only take less
> than 5 secs. I am not sorting the rows.
> We can get avg size of the table row by querying the metadata for the table
> from its metadata table information_schema.tables. It doesn't have to even
> access the actual table.
> Please, give us more time to provide more on bench marking.
>
>
> Madhu Borkar
>
> On Sat, Jun 10, 2017 at 10:51 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > To elaborate a bit on what JB said:
> >
> > Suppose the table has 1,000,000 rows, and suppose you split it into 1000
> > bundles, 1000 rows per bundle.
> >
> > Does Aurora provide an API that allows to efficiently read the bundle
> > containing rows 999,000-1,000,000, that does not involve reading and
> > throwing away the first 999,000 rows?
> > Cause if the answer to this is "no", then reading these 1000 bundles
> would
> > involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows
> (and
> > throwing away 498,500,000 of them) instead of 1,000,000.
> >
> > For a typical relational database, the answer is "no" - that's why JdbcIO
> > does not provide splitting. Instead it reads the rows sequentially, but
> > uses a reshuffle-like transform to make sure that they will be
> *processed*
> > in parallel by downstream transforms.
> >
> > There's also a couple more questions that make this proposal much harder
> > than it may seem at first sight:
> > - In order to make sure the bundles cover each row exactly once, you need
> > to scan the table in a particular fixed order - otherwise "rows number X
> > through Y" is meaningless - this adds the overhead of an ORDER BY clause
> > (though for a table with a primary key it's probably negligible).
> > - If the table is changed, and rows are inserted and deleted while you
> read
> > it, then again "rows number X through Y" is a meaningless concept,
> because
> > what is "row number X" at one moment may be completely different at
> another
> > moment, and from reading 1000 bundles in parallel you might get duplicate
> > rows, lost rows, or both.
> > - You mention the "size of a single row" - I suppose you're referring to
> > the arithmetic mean of the sizes of all rows in the database. Does Aurora
> > provide a way to efficiently query for that? (without reading the whole
> > database and computing the size of each row)
> >
> > On Sat, Jun 10, 2017 at 10:36 PM Jean-Baptiste Onofré 
> > wrote:
> >
> > > Hi,
> > >
> > > I created a Jira to add custom splitting to JdbcIO (but it's not so
> > trivial
> > > depending of the backends.
> > >
> > > Regarding your proposal it sounds interesting, but do you think we will
> > > have
> > > 

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Madhusudan Borkar
Hi,
Appreciate your questions.
One thing I believe, AWS Aurora even though it is based on MySQL, it is no
MySQL. The reason being, AWS has developed this database service RDS ground
up and has improved or completely changed its implementation. That being
said some of things that one may have experienced with MySQL standard
implementation will not hold. They have changed the way query results are
cached, new technique. RDS has multiple read replicas which are load
balanced and separated from writer. There are many more improvements.
The query to read m rows from row n works like
select * from table limit n,m;
My research with Aurora shows that this doesn't read up-to n-1 rows first
and then discard that data before picking up next m rows. It looks like
that it picks up from nth row. My table with 2M rows has returned data for
200K rows under 15 secs. If I repeat the query then it will only take less
than 5 secs. I am not sorting the rows.
We can get avg size of the table row by querying the metadata for the table
from its metadata table information_schema.tables. It doesn't have to even
access the actual table.
Please, give us more time to provide more on bench marking.


Madhu Borkar

On Sat, Jun 10, 2017 at 10:51 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> To elaborate a bit on what JB said:
>
> Suppose the table has 1,000,000 rows, and suppose you split it into 1000
> bundles, 1000 rows per bundle.
>
> Does Aurora provide an API that allows to efficiently read the bundle
> containing rows 999,000-1,000,000, that does not involve reading and
> throwing away the first 999,000 rows?
> Cause if the answer to this is "no", then reading these 1000 bundles would
> involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows (and
> throwing away 498,500,000 of them) instead of 1,000,000.
>
> For a typical relational database, the answer is "no" - that's why JdbcIO
> does not provide splitting. Instead it reads the rows sequentially, but
> uses a reshuffle-like transform to make sure that they will be *processed*
> in parallel by downstream transforms.
>
> There's also a couple more questions that make this proposal much harder
> than it may seem at first sight:
> - In order to make sure the bundles cover each row exactly once, you need
> to scan the table in a particular fixed order - otherwise "rows number X
> through Y" is meaningless - this adds the overhead of an ORDER BY clause
> (though for a table with a primary key it's probably negligible).
> - If the table is changed, and rows are inserted and deleted while you read
> it, then again "rows number X through Y" is a meaningless concept, because
> what is "row number X" at one moment may be completely different at another
> moment, and from reading 1000 bundles in parallel you might get duplicate
> rows, lost rows, or both.
> - You mention the "size of a single row" - I suppose you're referring to
> the arithmetic mean of the sizes of all rows in the database. Does Aurora
> provide a way to efficiently query for that? (without reading the whole
> database and computing the size of each row)
>
> On Sat, Jun 10, 2017 at 10:36 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi,
> >
> > I created a Jira to add custom splitting to JdbcIO (but it's not so
> trivial
> > depending of the backends.
> >
> > Regarding your proposal it sounds interesting, but do you think we will
> > have
> > really "parallel" read of the split ? I think splitting makes sense if we
> > can do
> > parallel read: if we split to read on an unique backend, it doesn't bring
> > lot of
> > improvement.
> >
> > Regards
> > JB
> >
> > On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> > > Hi,
> > > We are proposing to develop connector for AWS Aurora. Aurora being
> > cluster
> > > for relational database (MySQL) has no Java api for reading/writing
> other
> > > than jdbc client. Although there is a JdbcIO available, it looks like
> it
> > > doesn't work in parallel. The proposal is to provide split
> functionality
> > > and then use transform to parallelize the operation. As mentioned
> above,
> > > this is typical sql based database and not comparable with likes of
> Hive.
> > > Hive implementation is based on abstraction over Hdfs file system of
> > > Hadoop, which provides splits. Here none of these are applicable.
> > > During implementation of Hive connector there was lot of discussion as
> > how
> > > to implement connector while strictly following Beam design principal
> > using
> > > Bounded source. I am not sure how Aurora connector will fit into these
> > > design principals.
> > > Here is our proposal.
> > > 1. Split functionality: If the table contains 'x' rows, it will be
> split
> > > into 'n' bundles in the split method. This would be done like follows :
> > > noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
> > > 2. Then each of these 'pseudo' splits would be read in parallel
> > > 3. Each of these reads will use db connection from c

Re: DataWorks Summit San Jose 2017

2017-06-13 Thread Davor Bonaci
A quick remainder that Beam talks are tomorrow. And, of course, stickers
will make an appearance!

If you'd like to chat about all-things-Beam, please stay in the room after
any of these sessions, or stop me if you see me around.

I hope to see many of you there!

On Mon, Jun 5, 2017 at 4:22 PM, Davor Bonaci  wrote:

> Apache Beam will be featured at the DataWorks Summit / Hadoop Summit next
> week in San Jose, CA [1].
>
> Scheduled events:
> * Realizing the promise of portable data processing with Apache Beam [2]
>   Time: Wednesday, 6/14, 11:30 am
>   Speaker: Davor Bonaci
>
> * Stateful processing of massive out-of-order streams with Apache Beam [3]
>   Time: Wednesday, 6/14, 3:00 pm
>   Speaker: Kenneth Knowles
>
> * Birds-of-a-feather: IoT, Streaming and Data Flow [4]
>   Time: Thursday, 6/15, 5:00 pm
>   Panel: Yolanda Davis, Davor Bonaci, P. Taylor Goetz, Sriharsha
> Chintalapani, Joseph Niemiec
>
> Everybody is welcome -- users and contributors alike! Feel free to use
> code ASF25 to get 25% off the standard All-Access Pass.
>
> I hope to see many of your there!
>
> Davor
>
> [1] https://dataworkssummit.com/
> [2] https://dataworkssummit.com/san-jose-2017/sessions/
> realizing-the-promise-of-portable-data-processing-with-apache-beam/
> [3] https://dataworkssummit.com/san-jose-2017/sessions/
> stateful-processing-of-massive-out-of-order-streams-with-apache-beam/
> [4] https://dataworkssummit.com/san-jose-2017/birds-of-a-
> feather/iot-streaming-and-data-flow/
>


Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Sourabh Bajaj
+1 for S3 being more of a FS

@Madhusudan can you point to some documentation on how to do row-range
queries in Aurora as from a quick scan it follows the MySql 5.6 syntax so
you will still need an order by for the IO to do exactly once reads. So
wanted to learn more about how the questions raised by Eugene are handled.

Thanks
Sourabh

On Mon, Jun 12, 2017 at 9:32 PM Jean-Baptiste Onofré 
wrote:

> Hi,
>
> I think it's a mix of filesystem and IO. For S3, I see more a Beam
> filesystem
> than a pure IO.
>
> WDYT ?
>
> Regards
> JB
>
> On 06/13/2017 02:43 AM, tarush grover wrote:
> > Hi All,
> >
> > I think this can be added under java --> io --> aws-cloud-platform with
> > more io connectors can be added into it eg. S3 also.
> >
> > Regards,
> > Tarush
> >
> > On Mon, Jun 12, 2017 at 4:03 AM, Madhusudan Borkar 
> > wrote:
> >
> >> Yes, I believe so. Thanks for the Jira.
> >>
> >> Madhu Borkar
> >>
> >> On Sat, Jun 10, 2017 at 10:36 PM, Jean-Baptiste Onofré  >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> I created a Jira to add custom splitting to JdbcIO (but it's not so
> >>> trivial depending of the backends.
> >>>
> >>> Regarding your proposal it sounds interesting, but do you think we will
> >>> have really "parallel" read of the split ? I think splitting makes
> sense
> >> if
> >>> we can do parallel read: if we split to read on an unique backend, it
> >>> doesn't bring lot of improvement.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> >>>
>  Hi,
>  We are proposing to develop connector for AWS Aurora. Aurora being
> >> cluster
>  for relational database (MySQL) has no Java api for reading/writing
> >> other
>  than jdbc client. Although there is a JdbcIO available, it looks like
> it
>  doesn't work in parallel. The proposal is to provide split
> functionality
>  and then use transform to parallelize the operation. As mentioned
> above,
>  this is typical sql based database and not comparable with likes of
> >> Hive.
>  Hive implementation is based on abstraction over Hdfs file system of
>  Hadoop, which provides splits. Here none of these are applicable.
>  During implementation of Hive connector there was lot of discussion as
> >> how
>  to implement connector while strictly following Beam design principal
>  using
>  Bounded source. I am not sure how Aurora connector will fit into these
>  design principals.
>  Here is our proposal.
>  1. Split functionality: If the table contains 'x' rows, it will be
> split
>  into 'n' bundles in the split method. This would be done like follows
> :
>  noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
>  2. Then each of these 'pseudo' splits would be read in parallel
>  3. Each of these reads will use db connection from connection pool.
>  This will provide better bench marking. Please, let know your views.
> 
>  Thanks
>  Madhu Borkar
> 
> 
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: BeamSQL status and merge to master

2017-06-13 Thread Lukasz Cwik
Nevermind, I merged it into #2 about usability.

On Tue, Jun 13, 2017 at 8:50 AM, Lukasz Cwik  wrote:

> I added a section about maven module structure/packaging (#6).
>
> On Tue, Jun 13, 2017 at 8:30 AM, Tyler Akidau 
> wrote:
>
>> Thanks Mingmin. I've copied your list into a doc[1] to make it easier to
>> collaborate on comments and edits.
>>
>> [1] https://s.apache.org/beam-dsl-sql-burndown
>>
>> -Tyler
>>
>>
>> On Mon, Jun 12, 2017 at 10:09 PM Jean-Baptiste Onofré 
>> wrote:
>>
>> > Hi Mingmin
>> >
>> > Sorry, the meeting was in the middle of the night for me and I wasn't
>> able
>> > to
>> > make it.
>> >
>> > The timing and checklist look good to me.
>> >
>> > We plan to do a Beam release end of June, so, merging in July means we
>> can
>> > include it in the next release.
>> >
>> > Thanks !
>> > Regards
>> > JB
>> >
>> > On 06/13/2017 03:06 AM, Mingmin Xu wrote:
>> > > Hi all,
>> > >
>> > > Thanks to join the meeting. As discussed, we're planning to merge
>> DSL_SQL
>> > > branch back to master, targeted in the middle of July. A tag
>> > > 'dsl_sql_merge'[1] is created to track all todo tasks.
>> > >
>> > > *What's added in Beam SQL?*
>> > > BeamSQL provides the capability to execute SQL queries with Beam Java
>> > SDK,
>> > > either by translating SQL to a PTransform, or run with a standalone
>> CLI
>> > > client.
>> > >
>> > > *Checklist for merge:*
>> > > 1. functionality
>> > >1.1. SQL grammer:
>> > >  1.1.1. basic query with SELECT/FILTER/PROJECT;
>> > >  1.1.2. AGGREGATION with global window;
>> > >  1.1.3. AGGREGATION with FIX_TIME/SLIDING_TIME/SESSION window;
>> > >  1.1.4. JOIN
>> > >1.2. UDF/UDAF support;
>> > >1.3. support predefined String/Math/Date functions, see[2];
>> > >
>> > > 2. DSL interface to convert SQL as PTransform;
>> > >
>> > > 3. junit test;
>> > >
>> > > 4. Java document;
>> > >
>> > > 5. Document of SQL feature in website;
>> > >
>> > > Any comments/suggestions are very welcomed.
>> > >
>> > > Note:
>> > > [1].
>> > >
>> > https://issues.apache.org/jira/browse/BEAM-2436?jql=labels%
>> 20%3D%20dsl_sql_merge
>> > >
>> > > [2]. https://calcite.apache.org/docs/reference.html
>> > >
>> >
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>> >
>>
>
>


Re: BeamSQL status and merge to master

2017-06-13 Thread Lukasz Cwik
I added a section about maven module structure/packaging (#6).

On Tue, Jun 13, 2017 at 8:30 AM, Tyler Akidau 
wrote:

> Thanks Mingmin. I've copied your list into a doc[1] to make it easier to
> collaborate on comments and edits.
>
> [1] https://s.apache.org/beam-dsl-sql-burndown
>
> -Tyler
>
>
> On Mon, Jun 12, 2017 at 10:09 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi Mingmin
> >
> > Sorry, the meeting was in the middle of the night for me and I wasn't
> able
> > to
> > make it.
> >
> > The timing and checklist look good to me.
> >
> > We plan to do a Beam release end of June, so, merging in July means we
> can
> > include it in the next release.
> >
> > Thanks !
> > Regards
> > JB
> >
> > On 06/13/2017 03:06 AM, Mingmin Xu wrote:
> > > Hi all,
> > >
> > > Thanks to join the meeting. As discussed, we're planning to merge
> DSL_SQL
> > > branch back to master, targeted in the middle of July. A tag
> > > 'dsl_sql_merge'[1] is created to track all todo tasks.
> > >
> > > *What's added in Beam SQL?*
> > > BeamSQL provides the capability to execute SQL queries with Beam Java
> > SDK,
> > > either by translating SQL to a PTransform, or run with a standalone CLI
> > > client.
> > >
> > > *Checklist for merge:*
> > > 1. functionality
> > >1.1. SQL grammer:
> > >  1.1.1. basic query with SELECT/FILTER/PROJECT;
> > >  1.1.2. AGGREGATION with global window;
> > >  1.1.3. AGGREGATION with FIX_TIME/SLIDING_TIME/SESSION window;
> > >  1.1.4. JOIN
> > >1.2. UDF/UDAF support;
> > >1.3. support predefined String/Math/Date functions, see[2];
> > >
> > > 2. DSL interface to convert SQL as PTransform;
> > >
> > > 3. junit test;
> > >
> > > 4. Java document;
> > >
> > > 5. Document of SQL feature in website;
> > >
> > > Any comments/suggestions are very welcomed.
> > >
> > > Note:
> > > [1].
> > >
> > https://issues.apache.org/jira/browse/BEAM-2436?jql=
> labels%20%3D%20dsl_sql_merge
> > >
> > > [2]. https://calcite.apache.org/docs/reference.html
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: BeamSQL status and merge to master

2017-06-13 Thread Tyler Akidau
Thanks Mingmin. I've copied your list into a doc[1] to make it easier to
collaborate on comments and edits.

[1] https://s.apache.org/beam-dsl-sql-burndown

-Tyler


On Mon, Jun 12, 2017 at 10:09 PM Jean-Baptiste Onofré 
wrote:

> Hi Mingmin
>
> Sorry, the meeting was in the middle of the night for me and I wasn't able
> to
> make it.
>
> The timing and checklist look good to me.
>
> We plan to do a Beam release end of June, so, merging in July means we can
> include it in the next release.
>
> Thanks !
> Regards
> JB
>
> On 06/13/2017 03:06 AM, Mingmin Xu wrote:
> > Hi all,
> >
> > Thanks to join the meeting. As discussed, we're planning to merge DSL_SQL
> > branch back to master, targeted in the middle of July. A tag
> > 'dsl_sql_merge'[1] is created to track all todo tasks.
> >
> > *What's added in Beam SQL?*
> > BeamSQL provides the capability to execute SQL queries with Beam Java
> SDK,
> > either by translating SQL to a PTransform, or run with a standalone CLI
> > client.
> >
> > *Checklist for merge:*
> > 1. functionality
> >1.1. SQL grammer:
> >  1.1.1. basic query with SELECT/FILTER/PROJECT;
> >  1.1.2. AGGREGATION with global window;
> >  1.1.3. AGGREGATION with FIX_TIME/SLIDING_TIME/SESSION window;
> >  1.1.4. JOIN
> >1.2. UDF/UDAF support;
> >1.3. support predefined String/Math/Date functions, see[2];
> >
> > 2. DSL interface to convert SQL as PTransform;
> >
> > 3. junit test;
> >
> > 4. Java document;
> >
> > 5. Document of SQL feature in website;
> >
> > Any comments/suggestions are very welcomed.
> >
> > Note:
> > [1].
> >
> https://issues.apache.org/jira/browse/BEAM-2436?jql=labels%20%3D%20dsl_sql_merge
> >
> > [2]. https://calcite.apache.org/docs/reference.html
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Report to the Board, June 2017 edition

2017-06-13 Thread Davor Bonaci
We are expected to submit a project report to the ASF Board of Directors
ahead of its next meeting. The report is due on Wednesday, 6/14.

If interested, please take a look at the draft [1], and comment or
contribute content, as appropriate. I'll submit the report sometime on
Wednesday.

Thanks!

Davor

[1]
https://docs.google.com/document/d/1tgJ_2WEInGa7Wg2RXSWZI7ot3bZ37ZuApalVAJOQscI/


Re: Beam Proposal: Pipeline Drain

2017-06-13 Thread Reuven Lax
Thanks Ismaël,

I think the SDK portions of the Drain proposal are completely runner
independent. Some parts of Drain (e.g. advancing watermarks) will have to
be done by the runners of course.

I'm working on the snapshot and update proposal. I hope to have time to
send it out soon!

Reuven

On Mon, Jun 12, 2017 at 11:49 PM, Ismaël Mejía  wrote:

> Hello Reuven,
>
> I finally took the time to read the Drain proposal, thanks a lot for
> bringing this,  it looks like a nice fit with the current APIs and it
> would be great if this could be implemented as much as possible in a
> Runner independent way.
>
> I am eager now to see the snapshot and update proposal.
> Thanks again,
> Ismaël
>
> On Tue, Jun 6, 2017 at 10:03 PM, Reuven Lax 
> wrote:
> >
> > I believe so, but it looks like the dispenser is an interface to the user
> > for such features. We still need to define what the semantics of features
> > like Drain are, and how they affect the pipeline.
> >
> > On Tue, Jun 6, 2017 at 12:06 PM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Hi Reuven,
> > >
> > > In the "Apache Beam: Technical vision" document (dating from the
> > > incubation) (https://docs.google.com/document/d/
> 1UyAeugHxZmVlQ5cEWo_eOPg
> > > XNQA1oD-rGooWOSwAqh8/edit?usp=sharing), I added a section named "Beam
> > > Pipelines Dispenser".
> > >
> > > The idea is to be able to bootstrap, run and control pipelines (and the
> > > runners).
> > >
> > > I think it's somehow related. WDYT ?
> > >
> > > Regards
> > > JB
> > >
> > > On 06/06/2017 07:43 PM, Reuven Lax wrote:
> > >
> > >> Hi all,
> > >>
> > >> Beam is a great programming mode, but in order to really run pipelines
> > >> (especially streaming pipelines which are "always on") in a production
> > >> setting, there is a set of features necessary. Dataflow has a couple
> of
> > >> those features built in (Drain and Update), and inspired by those
> I'll be
> > >> sending out a few proposals for similar features in Beam.
> > >>
> > >> Please note that my intention here is _not_ to simply forklift the
> > >> Dataflow
> > >> features to Beam. The Dataflow features are being used as
> inspiration, and
> > >> we have two years of experience how real users have used these feature
> > >> (and
> > >> also experienced when users have found these features limited and
> > >> frustrating). In every case my Beam proposals are different -
> hopefully
> > >> better! - than the actual Dataflow feature that exists today.
> > >>
> > >> I think all runners would greatly benefit from production-control
> features
> > >> like this, and I would love to see community input. The first
> proposal is
> > >> for a way of draining a streaming pipeline before stopping it, and
> here it
> > >> is
> > >>  > >> GhDPmm3cllSN8IMmWci8/edit>
> > >> .
> > >>
> > >> Reuven
> > >>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
>