Re: [PROPOSAL] for AWS Aurora relational database connector

Eugene Kirpichov Tue, 13 Jun 2017 16:39:41 -0700

Thanks Madhusudan. Please note that in your case, likely, the time was
dominated by shipping the rows over the network, rather than executing the
query. Please make sure to include benchmarks where the query itself is
expensive to evaluate (e.g. "select count(*) from query" takes time
comparable to "select * from query"), and make sure that its results are
large enough to be non-cacheable.


Also let me know your thoughts on ensuring *semantic* correctness - i.e.
the fact that "select * limit n, m" is non-deterministic, and requires an
ORDER BY clause to provide any kind of determinism - and that determinism,
too, is lost, if the table may be concurrently modified.
E.g. you have a table with values in column "id": 0, 10, 20, 30, 40, 50.
You split it into 2 parts - "order by id limit 0, 3" and "order by id limit
3, 6".
While you read the "limit 0, 3" part, the value 5 is added, and this part
gets values 0, 10, 5.
When you read the "limit 3, 6" part, the value 7 is added, and this part
gets values 10, 20, 30.
As a result, you get the value 10 twice, and you completely miss the values
7, 40 and 50.

The only way to address this is if the database supports snapshot
(point-in-time) reads. If Aurora supports this and if you plan to use them,
then make sure that you use this feature in all of your benchmarks - it
might be less optimized by Aurora internally.

On Tue, Jun 13, 2017 at 4:25 PM Madhusudan Borkar <[email protected]>
wrote:

> Hi,
> Appreciate your questions.
> One thing I believe, AWS Aurora even though it is based on MySQL, it is no
> MySQL. The reason being, AWS has developed this database service RDS ground
> up and has improved or completely changed its implementation. That being
> said some of things that one may have experienced with MySQL standard
> implementation will not hold. They have changed the way query results are
> cached, new technique. RDS has multiple read replicas which are load
> balanced and separated from writer. There are many more improvements.
> The query to read m rows from row n works like
> select * from table limit n,m;
> My research with Aurora shows that this doesn't read up-to n-1 rows first
> and then discard that data before picking up next m rows. It looks like
> that it picks up from nth row. My table with 2M rows has returned data for
> 200K rows under 15 secs. If I repeat the query then it will only take less
> than 5 secs. I am not sorting the rows.
> We can get avg size of the table row by querying the metadata for the table
> from its metadata table information_schema.tables. It doesn't have to even
> access the actual table.
> Please, give us more time to provide more on bench marking.
>
>
> Madhu Borkar
>
> On Sat, Jun 10, 2017 at 10:51 PM, Eugene Kirpichov <
> [email protected]> wrote:
>
> > To elaborate a bit on what JB said:
> >
> > Suppose the table has 1,000,000 rows, and suppose you split it into 1000
> > bundles, 1000 rows per bundle.
> >
> > Does Aurora provide an API that allows to efficiently read the bundle
> > containing rows 999,000-1,000,000, that does not involve reading and
> > throwing away the first 999,000 rows?
> > Cause if the answer to this is "no", then reading these 1000 bundles
> would
> > involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows
> (and
> > throwing away 498,500,000 of them) instead of 1,000,000.
> >
> > For a typical relational database, the answer is "no" - that's why JdbcIO
> > does not provide splitting. Instead it reads the rows sequentially, but
> > uses a reshuffle-like transform to make sure that they will be
> *processed*
> > in parallel by downstream transforms.
> >
> > There's also a couple more questions that make this proposal much harder
> > than it may seem at first sight:
> > - In order to make sure the bundles cover each row exactly once, you need
> > to scan the table in a particular fixed order - otherwise "rows number X
> > through Y" is meaningless - this adds the overhead of an ORDER BY clause
> > (though for a table with a primary key it's probably negligible).
> > - If the table is changed, and rows are inserted and deleted while you
> read
> > it, then again "rows number X through Y" is a meaningless concept,
> because
> > what is "row number X" at one moment may be completely different at
> another
> > moment, and from reading 1000 bundles in parallel you might get duplicate
> > rows, lost rows, or both.
> > - You mention the "size of a single row" - I suppose you're referring to
> > the arithmetic mean of the sizes of all rows in the database. Does Aurora
> > provide a way to efficiently query for that? (without reading the whole
> > database and computing the size of each row)
> >
> > On Sat, Jun 10, 2017 at 10:36 PM Jean-Baptiste Onofré <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I created a Jira to add custom splitting to JdbcIO (but it's not so
> > trivial
> > > depending of the backends.
> > >
> > > Regarding your proposal it sounds interesting, but do you think we will
> > > have
> > > really "parallel" read of the split ? I think splitting makes sense if
> we
> > > can do
> > > parallel read: if we split to read on an unique backend, it doesn't
> bring
> > > lot of
> > > improvement.
> > >
> > > Regards
> > > JB
> > >
> > > On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> > > > Hi,
> > > > We are proposing to develop connector for AWS Aurora. Aurora being
> > > cluster
> > > > for relational database (MySQL) has no Java api for reading/writing
> > other
> > > > than jdbc client. Although there is a JdbcIO available, it looks like
> > it
> > > > doesn't work in parallel. The proposal is to provide split
> > functionality
> > > > and then use transform to parallelize the operation. As mentioned
> > above,
> > > > this is typical sql based database and not comparable with likes of
> > Hive.
> > > > Hive implementation is based on abstraction over Hdfs file system of
> > > > Hadoop, which provides splits. Here none of these are applicable.
> > > > During implementation of Hive connector there was lot of discussion
> as
> > > how
> > > > to implement connector while strictly following Beam design principal
> > > using
> > > > Bounded source. I am not sure how Aurora connector will fit into
> these
> > > > design principals.
> > > > Here is our proposal.
> > > > 1. Split functionality: If the table contains 'x' rows, it will be
> > split
> > > > into 'n' bundles in the split method. This would be done like
> follows :
> > > > noOfSplits = 'x' * size of a single row / bundleSize hint from
> runner.
> > > > 2. Then each of these 'pseudo' splits would be read in parallel
> > > > 3. Each of these reads will use db connection from connection pool.
> > > > This will provide better bench marking. Please, let know your views.
> > > >
> > > > Thanks
> > > > Madhu Borkar
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > [email protected]
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>

Re: [PROPOSAL] for AWS Aurora relational database connector

Reply via email to