Re: [PROPOSAL] for AWS Aurora relational database connector

Madhusudan Borkar Tue, 13 Jun 2017 16:25:27 -0700

Hi,
Appreciate your questions.
One thing I believe, AWS Aurora even though it is based on MySQL, it is no
MySQL. The reason being, AWS has developed this database service RDS ground
up and has improved or completely changed its implementation. That being
said some of things that one may have experienced with MySQL standard
implementation will not hold. They have changed the way query results are
cached, new technique. RDS has multiple read replicas which are load
balanced and separated from writer. There are many more improvements.
The query to read m rows from row n works like
select * from table limit n,m;
My research with Aurora shows that this doesn't read up-to n-1 rows first
and then discard that data before picking up next m rows. It looks like
that it picks up from nth row. My table with 2M rows has returned data for
200K rows under 15 secs. If I repeat the query then it will only take less
than 5 secs. I am not sorting the rows.
We can get avg size of the table row by querying the metadata for the table
from its metadata table information_schema.tables. It doesn't have to even
access the actual table.
Please, give us more time to provide more on bench marking.



Madhu Borkar

On Sat, Jun 10, 2017 at 10:51 PM, Eugene Kirpichov <
[email protected]> wrote:

> To elaborate a bit on what JB said:
>
> Suppose the table has 1,000,000 rows, and suppose you split it into 1000
> bundles, 1000 rows per bundle.
>
> Does Aurora provide an API that allows to efficiently read the bundle
> containing rows 999,000-1,000,000, that does not involve reading and
> throwing away the first 999,000 rows?
> Cause if the answer to this is "no", then reading these 1000 bundles would
> involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows (and
> throwing away 498,500,000 of them) instead of 1,000,000.
>
> For a typical relational database, the answer is "no" - that's why JdbcIO
> does not provide splitting. Instead it reads the rows sequentially, but
> uses a reshuffle-like transform to make sure that they will be *processed*
> in parallel by downstream transforms.
>
> There's also a couple more questions that make this proposal much harder
> than it may seem at first sight:
> - In order to make sure the bundles cover each row exactly once, you need
> to scan the table in a particular fixed order - otherwise "rows number X
> through Y" is meaningless - this adds the overhead of an ORDER BY clause
> (though for a table with a primary key it's probably negligible).
> - If the table is changed, and rows are inserted and deleted while you read
> it, then again "rows number X through Y" is a meaningless concept, because
> what is "row number X" at one moment may be completely different at another
> moment, and from reading 1000 bundles in parallel you might get duplicate
> rows, lost rows, or both.
> - You mention the "size of a single row" - I suppose you're referring to
> the arithmetic mean of the sizes of all rows in the database. Does Aurora
> provide a way to efficiently query for that? (without reading the whole
> database and computing the size of each row)
>
> On Sat, Jun 10, 2017 at 10:36 PM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
> > Hi,
> >
> > I created a Jira to add custom splitting to JdbcIO (but it's not so
> trivial
> > depending of the backends.
> >
> > Regarding your proposal it sounds interesting, but do you think we will
> > have
> > really "parallel" read of the split ? I think splitting makes sense if we
> > can do
> > parallel read: if we split to read on an unique backend, it doesn't bring
> > lot of
> > improvement.
> >
> > Regards
> > JB
> >
> > On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> > > Hi,
> > > We are proposing to develop connector for AWS Aurora. Aurora being
> > cluster
> > > for relational database (MySQL) has no Java api for reading/writing
> other
> > > than jdbc client. Although there is a JdbcIO available, it looks like
> it
> > > doesn't work in parallel. The proposal is to provide split
> functionality
> > > and then use transform to parallelize the operation. As mentioned
> above,
> > > this is typical sql based database and not comparable with likes of
> Hive.
> > > Hive implementation is based on abstraction over Hdfs file system of
> > > Hadoop, which provides splits. Here none of these are applicable.
> > > During implementation of Hive connector there was lot of discussion as
> > how
> > > to implement connector while strictly following Beam design principal
> > using
> > > Bounded source. I am not sure how Aurora connector will fit into these
> > > design principals.
> > > Here is our proposal.
> > > 1. Split functionality: If the table contains 'x' rows, it will be
> split
> > > into 'n' bundles in the split method. This would be done like follows :
> > > noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
> > > 2. Then each of these 'pseudo' splits would be read in parallel
> > > 3. Each of these reads will use db connection from connection pool.
> > > This will provide better bench marking. Please, let know your views.
> > >
> > > Thanks
> > > Madhu Borkar
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > [email protected]
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: [PROPOSAL] for AWS Aurora relational database connector

Reply via email to