To elaborate a bit on what JB said:

Suppose the table has 1,000,000 rows, and suppose you split it into 1000
bundles, 1000 rows per bundle.

Does Aurora provide an API that allows to efficiently read the bundle
containing rows 999,000-1,000,000, that does not involve reading and
throwing away the first 999,000 rows?
Cause if the answer to this is "no", then reading these 1000 bundles would
involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows (and
throwing away 498,500,000 of them) instead of 1,000,000.

For a typical relational database, the answer is "no" - that's why JdbcIO
does not provide splitting. Instead it reads the rows sequentially, but
uses a reshuffle-like transform to make sure that they will be *processed*
in parallel by downstream transforms.

There's also a couple more questions that make this proposal much harder
than it may seem at first sight:
- In order to make sure the bundles cover each row exactly once, you need
to scan the table in a particular fixed order - otherwise "rows number X
through Y" is meaningless - this adds the overhead of an ORDER BY clause
(though for a table with a primary key it's probably negligible).
- If the table is changed, and rows are inserted and deleted while you read
it, then again "rows number X through Y" is a meaningless concept, because
what is "row number X" at one moment may be completely different at another
moment, and from reading 1000 bundles in parallel you might get duplicate
rows, lost rows, or both.
- You mention the "size of a single row" - I suppose you're referring to
the arithmetic mean of the sizes of all rows in the database. Does Aurora
provide a way to efficiently query for that? (without reading the whole
database and computing the size of each row)

On Sat, Jun 10, 2017 at 10:36 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi,
>
> I created a Jira to add custom splitting to JdbcIO (but it's not so trivial
> depending of the backends.
>
> Regarding your proposal it sounds interesting, but do you think we will
> have
> really "parallel" read of the split ? I think splitting makes sense if we
> can do
> parallel read: if we split to read on an unique backend, it doesn't bring
> lot of
> improvement.
>
> Regards
> JB
>
> On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> > Hi,
> > We are proposing to develop connector for AWS Aurora. Aurora being
> cluster
> > for relational database (MySQL) has no Java api for reading/writing other
> > than jdbc client. Although there is a JdbcIO available, it looks like it
> > doesn't work in parallel. The proposal is to provide split functionality
> > and then use transform to parallelize the operation. As mentioned above,
> > this is typical sql based database and not comparable with likes of Hive.
> > Hive implementation is based on abstraction over Hdfs file system of
> > Hadoop, which provides splits. Here none of these are applicable.
> > During implementation of Hive connector there was lot of discussion as
> how
> > to implement connector while strictly following Beam design principal
> using
> > Bounded source. I am not sure how Aurora connector will fit into these
> > design principals.
> > Here is our proposal.
> > 1. Split functionality: If the table contains 'x' rows, it will be split
> > into 'n' bundles in the split method. This would be done like follows :
> > noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
> > 2. Then each of these 'pseudo' splits would be read in parallel
> > 3. Each of these reads will use db connection from connection pool.
> > This will provide better bench marking. Please, let know your views.
> >
> > Thanks
> > Madhu Borkar
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Reply via email to