Re: Bulk-loader performance

Tulasi Paradarami Fri, 13 Mar 2015 14:55:34 -0700

>
> I don't know of any benchmarks vs. HBase bulk loader. Would be interesting,
> if you could come up with an apples-to-apples test.



I did some testing to get an apples-to-apples comparison between the two
options.

For 10 million rows (primary key is a 3 column composite key with 3 column
qualifiers):
JDBC bulk-loading: 430 sec (after applying PHOENIX-1711 patch)
Direct Phoenix encoding: 112 sec

Using direct encoding path executes in 1/4th the JDBC time and I think, the
difference is significant enough to provide APIs for direct Phoenix
encoding in the bulk-loader.

Thanks


On Thu, Mar 5, 2015 at 2:13 PM, Nick Dimiduk <[email protected]> wrote:

> I don't know of any benchmarks vs. HBase bulk loader. Would be interesting,
> if you could come up with an apples-to-apples test.
>
> 100TB binary file cannot be partitioned at all? You're always bound to a
> single process. Bummer. I guess plan B could be pre-processing the binary
> file into something splittable. You'll cover the data twice, but if Phoenix
> encoding really is the current bottleneck, as your mail indicates, then
> separating the decoding of the binary file from encoding of the Phoenix
> output should allow for parallelizing the second step and improve the state
> of things.
>
> Mean time, would be good to look at perf improvements of the Phoenix
> encoding step. Any volunteers lurking about?
>
> -n
>
> On Thu, Mar 5, 2015 at 1:08 PM, Tulasi Paradarami <
> [email protected]> wrote:
>
> > Gabriel, Nick, thanks for your inputs. My comments below.
> >
> > Although it may look as though data is being written over the wire to
> > > Phoenix, the execution of an upsert executor and retrieval of the
> > > uncommitted KeyValues is all local (in memory). The code is implemented
> > in
> > > this way because JDBC is the general API used within Phoenix -- there
> > isn't
> > > direct "convert fields to Phoenix encoding" API, although this is doing
> > the
> > > equivalent operation.
> >
> > I understand, data processing is in memory but performance can be
> improved
> > if there is a direct conversion to Phoenix encoding.
> > Are there any performance comparison results between phoenix & hbase
> > bulk-loader?
> >
> > Could you give some more information on your performance numbers? For
> > > example, is this the throughput that you're getting in a single
> process,
> > or
> > > over a number of processes? If so, how many processes?
> >
> > Its currently running as a single mapper processing a binary file
> > (un-splittable). Disk throughput doesn't look to be an issue here.
> > Production has machines of the same processing capability but obviously
> > more number of nodes and input files.
> >
> >
> > Also, how many columns are in the records that you're loading?
> >
> > The row-size is small: 3 integers for PK, 2 short qualifiers, 1 varchar
> > qualifier
> >
> > What is the current (projected) time required to load the data?
> >
> > About 20-25 days
> >
> >
> > What is the minimum allowable ingest speed to be considered satisfactory?
> >
> > We would like to finish the load in less than 10-12 days.
> >
> >
> > You can make things go faster by increasing the number of mappers.
> >
> > The input file (binary) is not-splittable, a mapper is tied to the
> specific
> > file.
> >
> > What changes did you make to the map() method? Increased logging,
> > > performance enhancements, plugging in custom logic, something else?
> >
> > I added custom logic to the map() method.
> >
> >
> >
> > On Thu, Mar 5, 2015 at 7:53 AM, Nick Dimiduk <[email protected]> wrote:
> >
> > > Also: how large is your cluster? You can make things go faster by
> > > increasing the number of mappers. What changes did you make to the
> map()
> > > method? Increased logging, performance enhancements, plugging in custom
> > > logic, something else?
> > >
> > > On Thursday, March 5, 2015, Gabriel Reid <[email protected]>
> wrote:
> > >
> > > > Hi Tulasi,
> > > >
> > > > Answers (and questions) inlined below:
> > > >
> > > > On Thu, Mar 5, 2015 at 2:41 AM Tulasi Paradarami <
> > > > [email protected] <javascript:;>>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Here are the details of our environment:
> > > > > Phoenix 4.3
> > > > > HBase 0.98.6
> > > > >
> > > > > I'm loading data to a Phoenix table using the csv bulk-loader
> (after
> > > > making
> > > > > some changes to the map(...) method) and it is processing about
> > 16,000
> > > -
> > > > > 20,000 rows/sec. I noticed that the bulk-loader spends upto 40% of
> > the
> > > > > execution time in the following steps.
> > > >
> > > >
> > > > > //...
> > > > > csvRecord = csvLineParser.parse(value.toString());
> > > > > csvUpsertExecutor.execute(ImmutableList.of(csvRecord));
> > > > > Iterator<Pair<byte[], List<KeyValue>>> uncommittedDataIterator =
> > > > > PhoenixRuntime.getUncommittedDataIterator(conn, true);
> > > > > //...
> > > > >
> > > >
> > > > The non-code translation of those steps is:
> > > > 1. Parse the CSV record
> > > > 2. Convert the contents of the CSV record into KeyValues
> > > >
> > > > Although it may look as though data is being written over the wire to
> > > > Phoenix, the execution of an upsert executor and retrieval of the
> > > > uncommitted KeyValues is all local (in memory). The code is
> implemented
> > > in
> > > > this way because JDBC is the general API used within Phoenix -- there
> > > isn't
> > > > direct "convert fields to Phoenix encoding" API, although this is
> doing
> > > the
> > > > equivalent operation.
> > > >
> > > > Could you give some more information on your performance numbers? For
> > > > example, is this the throughput that you're getting in a single
> > process,
> > > or
> > > > over a number of processes? If so, how many processes? Also, how many
> > > > columns are in the records that you're loading?
> > > >
> > > >
> > > > >
> > > > > We plan to load up-to 100TB of data and overall performance of the
> > > > > bulk-loader is not satisfactory.
> > > > >
> > > >
> > > > How many records are in that 100TB? What is the current (projected)
> > time
> > > > required to load the data? What is the minimum allowable ingest speed
> > to
> > > be
> > > > considered satisfactory?
> > > >
> > > > - Gabriel
> > > >
> > >
> >
>

Re: Bulk-loader performance

Reply via email to