> > I don't know of any benchmarks vs. HBase bulk loader. Would be interesting, > if you could come up with an apples-to-apples test.
I did some testing to get an apples-to-apples comparison between the two options. For 10 million rows (primary key is a 3 column composite key with 3 column qualifiers): JDBC bulk-loading: 430 sec (after applying PHOENIX-1711 patch) Direct Phoenix encoding: 112 sec Using direct encoding path executes in 1/4th the JDBC time and I think, the difference is significant enough to provide APIs for direct Phoenix encoding in the bulk-loader. Thanks On Thu, Mar 5, 2015 at 2:13 PM, Nick Dimiduk <[email protected]> wrote: > I don't know of any benchmarks vs. HBase bulk loader. Would be interesting, > if you could come up with an apples-to-apples test. > > 100TB binary file cannot be partitioned at all? You're always bound to a > single process. Bummer. I guess plan B could be pre-processing the binary > file into something splittable. You'll cover the data twice, but if Phoenix > encoding really is the current bottleneck, as your mail indicates, then > separating the decoding of the binary file from encoding of the Phoenix > output should allow for parallelizing the second step and improve the state > of things. > > Mean time, would be good to look at perf improvements of the Phoenix > encoding step. Any volunteers lurking about? > > -n > > On Thu, Mar 5, 2015 at 1:08 PM, Tulasi Paradarami < > [email protected]> wrote: > > > Gabriel, Nick, thanks for your inputs. My comments below. > > > > Although it may look as though data is being written over the wire to > > > Phoenix, the execution of an upsert executor and retrieval of the > > > uncommitted KeyValues is all local (in memory). The code is implemented > > in > > > this way because JDBC is the general API used within Phoenix -- there > > isn't > > > direct "convert fields to Phoenix encoding" API, although this is doing > > the > > > equivalent operation. > > > > I understand, data processing is in memory but performance can be > improved > > if there is a direct conversion to Phoenix encoding. > > Are there any performance comparison results between phoenix & hbase > > bulk-loader? > > > > Could you give some more information on your performance numbers? For > > > example, is this the throughput that you're getting in a single > process, > > or > > > over a number of processes? If so, how many processes? > > > > Its currently running as a single mapper processing a binary file > > (un-splittable). Disk throughput doesn't look to be an issue here. > > Production has machines of the same processing capability but obviously > > more number of nodes and input files. > > > > > > Also, how many columns are in the records that you're loading? > > > > The row-size is small: 3 integers for PK, 2 short qualifiers, 1 varchar > > qualifier > > > > What is the current (projected) time required to load the data? > > > > About 20-25 days > > > > > > What is the minimum allowable ingest speed to be considered satisfactory? > > > > We would like to finish the load in less than 10-12 days. > > > > > > You can make things go faster by increasing the number of mappers. > > > > The input file (binary) is not-splittable, a mapper is tied to the > specific > > file. > > > > What changes did you make to the map() method? Increased logging, > > > performance enhancements, plugging in custom logic, something else? > > > > I added custom logic to the map() method. > > > > > > > > On Thu, Mar 5, 2015 at 7:53 AM, Nick Dimiduk <[email protected]> wrote: > > > > > Also: how large is your cluster? You can make things go faster by > > > increasing the number of mappers. What changes did you make to the > map() > > > method? Increased logging, performance enhancements, plugging in custom > > > logic, something else? > > > > > > On Thursday, March 5, 2015, Gabriel Reid <[email protected]> > wrote: > > > > > > > Hi Tulasi, > > > > > > > > Answers (and questions) inlined below: > > > > > > > > On Thu, Mar 5, 2015 at 2:41 AM Tulasi Paradarami < > > > > [email protected] <javascript:;>> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > Here are the details of our environment: > > > > > Phoenix 4.3 > > > > > HBase 0.98.6 > > > > > > > > > > I'm loading data to a Phoenix table using the csv bulk-loader > (after > > > > making > > > > > some changes to the map(...) method) and it is processing about > > 16,000 > > > - > > > > > 20,000 rows/sec. I noticed that the bulk-loader spends upto 40% of > > the > > > > > execution time in the following steps. > > > > > > > > > > > > > //... > > > > > csvRecord = csvLineParser.parse(value.toString()); > > > > > csvUpsertExecutor.execute(ImmutableList.of(csvRecord)); > > > > > Iterator<Pair<byte[], List<KeyValue>>> uncommittedDataIterator = > > > > > PhoenixRuntime.getUncommittedDataIterator(conn, true); > > > > > //... > > > > > > > > > > > > > The non-code translation of those steps is: > > > > 1. Parse the CSV record > > > > 2. Convert the contents of the CSV record into KeyValues > > > > > > > > Although it may look as though data is being written over the wire to > > > > Phoenix, the execution of an upsert executor and retrieval of the > > > > uncommitted KeyValues is all local (in memory). The code is > implemented > > > in > > > > this way because JDBC is the general API used within Phoenix -- there > > > isn't > > > > direct "convert fields to Phoenix encoding" API, although this is > doing > > > the > > > > equivalent operation. > > > > > > > > Could you give some more information on your performance numbers? For > > > > example, is this the throughput that you're getting in a single > > process, > > > or > > > > over a number of processes? If so, how many processes? Also, how many > > > > columns are in the records that you're loading? > > > > > > > > > > > > > > > > > > We plan to load up-to 100TB of data and overall performance of the > > > > > bulk-loader is not satisfactory. > > > > > > > > > > > > > How many records are in that 100TB? What is the current (projected) > > time > > > > required to load the data? What is the minimum allowable ingest speed > > to > > > be > > > > considered satisfactory? > > > > > > > > - Gabriel > > > > > > > > > >
