Re: Issues while Running Apache Phoenix against TPC-H data

Mujtaba Chohan Wed, 24 Aug 2016 12:29:43 -0700

That sounds about right for loading CSV directly on a 5-8 node cluster. As
Gabriel/James mentioned in another thread, CSVBulkLoadTool with pre-split
table might offer significantly better performance for large datasets.


On Tue, Aug 23, 2016 at 2:17 PM, John Leach <jle...@splicemachine.com>
wrote:

> So to load a TB of data would take around 2 days?  Does that seem right to
> you?
>
> Regards,
> John
>
> > On Aug 23, 2016, at 3:07 PM, Mujtaba Chohan <mujt...@apache.org> wrote:
> >
> > Since there are 100 files on which this 600M row data is split. 5
> separate
> > psql script running in parallel on single machine ran that loaded data
> from
> > files 1-20, 21-40, 41-60, 61-80, 81-100. Performance get affected as keys
> > are in sequence in these files which lead to hot-spotting of RS, for this
> > should also try out Salted <https://phoenix.apache.org/salted.html>
> Phoenix
> > table to get the best write performance. Higher batch size might offer
> > better performance too.
> >
> > I've also just started data load from single psql client as loading from
> > single machine can get bottled-necked on network I/O. I'll update on its
> > result tomorrow and I don't think its time would be too off from loading
> > using multiple clients.
> >
> >
> > On Tue, Aug 23, 2016 at 12:49 PM, John Leach <jle...@splicemachine.com>
> > wrote:
> >
> >> Mujtaba,
> >>
> >> Not following the import process.
> >>
> >> The 5 parallel psql clients means that you manually split the data into
> 5
> >> buckets/files/directories and then run 5 import scripts simultaneously?
> >>
> >> If we wanted to benchmark import performance, what would be the right
> >> model for that?
> >>
> >> Thanks this is very helpful...
> >>
> >> Regards,
> >> John
> >>
> >>
> >>
> >>> On Aug 23, 2016, at 2:28 PM, Mujtaba Chohan <mujt...@apache.org>
> wrote:
> >>>
> >>> FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with
> 64G
> >>> total/12G HBase heap.
> >>>
> >>> *Data Load*
> >>> * 5.5 hours for 600M rows
> >>> * Method: Direct CSV load using psql.py script
> >>> * # client machines: 1
> >>> * Batch size 1K
> >>> * Key order: *Sequential*
> >>> * 5 parallel psql clients
> >>> * No missing rows due to data load
> >>>
> >>> *Schema*
> >>> CREATE TABLE LINEITEM_MULTICF (L_ORDERKEY INTEGER not null, A.L_PARTKEY
> >>> INTEGER, B.L_SUPPKEY  INTEGER , L_LINENUMBER  INTEGER not null,
> >>> B.L_QUANTITY DECIMAL(15,2), B.L_EXTENDEDPRICE DECIMAL(15,2),
> C.L_DISCOUNT
> >>> DECIMAL(15,2), C.L_TAX DECIMAL(15,2), C.L_RETURNFLAG  CHAR(1),
> >>> C.L_LINESTATUS  CHAR(1), C.L_SHIPDATE DATE, C.L_COMMITDATE DATE,
> >>> C.L_RECEIPTDATE DATE, C.L_SHIPINSTRUCT CHAR(25), C.L_SHIPMODE CHAR(10),
> >>> C.L_COMMENT VARCHAR(44) constraint pk primary key(l_orderkey,
> >>> l_linenumber));
> >>>
> >>> * 100M guidepost width
> >>> * ~42GB data
> >>> * Uncompressed/default fast-diff encoded
> >>>
> >>> *Performance with default 128 threadpool*
> >>> select count(*) from lineitem_multicf;
> >>> +------------+
> >>> |  COUNT(1)  |
> >>> +------------+
> >>> | 600037902  |
> >>> +------------+
> >>> 1 row selected (*24.1** seconds*)
> >>>
> >>> select l_returnflag, l_linestatus,sum(l_quantity) as
> >>> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice *
> (1
> >> -
> >>> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount)
> *
> >> (1
> >>> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice)
> >> as
> >>> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from
> >> lineitem
> >>> where l_shipdate <= current_date()- 90 group by l_returnflag,
> >> l_linestatus
> >>> order by l_returnflag, l_linestatus;
> >>> 4 row selected (*185.2** seconds*)
> >>>
> >>> *Data*
> >>> do curl -O http://static.druid.io/data/benchmarks/tpch/100/lineitem.
> >> tbl.$i.gz
> >>> ; done
> >>>
> >>>
> >>>
> >>> On Mon, Aug 22, 2016 at 7:48 AM, John Leach <jle...@splicemachine.com>
> >>> wrote:
> >>>
> >>>> It looks like you guys already have most of the TPCH queries running
> >> based
> >>>> on Enis’s talk in Ireland this year.  Very cool.
> >>>>
> >>>> (Slide 20:  Phoenix can execute most of the TPC-H queries!)
> >>>>
> >>>> Regards,
> >>>> John Leach
> >>>>
> >>>>
> >>>>> On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <ndimi...@gmail.com>
> wrote:
> >>>>>
> >>>>> It's TPC-DS, not -H, but this is what I was using way back when to
> run
> >>>> perf
> >>>>> tests over Phoenix and the query server while I was developing on it.
> >> The
> >>>>> first project generates, loads the data via mapreduce and the second
> >> tool
> >>>>> wraps up use of jmeter to run queries in parallel.
> >>>>>
> >>>>> https://github.com/ndimiduk/tpcds-gen
> >>>>> https://github.com/ndimiduk/phoenix-performance
> >>>>>
> >>>>> Probably there's dust and bit-rot to brush off of both projects, but
> >>>> maybe
> >>>>> it'll help someone looking for a starting point?
> >>>>>
> >>>>> Apologies, but I haven't had time to see what the speakers have
> shared
> >>>>> about their setup.
> >>>>>
> >>>>> -n
> >>>>>
> >>>>> On Friday, August 19, 2016, Andrew Purtell <apurt...@apache.org>
> >> wrote:
> >>>>>
> >>>>>>> Maybe there's such a test harness that already exists for TPC?
> >>>>>>
> >>>>>> TPC provides tooling but it's all proprietary. The generated data
> can
> >> be
> >>>>>> kept separately (Druid does it at least -
> >>>>>> http://druid.io/blog/2014/03/17/benchmarking-druid.html
> >>>>>> ).
> >>>>>>
> >>>>>> I'd say there would be one time setup: generation of data sets of
> >>>> various
> >>>>>> sizes, conversion to compressed CSV, and upload to somewhere public
> >>>> (S3?).
> >>>>>> Not strictly necessary, but it would save everyone a lot of time and
> >>>> hassle
> >>>>>> to not have to download the TPC data generators and munge the output
> >>>> every
> >>>>>> time. For this one could use the TPC tools.
> >>>>>>
> >>>>>> Then, the most sensible avenue I think would be implementation of
> new
> >>>>>> Phoenix integration tests that consume that data and run uniquely
> >>>> tweaked
> >>>>>> queries (yeah - every datastore vendor must do that with TPC).
> Phoenix
> >>>> can
> >>>>>> use hbase-it and get the cluster and chaos tooling such as it is for
> >>>> free,
> >>>>>> but the upsert/initialization/bulk load and query tooling would be
> all
> >>>>>> Phoenix based: the CSV loader, the JDBC driver.
> >>>>>>
> >>>>>> 
> >>>>>> 
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor <
> jamestay...@apache.org
> >>>>>> <javascript:;>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell <
> apurt...@apache.org
> >>>>>> <javascript:;>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>>> I have a long interest in 'canned' loadings. Interesting ones are
> >>>>>> hard
> >>>>>>> to
> >>>>>>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try
> >> it.
> >>>>>>>>
> >>>>>>>> Likewise
> >>>>>>>>
> >>>>>>>>> But I don't want to be the first to try it. I am not a Phoenix
> >>>>>> expert.
> >>>>>>>>
> >>>>>>>> Same here, I'd just email dev@phoenix with a report that TPC
> query
> >>>> XYZ
> >>>>>>>> didn't work and that would be as far as I could get.
> >>>>>>>>
> >>>>>>>> I don't think the first phase would require Phoenix experience.
> It's
> >>>>>> more
> >>>>>>> around the automation for running each TPC benchmark so the process
> >> is
> >>>>>>> repeatable:
> >>>>>>> - pulling in the data
> >>>>>>> - scripting the jobs
> >>>>>>> - having a test harness they run inside
> >>>>>>> - identifying the queries that don't work (ideally you wouldn't
> stop
> >> at
> >>>>>> the
> >>>>>>> first error)
> >>>>>>> - filing JIRAs for these
> >>>>>>>
> >>>>>>> The entire framework could be built and tested using standard JDBC
> >>>> APIs,
> >>>>>>> and then initially run using MySQL or some other RDBMS before
> trying
> >> it
> >>>>>>> with Phoenix. Maybe there's such a test harness that already exists
> >> for
> >>>>>>> TPC?
> >>>>>>>
> >>>>>>> Then I think the next phase would require more Phoenix & HBase
> >>>>>> experience:
> >>>>>>> - tweaking queries where possible given any limitations in Phoenix
> >>>>>>> - adding missing syntax (or potentially using the calcite branch
> >> which
> >>>>>>> supports more)
> >>>>>>> - tweaking Phoenix schema declarations to optimize
> >>>>>>> - tweaking Phoenix & HBase configs to optimize
> >>>>>>> - determining which secondary indexes to add (though I think
> there's
> >> an
> >>>>>>> academic paper on this, I can't seem to find it)
> >>>>>>>
> >>>>>>> Both phases would require a significant amount of time and effort.
> >> Each
> >>>>>>> benchmark would likely require unique tweaks.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> James
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best regards,
> >>>>>>
> >>>>>> - Andy
> >>>>>>
> >>>>>> Problems worthy of attack prove their worth by hitting back. - Piet
> >> Hein
> >>>>>> (via Tom White)
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Issues while Running Apache Phoenix against TPC-H data

Reply via email to