Re: Issues while Running Apache Phoenix against TPC-H data

John Leach Tue, 23 Aug 2016 14:18:09 -0700

So to load a TB of data would take around 2 days?  Does that seem right to you?


Regards,
John

> On Aug 23, 2016, at 3:07 PM, Mujtaba Chohan <[email protected]> wrote:
> 
> Since there are 100 files on which this 600M row data is split. 5 separate
> psql script running in parallel on single machine ran that loaded data from
> files 1-20, 21-40, 41-60, 61-80, 81-100. Performance get affected as keys
> are in sequence in these files which lead to hot-spotting of RS, for this
> should also try out Salted <https://phoenix.apache.org/salted.html> Phoenix
> table to get the best write performance. Higher batch size might offer
> better performance too.
> 
> I've also just started data load from single psql client as loading from
> single machine can get bottled-necked on network I/O. I'll update on its
> result tomorrow and I don't think its time would be too off from loading
> using multiple clients.
> 
> 
> On Tue, Aug 23, 2016 at 12:49 PM, John Leach <[email protected]>
> wrote:
> 
>> Mujtaba,
>> 
>> Not following the import process.
>> 
>> The 5 parallel psql clients means that you manually split the data into 5
>> buckets/files/directories and then run 5 import scripts simultaneously?
>> 
>> If we wanted to benchmark import performance, what would be the right
>> model for that?
>> 
>> Thanks this is very helpful...
>> 
>> Regards,
>> John
>> 
>> 
>> 
>>> On Aug 23, 2016, at 2:28 PM, Mujtaba Chohan <[email protected]> wrote:
>>> 
>>> FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with 64G
>>> total/12G HBase heap.
>>> 
>>> *Data Load*
>>> * 5.5 hours for 600M rows
>>> * Method: Direct CSV load using psql.py script
>>> * # client machines: 1
>>> * Batch size 1K
>>> * Key order: *Sequential*
>>> * 5 parallel psql clients
>>> * No missing rows due to data load
>>> 
>>> *Schema*
>>> CREATE TABLE LINEITEM_MULTICF (L_ORDERKEY INTEGER not null, A.L_PARTKEY
>>> INTEGER, B.L_SUPPKEY  INTEGER , L_LINENUMBER  INTEGER not null,
>>> B.L_QUANTITY DECIMAL(15,2), B.L_EXTENDEDPRICE DECIMAL(15,2), C.L_DISCOUNT
>>> DECIMAL(15,2), C.L_TAX DECIMAL(15,2), C.L_RETURNFLAG  CHAR(1),
>>> C.L_LINESTATUS  CHAR(1), C.L_SHIPDATE DATE, C.L_COMMITDATE DATE,
>>> C.L_RECEIPTDATE DATE, C.L_SHIPINSTRUCT CHAR(25), C.L_SHIPMODE CHAR(10),
>>> C.L_COMMENT VARCHAR(44) constraint pk primary key(l_orderkey,
>>> l_linenumber));
>>> 
>>> * 100M guidepost width
>>> * ~42GB data
>>> * Uncompressed/default fast-diff encoded
>>> 
>>> *Performance with default 128 threadpool*
>>> select count(*) from lineitem_multicf;
>>> +------------+
>>> |  COUNT(1)  |
>>> +------------+
>>> | 600037902  |
>>> +------------+
>>> 1 row selected (*24.1** seconds*)
>>> 
>>> select l_returnflag, l_linestatus,sum(l_quantity) as
>>> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1
>> -
>>> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) *
>> (1
>>> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice)
>> as
>>> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from
>> lineitem
>>> where l_shipdate <= current_date()- 90 group by l_returnflag,
>> l_linestatus
>>> order by l_returnflag, l_linestatus;
>>> 4 row selected (*185.2** seconds*)
>>> 
>>> *Data*
>>> do curl -O http://static.druid.io/data/benchmarks/tpch/100/lineitem.
>> tbl.$i.gz
>>> ; done
>>> 
>>> 
>>> 
>>> On Mon, Aug 22, 2016 at 7:48 AM, John Leach <[email protected]>
>>> wrote:
>>> 
>>>> It looks like you guys already have most of the TPCH queries running
>> based
>>>> on Enis’s talk in Ireland this year.  Very cool.
>>>> 
>>>> (Slide 20:  Phoenix can execute most of the TPC-H queries!)
>>>> 
>>>> Regards,
>>>> John Leach
>>>> 
>>>> 
>>>>> On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <[email protected]> wrote:
>>>>> 
>>>>> It's TPC-DS, not -H, but this is what I was using way back when to run
>>>> perf
>>>>> tests over Phoenix and the query server while I was developing on it.
>> The
>>>>> first project generates, loads the data via mapreduce and the second
>> tool
>>>>> wraps up use of jmeter to run queries in parallel.
>>>>> 
>>>>> https://github.com/ndimiduk/tpcds-gen
>>>>> https://github.com/ndimiduk/phoenix-performance
>>>>> 
>>>>> Probably there's dust and bit-rot to brush off of both projects, but
>>>> maybe
>>>>> it'll help someone looking for a starting point?
>>>>> 
>>>>> Apologies, but I haven't had time to see what the speakers have shared
>>>>> about their setup.
>>>>> 
>>>>> -n
>>>>> 
>>>>> On Friday, August 19, 2016, Andrew Purtell <[email protected]>
>> wrote:
>>>>> 
>>>>>>> Maybe there's such a test harness that already exists for TPC?
>>>>>> 
>>>>>> TPC provides tooling but it's all proprietary. The generated data can
>> be
>>>>>> kept separately (Druid does it at least -
>>>>>> http://druid.io/blog/2014/03/17/benchmarking-druid.html
>>>>>> ).
>>>>>> 
>>>>>> I'd say there would be one time setup: generation of data sets of
>>>> various
>>>>>> sizes, conversion to compressed CSV, and upload to somewhere public
>>>> (S3?).
>>>>>> Not strictly necessary, but it would save everyone a lot of time and
>>>> hassle
>>>>>> to not have to download the TPC data generators and munge the output
>>>> every
>>>>>> time. For this one could use the TPC tools.
>>>>>> 
>>>>>> Then, the most sensible avenue I think would be implementation of new
>>>>>> Phoenix integration tests that consume that data and run uniquely
>>>> tweaked
>>>>>> queries (yeah - every datastore vendor must do that with TPC). Phoenix
>>>> can
>>>>>> use hbase-it and get the cluster and chaos tooling such as it is for
>>>> free,
>>>>>> but the upsert/initialization/bulk load and query tooling would be all
>>>>>> Phoenix based: the CSV loader, the JDBC driver.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor <[email protected]
>>>>>> <javascript:;>>
>>>>>> wrote:
>>>>>> 
>>>>>>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell <[email protected]
>>>>>> <javascript:;>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>>> I have a long interest in 'canned' loadings. Interesting ones are
>>>>>> hard
>>>>>>> to
>>>>>>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try
>> it.
>>>>>>>> 
>>>>>>>> Likewise
>>>>>>>> 
>>>>>>>>> But I don't want to be the first to try it. I am not a Phoenix
>>>>>> expert.
>>>>>>>> 
>>>>>>>> Same here, I'd just email dev@phoenix with a report that TPC query
>>>> XYZ
>>>>>>>> didn't work and that would be as far as I could get.
>>>>>>>> 
>>>>>>>> I don't think the first phase would require Phoenix experience. It's
>>>>>> more
>>>>>>> around the automation for running each TPC benchmark so the process
>> is
>>>>>>> repeatable:
>>>>>>> - pulling in the data
>>>>>>> - scripting the jobs
>>>>>>> - having a test harness they run inside
>>>>>>> - identifying the queries that don't work (ideally you wouldn't stop
>> at
>>>>>> the
>>>>>>> first error)
>>>>>>> - filing JIRAs for these
>>>>>>> 
>>>>>>> The entire framework could be built and tested using standard JDBC
>>>> APIs,
>>>>>>> and then initially run using MySQL or some other RDBMS before trying
>> it
>>>>>>> with Phoenix. Maybe there's such a test harness that already exists
>> for
>>>>>>> TPC?
>>>>>>> 
>>>>>>> Then I think the next phase would require more Phoenix & HBase
>>>>>> experience:
>>>>>>> - tweaking queries where possible given any limitations in Phoenix
>>>>>>> - adding missing syntax (or potentially using the calcite branch
>> which
>>>>>>> supports more)
>>>>>>> - tweaking Phoenix schema declarations to optimize
>>>>>>> - tweaking Phoenix & HBase configs to optimize
>>>>>>> - determining which secondary indexes to add (though I think there's
>> an
>>>>>>> academic paper on this, I can't seem to find it)
>>>>>>> 
>>>>>>> Both phases would require a significant amount of time and effort.
>> Each
>>>>>>> benchmark would likely require unique tweaks.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> James
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Best regards,
>>>>>> 
>>>>>> - Andy
>>>>>> 
>>>>>> Problems worthy of attack prove their worth by hitting back. - Piet
>> Hein
>>>>>> (via Tom White)
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Issues while Running Apache Phoenix against TPC-H data

Reply via email to