Re: Issues while Running Apache Phoenix against TPC-H data

John Leach Wed, 24 Aug 2016 12:39:25 -0700

Thanks Mujtaba…

Regards,
John


> On Aug 24, 2016, at 2:29 PM, Mujtaba Chohan <mujt...@apache.org> wrote:
> 
> That sounds about right for loading CSV directly on a 5-8 node cluster. As
> Gabriel/James mentioned in another thread, CSVBulkLoadTool with pre-split
> table might offer significantly better performance for large datasets.
> 
> On Tue, Aug 23, 2016 at 2:17 PM, John Leach <jle...@splicemachine.com>
> wrote:
> 
>> So to load a TB of data would take around 2 days?  Does that seem right to
>> you?
>> 
>> Regards,
>> John
>> 
>>> On Aug 23, 2016, at 3:07 PM, Mujtaba Chohan <mujt...@apache.org> wrote:
>>> 
>>> Since there are 100 files on which this 600M row data is split. 5
>> separate
>>> psql script running in parallel on single machine ran that loaded data
>> from
>>> files 1-20, 21-40, 41-60, 61-80, 81-100. Performance get affected as keys
>>> are in sequence in these files which lead to hot-spotting of RS, for this
>>> should also try out Salted <https://phoenix.apache.org/salted.html>
>> Phoenix
>>> table to get the best write performance. Higher batch size might offer
>>> better performance too.
>>> 
>>> I've also just started data load from single psql client as loading from
>>> single machine can get bottled-necked on network I/O. I'll update on its
>>> result tomorrow and I don't think its time would be too off from loading
>>> using multiple clients.
>>> 
>>> 
>>> On Tue, Aug 23, 2016 at 12:49 PM, John Leach <jle...@splicemachine.com>
>>> wrote:
>>> 
>>>> Mujtaba,
>>>> 
>>>> Not following the import process.
>>>> 
>>>> The 5 parallel psql clients means that you manually split the data into
>> 5
>>>> buckets/files/directories and then run 5 import scripts simultaneously?
>>>> 
>>>> If we wanted to benchmark import performance, what would be the right
>>>> model for that?
>>>> 
>>>> Thanks this is very helpful...
>>>> 
>>>> Regards,
>>>> John
>>>> 
>>>> 
>>>> 
>>>>> On Aug 23, 2016, at 2:28 PM, Mujtaba Chohan <mujt...@apache.org>
>> wrote:
>>>>> 
>>>>> FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with
>> 64G
>>>>> total/12G HBase heap.
>>>>> 
>>>>> *Data Load*
>>>>> * 5.5 hours for 600M rows
>>>>> * Method: Direct CSV load using psql.py script
>>>>> * # client machines: 1
>>>>> * Batch size 1K
>>>>> * Key order: *Sequential*
>>>>> * 5 parallel psql clients
>>>>> * No missing rows due to data load
>>>>> 
>>>>> *Schema*
>>>>> CREATE TABLE LINEITEM_MULTICF (L_ORDERKEY INTEGER not null, A.L_PARTKEY
>>>>> INTEGER, B.L_SUPPKEY  INTEGER , L_LINENUMBER  INTEGER not null,
>>>>> B.L_QUANTITY DECIMAL(15,2), B.L_EXTENDEDPRICE DECIMAL(15,2),
>> C.L_DISCOUNT
>>>>> DECIMAL(15,2), C.L_TAX DECIMAL(15,2), C.L_RETURNFLAG  CHAR(1),
>>>>> C.L_LINESTATUS  CHAR(1), C.L_SHIPDATE DATE, C.L_COMMITDATE DATE,
>>>>> C.L_RECEIPTDATE DATE, C.L_SHIPINSTRUCT CHAR(25), C.L_SHIPMODE CHAR(10),
>>>>> C.L_COMMENT VARCHAR(44) constraint pk primary key(l_orderkey,
>>>>> l_linenumber));
>>>>> 
>>>>> * 100M guidepost width
>>>>> * ~42GB data
>>>>> * Uncompressed/default fast-diff encoded
>>>>> 
>>>>> *Performance with default 128 threadpool*
>>>>> select count(*) from lineitem_multicf;
>>>>> +------------+
>>>>> |  COUNT(1)  |
>>>>> +------------+
>>>>> | 600037902  |
>>>>> +------------+
>>>>> 1 row selected (*24.1** seconds*)
>>>>> 
>>>>> select l_returnflag, l_linestatus,sum(l_quantity) as
>>>>> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice *
>> (1
>>>> -
>>>>> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount)
>> *
>>>> (1
>>>>> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice)
>>>> as
>>>>> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from
>>>> lineitem
>>>>> where l_shipdate <= current_date()- 90 group by l_returnflag,
>>>> l_linestatus
>>>>> order by l_returnflag, l_linestatus;
>>>>> 4 row selected (*185.2** seconds*)
>>>>> 
>>>>> *Data*
>>>>> do curl -O http://static.druid.io/data/benchmarks/tpch/100/lineitem.
>>>> tbl.$i.gz
>>>>> ; done
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Aug 22, 2016 at 7:48 AM, John Leach <jle...@splicemachine.com>
>>>>> wrote:
>>>>> 
>>>>>> It looks like you guys already have most of the TPCH queries running
>>>> based
>>>>>> on Enis’s talk in Ireland this year.  Very cool.
>>>>>> 
>>>>>> (Slide 20:  Phoenix can execute most of the TPC-H queries!)
>>>>>> 
>>>>>> Regards,
>>>>>> John Leach
>>>>>> 
>>>>>> 
>>>>>>> On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <ndimi...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>> It's TPC-DS, not -H, but this is what I was using way back when to
>> run
>>>>>> perf
>>>>>>> tests over Phoenix and the query server while I was developing on it.
>>>> The
>>>>>>> first project generates, loads the data via mapreduce and the second
>>>> tool
>>>>>>> wraps up use of jmeter to run queries in parallel.
>>>>>>> 
>>>>>>> https://github.com/ndimiduk/tpcds-gen
>>>>>>> https://github.com/ndimiduk/phoenix-performance
>>>>>>> 
>>>>>>> Probably there's dust and bit-rot to brush off of both projects, but
>>>>>> maybe
>>>>>>> it'll help someone looking for a starting point?
>>>>>>> 
>>>>>>> Apologies, but I haven't had time to see what the speakers have
>> shared
>>>>>>> about their setup.
>>>>>>> 
>>>>>>> -n
>>>>>>> 
>>>>>>> On Friday, August 19, 2016, Andrew Purtell <apurt...@apache.org>
>>>> wrote:
>>>>>>> 
>>>>>>>>> Maybe there's such a test harness that already exists for TPC?
>>>>>>>> 
>>>>>>>> TPC provides tooling but it's all proprietary. The generated data
>> can
>>>> be
>>>>>>>> kept separately (Druid does it at least -
>>>>>>>> http://druid.io/blog/2014/03/17/benchmarking-druid.html
>>>>>>>> ).
>>>>>>>> 
>>>>>>>> I'd say there would be one time setup: generation of data sets of
>>>>>> various
>>>>>>>> sizes, conversion to compressed CSV, and upload to somewhere public
>>>>>> (S3?).
>>>>>>>> Not strictly necessary, but it would save everyone a lot of time and
>>>>>> hassle
>>>>>>>> to not have to download the TPC data generators and munge the output
>>>>>> every
>>>>>>>> time. For this one could use the TPC tools.
>>>>>>>> 
>>>>>>>> Then, the most sensible avenue I think would be implementation of
>> new
>>>>>>>> Phoenix integration tests that consume that data and run uniquely
>>>>>> tweaked
>>>>>>>> queries (yeah - every datastore vendor must do that with TPC).
>> Phoenix
>>>>>> can
>>>>>>>> use hbase-it and get the cluster and chaos tooling such as it is for
>>>>>> free,
>>>>>>>> but the upsert/initialization/bulk load and query tooling would be
>> all
>>>>>>>> Phoenix based: the CSV loader, the JDBC driver.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor <
>> jamestay...@apache.org
>>>>>>>> <javascript:;>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell <
>> apurt...@apache.org
>>>>>>>> <javascript:;>>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>>> I have a long interest in 'canned' loadings. Interesting ones are
>>>>>>>> hard
>>>>>>>>> to
>>>>>>>>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try
>>>> it.
>>>>>>>>>> 
>>>>>>>>>> Likewise
>>>>>>>>>> 
>>>>>>>>>>> But I don't want to be the first to try it. I am not a Phoenix
>>>>>>>> expert.
>>>>>>>>>> 
>>>>>>>>>> Same here, I'd just email dev@phoenix with a report that TPC
>> query
>>>>>> XYZ
>>>>>>>>>> didn't work and that would be as far as I could get.
>>>>>>>>>> 
>>>>>>>>>> I don't think the first phase would require Phoenix experience.
>> It's
>>>>>>>> more
>>>>>>>>> around the automation for running each TPC benchmark so the process
>>>> is
>>>>>>>>> repeatable:
>>>>>>>>> - pulling in the data
>>>>>>>>> - scripting the jobs
>>>>>>>>> - having a test harness they run inside
>>>>>>>>> - identifying the queries that don't work (ideally you wouldn't
>> stop
>>>> at
>>>>>>>> the
>>>>>>>>> first error)
>>>>>>>>> - filing JIRAs for these
>>>>>>>>> 
>>>>>>>>> The entire framework could be built and tested using standard JDBC
>>>>>> APIs,
>>>>>>>>> and then initially run using MySQL or some other RDBMS before
>> trying
>>>> it
>>>>>>>>> with Phoenix. Maybe there's such a test harness that already exists
>>>> for
>>>>>>>>> TPC?
>>>>>>>>> 
>>>>>>>>> Then I think the next phase would require more Phoenix & HBase
>>>>>>>> experience:
>>>>>>>>> - tweaking queries where possible given any limitations in Phoenix
>>>>>>>>> - adding missing syntax (or potentially using the calcite branch
>>>> which
>>>>>>>>> supports more)
>>>>>>>>> - tweaking Phoenix schema declarations to optimize
>>>>>>>>> - tweaking Phoenix & HBase configs to optimize
>>>>>>>>> - determining which secondary indexes to add (though I think
>> there's
>>>> an
>>>>>>>>> academic paper on this, I can't seem to find it)
>>>>>>>>> 
>>>>>>>>> Both phases would require a significant amount of time and effort.
>>>> Each
>>>>>>>>> benchmark would likely require unique tweaks.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> James
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> 
>>>>>>>> - Andy
>>>>>>>> 
>>>>>>>> Problems worthy of attack prove their worth by hitting back. - Piet
>>>> Hein
>>>>>>>> (via Tom White)
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Issues while Running Apache Phoenix against TPC-H data

Reply via email to