Re: Issues while Running Apache Phoenix against TPC-H data

John Leach Tue, 23 Aug 2016 12:49:36 -0700

Mujtaba,

Not following the import process.


The 5 parallel psql clients means that you manually split the data into 5 
buckets/files/directories and then run 5 import scripts simultaneously?  

If we wanted to benchmark import performance, what would be the right model for 
that?

Thanks this is very helpful...

Regards,
John



> On Aug 23, 2016, at 2:28 PM, Mujtaba Chohan <mujt...@apache.org> wrote:
> 
> FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with 64G
> total/12G HBase heap.
> 
> *Data Load*
> * 5.5 hours for 600M rows
> * Method: Direct CSV load using psql.py script
> * # client machines: 1
> * Batch size 1K
> * Key order: *Sequential*
> * 5 parallel psql clients
> * No missing rows due to data load
> 
> *Schema*
> CREATE TABLE LINEITEM_MULTICF (L_ORDERKEY INTEGER not null, A.L_PARTKEY
> INTEGER, B.L_SUPPKEY  INTEGER , L_LINENUMBER  INTEGER not null,
> B.L_QUANTITY DECIMAL(15,2), B.L_EXTENDEDPRICE DECIMAL(15,2), C.L_DISCOUNT
> DECIMAL(15,2), C.L_TAX DECIMAL(15,2), C.L_RETURNFLAG  CHAR(1),
> C.L_LINESTATUS  CHAR(1), C.L_SHIPDATE DATE, C.L_COMMITDATE DATE,
> C.L_RECEIPTDATE DATE, C.L_SHIPINSTRUCT CHAR(25), C.L_SHIPMODE CHAR(10),
> C.L_COMMENT VARCHAR(44) constraint pk primary key(l_orderkey,
> l_linenumber));
> 
> * 100M guidepost width
> * ~42GB data
> * Uncompressed/default fast-diff encoded
> 
> *Performance with default 128 threadpool*
> select count(*) from lineitem_multicf;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 600037902  |
> +------------+
> 1 row selected (*24.1** seconds*)
> 
> select l_returnflag, l_linestatus,sum(l_quantity) as
> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 -
> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1
> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as
> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from lineitem
> where l_shipdate <= current_date()- 90 group by l_returnflag, l_linestatus
> order by l_returnflag, l_linestatus;
> 4 row selected (*185.2** seconds*)
> 
> *Data*
> do curl -O http://static.druid.io/data/benchmarks/tpch/100/lineitem.tbl.$i.gz
> ; done
> 
> 
> 
> On Mon, Aug 22, 2016 at 7:48 AM, John Leach <jle...@splicemachine.com>
> wrote:
> 
>> It looks like you guys already have most of the TPCH queries running based
>> on Enis’s talk in Ireland this year.  Very cool.
>> 
>> (Slide 20:  Phoenix can execute most of the TPC-H queries!)
>> 
>> Regards,
>> John Leach
>> 
>> 
>>> On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:
>>> 
>>> It's TPC-DS, not -H, but this is what I was using way back when to run
>> perf
>>> tests over Phoenix and the query server while I was developing on it. The
>>> first project generates, loads the data via mapreduce and the second tool
>>> wraps up use of jmeter to run queries in parallel.
>>> 
>>> https://github.com/ndimiduk/tpcds-gen
>>> https://github.com/ndimiduk/phoenix-performance
>>> 
>>> Probably there's dust and bit-rot to brush off of both projects, but
>> maybe
>>> it'll help someone looking for a starting point?
>>> 
>>> Apologies, but I haven't had time to see what the speakers have shared
>>> about their setup.
>>> 
>>> -n
>>> 
>>> On Friday, August 19, 2016, Andrew Purtell <apurt...@apache.org> wrote:
>>> 
>>>>> Maybe there's such a test harness that already exists for TPC?
>>>> 
>>>> TPC provides tooling but it's all proprietary. The generated data can be
>>>> kept separately (Druid does it at least -
>>>> http://druid.io/blog/2014/03/17/benchmarking-druid.html
>>>> ).
>>>> 
>>>> I'd say there would be one time setup: generation of data sets of
>> various
>>>> sizes, conversion to compressed CSV, and upload to somewhere public
>> (S3?).
>>>> Not strictly necessary, but it would save everyone a lot of time and
>> hassle
>>>> to not have to download the TPC data generators and munge the output
>> every
>>>> time. For this one could use the TPC tools.
>>>> 
>>>> Then, the most sensible avenue I think would be implementation of new
>>>> Phoenix integration tests that consume that data and run uniquely
>> tweaked
>>>> queries (yeah - every datastore vendor must do that with TPC). Phoenix
>> can
>>>> use hbase-it and get the cluster and chaos tooling such as it is for
>> free,
>>>> but the upsert/initialization/bulk load and query tooling would be all
>>>> Phoenix based: the CSV loader, the JDBC driver.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor <jamestay...@apache.org
>>>> <javascript:;>>
>>>> wrote:
>>>> 
>>>>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell <apurt...@apache.org
>>>> <javascript:;>>
>>>>> wrote:
>>>>> 
>>>>>>> I have a long interest in 'canned' loadings. Interesting ones are
>>>> hard
>>>>> to
>>>>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try it.
>>>>>> 
>>>>>> Likewise
>>>>>> 
>>>>>>> But I don't want to be the first to try it. I am not a Phoenix
>>>> expert.
>>>>>> 
>>>>>> Same here, I'd just email dev@phoenix with a report that TPC query
>> XYZ
>>>>>> didn't work and that would be as far as I could get.
>>>>>> 
>>>>>> I don't think the first phase would require Phoenix experience. It's
>>>> more
>>>>> around the automation for running each TPC benchmark so the process is
>>>>> repeatable:
>>>>> - pulling in the data
>>>>> - scripting the jobs
>>>>> - having a test harness they run inside
>>>>> - identifying the queries that don't work (ideally you wouldn't stop at
>>>> the
>>>>> first error)
>>>>> - filing JIRAs for these
>>>>> 
>>>>> The entire framework could be built and tested using standard JDBC
>> APIs,
>>>>> and then initially run using MySQL or some other RDBMS before trying it
>>>>> with Phoenix. Maybe there's such a test harness that already exists for
>>>>> TPC?
>>>>> 
>>>>> Then I think the next phase would require more Phoenix & HBase
>>>> experience:
>>>>> - tweaking queries where possible given any limitations in Phoenix
>>>>> - adding missing syntax (or potentially using the calcite branch which
>>>>> supports more)
>>>>> - tweaking Phoenix schema declarations to optimize
>>>>> - tweaking Phoenix & HBase configs to optimize
>>>>> - determining which secondary indexes to add (though I think there's an
>>>>> academic paper on this, I can't seem to find it)
>>>>> 
>>>>> Both phases would require a significant amount of time and effort. Each
>>>>> benchmark would likely require unique tweaks.
>>>>> 
>>>>> Thanks,
>>>>> James
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best regards,
>>>> 
>>>>  - Andy
>>>> 
>>>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>>>> (via Tom White)
>>>> 
>> 
>>

Re: Issues while Running Apache Phoenix against TPC-H data

Reply via email to