Re: Issues while Running Apache Phoenix against TPC-H data

John Leach Mon, 22 Aug 2016 07:48:53 -0700

It looks like you guys already have most of the TPCH queries running based on 
Enis’s talk in Ireland this year.  Very cool.


(Slide 20:  Phoenix can execute most of the TPC-H queries!)

Regards,
John Leach


> On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:
> 
> It's TPC-DS, not -H, but this is what I was using way back when to run perf
> tests over Phoenix and the query server while I was developing on it. The
> first project generates, loads the data via mapreduce and the second tool
> wraps up use of jmeter to run queries in parallel.
> 
> https://github.com/ndimiduk/tpcds-gen
> https://github.com/ndimiduk/phoenix-performance
> 
> Probably there's dust and bit-rot to brush off of both projects, but maybe
> it'll help someone looking for a starting point?
> 
> Apologies, but I haven't had time to see what the speakers have shared
> about their setup.
> 
> -n
> 
> On Friday, August 19, 2016, Andrew Purtell <apurt...@apache.org> wrote:
> 
>>> Maybe there's such a test harness that already exists for TPC?
>> 
>> TPC provides tooling but it's all proprietary. The generated data can be
>> kept separately (Druid does it at least -
>> http://druid.io/blog/2014/03/17/benchmarking-druid.html
>> ).
>> 
>> I'd say there would be one time setup: generation of data sets of various
>> sizes, conversion to compressed CSV, and upload to somewhere public (S3?).
>> Not strictly necessary, but it would save everyone a lot of time and hassle
>> to not have to download the TPC data generators and munge the output every
>> time. For this one could use the TPC tools.
>> 
>> Then, the most sensible avenue I think would be implementation of new
>> Phoenix integration tests that consume that data and run uniquely tweaked
>> queries (yeah - every datastore vendor must do that with TPC). Phoenix can
>> use hbase-it and get the cluster and chaos tooling such as it is for free,
>> but the upsert/initialization/bulk load and query tooling would be all
>> Phoenix based: the CSV loader, the JDBC driver.
>> 
>> 
>> 
>> 
>> 
>> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor <jamestay...@apache.org
>> <javascript:;>>
>> wrote:
>> 
>>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell <apurt...@apache.org
>> <javascript:;>>
>>> wrote:
>>> 
>>>>> I have a long interest in 'canned' loadings. Interesting ones are
>> hard
>>> to
>>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try it.
>>>> 
>>>> Likewise
>>>> 
>>>>> But I don't want to be the first to try it. I am not a Phoenix
>> expert.
>>>> 
>>>> Same here, I'd just email dev@phoenix with a report that TPC query XYZ
>>>> didn't work and that would be as far as I could get.
>>>> 
>>>> I don't think the first phase would require Phoenix experience. It's
>> more
>>> around the automation for running each TPC benchmark so the process is
>>> repeatable:
>>> - pulling in the data
>>> - scripting the jobs
>>> - having a test harness they run inside
>>> - identifying the queries that don't work (ideally you wouldn't stop at
>> the
>>> first error)
>>> - filing JIRAs for these
>>> 
>>> The entire framework could be built and tested using standard JDBC APIs,
>>> and then initially run using MySQL or some other RDBMS before trying it
>>> with Phoenix. Maybe there's such a test harness that already exists for
>>> TPC?
>>> 
>>> Then I think the next phase would require more Phoenix & HBase
>> experience:
>>> - tweaking queries where possible given any limitations in Phoenix
>>> - adding missing syntax (or potentially using the calcite branch which
>>> supports more)
>>> - tweaking Phoenix schema declarations to optimize
>>> - tweaking Phoenix & HBase configs to optimize
>>> - determining which secondary indexes to add (though I think there's an
>>> academic paper on this, I can't seem to find it)
>>> 
>>> Both phases would require a significant amount of time and effort. Each
>>> benchmark would likely require unique tweaks.
>>> 
>>> Thanks,
>>> James
>>> 
>> 
>> 
>> 
>> --
>> Best regards,
>> 
>>   - Andy
>> 
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>

Re: Issues while Running Apache Phoenix against TPC-H data

Reply via email to