Re: Issues while Running Apache Phoenix against TPC-H data

Andrew Purtell Fri, 19 Aug 2016 18:08:22 -0700

> Maybe there's such a test harness that already exists for TPC?

TPC provides tooling but it's all proprietary. The generated data can be
kept separately (Druid does it at least -
http://druid.io/blog/2014/03/17/benchmarking-druid.html
).

I'd say there would be one time setup: generation of data sets of various
sizes, conversion to compressed CSV, and upload to somewhere public (S3?).
Not strictly necessary, but it would save everyone a lot of time and hassle
to not have to download the TPC data generators and munge the output every
time. For this one could use the TPC tools.

Then, the most sensible avenue I think would be implementation of new
Phoenix integration tests that consume that data and run uniquely tweaked
queries (yeah - every datastore vendor must do that with TPC). Phoenix can
use hbase-it and get the cluster and chaos tooling such as it is for free,
but the upsert/initialization/bulk load and query tooling would be all
Phoenix based: the CSV loader, the JDBC driver.

On Fri, Aug 19, 2016 at 5:31 PM, James Taylor <[email protected]>
wrote:

> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell <[email protected]>
> wrote:
>
> > > I have a long interest in 'canned' loadings. Interesting ones are hard
> to
> > > come by. If Phoenix ran any or a subset of TPCs, I'd like to try it.
> >
> > Likewise
> >
> > > But I don't want to be the first to try it. I am not a Phoenix expert.
> >
> > Same here, I'd just email dev@phoenix with a report that TPC query XYZ
> > didn't work and that would be as far as I could get.
> >
> > I don't think the first phase would require Phoenix experience. It's more
> around the automation for running each TPC benchmark so the process is
> repeatable:
> - pulling in the data
> - scripting the jobs
> - having a test harness they run inside
> - identifying the queries that don't work (ideally you wouldn't stop at the
> first error)
> - filing JIRAs for these
>
> The entire framework could be built and tested using standard JDBC APIs,
> and then initially run using MySQL or some other RDBMS before trying it
> with Phoenix. Maybe there's such a test harness that already exists for
> TPC?
>
> Then I think the next phase would require more Phoenix & HBase experience:
> - tweaking queries where possible given any limitations in Phoenix
> - adding missing syntax (or potentially using the calcite branch which
> supports more)
> - tweaking Phoenix schema declarations to optimize
> - tweaking Phoenix & HBase configs to optimize
> - determining which secondary indexes to add (though I think there's an
> academic paper on this, I can't seem to find it)
>
> Both phases would require a significant amount of time and effort. Each
> benchmark would likely require unique tweaks.
>
> Thanks,
> James
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Issues while Running Apache Phoenix against TPC-H data

Reply via email to