Mujtaba, Not following the import process.
The 5 parallel psql clients means that you manually split the data into 5 buckets/files/directories and then run 5 import scripts simultaneously? If we wanted to benchmark import performance, what would be the right model for that? Thanks this is very helpful... Regards, John > On Aug 23, 2016, at 2:28 PM, Mujtaba Chohan <mujt...@apache.org> wrote: > > FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with 64G > total/12G HBase heap. > > *Data Load* > * 5.5 hours for 600M rows > * Method: Direct CSV load using psql.py script > * # client machines: 1 > * Batch size 1K > * Key order: *Sequential* > * 5 parallel psql clients > * No missing rows due to data load > > *Schema* > CREATE TABLE LINEITEM_MULTICF (L_ORDERKEY INTEGER not null, A.L_PARTKEY > INTEGER, B.L_SUPPKEY INTEGER , L_LINENUMBER INTEGER not null, > B.L_QUANTITY DECIMAL(15,2), B.L_EXTENDEDPRICE DECIMAL(15,2), C.L_DISCOUNT > DECIMAL(15,2), C.L_TAX DECIMAL(15,2), C.L_RETURNFLAG CHAR(1), > C.L_LINESTATUS CHAR(1), C.L_SHIPDATE DATE, C.L_COMMITDATE DATE, > C.L_RECEIPTDATE DATE, C.L_SHIPINSTRUCT CHAR(25), C.L_SHIPMODE CHAR(10), > C.L_COMMENT VARCHAR(44) constraint pk primary key(l_orderkey, > l_linenumber)); > > * 100M guidepost width > * ~42GB data > * Uncompressed/default fast-diff encoded > > *Performance with default 128 threadpool* > select count(*) from lineitem_multicf; > +------------+ > | COUNT(1) | > +------------+ > | 600037902 | > +------------+ > 1 row selected (*24.1** seconds*) > > select l_returnflag, l_linestatus,sum(l_quantity) as > sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 - > l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1 > + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as > avg_price,avg(l_discount) as avg_disc,count(*) as count_order from lineitem > where l_shipdate <= current_date()- 90 group by l_returnflag, l_linestatus > order by l_returnflag, l_linestatus; > 4 row selected (*185.2** seconds*) > > *Data* > do curl -O http://static.druid.io/data/benchmarks/tpch/100/lineitem.tbl.$i.gz > ; done > > > > On Mon, Aug 22, 2016 at 7:48 AM, John Leach <jle...@splicemachine.com> > wrote: > >> It looks like you guys already have most of the TPCH queries running based >> on Enis’s talk in Ireland this year. Very cool. >> >> (Slide 20: Phoenix can execute most of the TPC-H queries!) >> >> Regards, >> John Leach >> >> >>> On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: >>> >>> It's TPC-DS, not -H, but this is what I was using way back when to run >> perf >>> tests over Phoenix and the query server while I was developing on it. The >>> first project generates, loads the data via mapreduce and the second tool >>> wraps up use of jmeter to run queries in parallel. >>> >>> https://github.com/ndimiduk/tpcds-gen >>> https://github.com/ndimiduk/phoenix-performance >>> >>> Probably there's dust and bit-rot to brush off of both projects, but >> maybe >>> it'll help someone looking for a starting point? >>> >>> Apologies, but I haven't had time to see what the speakers have shared >>> about their setup. >>> >>> -n >>> >>> On Friday, August 19, 2016, Andrew Purtell <apurt...@apache.org> wrote: >>> >>>>> Maybe there's such a test harness that already exists for TPC? >>>> >>>> TPC provides tooling but it's all proprietary. The generated data can be >>>> kept separately (Druid does it at least - >>>> http://druid.io/blog/2014/03/17/benchmarking-druid.html >>>> ). >>>> >>>> I'd say there would be one time setup: generation of data sets of >> various >>>> sizes, conversion to compressed CSV, and upload to somewhere public >> (S3?). >>>> Not strictly necessary, but it would save everyone a lot of time and >> hassle >>>> to not have to download the TPC data generators and munge the output >> every >>>> time. For this one could use the TPC tools. >>>> >>>> Then, the most sensible avenue I think would be implementation of new >>>> Phoenix integration tests that consume that data and run uniquely >> tweaked >>>> queries (yeah - every datastore vendor must do that with TPC). Phoenix >> can >>>> use hbase-it and get the cluster and chaos tooling such as it is for >> free, >>>> but the upsert/initialization/bulk load and query tooling would be all >>>> Phoenix based: the CSV loader, the JDBC driver. >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor <jamestay...@apache.org >>>> <javascript:;>> >>>> wrote: >>>> >>>>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell <apurt...@apache.org >>>> <javascript:;>> >>>>> wrote: >>>>> >>>>>>> I have a long interest in 'canned' loadings. Interesting ones are >>>> hard >>>>> to >>>>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try it. >>>>>> >>>>>> Likewise >>>>>> >>>>>>> But I don't want to be the first to try it. I am not a Phoenix >>>> expert. >>>>>> >>>>>> Same here, I'd just email dev@phoenix with a report that TPC query >> XYZ >>>>>> didn't work and that would be as far as I could get. >>>>>> >>>>>> I don't think the first phase would require Phoenix experience. It's >>>> more >>>>> around the automation for running each TPC benchmark so the process is >>>>> repeatable: >>>>> - pulling in the data >>>>> - scripting the jobs >>>>> - having a test harness they run inside >>>>> - identifying the queries that don't work (ideally you wouldn't stop at >>>> the >>>>> first error) >>>>> - filing JIRAs for these >>>>> >>>>> The entire framework could be built and tested using standard JDBC >> APIs, >>>>> and then initially run using MySQL or some other RDBMS before trying it >>>>> with Phoenix. Maybe there's such a test harness that already exists for >>>>> TPC? >>>>> >>>>> Then I think the next phase would require more Phoenix & HBase >>>> experience: >>>>> - tweaking queries where possible given any limitations in Phoenix >>>>> - adding missing syntax (or potentially using the calcite branch which >>>>> supports more) >>>>> - tweaking Phoenix schema declarations to optimize >>>>> - tweaking Phoenix & HBase configs to optimize >>>>> - determining which secondary indexes to add (though I think there's an >>>>> academic paper on this, I can't seem to find it) >>>>> >>>>> Both phases would require a significant amount of time and effort. Each >>>>> benchmark would likely require unique tweaks. >>>>> >>>>> Thanks, >>>>> James >>>>> >>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> >>>> - Andy >>>> >>>> Problems worthy of attack prove their worth by hitting back. - Piet Hein >>>> (via Tom White) >>>> >> >>