Thanks Mujtaba… Regards, John
> On Aug 24, 2016, at 2:29 PM, Mujtaba Chohan <mujt...@apache.org> wrote: > > That sounds about right for loading CSV directly on a 5-8 node cluster. As > Gabriel/James mentioned in another thread, CSVBulkLoadTool with pre-split > table might offer significantly better performance for large datasets. > > On Tue, Aug 23, 2016 at 2:17 PM, John Leach <jle...@splicemachine.com> > wrote: > >> So to load a TB of data would take around 2 days? Does that seem right to >> you? >> >> Regards, >> John >> >>> On Aug 23, 2016, at 3:07 PM, Mujtaba Chohan <mujt...@apache.org> wrote: >>> >>> Since there are 100 files on which this 600M row data is split. 5 >> separate >>> psql script running in parallel on single machine ran that loaded data >> from >>> files 1-20, 21-40, 41-60, 61-80, 81-100. Performance get affected as keys >>> are in sequence in these files which lead to hot-spotting of RS, for this >>> should also try out Salted <https://phoenix.apache.org/salted.html> >> Phoenix >>> table to get the best write performance. Higher batch size might offer >>> better performance too. >>> >>> I've also just started data load from single psql client as loading from >>> single machine can get bottled-necked on network I/O. I'll update on its >>> result tomorrow and I don't think its time would be too off from loading >>> using multiple clients. >>> >>> >>> On Tue, Aug 23, 2016 at 12:49 PM, John Leach <jle...@splicemachine.com> >>> wrote: >>> >>>> Mujtaba, >>>> >>>> Not following the import process. >>>> >>>> The 5 parallel psql clients means that you manually split the data into >> 5 >>>> buckets/files/directories and then run 5 import scripts simultaneously? >>>> >>>> If we wanted to benchmark import performance, what would be the right >>>> model for that? >>>> >>>> Thanks this is very helpful... >>>> >>>> Regards, >>>> John >>>> >>>> >>>> >>>>> On Aug 23, 2016, at 2:28 PM, Mujtaba Chohan <mujt...@apache.org> >> wrote: >>>>> >>>>> FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with >> 64G >>>>> total/12G HBase heap. >>>>> >>>>> *Data Load* >>>>> * 5.5 hours for 600M rows >>>>> * Method: Direct CSV load using psql.py script >>>>> * # client machines: 1 >>>>> * Batch size 1K >>>>> * Key order: *Sequential* >>>>> * 5 parallel psql clients >>>>> * No missing rows due to data load >>>>> >>>>> *Schema* >>>>> CREATE TABLE LINEITEM_MULTICF (L_ORDERKEY INTEGER not null, A.L_PARTKEY >>>>> INTEGER, B.L_SUPPKEY INTEGER , L_LINENUMBER INTEGER not null, >>>>> B.L_QUANTITY DECIMAL(15,2), B.L_EXTENDEDPRICE DECIMAL(15,2), >> C.L_DISCOUNT >>>>> DECIMAL(15,2), C.L_TAX DECIMAL(15,2), C.L_RETURNFLAG CHAR(1), >>>>> C.L_LINESTATUS CHAR(1), C.L_SHIPDATE DATE, C.L_COMMITDATE DATE, >>>>> C.L_RECEIPTDATE DATE, C.L_SHIPINSTRUCT CHAR(25), C.L_SHIPMODE CHAR(10), >>>>> C.L_COMMENT VARCHAR(44) constraint pk primary key(l_orderkey, >>>>> l_linenumber)); >>>>> >>>>> * 100M guidepost width >>>>> * ~42GB data >>>>> * Uncompressed/default fast-diff encoded >>>>> >>>>> *Performance with default 128 threadpool* >>>>> select count(*) from lineitem_multicf; >>>>> +------------+ >>>>> | COUNT(1) | >>>>> +------------+ >>>>> | 600037902 | >>>>> +------------+ >>>>> 1 row selected (*24.1** seconds*) >>>>> >>>>> select l_returnflag, l_linestatus,sum(l_quantity) as >>>>> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * >> (1 >>>> - >>>>> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) >> * >>>> (1 >>>>> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) >>>> as >>>>> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from >>>> lineitem >>>>> where l_shipdate <= current_date()- 90 group by l_returnflag, >>>> l_linestatus >>>>> order by l_returnflag, l_linestatus; >>>>> 4 row selected (*185.2** seconds*) >>>>> >>>>> *Data* >>>>> do curl -O http://static.druid.io/data/benchmarks/tpch/100/lineitem. >>>> tbl.$i.gz >>>>> ; done >>>>> >>>>> >>>>> >>>>> On Mon, Aug 22, 2016 at 7:48 AM, John Leach <jle...@splicemachine.com> >>>>> wrote: >>>>> >>>>>> It looks like you guys already have most of the TPCH queries running >>>> based >>>>>> on Enis’s talk in Ireland this year. Very cool. >>>>>> >>>>>> (Slide 20: Phoenix can execute most of the TPC-H queries!) >>>>>> >>>>>> Regards, >>>>>> John Leach >>>>>> >>>>>> >>>>>>> On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <ndimi...@gmail.com> >> wrote: >>>>>>> >>>>>>> It's TPC-DS, not -H, but this is what I was using way back when to >> run >>>>>> perf >>>>>>> tests over Phoenix and the query server while I was developing on it. >>>> The >>>>>>> first project generates, loads the data via mapreduce and the second >>>> tool >>>>>>> wraps up use of jmeter to run queries in parallel. >>>>>>> >>>>>>> https://github.com/ndimiduk/tpcds-gen >>>>>>> https://github.com/ndimiduk/phoenix-performance >>>>>>> >>>>>>> Probably there's dust and bit-rot to brush off of both projects, but >>>>>> maybe >>>>>>> it'll help someone looking for a starting point? >>>>>>> >>>>>>> Apologies, but I haven't had time to see what the speakers have >> shared >>>>>>> about their setup. >>>>>>> >>>>>>> -n >>>>>>> >>>>>>> On Friday, August 19, 2016, Andrew Purtell <apurt...@apache.org> >>>> wrote: >>>>>>> >>>>>>>>> Maybe there's such a test harness that already exists for TPC? >>>>>>>> >>>>>>>> TPC provides tooling but it's all proprietary. The generated data >> can >>>> be >>>>>>>> kept separately (Druid does it at least - >>>>>>>> http://druid.io/blog/2014/03/17/benchmarking-druid.html >>>>>>>> ). >>>>>>>> >>>>>>>> I'd say there would be one time setup: generation of data sets of >>>>>> various >>>>>>>> sizes, conversion to compressed CSV, and upload to somewhere public >>>>>> (S3?). >>>>>>>> Not strictly necessary, but it would save everyone a lot of time and >>>>>> hassle >>>>>>>> to not have to download the TPC data generators and munge the output >>>>>> every >>>>>>>> time. For this one could use the TPC tools. >>>>>>>> >>>>>>>> Then, the most sensible avenue I think would be implementation of >> new >>>>>>>> Phoenix integration tests that consume that data and run uniquely >>>>>> tweaked >>>>>>>> queries (yeah - every datastore vendor must do that with TPC). >> Phoenix >>>>>> can >>>>>>>> use hbase-it and get the cluster and chaos tooling such as it is for >>>>>> free, >>>>>>>> but the upsert/initialization/bulk load and query tooling would be >> all >>>>>>>> Phoenix based: the CSV loader, the JDBC driver. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor < >> jamestay...@apache.org >>>>>>>> <javascript:;>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell < >> apurt...@apache.org >>>>>>>> <javascript:;>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>>> I have a long interest in 'canned' loadings. Interesting ones are >>>>>>>> hard >>>>>>>>> to >>>>>>>>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try >>>> it. >>>>>>>>>> >>>>>>>>>> Likewise >>>>>>>>>> >>>>>>>>>>> But I don't want to be the first to try it. I am not a Phoenix >>>>>>>> expert. >>>>>>>>>> >>>>>>>>>> Same here, I'd just email dev@phoenix with a report that TPC >> query >>>>>> XYZ >>>>>>>>>> didn't work and that would be as far as I could get. >>>>>>>>>> >>>>>>>>>> I don't think the first phase would require Phoenix experience. >> It's >>>>>>>> more >>>>>>>>> around the automation for running each TPC benchmark so the process >>>> is >>>>>>>>> repeatable: >>>>>>>>> - pulling in the data >>>>>>>>> - scripting the jobs >>>>>>>>> - having a test harness they run inside >>>>>>>>> - identifying the queries that don't work (ideally you wouldn't >> stop >>>> at >>>>>>>> the >>>>>>>>> first error) >>>>>>>>> - filing JIRAs for these >>>>>>>>> >>>>>>>>> The entire framework could be built and tested using standard JDBC >>>>>> APIs, >>>>>>>>> and then initially run using MySQL or some other RDBMS before >> trying >>>> it >>>>>>>>> with Phoenix. Maybe there's such a test harness that already exists >>>> for >>>>>>>>> TPC? >>>>>>>>> >>>>>>>>> Then I think the next phase would require more Phoenix & HBase >>>>>>>> experience: >>>>>>>>> - tweaking queries where possible given any limitations in Phoenix >>>>>>>>> - adding missing syntax (or potentially using the calcite branch >>>> which >>>>>>>>> supports more) >>>>>>>>> - tweaking Phoenix schema declarations to optimize >>>>>>>>> - tweaking Phoenix & HBase configs to optimize >>>>>>>>> - determining which secondary indexes to add (though I think >> there's >>>> an >>>>>>>>> academic paper on this, I can't seem to find it) >>>>>>>>> >>>>>>>>> Both phases would require a significant amount of time and effort. >>>> Each >>>>>>>>> benchmark would likely require unique tweaks. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> James >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best regards, >>>>>>>> >>>>>>>> - Andy >>>>>>>> >>>>>>>> Problems worthy of attack prove their worth by hitting back. - Piet >>>> Hein >>>>>>>> (via Tom White) >>>>>>>> >>>>>> >>>>>> >>>> >>>> >> >>