So to load a TB of data would take around 2 days? Does that seem right to you?
Regards, John > On Aug 23, 2016, at 3:07 PM, Mujtaba Chohan <mujt...@apache.org> wrote: > > Since there are 100 files on which this 600M row data is split. 5 separate > psql script running in parallel on single machine ran that loaded data from > files 1-20, 21-40, 41-60, 61-80, 81-100. Performance get affected as keys > are in sequence in these files which lead to hot-spotting of RS, for this > should also try out Salted <https://phoenix.apache.org/salted.html> Phoenix > table to get the best write performance. Higher batch size might offer > better performance too. > > I've also just started data load from single psql client as loading from > single machine can get bottled-necked on network I/O. I'll update on its > result tomorrow and I don't think its time would be too off from loading > using multiple clients. > > > On Tue, Aug 23, 2016 at 12:49 PM, John Leach <jle...@splicemachine.com> > wrote: > >> Mujtaba, >> >> Not following the import process. >> >> The 5 parallel psql clients means that you manually split the data into 5 >> buckets/files/directories and then run 5 import scripts simultaneously? >> >> If we wanted to benchmark import performance, what would be the right >> model for that? >> >> Thanks this is very helpful... >> >> Regards, >> John >> >> >> >>> On Aug 23, 2016, at 2:28 PM, Mujtaba Chohan <mujt...@apache.org> wrote: >>> >>> FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with 64G >>> total/12G HBase heap. >>> >>> *Data Load* >>> * 5.5 hours for 600M rows >>> * Method: Direct CSV load using psql.py script >>> * # client machines: 1 >>> * Batch size 1K >>> * Key order: *Sequential* >>> * 5 parallel psql clients >>> * No missing rows due to data load >>> >>> *Schema* >>> CREATE TABLE LINEITEM_MULTICF (L_ORDERKEY INTEGER not null, A.L_PARTKEY >>> INTEGER, B.L_SUPPKEY INTEGER , L_LINENUMBER INTEGER not null, >>> B.L_QUANTITY DECIMAL(15,2), B.L_EXTENDEDPRICE DECIMAL(15,2), C.L_DISCOUNT >>> DECIMAL(15,2), C.L_TAX DECIMAL(15,2), C.L_RETURNFLAG CHAR(1), >>> C.L_LINESTATUS CHAR(1), C.L_SHIPDATE DATE, C.L_COMMITDATE DATE, >>> C.L_RECEIPTDATE DATE, C.L_SHIPINSTRUCT CHAR(25), C.L_SHIPMODE CHAR(10), >>> C.L_COMMENT VARCHAR(44) constraint pk primary key(l_orderkey, >>> l_linenumber)); >>> >>> * 100M guidepost width >>> * ~42GB data >>> * Uncompressed/default fast-diff encoded >>> >>> *Performance with default 128 threadpool* >>> select count(*) from lineitem_multicf; >>> +------------+ >>> | COUNT(1) | >>> +------------+ >>> | 600037902 | >>> +------------+ >>> 1 row selected (*24.1** seconds*) >>> >>> select l_returnflag, l_linestatus,sum(l_quantity) as >>> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 >> - >>> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) * >> (1 >>> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) >> as >>> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from >> lineitem >>> where l_shipdate <= current_date()- 90 group by l_returnflag, >> l_linestatus >>> order by l_returnflag, l_linestatus; >>> 4 row selected (*185.2** seconds*) >>> >>> *Data* >>> do curl -O http://static.druid.io/data/benchmarks/tpch/100/lineitem. >> tbl.$i.gz >>> ; done >>> >>> >>> >>> On Mon, Aug 22, 2016 at 7:48 AM, John Leach <jle...@splicemachine.com> >>> wrote: >>> >>>> It looks like you guys already have most of the TPCH queries running >> based >>>> on Enis’s talk in Ireland this year. Very cool. >>>> >>>> (Slide 20: Phoenix can execute most of the TPC-H queries!) >>>> >>>> Regards, >>>> John Leach >>>> >>>> >>>>> On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: >>>>> >>>>> It's TPC-DS, not -H, but this is what I was using way back when to run >>>> perf >>>>> tests over Phoenix and the query server while I was developing on it. >> The >>>>> first project generates, loads the data via mapreduce and the second >> tool >>>>> wraps up use of jmeter to run queries in parallel. >>>>> >>>>> https://github.com/ndimiduk/tpcds-gen >>>>> https://github.com/ndimiduk/phoenix-performance >>>>> >>>>> Probably there's dust and bit-rot to brush off of both projects, but >>>> maybe >>>>> it'll help someone looking for a starting point? >>>>> >>>>> Apologies, but I haven't had time to see what the speakers have shared >>>>> about their setup. >>>>> >>>>> -n >>>>> >>>>> On Friday, August 19, 2016, Andrew Purtell <apurt...@apache.org> >> wrote: >>>>> >>>>>>> Maybe there's such a test harness that already exists for TPC? >>>>>> >>>>>> TPC provides tooling but it's all proprietary. The generated data can >> be >>>>>> kept separately (Druid does it at least - >>>>>> http://druid.io/blog/2014/03/17/benchmarking-druid.html >>>>>> ). >>>>>> >>>>>> I'd say there would be one time setup: generation of data sets of >>>> various >>>>>> sizes, conversion to compressed CSV, and upload to somewhere public >>>> (S3?). >>>>>> Not strictly necessary, but it would save everyone a lot of time and >>>> hassle >>>>>> to not have to download the TPC data generators and munge the output >>>> every >>>>>> time. For this one could use the TPC tools. >>>>>> >>>>>> Then, the most sensible avenue I think would be implementation of new >>>>>> Phoenix integration tests that consume that data and run uniquely >>>> tweaked >>>>>> queries (yeah - every datastore vendor must do that with TPC). Phoenix >>>> can >>>>>> use hbase-it and get the cluster and chaos tooling such as it is for >>>> free, >>>>>> but the upsert/initialization/bulk load and query tooling would be all >>>>>> Phoenix based: the CSV loader, the JDBC driver. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor <jamestay...@apache.org >>>>>> <javascript:;>> >>>>>> wrote: >>>>>> >>>>>>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell <apurt...@apache.org >>>>>> <javascript:;>> >>>>>>> wrote: >>>>>>> >>>>>>>>> I have a long interest in 'canned' loadings. Interesting ones are >>>>>> hard >>>>>>> to >>>>>>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try >> it. >>>>>>>> >>>>>>>> Likewise >>>>>>>> >>>>>>>>> But I don't want to be the first to try it. I am not a Phoenix >>>>>> expert. >>>>>>>> >>>>>>>> Same here, I'd just email dev@phoenix with a report that TPC query >>>> XYZ >>>>>>>> didn't work and that would be as far as I could get. >>>>>>>> >>>>>>>> I don't think the first phase would require Phoenix experience. It's >>>>>> more >>>>>>> around the automation for running each TPC benchmark so the process >> is >>>>>>> repeatable: >>>>>>> - pulling in the data >>>>>>> - scripting the jobs >>>>>>> - having a test harness they run inside >>>>>>> - identifying the queries that don't work (ideally you wouldn't stop >> at >>>>>> the >>>>>>> first error) >>>>>>> - filing JIRAs for these >>>>>>> >>>>>>> The entire framework could be built and tested using standard JDBC >>>> APIs, >>>>>>> and then initially run using MySQL or some other RDBMS before trying >> it >>>>>>> with Phoenix. Maybe there's such a test harness that already exists >> for >>>>>>> TPC? >>>>>>> >>>>>>> Then I think the next phase would require more Phoenix & HBase >>>>>> experience: >>>>>>> - tweaking queries where possible given any limitations in Phoenix >>>>>>> - adding missing syntax (or potentially using the calcite branch >> which >>>>>>> supports more) >>>>>>> - tweaking Phoenix schema declarations to optimize >>>>>>> - tweaking Phoenix & HBase configs to optimize >>>>>>> - determining which secondary indexes to add (though I think there's >> an >>>>>>> academic paper on this, I can't seem to find it) >>>>>>> >>>>>>> Both phases would require a significant amount of time and effort. >> Each >>>>>>> benchmark would likely require unique tweaks. >>>>>>> >>>>>>> Thanks, >>>>>>> James >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> >>>>>> - Andy >>>>>> >>>>>> Problems worthy of attack prove their worth by hitting back. - Piet >> Hein >>>>>> (via Tom White) >>>>>> >>>> >>>> >> >>