That sounds about right for loading CSV directly on a 5-8 node cluster. As Gabriel/James mentioned in another thread, CSVBulkLoadTool with pre-split table might offer significantly better performance for large datasets.
On Tue, Aug 23, 2016 at 2:17 PM, John Leach <jle...@splicemachine.com> wrote: > So to load a TB of data would take around 2 days? Does that seem right to > you? > > Regards, > John > > > On Aug 23, 2016, at 3:07 PM, Mujtaba Chohan <mujt...@apache.org> wrote: > > > > Since there are 100 files on which this 600M row data is split. 5 > separate > > psql script running in parallel on single machine ran that loaded data > from > > files 1-20, 21-40, 41-60, 61-80, 81-100. Performance get affected as keys > > are in sequence in these files which lead to hot-spotting of RS, for this > > should also try out Salted <https://phoenix.apache.org/salted.html> > Phoenix > > table to get the best write performance. Higher batch size might offer > > better performance too. > > > > I've also just started data load from single psql client as loading from > > single machine can get bottled-necked on network I/O. I'll update on its > > result tomorrow and I don't think its time would be too off from loading > > using multiple clients. > > > > > > On Tue, Aug 23, 2016 at 12:49 PM, John Leach <jle...@splicemachine.com> > > wrote: > > > >> Mujtaba, > >> > >> Not following the import process. > >> > >> The 5 parallel psql clients means that you manually split the data into > 5 > >> buckets/files/directories and then run 5 import scripts simultaneously? > >> > >> If we wanted to benchmark import performance, what would be the right > >> model for that? > >> > >> Thanks this is very helpful... > >> > >> Regards, > >> John > >> > >> > >> > >>> On Aug 23, 2016, at 2:28 PM, Mujtaba Chohan <mujt...@apache.org> > wrote: > >>> > >>> FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with > 64G > >>> total/12G HBase heap. > >>> > >>> *Data Load* > >>> * 5.5 hours for 600M rows > >>> * Method: Direct CSV load using psql.py script > >>> * # client machines: 1 > >>> * Batch size 1K > >>> * Key order: *Sequential* > >>> * 5 parallel psql clients > >>> * No missing rows due to data load > >>> > >>> *Schema* > >>> CREATE TABLE LINEITEM_MULTICF (L_ORDERKEY INTEGER not null, A.L_PARTKEY > >>> INTEGER, B.L_SUPPKEY INTEGER , L_LINENUMBER INTEGER not null, > >>> B.L_QUANTITY DECIMAL(15,2), B.L_EXTENDEDPRICE DECIMAL(15,2), > C.L_DISCOUNT > >>> DECIMAL(15,2), C.L_TAX DECIMAL(15,2), C.L_RETURNFLAG CHAR(1), > >>> C.L_LINESTATUS CHAR(1), C.L_SHIPDATE DATE, C.L_COMMITDATE DATE, > >>> C.L_RECEIPTDATE DATE, C.L_SHIPINSTRUCT CHAR(25), C.L_SHIPMODE CHAR(10), > >>> C.L_COMMENT VARCHAR(44) constraint pk primary key(l_orderkey, > >>> l_linenumber)); > >>> > >>> * 100M guidepost width > >>> * ~42GB data > >>> * Uncompressed/default fast-diff encoded > >>> > >>> *Performance with default 128 threadpool* > >>> select count(*) from lineitem_multicf; > >>> +------------+ > >>> | COUNT(1) | > >>> +------------+ > >>> | 600037902 | > >>> +------------+ > >>> 1 row selected (*24.1** seconds*) > >>> > >>> select l_returnflag, l_linestatus,sum(l_quantity) as > >>> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * > (1 > >> - > >>> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) > * > >> (1 > >>> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) > >> as > >>> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from > >> lineitem > >>> where l_shipdate <= current_date()- 90 group by l_returnflag, > >> l_linestatus > >>> order by l_returnflag, l_linestatus; > >>> 4 row selected (*185.2** seconds*) > >>> > >>> *Data* > >>> do curl -O http://static.druid.io/data/benchmarks/tpch/100/lineitem. > >> tbl.$i.gz > >>> ; done > >>> > >>> > >>> > >>> On Mon, Aug 22, 2016 at 7:48 AM, John Leach <jle...@splicemachine.com> > >>> wrote: > >>> > >>>> It looks like you guys already have most of the TPCH queries running > >> based > >>>> on Enis’s talk in Ireland this year. Very cool. > >>>> > >>>> (Slide 20: Phoenix can execute most of the TPC-H queries!) > >>>> > >>>> Regards, > >>>> John Leach > >>>> > >>>> > >>>>> On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <ndimi...@gmail.com> > wrote: > >>>>> > >>>>> It's TPC-DS, not -H, but this is what I was using way back when to > run > >>>> perf > >>>>> tests over Phoenix and the query server while I was developing on it. > >> The > >>>>> first project generates, loads the data via mapreduce and the second > >> tool > >>>>> wraps up use of jmeter to run queries in parallel. > >>>>> > >>>>> https://github.com/ndimiduk/tpcds-gen > >>>>> https://github.com/ndimiduk/phoenix-performance > >>>>> > >>>>> Probably there's dust and bit-rot to brush off of both projects, but > >>>> maybe > >>>>> it'll help someone looking for a starting point? > >>>>> > >>>>> Apologies, but I haven't had time to see what the speakers have > shared > >>>>> about their setup. > >>>>> > >>>>> -n > >>>>> > >>>>> On Friday, August 19, 2016, Andrew Purtell <apurt...@apache.org> > >> wrote: > >>>>> > >>>>>>> Maybe there's such a test harness that already exists for TPC? > >>>>>> > >>>>>> TPC provides tooling but it's all proprietary. The generated data > can > >> be > >>>>>> kept separately (Druid does it at least - > >>>>>> http://druid.io/blog/2014/03/17/benchmarking-druid.html > >>>>>> ). > >>>>>> > >>>>>> I'd say there would be one time setup: generation of data sets of > >>>> various > >>>>>> sizes, conversion to compressed CSV, and upload to somewhere public > >>>> (S3?). > >>>>>> Not strictly necessary, but it would save everyone a lot of time and > >>>> hassle > >>>>>> to not have to download the TPC data generators and munge the output > >>>> every > >>>>>> time. For this one could use the TPC tools. > >>>>>> > >>>>>> Then, the most sensible avenue I think would be implementation of > new > >>>>>> Phoenix integration tests that consume that data and run uniquely > >>>> tweaked > >>>>>> queries (yeah - every datastore vendor must do that with TPC). > Phoenix > >>>> can > >>>>>> use hbase-it and get the cluster and chaos tooling such as it is for > >>>> free, > >>>>>> but the upsert/initialization/bulk load and query tooling would be > all > >>>>>> Phoenix based: the CSV loader, the JDBC driver. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor < > jamestay...@apache.org > >>>>>> <javascript:;>> > >>>>>> wrote: > >>>>>> > >>>>>>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell < > apurt...@apache.org > >>>>>> <javascript:;>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>>> I have a long interest in 'canned' loadings. Interesting ones are > >>>>>> hard > >>>>>>> to > >>>>>>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try > >> it. > >>>>>>>> > >>>>>>>> Likewise > >>>>>>>> > >>>>>>>>> But I don't want to be the first to try it. I am not a Phoenix > >>>>>> expert. > >>>>>>>> > >>>>>>>> Same here, I'd just email dev@phoenix with a report that TPC > query > >>>> XYZ > >>>>>>>> didn't work and that would be as far as I could get. > >>>>>>>> > >>>>>>>> I don't think the first phase would require Phoenix experience. > It's > >>>>>> more > >>>>>>> around the automation for running each TPC benchmark so the process > >> is > >>>>>>> repeatable: > >>>>>>> - pulling in the data > >>>>>>> - scripting the jobs > >>>>>>> - having a test harness they run inside > >>>>>>> - identifying the queries that don't work (ideally you wouldn't > stop > >> at > >>>>>> the > >>>>>>> first error) > >>>>>>> - filing JIRAs for these > >>>>>>> > >>>>>>> The entire framework could be built and tested using standard JDBC > >>>> APIs, > >>>>>>> and then initially run using MySQL or some other RDBMS before > trying > >> it > >>>>>>> with Phoenix. Maybe there's such a test harness that already exists > >> for > >>>>>>> TPC? > >>>>>>> > >>>>>>> Then I think the next phase would require more Phoenix & HBase > >>>>>> experience: > >>>>>>> - tweaking queries where possible given any limitations in Phoenix > >>>>>>> - adding missing syntax (or potentially using the calcite branch > >> which > >>>>>>> supports more) > >>>>>>> - tweaking Phoenix schema declarations to optimize > >>>>>>> - tweaking Phoenix & HBase configs to optimize > >>>>>>> - determining which secondary indexes to add (though I think > there's > >> an > >>>>>>> academic paper on this, I can't seem to find it) > >>>>>>> > >>>>>>> Both phases would require a significant amount of time and effort. > >> Each > >>>>>>> benchmark would likely require unique tweaks. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> James > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Best regards, > >>>>>> > >>>>>> - Andy > >>>>>> > >>>>>> Problems worthy of attack prove their worth by hitting back. - Piet > >> Hein > >>>>>> (via Tom White) > >>>>>> > >>>> > >>>> > >> > >> > >