Hi Ralph, Regarding the upsert query in the logs, it should be *Phoenix Custom Upsert Statement:* as you have explicitly specified the fields in STORE . Is it possible to give it a try with a smaller set of records , say 8k to see the behavior.
Regards Ravi On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <[email protected]> wrote: > Thanks for the quick response. Here is what I have below: > > ======================================== > Pig script: > ------------------------------- > register $phoenix_jar; > > Z = load '$data' USING PigStorage(',') as ( > file_name, > rec_num, > epoch_time, > timet, > site, > proto, > saddr, > daddr, > sport, > dport, > mf, > cf, > dur, > sdata, > ddata, > sbyte, > dbyte, > spkt, > dpkt, > siopt, > diopt, > stopt, > dtopt, > sflags, > dflags, > flags, > sfseq, > dfseq, > slseq, > dlseq, > category); > > STORE Z into > 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY' > using org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize > 5000'); > > ========================= > > I cannot find the upsert statement you are referring to in either the MR > logs or Pig output but I do have this below – Pig thinks it output the > correct number of records > > Input(s): > Successfully read 42871627 records (1479463169 bytes) from: > "/data/incoming/201501124931/SAMPLE" > > Output(s): > Successfully stored 42871627 records in: > "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY" > > > Count command: > select count(1) from TEST; > > __________________________________________________ > *Ralph Perko* > Pacific Northwest National Laboratory > (509) 375-2272 > [email protected] > > From: Ravi Kiran <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Monday, February 2, 2015 at 11:01 AM > To: "[email protected]" <[email protected]> > Subject: Re: Pig vs Bulk Load record count > > Hi Ralph, > > That's definitely a cause of worry. Can you please share the UPSERT > query being built by Phoenix . You should see it in the logs with an entry > "*Phoenix > Generic Upsert Statement: *.. > Also, what do the MapReduce counters say for the job. If possible can > you share the pig script as sometimes the order of columns in the STORE > command impacts. > > Regards > Ravi > > > On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <[email protected]> > wrote: > >> Hi, I’ve run into a peculiar issue between loading data using Pig vs >> the CsvBulkLoadTool. I have 42M csv records to load and I am comparing the >> performance. >> >> In both cases the MR jobs are successful, and there are no errors. >> In both cases the MR job counters state there are 42M Map input and >> output records >> >> However, when I run count on the table when the jobs are complete >> something is terribly off. >> After the bulk load, select count shows all 42M recs in Phoenix as is >> expected. >> After the pig load there are only 3M recs in Phoenix – not even close. >> >> I have no errors to send. I have run the same test multiple times and >> gotten the same results. The pig script is not doing any >> transformations. It is a simple LOAD and STORE >> I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT. >> 4.2.3-SNAPSHOT is running on the region servers. >> >> Thanks, >> Ralph >> >> >
