Hi Ralph, Is it possible to share the CREATE TABLE command as I would like to reproduce the error on my side with a sample dataset with the specific data types of yours.
Regards Ravi On Mon, Feb 2, 2015 at 1:29 PM, Perko, Ralph J <[email protected]> wrote: > Ravi, > > Thanks for the help - I am sorry I am not finding the upsert statement. > Attache are the logs and output. I specify the columns because I get > errors if I do not. > > I ran a test on 10K records. Pig states it processed 10K records. > Select count() says 9030. I analyzed the 10k data in excel and there are > no duplicates > > Thanks! > Ralph > > __________________________________________________ > *Ralph Perko* > Pacific Northwest National Laboratory > (509) 375-2272 > [email protected] > > From: Ravi Kiran <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Monday, February 2, 2015 at 12:23 PM > > To: "[email protected]" <[email protected]> > Subject: Re: Pig vs Bulk Load record count > > Hi Ralph, > > Regarding the upsert query in the logs, it should be *Phoenix Custom > Upsert Statement:* as you have explicitly specified the fields in STORE > . Is it possible to give it a try with a smaller set of records , say 8k > to see the behavior. > > Regards > Ravi > > On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <[email protected]> > wrote: > >> Thanks for the quick response. Here is what I have below: >> >> ======================================== >> Pig script: >> ------------------------------- >> register $phoenix_jar; >> >> Z = load '$data' USING PigStorage(',') as ( >> file_name, >> rec_num, >> epoch_time, >> timet, >> site, >> proto, >> saddr, >> daddr, >> sport, >> dport, >> mf, >> cf, >> dur, >> sdata, >> ddata, >> sbyte, >> dbyte, >> spkt, >> dpkt, >> siopt, >> diopt, >> stopt, >> dtopt, >> sflags, >> dflags, >> flags, >> sfseq, >> dfseq, >> slseq, >> dlseq, >> category); >> >> STORE Z into >> 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY' >> using org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize >> 5000'); >> >> ========================= >> >> I cannot find the upsert statement you are referring to in either the >> MR logs or Pig output but I do have this below – Pig thinks it output the >> correct number of records >> >> Input(s): >> Successfully read 42871627 records (1479463169 bytes) from: >> "/data/incoming/201501124931/SAMPLE" >> >> Output(s): >> Successfully stored 42871627 records in: >> "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY" >> >> >> Count command: >> select count(1) from TEST; >> >> __________________________________________________ >> *Ralph Perko* >> Pacific Northwest National Laboratory >> (509) 375-2272 >> [email protected] >> >> From: Ravi Kiran <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Monday, February 2, 2015 at 11:01 AM >> To: "[email protected]" <[email protected]> >> Subject: Re: Pig vs Bulk Load record count >> >> Hi Ralph, >> >> That's definitely a cause of worry. Can you please share the UPSERT >> query being built by Phoenix . You should see it in the logs with an entry >> "*Phoenix >> Generic Upsert Statement: *.. >> Also, what do the MapReduce counters say for the job. If possible can >> you share the pig script as sometimes the order of columns in the STORE >> command impacts. >> >> Regards >> Ravi >> >> >> On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <[email protected]> >> wrote: >> >>> Hi, I’ve run into a peculiar issue between loading data using Pig vs >>> the CsvBulkLoadTool. I have 42M csv records to load and I am comparing the >>> performance. >>> >>> In both cases the MR jobs are successful, and there are no errors. >>> In both cases the MR job counters state there are 42M Map input and >>> output records >>> >>> However, when I run count on the table when the jobs are complete >>> something is terribly off. >>> After the bulk load, select count shows all 42M recs in Phoenix as is >>> expected. >>> After the pig load there are only 3M recs in Phoenix – not even close. >>> >>> I have no errors to send. I have run the same test multiple times and >>> gotten the same results. The pig script is not doing any >>> transformations. It is a simple LOAD and STORE >>> I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT. >>> 4.2.3-SNAPSHOT is running on the region servers. >>> >>> Thanks, >>> Ralph >>> >>> >> >
