Glad to hear it, Ralph. Still sounds like there's a bug here (or at a minimum a usability issue), but not a showstopper for the 4.3 release. Would you mind filing a JIRA for it? Thanks, James
On Tue, Feb 3, 2015 at 4:31 PM, Ravi Kiran <[email protected]> wrote: > Hi Ralph, > > Glad it is working!! > > Regards > Ravi > > On Tue, Feb 3, 2015 at 3:29 PM, Perko, Ralph J <[email protected]> wrote: >> >> I have solved the problem. This was a mystery because the same data >> loaded into the same schema gave conflicting counts depending on the load >> technique. While the data itself had no duplicate keys the behavior >> suggested something was up with the keys (MR input / output had the correct >> record count for both load techniques for instance). I confirmed this by >> creating a pig udf that created a uuid for each row as the pk. The result >> of running this test was each row appeared as expected and I got the correct >> count. But I couldn’t figure out why the data itself would behave >> differently because it was also unique. My pig script could hardly be >> simpler with no transformations, it is a simple load and store. This ended >> up being the issue! >> >> Solution: >> Assign the correct pig data type to the PK values rather than letting pig >> figure it out. I am not sure what the exact underlying issue is, but this >> fixed it (perhaps when pig coerced the values to a datatype it thought best >> it munged it somehow). >> >> Changes to pig script from below: >> >> >> Z = load '$data' USING PigStorage(',') as ( >> >> file_name:chararray, >> >> rec_num:int, >> >> >> Thanks for the help >> >> Ralph >> >> >> From: <Ciureanu>, "Constantin (GfK)" <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Tuesday, February 3, 2015 at 1:52 AM >> To: "[email protected]" <[email protected]> >> Subject: RE: Pig vs Bulk Load record count >> >> Hello Ralph, >> >> >> >> Try to check if the PIG script doesn’t produce keys that overlap (that >> would explain the reduce in number of rows). >> >> >> >> Good luck, >> >> Constantin >> >> >> >> From: Ravi Kiran [mailto:[email protected]] >> Sent: Tuesday, February 03, 2015 2:42 AM >> To: [email protected] >> Subject: Re: Pig vs Bulk Load record count >> >> >> >> Thanks Ralph. I will try to reproduce this on my end with a sample data >> set and get back to you. >> >> Regards >> >> Ravi >> >> >> >> On Mon, Feb 2, 2015 at 5:27 PM, Perko, Ralph J <[email protected]> >> wrote: >> >> Ravi, >> >> >> >> The create statement is attached. You will see some additional fields I >> excluded from the first email. >> >> >> >> Thanks! >> >> Ralph >> >> >> >> ________________________________ >> >> From: Ravi Kiran [[email protected]] >> Sent: Monday, February 02, 2015 5:03 PM >> To: [email protected] >> >> >> Subject: Re: Pig vs Bulk Load record count >> >> >> >> Hi Ralph, >> >> Is it possible to share the CREATE TABLE command as I would like to >> reproduce the error on my side with a sample dataset with the specific data >> types of yours. >> >> Regards >> Ravi >> >> >> >> On Mon, Feb 2, 2015 at 1:29 PM, Perko, Ralph J <[email protected]> >> wrote: >> >> Ravi, >> >> >> >> Thanks for the help - I am sorry I am not finding the upsert statement. >> Attache are the logs and output. I specify the columns because I get errors >> if I do not. >> >> >> >> I ran a test on 10K records. Pig states it processed 10K records. Select >> count() says 9030. I analyzed the 10k data in excel and there are no >> duplicates >> >> >> >> Thanks! >> >> Ralph >> >> >> >> __________________________________________________ >> >> Ralph Perko >> >> Pacific Northwest National Laboratory >> >> (509) 375-2272 >> >> [email protected] >> >> >> >> From: Ravi Kiran <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Monday, February 2, 2015 at 12:23 PM >> >> >> To: "[email protected]" <[email protected]> >> Subject: Re: Pig vs Bulk Load record count >> >> >> >> Hi Ralph, >> >> Regarding the upsert query in the logs, it should be Phoenix Custom >> Upsert Statement: as you have explicitly specified the fields in STORE . >> Is it possible to give it a try with a smaller set of records , say 8k to >> see the behavior. >> >> Regards >> Ravi >> >> >> >> On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <[email protected]> >> wrote: >> >> Thanks for the quick response. Here is what I have below: >> >> >> >> ======================================== >> >> Pig script: >> >> ------------------------------- >> >> register $phoenix_jar; >> >> >> >> Z = load '$data' USING PigStorage(',') as ( >> >> file_name, >> >> rec_num, >> >> epoch_time, >> >> timet, >> >> site, >> >> proto, >> >> saddr, >> >> daddr, >> >> sport, >> >> dport, >> >> mf, >> >> cf, >> >> dur, >> >> sdata, >> >> ddata, >> >> sbyte, >> >> dbyte, >> >> spkt, >> >> dpkt, >> >> siopt, >> >> diopt, >> >> stopt, >> >> dtopt, >> >> sflags, >> >> dflags, >> >> flags, >> >> sfseq, >> >> dfseq, >> >> slseq, >> >> dlseq, >> >> category); >> >> >> >> STORE Z into >> 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY' >> using org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize >> 5000'); >> >> >> >> ========================= >> >> >> >> I cannot find the upsert statement you are referring to in either the MR >> logs or Pig output but I do have this below – Pig thinks it output the >> correct number of records >> >> >> >> Input(s): >> >> Successfully read 42871627 records (1479463169 bytes) from: >> "/data/incoming/201501124931/SAMPLE" >> >> >> >> Output(s): >> >> Successfully stored 42871627 records in: >> "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY" >> >> >> >> >> >> Count command: >> >> select count(1) from TEST; >> >> >> >> __________________________________________________ >> >> Ralph Perko >> >> Pacific Northwest National Laboratory >> >> (509) 375-2272 >> >> [email protected] >> >> >> >> From: Ravi Kiran <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Monday, February 2, 2015 at 11:01 AM >> To: "[email protected]" <[email protected]> >> Subject: Re: Pig vs Bulk Load record count >> >> >> >> Hi Ralph, >> >> That's definitely a cause of worry. Can you please share the UPSERT >> query being built by Phoenix . You should see it in the logs with an entry >> "Phoenix Generic Upsert Statement: .. >> >> Also, what do the MapReduce counters say for the job. If possible can you >> share the pig script as sometimes the order of columns in the STORE command >> impacts. >> >> Regards >> Ravi >> >> >> >> >> >> On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <[email protected]> >> wrote: >> >> Hi, I’ve run into a peculiar issue between loading data using Pig vs the >> CsvBulkLoadTool. I have 42M csv records to load and I am comparing the >> performance. >> >> >> >> In both cases the MR jobs are successful, and there are no errors. >> >> In both cases the MR job counters state there are 42M Map input and output >> records >> >> >> >> However, when I run count on the table when the jobs are complete >> something is terribly off. >> >> After the bulk load, select count shows all 42M recs in Phoenix as is >> expected. >> >> After the pig load there are only 3M recs in Phoenix – not even close. >> >> >> >> I have no errors to send. I have run the same test multiple times and >> gotten the same results. The pig script is not doing any transformations. >> It is a simple LOAD and STORE >> >> I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT. >> 4.2.3-SNAPSHOT is running on the region servers. >> >> >> >> Thanks, >> >> Ralph >> >> >> >> >> >> >> >> >> >> > >
