Re: Pig vs Bulk Load record count

James Taylor Tue, 03 Feb 2015 16:40:07 -0800

Glad to hear it, Ralph. Still sounds like there's a bug here (or at a
minimum a usability issue), but not a showstopper for the 4.3 release.
Would you mind filing a JIRA for it?
Thanks,
James


On Tue, Feb 3, 2015 at 4:31 PM, Ravi Kiran <[email protected]> wrote:
> Hi Ralph,
>
>    Glad it is working!!
>
> Regards
> Ravi
>
> On Tue, Feb 3, 2015 at 3:29 PM, Perko, Ralph J <[email protected]> wrote:
>>
>> I have solved the problem.  This was a mystery because the same data
>> loaded into the same schema gave conflicting counts depending on the load
>> technique.  While the data itself had no duplicate keys the behavior
>> suggested something was up with the keys (MR input / output had the correct
>> record count for both load techniques for instance).  I confirmed this by
>> creating a pig udf that created a uuid for each row as the pk.  The result
>> of running this test was each row appeared as expected and I got the correct
>> count.  But I couldn’t figure out why the data itself would behave
>> differently because it was also unique.  My pig script could hardly be
>> simpler with no transformations, it is a simple load and store.  This ended
>> up being the issue!
>>
>> Solution:
>> Assign the correct pig data type to the PK values rather than letting pig
>> figure it out.  I am not sure what the exact underlying issue is, but this
>> fixed it (perhaps when pig coerced the values to a datatype it thought best
>> it munged it somehow).
>>
>> Changes to pig script from below:
>>
>>
>> Z = load '$data' USING PigStorage(',') as (
>>
>>   file_name:chararray,
>>
>>   rec_num:int,
>>
>>
>> Thanks for the help
>>
>> Ralph
>>
>>
>> From: <Ciureanu>, "Constantin (GfK)" <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Tuesday, February 3, 2015 at 1:52 AM
>> To: "[email protected]" <[email protected]>
>> Subject: RE: Pig vs Bulk Load record count
>>
>> Hello Ralph,
>>
>>
>>
>> Try to check if the PIG script doesn’t produce keys that overlap (that
>> would explain the reduce in number of rows).
>>
>>
>>
>> Good luck,
>>
>>    Constantin
>>
>>
>>
>> From: Ravi Kiran [mailto:[email protected]]
>> Sent: Tuesday, February 03, 2015 2:42 AM
>> To: [email protected]
>> Subject: Re: Pig vs Bulk Load record count
>>
>>
>>
>> Thanks Ralph. I will try to reproduce this on my end with a sample data
>> set and get back to you.
>>
>> Regards
>>
>> Ravi
>>
>>
>>
>> On Mon, Feb 2, 2015 at 5:27 PM, Perko, Ralph J <[email protected]>
>> wrote:
>>
>> Ravi,
>>
>>
>>
>> The create statement is attached.  You will see some additional fields I
>> excluded from the first email.
>>
>>
>>
>> Thanks!
>>
>> Ralph
>>
>>
>>
>> ________________________________
>>
>> From: Ravi Kiran [[email protected]]
>> Sent: Monday, February 02, 2015 5:03 PM
>> To: [email protected]
>>
>>
>> Subject: Re: Pig vs Bulk Load record count
>>
>>
>>
>> Hi Ralph,
>>
>>    Is it possible to share the CREATE TABLE command as I would like to
>> reproduce the error on my side with a sample dataset with the specific data
>> types of yours.
>>
>> Regards
>> Ravi
>>
>>
>>
>> On Mon, Feb 2, 2015 at 1:29 PM, Perko, Ralph J <[email protected]>
>> wrote:
>>
>> Ravi,
>>
>>
>>
>> Thanks for the help - I am sorry I am not finding the upsert statement.
>> Attache are the logs and output.  I specify the columns because I get errors
>> if I do not.
>>
>>
>>
>> I ran a test on 10K records.  Pig states it processed 10K records.  Select
>> count() says 9030.  I analyzed the 10k data in excel and there are no
>> duplicates
>>
>>
>>
>> Thanks!
>>
>> Ralph
>>
>>
>>
>> __________________________________________________
>>
>> Ralph Perko
>>
>> Pacific Northwest National Laboratory
>>
>> (509) 375-2272
>>
>> [email protected]
>>
>>
>>
>> From: Ravi Kiran <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Monday, February 2, 2015 at 12:23 PM
>>
>>
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Pig vs Bulk Load record count
>>
>>
>>
>> Hi Ralph,
>>
>>    Regarding the upsert query in the logs, it should be Phoenix Custom
>> Upsert Statement:  as you have explicitly specified the fields in STORE .
>> Is it possible to give it a try with a smaller set of records , say 8k to
>> see the behavior.
>>
>> Regards
>> Ravi
>>
>>
>>
>> On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <[email protected]>
>> wrote:
>>
>> Thanks for the quick response.  Here is what I have below:
>>
>>
>>
>> ========================================
>>
>> Pig script:
>>
>> -------------------------------
>>
>> register $phoenix_jar;
>>
>>
>>
>> Z = load '$data' USING PigStorage(',') as (
>>
>>   file_name,
>>
>>   rec_num,
>>
>>   epoch_time,
>>
>>   timet,
>>
>>   site,
>>
>>   proto,
>>
>>   saddr,
>>
>>   daddr,
>>
>>   sport,
>>
>>   dport,
>>
>>   mf,
>>
>>   cf,
>>
>>   dur,
>>
>>   sdata,
>>
>>   ddata,
>>
>>   sbyte,
>>
>>   dbyte,
>>
>>   spkt,
>>
>>   dpkt,
>>
>>   siopt,
>>
>>   diopt,
>>
>>   stopt,
>>
>>   dtopt,
>>
>>   sflags,
>>
>>   dflags,
>>
>>   flags,
>>
>>   sfseq,
>>
>>   dfseq,
>>
>>   slseq,
>>
>>   dlseq,
>>
>>   category);
>>
>>
>>
>> STORE Z into
>> 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY'
>> using org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize
>> 5000');
>>
>>
>>
>> =========================
>>
>>
>>
>> I cannot find the upsert statement you are referring to in either the MR
>> logs or Pig output but I do have this below – Pig thinks it output the
>> correct number of records
>>
>>
>>
>> Input(s):
>>
>> Successfully read 42871627 records (1479463169 bytes) from:
>> "/data/incoming/201501124931/SAMPLE"
>>
>>
>>
>> Output(s):
>>
>> Successfully stored 42871627 records in:
>> "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY"
>>
>>
>>
>>
>>
>> Count command:
>>
>> select count(1) from TEST;
>>
>>
>>
>> __________________________________________________
>>
>> Ralph Perko
>>
>> Pacific Northwest National Laboratory
>>
>> (509) 375-2272
>>
>> [email protected]
>>
>>
>>
>> From: Ravi Kiran <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Monday, February 2, 2015 at 11:01 AM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Pig vs Bulk Load record count
>>
>>
>>
>> Hi Ralph,
>>
>>    That's definitely a cause of worry. Can you please share the UPSERT
>> query being built by Phoenix . You should see it in the logs with an entry
>> "Phoenix Generic Upsert Statement: ..
>>
>> Also, what do the MapReduce counters say for the job.  If possible can you
>> share the pig script as sometimes the order of columns in the STORE command
>> impacts.
>>
>> Regards
>> Ravi
>>
>>
>>
>>
>>
>> On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <[email protected]>
>> wrote:
>>
>> Hi, I’ve run into a peculiar issue between loading data using Pig vs the
>> CsvBulkLoadTool.  I have 42M csv records to load and I am comparing the
>> performance.
>>
>>
>>
>> In both cases the MR jobs are successful, and there are no errors.
>>
>> In both cases the MR job counters state there are 42M Map input and output
>> records
>>
>>
>>
>> However, when I run count on the table when the jobs are complete
>> something is terribly off.
>>
>> After the bulk load, select count shows all 42M recs in Phoenix as is
>> expected.
>>
>> After the pig load there are only 3M recs in Phoenix – not even close.
>>
>>
>>
>> I have no errors to send.  I have run the same test multiple times and
>> gotten the same results.    The pig script is not doing any transformations.
>> It is a simple LOAD and STORE
>>
>> I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT.
>> 4.2.3-SNAPSHOT is running on the region servers.
>>
>>
>>
>> Thanks,
>>
>> Ralph
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Pig vs Bulk Load record count

Reply via email to