Re: Pig vs Bulk Load record count

Ravi Kiran Mon, 02 Feb 2015 17:05:10 -0800

Hi Ralph,

   Is it possible to share the CREATE TABLE command as I would like to
reproduce the error on my side with a sample dataset with the specific data
types of yours.


Regards
Ravi

On Mon, Feb 2, 2015 at 1:29 PM, Perko, Ralph J <[email protected]> wrote:

>  Ravi,
>
>  Thanks for the help - I am sorry I am not finding the upsert statement.
> Attache are the logs and output.  I specify the columns because I get
> errors if I do not.
>
>  I ran a test on 10K records.  Pig states it processed 10K records.
> Select count() says 9030.  I analyzed the 10k data in excel and there are
> no duplicates
>
>  Thanks!
> Ralph
>
>  __________________________________________________
> *Ralph Perko*
> Pacific Northwest National Laboratory
> (509) 375-2272
> [email protected]
>
>   From: Ravi Kiran <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Monday, February 2, 2015 at 12:23 PM
>
> To: "[email protected]" <[email protected]>
> Subject: Re: Pig vs Bulk Load record count
>
>    Hi Ralph,
>
>     Regarding the upsert query in the logs, it should be *Phoenix Custom
> Upsert Statement:*  as you have explicitly specified the fields in STORE
> .    Is it possible to give it a try with a smaller set of records , say 8k
> to see the behavior.
>
>  Regards
> Ravi
>
> On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <[email protected]>
> wrote:
>
>>  Thanks for the quick response.  Here is what I have below:
>>
>>  ========================================
>>  Pig script:
>> -------------------------------
>> register $phoenix_jar;
>>
>>  Z = load '$data' USING PigStorage(',') as (
>>   file_name,
>>   rec_num,
>>   epoch_time,
>>   timet,
>>   site,
>>   proto,
>>   saddr,
>>   daddr,
>>   sport,
>>   dport,
>>   mf,
>>   cf,
>>   dur,
>>   sdata,
>>   ddata,
>>   sbyte,
>>   dbyte,
>>   spkt,
>>   dpkt,
>>   siopt,
>>   diopt,
>>   stopt,
>>   dtopt,
>>   sflags,
>>   dflags,
>>   flags,
>>   sfseq,
>>   dfseq,
>>   slseq,
>>   dlseq,
>>   category);
>>
>>  STORE Z into
>> 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY'
>> using org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize
>> 5000');
>>
>>  =========================
>>
>>  I cannot find the upsert statement you are referring to in either the
>> MR logs or Pig output but I do have this below – Pig thinks it output the
>> correct number of records
>>
>>  Input(s):
>> Successfully read 42871627 records (1479463169 bytes) from:
>> "/data/incoming/201501124931/SAMPLE"
>>
>>  Output(s):
>> Successfully stored 42871627 records in:
>> "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY"
>>
>>
>>  Count command:
>> select count(1) from TEST;
>>
>>  __________________________________________________
>> *Ralph Perko*
>> Pacific Northwest National Laboratory
>> (509) 375-2272
>> [email protected]
>>
>>   From: Ravi Kiran <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Monday, February 2, 2015 at 11:01 AM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Pig vs Bulk Load record count
>>
>>    Hi Ralph,
>>
>>     That's definitely a cause of worry. Can you please share the UPSERT
>> query being built by Phoenix . You should see it in the logs with an entry 
>> "*Phoenix
>> Generic Upsert Statement: *..
>>  Also, what do the MapReduce counters say for the job.  If possible can
>> you share the pig script as sometimes the order of columns in the STORE
>> command impacts.
>>
>>  Regards
>> Ravi
>>
>>
>> On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <[email protected]>
>> wrote:
>>
>>>  Hi, I’ve run into a peculiar issue between loading data using Pig vs
>>> the CsvBulkLoadTool.  I have 42M csv records to load and I am comparing the
>>> performance.
>>>
>>>  In both cases the MR jobs are successful, and there are no errors.
>>> In both cases the MR job counters state there are 42M Map input and
>>> output records
>>>
>>>  However, when I run count on the table when the jobs are complete
>>> something is terribly off.
>>> After the bulk load, select count shows all 42M recs in Phoenix as is
>>> expected.
>>> After the pig load there are only 3M recs in Phoenix – not even close.
>>>
>>>  I have no errors to send.  I have run the same test multiple times and
>>> gotten the same results.    The pig script is not doing any
>>> transformations.  It is a simple LOAD and STORE
>>> I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT.
>>>  4.2.3-SNAPSHOT is running on the region servers.
>>>
>>>  Thanks,
>>> Ralph
>>>
>>>
>>
>

Re: Pig vs Bulk Load record count

Reply via email to