Re: Pig vs Bulk Load record count

Ravi Kiran Mon, 02 Feb 2015 12:24:17 -0800

Hi Ralph,

   Regarding the upsert query in the logs, it should be *Phoenix Custom
Upsert Statement:*  as you have explicitly specified the fields in STORE .
  Is it possible to give it a try with a smaller set of records , say 8k to
see the behavior.


Regards
Ravi

On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <[email protected]>
wrote:

>  Thanks for the quick response.  Here is what I have below:
>
>  ========================================
>  Pig script:
> -------------------------------
> register $phoenix_jar;
>
>  Z = load '$data' USING PigStorage(',') as (
>   file_name,
>   rec_num,
>   epoch_time,
>   timet,
>   site,
>   proto,
>   saddr,
>   daddr,
>   sport,
>   dport,
>   mf,
>   cf,
>   dur,
>   sdata,
>   ddata,
>   sbyte,
>   dbyte,
>   spkt,
>   dpkt,
>   siopt,
>   diopt,
>   stopt,
>   dtopt,
>   sflags,
>   dflags,
>   flags,
>   sfseq,
>   dfseq,
>   slseq,
>   dlseq,
>   category);
>
>  STORE Z into
> 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY'
> using org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize
> 5000');
>
>  =========================
>
>  I cannot find the upsert statement you are referring to in either the MR
> logs or Pig output but I do have this below – Pig thinks it output the
> correct number of records
>
>  Input(s):
> Successfully read 42871627 records (1479463169 bytes) from:
> "/data/incoming/201501124931/SAMPLE"
>
>  Output(s):
> Successfully stored 42871627 records in:
> "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY"
>
>
>  Count command:
> select count(1) from TEST;
>
>  __________________________________________________
> *Ralph Perko*
> Pacific Northwest National Laboratory
> (509) 375-2272
> [email protected]
>
>   From: Ravi Kiran <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Monday, February 2, 2015 at 11:01 AM
> To: "[email protected]" <[email protected]>
> Subject: Re: Pig vs Bulk Load record count
>
>    Hi Ralph,
>
>     That's definitely a cause of worry. Can you please share the UPSERT
> query being built by Phoenix . You should see it in the logs with an entry 
> "*Phoenix
> Generic Upsert Statement: *..
>  Also, what do the MapReduce counters say for the job.  If possible can
> you share the pig script as sometimes the order of columns in the STORE
> command impacts.
>
>  Regards
> Ravi
>
>
> On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <[email protected]>
> wrote:
>
>>  Hi, I’ve run into a peculiar issue between loading data using Pig vs
>> the CsvBulkLoadTool.  I have 42M csv records to load and I am comparing the
>> performance.
>>
>>  In both cases the MR jobs are successful, and there are no errors.
>> In both cases the MR job counters state there are 42M Map input and
>> output records
>>
>>  However, when I run count on the table when the jobs are complete
>> something is terribly off.
>> After the bulk load, select count shows all 42M recs in Phoenix as is
>> expected.
>> After the pig load there are only 3M recs in Phoenix – not even close.
>>
>>  I have no errors to send.  I have run the same test multiple times and
>> gotten the same results.    The pig script is not doing any
>> transformations.  It is a simple LOAD and STORE
>> I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT.
>>  4.2.3-SNAPSHOT is running on the region servers.
>>
>>  Thanks,
>> Ralph
>>
>>
>

Re: Pig vs Bulk Load record count

Reply via email to