Re: Pig vs Bulk Load record count

Perko, Ralph J Tue, 03 Feb 2015 15:31:51 -0800

I have solved the problem.  This was a mystery because the same data loaded 
into the same schema gave conflicting counts depending on the load technique.  
While the data itself had no duplicate keys the behavior suggested something 
was up with the keys (MR input / output had the correct record count for both 
load techniques for instance).  I confirmed this by creating a pig udf that 
created a uuid for each row as the pk.  The result of running this test was 
each row appeared as expected and I got the correct count.  But I couldn’t 
figure out why the data itself would behave differently because it was also 
unique.  My pig script could hardly be simpler with no transformations, it is a 
simple load and store.  This ended up being the issue!


Solution:
Assign the correct pig data type to the PK values rather than letting pig 
figure it out.  I am not sure what the exact underlying issue is, but this 
fixed it (perhaps when pig coerced the values to a datatype it thought best it 
munged it somehow).

Changes to pig script from below:

Z = load '$data' USING PigStorage(',') as (
  file_name:chararray,
  rec_num:int,

Thanks for the help
Ralph

From: <Ciureanu>, "Constantin (GfK)" 
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, February 3, 2015 at 1:52 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: RE: Pig vs Bulk Load record count

Hello Ralph,

Try to check if the PIG script doesn’t produce keys that overlap (that would 
explain the reduce in number of rows).

Good luck,
   Constantin

From: Ravi Kiran [mailto:[email protected]]
Sent: Tuesday, February 03, 2015 2:42 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Pig vs Bulk Load record count

Thanks Ralph. I will try to reproduce this on my end with a sample data set and 
get back to you.
Regards
Ravi

On Mon, Feb 2, 2015 at 5:27 PM, Perko, Ralph J 
<[email protected]<mailto:[email protected]>> wrote:
Ravi,

The create statement is attached.  You will see some additional fields I 
excluded from the first email.

Thanks!
Ralph

________________________________
From: Ravi Kiran [[email protected]<mailto:[email protected]>]
Sent: Monday, February 02, 2015 5:03 PM
To: [email protected]<mailto:[email protected]>

Subject: Re: Pig vs Bulk Load record count

Hi Ralph,
   Is it possible to share the CREATE TABLE command as I would like to 
reproduce the error on my side with a sample dataset with the specific data 
types of yours.
Regards
Ravi

On Mon, Feb 2, 2015 at 1:29 PM, Perko, Ralph J 
<[email protected]<mailto:[email protected]>> wrote:
Ravi,

Thanks for the help - I am sorry I am not finding the upsert statement.  
Attache are the logs and output.  I specify the columns because I get errors if 
I do not.

I ran a test on 10K records.  Pig states it processed 10K records.  Select 
count() says 9030.  I analyzed the 10k data in excel and there are no duplicates

Thanks!
Ralph

__________________________________________________
Ralph Perko
Pacific Northwest National Laboratory
(509) 375-2272<tel:%28509%29%20375-2272>
[email protected]<mailto:[email protected]>

From: Ravi Kiran <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Monday, February 2, 2015 at 12:23 PM

To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Pig vs Bulk Load record count

Hi Ralph,
   Regarding the upsert query in the logs, it should be Phoenix Custom Upsert 
Statement:  as you have explicitly specified the fields in STORE .    Is it 
possible to give it a try with a smaller set of records , say 8k to see the 
behavior.
Regards
Ravi

On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J 
<[email protected]<mailto:[email protected]>> wrote:
Thanks for the quick response.  Here is what I have below:

========================================
Pig script:
-------------------------------
register $phoenix_jar;

Z = load '$data' USING PigStorage(',') as (
  file_name,
  rec_num,
  epoch_time,
  timet,
  site,
  proto,
  saddr,
  daddr,
  sport,
  dport,
  mf,
  cf,
  dur,
  sdata,
  ddata,
  sbyte,
  dbyte,
  spkt,
  dpkt,
  siopt,
  diopt,
  stopt,
  dtopt,
  sflags,
  dflags,
  flags,
  sfseq,
  dfseq,
  slseq,
  dlseq,
  category);

STORE Z into 
'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY'
 using org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize 
5000');

=========================

I cannot find the upsert statement you are referring to in either the MR logs 
or Pig output but I do have this below – Pig thinks it output the correct 
number of records

Input(s):
Successfully read 42871627 records (1479463169 bytes) from: 
"/data/incoming/201501124931/SAMPLE"

Output(s):
Successfully stored 42871627 records in: 
"hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY"


Count command:
select count(1) from TEST;

__________________________________________________
Ralph Perko
Pacific Northwest National Laboratory
(509) 375-2272<tel:%28509%29%20375-2272>
[email protected]<mailto:[email protected]>

From: Ravi Kiran <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Monday, February 2, 2015 at 11:01 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Pig vs Bulk Load record count

Hi Ralph,
   That's definitely a cause of worry. Can you please share the UPSERT query 
being built by Phoenix . You should see it in the logs with an entry "Phoenix 
Generic Upsert Statement: ..
Also, what do the MapReduce counters say for the job.  If possible can you 
share the pig script as sometimes the order of columns in the STORE command 
impacts.
Regards
Ravi


On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J 
<[email protected]<mailto:[email protected]>> wrote:
Hi, I’ve run into a peculiar issue between loading data using Pig vs the 
CsvBulkLoadTool.  I have 42M csv records to load and I am comparing the 
performance.

In both cases the MR jobs are successful, and there are no errors.
In both cases the MR job counters state there are 42M Map input and output 
records

However, when I run count on the table when the jobs are complete something is 
terribly off.
After the bulk load, select count shows all 42M recs in Phoenix as is expected.
After the pig load there are only 3M recs in Phoenix – not even close.

I have no errors to send.  I have run the same test multiple times and gotten 
the same results.    The pig script is not doing any transformations.  It is a 
simple LOAD and STORE
I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT.  
4.2.3-SNAPSHOT is running on the region servers.

Thanks,
Ralph

Re: Pig vs Bulk Load record count

Reply via email to