Re: Strange CROSS behavior

2014-05-02 Thread Rohini Palaniswamy
This looks like a bug. Can you please file a jira with steps to reproduce?


On Fri, Apr 18, 2014 at 2:45 PM, Alex Rasmussen alex...@trifacta.comwrote:

 I'm using PigStorage(',') for all stores.

 I agree about the expensiveness of CROSS, but I'm still kind of confused as
 to why it would lose records in this case.

 --Alex


 On Fri, Apr 18, 2014 at 2:28 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

  What is the storage func you're using? My guess is that there is some
  shared state in the Storage func. Take a look at this SO that is dealing
  with shared state in Stores.
 
 
 http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592
  .
  The reason why this doesn't occur is because PigStorage doesn't have
 shared
  state. So, in T3, you're loading from text files instead of your original
  store func.
 
  CROSS is pretty expensive by nature. If one of your datasets is small
  enough to load into memory, you use a fragment replicate join instead.
 
 
  On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen alex...@trifacta.com
  wrote:
 
   I'm noticing some really strange behavior with a CROSS operation in one
  of
   my scripts.
  
   I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one
   row, and T2 has 2,982,035 rows.
  
   If I STORE both T1 and T2 before CROSSing them together to get T3, like
  so:
  
   -- ... Long script that, among other things, creates T1 and T2 ...
   STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(',');
   STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(',');
   T3 = CROSS T2, T1;
  
   then I get what I expect; T3 has 2,982,035 records.
  
   However, if I omit the STOREs and run the CROSS directly, T3 only has
   1,492,977
   records.
  
   I've run EXPLAIN on both the script with the STOREs and the script
  without,
   and their query plans are identical.
  
   I'm going to end up refactoring the script to get rid of the CROSS
 anyway
   since it's expensive, but am curious as to whether I'm doing something
   wrong or if there may be a subtle bug in CROSS.
  
   I'm using Pig version 0.11.0-cdh4.5.0
  
   Any insight you could give me here would be greatly appreciated.
  
   Thanks,
   --Alex
  
 



Strange CROSS behavior

2014-04-18 Thread Alex Rasmussen
I'm noticing some really strange behavior with a CROSS operation in one of
my scripts.

I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one
row, and T2 has 2,982,035 rows.

If I STORE both T1 and T2 before CROSSing them together to get T3, like so:

-- ... Long script that, among other things, creates T1 and T2 ...
STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(',');
STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(',');
T3 = CROSS T2, T1;

then I get what I expect; T3 has 2,982,035 records.

However, if I omit the STOREs and run the CROSS directly, T3 only has 1,492,977
records.

I've run EXPLAIN on both the script with the STOREs and the script without,
and their query plans are identical.

I'm going to end up refactoring the script to get rid of the CROSS anyway
since it's expensive, but am curious as to whether I'm doing something
wrong or if there may be a subtle bug in CROSS.

I'm using Pig version 0.11.0-cdh4.5.0

Any insight you could give me here would be greatly appreciated.

Thanks,
--Alex


Re: Strange CROSS behavior

2014-04-18 Thread Pradeep Gollakota
What is the storage func you're using? My guess is that there is some
shared state in the Storage func. Take a look at this SO that is dealing
with shared state in Stores.
http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592.
The reason why this doesn't occur is because PigStorage doesn't have shared
state. So, in T3, you're loading from text files instead of your original
store func.

CROSS is pretty expensive by nature. If one of your datasets is small
enough to load into memory, you use a fragment replicate join instead.


On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen alex...@trifacta.comwrote:

 I'm noticing some really strange behavior with a CROSS operation in one of
 my scripts.

 I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one
 row, and T2 has 2,982,035 rows.

 If I STORE both T1 and T2 before CROSSing them together to get T3, like so:

 -- ... Long script that, among other things, creates T1 and T2 ...
 STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(',');
 STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(',');
 T3 = CROSS T2, T1;

 then I get what I expect; T3 has 2,982,035 records.

 However, if I omit the STOREs and run the CROSS directly, T3 only has
 1,492,977
 records.

 I've run EXPLAIN on both the script with the STOREs and the script without,
 and their query plans are identical.

 I'm going to end up refactoring the script to get rid of the CROSS anyway
 since it's expensive, but am curious as to whether I'm doing something
 wrong or if there may be a subtle bug in CROSS.

 I'm using Pig version 0.11.0-cdh4.5.0

 Any insight you could give me here would be greatly appreciated.

 Thanks,
 --Alex



Re: Strange CROSS behavior

2014-04-18 Thread Alex Rasmussen
I'm using PigStorage(',') for all stores.

I agree about the expensiveness of CROSS, but I'm still kind of confused as
to why it would lose records in this case.

--Alex


On Fri, Apr 18, 2014 at 2:28 PM, Pradeep Gollakota pradeep...@gmail.comwrote:

 What is the storage func you're using? My guess is that there is some
 shared state in the Storage func. Take a look at this SO that is dealing
 with shared state in Stores.

 http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592
 .
 The reason why this doesn't occur is because PigStorage doesn't have shared
 state. So, in T3, you're loading from text files instead of your original
 store func.

 CROSS is pretty expensive by nature. If one of your datasets is small
 enough to load into memory, you use a fragment replicate join instead.


 On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen alex...@trifacta.com
 wrote:

  I'm noticing some really strange behavior with a CROSS operation in one
 of
  my scripts.
 
  I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one
  row, and T2 has 2,982,035 rows.
 
  If I STORE both T1 and T2 before CROSSing them together to get T3, like
 so:
 
  -- ... Long script that, among other things, creates T1 and T2 ...
  STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(',');
  STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(',');
  T3 = CROSS T2, T1;
 
  then I get what I expect; T3 has 2,982,035 records.
 
  However, if I omit the STOREs and run the CROSS directly, T3 only has
  1,492,977
  records.
 
  I've run EXPLAIN on both the script with the STOREs and the script
 without,
  and their query plans are identical.
 
  I'm going to end up refactoring the script to get rid of the CROSS anyway
  since it's expensive, but am curious as to whether I'm doing something
  wrong or if there may be a subtle bug in CROSS.
 
  I'm using Pig version 0.11.0-cdh4.5.0
 
  Any insight you could give me here would be greatly appreciated.
 
  Thanks,
  --Alex
 



Re: Strange CROSS behavior

2014-04-18 Thread Russell Jurney
STORing and LOADing relations often is a workaround for these kinds of bugs.

On Friday, April 18, 2014, Alex Rasmussen alex...@trifacta.com wrote:

 I'm using PigStorage(',') for all stores.

 I agree about the expensiveness of CROSS, but I'm still kind of confused as
 to why it would lose records in this case.

 --Alex


 On Fri, Apr 18, 2014 at 2:28 PM, Pradeep Gollakota 
 pradeep...@gmail.comjavascript:;
 wrote:

  What is the storage func you're using? My guess is that there is some
  shared state in the Storage func. Take a look at this SO that is dealing
  with shared state in Stores.
 
 
 http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592
  .
  The reason why this doesn't occur is because PigStorage doesn't have
 shared
  state. So, in T3, you're loading from text files instead of your original
  store func.
 
  CROSS is pretty expensive by nature. If one of your datasets is small
  enough to load into memory, you use a fragment replicate join instead.
 
 
  On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen 
  alex...@trifacta.comjavascript:;
  wrote:
 
   I'm noticing some really strange behavior with a CROSS operation in one
  of
   my scripts.
  
   I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one
   row, and T2 has 2,982,035 rows.
  
   If I STORE both T1 and T2 before CROSSing them together to get T3, like
  so:
  
   -- ... Long script that, among other things, creates T1 and T2 ...
   STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(',');
   STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(',');
   T3 = CROSS T2, T1;
  
   then I get what I expect; T3 has 2,982,035 records.
  
   However, if I omit the STOREs and run the CROSS directly, T3 only has
   1,492,977
   records.
  
   I've run EXPLAIN on both the script with the STOREs and the script
  without,
   and their query plans are identical.
  
   I'm going to end up refactoring the script to get rid of the CROSS
 anyway
   since it's expensive, but am curious as to whether I'm doing something
   wrong or if there may be a subtle bug in CROSS.
  
   I'm using Pig version 0.11.0-cdh4.5.0
  
   Any insight you could give me here would be greatly appreciated.
  
   Thanks,
   --Alex
  
 



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com