What is the storage func you're using? My guess is that there is some
shared state in the Storage func. Take a look at this SO that is dealing
with shared state in Stores.
http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592.
The reason why this doesn't occur is because PigStorage doesn't have shared
state. So, in T3, you're loading from text files instead of your original
store func.

CROSS is pretty expensive by nature. If one of your datasets is small
enough to load into memory, you use a fragment replicate join instead.


On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen <alex...@trifacta.com>wrote:

> I'm noticing some really strange behavior with a CROSS operation in one of
> my scripts.
>
> I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one
> row, and T2 has 2,982,035 rows.
>
> If I STORE both T1 and T2 before CROSSing them together to get T3, like so:
>
> -- ... Long script that, among other things, creates T1 and T2 ...
> STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(',');
> STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(',');
> T3 = CROSS T2, T1;
>
> then I get what I expect; T3 has 2,982,035 records.
>
> However, if I omit the STOREs and run the CROSS directly, T3 only has
> 1,492,977
> records.
>
> I've run EXPLAIN on both the script with the STOREs and the script without,
> and their query plans are identical.
>
> I'm going to end up refactoring the script to get rid of the CROSS anyway
> since it's expensive, but am curious as to whether I'm doing something
> wrong or if there may be a subtle bug in CROSS.
>
> I'm using Pig version 0.11.0-cdh4.5.0
>
> Any insight you could give me here would be greatly appreciated.
>
> Thanks,
> --Alex
>

Reply via email to