Re: Strange CROSS behavior
This looks like a bug. Can you please file a jira with steps to reproduce? On Fri, Apr 18, 2014 at 2:45 PM, Alex Rasmussen alex...@trifacta.comwrote: I'm using PigStorage(',') for all stores. I agree about the expensiveness of CROSS, but I'm still kind of confused as to why it would lose records in this case. --Alex On Fri, Apr 18, 2014 at 2:28 PM, Pradeep Gollakota pradeep...@gmail.com wrote: What is the storage func you're using? My guess is that there is some shared state in the Storage func. Take a look at this SO that is dealing with shared state in Stores. http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592 . The reason why this doesn't occur is because PigStorage doesn't have shared state. So, in T3, you're loading from text files instead of your original store func. CROSS is pretty expensive by nature. If one of your datasets is small enough to load into memory, you use a fragment replicate join instead. On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen alex...@trifacta.com wrote: I'm noticing some really strange behavior with a CROSS operation in one of my scripts. I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one row, and T2 has 2,982,035 rows. If I STORE both T1 and T2 before CROSSing them together to get T3, like so: -- ... Long script that, among other things, creates T1 and T2 ... STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(','); STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(','); T3 = CROSS T2, T1; then I get what I expect; T3 has 2,982,035 records. However, if I omit the STOREs and run the CROSS directly, T3 only has 1,492,977 records. I've run EXPLAIN on both the script with the STOREs and the script without, and their query plans are identical. I'm going to end up refactoring the script to get rid of the CROSS anyway since it's expensive, but am curious as to whether I'm doing something wrong or if there may be a subtle bug in CROSS. I'm using Pig version 0.11.0-cdh4.5.0 Any insight you could give me here would be greatly appreciated. Thanks, --Alex
Strange CROSS behavior
I'm noticing some really strange behavior with a CROSS operation in one of my scripts. I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one row, and T2 has 2,982,035 rows. If I STORE both T1 and T2 before CROSSing them together to get T3, like so: -- ... Long script that, among other things, creates T1 and T2 ... STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(','); STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(','); T3 = CROSS T2, T1; then I get what I expect; T3 has 2,982,035 records. However, if I omit the STOREs and run the CROSS directly, T3 only has 1,492,977 records. I've run EXPLAIN on both the script with the STOREs and the script without, and their query plans are identical. I'm going to end up refactoring the script to get rid of the CROSS anyway since it's expensive, but am curious as to whether I'm doing something wrong or if there may be a subtle bug in CROSS. I'm using Pig version 0.11.0-cdh4.5.0 Any insight you could give me here would be greatly appreciated. Thanks, --Alex
Re: Strange CROSS behavior
What is the storage func you're using? My guess is that there is some shared state in the Storage func. Take a look at this SO that is dealing with shared state in Stores. http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592. The reason why this doesn't occur is because PigStorage doesn't have shared state. So, in T3, you're loading from text files instead of your original store func. CROSS is pretty expensive by nature. If one of your datasets is small enough to load into memory, you use a fragment replicate join instead. On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen alex...@trifacta.comwrote: I'm noticing some really strange behavior with a CROSS operation in one of my scripts. I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one row, and T2 has 2,982,035 rows. If I STORE both T1 and T2 before CROSSing them together to get T3, like so: -- ... Long script that, among other things, creates T1 and T2 ... STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(','); STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(','); T3 = CROSS T2, T1; then I get what I expect; T3 has 2,982,035 records. However, if I omit the STOREs and run the CROSS directly, T3 only has 1,492,977 records. I've run EXPLAIN on both the script with the STOREs and the script without, and their query plans are identical. I'm going to end up refactoring the script to get rid of the CROSS anyway since it's expensive, but am curious as to whether I'm doing something wrong or if there may be a subtle bug in CROSS. I'm using Pig version 0.11.0-cdh4.5.0 Any insight you could give me here would be greatly appreciated. Thanks, --Alex
Re: Strange CROSS behavior
I'm using PigStorage(',') for all stores. I agree about the expensiveness of CROSS, but I'm still kind of confused as to why it would lose records in this case. --Alex On Fri, Apr 18, 2014 at 2:28 PM, Pradeep Gollakota pradeep...@gmail.comwrote: What is the storage func you're using? My guess is that there is some shared state in the Storage func. Take a look at this SO that is dealing with shared state in Stores. http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592 . The reason why this doesn't occur is because PigStorage doesn't have shared state. So, in T3, you're loading from text files instead of your original store func. CROSS is pretty expensive by nature. If one of your datasets is small enough to load into memory, you use a fragment replicate join instead. On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen alex...@trifacta.com wrote: I'm noticing some really strange behavior with a CROSS operation in one of my scripts. I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one row, and T2 has 2,982,035 rows. If I STORE both T1 and T2 before CROSSing them together to get T3, like so: -- ... Long script that, among other things, creates T1 and T2 ... STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(','); STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(','); T3 = CROSS T2, T1; then I get what I expect; T3 has 2,982,035 records. However, if I omit the STOREs and run the CROSS directly, T3 only has 1,492,977 records. I've run EXPLAIN on both the script with the STOREs and the script without, and their query plans are identical. I'm going to end up refactoring the script to get rid of the CROSS anyway since it's expensive, but am curious as to whether I'm doing something wrong or if there may be a subtle bug in CROSS. I'm using Pig version 0.11.0-cdh4.5.0 Any insight you could give me here would be greatly appreciated. Thanks, --Alex
Re: Strange CROSS behavior
STORing and LOADing relations often is a workaround for these kinds of bugs. On Friday, April 18, 2014, Alex Rasmussen alex...@trifacta.com wrote: I'm using PigStorage(',') for all stores. I agree about the expensiveness of CROSS, but I'm still kind of confused as to why it would lose records in this case. --Alex On Fri, Apr 18, 2014 at 2:28 PM, Pradeep Gollakota pradeep...@gmail.comjavascript:; wrote: What is the storage func you're using? My guess is that there is some shared state in the Storage func. Take a look at this SO that is dealing with shared state in Stores. http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592 . The reason why this doesn't occur is because PigStorage doesn't have shared state. So, in T3, you're loading from text files instead of your original store func. CROSS is pretty expensive by nature. If one of your datasets is small enough to load into memory, you use a fragment replicate join instead. On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen alex...@trifacta.comjavascript:; wrote: I'm noticing some really strange behavior with a CROSS operation in one of my scripts. I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one row, and T2 has 2,982,035 rows. If I STORE both T1 and T2 before CROSSing them together to get T3, like so: -- ... Long script that, among other things, creates T1 and T2 ... STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(','); STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(','); T3 = CROSS T2, T1; then I get what I expect; T3 has 2,982,035 records. However, if I omit the STOREs and run the CROSS directly, T3 only has 1,492,977 records. I've run EXPLAIN on both the script with the STOREs and the script without, and their query plans are identical. I'm going to end up refactoring the script to get rid of the CROSS anyway since it's expensive, but am curious as to whether I'm doing something wrong or if there may be a subtle bug in CROSS. I'm using Pig version 0.11.0-cdh4.5.0 Any insight you could give me here would be greatly appreciated. Thanks, --Alex -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com