[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063633#comment-13063633 ] Patrick Hunt commented on PIG-1890: --- I tested PIG-1890-4.patch against trunk using the UNION example and it generated expected (i.e. correct) results. > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Labels: patch > Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, > PIG-1890-4.patch, pig_setloc_avro.txt > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated PIG-1890: -- Attachment: pig_setloc_avro.txt demonstrate setLocation calls on AvroStorage. > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, > pig_setloc_avro.txt > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060700#comment-13060700 ] Patrick Hunt commented on PIG-1890: --- @Dmitriy thanks. bq. Patrick, can you show some debug output that has the sequence of calls? Sure, I didn't save the original so I re-ran it, see attached (pig_setloc_avro.txt) for full details using the UNION example (this is with current trunk - notice that there are 6 tuples output rather than 2). I mis-remembered one detail - it's calling setLoc for the same job, with different files, but _different_ AvroStorage objects. (see first two lines of setLocation debug message). Why are there 8 AvroStorage objects being created, shouldn't there just be 2, one for loading each of the two input files? > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, > pig_setloc_avro.txt > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060213#comment-13060213 ] Patrick Hunt commented on PIG-1890: --- @ken (and @mads) thanks, I figured something like that. Could this possibly be an issue in pig itself? I do see this {noformat} LoadFunc.setLocation: * This method will be called in the backend multiple times. Implementations * should bear in mind that this method is called multiple times and should * ensure there are no inconsistent side effects due to the multiple calls. {noformat} But what I'm seeing in this UNION case is that setLocation is being called multiple times on the same AvroStorage instance, for the same job, with different files. This results (current avrostorage code with pig-1890-2.patch applied) in the duplication - 2 files are added rather than one (my patch fixes this by only taking the most recent argument to setLocation, which is consistent with existing loader funcs, whereas avrostorage keeps adding). If you check the debugging output you'll see this (I might have added a bit more debugging to setLocation to capture this event...) Regards. > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060114#comment-13060114 ] Patrick Hunt commented on PIG-1890: --- Hi, I'm seeing an issue with both versions of the attached patches when I run the following: {noformat} REGISTER avro-1.4.1.jar; REGISTER json-simple-1.1.jar; REGISTER piggybank.jar; A = LOAD 'input_123.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); B = LOAD 'input_789.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); C = UNION A, B; DUMP C; {noformat} where each file contains a single tuple; input_123.avro contains "1,2,3" (ints) and input_789.avro contains "7,8,9" Dump C should be returning 2 tuples; 1 tuple 1,2,3 and 1 tuple 7,8,9. Without the patch I see 6 tuples output (3 1,2,3 and 3 7,8,9) With either of the proposed patches applied I see 4 tuples output (2 1,2,3 and 2 7,8,9) >From looking at other pig loader functions it seems like the following would >address the setLocation issue: {noformat} public void setLocation(String location, Job job) throws IOException { -if(AvroStorageUtils.addInputPaths(location, job) && inputAvroSchema == null) { -inputAvroSchema = getAvroSchema(location, job); -} +FileInputFormat.setInputPaths(job, location); +inputAvroSchema = getAvroSchema(location, job); } {noformat} This does resolve the issue for the script I described. However the "addInputPaths" functionality of AvroStorageUtils is lost - but I'm wondering why this was added rather than just rely on the std capabilities of LOAD? (such as globbing). I'd be happy to package up my suggestion as a patch if there's interest. > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira