[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060762#comment-13060762 ] Ken Goodhope commented on PIG-1890: --- A recent change in Pig causes setLocation to be called twice, and if setLocation isn't idempotent, then you get twice the output. My suspicion is UNION is further exasperating the problem leading to the input being added 4X. Did you still see the problem with the last patch I added? Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, pig_setloc_avro.txt TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060767#comment-13060767 ] Mads Moeller commented on PIG-1890: --- Hi Ken, With the latest patch the UNION behaves as expected for me. Thanks, Mads Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, pig_setloc_avro.txt TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059958#comment-13059958 ] Dmitriy V. Ryaboy commented on PIG-1890: Marked PIG-2153 as a blocker to this. I have a feeling that ticket is also blocking EB issue 60 https://github.com/kevinweil/elephant-bird/issues/60 Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Attachments: PIG-1890-1.patch, PIG-1890-2.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060114#comment-13060114 ] Patrick Hunt commented on PIG-1890: --- Hi, I'm seeing an issue with both versions of the attached patches when I run the following: {noformat} REGISTER avro-1.4.1.jar; REGISTER json-simple-1.1.jar; REGISTER piggybank.jar; A = LOAD 'input_123.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); B = LOAD 'input_789.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); C = UNION A, B; DUMP C; {noformat} where each file contains a single tuple; input_123.avro contains 1,2,3 (ints) and input_789.avro contains 7,8,9 Dump C should be returning 2 tuples; 1 tuple 1,2,3 and 1 tuple 7,8,9. Without the patch I see 6 tuples output (3 1,2,3 and 3 7,8,9) With either of the proposed patches applied I see 4 tuples output (2 1,2,3 and 2 7,8,9) From looking at other pig loader functions it seems like the following would address the setLocation issue: {noformat} public void setLocation(String location, Job job) throws IOException { -if(AvroStorageUtils.addInputPaths(location, job) inputAvroSchema == null) { -inputAvroSchema = getAvroSchema(location, job); -} +FileInputFormat.setInputPaths(job, location); +inputAvroSchema = getAvroSchema(location, job); } {noformat} This does resolve the issue for the script I described. However the addInputPaths functionality of AvroStorageUtils is lost - but I'm wondering why this was added rather than just rely on the std capabilities of LOAD? (such as globbing). I'd be happy to package up my suggestion as a patch if there's interest. Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Attachments: PIG-1890-1.patch, PIG-1890-2.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060165#comment-13060165 ] Ken Goodhope commented on PIG-1890: --- Hi Patrick, for our purposes we need setLocation to add all sub-directories, including directories more than 2 levels deep. A common use case for us to to have directories organized by time, /MM/dd/hh/mm. In that case if you want to load all the data from a particular month, then you need to add all the subdirs. Your right that a UNION can accomplish this, but it can be painful to add the directories that way. I will take a look at why this is still breaking in your case. Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Attachments: PIG-1890-1.patch, PIG-1890-2.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060190#comment-13060190 ] Mads Moeller commented on PIG-1890: --- Re-pasting addInputPaths. {code} /** * get input paths to job config */ public static boolean addInputPaths(String pathString, Job job) throws IOException { SetPath pathSet = new HashSetPath(); if (addAllSubDirs(new Path(pathString), job, pathSet)) { Path[] paths = pathSet.toArray(new Path[pathSet.size()]); FileInputFormat.setInputPaths(job, paths); return true; } return false; } {code} Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Attachments: PIG-1890-1.patch, PIG-1890-2.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060213#comment-13060213 ] Patrick Hunt commented on PIG-1890: --- @ken (and @mads) thanks, I figured something like that. Could this possibly be an issue in pig itself? I do see this {noformat} LoadFunc.setLocation: * This method will be called in the backend multiple times. Implementations * should bear in mind that this method is called multiple times and should * ensure there are no inconsistent side effects due to the multiple calls. {noformat} But what I'm seeing in this UNION case is that setLocation is being called multiple times on the same AvroStorage instance, for the same job, with different files. This results (current avrostorage code with pig-1890-2.patch applied) in the duplication - 2 files are added rather than one (my patch fixes this by only taking the most recent argument to setLocation, which is consistent with existing loader funcs, whereas avrostorage keeps adding). If you check the debugging output you'll see this (I might have added a bit more debugging to setLocation to capture this event...) Regards. Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Attachments: PIG-1890-1.patch, PIG-1890-2.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060218#comment-13060218 ] Dmitriy V. Ryaboy commented on PIG-1890: I've been a bit out of the loop on this -- you are doing your own directory traversal? You shouldn't need to do that in the Pig layer, this should be done in your InputFormat. I had to write a wrapper to emulate what MAPREDUCE-1501 does in Elephant-Bird, and I believe Pig does the same thing (but without caring about the mapred.input.dir.recursive config). As for setLocation, yes. Making it idempotent is fun. I am curious about this business with calling it with different files for the same instance for the same job. Patrick, can you show some debug output that has the sequence of calls? Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Attachments: PIG-1890-1.patch, PIG-1890-2.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060241#comment-13060241 ] Ken Goodhope commented on PIG-1890: --- Dmitry, when I inherited the code it was already doing the traversal in setLocation, and I didn't consider doing in the InputFormat. To be honest, I am not crazy about adding all the subdirs by default, since this is inconsistent with the way a standard map-reduce job works. But, our users expect this behavior, and have pig jobs that depend on it. If the current patch works, I am inclined to leave it, until I get time to do a better re-factoring. Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060245#comment-13060245 ] Dmitriy V. Ryaboy commented on PIG-1890: Ken, adding all subdirs is how Hadoop + whatever patchset works, given the right value for mapred.input.dir.recursive Now, what version of Hadoop, I have no idea, but it's in there somewhere :). And since that's what people decided on it probably behooves us to respect it. But fixing that issue is a separate concern from what this ticket tries to address. We should open a ticket, though. Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041915#comment-13041915 ] Ken Goodhope commented on PIG-1890: --- I need some clarification on the contract for POProject.getNext(Tuple). Right now, if it receives a tuple with a single element, it extracts that element and attempts to cast it as a tuple and return it. This breaks with any single element tuple that where the single element is not a tuple. The code could be modified to not extract non-tuple elements. Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Fix For: 0.9.0 Attachments: PIG-1890-1.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034925#comment-13034925 ] Daniel Dai commented on PIG-1890: - Seems it should call POProject.getNext(DataBag) instead. Project one item assumes this item already has the correct type and need not convert. The issue should be caused by plan generation, which results a wrong result type for POProject. Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Fix For: 0.9.0 Attachments: PIG-1890-1.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13030845#comment-13030845 ] Jakob Homan commented on PIG-1890: -- @Ken - any update now that we're in a new week? Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Fix For: 0.9.0 Attachments: PIG-1890-1.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027855#comment-13027855 ] Olga Natkovich commented on PIG-1890: - Hi Jacob, Are you planning to address the additional issue for 0.9 or should we delay this? Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Fix For: 0.9.0 Attachments: PIG-1890-1.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027862#comment-13027862 ] Ken Goodhope commented on PIG-1890: --- I have been working on some fixes to AvroStorage already. I should be able to make sure this issue gets addressed in those fixes as will. Will have it done sometime this week. Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Fix For: 0.9.0 Attachments: PIG-1890-1.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13004906#comment-13004906 ] Daniel Dai commented on PIG-1890: - PIG-1890-1.patch fix the first issue. I temporary comment out all test cases in TestAvroStorage. Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.9.0 Attachments: PIG-1890-1.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira