[jira] [Commented] (PIG-2153) POProject throws an error with tuples containing a single non-tuple field
[ https://issues.apache.org/jira/browse/PIG-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060298#comment-13060298 ] Pradeep Kamath commented on PIG-2153: - I don't have full context and given that I have not actively looked at Pig code in quite a while, my comments should be taken with a grain of salt. I am assuming POProject.getNext(Tuple) is being called because the schema (of load?) says that a tuple field should be projected. If that is indeed the case, then shouldn't the LoadFunc be returning a Tuple (with the bag in it)? The outer tuple that the LoadFunc returns simply represents a record and does not count - the types of the fields inside the outer tuple are the ones that matter in the schema and if the schema says there is one field of type Tuple, then POProject would except a type Tuple - so am wondering if the cast is correct as it is. Again, I have been out of touch with Pig for a good 8 months now - so my thinking above could be completely wrong :) - hopefully the more active Pig committers can confirm/refute my hypothesis. > POProject throws an error with tuples containing a single non-tuple field > - > > Key: PIG-2153 > URL: https://issues.apache.org/jira/browse/PIG-2153 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Ken Goodhope > > When POProject.getNext(tuple) processes a tuple with one field, the field is > pulled out. If that field is not a tuple, a cast exception is thrown. This > is happening in the folliwing block of code at line 401. >if(columns.size() == 1) { > try{ > ret = inpValue.get(columns.get(0)); > ... >res.result = (Tuple)ret; > I am seeing this error in a unit test that is loading an array of floats. > The LoadFunc is converting the array to bag, and wrapping the bag in a tuple. > > ({(3.3),(1.2),(5.6)}) > This results on POProject attempting to cast the bag to a tuple. Looking at > the code, it appears that if I wrapped the previous tuple in another tuple, > then it would work. > (({(3.3),(1.2),(5.6)})) > In this case it would work because POProject would extract the first inner > tuple and return it. But this would require the LoadFunc to check for tuples > with a single non-tuple field and only wrap those. > This could be fixed by first checking that the tuple does actually wrap > another tuple. >if(columns.size() == 1 && inpValue.getType(0) == DataType.TUPLE) > {... > I don't know the original intent of this code well enough to say this is the > appropriate fix or not. Hoping someone with more Pig experience can help > here. Right now this is preventing the unit tests in AvroStorage from > working. I can change the unit test, but I think in this case the unit test > is catching a real bug. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2153) POProject throws an error with tuples containing a single non-tuple field
[ https://issues.apache.org/jira/browse/PIG-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060251#comment-13060251 ] Ken Goodhope commented on PIG-2153: --- I am the first to admit this is ugly, and if someone has a better idea I would be thrilled. I am currently running unit tests with this possible fix. if(columns.size() == 1 && ((!overloaded && inpValue.getType(0) == DataType.TUPLE) || (overloaded && inpValue.getType(0) == DataType.BAG))) { ... My current thinking is the reason the previous fix broke so many unit tests is single element tuples containing a databag are acceptable if overloaded is set. I will post the results of the tests when complete. This might fix the issue in ElephantBird, but I haven't had time to investigate that. > POProject throws an error with tuples containing a single non-tuple field > - > > Key: PIG-2153 > URL: https://issues.apache.org/jira/browse/PIG-2153 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Ken Goodhope > > When POProject.getNext(tuple) processes a tuple with one field, the field is > pulled out. If that field is not a tuple, a cast exception is thrown. This > is happening in the folliwing block of code at line 401. >if(columns.size() == 1) { > try{ > ret = inpValue.get(columns.get(0)); > ... >res.result = (Tuple)ret; > I am seeing this error in a unit test that is loading an array of floats. > The LoadFunc is converting the array to bag, and wrapping the bag in a tuple. > > ({(3.3),(1.2),(5.6)}) > This results on POProject attempting to cast the bag to a tuple. Looking at > the code, it appears that if I wrapped the previous tuple in another tuple, > then it would work. > (({(3.3),(1.2),(5.6)})) > In this case it would work because POProject would extract the first inner > tuple and return it. But this would require the LoadFunc to check for tuples > with a single non-tuple field and only wrap those. > This could be fixed by first checking that the tuple does actually wrap > another tuple. >if(columns.size() == 1 && inpValue.getType(0) == DataType.TUPLE) > {... > I don't know the original intent of this code well enough to say this is the > appropriate fix or not. Hoping someone with more Pig experience can help > here. Right now this is preventing the unit tests in AvroStorage from > working. I can change the unit test, but I think in this case the unit test > is catching a real bug. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060245#comment-13060245 ] Dmitriy V. Ryaboy commented on PIG-1890: Ken, adding all subdirs is how Hadoop + whatever patchset works, given the right value for mapred.input.dir.recursive Now, what version of Hadoop, I have no idea, but it's in there somewhere :). And since that's what people decided on it probably behooves us to respect it. But fixing that issue is a separate concern from what this ticket tries to address. We should open a ticket, though. > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060241#comment-13060241 ] Ken Goodhope commented on PIG-1890: --- Dmitry, when I inherited the code it was already doing the traversal in setLocation, and I didn't consider doing in the InputFormat. To be honest, I am not crazy about adding all the subdirs by default, since this is inconsistent with the way a standard map-reduce job works. But, our users expect this behavior, and have pig jobs that depend on it. If the current patch works, I am inclined to leave it, until I get time to do a better re-factoring. > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Goodhope updated PIG-1890: -- Attachment: PIG-1890-3.patch There are places where we use addInputDir as a true add, not set. Otherwise your solution would work. I did incorporate the use in a set for addAllSubDirs. Since the method name was no longer descriptive, I changed it to getAllSubDirs. This new patch passed unit tests, but currently there isn't a test for UNION. Let me know if this works. > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Homan updated PIG-1748: - Assignee: lin guo (was: Jakob Homan) > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: lin guo > Fix For: 0.9.0 > > Attachments: PIG-1748-2.patch, PIG-1748-3.patch, avro_storage.patch, > avro_test_files.tar.gz > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060218#comment-13060218 ] Dmitriy V. Ryaboy commented on PIG-1890: I've been a bit out of the loop on this -- you are doing your own directory traversal? You shouldn't need to do that in the Pig layer, this should be done in your InputFormat. I had to write a wrapper to emulate what MAPREDUCE-1501 does in Elephant-Bird, and I believe Pig does the same thing (but without caring about the mapred.input.dir.recursive config). As for setLocation, yes. Making it idempotent is "fun". I am curious about this business with calling it with different files for the same instance for the same job. Patrick, can you show some debug output that has the sequence of calls? > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060213#comment-13060213 ] Patrick Hunt commented on PIG-1890: --- @ken (and @mads) thanks, I figured something like that. Could this possibly be an issue in pig itself? I do see this {noformat} LoadFunc.setLocation: * This method will be called in the backend multiple times. Implementations * should bear in mind that this method is called multiple times and should * ensure there are no inconsistent side effects due to the multiple calls. {noformat} But what I'm seeing in this UNION case is that setLocation is being called multiple times on the same AvroStorage instance, for the same job, with different files. This results (current avrostorage code with pig-1890-2.patch applied) in the duplication - 2 files are added rather than one (my patch fixes this by only taking the most recent argument to setLocation, which is consistent with existing loader funcs, whereas avrostorage keeps adding). If you check the debugging output you'll see this (I might have added a bit more debugging to setLocation to capture this event...) Regards. > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060190#comment-13060190 ] Mads Moeller commented on PIG-1890: --- Re-pasting addInputPaths. {code} /** * get input paths to job config */ public static boolean addInputPaths(String pathString, Job job) throws IOException { Set pathSet = new HashSet(); if (addAllSubDirs(new Path(pathString), job, pathSet)) { Path[] paths = pathSet.toArray(new Path[pathSet.size()]); FileInputFormat.setInputPaths(job, paths); return true; } return false; } {code} > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060178#comment-13060178 ] Mads Moeller commented on PIG-1890: --- Hi Ken, I am have the same use case as you and encountering the same behavior as Patrick. I made a few modifications to the methods "addInputPaths" and "addAllSubDirs" from your patch, which seems to solve the UNION issue. {code} public static boolean addInputPaths(String pathString, Job job) throws IOException { Set pathSet = new HashSet(); if (addAllSubDirs(new Path(pathString), job, pathSet)) { Path[] paths = pathSet.toArray(new Path[pathSet.size()]); return true; } return false; } /** * Adds all non-hidden directories and subdirectories to the paths set * * @throws IOException */ private static boolean addAllSubDirs(Path path, Job job, Set paths) throws IOException { FileSystem fs = FileSystem.get(job.getConfiguration()); if (PATH_FILTER.accept(path)) { try { FileStatus file = fs.getFileStatus(path); if (file.isDir()) { for (FileStatus sub : fs.listStatus(path)) { addAllSubDirs(sub.getPath(), job, paths); } } else { AvroStorageLog.details("Add input file:" + file); paths.add(file.getPath()); } } catch (FileNotFoundException e) { AvroStorageLog.details("Input path does not exist: " + path); return false; } return true; } return false; } {code} > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060165#comment-13060165 ] Ken Goodhope commented on PIG-1890: --- Hi Patrick, for our purposes we need setLocation to add all sub-directories, including directories more than 2 levels deep. A common use case for us to to have directories organized by time, /MM/dd/hh/mm. In that case if you want to load all the data from a particular month, then you need to add all the subdirs. Your right that a UNION can accomplish this, but it can be painful to add the directories that way. I will take a look at why this is still breaking in your case. > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060114#comment-13060114 ] Patrick Hunt commented on PIG-1890: --- Hi, I'm seeing an issue with both versions of the attached patches when I run the following: {noformat} REGISTER avro-1.4.1.jar; REGISTER json-simple-1.1.jar; REGISTER piggybank.jar; A = LOAD 'input_123.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); B = LOAD 'input_789.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); C = UNION A, B; DUMP C; {noformat} where each file contains a single tuple; input_123.avro contains "1,2,3" (ints) and input_789.avro contains "7,8,9" Dump C should be returning 2 tuples; 1 tuple 1,2,3 and 1 tuple 7,8,9. Without the patch I see 6 tuples output (3 1,2,3 and 3 7,8,9) With either of the proposed patches applied I see 4 tuples output (2 1,2,3 and 2 7,8,9) >From looking at other pig loader functions it seems like the following would >address the setLocation issue: {noformat} public void setLocation(String location, Job job) throws IOException { -if(AvroStorageUtils.addInputPaths(location, job) && inputAvroSchema == null) { -inputAvroSchema = getAvroSchema(location, job); -} +FileInputFormat.setInputPaths(job, location); +inputAvroSchema = getAvroSchema(location, job); } {noformat} This does resolve the issue for the script I described. However the "addInputPaths" functionality of AvroStorageUtils is lost - but I'm wondering why this was added rather than just rely on the std capabilities of LOAD? (such as globbing). I'd be happy to package up my suggestion as a patch if there's interest. > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2153) POProject throws an error with tuples containing a single non-tuple field
[ https://issues.apache.org/jira/browse/PIG-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060062#comment-13060062 ] Ken Goodhope commented on PIG-2153: --- It looks like the last time this code was touched it was for PIG-1369 by Pradeep Kamath. > POProject throws an error with tuples containing a single non-tuple field > - > > Key: PIG-2153 > URL: https://issues.apache.org/jira/browse/PIG-2153 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Ken Goodhope > > When POProject.getNext(tuple) processes a tuple with one field, the field is > pulled out. If that field is not a tuple, a cast exception is thrown. This > is happening in the folliwing block of code at line 401. >if(columns.size() == 1) { > try{ > ret = inpValue.get(columns.get(0)); > ... >res.result = (Tuple)ret; > I am seeing this error in a unit test that is loading an array of floats. > The LoadFunc is converting the array to bag, and wrapping the bag in a tuple. > > ({(3.3),(1.2),(5.6)}) > This results on POProject attempting to cast the bag to a tuple. Looking at > the code, it appears that if I wrapped the previous tuple in another tuple, > then it would work. > (({(3.3),(1.2),(5.6)})) > In this case it would work because POProject would extract the first inner > tuple and return it. But this would require the LoadFunc to check for tuples > with a single non-tuple field and only wrap those. > This could be fixed by first checking that the tuple does actually wrap > another tuple. >if(columns.size() == 1 && inpValue.getType(0) == DataType.TUPLE) > {... > I don't know the original intent of this code well enough to say this is the > appropriate fix or not. Hoping someone with more Pig experience can help > here. Right now this is preventing the unit tests in AvroStorage from > working. I can change the unit test, but I think in this case the unit test > is catching a real bug. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2153) POProject throws an error with tuples containing a single non-tuple field
[ https://issues.apache.org/jira/browse/PIG-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060034#comment-13060034 ] Ken Goodhope commented on PIG-2153: --- I ran unit tests with the change I recommend in the description. Good news is several tests that failed before now work and are listed below. org.apache.pig.test.TestBestFitCast org.apache.pig.test.TestDataBagAccess org.apache.pig.test.TestGrunt org.apache.pig.test.TestImplicitSplit org.apache.pig.test.TestMapSideCogroup org.apache.pig.test.TestPigRunner org.apache.pig.test.TestPigSplit org.apache.pig.test.TestScriptUDF The bad news is several tests that were working now fail. org.apache.pig.test.TestBuiltin org.apache.pig.test.TestCollectedGroup org.apache.pig.test.TestCombiner org.apache.pig.test.TestCommit org.apache.pig.test.TestEvalPipeline2 org.apache.pig.test.TestEvalPipelineLocal org.apache.pig.test.TestFRJoin2 org.apache.pig.test.TestFilter org.apache.pig.test.TestForEach org.apache.pig.test.TestForEachNestedPlanLocal org.apache.pig.test.TestJoin org.apache.pig.test.TestJoinSmoke org.apache.pig.test.TestLimitAdjuster org.apache.pig.test.TestLocalRearrange org.apache.pig.test.TestNativeMapReduce org.apache.pig.test.TestNewPlanImplicitSplit org.apache.pig.test.TestProject org.apache.pig.test.TestStore org.apache.pig.test.TestStoreInstances org.apache.pig.test.TestUnionOnSchema Obviously, there are more tests that break than get fixed. > POProject throws an error with tuples containing a single non-tuple field > - > > Key: PIG-2153 > URL: https://issues.apache.org/jira/browse/PIG-2153 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Ken Goodhope > > When POProject.getNext(tuple) processes a tuple with one field, the field is > pulled out. If that field is not a tuple, a cast exception is thrown. This > is happening in the folliwing block of code at line 401. >if(columns.size() == 1) { > try{ > ret = inpValue.get(columns.get(0)); > ... >res.result = (Tuple)ret; > I am seeing this error in a unit test that is loading an array of floats. > The LoadFunc is converting the array to bag, and wrapping the bag in a tuple. > > ({(3.3),(1.2),(5.6)}) > This results on POProject attempting to cast the bag to a tuple. Looking at > the code, it appears that if I wrapped the previous tuple in another tuple, > then it would work. > (({(3.3),(1.2),(5.6)})) > In this case it would work because POProject would extract the first inner > tuple and return it. But this would require the LoadFunc to check for tuples > with a single non-tuple field and only wrap those. > This could be fixed by first checking that the tuple does actually wrap > another tuple. >if(columns.size() == 1 && inpValue.getType(0) == DataType.TUPLE) > {... > I don't know the original intent of this code well enough to say this is the > appropriate fix or not. Hoping someone with more Pig experience can help > here. Right now this is preventing the unit tests in AvroStorage from > working. I can change the unit test, but I think in this case the unit test > is catching a real bug. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059958#comment-13059958 ] Dmitriy V. Ryaboy commented on PIG-1890: Marked PIG-2153 as a blocker to this. I have a feeling that ticket is also blocking EB issue 60 https://github.com/kevinweil/elephant-bird/issues/60 > Fix piggybank unit test TestAvroStorage > --- > > Key: PIG-1890 > URL: https://issues.apache.org/jira/browse/PIG-1890 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Jakob Homan > Attachments: PIG-1890-1.patch, PIG-1890-2.patch > > > TestAvroStorage fail on trunk. There are two reasons: > 1. After PIG-1680, we call LoadFunc.setLocation one more time. > 2. The schema for AvroStorage seems to be wrong. For example, in first test > case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: > {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This > issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira