[jira] [Commented] (PIG-2153) POProject throws an error with tuples containing a single non-tuple field

2011-07-05 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060298#comment-13060298
 ] 

Pradeep Kamath commented on PIG-2153:
-

I don't have full context and given that I have not actively looked at Pig code 
in quite a while, my comments should be taken with a grain of salt. I am 
assuming POProject.getNext(Tuple) is being called because the schema (of load?) 
says that a tuple field should be projected. If that is indeed the case, then 
shouldn't the LoadFunc be returning a Tuple (with the bag in it)? The outer 
tuple that the LoadFunc returns simply represents a record and does not count - 
the types of the fields inside the outer tuple are the ones that matter in the 
schema and if the schema says there is one field of type Tuple, then POProject 
would except a type Tuple - so am wondering if the cast is correct as it is.

Again, I have been out of touch with Pig for a good 8 months now - so my 
thinking above could be completely wrong :) - hopefully the more active Pig 
committers can confirm/refute my hypothesis.

> POProject throws an error with tuples containing a single non-tuple field
> -
>
> Key: PIG-2153
> URL: https://issues.apache.org/jira/browse/PIG-2153
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Ken Goodhope
>
> When POProject.getNext(tuple) processes a tuple with one field, the field is 
> pulled out.  If that field is not a tuple, a cast exception is thrown.  This 
> is happening in the folliwing block of code at line 401.
>if(columns.size() == 1) {
> try{
> ret = inpValue.get(columns.get(0));
> ...
>res.result = (Tuple)ret;
> I am seeing this error in a unit test that is loading an array of floats.  
> The LoadFunc is converting the array to bag, and wrapping the bag in a tuple. 
>  
> ({(3.3),(1.2),(5.6)})
> This results on POProject attempting to cast the bag to a tuple.  Looking at 
> the code, it appears that if I wrapped the previous tuple in another tuple, 
> then it would work.
> (({(3.3),(1.2),(5.6)}))
> In this case it would work because POProject would extract the first inner 
> tuple and return it.  But this would require the LoadFunc to check for tuples 
> with a single non-tuple field and only wrap those.
> This could be fixed by first checking that the tuple does actually wrap 
> another tuple.
>if(columns.size() == 1 && inpValue.getType(0) == DataType.TUPLE) 
> {...
> I don't know the original intent of this code well enough to say this is the 
> appropriate fix or not.  Hoping someone with more Pig experience can help 
> here.  Right now this is preventing the unit tests in AvroStorage from 
> working.  I can change the unit test, but I think in this case the unit test 
> is catching a real bug.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2153) POProject throws an error with tuples containing a single non-tuple field

2011-07-05 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060251#comment-13060251
 ] 

Ken Goodhope commented on PIG-2153:
---

I am the first to admit this is ugly, and if someone has a better idea I would 
be thrilled.  I am currently running unit tests with this possible fix.

if(columns.size() == 1 && ((!overloaded && inpValue.getType(0) == 
DataType.TUPLE) || (overloaded && inpValue.getType(0) == DataType.BAG))) {
...

My current thinking is the reason the previous fix broke so many unit tests is 
single element tuples containing a databag are acceptable if overloaded is set. 
 I will post the results of the tests when complete.

This might fix the issue in ElephantBird, but I haven't had time to investigate 
that.

> POProject throws an error with tuples containing a single non-tuple field
> -
>
> Key: PIG-2153
> URL: https://issues.apache.org/jira/browse/PIG-2153
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Ken Goodhope
>
> When POProject.getNext(tuple) processes a tuple with one field, the field is 
> pulled out.  If that field is not a tuple, a cast exception is thrown.  This 
> is happening in the folliwing block of code at line 401.
>if(columns.size() == 1) {
> try{
> ret = inpValue.get(columns.get(0));
> ...
>res.result = (Tuple)ret;
> I am seeing this error in a unit test that is loading an array of floats.  
> The LoadFunc is converting the array to bag, and wrapping the bag in a tuple. 
>  
> ({(3.3),(1.2),(5.6)})
> This results on POProject attempting to cast the bag to a tuple.  Looking at 
> the code, it appears that if I wrapped the previous tuple in another tuple, 
> then it would work.
> (({(3.3),(1.2),(5.6)}))
> In this case it would work because POProject would extract the first inner 
> tuple and return it.  But this would require the LoadFunc to check for tuples 
> with a single non-tuple field and only wrap those.
> This could be fixed by first checking that the tuple does actually wrap 
> another tuple.
>if(columns.size() == 1 && inpValue.getType(0) == DataType.TUPLE) 
> {...
> I don't know the original intent of this code well enough to say this is the 
> appropriate fix or not.  Hoping someone with more Pig experience can help 
> here.  Right now this is preventing the unit tests in AvroStorage from 
> working.  I can change the unit test, but I think in this case the unit test 
> is catching a real bug.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060245#comment-13060245
 ] 

Dmitriy V. Ryaboy commented on PIG-1890:


Ken, adding all subdirs is how Hadoop + whatever patchset works, given the 
right value for mapred.input.dir.recursive

Now, what version of Hadoop, I have no idea, but it's in there somewhere :). 
And since that's what people decided on it probably behooves us to respect it. 
But fixing that issue is a separate concern from what this ticket tries to 
address. We should open a ticket, though.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060241#comment-13060241
 ] 

Ken Goodhope commented on PIG-1890:
---

Dmitry, when I inherited the code it was already doing the traversal in 
setLocation, and I didn't consider doing in the InputFormat.  To be honest, I 
am not crazy about adding all the subdirs by default, since this is 
inconsistent with the way a standard map-reduce job works.  But, our users 
expect this behavior, and have pig jobs that depend on it.

If the current patch works, I am inclined to leave it, until I get time to do a 
better re-factoring.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Ken Goodhope (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Goodhope updated PIG-1890:
--

Attachment: PIG-1890-3.patch

There are places where we use addInputDir as a true add, not set.  Otherwise 
your solution would work.  I did incorporate the use in a set for 
addAllSubDirs.  Since the method name was no longer descriptive, I changed it 
to getAllSubDirs.  This new patch passed unit tests, but currently there isn't 
a test for UNION.  Let me know if this works.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-1748) Add load/store function AvroStorage for avro data

2011-07-05 Thread Jakob Homan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Homan updated PIG-1748:
-

Assignee: lin guo  (was: Jakob Homan)

> Add load/store function AvroStorage for avro data
> -
>
> Key: PIG-1748
> URL: https://issues.apache.org/jira/browse/PIG-1748
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: lin guo
>Assignee: lin guo
> Fix For: 0.9.0
>
> Attachments: PIG-1748-2.patch, PIG-1748-3.patch, avro_storage.patch, 
> avro_test_files.tar.gz
>
>
> We want to use Pig to process arbitrary Avro data and store results as Avro 
> files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. 
> Due to discrepancies of Avro and Pig data models, AvroStorage has:
> 1. Limited support for "record": we do not support recursively defined record 
> because the number of fields in such records is data dependent.
> 2. Limited support for "union": we only accept nullable union like ["null", 
> "some-type"].
> For simplicity, we also make the following assumptions:
> If the input directory is a leaf directory, then we assume Avro data files in 
> it have the same schema;
> If the input directory contains sub-directories, then we assume Avro data 
> files in all sub-directories have the same schema.
> AvroStorage takes no input parameters when used as a LoadFunc (except for 
> "debug [debug-level]"). 
> Users can provide parameters to AvroStorage when used as a StoreFunc. If they 
> don't, Avro schema of output data is derived from its 
> Pig schema.
> Detailed documentation can be found in 
> http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060218#comment-13060218
 ] 

Dmitriy V. Ryaboy commented on PIG-1890:


I've been a bit out of the loop on this -- you are doing your own directory 
traversal? You shouldn't need to do that in the Pig layer, this should be done 
in your InputFormat. I had to write a wrapper to emulate what MAPREDUCE-1501 
does in Elephant-Bird, and I believe Pig does the same thing (but without 
caring about the mapred.input.dir.recursive config).

As for setLocation, yes. Making it idempotent is "fun".  

I am curious about this business with calling it with different files for the 
same instance for the same job. Patrick, can you show some debug output that 
has the sequence of calls? 

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060213#comment-13060213
 ] 

Patrick Hunt commented on PIG-1890:
---

@ken (and @mads) thanks, I figured something like that. Could this possibly be 
an issue in pig itself? I do see this

{noformat}
LoadFunc.setLocation:
 * This method will be called in the backend multiple times. Implementations
 * should bear in mind that this method is called multiple times and should
 * ensure there are no inconsistent side effects due to the multiple calls.
{noformat}

But what I'm seeing in this UNION case is that setLocation is being called 
multiple times on the same AvroStorage instance, for the same job, with 
different files. This results (current avrostorage code with pig-1890-2.patch 
applied) in the duplication - 2 files are added rather than one (my patch fixes 
this by only taking the most recent argument to setLocation, which is 
consistent with existing loader funcs, whereas avrostorage keeps adding). If 
you check the debugging output you'll see this (I might have added a bit more 
debugging to setLocation to capture this event...)

Regards.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Mads Moeller (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060190#comment-13060190
 ] 

Mads Moeller commented on PIG-1890:
---

Re-pasting addInputPaths. 

{code}
/**
 * get input paths to job config
 */
public static boolean addInputPaths(String pathString, Job job)
throws IOException {

Set pathSet = new HashSet();

if (addAllSubDirs(new Path(pathString), job, pathSet)) {
Path[] paths = pathSet.toArray(new Path[pathSet.size()]);

FileInputFormat.setInputPaths(job, paths);  
return true;
}
return false;
}
{code}

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Mads Moeller (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060178#comment-13060178
 ] 

Mads Moeller commented on PIG-1890:
---

Hi Ken,

I am have the same use case as you and encountering the same behavior as 
Patrick. I made a few modifications to the methods "addInputPaths" and 
"addAllSubDirs" from your patch, which seems to solve the UNION issue. 

{code}
public static boolean addInputPaths(String pathString, Job job)
throws IOException {

Set pathSet = new HashSet();

if (addAllSubDirs(new Path(pathString), job, pathSet)) {
Path[] paths = pathSet.toArray(new Path[pathSet.size()]);
 
return true;
}
return false;
}

/**
 * Adds all non-hidden directories and subdirectories to the paths set
 * 
 * @throws IOException
 */
private static boolean addAllSubDirs(Path path, Job job, Set 
paths) throws IOException {
FileSystem fs = FileSystem.get(job.getConfiguration());

if (PATH_FILTER.accept(path)) {
try {
FileStatus file = fs.getFileStatus(path);
if (file.isDir()) {
for (FileStatus sub : 
fs.listStatus(path)) {
addAllSubDirs(sub.getPath(), 
job, paths);
}
} else {
AvroStorageLog.details("Add input 
file:" + file);
paths.add(file.getPath());
}
} catch (FileNotFoundException e) {
AvroStorageLog.details("Input path does not 
exist: " + path);
return false;
}
return true;
}
return false;
}
{code}

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060165#comment-13060165
 ] 

Ken Goodhope commented on PIG-1890:
---

Hi Patrick, for our purposes we need setLocation to add all sub-directories, 
including directories more than 2 levels deep.  A common use case for us to to 
have directories organized by time, /MM/dd/hh/mm.  In that case if you want 
to load all the data from a particular month, then you need to add all the 
subdirs.  Your right that a UNION can accomplish this, but it can be painful to 
add the directories that way.  I will take a look at why this is still breaking 
in your case.



> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060114#comment-13060114
 ] 

Patrick Hunt commented on PIG-1890:
---

Hi, I'm seeing an issue with both versions of the attached patches when I run 
the following:

{noformat}
REGISTER avro-1.4.1.jar; 
REGISTER json-simple-1.1.jar; 
REGISTER piggybank.jar;

A = LOAD 'input_123.avro' USING 
org.apache.pig.piggybank.storage.avro.AvroStorage();

B = LOAD 'input_789.avro' USING 
org.apache.pig.piggybank.storage.avro.AvroStorage();

C = UNION A, B; 
DUMP C;
{noformat}

where each file contains a single tuple; input_123.avro contains "1,2,3" (ints) 
and input_789.avro contains "7,8,9"
Dump C should be returning 2 tuples; 1 tuple 1,2,3 and 1 tuple 7,8,9.

Without the patch I see 6 tuples output (3 1,2,3 and 3 7,8,9)
With either of the proposed patches applied I see 4 tuples output (2 1,2,3 and 
2 7,8,9)

>From looking at other pig loader functions it seems like the following would 
>address the setLocation issue:

{noformat}
 public void setLocation(String location, Job job) throws IOException {
-if(AvroStorageUtils.addInputPaths(location, job) && inputAvroSchema == 
null) {
-inputAvroSchema = getAvroSchema(location, job);
-}
+FileInputFormat.setInputPaths(job, location);
+inputAvroSchema = getAvroSchema(location, job);
 }
{noformat}

This does resolve the issue for the script I described. However the 
"addInputPaths" functionality of AvroStorageUtils is lost - but I'm wondering 
why this was added rather than just rely on the std capabilities of LOAD? (such 
as globbing).


I'd be happy to package up my suggestion as a patch if there's interest.


> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2153) POProject throws an error with tuples containing a single non-tuple field

2011-07-05 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060062#comment-13060062
 ] 

Ken Goodhope commented on PIG-2153:
---

It looks like the last time this code was touched it was for PIG-1369 by 
Pradeep Kamath.

> POProject throws an error with tuples containing a single non-tuple field
> -
>
> Key: PIG-2153
> URL: https://issues.apache.org/jira/browse/PIG-2153
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Ken Goodhope
>
> When POProject.getNext(tuple) processes a tuple with one field, the field is 
> pulled out.  If that field is not a tuple, a cast exception is thrown.  This 
> is happening in the folliwing block of code at line 401.
>if(columns.size() == 1) {
> try{
> ret = inpValue.get(columns.get(0));
> ...
>res.result = (Tuple)ret;
> I am seeing this error in a unit test that is loading an array of floats.  
> The LoadFunc is converting the array to bag, and wrapping the bag in a tuple. 
>  
> ({(3.3),(1.2),(5.6)})
> This results on POProject attempting to cast the bag to a tuple.  Looking at 
> the code, it appears that if I wrapped the previous tuple in another tuple, 
> then it would work.
> (({(3.3),(1.2),(5.6)}))
> In this case it would work because POProject would extract the first inner 
> tuple and return it.  But this would require the LoadFunc to check for tuples 
> with a single non-tuple field and only wrap those.
> This could be fixed by first checking that the tuple does actually wrap 
> another tuple.
>if(columns.size() == 1 && inpValue.getType(0) == DataType.TUPLE) 
> {...
> I don't know the original intent of this code well enough to say this is the 
> appropriate fix or not.  Hoping someone with more Pig experience can help 
> here.  Right now this is preventing the unit tests in AvroStorage from 
> working.  I can change the unit test, but I think in this case the unit test 
> is catching a real bug.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2153) POProject throws an error with tuples containing a single non-tuple field

2011-07-05 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060034#comment-13060034
 ] 

Ken Goodhope commented on PIG-2153:
---

I ran unit tests with the change I recommend in the description.  Good news is 
several tests that failed before now work and are listed below.
org.apache.pig.test.TestBestFitCast 
org.apache.pig.test.TestDataBagAccess 
org.apache.pig.test.TestGrunt 
org.apache.pig.test.TestImplicitSplit 
org.apache.pig.test.TestMapSideCogroup 
org.apache.pig.test.TestPigRunner 
org.apache.pig.test.TestPigSplit 
org.apache.pig.test.TestScriptUDF 

The bad news is several tests that were working now fail.
org.apache.pig.test.TestBuiltin 
org.apache.pig.test.TestCollectedGroup 
org.apache.pig.test.TestCombiner 
org.apache.pig.test.TestCommit 
org.apache.pig.test.TestEvalPipeline2 
org.apache.pig.test.TestEvalPipelineLocal 
org.apache.pig.test.TestFRJoin2 
org.apache.pig.test.TestFilter 
org.apache.pig.test.TestForEach 
org.apache.pig.test.TestForEachNestedPlanLocal 
org.apache.pig.test.TestJoin 
org.apache.pig.test.TestJoinSmoke 
org.apache.pig.test.TestLimitAdjuster 
org.apache.pig.test.TestLocalRearrange 
org.apache.pig.test.TestNativeMapReduce 
org.apache.pig.test.TestNewPlanImplicitSplit 
org.apache.pig.test.TestProject 
org.apache.pig.test.TestStore 
org.apache.pig.test.TestStoreInstances 
org.apache.pig.test.TestUnionOnSchema 

Obviously, there are more tests that break than get fixed.  

> POProject throws an error with tuples containing a single non-tuple field
> -
>
> Key: PIG-2153
> URL: https://issues.apache.org/jira/browse/PIG-2153
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Ken Goodhope
>
> When POProject.getNext(tuple) processes a tuple with one field, the field is 
> pulled out.  If that field is not a tuple, a cast exception is thrown.  This 
> is happening in the folliwing block of code at line 401.
>if(columns.size() == 1) {
> try{
> ret = inpValue.get(columns.get(0));
> ...
>res.result = (Tuple)ret;
> I am seeing this error in a unit test that is loading an array of floats.  
> The LoadFunc is converting the array to bag, and wrapping the bag in a tuple. 
>  
> ({(3.3),(1.2),(5.6)})
> This results on POProject attempting to cast the bag to a tuple.  Looking at 
> the code, it appears that if I wrapped the previous tuple in another tuple, 
> then it would work.
> (({(3.3),(1.2),(5.6)}))
> In this case it would work because POProject would extract the first inner 
> tuple and return it.  But this would require the LoadFunc to check for tuples 
> with a single non-tuple field and only wrap those.
> This could be fixed by first checking that the tuple does actually wrap 
> another tuple.
>if(columns.size() == 1 && inpValue.getType(0) == DataType.TUPLE) 
> {...
> I don't know the original intent of this code well enough to say this is the 
> appropriate fix or not.  Hoping someone with more Pig experience can help 
> here.  Right now this is preventing the unit tests in AvroStorage from 
> working.  I can change the unit test, but I think in this case the unit test 
> is catching a real bug.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059958#comment-13059958
 ] 

Dmitriy V. Ryaboy commented on PIG-1890:


Marked PIG-2153 as a blocker to this.

I have a feeling that ticket is also blocking EB issue 60 
https://github.com/kevinweil/elephant-bird/issues/60 

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira