[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-11 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063637#comment-13063637
 ] 

Daniel Dai commented on PIG-1890:
-

All Avro unit tests pass now, and test-patch returns all +1. Now we don't get a 
double PIG_WRAPPER, the schema generated by ArvoStorage looks good to me. 
Thanks guys for your hard working!

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
>  Labels: patch
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, 
> PIG-1890-4.patch, pig_setloc_avro.txt
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-11 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063633#comment-13063633
 ] 

Patrick Hunt commented on PIG-1890:
---

I tested PIG-1890-4.patch against trunk using the UNION example and it 
generated expected (i.e. correct) results.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
>  Labels: patch
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, 
> PIG-1890-4.patch, pig_setloc_avro.txt
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-09 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062664#comment-13062664
 ] 

Ken Goodhope commented on PIG-1890:
---

Removing the blocker for PIG-2153.  Turns out the problem, as first asserted, 
was in AvroStorage.  The new logical plan must handle implicit wrapping tuples 
differently than used to be the case.  In order to make this work, I removed 
the wrapping tuple from the schema produced by getSchema.  getNext still 
returns its result in the wrapping tuple.  I also had to modify putNext, to 
expect a piq schema without the implicit wrapping tuple.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, 
> PIG-1890-4.patch, pig_setloc_avro.txt
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-06 Thread Mads Moeller (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060767#comment-13060767
 ] 

Mads Moeller commented on PIG-1890:
---

Hi Ken,

With the latest patch the UNION behaves as expected for me.


Thanks,
Mads

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, 
> pig_setloc_avro.txt
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-06 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060762#comment-13060762
 ] 

Ken Goodhope commented on PIG-1890:
---

A recent change in Pig causes setLocation to be called twice, and if 
setLocation isn't idempotent, then you get twice the output.  My suspicion is 
UNION is further exasperating the problem leading to the input being added 4X.  
Did you still see the problem with the last patch I added?

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, 
> pig_setloc_avro.txt
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-06 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060700#comment-13060700
 ] 

Patrick Hunt commented on PIG-1890:
---

@Dmitriy thanks.

bq. Patrick, can you show some debug output that has the sequence of calls?

Sure, I didn't save the original so I re-ran it, see attached 
(pig_setloc_avro.txt) for full details using the UNION example (this is with 
current trunk - notice that there are 6 tuples output rather than 2). I 
mis-remembered one detail - it's calling setLoc for the same job, with 
different files, but _different_ AvroStorage objects. (see first two lines of 
setLocation debug message). 

Why are there 8 AvroStorage objects being created, shouldn't there just be 2, 
one for loading each of the two input files?

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, 
> pig_setloc_avro.txt
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060245#comment-13060245
 ] 

Dmitriy V. Ryaboy commented on PIG-1890:


Ken, adding all subdirs is how Hadoop + whatever patchset works, given the 
right value for mapred.input.dir.recursive

Now, what version of Hadoop, I have no idea, but it's in there somewhere :). 
And since that's what people decided on it probably behooves us to respect it. 
But fixing that issue is a separate concern from what this ticket tries to 
address. We should open a ticket, though.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060241#comment-13060241
 ] 

Ken Goodhope commented on PIG-1890:
---

Dmitry, when I inherited the code it was already doing the traversal in 
setLocation, and I didn't consider doing in the InputFormat.  To be honest, I 
am not crazy about adding all the subdirs by default, since this is 
inconsistent with the way a standard map-reduce job works.  But, our users 
expect this behavior, and have pig jobs that depend on it.

If the current patch works, I am inclined to leave it, until I get time to do a 
better re-factoring.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060218#comment-13060218
 ] 

Dmitriy V. Ryaboy commented on PIG-1890:


I've been a bit out of the loop on this -- you are doing your own directory 
traversal? You shouldn't need to do that in the Pig layer, this should be done 
in your InputFormat. I had to write a wrapper to emulate what MAPREDUCE-1501 
does in Elephant-Bird, and I believe Pig does the same thing (but without 
caring about the mapred.input.dir.recursive config).

As for setLocation, yes. Making it idempotent is "fun".  

I am curious about this business with calling it with different files for the 
same instance for the same job. Patrick, can you show some debug output that 
has the sequence of calls? 

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060213#comment-13060213
 ] 

Patrick Hunt commented on PIG-1890:
---

@ken (and @mads) thanks, I figured something like that. Could this possibly be 
an issue in pig itself? I do see this

{noformat}
LoadFunc.setLocation:
 * This method will be called in the backend multiple times. Implementations
 * should bear in mind that this method is called multiple times and should
 * ensure there are no inconsistent side effects due to the multiple calls.
{noformat}

But what I'm seeing in this UNION case is that setLocation is being called 
multiple times on the same AvroStorage instance, for the same job, with 
different files. This results (current avrostorage code with pig-1890-2.patch 
applied) in the duplication - 2 files are added rather than one (my patch fixes 
this by only taking the most recent argument to setLocation, which is 
consistent with existing loader funcs, whereas avrostorage keeps adding). If 
you check the debugging output you'll see this (I might have added a bit more 
debugging to setLocation to capture this event...)

Regards.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Mads Moeller (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060190#comment-13060190
 ] 

Mads Moeller commented on PIG-1890:
---

Re-pasting addInputPaths. 

{code}
/**
 * get input paths to job config
 */
public static boolean addInputPaths(String pathString, Job job)
throws IOException {

Set pathSet = new HashSet();

if (addAllSubDirs(new Path(pathString), job, pathSet)) {
Path[] paths = pathSet.toArray(new Path[pathSet.size()]);

FileInputFormat.setInputPaths(job, paths);  
return true;
}
return false;
}
{code}

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Mads Moeller (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060178#comment-13060178
 ] 

Mads Moeller commented on PIG-1890:
---

Hi Ken,

I am have the same use case as you and encountering the same behavior as 
Patrick. I made a few modifications to the methods "addInputPaths" and 
"addAllSubDirs" from your patch, which seems to solve the UNION issue. 

{code}
public static boolean addInputPaths(String pathString, Job job)
throws IOException {

Set pathSet = new HashSet();

if (addAllSubDirs(new Path(pathString), job, pathSet)) {
Path[] paths = pathSet.toArray(new Path[pathSet.size()]);
 
return true;
}
return false;
}

/**
 * Adds all non-hidden directories and subdirectories to the paths set
 * 
 * @throws IOException
 */
private static boolean addAllSubDirs(Path path, Job job, Set 
paths) throws IOException {
FileSystem fs = FileSystem.get(job.getConfiguration());

if (PATH_FILTER.accept(path)) {
try {
FileStatus file = fs.getFileStatus(path);
if (file.isDir()) {
for (FileStatus sub : 
fs.listStatus(path)) {
addAllSubDirs(sub.getPath(), 
job, paths);
}
} else {
AvroStorageLog.details("Add input 
file:" + file);
paths.add(file.getPath());
}
} catch (FileNotFoundException e) {
AvroStorageLog.details("Input path does not 
exist: " + path);
return false;
}
return true;
}
return false;
}
{code}

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060165#comment-13060165
 ] 

Ken Goodhope commented on PIG-1890:
---

Hi Patrick, for our purposes we need setLocation to add all sub-directories, 
including directories more than 2 levels deep.  A common use case for us to to 
have directories organized by time, /MM/dd/hh/mm.  In that case if you want 
to load all the data from a particular month, then you need to add all the 
subdirs.  Your right that a UNION can accomplish this, but it can be painful to 
add the directories that way.  I will take a look at why this is still breaking 
in your case.



> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060114#comment-13060114
 ] 

Patrick Hunt commented on PIG-1890:
---

Hi, I'm seeing an issue with both versions of the attached patches when I run 
the following:

{noformat}
REGISTER avro-1.4.1.jar; 
REGISTER json-simple-1.1.jar; 
REGISTER piggybank.jar;

A = LOAD 'input_123.avro' USING 
org.apache.pig.piggybank.storage.avro.AvroStorage();

B = LOAD 'input_789.avro' USING 
org.apache.pig.piggybank.storage.avro.AvroStorage();

C = UNION A, B; 
DUMP C;
{noformat}

where each file contains a single tuple; input_123.avro contains "1,2,3" (ints) 
and input_789.avro contains "7,8,9"
Dump C should be returning 2 tuples; 1 tuple 1,2,3 and 1 tuple 7,8,9.

Without the patch I see 6 tuples output (3 1,2,3 and 3 7,8,9)
With either of the proposed patches applied I see 4 tuples output (2 1,2,3 and 
2 7,8,9)

>From looking at other pig loader functions it seems like the following would 
>address the setLocation issue:

{noformat}
 public void setLocation(String location, Job job) throws IOException {
-if(AvroStorageUtils.addInputPaths(location, job) && inputAvroSchema == 
null) {
-inputAvroSchema = getAvroSchema(location, job);
-}
+FileInputFormat.setInputPaths(job, location);
+inputAvroSchema = getAvroSchema(location, job);
 }
{noformat}

This does resolve the issue for the script I described. However the 
"addInputPaths" functionality of AvroStorageUtils is lost - but I'm wondering 
why this was added rather than just rely on the std capabilities of LOAD? (such 
as globbing).


I'd be happy to package up my suggestion as a patch if there's interest.


> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059958#comment-13059958
 ] 

Dmitriy V. Ryaboy commented on PIG-1890:


Marked PIG-2153 as a blocker to this.

I have a feeling that ticket is also blocking EB issue 60 
https://github.com/kevinweil/elephant-bird/issues/60 

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-01 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058813#comment-13058813
 ] 

Ken Goodhope commented on PIG-1890:
---

The fix for this jira involves two parts, making setLocation idempotent, and a 
fix in POProject.  I have added a jira for POProject issue PIG-2153.  I will 
try and get a patch for the setLocation issue added this weekend.  I have made 
some other changes to the version of AvroStorage we are using at LinkedIn and 
want to seperate those changes from any patch I submit for this.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-05-31 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13041915#comment-13041915
 ] 

Ken Goodhope commented on PIG-1890:
---

I need some clarification on the contract for POProject.getNext(Tuple).  Right 
now, if it receives a tuple with a single element, it extracts that element and 
attempts to cast it as a tuple and return it.  This breaks with any single 
element tuple that where the single element is not a tuple.  The code could be 
modified to not extract non-tuple elements.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Fix For: 0.9.0
>
> Attachments: PIG-1890-1.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-05-23 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038100#comment-13038100
 ] 

Ken Goodhope commented on PIG-1890:
---

Right now, in this test, AvroStorage is attempting to pass back a single array 
of floats with one call to next. To be consistent with intent of how the data 
is stored we want this array returned as a single unit(databag) with each 
foreach call. In other words we don't want foreach to return each element of 
that array one at a time. If I am understanding the code right, it appears that 
is what it is trying to do. Am I missing something? Is there a way to control 
this behavior?



> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Fix For: 0.9.0
>
> Attachments: PIG-1890-1.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-05-17 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034925#comment-13034925
 ] 

Daniel Dai commented on PIG-1890:
-

Seems it should call POProject.getNext(DataBag) instead. Project one item 
assumes this item already has the correct type and need not convert. The issue 
should be caused by plan generation, which results a wrong result type for 
POProject.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Fix For: 0.9.0
>
> Attachments: PIG-1890-1.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-05-15 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033822#comment-13033822
 ] 

Ken Goodhope commented on PIG-1890:
---

For testArrayDefault, we are attempting to return an entire avro array, which 
is consistent with the schema.  The result is tuple with one column, a bag of 
floats".  In POProject.getNext(Tuple), tuples with one column have their single 
column extracted, cast to a tuple, and then returned.  Obviously in this case, 
this results in trying to cast the bag of floats into a tuple and an exception 
being thrown.

Does anyone know why this is being done in POProject?

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Fix For: 0.9.0
>
> Attachments: PIG-1890-1.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-05-09 Thread Jakob Homan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030845#comment-13030845
 ] 

Jakob Homan commented on PIG-1890:
--

@Ken - any update now that we're in a new week?

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Fix For: 0.9.0
>
> Attachments: PIG-1890-1.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-05-02 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027862#comment-13027862
 ] 

Ken Goodhope commented on PIG-1890:
---

I have been working on some fixes to AvroStorage already.  I should be able to 
make sure this issue gets addressed in those fixes as will.  Will have it done 
sometime this week.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Fix For: 0.9.0
>
> Attachments: PIG-1890-1.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-05-02 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027855#comment-13027855
 ] 

Olga Natkovich commented on PIG-1890:
-

Hi Jacob, 

Are you planning to address the additional issue for 0.9 or should we delay 
this?

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Fix For: 0.9.0
>
> Attachments: PIG-1890-1.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-03-09 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004906#comment-13004906
 ] 

Daniel Dai commented on PIG-1890:
-

PIG-1890-1.patch fix the first issue. I temporary comment out all test cases in 
TestAvroStorage.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.9.0
>
> Attachments: PIG-1890-1.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira