[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-11 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063633#comment-13063633
 ] 

Patrick Hunt commented on PIG-1890:
---

I tested PIG-1890-4.patch against trunk using the UNION example and it 
generated expected (i.e. correct) results.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
>  Labels: patch
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, 
> PIG-1890-4.patch, pig_setloc_avro.txt
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-06 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated PIG-1890:
--

Attachment: pig_setloc_avro.txt

demonstrate setLocation calls on AvroStorage.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, 
> pig_setloc_avro.txt
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-06 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060700#comment-13060700
 ] 

Patrick Hunt commented on PIG-1890:
---

@Dmitriy thanks.

bq. Patrick, can you show some debug output that has the sequence of calls?

Sure, I didn't save the original so I re-ran it, see attached 
(pig_setloc_avro.txt) for full details using the UNION example (this is with 
current trunk - notice that there are 6 tuples output rather than 2). I 
mis-remembered one detail - it's calling setLoc for the same job, with 
different files, but _different_ AvroStorage objects. (see first two lines of 
setLocation debug message). 

Why are there 8 AvroStorage objects being created, shouldn't there just be 2, 
one for loading each of the two input files?

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch, PIG-1890-3.patch, 
> pig_setloc_avro.txt
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060213#comment-13060213
 ] 

Patrick Hunt commented on PIG-1890:
---

@ken (and @mads) thanks, I figured something like that. Could this possibly be 
an issue in pig itself? I do see this

{noformat}
LoadFunc.setLocation:
 * This method will be called in the backend multiple times. Implementations
 * should bear in mind that this method is called multiple times and should
 * ensure there are no inconsistent side effects due to the multiple calls.
{noformat}

But what I'm seeing in this UNION case is that setLocation is being called 
multiple times on the same AvroStorage instance, for the same job, with 
different files. This results (current avrostorage code with pig-1890-2.patch 
applied) in the duplication - 2 files are added rather than one (my patch fixes 
this by only taking the most recent argument to setLocation, which is 
consistent with existing loader funcs, whereas avrostorage keeps adding). If 
you check the debugging output you'll see this (I might have added a bit more 
debugging to setLocation to capture this event...)

Regards.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-07-05 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060114#comment-13060114
 ] 

Patrick Hunt commented on PIG-1890:
---

Hi, I'm seeing an issue with both versions of the attached patches when I run 
the following:

{noformat}
REGISTER avro-1.4.1.jar; 
REGISTER json-simple-1.1.jar; 
REGISTER piggybank.jar;

A = LOAD 'input_123.avro' USING 
org.apache.pig.piggybank.storage.avro.AvroStorage();

B = LOAD 'input_789.avro' USING 
org.apache.pig.piggybank.storage.avro.AvroStorage();

C = UNION A, B; 
DUMP C;
{noformat}

where each file contains a single tuple; input_123.avro contains "1,2,3" (ints) 
and input_789.avro contains "7,8,9"
Dump C should be returning 2 tuples; 1 tuple 1,2,3 and 1 tuple 7,8,9.

Without the patch I see 6 tuples output (3 1,2,3 and 3 7,8,9)
With either of the proposed patches applied I see 4 tuples output (2 1,2,3 and 
2 7,8,9)

>From looking at other pig loader functions it seems like the following would 
>address the setLocation issue:

{noformat}
 public void setLocation(String location, Job job) throws IOException {
-if(AvroStorageUtils.addInputPaths(location, job) && inputAvroSchema == 
null) {
-inputAvroSchema = getAvroSchema(location, job);
-}
+FileInputFormat.setInputPaths(job, location);
+inputAvroSchema = getAvroSchema(location, job);
 }
{noformat}

This does resolve the issue for the script I described. However the 
"addInputPaths" functionality of AvroStorageUtils is lost - but I'm wondering 
why this was added rather than just rely on the std capabilities of LOAD? (such 
as globbing).


I'd be happy to package up my suggestion as a patch if there's interest.


> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Attachments: PIG-1890-1.patch, PIG-1890-2.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira