[ 
https://issues.apache.org/jira/browse/PIG-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414564#comment-13414564
 ] 

Cheolsoo Park commented on PIG-2492:
------------------------------------

Hi,

I am interested in getting this jira resolved, so I posted a new patch 
[^PIG-2492.patch] that hopefully addresses concerns expressed here. To 
summarize, I did the following:

1) I used functions that hadoop provides instead of implementing my own glob 
pattern matching. In fact, it was slightly more complicated than what Scott 
described for two reasons:
- _FileInputFormat.setInputFiles()_ doesn't find files in sub-directories. But 
currently, if the path is a directory, AvroStorage recursively loads files in a 
directory and its sub-directories.
- AvroStorage needs to know the schema of the files to load, so t is necessary 
to expand the glob pattern in AvroStorage.

Nevertheless, I was able to implement glob/comma support using 
_FileSystem.globStatus()_ and _FileInputFormat.setInputFiles()_ while not 
changing the current recursive load semantics.

2) URIs are handled properly because glob patterns are expanded by hadoop that 
knows how to handle URIs properly.

3) The glob syntax is the same as what's supported in PigStorage since 
PigStorage also uses _FileInputFormat.setInputFiles()_ to expand glob patterns. 
Some examples are as follows:
{code}
test_dir1/*
test_dir1/test_glob{1,2,3}.avro
{test_dir1,test_dir2}/test_glob*.avro
{code}

4) I assumed that all the files that match the glob pattern have the same 
schema. In fact, this is the same limitation that we have for loading a 
directory:
{quote}
If the input directory is a leaf directory, then we assume Avro data files in 
it have the same schema;
If the input directory contains sub-directoies, then we assume Avro data files 
in all sub-directories have the same schema.
{quote}
https://cwiki.apache.org/PIG/avrostorage.html

4) I added 4 unit tests to verify the functionality as follow:
- testDir verifies that AvroStorage recursively loads files in a directory and 
its sub-directories.
- testGlob1 to 3 verify that glob patterns are expanded properly.

In addition to the patch, I uploaded some .avro files [^avro_test_files.tar.gz] 
that are needed for my tests. To run the tests, please do the following:
{code}
tar -xf avro_test_files.tar.gz
ant clean compile-test piggybank -Dhadoopversion=20
cd contrib/piggybank/java
ant test -Dtestcase=TestAvroStorage
{code}
Please let me know what you think.

Thanks!
                
> AvroStorage should recognize globs and commas
> ---------------------------------------------
>
>                 Key: PIG-2492
>                 URL: https://issues.apache.org/jira/browse/PIG-2492
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>    Affects Versions: 0.9.1, 0.10.0
>            Reporter: Stan Rosenberg
>         Attachments: AvroStorage.patch, AvroStorageUtils.patch, 
> PIG-2492.patch, avro_test_files.tar.gz
>
>
> I've patched AvroStorage and AvroStorageUtils to support the same file input 
> syntax as currently supported
> by hadoop's FileInputFormat.  Specifically, globs and commas are supported.
> Somebody should write some unit tests for theses changes; I am currently 
> pressed for time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to