[
https://issues.apache.org/jira/browse/PIG-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414564#comment-13414564
]
Cheolsoo Park commented on PIG-2492:
------------------------------------
Hi,
I am interested in getting this jira resolved, so I posted a new patch
[^PIG-2492.patch] that hopefully addresses concerns expressed here. To
summarize, I did the following:
1) I used functions that hadoop provides instead of implementing my own glob
pattern matching. In fact, it was slightly more complicated than what Scott
described for two reasons:
- _FileInputFormat.setInputFiles()_ doesn't find files in sub-directories. But
currently, if the path is a directory, AvroStorage recursively loads files in a
directory and its sub-directories.
- AvroStorage needs to know the schema of the files to load, so t is necessary
to expand the glob pattern in AvroStorage.
Nevertheless, I was able to implement glob/comma support using
_FileSystem.globStatus()_ and _FileInputFormat.setInputFiles()_ while not
changing the current recursive load semantics.
2) URIs are handled properly because glob patterns are expanded by hadoop that
knows how to handle URIs properly.
3) The glob syntax is the same as what's supported in PigStorage since
PigStorage also uses _FileInputFormat.setInputFiles()_ to expand glob patterns.
Some examples are as follows:
{code}
test_dir1/*
test_dir1/test_glob{1,2,3}.avro
{test_dir1,test_dir2}/test_glob*.avro
{code}
4) I assumed that all the files that match the glob pattern have the same
schema. In fact, this is the same limitation that we have for loading a
directory:
{quote}
If the input directory is a leaf directory, then we assume Avro data files in
it have the same schema;
If the input directory contains sub-directoies, then we assume Avro data files
in all sub-directories have the same schema.
{quote}
https://cwiki.apache.org/PIG/avrostorage.html
4) I added 4 unit tests to verify the functionality as follow:
- testDir verifies that AvroStorage recursively loads files in a directory and
its sub-directories.
- testGlob1 to 3 verify that glob patterns are expanded properly.
In addition to the patch, I uploaded some .avro files [^avro_test_files.tar.gz]
that are needed for my tests. To run the tests, please do the following:
{code}
tar -xf avro_test_files.tar.gz
ant clean compile-test piggybank -Dhadoopversion=20
cd contrib/piggybank/java
ant test -Dtestcase=TestAvroStorage
{code}
Please let me know what you think.
Thanks!
> AvroStorage should recognize globs and commas
> ---------------------------------------------
>
> Key: PIG-2492
> URL: https://issues.apache.org/jira/browse/PIG-2492
> Project: Pig
> Issue Type: Improvement
> Components: piggybank
> Affects Versions: 0.9.1, 0.10.0
> Reporter: Stan Rosenberg
> Attachments: AvroStorage.patch, AvroStorageUtils.patch,
> PIG-2492.patch, avro_test_files.tar.gz
>
>
> I've patched AvroStorage and AvroStorageUtils to support the same file input
> syntax as currently supported
> by hadoop's FileInputFormat. Specifically, globs and commas are supported.
> Somebody should write some unit tests for theses changes; I am currently
> pressed for time.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira