PiggyBank AllLoader - Load multiple file formats in one load statement
----------------------------------------------------------------------

                 Key: PIG-1722
                 URL: https://issues.apache.org/jira/browse/PIG-1722
             Project: Pig
          Issue Type: New Feature
            Reporter: Gerrit Jansen van Vuuren
            Assignee: Gerrit Jansen van Vuuren
            Priority: Minor


This gives the ability to point one loader at a directory and have multiple 
formats loaded and used in the same query

----- Overview -----

Lets say we have a directory with files:
 /logs/myfile.lzo
 /logs/myfile.rc
 /logs/myfile.bz2
 /logs/myfile.gz

To load these currently requires multiple loaders, load statements in pig and 
then have the query perform a union on these.

With this Loader the query becomes:
a = LOAD '/logs/' USING  org.apache.pig.piggybank.storage.AllLoader();

The AllLoader will use the mapping property in the $PIG_HOME/conf/pig.properties

file.extension.loaders that can be setup as:

file.extension.loaders=gz:org.apache.pig.builtin.PigStorage(),bz2:org.apache.pig.builtin.PigStorage(),lzo:com.twitter.elephantbird.pig.load.LzoTextLoader(),
 rc:org.apache.pig.piggybank.storage.HiveColumnarLoader()

The formats of this property is:

-> [file extension]:[loader func spec]
-> [file-extension]:[optional path tag]:[loader func spec]
-> [file-extension]:[optional path tag]:[sequence file key value writer class 
name]:[loader func spec]

----- File path tagging: -----

Loaders can also be chosen based on folder names in the file path:
e.g.
file.extension.loaders:gz:type1:Type1Loader(), gz:type2:Type2Loader()

So that if you have /logs/type1/mylog and /logs/type2/mylog
doing : a = LOAD '/logs/' USING  org.apache.pig.piggybank.storage.AllLoader();  
will use Type1Loader for mylog in /logs/type1 and Type2Loader for mylog in 
/logs/type2

----- File content guessing: -----

If the files do not have extensions the AllLoader will try to guess the type of 
file by looking at the first three bytes mapping the following bytes to each 
extension:

[ -119, 76, 90 ] = lzo
[ 31, -117, 8 ] = gz
[ 66, 90, 104 ] = bz2
[ 83, 69, 81 ] = seq

----- Loader selection based on sequence file writer class -----

Loaders can be configured to be selected based on the getKeyClassName of the 
Sequence File.
e.g. 
file.extension.loaders:seq::org.apache.hadoop.hive.ql.io.RCFile:HiveColumnarLoader
 
will use the HiveColumnarLoader loader for all sequence files that have been 
written with org.apache.hadoop.hive.ql.io.RCFile as the KeyClassName.

All $ extensions are removed from the getKeyClassName's return value.

----- Path Partition Handling -----

Hive style partitioning is supported in the Loader itself so that if you have 
/logs/type=1 /logs/type=2 /logs/type=3
The partition columns will be recougnised as "type" and filtering can be done 
like type<=2 etc.

For this current implementation filtering expressions should be passed into the 
AllLoader's constructor e.g.


a = LOAD '/logs/' USING  org.apache.pig.piggybank.storage.AllLoader('type<=2'); 
will load only files that are in /logs/type=1 and /logs/type=2





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to