[ 
https://issues.apache.org/jira/browse/PIG-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931515#action_12931515
 ] 

Gerrit Jansen van Vuuren commented on PIG-1722:
-----------------------------------------------

Thanks, sorry about the tabs, I did the auto-formatting in eclipse but will 
check it to do TABS=4spaces :)


> PiggyBank AllLoader - Load multiple file formats in one load statement
> ----------------------------------------------------------------------
>
>                 Key: PIG-1722
>                 URL: https://issues.apache.org/jira/browse/PIG-1722
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>         Attachments: PIG-1722.patch
>
>
> This gives the ability to point one loader at a directory and have multiple 
> formats loaded and used in the same query
> ----- Overview -----
> Lets say we have a directory with files:
>  /logs/myfile.lzo
>  /logs/myfile.rc
>  /logs/myfile.bz2
>  /logs/myfile.gz
> To load these currently requires multiple loaders, load statements in pig and 
> then have the query perform a union on these.
> With this Loader the query becomes:
> a = LOAD '/logs/' USING  org.apache.pig.piggybank.storage.AllLoader();
> The AllLoader will use the mapping property in the 
> $PIG_HOME/conf/pig.properties
> file.extension.loaders that can be setup as:
> file.extension.loaders=gz:org.apache.pig.builtin.PigStorage(),bz2:org.apache.pig.builtin.PigStorage(),lzo:com.twitter.elephantbird.pig.load.LzoTextLoader(),
>  rc:org.apache.pig.piggybank.storage.HiveColumnarLoader()
> The formats of this property is:
> -> [file extension]:[loader func spec]
> -> [file-extension]:[optional path tag]:[loader func spec]
> -> [file-extension]:[optional path tag]:[sequence file key value writer class 
> name]:[loader func spec]
> ----- File path tagging: -----
> Loaders can also be chosen based on folder names in the file path:
> e.g.
> file.extension.loaders:gz:type1:Type1Loader(), gz:type2:Type2Loader()
> So that if you have /logs/type1/mylog and /logs/type2/mylog
> doing : a = LOAD '/logs/' USING  
> org.apache.pig.piggybank.storage.AllLoader();  will use Type1Loader for mylog 
> in /logs/type1 and Type2Loader for mylog in /logs/type2
> ----- File content guessing: -----
> If the files do not have extensions the AllLoader will try to guess the type 
> of file by looking at the first three bytes mapping the following bytes to 
> each extension:
> [ -119, 76, 90 ] = lzo
> [ 31, -117, 8 ] = gz
> [ 66, 90, 104 ] = bz2
> [ 83, 69, 81 ] = seq
> ----- Loader selection based on sequence file writer class -----
> Loaders can be configured to be selected based on the getKeyClassName of the 
> Sequence File.
> e.g. 
> file.extension.loaders:seq::org.apache.hadoop.hive.ql.io.RCFile:HiveColumnarLoader
>  
> will use the HiveColumnarLoader loader for all sequence files that have been 
> written with org.apache.hadoop.hive.ql.io.RCFile as the KeyClassName.
> All $ extensions are removed from the getKeyClassName's return value.
> ----- Path Partition Handling -----
> Hive style partitioning is supported in the Loader itself so that if you have 
> /logs/type=1 /logs/type=2 /logs/type=3
> The partition columns will be recougnised as "type" and filtering can be done 
> like type<=2 etc.
> For this current implementation filtering expressions should be passed into 
> the AllLoader's constructor e.g.
> a = LOAD '/logs/' USING  
> org.apache.pig.piggybank.storage.AllLoader('type<=2'); will load only files 
> that are in /logs/type=1 and /logs/type=2

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to