[ https://issues.apache.org/jira/browse/PIG-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gerrit Jansen van Vuuren updated PIG-1722: ------------------------------------------ Tags: PIG-1722.patch Status: Patch Available (was: Open) > PiggyBank AllLoader - Load multiple file formats in one load statement > ---------------------------------------------------------------------- > > Key: PIG-1722 > URL: https://issues.apache.org/jira/browse/PIG-1722 > Project: Pig > Issue Type: New Feature > Reporter: Gerrit Jansen van Vuuren > Assignee: Gerrit Jansen van Vuuren > Priority: Minor > Attachments: PIG-1722.patch > > > This gives the ability to point one loader at a directory and have multiple > formats loaded and used in the same query > ----- Overview ----- > Lets say we have a directory with files: > /logs/myfile.lzo > /logs/myfile.rc > /logs/myfile.bz2 > /logs/myfile.gz > To load these currently requires multiple loaders, load statements in pig and > then have the query perform a union on these. > With this Loader the query becomes: > a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader(); > The AllLoader will use the mapping property in the > $PIG_HOME/conf/pig.properties > file.extension.loaders that can be setup as: > file.extension.loaders=gz:org.apache.pig.builtin.PigStorage(),bz2:org.apache.pig.builtin.PigStorage(),lzo:com.twitter.elephantbird.pig.load.LzoTextLoader(), > rc:org.apache.pig.piggybank.storage.HiveColumnarLoader() > The formats of this property is: > -> [file extension]:[loader func spec] > -> [file-extension]:[optional path tag]:[loader func spec] > -> [file-extension]:[optional path tag]:[sequence file key value writer class > name]:[loader func spec] > ----- File path tagging: ----- > Loaders can also be chosen based on folder names in the file path: > e.g. > file.extension.loaders:gz:type1:Type1Loader(), gz:type2:Type2Loader() > So that if you have /logs/type1/mylog and /logs/type2/mylog > doing : a = LOAD '/logs/' USING > org.apache.pig.piggybank.storage.AllLoader(); will use Type1Loader for mylog > in /logs/type1 and Type2Loader for mylog in /logs/type2 > ----- File content guessing: ----- > If the files do not have extensions the AllLoader will try to guess the type > of file by looking at the first three bytes mapping the following bytes to > each extension: > [ -119, 76, 90 ] = lzo > [ 31, -117, 8 ] = gz > [ 66, 90, 104 ] = bz2 > [ 83, 69, 81 ] = seq > ----- Loader selection based on sequence file writer class ----- > Loaders can be configured to be selected based on the getKeyClassName of the > Sequence File. > e.g. > file.extension.loaders:seq::org.apache.hadoop.hive.ql.io.RCFile:HiveColumnarLoader > > will use the HiveColumnarLoader loader for all sequence files that have been > written with org.apache.hadoop.hive.ql.io.RCFile as the KeyClassName. > All $ extensions are removed from the getKeyClassName's return value. > ----- Path Partition Handling ----- > Hive style partitioning is supported in the Loader itself so that if you have > /logs/type=1 /logs/type=2 /logs/type=3 > The partition columns will be recougnised as "type" and filtering can be done > like type<=2 etc. > For this current implementation filtering expressions should be passed into > the AllLoader's constructor e.g. > a = LOAD '/logs/' USING > org.apache.pig.piggybank.storage.AllLoader('type<=2'); will load only files > that are in /logs/type=1 and /logs/type=2 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.