Hi everyone, Thanks in advance to the Pig community, great tool! We're using pig for a project that essentially takes in a client specific csv file, filters out the data we want, transforms it to the format we want and then writes to a client specific database (basic ETL). Currently we've implemented this pig script, and just pass in the client database name along with the csv file location, and then use a shell script to call the pig script for each client csv file we have (parsing the client name for the database name param and pass that in as well).
Basically like this: single_csv = load 'file_name.csv' using PigStorage() as (fields); -- filter single_csv; STORE filtered_set INTO table USING 'DBStorage(driver, unique_client_db_info, insert into table (columns) values (?,?)' My question is, is it possible/advisable to load all the csv files with something like: ' all_csv = load '*.csv' using PigStorage() as (client_name:nameParsefunc(), field1:$1, field2:$2)' Above I'm thinking of somehow parsing the unique client filename for each csv and inserting the unique client name into each tuple stored, so it is properly associated with the client? That way we can write each tuple to the correct database so the 'unique_client' data in the store query would just be dynamic loaded via ? for each tuple. Or is our approach best practice, keep the pig script light, doing the ETL on a per csv basis? I hope this makes sense and thanks in advance for any feedback, suggestions or criticisms for how we're going about this! Also, let me know if this would be better to bring to the IRC channel... -- Harrison Cavallero *cavallero.me <http://cavallero.me>*
