It looks like the best option at this point is to write a custom UDF that takes loads a set of regular expressions from file and runs the data through all of them.
On Mon, Oct 6, 2014 at 1:44 PM, Ankur Kasliwal <ankur.kasliwal...@gmail.com> wrote: > Thanks for replying everyone. Few comments to everyone's suggestion. > > 1> I am processing sequence file which consist of many CSV files. I need > to extract only few among all CSV'S. So that is the reason I am doing > 'SelectFieldByValue' > which is file name in my case not by field directly. > > 2> All selected files ( different RegEx ) are stored in HDFS separately. > So one STORE statement for each extracted file in a bag. > > 3> Cannot do cross join as all files input will get combined, do not > want to do that. > > 4> Cannot do AND/OR operator as i need different bags for each selected > file ( RegEx). > > > > Let me know if any one has any other suggestions. > Sorry for not being clear with specification at first place. > > Thanks. > > On Mon, Oct 6, 2014 at 4:12 PM, Pradeep Gollakota <pradeep...@gmail.com> > wrote: > >> In case you haven't seen this already, take a look at >> http://pig.apache.org/docs/r0.13.0/perf.html for some basic strategies on >> optimizing your pig scripts. >> >> On Mon, Oct 6, 2014 at 1:08 PM, Russell Jurney <russell.jur...@gmail.com> >> wrote: >> >> > Actually, I don't think you need SelectFieldByValue. Just use the name >> of >> > the field directly. >> > >> > On Monday, October 6, 2014, Prashant Kommireddi <prash1...@gmail.com> >> > wrote: >> > >> > > Are these regex static? If yes, this is easily achieved with embedding >> > your >> > > script in Java or any other language that Pig supports >> > > http://pig.apache.org/docs/r0.13.0/cont.html >> > > >> > > You could also possibly write a UDF that loops through all the regex >> and >> > > returns result. >> > > >> > > >> > > >> > > On Mon, Oct 6, 2014 at 12:44 PM, Ankur Kasliwal < >> > > ankur.kasliwal...@gmail.com <javascript:;> >> > > > wrote: >> > > >> > > > Hi, >> > > > >> > > > >> > > > >> > > > I have written a ‘Pig Script’ which is processing Sequence files >> given >> > as >> > > > input. >> > > > >> > > > It is working fine but there is one problem mentioned below. >> > > > >> > > > >> > > > >> > > > I have repetitive statements in my pig script, as shown below: >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > - Filtered_Data _1= FILTER BagName BY ($0 matches 'RegEx-1'); >> > > > - Filtered_Data_2 = FILTER BagName BY ($0 matches 'RegEx-2'); >> > > > - Filtered_Data_3 = FILTER BagName BY ($0 matches 'RegEx-3'); >> > > > - So on… >> > > > >> > > > >> > > > >> > > > Question : >> > > > >> > > > So is there any way by which I can have above statement written once >> > and >> > > > >> > > > then loop through all possible “RegEx” and substitute in Pig script. >> > > > >> > > > >> > > > >> > > > For Example: >> > > > >> > > > >> > > > Filtered_Data _X = FILTER BagName BY ($0 matches 'RegEx'); ( >> have >> > > this >> > > > statement once ) >> > > > >> > > > ( loop through all possible RegEx and substitute value in the >> > statement ) >> > > > >> > > > >> > > > >> > > > Right now I am calling Pig script from a shell script, so any way >> from >> > > > shell script will be also be welcome.. >> > > > >> > > > >> > > > >> > > > Thanks in advance. >> > > > >> > > > Happy Pigging!!!! >> > > > >> > > >> > >> > >> > -- >> > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com >> > datasyndrome.com >> > >> > >