Optimizing Pig script

2014-10-06 Thread Ankur Kasliwal
Hi, I have written a ‘Pig Script’ which is processing Sequence files given as input. It is working fine but there is one problem mentioned below. I have repetitive statements in my pig script, as shown below: - Filtered_Data _1= FILTER BagName BY ($0 matches 'RegEx-1'); -

Re: Optimizing Pig script

2014-10-06 Thread Russell Jurney
Load the regex patterns from a file (one pattern per line), CROSS their relation with BagName, and then use SelectFieldByName UDF to summon the regex pattern from the regex relation. https://issues.apache.org/jira/plugins/servlet/mobile#issue/DATAFU-69 I believe you can use a field name against

Re: Optimizing Pig script

2014-10-06 Thread Prashant Kommireddi
Are these regex static? If yes, this is easily achieved with embedding your script in Java or any other language that Pig supports http://pig.apache.org/docs/r0.13.0/cont.html You could also possibly write a UDF that loops through all the regex and returns result. On Mon, Oct 6, 2014 at 12:44

Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
Hi Ankur, Is the list of regular expressions static or dynamic? If it's a static list, you can collapse all the filter operators into a single operator and use the AND keyword to combine them. E.g. Filtered_Data = FILTER BagName BY ($0 matches 'RegEx-1') AND ($0 matches 'RegEx-2') AND ($0

Re: Optimizing Pig script

2014-10-06 Thread Russell Jurney
Actually, I don't think you need SelectFieldByValue. Just use the name of the field directly. On Monday, October 6, 2014, Prashant Kommireddi prash1...@gmail.com wrote: Are these regex static? If yes, this is easily achieved with embedding your script in Java or any other language that Pig

Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
In case you haven't seen this already, take a look at http://pig.apache.org/docs/r0.13.0/perf.html for some basic strategies on optimizing your pig scripts. On Mon, Oct 6, 2014 at 1:08 PM, Russell Jurney russell.jur...@gmail.com wrote: Actually, I don't think you need SelectFieldByValue. Just

Re: Optimizing Pig script

2014-10-06 Thread Ankur Kasliwal
Thanks for replying everyone. Few comments to everyone's suggestion. 1 I am processing sequence file which consist of many CSV files. I need to extract only few among all CSV'S. So that is the reason I am doing 'SelectFieldByValue' which is file name in my case not by field directly. 2 All

Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
It looks like the best option at this point is to write a custom UDF that takes loads a set of regular expressions from file and runs the data through all of them. On Mon, Oct 6, 2014 at 1:44 PM, Ankur Kasliwal ankur.kasliwal...@gmail.com wrote: Thanks for replying everyone. Few comments to

Re: Optimizing Pig script

2014-10-06 Thread Russell Jurney
If you can describe the layout of your input files more thoroughly, it would help. On Monday, October 6, 2014, Pradeep Gollakota pradeep...@gmail.com wrote: It looks like the best option at this point is to write a custom UDF that takes loads a set of regular expressions from file and runs the