It looks like the best option at this point is to write a custom UDF that
takes loads a set of regular expressions from file and runs the data
through all of them.

On Mon, Oct 6, 2014 at 1:44 PM, Ankur Kasliwal <ankur.kasliwal...@gmail.com>
wrote:

> Thanks for replying everyone. Few comments to everyone's suggestion.
>
> 1>  I am processing sequence file which consist of many CSV files. I need
> to extract only few among all CSV'S. So that is the reason I am doing 
> 'SelectFieldByValue'
> which is file name in my case not by field directly.
>
> 2>  All selected files ( different RegEx ) are stored in HDFS separately.
> So one STORE statement for each extracted file in a bag.
>
> 3>  Cannot  do cross join as all files input will get combined, do not
> want to do that.
>
> 4>  Cannot do AND/OR operator as i need different bags for each selected
> file ( RegEx).
>
>
>
> Let me know if any one has any other suggestions.
> Sorry for not being clear with specification at first place.
>
> Thanks.
>
> On Mon, Oct 6, 2014 at 4:12 PM, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
>
>> In case you haven't seen this already, take a look at
>> http://pig.apache.org/docs/r0.13.0/perf.html for some basic strategies on
>> optimizing your pig scripts.
>>
>> On Mon, Oct 6, 2014 at 1:08 PM, Russell Jurney <russell.jur...@gmail.com>
>> wrote:
>>
>> > Actually, I don't think you need SelectFieldByValue. Just use the name
>> of
>> > the field directly.
>> >
>> > On Monday, October 6, 2014, Prashant Kommireddi <prash1...@gmail.com>
>> > wrote:
>> >
>> > > Are these regex static? If yes, this is easily achieved with embedding
>> > your
>> > > script in Java or any other language that Pig supports
>> > > http://pig.apache.org/docs/r0.13.0/cont.html
>> > >
>> > > You could also possibly write a UDF that loops through all the regex
>> and
>> > > returns result.
>> > >
>> > >
>> > >
>> > > On Mon, Oct 6, 2014 at 12:44 PM, Ankur Kasliwal <
>> > > ankur.kasliwal...@gmail.com <javascript:;>
>> > > > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > >
>> > > >
>> > > > I have written a ‘Pig Script’ which is processing Sequence files
>> given
>> > as
>> > > > input.
>> > > >
>> > > > It is working fine but there is one problem mentioned below.
>> > > >
>> > > >
>> > > >
>> > > > I have repetitive statements in my pig script,  as shown below:
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >    -  Filtered_Data _1= FILTER BagName BY ($0 matches 'RegEx-1');
>> > > >    -  Filtered_Data_2 = FILTER BagName BY ($0 matches 'RegEx-2');
>> > > >    -  Filtered_Data_3 = FILTER BagName BY ($0 matches 'RegEx-3');
>> > > >    - So on…
>> > > >
>> > > >
>> > > >
>> > > > Question :
>> > > >
>> > > > So is there any way by which I can have above statement written once
>> > and
>> > > >
>> > > > then loop through all possible “RegEx” and substitute in Pig script.
>> > > >
>> > > >
>> > > >
>> > > > For Example:
>> > > >
>> > > >
>> > > > Filtered_Data _X  =   FILTER BagName BY ($0 matches 'RegEx');  (
>> have
>> > > this
>> > > > statement once )
>> > > >
>> > > > ( loop through all possible RegEx and substitute value in the
>> > statement )
>> > > >
>> > > >
>> > > >
>> > > > Right now I am calling Pig script from a shell script, so any way
>> from
>> > > > shell script will be also be welcome..
>> > > >
>> > > >
>> > > >
>> > > > Thanks in advance.
>> > > >
>> > > > Happy Pigging!!!!
>> > > >
>> > >
>> >
>> >
>> > --
>> > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
>> > datasyndrome.com
>> >
>>
>
>

Reply via email to