Hi folks,
I was able to filter that string out using an alternative approach, sharing
it as it might be useful for someone encountering the similar issue &
couldn't update for some reasons.
While using it as mentioned, was getting error related to mismatch &
expecting end of line with semi colon.
B=foreach D generate REGEX_EXTRACT(test,'(B75.*;)',1);
So instead, used nesting of foreach & it worked.
B = foreach D {
test1 = REGEX_EXTRACT(test,'(B75.*;)',1);
test2 = REPLACE(test1,'\\u003B',''); -- to remove the semi colon in the
last
GENERATE test2;
}
Cheers,
Kartik
On Mon, May 12, 2014 at 9:27 PM, kartik manocha <[email protected]>wrote:
> Thanks, it could be due to this bug as I'm using 0.11,1.
>
> Upgrade isn't an option feasible for me at the moment.
>
> Will try exploring writing UDF's, btw thanks for the quick response.
>
>
> Thanks,
> Kartik
>
>
> On Mon, May 12, 2014 at 9:11 PM, Pradeep Gollakota
> <[email protected]>wrote:
>
>> Kartik,
>>
>> Looks like you're facing this issues:
>> https://issues.apache.org/jira/browse/PIG-2507
>> What version of Pig are you using? The issue is fixed in 0.11.2 and 0.12.
>> So if you upgrade to these versions, your problem should go away.
>>
>> If you're unable to upgrade for some reason, your best bet is to write a
>> custom UDF. But the general idea remains the same, write a regex to
>> extract
>> the appropriate substring and project that from the UDF.
>>
>>
>> Unmesha,
>>
>> Start a new thread with your question so we don't pollute this thread for
>> Kartik. Can you give some samples as well? I'm not sure I understood your
>> question.
>>
>>
>> On Mon, May 12, 2014 at 3:05 AM, kartik manocha <[email protected]
>> >wrote:
>>
>> > Pradeep,
>> >
>> > Thanks for the pointers, but as i mentioned that I need to extract that
>> > string till semicolon, so facing issues with that.
>> >
>> > I need to print it before semiclon that's causing pain as when I mention
>> > semicolon in regex it treats it as end of statement & produces error.
>> >
>> > However without mentioning semicolon it works fine but produces complete
>> > stuff starting with B75.
>> > eg .
>> > B=foreach D generate REGEX_EXTRACT(test,'(B75.*)',1);
>> >
>> > Is there any way by which I can mention semicolon in my above regex, so
>> > that it prints the string before that.
>> >
>> >
>> > Thanks,
>> > Kartik
>> >
>> >
>> >
>> > On Mon, May 12, 2014 at 2:03 PM, Pradeep Gollakota <
>> [email protected]
>> > >wrote:
>> >
>> > > Check out
>> > >
>> http://archive.cloudera.com/cdh/3/pig/piglatin_ref2.html#REGEX_EXTRACT
>> > >
>> > > This may suit your needs
>> > >
>> > >
>> > > On Mon, May 12, 2014 at 12:16 AM, kartik manocha <
>> [email protected]
>> > > >wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > I am new to pig & facing an issue in filtering out a string from a
>> > field,
>> > > > mentioned is the scenario.
>> > > >
>> > > > - > I am loading data with several fields, among those fields there
>> is
>> > > > field name called 'test_data'
>> > > > - > There are lot of things in this field, I wanted to filter out a
>> > > string
>> > > > from this field which starts from B75 & ends with semi colon.
>> > > > - > After taking this string out, wanted to add this as a new field
>> to
>> > > the
>> > > > existing bag which was loaded
>> > > >
>> > > > I tried using INDEXOF UDF, but that works for a single character
>> only,
>> > > > however when I tried using that for single character, it returns ()
>> > only
>> > > > instead of index number. I was just testing, & by manually providing
>> > > > indexes in SUBSTRING UDF, it was generating string.
>> > > >
>> > > > But unable to get the position using indexof UDF, or may be there
>> could
>> > > be
>> > > > a better of doing this.
>> > > >
>> > > > If you have any pointers / suggestions, please share.
>> > > >
>> > > > Thanks in advance.
>> > > >
>> > > >
>> > > > Best,
>> > > > Kartik
>> > > >
>> > >
>> >
>>
>
>