[
https://issues.apache.org/jira/browse/PIG-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693758#action_12693758
]
Mridul Muralidharan commented on PIG-738:
-----------------------------------------
As a workaround, you can use the hex code for '.' within your script.
Like : 'www\u002eyahoo\u002ecom/sports' instead of trying to escape '.'.
But yes, the basic problem seems to exist.
> Regexp passed from pigscript fails in UDF
> -------------------------------------------
>
> Key: PIG-738
> URL: https://issues.apache.org/jira/browse/PIG-738
> Project: Pig
> Issue Type: Bug
> Components: grunt
> Affects Versions: 0.3.0
> Reporter: Viraj Bhat
> Fix For: 0.3.0
>
> Attachments: myregexp.jar, RegexGroupCount.java, regexp.pig,
> regexpinput.txt
>
>
> Consider a pig script which parses and counts regular expressions from a text
> file.
> The regular expression supplied in the Pig script needs to escape the "."
> (dot) character.
> {code}
> register myregexp.jar;
> -- pattern not picked up
> define minelogs ci_pig_udfs.RegexGroupCount('www\\.yahoo\\.com/sports');
> A = load '/user/viraj/regexpinput.txt' using PigStorage() as (source :
> chararray);
> B = foreach A generate minelogs(source) as sportslogs;
> dump B;
> {code}
> Snippet of UDF RegexGroupCount.java
> {code}
> public class RegexGroupCount extends EvalFunc<Integer> {
> private final Pattern pattern_;
> public RegexGroupCount(String patternStr) {
> System.out.println("My pattern supplied is "+patternStr);
> System.out.println("Equality test
> "+patternStr.equals("www\\.yahoo\\.com/sports"));
> pattern_ = Pattern.compile(patternStr,
> Pattern.DOTALL|Pattern.CASE_INSENSITIVE);
> }
> public Integer exec(Tuple input) throws IOException {
> }
> }
> {code}
> Running the above script on the following dataset :
> ====================================================================================================
> dshfdskfwww.yahoo.com/sportsjoadfjdslpdshfdskfwww.yahoo.com/sportsjoadfjdsl
> kas;dka;sd
> jsjsjwww.yahoo.com/sports
> jsdLSJDcom/sports
> wwwJyahooMcom/sports
> ====================================================================================================
> Results in the following:
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> Userfunc: (Name: UserFunc viraj-Sat Mar 28 02:06:31 PDT 2009-14 function:
> ci_pig_udfs.RegexGroupCount('www\\.yahoo\\.com/sports') Operator Key:
> viraj-Sat Mar 28 02:06:31 PDT 2009-14)
> Userfunc fs: int
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> 2009-03-28 02:06:43,923 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2009-03-28 02:06:43,923 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!
> (0)
> (0)
> (0)
> (0)
> (0)
> ====================================================================================================
> In essence there seems to be no way of passing this type of constructor
> argument through the Pig script. The only workaround seems to be hard coding
> the values in the UDF!!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.