[
https://issues.apache.org/jira/browse/PIG-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Dai resolved PIG-4803.
-----------------------------
Resolution: Fixed
Hadoop Flags: Reviewed
Fix Version/s: 0.16.0
+1. Patch committed to trunk.
> Improve performance of regex-based builtin functions
> ----------------------------------------------------
>
> Key: PIG-4803
> URL: https://issues.apache.org/jira/browse/PIG-4803
> Project: Pig
> Issue Type: Improvement
> Reporter: Eyal Allweil
> Assignee: Eyal Allweil
> Labels: perfomance, regex
> Fix For: 0.16.0
>
> Attachments: PIG-4803.patch
>
>
> There are three strategies used by Pig's regex-based built in functions.
> 1) REPLACE doesn't do any pattern caching.
> 2) REGEX_EXTRACT and REGEX_EXTRACT_ALL attempt to cache a single pattern as
> an instance variable.
> 3) PluckTuple attempts to cache a single pattern statically. (doesn't this
> cause problems if two clashing defines for different PluckTuples are used?)
> I have a little fix and a medium fix in mind. The little fix is to give
> REPLACE a similar caching strategy, and to fix PluckTuple, if the static
> nature of the pattern is indeed a problem.
> The medium fix is to make all four functions take an additional constructor
> with a constant regex (and therefore one less argument in evaluation) and use
> that if it exists. This would be backwards compatible, should barely (or not)
> affect the performance of the existing code path, but I think that in cases
> where there are two clashing usages of the functions in the same
> foreach..generate it would allow the pattern caching to work.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)