[ https://issues.apache.org/jira/browse/PIG-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eyal Allweil updated PIG-4803: ------------------------------ Attachment: PIG-4803.patch This patch gives REPLACE a similar compiled pattern caching strategy to that of REGEX_EXTRACT and REGEX_EXTRACT_ALL. Because the existing implementation swallows exceptions with a warning, I did the same in my implementation (although they are different, more informative warnings). I changed TestStringUDFs.TestReplace slightly to cover the new code. > Improve performance of regex-based builtin functions > ---------------------------------------------------- > > Key: PIG-4803 > URL: https://issues.apache.org/jira/browse/PIG-4803 > Project: Pig > Issue Type: Improvement > Reporter: Eyal Allweil > Assignee: Eyal Allweil > Attachments: PIG-4803.patch > > > There are three strategies used by Pig's regex-based built in functions. > 1) REPLACE doesn't do any pattern caching. > 2) REGEX_EXTRACT and REGEX_EXTRACT_ALL attempt to cache a single pattern as > an instance variable. > 3) PluckTuple attempts to cache a single pattern statically. (doesn't this > cause problems if two clashing defines for different PluckTuples are used?) > I have a little fix and a medium fix in mind. The little fix is to give > REPLACE a similar caching strategy, and to fix PluckTuple, if the static > nature of the pattern is indeed a problem. > The medium fix is to make all four functions take an additional constructor > with a constant regex (and therefore one less argument in evaluation) and use > that if it exists. This would be backwards compatible, should barely (or not) > affect the performance of the existing code path, but I think that in cases > where there are two clashing usages of the functions in the same > foreach..generate it would allow the pattern caching to work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)