[ 
https://issues.apache.org/jira/browse/LANG-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pascal Schumacher resolved LANG-1206.
-------------------------------------
       Resolution: Fixed
         Assignee: Pascal Schumacher
    Fix Version/s:     (was: 3.4)
                   3.5

Patch applied. Thanks!

> Improve CharSetUtils.squeeze() performance
> ------------------------------------------
>
>                 Key: LANG-1206
>                 URL: https://issues.apache.org/jira/browse/LANG-1206
>             Project: Commons Lang
>          Issue Type: Improvement
>          Components: lang.*
>    Affects Versions: 3.4
>            Reporter: Mohammed Alfallaj
>            Assignee: Pascal Schumacher
>              Labels: patch, performance
>             Fix For: 3.5
>
>         Attachments: CharSetUtils_squeeze_performance_issue.patch, 
> Description_Squeeze.docx
>
>
> Before reading the description below, please note that there was an issue 
> created in 10/Jul/2011 titled "CharSetUtils.squeeze() speedup" LANG-715. This 
> solution does not prevent the repetitive calls to the expensive contains() 
> method explained below. In addition, I updated this to (major) because of the 
> high CPU cost of the current solution.
> A readable form of the description is attached in Description_Squeeze.docx.
> Below is a text format of the description (in case the .docx does not work):
> ***********************************************************************
> Performance problems of current CharSetUtils.squeeze() algorithm:
> ***********************************************************************
> ************************************
> FIRST PROBLEM: 
> ************************************
> -Method org.apache.commons.lang3.CharSet.contains() is called in each 
> character iteration for a consecutive set of duplicated characters in str 
> (starting from the second character in the duplicated set). This large number 
> of calls to contains() are determined to increase the performance overhead 
> for squeeze() calls given the high computing cost of executing contains() 
> method. Note that only one call to contains() is needed for the second 
> character in this set of duplicated characters. There is no need to repeat 
> the calls to contains() method for the same duplicated character that we have 
> already checked in the second occurrence. For example, if str=”abbbbbc” and 
> set=”b”, then the algorithm should call contains() only once for the second b 
> character occurrence and prevent calling contains() again given that it has 
> already been called before for this block. The current algorithm adds high 
> performance overhead by calling contains() repetitively for each duplicated b 
> character starting from the second occurrence. For this example where 
> str=”abbbbbc” and set=”b”, the suggested solution will call contains() method 
> only once instead of 4 times which will largely reduce the performance 
> overhead of squeeze() method. 
> -Consider the following example, System.currentTimeMillis() testing shows a 
> performance improvement by around 83% if we run the suggested enhanced 
> solution with the following sample input data 
>       str = 
> “abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc”
>       set = “b”
> ************************************
> PROPOSED SOLUTION FOR FIRST PROBLEM:
> ************************************
> -Using two character flags inChars and notInChars:
>       inChars: flag to record the recent character that has been determined 
> to be in chars CharSet
>       notInChars: flag to record the recent character that has been 
> determined to NOT be in chars CharSet
> -Adding the following conditions to prevent unnecessary/expensive calls to 
> contains() method. Only one call to contains() method is needed starting from 
> the second character in the consecutive set of duplicated characters:
>       if ((inChars != null) && (ch == inChars))
>               this if-statement checks if this ch char has recently been 
> determined to be in chars CharSet, this prevents calling contains() more than 
> once for a consecutive set of duplicated characters in str
>               if one of these conditions fails, it means that the consecutive 
> set of duplicated characters for this ch character has not yet been checked 
> for this block of duplicated characters (i.e. has not been checked = we have 
> not yet done the one-time call to contains method for this particular set of 
> duplicated characters)
>       if ((notInChars == null) || (ch != notInChars))
>               if both conditions are false, it means that ch has recently 
> been determined to NOT be in chars CharSet
>       if (chars.contains(ch))
>               here we call the expensive chars.contains() only once for this 
> consecutive set of duplicated characters.
> -Note: this suggested solution does not break the functionality of 
> org.apache.commons.lang3.CharSetUtils.squeeze(String str, String… set). This 
> suggested solution is executed against the latest UnitTest 
> CharSetUtilsTest.java written by Apache.
> ************************************
> SECOND PROBLEM: 
> ************************************
> the current algorithm starts the for-loop with i=0 which is not a necessary 
> step. This adds the overhead of executing the if condition (i != 0). In this 
> logic, the algorithm will always append the first character from str to the 
> buffer. So there is no need to execute (i != 0) for every character in the 
> consecutive set of duplicated characters.
> ************************************
> PROPOSED SOLUTION FOR SECOND PROBLEM: 
> ************************************
> lastChar is assigned the first character in str and appended to buffer before 
> executing the for loop. Then the for loop begins with i=1 removing the 
> unnecessary (i != 0) executions.
> END OF DESCRIPTION



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to