[ https://issues.apache.org/jira/browse/LUCENE-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand closed LUCENE-7256. -------------------------------- Resolution: Won't Fix Closing then. > PatternReplaceCharFilter can make Lucene hang > --------------------------------------------- > > Key: LUCENE-7256 > URL: https://issues.apache.org/jira/browse/LUCENE-7256 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 5.4.1 > Environment: alpine linux v3.3 > Reporter: Tom Fotherby > Priority: Minor > > I'm using ElasticSearch (v2.2.0 , Lucene v5.4.1) and it's [Pattern Replace > Char > Filter|https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html] > (Lucenes PatternReplaceCharFilter) . I need to filter out urls from my query > text before it is tokenised. But I found that some input strings cause > ElasticSearch to "hang" (slowly eating more CPU and memory) until the system > crashes. > ---- > *Example* > {code} > // Character filters are used to "tidy up" a string *before* it is tokenized. > 'char_filter' => [ > 'url_removal_pattern' => [ > 'type' => 'pattern_replace', > 'pattern' => > '(?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»""'']))', > 'replacement' => '', > ], > {code} > This filter was working fine for some weeks until suddenly ElasticSearch > started crashing. We found someone was trying to do a javascript injection > attack in our search box. > I pasted the regex and the attack string into https://regex101.com > * Regexp: > * > {code}(?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s!()\[\]{};:\'".,<>?«»""''])){code} > * Test string: > * > {code}twitter.com/widgets.js\";fjs.parentNode.insertBefore(js,fjs);}}(document,\"script\",\"twitter-wjs\"{code} > https://regex101.com shows the problem to be "Catastrophic backtracking" > bq. Catastrophic backtracking has been detected and the execution of your > expression has been halted. To find out more what this is, please read the > following article: [Runaway Regular > Expressions|http://www.regular-expressions.info/catastrophic.html]. > It would be great if Lucene could detect "Catastrophic backtracking" and > throw a error or return null. > ---- > As an aside, I created a unit test for our PHP application that uses the same > regexp and test string. (PHP can understand the same regexp, even though it's > obviously for Java in the ElasticSearch case) . Interestingly in php, the > regex results in `null` which is the documented response of > [preg_replace|http://php.net/manual/en/function.preg-replace.php] when a > error occurs. If PHP can return a error rather than crashing - surely Lucene > / Java can too :trollface: ? > {code} > namespace app\tests\unit; > use \yii\codeception\TestCase; > class TagsControllerTest extends TestCase > { > public function testRegexForURLDetection() > { > $regex = > '(?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»""'']))'; > // Test the Catastrophic backtracking problem > $testString = > "twitter.com/widgets.js\";fjs.parentNode.insertBefore(js,fjs);}}(document,\"script\",\"twitter-wjs\""; > // This shows the regex is not working for our test string - it gives > null but should give 'hello ' > $this->assertEquals(null, preg_replace("/$regex/", '', "hello > $testString")); > } > } > {code} > ---- > (I originally [opened a > ticket|https://github.com/elastic/elasticsearch/issues/17934] to the > ElasticSearch project but got told opening it here would be more appropriate > - sorry if I'm wrong) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org