[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200018#comment-14200018 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1637054 from [~rcmuir] in branch 'dev/trunk' [ https://svn.apache.org/r1637054 ] LUCENE-6046: let this test determinize massive automata RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200019#comment-14200019 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1637055 from [~rcmuir] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1637055 ] LUCENE-6046: let this test determinize massive automata RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200021#comment-14200021 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1637056 from [~rcmuir] in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1637056 ] LUCENE-6046: let this test determinize massive automata RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200079#comment-14200079 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1637078 from [~mikemccand] in branch 'dev/trunk' [ https://svn.apache.org/r1637078 ] LUCENE-6046: remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200086#comment-14200086 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1637080 from [~mikemccand] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1637080 ] LUCENE-6046: remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200090#comment-14200090 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1637082 from [~mikemccand] in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1637082 ] LUCENE-6046: remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14197970#comment-14197970 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1636830 from [~mikemccand] in branch 'dev/trunk' [ https://svn.apache.org/r1636830 ] LUCENE-6046: let this test determinize massive automata RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14197972#comment-14197972 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1636831 from [~mikemccand] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1636831 ] LUCENE-6046: let this test determinize massive automata RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14197974#comment-14197974 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1636832 from [~mikemccand] in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1636832 ] LUCENE-6046: let this test determinize massive automata RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196070#comment-14196070 ] Nik Everett commented on LUCENE-6046: - A couple of updates: This affects version 4.9 as well. Probably all versions. But its impact is likely minor enough to only be worth adding to the 4.10 line. A found a few test cases that need lots and lots of states. Any time you feed a couple hundred random unicode words to the automata you'll end up needing more than ten thousand states. I've updated those tests to ask for a million states and they caught a few places where I hadn't been as diligent in piping maxDeterminizedStates through. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196295#comment-14196295 ] Michael McCandless commented on LUCENE-6046: Thanks Nik, your patch looks great; I'll fold in some more minor things from my patch and commit! RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196723#comment-14196723 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1636716 from [~mikemccand] in branch 'dev/trunk' [ https://svn.apache.org/r1636716 ] LUCENE-6046: add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196770#comment-14196770 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1636728 from [~mikemccand] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1636728 ] LUCENE-6046: add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196947#comment-14196947 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1636758 from [~mikemccand] in branch 'dev/trunk' [ https://svn.apache.org/r1636758 ] LUCENE-6046: fix test failure, add maxDeterminizedStates to AutomatonQuery and WildcardQuery too RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196949#comment-14196949 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1636759 from [~mikemccand] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1636759 ] LUCENE-6046: fix test failure, add maxDeterminizedStates to AutomatonQuery and WildcardQuery too RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196981#comment-14196981 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1636762 from [~mikemccand] in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1636762 ] LUCENE-6046: add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
Thanks Mike! On Nov 4, 2014 5:29 PM, ASF subversion and git services (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196981#comment-14196981 ] ASF subversion and git services commented on LUCENE-6046: - Commit 1636762 from [~mikemccand] in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1636762 ] LUCENE-6046: add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Fix For: 4.10.3, 5.0, Trunk Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194395#comment-14194395 ] Michael McCandless commented on LUCENE-6046: Hmm, two bugs here. First off, RegExp.toAutomaton is an inherently costly method: wasteful of RAM and CPU, doing minimize after each recursive operation, in order to build a DFA in the end. It's unfortunately quite easy to concoct regular expressions that make it consume ridiculous resources. I'll look at this example and see if we can improve it, but in the end it will always have its adversarial cases unless we give up on making the resulting automaton deterministic, which would be a very big change. Maybe we should add adversary defenses to it, e.g. you set a limit on the number of states it's allowed to create, and it throws a RegExpTooHardException if it would exceed that? Second off, ArrayUtil.oversize has the wrong (too large) value for MAX_ARRAY_LENGTH, which is a bug from LUCENE-5844. Which JVM did you run this test on? RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194397#comment-14194397 ] Dawid Weiss commented on LUCENE-6046: - Just a note -- Russ Cox wrote a series of excellent articles about different approaches of implementing regexp scanners. http://swtch.com/~rsc/regexp/regexp1.html (There is no clear winner -- both DFAs and NFA have advantages.) RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194400#comment-14194400 ] Lee Hinman commented on LUCENE-6046: [~mikemccand] I ran it with the following JVM: {noformat} java version 1.8.0_20 Java(TM) SE Runtime Environment (build 1.8.0_20-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode) {noformat} RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194406#comment-14194406 ] Michael McCandless commented on LUCENE-6046: bq. Russ Cox wrote a series of excellent articles about different approaches of implementing regexp scanners. Thanks Dawid, these are great. Switching to NFA based matching would be a very large change ... I don't think we should pursue it here. Terms.intersect implementation for block tree is already very complex ... though I suppose of we could hide the on the fly subset construction (and convert regexp to a Thompson NFA) under an API, then Terms.intersect implementation wouldn't have to change much. Still, there will always be adversarial cases no matter which approach we choose. I think for this issue we should allow passing in a how much work are you willing to do to RegExp.toAutomaton, and it throws an exc when it would exceed that. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194413#comment-14194413 ] Dawid Weiss commented on LUCENE-6046: - I didn't mean to imply we should change the regexp implementation! :) This was just a pointer in case somebody wished to understand why regexps can explode. I actually wish there was an NFA-based regexp implementation for Java (with low-memory footprint) because this would make concatenating thousands of regexps (e.g., for pattern detection) much easier. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194412#comment-14194412 ] Michael McCandless commented on LUCENE-6046: bq. Michael McCandless I ran it with the following JVM: Thanks [~dakrone]. I was wrong about the first bug: there is no bug in ArrayUtil.oversize. That exception just means RegExp is trying to create a too-big array ... so just the one bug :) RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194436#comment-14194436 ] Lee Hinman commented on LUCENE-6046: bq. I think for this issue we should allow passing in a how much work are you willing to do to RegExp.toAutomaton, and it throws an exc when it would exceed that. For what it's worth, I think this would be a good solution for us, much better than silently (from the user's perspective) freezing and then hitting an OOME. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194484#comment-14194484 ] Nik Everett commented on LUCENE-6046: - I'm working on a first cut of something that does that. Better regex implementation would be great but the biggest thing to me is being able to limit the amount of work the determinize operation performs. Its such a costly operation that I don't think it should ever be really abstracted from the user. Something like having determinize throw a checked exception when it performs too much work would make you keenly aware whenever you might be straying into exponential territory. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194585#comment-14194585 ] Michael McCandless commented on LUCENE-6046: OK I boiled down the adversarial regexp to this simpler still-adversarial version: \[ac]*a\[ac]\{50,200} I suspect this is a legitimate adversary and not a bug in our RegExp/automaton impl, i.e. the number of states in the DFA for this is exponential as a function of the 50/200. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194592#comment-14194592 ] Nik Everett commented on LUCENE-6046: - Oh yeah, its totally running into 2^n territory legitiately here. This is totally something that'd be rejected by a framework to prevent explosive growth during determination. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194716#comment-14194716 ] Nik Everett commented on LUCENE-6046: - Oh - I'm still running the solr tests against this. I imagine they'll pass as they've been running fine for 30 minutes now but I should throw that out there in case someone gets them to fail with this before I do. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195033#comment-14195033 ] Nik Everett commented on LUCENE-6046: - Oh no! I wrote a very similar patch! Sorry to duplicate effort there. I found that 10,000 states wasn't quite enough to handle some of the tests so I went with 1,000,000 as the default. Its pretty darn huge but it does get the job done. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195047#comment-14195047 ] Michael McCandless commented on LUCENE-6046: Woops, sorry, I didn't see you had a patch here! Thank you. I like your patch: it's good to make all hidden usages of determinize visible. Let's start from your patch and merge anything from mine in? E.g. I think we can collapse minimizeHopcroft into just minimize... bq. I found that 10,000 states wasn't quite enough to handle some of the tests so I went with 1,000,000 as the default. Its pretty darn huge but it does get the job done. Whoa, which tests needed 1M max states? I worry about passing a 1M state automaton to minimize... RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195053#comment-14195053 ] Michael McCandless commented on LUCENE-6046: I like the test simplifications, and removing dead code from Operations.determinize. Can we fix the exc thrown from Regexp to include the offending regular expression, and fix the test to confirm the message contains it? Maybe also add RegExp.toStringTree? I found it very useful while debugging the original regexp :) I think QueryParserBase should also have a set/get for this option? RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195056#comment-14195056 ] Nik Everett commented on LUCENE-6046: - TestDeterminizeLexicon wants to make an automata that accepts 5000 random strings. So 10,000 isn't enough states for it. I'll drop the default limit to 10,000 again and just feed a million to that test case. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195065#comment-14195065 ] Nik Everett commented on LUCENE-6046: - I'll certainly add the regexp string to the exception message. And I'll merge the toStringTree from your patch into mine if you'd like. Yeah - QueryParserBase should have this option too. The thing I found most useful for debugging this was to call toDot on the automata before and after normalization. I just looked at it and went, oh, of course you have to do it that way. No wonder the states explode. And then I read https://en.wikipedia.org/wiki/Powerset_construction and remembered it from my rusty CS degree. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org