[ https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15778606#comment-15778606 ]
Steve Rowe commented on LUCENE-6664: ------------------------------------ Looks like the new {{FlattenGraphFilter}} is implicated in this reproducing {{TestRandomChains}} failure from Policeman Jenkins [https://builds.apache.org/job/Lucene-Solr-NightlyTests-master/1193/]: {noformat} [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=true text='\ud991\udc33\u0662 vb wlvvo \ufe0f\ufe04\ufe01\ufe05\ufe00\ufe07 ]u[{1,5 ntwwqlyvt \ua4ed\ua4d2\ua4ff\ua4fd\ua4ef\ua4db\ua4e3\ua4e4\ua4db\ua4e2\ua4ea jlrzerz' [junit4] 2> Exception from random analyzer: [junit4] 2> charfilters= [junit4] 2> tokenizer= [junit4] 2> org.apache.lucene.analysis.wikipedia.WikipediaTokenizer() [junit4] 2> filters= [junit4] 2> org.apache.lucene.analysis.commongrams.CommonGramsFilter(ValidatingTokenFilter@68052231 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0, [wfo, i, ecngk, lntfhzycu, f]) [junit4] 2> org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@4c507013 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,keyword=false) [junit4] 2> org.apache.lucene.analysis.synonym.FlattenGraphFilter(ValidatingTokenFilter@27c78e22 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,keyword=false) [junit4] 2> offsetsAreCorrect=false [junit4] 2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory. [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=740BB1C4895371B0 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-master/test-data/enwiki.random.lines.txt -Dtests.locale=es-CO -Dtests.timezone=EET -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 14.8s J0 | TestRandomChains.testRandomChainsWithLargeStrings <<< [junit4] > Throwable #1: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=24,endOffset=22 [junit4] > at __randomizedtesting.SeedInfo.seed([740BB1C4895371B0:1E500ED5D01D5143]:0) [junit4] > at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:107) [junit4] > at org.apache.lucene.analysis.synonym.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:237) [junit4] > at org.apache.lucene.analysis.synonym.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:264) [junit4] > at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:724) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:635) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:533) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:869) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: leaving temporary files on disk at: /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-master/checkout/lucene/build/analysis/common/test/J0/temp/lucene.analysis.core.TestRandomChains_740BB1C4895371B0-001 [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, maxPointsInLeafNode=1442, maxMBSortInHeap=6.705070576143851, sim=RandomSimilarity(queryNorm=true): {dummy=DFR I(n)L1}, locale=es-CO, timezone=EET [junit4] 2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation 1.8.0_102 (64-bit)/cpus=4,threads=1,free=136682616,total=285736960 [junit4] 2> NOTE: All tests run in this JVM: [TestRussianLightStemFilter, TestEnglishAnalyzer, EdgeNGramTokenFilterTest, TestSwedishAnalyzer, TestHindiFilters, TestHindiNormalizer, TestHungarianLightStemFilterFactory, TestPorterStemFilter, TestCondition, TestTruncateTokenFilterFactory, TestCollationKeyAnalyzer, TestSpanishAnalyzer, TestHTMLStripCharFilterFactory, TestArabicFilters, TestFactories, TestIrishLowerCaseFilterFactory, TestBrazilianAnalyzer, TestLatvianAnalyzer, TestEscaped, TestPortugueseLightStemFilter, TestElisionFilterFactory, TestHungarianAnalyzer, TestGreekLowerCaseFilterFactory, TestElision, TestCustomAnalyzer, TestTurkishLowerCaseFilterFactory, TestFullStrip, TestSpanishLightStemFilterFactory, TestNorwegianMinimalStemFilterFactory, QueryAutoStopWordAnalyzerTest, NGramTokenizerTest, WikipediaTokenizerTest, TestIndonesianStemFilterFactory, TestBasqueAnalyzer, DateRecognizerFilterFactoryTest, TestNorwegianLightStemFilter, TestFrenchMinimalStemFilterFactory, TestCommonGramsFilterFactory, TestPersianNormalizationFilterFactory, TestTwoSuffixes, TestIndonesianStemmer, TypeAsPayloadTokenFilterTest, TestFrenchLightStemFilterFactory, TestThaiAnalyzer, TestCaseSensitive, TestRandomChains] [junit4] Completed [133/275 (1!)] on J0 in 121.18s, 2 tests, 1 error <<< FAILURES! {noformat} > Replace SynonymFilter with SynonymGraphFilter > --------------------------------------------- > > Key: LUCENE-6664 > URL: https://issues.apache.org/jira/browse/LUCENE-6664 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, > LUCENE-6664.patch, LUCENE-6664.patch, usa.png, usa_flat.png > > > Spinoff from LUCENE-6582. > I created a new SynonymGraphFilter (to replace the current buggy > SynonymFilter), that produces correct graphs (does no "graph > flattening" itself). I think this makes it simpler. > This means you must add the FlattenGraphFilter yourself, if you are > applying synonyms during indexing. > Index-time syn expansion is a necessarily "lossy" graph transformation > when multi-token (input or output) synonyms are applied, because the > index does not store {{posLength}}, so there will always be phrase > queries that should match but do not, and then phrase queries that > should not match but do. > http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html > goes into detail about this. > However, with this new SynonymGraphFilter, if instead you do synonym > expansion at query time (and don't do the flattening), and you use > TermAutomatonQuery (future: somehow integrated into a query parser), > or maybe just "enumerate all paths and make union of PhraseQuery", you > should get 100% correct matches (not sure about "proper" scoring > though...). > This new syn filter still cannot consume an arbitrary graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org