[
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15781557#comment-15781557
]
Steve Rowe commented on LUCENE-6664:
------------------------------------
Another TestRandomChains failure, from
[https://builds.apache.org/job/Lucene-Solr-NightlyTests-6.x/239/]:
{noformat}
Checking out Revision 9dde8a30303de4bce5da45189219dd6150252b29
(refs/remotes/origin/branch_6x)
[...]
[junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
[junit4] 2> TEST FAIL: useCharFilter=true text='ivi[q.(k--r
f\u0002\uf672o\u983c'
[junit4] 2> Exception from random analyzer:
[junit4] 2> charfilters=
[junit4] 2>
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@261ff7a0)
[junit4] 2> tokenizer=
[junit4] 2> org.apache.lucene.analysis.wikipedia.WikipediaTokenizer()
[junit4] 2> filters=
[junit4] 2>
org.apache.lucene.analysis.StopFilter(ValidatingTokenFilter@62f3af70
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,
[hejalskyy, d, skap, nfd, nirasnsmg, hmdqqn])
[junit4] 2>
org.apache.lucene.analysis.synonym.FlattenGraphFilter(ValidatingTokenFilter@1a9001e5
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0)
[junit4] 2> offsetsAreCorrect=false
[junit4] 2> NOTE: download the large Jenkins line-docs file by running
'ant get-jenkins-line-docs' in the lucene directory.
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains
-Dtests.method=testRandomChains -Dtests.seed=127E19CE02B54D17
-Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true
-Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-6.x/test-data/enwiki.random.lines.txt
-Dtests.locale=zh-HK -Dtests.timezone=America/Virgin -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
[junit4] ERROR 62.4s J1 | TestRandomChains.testRandomChains <<<
[junit4] > Throwable #1: java.lang.IllegalArgumentException: startOffset
must be non-negative, and endOffset must be >= startOffset,
startOffset=4,endOffset=3
[junit4] > at
__randomizedtesting.SeedInfo.seed([127E19CE02B54D17:2F9F30AF45A750D7]:0)
[junit4] > at
org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:107)
[junit4] > at
org.apache.lucene.analysis.synonym.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:237)
[junit4] > at
org.apache.lucene.analysis.synonym.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:264)
[junit4] > at
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:723)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:634)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:532)
[junit4] > at
org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:842)
[junit4] > at java.lang.Thread.run(Thread.java:745)
[junit4] 2> NOTE: leaving temporary files on disk at:
/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-6.x/checkout/lucene/build/analysis/common/test/J1/temp/lucene.analysis.core.TestRandomChains_127E19CE02B54D17-001
[junit4] 2> NOTE: test params are: codec=Asserting(Lucene62):
{dummy=PostingsFormat(name=Memory doPackFST= true)}, docValues:{},
maxPointsInLeafNode=772, maxMBSortInHeap=6.693205231616328,
sim=RandomSimilarity(queryNorm=false,coord=yes): {dummy=DFR I(F)LZ(0.3)},
locale=zh-HK, timezone=America/Virgin
[junit4] 2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation
1.8.0_102 (64-bit)/cpus=4,threads=1,free=177327976,total=255852544
[junit4] 2> NOTE: All tests run in this JVM: [TestPatternTokenizerFactory,
TestCircumfix, TestReverseStringFilterFactory, TestSnowball, TestIrishAnalyzer,
TestBulgarianAnalyzer, TestHomonyms, TestKeywordRepeatFilter,
TestPrefixAwareTokenFilter, CommonGramsFilterTest,
TestHyphenationCompoundWordTokenFilterFactory, TestSoraniAnalyzer,
TestGermanStemFilterFactory, TestEmptyTokenStream, TestIndicNormalizer,
TestTurkishLowerCaseFilter, TestGalicianMinimalStemFilterFactory,
TestDecimalDigitFilterFactory, TestLatvianStemmer, TestItalianLightStemFilter,
TestKeepWordFilter, TestLithuanianStemming, TestKeepFilterFactory,
TestPortugueseMinimalStemFilter, TestAnalyzers, TestAlternateCasing,
TestSoraniStemFilter, TestApostropheFilterFactory, TestDictionary,
TestCodepointCountFilterFactory, TestDanishAnalyzer, TestRomanianAnalyzer,
TestPortugueseMinimalStemFilterFactory, TestArabicNormalizationFilter,
TestLimitTokenOffsetFilterFactory, TestZeroAffix, DateRecognizerFilterTest,
TestGermanLightStemFilter, TestCJKAnalyzer, TestMorphData,
TestBulgarianStemFilterFactory, TestSynonymGraphFilter,
TestGermanNormalizationFilterFactory, TestBulgarianStemmer,
DelimitedPayloadTokenFilterTest, TestStrangeOvergeneration, TestFactories,
TestRandomChains]
[junit4] Completed [141/275 (1!)] on J1 in 119.08s, 2 tests, 1 error <<<
FAILURES!
{noformat}
> Replace SynonymFilter with SynonymGraphFilter
> ---------------------------------------------
>
> Key: LUCENE-6664
> URL: https://issues.apache.org/jira/browse/LUCENE-6664
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch,
> LUCENE-6664.patch, LUCENE-6664.patch, usa.png, usa_flat.png
>
>
> Spinoff from LUCENE-6582.
> I created a new SynonymGraphFilter (to replace the current buggy
> SynonymFilter), that produces correct graphs (does no "graph
> flattening" itself). I think this makes it simpler.
> This means you must add the FlattenGraphFilter yourself, if you are
> applying synonyms during indexing.
> Index-time syn expansion is a necessarily "lossy" graph transformation
> when multi-token (input or output) synonyms are applied, because the
> index does not store {{posLength}}, so there will always be phrase
> queries that should match but do not, and then phrase queries that
> should not match but do.
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> goes into detail about this.
> However, with this new SynonymGraphFilter, if instead you do synonym
> expansion at query time (and don't do the flattening), and you use
> TermAutomatonQuery (future: somehow integrated into a query parser),
> or maybe just "enumerate all paths and make union of PhraseQuery", you
> should get 100% correct matches (not sure about "proper" scoring
> though...).
> This new syn filter still cannot consume an arbitrary graph.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]