[jira] [Commented] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

Steve Rowe (JIRA) Mon, 26 Dec 2016 08:29:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15778606#comment-15778606
 ]


Steve Rowe commented on LUCENE-6664:
------------------------------------

Looks like the new {{FlattenGraphFilter}} is implicated in this reproducing 
{{TestRandomChains}} failure from Policeman Jenkins 
[https://builds.apache.org/job/Lucene-Solr-NightlyTests-master/1193/]:

{noformat}
   [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
   [junit4]   2> TEST FAIL: useCharFilter=true text='\ud991\udc33\u0662 vb 
wlvvo \ufe0f\ufe04\ufe01\ufe05\ufe00\ufe07 ]u[{1,5 ntwwqlyvt 
\ua4ed\ua4d2\ua4ff\ua4fd\ua4ef\ua4db\ua4e3\ua4e4\ua4db\ua4e2\ua4ea jlrzerz'
   [junit4]   2> Exception from random analyzer: 
   [junit4]   2> charfilters=
   [junit4]   2> tokenizer=
   [junit4]   2>   org.apache.lucene.analysis.wikipedia.WikipediaTokenizer()
   [junit4]   2> filters=
   [junit4]   2>   
org.apache.lucene.analysis.commongrams.CommonGramsFilter(ValidatingTokenFilter@68052231
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,
 [wfo, i, ecngk, lntfhzycu, f])
   [junit4]   2>   
org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@4c507013
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,keyword=false)
   [junit4]   2>   
org.apache.lucene.analysis.synonym.FlattenGraphFilter(ValidatingTokenFilter@27c78e22
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,keyword=false)
   [junit4]   2> offsetsAreCorrect=false
   [junit4]   2> NOTE: download the large Jenkins line-docs file by running 
'ant get-jenkins-line-docs' in the lucene directory.
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains 
-Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=740BB1C4895371B0 
-Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true 
-Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-master/test-data/enwiki.random.lines.txt
 -Dtests.locale=es-CO -Dtests.timezone=EET -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   14.8s J0 | 
TestRandomChains.testRandomChainsWithLargeStrings <<<
   [junit4]    > Throwable #1: java.lang.IllegalArgumentException: startOffset 
must be non-negative, and endOffset must be >= startOffset, 
startOffset=24,endOffset=22
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([740BB1C4895371B0:1E500ED5D01D5143]:0)
   [junit4]    >        at 
org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:107)
   [junit4]    >        at 
org.apache.lucene.analysis.synonym.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:237)
   [junit4]    >        at 
org.apache.lucene.analysis.synonym.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:264)
   [junit4]    >        at 
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:724)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:635)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:533)
   [junit4]    >        at 
org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:869)
   [junit4]    >        at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: leaving temporary files on disk at: 
/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-master/checkout/lucene/build/analysis/common/test/J0/temp/lucene.analysis.core.TestRandomChains_740BB1C4895371B0-001
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
{dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, 
maxPointsInLeafNode=1442, maxMBSortInHeap=6.705070576143851, 
sim=RandomSimilarity(queryNorm=true): {dummy=DFR I(n)L1}, locale=es-CO, 
timezone=EET
   [junit4]   2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation 
1.8.0_102 (64-bit)/cpus=4,threads=1,free=136682616,total=285736960
   [junit4]   2> NOTE: All tests run in this JVM: [TestRussianLightStemFilter, 
TestEnglishAnalyzer, EdgeNGramTokenFilterTest, TestSwedishAnalyzer, 
TestHindiFilters, TestHindiNormalizer, TestHungarianLightStemFilterFactory, 
TestPorterStemFilter, TestCondition, TestTruncateTokenFilterFactory, 
TestCollationKeyAnalyzer, TestSpanishAnalyzer, TestHTMLStripCharFilterFactory, 
TestArabicFilters, TestFactories, TestIrishLowerCaseFilterFactory, 
TestBrazilianAnalyzer, TestLatvianAnalyzer, TestEscaped, 
TestPortugueseLightStemFilter, TestElisionFilterFactory, TestHungarianAnalyzer, 
TestGreekLowerCaseFilterFactory, TestElision, TestCustomAnalyzer, 
TestTurkishLowerCaseFilterFactory, TestFullStrip, 
TestSpanishLightStemFilterFactory, TestNorwegianMinimalStemFilterFactory, 
QueryAutoStopWordAnalyzerTest, NGramTokenizerTest, WikipediaTokenizerTest, 
TestIndonesianStemFilterFactory, TestBasqueAnalyzer, 
DateRecognizerFilterFactoryTest, TestNorwegianLightStemFilter, 
TestFrenchMinimalStemFilterFactory, TestCommonGramsFilterFactory, 
TestPersianNormalizationFilterFactory, TestTwoSuffixes, TestIndonesianStemmer, 
TypeAsPayloadTokenFilterTest, TestFrenchLightStemFilterFactory, 
TestThaiAnalyzer, TestCaseSensitive, TestRandomChains]
   [junit4] Completed [133/275 (1!)] on J0 in 121.18s, 2 tests, 1 error <<< 
FAILURES!
{noformat}

> Replace SynonymFilter with SynonymGraphFilter
> ---------------------------------------------
>
>                 Key: LUCENE-6664
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6664
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0), 6.4
>
>         Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, 
> LUCENE-6664.patch, LUCENE-6664.patch, usa.png, usa_flat.png
>
>
> Spinoff from LUCENE-6582.
> I created a new SynonymGraphFilter (to replace the current buggy
> SynonymFilter), that produces correct graphs (does no "graph
> flattening" itself).  I think this makes it simpler.
> This means you must add the FlattenGraphFilter yourself, if you are
> applying synonyms during indexing.
> Index-time syn expansion is a necessarily "lossy" graph transformation
> when multi-token (input or output) synonyms are applied, because the
> index does not store {{posLength}}, so there will always be phrase
> queries that should match but do not, and then phrase queries that
> should not match but do.
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> goes into detail about this.
> However, with this new SynonymGraphFilter, if instead you do synonym
> expansion at query time (and don't do the flattening), and you use
> TermAutomatonQuery (future: somehow integrated into a query parser),
> or maybe just "enumerate all paths and make union of PhraseQuery", you
> should get 100% correct matches (not sure about "proper" scoring
> though...).
> This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

Reply via email to