Re: synonym question
Hello, just a guess, have you tried escaping the space in your multi-word terms with backslash? isoweek,iso\ week Regards Bernd Am 14.03.22 um 15:54 schrieb Trevor Nicholls: I have technical data which I am querying with Lucene; one of the features of the content is that a large number of technical terms may be written as multiple words or as a compound word. For example, ISOWEEK or ISO WEEK. Or SynonymFilter or synonym filter. I have a synonym table which includes all of these phrases, thus isoweek=iso week, isoyear=iso year, etc. My understanding is that including the synonyms (with a SynonymFilter in my analyzer) at index time means that I shouldn't have to include the synonym filter in the query analyzer because if any of the synonyms appear in a query they will match records containing any of the synonymous terms, as all values are indexed for any one of them. Checking with Luke, this appears to be the case, however the queries are not matching all the records I expect them too, so I am taking a deeper look. In the indexing phase, input text is tokenised on whitespace and punctuation, lowercased, and then processed by a synonym filter. The relevant part of the analyzer is this: @Override protected TokenStreamComponents createComponents(String fieldName) { WhitespaceTokenizer src = new WhitespaceTokenizer(); TokenStream result = new TechTokenFilter( new LowerCaseFilter(src)); result = new SynonymGraphFilter(result, getSynonyms(options.getSynonymsList()), Boolean.TRUE); result = new FlattenGraphFilter(result); } return new TokenStreamComponents(src, result); The getSynonyms method builds a synonym map from a comma-delimited text file and I know this is working because all the one-word synonym replacements index and search perfectly. The problem I have is with synonym phrases. So if the synonyms input file contains isoweek,isodate then (using Luke) I can see that any document containing either 'isoweek' or 'isodate' has indexed both terms, and a search with either term returns matching results for both. Great. However if the input file contains isoweek,iso week then (again using Luke) I can see that while any document containing 'isoweek' has indexed the terms 'isoweek', 'iso' and 'week', unfortunately any document containing 'iso week' has only indexed 'iso' and 'week'. Am I chasing the impossible here? Is there something I can do in the query analyzer to make it work? (Currently the query analyzer is the same as the indexing analyzer with the synonymgraphfilter and flattengraphfilter omitted.) Or do I have to manually pre-process any query to include OR options for all phrase synonyms? I haven't produced a small test case for this because I'm hoping a high level discussion is all I need to put me on the right track. cheers T - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: synonym question
Hi, thanks for such a quick response! No I hadn't thought of that. In how many of the following would I need to do this: - synonym map creation - analyzing text for indexing - analyzing text for querying If either of the latter two then I can see lots of complications ensuing; it more or less makes a synonym map redundant if I have to manually parse the text and identify all the potential synonyms in advance. I may be missing something critical, of course. cheers T -Original Message- From: Bernd Fehling Sent: Tuesday, 15 March 2022 04:16 To: java-user@lucene.apache.org Subject: Re: synonym question Hello, just a guess, have you tried escaping the space in your multi-word terms with backslash? isoweek,iso\ week Regards Bernd Am 14.03.22 um 15:54 schrieb Trevor Nicholls: > I have technical data which I am querying with Lucene; one of the > features of the content is that a large number of technical terms may > be written as multiple words or as a compound word. For example, > ISOWEEK or ISO WEEK. Or SynonymFilter or synonym filter. > > > > I have a synonym table which includes all of these phrases, thus > isoweek=iso week, isoyear=iso year, etc. > > > > My understanding is that including the synonyms (with a SynonymFilter > in my > analyzer) at index time means that I shouldn't have to include the > synonym filter in the query analyzer because if any of the synonyms > appear in a query they will match records containing any of the > synonymous terms, as all values are indexed for any one of them. > > > > Checking with Luke, this appears to be the case, however the queries > are not matching all the records I expect them too, so I am taking a deeper > look. > > > > In the indexing phase, input text is tokenised on whitespace and > punctuation, lowercased, and then processed by a synonym filter. The > relevant part of the analyzer is this: > > > > @Override > > protected TokenStreamComponents createComponents(String fieldName) > { > >WhitespaceTokenizer src = new WhitespaceTokenizer(); > >TokenStream result = new TechTokenFilter( new > LowerCaseFilter(src)); > > result = new SynonymGraphFilter(result, > getSynonyms(options.getSynonymsList()), Boolean.TRUE); > > result = new FlattenGraphFilter(result); > >} > >return new TokenStreamComponents(src, result); > > > > The getSynonyms method builds a synonym map from a comma-delimited > text file and I know this is working because all the one-word synonym > replacements index and search perfectly. The problem I have is with synonym > phrases. > > > > So if the synonyms input file contains > > > >isoweek,isodate > > > > then (using Luke) I can see that any document containing either > 'isoweek' or 'isodate' has indexed both terms, and a search with > either term returns matching results for both. Great. > > > > However if the input file contains > > > >isoweek,iso week > > > > then (again using Luke) I can see that while any document containing > 'isoweek' has indexed the terms 'isoweek', 'iso' and 'week', > unfortunately any document containing 'iso week' has only indexed 'iso' and > 'week'. > > > > Am I chasing the impossible here? Is there something I can do in the > query analyzer to make it work? (Currently the query analyzer is the > same as the indexing analyzer with the synonymgraphfilter and > flattengraphfilter > omitted.) Or do I have to manually pre-process any query to include OR > options for all phrase synonyms? > > > > I haven't produced a small test case for this because I'm hoping a > high level discussion is all I need to put me on the right track. > > > > cheers > > T > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: synonym question
Just to confirm, escaping the spaces in synonym table construction, query construction, or both, does not solve the problem. -Original Message- From: Trevor Nicholls Sent: Tuesday, 15 March 2022 05:02 To: java-user@lucene.apache.org Subject: RE: synonym question Hi, thanks for such a quick response! No I hadn't thought of that. In how many of the following would I need to do this: - synonym map creation - analyzing text for indexing - analyzing text for querying If either of the latter two then I can see lots of complications ensuing; it more or less makes a synonym map redundant if I have to manually parse the text and identify all the potential synonyms in advance. I may be missing something critical, of course. cheers T -Original Message- From: Bernd Fehling Sent: Tuesday, 15 March 2022 04:16 To: java-user@lucene.apache.org Subject: Re: synonym question Hello, just a guess, have you tried escaping the space in your multi-word terms with backslash? isoweek,iso\ week Regards Bernd Am 14.03.22 um 15:54 schrieb Trevor Nicholls: > I have technical data which I am querying with Lucene; one of the > features of the content is that a large number of technical terms may > be written as multiple words or as a compound word. For example, > ISOWEEK or ISO WEEK. Or SynonymFilter or synonym filter. > > > > I have a synonym table which includes all of these phrases, thus > isoweek=iso week, isoyear=iso year, etc. > > > > My understanding is that including the synonyms (with a SynonymFilter > in my > analyzer) at index time means that I shouldn't have to include the > synonym filter in the query analyzer because if any of the synonyms > appear in a query they will match records containing any of the > synonymous terms, as all values are indexed for any one of them. > > > > Checking with Luke, this appears to be the case, however the queries > are not matching all the records I expect them too, so I am taking a deeper > look. > > > > In the indexing phase, input text is tokenised on whitespace and > punctuation, lowercased, and then processed by a synonym filter. The > relevant part of the analyzer is this: > > > > @Override > > protected TokenStreamComponents createComponents(String fieldName) > { > >WhitespaceTokenizer src = new WhitespaceTokenizer(); > >TokenStream result = new TechTokenFilter( new > LowerCaseFilter(src)); > > result = new SynonymGraphFilter(result, > getSynonyms(options.getSynonymsList()), Boolean.TRUE); > > result = new FlattenGraphFilter(result); > >} > >return new TokenStreamComponents(src, result); > > > > The getSynonyms method builds a synonym map from a comma-delimited > text file and I know this is working because all the one-word synonym > replacements index and search perfectly. The problem I have is with synonym > phrases. > > > > So if the synonyms input file contains > > > >isoweek,isodate > > > > then (using Luke) I can see that any document containing either > 'isoweek' or 'isodate' has indexed both terms, and a search with > either term returns matching results for both. Great. > > > > However if the input file contains > > > >isoweek,iso week > > > > then (again using Luke) I can see that while any document containing > 'isoweek' has indexed the terms 'isoweek', 'iso' and 'week', > unfortunately any document containing 'iso week' has only indexed 'iso' and > 'week'. > > > > Am I chasing the impossible here? Is there something I can do in the > query analyzer to make it work? (Currently the query analyzer is the > same as the indexing analyzer with the synonymgraphfilter and > flattengraphfilter > omitted.) Or do I have to manually pre-process any query to include OR > options for all phrase synonyms? > > > > I haven't produced a small test case for this because I'm hoping a > high level discussion is all I need to put me on the right track. > > > > cheers > > T > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org