Preserve Original Option In Stemming (EnglishMinimalStemFilterFactory).
Hi, I was working with Lucene 5.2 and trying to index some document. I am using EnglishMinimalStemFilterFactory and I found that there is no option for keeping the original text as wel as analyzed term into lucene index. WordDelimiterFilterFactory provides preserveOriginal option to do this. Can anyone tell me why this option is not provided for Stemming. For e.g. if I want to store both *Methods* and *Method* in my index then I think there is no option is available in Lucene to do this. I also noticed that if we place EnglishMinimalStemFilterFactory after WordDelimiterFilterFactory with option preserveOriginal ="1" then it store both *Methods* and *Method*. -- View this message in context: http://lucene.472066.n3.nabble.com/Preserve-Original-Option-In-Stemming-EnglishMinimalStemFilterFactory-tp4225116.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Preserve Original Option In Stemming (EnglishMinimalStemFilterFactory).
Can anyone tell me why this option is not provided for Stemming. I am not sure about it but the original token can be preserved by using too. To avoid any duplicate token in the document can be used at the end of analysis chain. Hope this helps. Regards, Modassar On Tue, Aug 25, 2015 at 2:12 PM, Vishnu Mishra wrote: > Hi, > > I was working with Lucene 5.2 and trying to index some document. I am using > EnglishMinimalStemFilterFactory and I found that there is no option for > keeping the original text as wel as analyzed term into lucene index. > WordDelimiterFilterFactory provides preserveOriginal option to do this. > Can > anyone tell me why this option is not provided for Stemming. For e.g. if I > want to store both *Methods* and *Method* in my index then I think there is > no option is available in Lucene to do this. I also noticed that if we > place EnglishMinimalStemFilterFactory after WordDelimiterFilterFactory with > option preserveOriginal ="1" then it store both *Methods* and *Method*. > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Preserve-Original-Option-In-Stemming-EnglishMinimalStemFilterFactory-tp4225116.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Preserve Original Option In Stemming (EnglishMinimalStemFilterFactory).
It's actually a real pain to do this right considering all the different analysis chains. As Modassar says, the KeywordRepeatFilterFactory is often "good enough". It'll boost the exact match, but it won't actually guarantee that only exact-match docs are returned. Ideally, you'd want the option to turn unstemmed match on and off. But to do that, you have to have some way to signal the analysis chain when to emit only the original token at _query_ time. So say there's some rule like "when you see a $ appended to the term, it shouldn't be stemmed at query time". Now if WordDelimiterFilterFactory comes before the stemmer (as it really should), the $ is removed and the signal to not stem is lost. Oops. And any of the ReplaceFilterFactories often remove such terms. And So the "usual" answer is either to use the KeywordRepeatFilterFactory, or use a copyField that doesn't stem and when exact matches are required, search on that field. Best, Erick On Tue, Aug 25, 2015 at 5:05 AM, Modassar Ather wrote: > Can > anyone tell me why this option is not provided for Stemming. > > I am not sure about it but the original token can be preserved by using > too. > To avoid any duplicate token in the document class="solr.RemoveDuplicatesTokenFilterFactory"/> can be used at the end of > analysis chain. > > Hope this helps. > > Regards, > Modassar > > On Tue, Aug 25, 2015 at 2:12 PM, Vishnu Mishra wrote: > >> Hi, >> >> I was working with Lucene 5.2 and trying to index some document. I am using >> EnglishMinimalStemFilterFactory and I found that there is no option for >> keeping the original text as wel as analyzed term into lucene index. >> WordDelimiterFilterFactory provides preserveOriginal option to do this. >> Can >> anyone tell me why this option is not provided for Stemming. For e.g. if I >> want to store both *Methods* and *Method* in my index then I think there is >> no option is available in Lucene to do this. I also noticed that if we >> place EnglishMinimalStemFilterFactory after WordDelimiterFilterFactory with >> option preserveOriginal ="1" then it store both *Methods* and *Method*. >> >> >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Preserve-Original-Option-In-Stemming-EnglishMinimalStemFilterFactory-tp4225116.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Preserve Original Option In Stemming (EnglishMinimalStemFilterFactory).
Hi, > So the "usual" answer is either to use the KeywordRepeatFilterFactory, or > use a copyField that doesn't stem and when exact matches are required, > search on that field. Or even better search on both fields (stemmed and unstemmed, I generally also have a ASCII-folded one) with SHOULD. An exact match would get higher score (because it hits both closes, stemmed and unstemmed field), while an only-stem match automatically gets a lower score (because only one Boolean clause matches). Best, Uwe > Best, > Erick > > On Tue, Aug 25, 2015 at 5:05 AM, Modassar Ather > wrote: > > Can > > anyone tell me why this option is not provided for Stemming. > > > > I am not sure about it but the original token can be preserved by > > using too. > > To avoid any duplicate token in the document > class="solr.RemoveDuplicatesTokenFilterFactory"/> can be used at the > > end of analysis chain. > > > > Hope this helps. > > > > Regards, > > Modassar > > > > On Tue, Aug 25, 2015 at 2:12 PM, Vishnu Mishra > wrote: > > > >> Hi, > >> > >> I was working with Lucene 5.2 and trying to index some document. I am > >> using EnglishMinimalStemFilterFactory and I found that there is no > >> option for keeping the original text as wel as analyzed term into lucene > index. > >> WordDelimiterFilterFactory provides preserveOriginal option to do this. > >> Can > >> anyone tell me why this option is not provided for Stemming. For e.g. > >> if I want to store both *Methods* and *Method* in my index then I > >> think there is no option is available in Lucene to do this. I also > >> noticed that if we place EnglishMinimalStemFilterFactory after > >> WordDelimiterFilterFactory with option preserveOriginal ="1" then it > store both *Methods* and *Method*. > >> > >> > >> > >> > >> > >> -- > >> View this message in context: > >> http://lucene.472066.n3.nabble.com/Preserve-Original-Option-In- > Stemmi > >> ng-EnglishMinimalStemFilterFactory-tp4225116.html > >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. > >> > >> - > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene indexes reverting to past state
*Summary:* Lucene indexes appear to revert to some past state after an application restart. *Background:* We're running an enterprise application written in Java/Spring/Hibernate, deployed within Jetty, with a Postgres backend. See below for version info. We use Lucene to index certain components of the database to enable fast/complex searching. The indexes are built by querying the relevant database tables, transferring the data to Lucene documents and writing to disk. An IndexWriter is used to add and commit the documents. A commit is performed at the end of a batch of database reads (generally 5,000). The reading and writing of batches is multi-threaded. The writer is configured with the following TieredMergePolicy attributes: segmentsPerTier=50.0 maxMergeAtOnce=5 maxMergedSegmentMB=100.0 No merge scheduler is set. The writer has its RAMBufferSizeMB set to 48. There are 23 separate indexes used to represent different logical components of the database. The largest index on disk is 13.7G. The largest index by number of documents contains around 32 million documents. Once the indexes are built they are maintained dynamically by the application to reflect the current state of the database. Dynamic updates are performed by a TrackingIndexWriter. *Problem:* After a reindex is run (as described above, a destructive process) the application runs okay and all Lucene queries return expected values that reflect the current state of the database. Subsequent usage of the system maintains the indexes in the correct state as evidenced by search results. In the last month we have found that after a restart of the application the indexes appear to revert to some unknown past state. The indexes can be queried okay (they're not corrupt, there are no logged errors or stack traces) but the data is either out of date (reflecting a past state of the database entries they represent) or missing. We first assumed the "past state" was based on the last reindex time, but have subsequently found that restarting the application immediately following a reindex still puts the indexes in a state that pre-dates the time of the last reindex. This is only occurring on a single site (our largest production site), and has only started in recent months. We have yet to reproduce the problem using an identical process with an identical configuration on near-identical data. We are not sure if the problem effects all of the indexes but know the larger (and most important) indexes are effected. *Question:* We are inclined to think that the problem is somewhere in our code, but are wondering if any of the described symptoms have been seen before by the Lucene community. Suggestions on how to isolate the problem, or configuration changes that may help are also most welcome. *Version Info:* Lucene: lucene-analyzers-common-4.9.1.jar lucene-core-4.9.1.jar lucene-grouping-4.9.1.jar lucene-join-4.9.1.jar lucene-misc-4.9.1.jar lucene-queries-4.9.1.jar lucene-queryparser-4.9.1.jar lucene-sandbox-4.9.1.jar lucene-snowball-2.4.1.jar lucene-suggest-4.9.1.jar Postgres: server: PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit client access: postgresql-9.1-901.jdbc4.jar OS: LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch Red Hat Enterprise Linux Server release 6.5 (Santiago) Java: java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) Jetty: jetty-6.1.22.jar Hibernate: hibernate-commons-annotations-4.0.2.Final.jar hibernate-core-4.2.2.Final.jar hibernate-ehcache-4.2.2.Final.jar hibernate-jpa-2.0-api-1.0.1.Final.jar Spring: spring-aop-4.0.4.RELEASE.jar spring-aspects-4.0.4.RELEASE.jar spring-beans-4.0.4.RELEASE.jar spring-context-4.0.4.RELEASE.jar spring-context-support-4.0.4.RELEASE.jar spring-core-4.0.4.RELEASE.jar spring-expression-4.0.4.RELEASE.jar spring-instrument-4.0.4.RELEASE.jar spring-jdbc-4.0.4.RELEASE.jar spring-jms-4.0.4.RELEASE.jar spring-orm-4.0.4.RELEASE.jar