delete/reset the index
hi all i would like to delete the the index to allow to start reindexing from scratch. is there a way to delete all entries in a index? any hint is very appreciated. simon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: delete/reset the index
Delete the index Directory in File System, I think this is the simpliest!!! 2008/9/4 simon litwan <[EMAIL PROTECTED]> > hi all > > i would like to delete the the index to allow to start reindexing from > scratch. > is there a way to delete all entries in a index? > > any hint is very appreciated. > > simon > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Similarity percentage between two Strings
Googling for "java string similarity" throws up some stuff you might find useful. -- Ian. On Wed, Sep 3, 2008 at 11:58 PM, Thiago Moreira <[EMAIL PROTECTED]> wrote: > > Well, the similar definition that I'm looking for is the number 2, maybe > the number 3, but to start the number 2 is enough. If you guys think that is > not a Lucene problem what else tool can I use to implement this > requirement?? > > Thanks > > Thiago Moreira > Software Engineer > [EMAIL PROTECTED] > Liferay, Inc. > Enterprise. Open Source. For Life. > > > N. Hira wrote: > > I don't know how much of this is a Lucene problem, but -- as I'm sure you > will inevitably hear from others on the list -- it depends on what your > definition of "similar" is. > > By similar, do you mean: > 1. Identical, except for variations in case (upper/lower) > 2. Allow 1., but also allow prefixes/suffixes (e.g., "FW: " or "... > (summary") > 3. Allow 1., 2. and permit some new terms ... how many? > 4. Allow all of the above and allow some changes to terms using stemming > (E.g., "Google releases Chrome" is similar to "Google announces the release > of its new Chrome web browser") > > > I'm sure you see where this is going. So ... how do you define similar? > > Good luck! > > -h > -- > Hira, N.R. > Cognocys, Inc. > > On 03-Sep-2008, at 2:52 PM, Thiago Moreira wrote: > > > Hey all, > > I want to know how much two Strings are similar! The thing is: I'm > processing an email box and I want to group all messages that have the > subject similar, makes sense?? I looked on the documentation but I didn't > find how to accomplish this. It's not necessary add the messages or the > subjects on some kind of index. I'm using 2.3.2 version of Lucene. > > Anyone has some idea? > > Thanks in advance. > -- > Thiago Moreira > Software Engineer > [EMAIL PROTECTED] > Liferay, Inc. > Enterprise. Open Source. For Life. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Pre-filtering for expensive query
Grant Ingersoll wrote: On Aug 30, 2008, at 3:14 PM, Andrzej Bialecki wrote: I think you can use a FilteredQuery in a BooleanClause. This may be faster than the filtering code in the Searcher, because the evaluation is done during scoring and not afterwards. FilteredQuery internally makes FYI, not sure if this is exactly what you are talking about Andrzej, but IndexSearcher no longer filters after scoring. This was changed in https://issues.apache.org/jira/browse/LUCENE-584 Ah, indeed - I was working with 2.3.0 release ... then there should be no visible performance difference if using the trunk version of IndexSearcher. The only difference now between the IndexSearcher method and ConjunctionScorer would be when the supplied filter would match many documents. IndexSearcher always runs skipTo on the filter first, so potentially it would stop at many docIds that aren't matching in the scorer - whereas the ConjunctionScorer tries to order sub-scorers so that "sparse" scorers are checked first. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser vs. BooleanQuery
Hello, I am experiencing a strange behaviour when trying to query the same thing via BooleanQuery vs. via the know-it-all QueryParser class. Precisely, the index contains the document: "12,Visual C++,4.2" with the field layout: ID,name,version(thus, "12" is the ID field, "Visual C++" is the name field and "4.2" is the version field). The search string is "Visual C++" for the name field. The following test, using QueryParser, goes fine: public final void testUsingQueryParser() { IndexSearcher recordSearcher; Query q; QueryParser parser = new QueryParser("name", new StandardAnalyzer()); try { q = parser.parse("name:visual +name:c++"); Directory directory = FSDirectory.getDirectory(); recordSearcher = new IndexSearcher(directory); Hits h = recordSearcher.search(q); assertEquals(1, h.length()); assertEquals(12, Integer.parseInt(h.doc(0).get("ID"))); } catch(Exception exn) { fail("Exception occurred."); } } But this one, using a BooleanQuery, fails. public final void testUsingTermQuery() { IndexSearcher recordSearcher; BooleanQuery bq = new BooleanQuery(); bq.add(new TermQuery(new Term("name", "visual")), BooleanClause.Occur.SHOULD); bq.add(new TermQuery(new Term("name", "c++")), BooleanClause.Occur.MUST); try { Directory directory = FSDirectory.getDirectory(); recordSearcher = new IndexSearcher(directory); Hits h = recordSearcher.search(bq); assertEquals(1, h.length()); // fails, saying it expects 0 !!! assertEquals(12, Integer.parseInt(h.doc(0).get("ID"))); } catch(Exception exn) { fail("Eexception occurred."); } } Rewriting the BooleanQuery and taking toString() yields the same String given to QueryParser.parse() in the first test. I am using Lucene 2.3.0. Can somebody explain the difference ? -- View this message in context: http://www.nabble.com/QueryParser-vs.-BooleanQuery-tp19306087p19306087.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can we know if 2 lucene indexes are same?
No documents can added into index when the index is optimizing, or optimizing can't run durling documents adding to the index. So, without other error, I think we can beleive the two index are indeed the same. :) 2008/9/4 Noble Paul നോബിള് नोब्ळ् <[EMAIL PROTECTED]> > The use case is as follows > > I have two indexes . One at the master and one at the slave. The user > occasionally keeps committing on the master and the delta is > replicated everytime. But when the optimize happens the transfer size > can be really large. So I am thinking of doing the optimize > separately on master and slave . > > So far, so good. But how can I really know that after the optimize the > indexes are indeed the same or no documents got added in between.? > > > > On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[EMAIL PROTECTED]> > wrote: > > > > 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള് नोब्ळ्: > > > >> hi, > >> I wish to know if the contents of two indexes have same data. > >> will all the files be exactly same if I put same set of documents to > both? > > > > If you insert the documents in the same order with the same settings and > > both indices are optimized, then the files ought to be identitical. I'm > > however not sure. > > > > The instantiated index contrib module contains a test that assert two > index > > readers are identical. You could use this to be really sure, but it it a > > rather long running process for a large index: > > > > > http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java > > > > > > Perhaps you should explain why you need to do this. > > > > > > karl > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > -- > --Noble Paul >
Re: QueryParser vs. BooleanQuery
Have a look at the index with Luke to see what has actually been indexed. StandardAnalyzer may well be removing the pluses, or you may need to escape them. And watch out for case - Visual != visual in term query land. -- Ian. On Thu, Sep 4, 2008 at 9:46 AM, bogdan71 <[EMAIL PROTECTED]> wrote: > > Hello, > > I am experiencing a strange behaviour when trying to query the same thing > via > BooleanQuery vs. via the know-it-all QueryParser class. Precisely, the index > contains > the document: > "12,Visual C++,4.2" with the field layout: ID,name,version(thus, "12" is > the ID field, "Visual C++" > is the name field and "4.2" is the version field). > The search string is "Visual C++" for the name field. > > The following test, using QueryParser, goes fine: > > public final void testUsingQueryParser() >{ >IndexSearcher recordSearcher; >Query q; >QueryParser parser = new QueryParser("name", new > StandardAnalyzer()); >try >{ > q = parser.parse("name:visual +name:c++"); > >Directory directory = > FSDirectory.getDirectory(); >recordSearcher = new IndexSearcher(directory); > >Hits h = recordSearcher.search(q); > >assertEquals(1, h.length()); >assertEquals(12, Integer.parseInt(h.doc(0).get("ID"))); >} >catch(Exception exn) >{ >fail("Exception occurred."); >} >} > > But this one, using a BooleanQuery, fails. > > public final void testUsingTermQuery() >{ >IndexSearcher recordSearcher; >BooleanQuery bq = new BooleanQuery(); > >bq.add(new TermQuery(new Term("name", "visual")), > BooleanClause.Occur.SHOULD); >bq.add(new TermQuery(new Term("name", "c++")), > BooleanClause.Occur.MUST); > >try >{ >Directory directory = > FSDirectory.getDirectory(); >recordSearcher = new IndexSearcher(directory); > >Hits h = recordSearcher.search(bq); > >assertEquals(1, h.length()); // fails, saying it > expects 0 !!! >assertEquals(12, Integer.parseInt(h.doc(0).get("ID"))); >} >catch(Exception exn) >{ >fail("Eexception occurred."); >} >} > > Rewriting the BooleanQuery and taking toString() yields the same String > given to QueryParser.parse() in the first test. I am using Lucene 2.3.0. Can > somebody explain the difference ? > -- > View this message in context: > http://www.nabble.com/QueryParser-vs.-BooleanQuery-tp19306087p19306087.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: delete/reset the index
If you're on Windows, the safest way to do this in general, if there is any possibility that readers are still using the index, is to create a new IndexWriter with create=true. Windows does not let you remove open files. IndexWriter will gracefully handle failed deletes by retrying them over time... Mike simon litwan wrote: hi all i would like to delete the the index to allow to start reindexing from scratch. is there a way to delete all entries in a index? any hint is very appreciated. simon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can we know if 2 lucene indexes are same?
Actually, as of 2.3, this is no longer true: merges and optimizing run in the background, and allow add/update/delete documents to run at the same time. I think it's probably best to use application logic (outside of Lucene) to keep track of what updates happened to the master while the slave was optimizing. Mike 叶双明 wrote: No documents can added into index when the index is optimizing, or optimizing can't run durling documents adding to the index. So, without other error, I think we can beleive the two index are indeed the same. :) 2008/9/4 Noble Paul നോബിള് नोब्ळ् <[EMAIL PROTECTED]> The use case is as follows I have two indexes . One at the master and one at the slave. The user occasionally keeps committing on the master and the delta is replicated everytime. But when the optimize happens the transfer size can be really large. So I am thinking of doing the optimize separately on master and slave . So far, so good. But how can I really know that after the optimize the indexes are indeed the same or no documents got added in between.? On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള് नोब्ळ्: hi, I wish to know if the contents of two indexes have same data. will all the files be exactly same if I put same set of documents to both? If you insert the documents in the same order with the same settings and both indices are optimized, then the files ought to be identitical. I'm however not sure. The instantiated index contrib module contains a test that assert two index readers are identical. You could use this to be really sure, but it it a rather long running process for a large index: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java Perhaps you should explain why you need to do this. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --Noble Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: getTimestamp method in IndexCommit
Noble Paul നോബിള് नोब्ळ् wrote: On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: Noble Paul നോബിള് नोब्ळ् wrote: On Tue, Sep 2, 2008 at 1:56 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: Are you thinking this would just fallback to Directory.fileModified on the segments_N file for that commit? You could actually do that without any API change, because IndexCommit exposes a getSegmentsFileName(). If it is a RAMDirectory how can we get the lastmodified? RAMDirectory will report the System.currentTimeMillis() when the file was last changed. Is that not sufficient? Isn't it a lot of overhead to read the file modified time everytime the timestamp is tobe obtained? I would think this method does not need to be super fast -- how frequently are you planning to call it? Only during a onCommit() or a onInit(). So if the commit point is passed over multiple times it would call this as many times.Not a big deal in terms of performance. But it is still some 3-4 lines of code which could very well be added to the API and exposed as a method getTimestamp() OK I'll commit this -- it's trivial. It's simply convenience for calling Directory.fileModified. Note that the segments_N file has no other means of extracting a timestamp for itself; it does not store a timestamp internally or anything. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --Noble Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: getTimestamp method in IndexCommit
YOU ARE FAST thanks. --Noble On Thu, Sep 4, 2008 at 2:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Noble Paul നോബിള് नोब्ळ् wrote: > >> On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless >> <[EMAIL PROTECTED]> wrote: >>> >>> Noble Paul നോബിള് नोब्ळ् wrote: >>> On Tue, Sep 2, 2008 at 1:56 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Are you thinking this would just fallback to Directory.fileModified on > the > segments_N file for that commit? > > You could actually do that without any API change, because IndexCommit > exposes a getSegmentsFileName(). If it is a RAMDirectory how can we get the lastmodified? >>> >>> RAMDirectory will report the System.currentTimeMillis() when the file was >>> last changed. Is that not sufficient? >>> Isn't it a lot of overhead to read the file modified time everytime the timestamp is tobe obtained? >>> >>> I would think this method does not need to be super fast -- how >>> frequently >>> are you planning to call it? >> >> Only during a onCommit() or a onInit(). So if the commit point is >> passed over multiple times it would call this as many times.Not a big >> deal in terms of performance. But it is still some 3-4 lines of code >> which could very well be added to the API and exposed as a method >> getTimestamp() > > OK I'll commit this -- it's trivial. It's simply convenience for calling > Directory.fileModified. > >>> >>> Note that the segments_N file has no other means of extracting a >>> timestamp >>> for itself; it does not store a timestamp internally or anything. >>> >>> Mike >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >> >> >> >> -- >> --Noble Paul > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- --Noble Paul
Re: getTimestamp method in IndexCommit
Thanks for raising it! It's through requests like this that Lucene's API improves. Mike Noble Paul നോബിള് नोब्ळ् wrote: YOU ARE FAST thanks. --Noble On Thu, Sep 4, 2008 at 2:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: Noble Paul നോബിള് नोब्ळ् wrote: On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: Noble Paul നോബിള് नोब्ळ् wrote: On Tue, Sep 2, 2008 at 1:56 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: Are you thinking this would just fallback to Directory.fileModified on the segments_N file for that commit? You could actually do that without any API change, because IndexCommit exposes a getSegmentsFileName(). If it is a RAMDirectory how can we get the lastmodified? RAMDirectory will report the System.currentTimeMillis() when the file was last changed. Is that not sufficient? Isn't it a lot of overhead to read the file modified time everytime the timestamp is tobe obtained? I would think this method does not need to be super fast -- how frequently are you planning to call it? Only during a onCommit() or a onInit(). So if the commit point is passed over multiple times it would call this as many times.Not a big deal in terms of performance. But it is still some 3-4 lines of code which could very well be added to the API and exposed as a method getTimestamp() OK I'll commit this -- it's trivial. It's simply convenience for calling Directory.fileModified. Note that the segments_N file has no other means of extracting a timestamp for itself; it does not store a timestamp internally or anything. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --Noble Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --Noble Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: delete/reset the index
Agree with Michael McCandless!! By that way,it is handling gracefully. 2008/9/4 Michael McCandless <[EMAIL PROTECTED]> > > If you're on Windows, the safest way to do this in general, if there is any > possibility that readers are still using the index, is to create a new > IndexWriter with create=true. Windows does not let you remove open files. > IndexWriter will gracefully handle failed deletes by retrying them over > time... > > Mike > > > simon litwan wrote: > > hi all >> >> i would like to delete the the index to allow to start reindexing from >> scratch. >> is there a way to delete all entries in a index? >> >> any hint is very appreciated. >> >> simon >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
string similarity measures
Hello, This came up before but - if we were to make a swear word filter, string edit distances are no good. for example words like `shot` is confused with `shit`. there is also problem with words like hitchcock. appearently i need something like soundex or double metaphone. the thing is - these are language specific, and i am not operating in english. I need a fuzzy like curse word filter for turkish, simply. Best regards, -C.B.
Re: Realtime Search for Social Networks Collaboration
Hello Jason, I have been trying to do this for a long time on my own. keep up the good work. What I tried was a document cache using apache collections. and before a indexwrite/delete i would sync the cache with index. I am waiting for lucene 2.4 to proceed. (query by delete) Best. On Wed, Sep 3, 2008 at 10:20 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > Hello all, > > I don't mean this to sound like a solicitation. I've been working on > realtime search and created some Lucene patches etc. I am wondering > if there are social networks (or anyone else) out there who would be > interested in collaborating with Apache on realtime search to get it > to the point it can be used in production. It is a challenging > problem that only Google has solved and made to scale. I've been > working on the problem for a while and though a lot has been > completed, there is still a lot more to do and collaboration amongst > the most probable users (social networks) seems like a good thing to > try to do at this point. I guess I'm saying it seems like a hard > enough problem that perhaps it's best to work together on it rather > than each company try to complete their own. However I could be > wrong. > > Realtime search benefits social networks by providing a scalable > searchable alternative to large Mysql implementations. Mysql I have > heard is difficult to scale at a certain point. Apparently Google has > created things like BigTable (a large database) and an online service > called GData (which Google has not published any whitepapers on the > technology underneath) to address scaling large database systems. > BigTable does not offer search. GData does and is used by all of > Google's web services instead of something like Mysql (this is at > least how I understand it). Social networks usually grow and so > scaling is continually an issue. It is possible to build a realtime > search system that scales linearly, something that I have heard > becomes difficult with Mysql. There is an article that discusses some > of these issues > http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337 I > don't think the current GData implementation is perfect and there is a > lot that can be improved on. It might be helpful to figure out > together what helpful things can be added. > > If this sounds like something of interest to anyone feel free to send > your input. > > Take care, > Jason > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: string similarity measures
4 sep 2008 kl. 14.38 skrev Cam Bazz: Hello, This came up before but - if we were to make a swear word filter, string edit distances are no good. for example words like `shot` is confused with `shit`. there is also problem with words like hitchcock. appearently i need something like soundex or double metaphone. the thing is - these are language specific, and i am not operating in english. I need a fuzzy like curse word filter for turkish, simply. You probably need to make a large list of words. I would try to learn from the users that do swear, perhaps even trust my users to report each other. I would probably also look at storing in what context the word is used, perhaps by adding the surrounding words (ngrams, shingles, markov chains). Compare "go to hell" and "when hell frezes over". The first is rather derogatory while the second doen't have to be bad at all. I'm thinking Hidden Markov Models and Neural Networks. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Similarity percentage between two Strings
I would create 1-5 ngram sized shingles and measure the distance using Tanimoto coefficient. That would probably work out just fine. You might want to add more weight the greater the size of the shingle. There are shingle filters in lucene/java/contrib/analyzers and there is a Tanimoto distance in lucene/mahout/. Feel free to report back on how well it works. karl 4 sep 2008 kl. 00.58 skrev Thiago Moreira: Well, the similar definition that I'm looking for is the number 2, maybe the number 3, but to start the number 2 is enough. If you guys think that is not a Lucene problem what else tool can I use to implement this requirement?? Thanks Thiago Moreira Software Engineer [EMAIL PROTECTED] Liferay, Inc. Enterprise. Open Source. For Life. N. Hira wrote: I don't know how much of this is a Lucene problem, but -- as I'm sure you will inevitably hear from others on the list -- it depends on what your definition of "similar" is. By similar, do you mean: 1. Identical, except for variations in case (upper/lower) 2. Allow 1., but also allow prefixes/suffixes (e.g., "FW: " or "... (summary") 3. Allow 1., 2. and permit some new terms ... how many? 4. Allow all of the above and allow some changes to terms using stemming (E.g., "Google releases Chrome" is similar to "Google announces the release of its new Chrome web browser") I'm sure you see where this is going. So ... how do you define similar? Good luck! -h -- Hira, N.R. Cognocys, Inc. On 03-Sep-2008, at 2:52 PM, Thiago Moreira wrote: Hey all, I want to know how much two Strings are similar! The thing is: I'm processing an email box and I want to group all messages that have the subject similar, makes sense?? I looked on the documentation but I didn't find how to accomplish this. It's not necessary add the messages or the subjects on some kind of index. I'm using 2.3.2 version of Lucene. Anyone has some idea? Thanks in advance. -- Thiago Moreira Software Engineer [EMAIL PROTECTED] Liferay, Inc. Enterprise. Open Source. For Life. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can we know if 2 lucene indexes are same?
I don't agreed with Michael McCandless. :) I konw that after 2.3, add and delete can run in one IndexWriter at one time, and also lucene has a update method which delete documents by term then add the new document. In my test, either LockObtainFailedException with thread sleep sentence: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: [EMAIL PROTECTED]:\index\write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:85) at org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298) at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750) at org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786) at org.test.IndexThread.run(IndexThread.java:33) or StaleReaderException without thread sleep sentence: org.apache.lucene.index.StaleReaderException: IndexReader out of date and no longer valid for delete, undelete, or setNorm operations at org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308) at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750) at org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786) at org.test.IndexThread.run(IndexThread.java:31) My test code: public class Main { public static void main(String[] args) throws IOException { Directory directory = FSDirectory.getDirectory("e:/index"); IndexWriter writer = new IndexWriter(directory, null, false); Document document = new Document(); document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED)); writer.addDocument(document); Thread t = new IndexThread(); t.start(); try { Thread.sleep(1000); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } writer.optimize(); writer.close(); System.out.println("out"); } } public class IndexThread extends Thread { @Override public void run() { Directory directory; try { try { Thread.sleep(10); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } directory = FSDirectory.getDirectory("e:/index"); System.out.println("thread begin"); //IndexWriter reader = new IndexWriter(directory, null, false); IndexReader reader = IndexReader.open(directory); Term term = new Term("bbb", "bbb"); reader.deleteDocuments(term); reader.close(); System.out.println("thread end"); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } 2008/9/4, Michael McCandless <[EMAIL PROTECTED]>: > > > Actually, as of 2.3, this is no longer true: merges and optimizing run in > the background, and allow add/update/delete documents to run at the same > time. > > I think it's probably best to use application logic (outside of Lucene) to > keep track of what updates happened to the master while the slave was > optimizing. > > Mike > > 叶双明 wrote: > > No documents can added into index when the index is optimizing, or >> optimizing can't run durling documents adding to the index. >> So, without other error, I think we can beleive the two index are indeed >> the >> same. >> >> :) >> >> 2008/9/4 Noble Paul നോബിള് नोब्ळ् <[EMAIL PROTECTED]> >> >> The use case is as follows >>> >>> I have two indexes . One at the master and one at the slave. The user >>> occasionally keeps committing on the master and the delta is >>> replicated everytime. But when the optimize happens the transfer size >>> can be really large. So I am thinking of doing the optimize >>> separately on master and slave . >>> >>> So far, so good. But how can I really know that after the optimize the >>> indexes are indeed the same or no documents got added in between.? >>> >>> >>> >>> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[EMAIL PROTECTED]> >>> wrote: >>> 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള് नोब्ळ्: hi, > I wish to know if the contents of two indexes have same data. > will all the files be exactly same if I put same set of documents to > both? >>> If you insert the documents in the same order with the same settings and both indices are optimized, then the files ought to be identitical. I'm however not sure. The instantiated index contrib module contains a test that assert two >>> index >>> readers are identical. You could use this to be really sure, but it it a rather long running process for a large index: >>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java >>> Perhaps you should explain why you need to do this. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> -- >>> --Noble Paul >>> >
Re: string similarity measures
yes, I already have a system for users reporting words. they fall on an operator screen and if operator approves, or if 3 other people marked it as curse, then it is filtered. in the other thread you wrote: >I would create 1-5 ngram sized shingles and measure the distance using Tanimoto coefficient. That would probably work out just fine. ?>You might want to add more weight the greater the size of the shingle. > >There are shingle filters in lucene/java/contrib/analyzers and there is a Tanimoto distance in lucene/mahout/. would that apply to my case? tanimoto coefficient over shingles? Best, On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: > > 4 sep 2008 kl. 14.38 skrev Cam Bazz: > > > Hello, >> This came up before but - if we were to make a swear word filter, string >> edit distances are no good. for example words like `shot` is confused with >> `shit`. there is also problem with words like hitchcock. appearently i >> need >> something like soundex or double metaphone. the thing is - these are >> language specific, and i am not operating in english. >> >> I need a fuzzy like curse word filter for turkish, simply. >> > > You probably need to make a large list of words. I would try to learn from > the users that do swear, perhaps even trust my users to report each other. I > would probably also look at storing in what context the word is used, > perhaps by adding the surrounding words (ngrams, shingles, markov chains). > Compare "go to hell" and "when hell frezes over". The first is rather > derogatory while the second doen't have to be bad at all. > > I'm thinking Hidden Markov Models and Neural Networks. > > > karl > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: string similarity measures
4 sep 2008 kl. 15.54 skrev Cam Bazz: yes, I already have a system for users reporting words. they fall on an operator screen and if operator approves, or if 3 other people marked it as curse, then it is filtered. in the other thread you wrote: I would create 1-5 ngram sized shingles and measure the distance using Tanimoto coefficient. That would probably work out just fine. ?>You might want to add more weight the greater the size of the shingle. There are shingle filters in lucene/java/contrib/analyzers and there is a Tanimoto distance in lucene/mahout/. would that apply to my case? tanimoto coefficient over shingles? Not really, no. karl Best, On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: 4 sep 2008 kl. 14.38 skrev Cam Bazz: Hello, This came up before but - if we were to make a swear word filter, string edit distances are no good. for example words like `shot` is confused with `shit`. there is also problem with words like hitchcock. appearently i need something like soundex or double metaphone. the thing is - these are language specific, and i am not operating in english. I need a fuzzy like curse word filter for turkish, simply. You probably need to make a large list of words. I would try to learn from the users that do swear, perhaps even trust my users to report each other. I would probably also look at storing in what context the word is used, perhaps by adding the surrounding words (ngrams, shingles, markov chains). Compare "go to hell" and "when hell frezes over". The first is rather derogatory while the second doen't have to be bad at all. I'm thinking Hidden Markov Models and Neural Networks. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can we know if 2 lucene indexes are same?
Sorry, I should have said: you must always use the same writer, ie as of 2.3, while IndexWriter.optimize (or normal segment merging) is running, under one thread, another thread can use that *same* writer to add/delete/update documents, and both are free to make changes to the index. Before 2.3, optimize() was fully synchronized and blocked add/update/ delete documents from changing the index until the optimize() call completed. So, your test is expected to fail: you're not allowed to open 2 "writers" on a single index at the same time, where "writer" includes an IndexReader that deletes documents; so those exceptions (LockObtainFailed, StaleReader) are expected. Mike 叶双明 wrote: I don't agreed with Michael McCandless. :) I konw that after 2.3, add and delete can run in one IndexWriter at one time, and also lucene has a update method which delete documents by term then add the new document. In my test, either LockObtainFailedException with thread sleep sentence: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: [EMAIL PROTECTED]:\index\write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:85) at org .apache .lucene .index .DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298) at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java: 750) at org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java: 786) at org.test.IndexThread.run(IndexThread.java:33) or StaleReaderException without thread sleep sentence: org.apache.lucene.index.StaleReaderException: IndexReader out of date and no longer valid for delete, undelete, or setNorm operations at org .apache .lucene .index .DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308) at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java: 750) at org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java: 786) at org.test.IndexThread.run(IndexThread.java:31) My test code: public class Main { public static void main(String[] args) throws IOException { Directory directory = FSDirectory.getDirectory("e:/index"); IndexWriter writer = new IndexWriter(directory, null, false); Document document = new Document(); document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED)); writer.addDocument(document); Thread t = new IndexThread(); t.start(); try { Thread.sleep(1000); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } writer.optimize(); writer.close(); System.out.println("out"); } } public class IndexThread extends Thread { @Override public void run() { Directory directory; try { try { Thread.sleep(10); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } directory = FSDirectory.getDirectory("e:/index"); System.out.println("thread begin"); //IndexWriter reader = new IndexWriter(directory, null, false); IndexReader reader = IndexReader.open(directory); Term term = new Term("bbb", "bbb"); reader.deleteDocuments(term); reader.close(); System.out.println("thread end"); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } 2008/9/4, Michael McCandless <[EMAIL PROTECTED]>: Actually, as of 2.3, this is no longer true: merges and optimizing run in the background, and allow add/update/delete documents to run at the same time. I think it's probably best to use application logic (outside of Lucene) to keep track of what updates happened to the master while the slave was optimizing. Mike 叶双明 wrote: No documents can added into index when the index is optimizing, or optimizing can't run durling documents adding to the index. So, without other error, I think we can beleive the two index are indeed the same. :) 2008/9/4 Noble Paul നോബിള് नोब्ळ् <[EMAIL PROTECTED]> The use case is as follows I have two indexes . One at the master and one at the slave. The user occasionally keeps committing on the master and the delta is replicated everytime. But when the optimize happens the transfer size can be really large. So I am thinking of doing the optimize separately on master and slave . So far, so good. But how can I really know that after the optimize the indexes are indeed the same or no documents got added in between.? On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള് नोब्ळ्: hi, I wish to know if the contents of two indexes have same data. will all the files be exactly same if I put same set of documents to both? If you insert the documents in the same order with the same settings and both indices are optimized, then the files ought to be identitical. I'm however not sure. The instantiated index contrib module contains a test that assert two index readers are identical. You could use this t
Re: string similarity measures
let me rephrase the problem. I already have a set of bad words. I want to avoid people inputting typos of the bad words. for example 'shit' is banned, but someone may enter sh1t. how can i flag those phonetically similar bad words to the marked bad words? Best. On Thu, Sep 4, 2008 at 5:02 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: > > 4 sep 2008 kl. 15.54 skrev Cam Bazz: > > yes, I already have a system for users reporting words. they fall on an >> operator screen and if operator approves, or if 3 other people marked it >> as >> curse, then it is filtered. >> in the other thread you wrote: >> >> I would create 1-5 ngram sized shingles and measure the distance using >>> >> Tanimoto coefficient. That would probably work out just fine. ?>You might >> want to add more weight the greater the size of the shingle. >> >>> >>> There are shingle filters in lucene/java/contrib/analyzers and there is a >>> >> Tanimoto distance in lucene/mahout/. >> >> would that apply to my case? tanimoto coefficient over shingles? >> > > Not really, no. > > > karl > > > > >> >> Best, >> >> >> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]> >> wrote: >> >> >>> 4 sep 2008 kl. 14.38 skrev Cam Bazz: >>> >>> >>> Hello, >>> This came up before but - if we were to make a swear word filter, string edit distances are no good. for example words like `shot` is confused with `shit`. there is also problem with words like hitchcock. appearently i need something like soundex or double metaphone. the thing is - these are language specific, and i am not operating in english. I need a fuzzy like curse word filter for turkish, simply. >>> You probably need to make a large list of words. I would try to learn >>> from >>> the users that do swear, perhaps even trust my users to report each >>> other. I >>> would probably also look at storing in what context the word is used, >>> perhaps by adding the surrounding words (ngrams, shingles, markov >>> chains). >>> Compare "go to hell" and "when hell frezes over". The first is rather >>> derogatory while the second doen't have to be bad at all. >>> >>> I'm thinking Hidden Markov Models and Neural Networks. >>> >>> >>>karl >>> >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
lucene ram buffering
hello, I was reading the performance optimization guides then I found : writer.setRAMBufferSizeMB() combined with: writer.setMaxBufferedDocs(IndexWriter.DISABLE_AUTO_FLUSH); this can be used to flush automatically so if the ram buffer size is over a certain limit it will flush. now the question: i would like to manage all flushes myself. yet, i still need a large ram buffer (for search performance) how can I set ram buffer size to a large value, yet dont use auto flush. I just want to flush every 32 documents added - and manage that myself. if the ram buffer size is high, but we have a small number of documents, does lucene try to write the entire contents of the ram buffer - thus resulting in a higher flush time? usually in oodbms systems, you use a larger ram buffer for search, and a smaller ram buffer for write optimization. the reason being is a smaller ram buffer is writable to disk faster. is that the case with lucene? Best. -C.B.
Re: How can we know if 2 lucene indexes are same?
I see now, thanks Michael McCandless, good explain!! 2008/9/4, Michael McCandless <[EMAIL PROTECTED]>: > > > Sorry, I should have said: you must always use the same writer, ie as of > 2.3, while IndexWriter.optimize (or normal segment merging) is running, > under one thread, another thread can use that *same* writer to > add/delete/update documents, and both are free to make changes to the index. > > Before 2.3, optimize() was fully synchronized and blocked add/update/delete > documents from changing the index until the optimize() call completed. > > So, your test is expected to fail: you're not allowed to open 2 "writers" > on a single index at the same time, where "writer" includes an IndexReader > that deletes documents; so those exceptions (LockObtainFailed, StaleReader) > are expected. > > Mike > > 叶双明 wrote: > > I don't agreed with Michael McCandless. :) >> >> I konw that after 2.3, add and delete can run in one IndexWriter at one >> time, and also lucene has a update method which delete documents by term >> then add the new document. >> >> In my test, either LockObtainFailedException with thread sleep sentence: >> >> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: >> [EMAIL PROTECTED]:\index\write.lock >> at org.apache.lucene.store.Lock.obtain(Lock.java:85) >> at >> >> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298) >> at >> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750) >> at >> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786) >> at org.test.IndexThread.run(IndexThread.java:33) >> >> or StaleReaderException without thread sleep sentence: >> >> org.apache.lucene.index.StaleReaderException: IndexReader out of date and >> no >> longer valid for delete, undelete, or setNorm operations >> at >> >> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308) >> at >> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750) >> at >> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786) >> at org.test.IndexThread.run(IndexThread.java:31) >> >> My test code: >> >> >> public class Main { >> >> public static void main(String[] args) throws IOException { >> Directory directory = FSDirectory.getDirectory("e:/index"); >> IndexWriter writer = new IndexWriter(directory, null, false); >> Document document = new Document(); >> document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED)); >> writer.addDocument(document); >> >> Thread t = new IndexThread(); >> t.start(); >> >> try { >> Thread.sleep(1000); >> } catch (InterruptedException e) { >> // TODO Auto-generated catch block >> e.printStackTrace(); >> } >> >> writer.optimize(); >> writer.close(); >> System.out.println("out"); >> } >> } >> >> public class IndexThread extends Thread { >> >> @Override >> public void run() { >> Directory directory; >> try { >> try { >> Thread.sleep(10); >> } catch (InterruptedException e) { >> // TODO Auto-generated catch block >> e.printStackTrace(); >> } >> >> directory = FSDirectory.getDirectory("e:/index"); >> System.out.println("thread begin"); >> //IndexWriter reader = new IndexWriter(directory, null, false); >> IndexReader reader = IndexReader.open(directory); >> Term term = new Term("bbb", "bbb"); >> reader.deleteDocuments(term); >> reader.close(); >> System.out.println("thread end"); >> } catch (IOException e) { >> // TODO Auto-generated catch block >> e.printStackTrace(); >> } >> } >> } >> >> >> >> 2008/9/4, Michael McCandless <[EMAIL PROTECTED]>: >> >>> >>> >>> Actually, as of 2.3, this is no longer true: merges and optimizing run in >>> the background, and allow add/update/delete documents to run at the same >>> time. >>> >>> I think it's probably best to use application logic (outside of Lucene) >>> to >>> keep track of what updates happened to the master while the slave was >>> optimizing. >>> >>> Mike >>> >>> 叶双明 wrote: >>> >>> No documents can added into index when the index is optimizing, or >>> optimizing can't run durling documents adding to the index. So, without other error, I think we can beleive the two index are indeed the same. :) 2008/9/4 Noble Paul നോബിള് नोब्ळ् <[EMAIL PROTECTED]> The use case is as follows > > I have two indexes . One at the master and one at the slave. The user > occasionally keeps committing on the master and the delta is > replicated everytime. But when the optimize happens the transfer size > can be really large. So I am thinking of doing the optimize > separately on master and slave . > > So far, so good. But how can I really know that after the optimize the > indexes are indeed the same or no documents got added in between.? > > > > On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[EMAIL PROTECTED]> > wrote: > > >> 29 aug 2008 k
Re: Realtime Search for Social Networks Collaboration
Hi Cam, Thanks! It has not been easy, probably has taken 3 years or so to get this far. At first I thought the new reopen code would be the solution. I used it, but then needed to modify it to do a clone instead of reference the old deleted docs. Then as I iterated, realized that just using reopen on a ramdirectory would not be quite fast enough because of the merging. Then started using InstantiatedIndex which provides an in memory version of the document, without the overhead of merging during the transaction. There are other complexities as well. The basic code works if you are interested in trying it out. Take care, Jason On Thu, Sep 4, 2008 at 9:08 AM, Cam Bazz <[EMAIL PROTECTED]> wrote: > Hello Jason, > I have been trying to do this for a long time on my own. keep up the good > work. > > What I tried was a document cache using apache collections. and before a > indexwrite/delete i would sync the cache with index. > > I am waiting for lucene 2.4 to proceed. (query by delete) > > Best. > > On Wed, Sep 3, 2008 at 10:20 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> Hello all, >> >> I don't mean this to sound like a solicitation. I've been working on >> realtime search and created some Lucene patches etc. I am wondering >> if there are social networks (or anyone else) out there who would be >> interested in collaborating with Apache on realtime search to get it >> to the point it can be used in production. It is a challenging >> problem that only Google has solved and made to scale. I've been >> working on the problem for a while and though a lot has been >> completed, there is still a lot more to do and collaboration amongst >> the most probable users (social networks) seems like a good thing to >> try to do at this point. I guess I'm saying it seems like a hard >> enough problem that perhaps it's best to work together on it rather >> than each company try to complete their own. However I could be >> wrong. >> >> Realtime search benefits social networks by providing a scalable >> searchable alternative to large Mysql implementations. Mysql I have >> heard is difficult to scale at a certain point. Apparently Google has >> created things like BigTable (a large database) and an online service >> called GData (which Google has not published any whitepapers on the >> technology underneath) to address scaling large database systems. >> BigTable does not offer search. GData does and is used by all of >> Google's web services instead of something like Mysql (this is at >> least how I understand it). Social networks usually grow and so >> scaling is continually an issue. It is possible to build a realtime >> search system that scales linearly, something that I have heard >> becomes difficult with Mysql. There is an article that discusses some >> of these issues >> http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337 I >> don't think the current GData implementation is perfect and there is a >> lot that can be improved on. It might be helpful to figure out >> together what helpful things can be added. >> >> If this sounds like something of interest to anyone feel free to send >> your input. >> >> Take care, >> Jason >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: string similarity measures
I submitted a patch to handle Aspell phonetic rules. You can find it in JIRA. On Thu, 4 Sep 2008 17:07:09 +0300, "Cam Bazz" <[EMAIL PROTECTED]> wrote: > let me rephrase the problem. I already have a set of bad words. I want to > avoid people inputting typos of the bad words. > for example 'shit' is banned, but someone may enter sh1t. > > how can i flag those phonetically similar bad words to the marked bad > words? > > Best. > > On Thu, Sep 4, 2008 at 5:02 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: > >> >> 4 sep 2008 kl. 15.54 skrev Cam Bazz: >> >> yes, I already have a system for users reporting words. they fall on an >>> operator screen and if operator approves, or if 3 other people marked > it >>> as >>> curse, then it is filtered. >>> in the other thread you wrote: >>> >>> I would create 1-5 ngram sized shingles and measure the distance using >>> Tanimoto coefficient. That would probably work out just fine. ?>You > might >>> want to add more weight the greater the size of the shingle. >>> There are shingle filters in lucene/java/contrib/analyzers and there > is a >>> Tanimoto distance in lucene/mahout/. >>> >>> would that apply to my case? tanimoto coefficient over shingles? >>> >> >> Not really, no. >> >> >> karl >> >> >> >> >>> >>> Best, >>> >>> >>> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]> >>> wrote: >>> >>> 4 sep 2008 kl. 14.38 skrev Cam Bazz: Hello, > This came up before but - if we were to make a swear word filter, > string > edit distances are no good. for example words like `shot` is confused > with > `shit`. there is also problem with words like hitchcock. appearently > i > need > something like soundex or double metaphone. the thing is - these are > language specific, and i am not operating in english. > > I need a fuzzy like curse word filter for turkish, simply. > > You probably need to make a large list of words. I would try to learn from the users that do swear, perhaps even trust my users to report each other. I would probably also look at storing in what context the word is used, perhaps by adding the surrounding words (ngrams, shingles, markov chains). Compare "go to hell" and "when hell frezes over". The first is rather derogatory while the second doen't have to be bad at all. I'm thinking Hidden Markov Models and Neural Networks. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ramdisks
hello, anyone using ramdisks for storage? there is ramsam and there is also fusion io. but they are kinda expensive. any other alternatives I wonder? Best.
PhraseQuery issues - differences with SpanNearQuery
Hi, I am having an issue when using the PhraseQuery which is best illustrated with this example: I have created 2 documents to emulate URLs. One with a URL of: "http://www.airballoon.com"; and title "air balloon" and the second one with URL "http://www.balloonair.com"; and title: "balloon air". Test1 (PhraseQuery) == Now when I use the phrase query with - title: "air balloon" ~2 I get back: url: "http://www.airballoon.com"; - score: 1.0 url: "http://www.balloonair.com"; - score: 0.57 Test2 (PhraseQuery) == Now when I use the phrase query with - title: "balloon air" ~2 I get back: url: "http://www.balloonair.com"; - score: 1.0 url: "http://www.airballoon.com"; - score: 0.57 Test3 (PhraseQuery) == Now when I use the phrase query with - title: "air balloon" ~2 title: "balloon air" ~2 I get back: url: "http://www.airballoon.com"; - score: 1.0 url: "http://www.balloonair.com"; - score: 1.0 Test4 (SpanNearQuery) === spanNear([title:air, title:balloon], 2, false) I get back: url: "http://www.airballoon.com"; - score: 1.0 url: "http://www.balloonair.com"; - score: 1.0 I would have expected that Test1, Test2 would actually return both URLs with score of 1.0 since I am setting the slop to 2. It seems though that lucene really favors and absolute exact match. Is it safe to assume that for what I am looking for (basically score the docs the same regardless on when someone is searching for "air balloon" or "balloon air") it would be better to use the SpanNearQuery rather than the PhraseQuery? Any input would be appreciated. Thanks in advance, Yannis.
Re: PhraseQuery issues - differences with SpanNearQuery
Sounds like its more in line with what you are looking for. If I remember correctly, the phrase query factors in the edit distance in scoring, but the NearSpanQuery will just use the combined idf for each of the terms in it, so distance shouldnt matter with spans (I'm sure Paul will correct me if I am wrong). - Mark Yannis Pavlidis wrote: Hi, I am having an issue when using the PhraseQuery which is best illustrated with this example: I have created 2 documents to emulate URLs. One with a URL of: "http://www.airballoon.com"; and title "air balloon" and the second one with URL "http://www.balloonair.com"; and title: "balloon air". Test1 (PhraseQuery) == Now when I use the phrase query with - title: "air balloon" ~2 I get back: url: "http://www.airballoon.com"; - score: 1.0 url: "http://www.balloonair.com"; - score: 0.57 Test2 (PhraseQuery) == Now when I use the phrase query with - title: "balloon air" ~2 I get back: url: "http://www.balloonair.com"; - score: 1.0 url: "http://www.airballoon.com"; - score: 0.57 Test3 (PhraseQuery) == Now when I use the phrase query with - title: "air balloon" ~2 title: "balloon air" ~2 I get back: url: "http://www.airballoon.com"; - score: 1.0 url: "http://www.balloonair.com"; - score: 1.0 Test4 (SpanNearQuery) === spanNear([title:air, title:balloon], 2, false) I get back: url: "http://www.airballoon.com"; - score: 1.0 url: "http://www.balloonair.com"; - score: 1.0 I would have expected that Test1, Test2 would actually return both URLs with score of 1.0 since I am setting the slop to 2. It seems though that lucene really favors and absolute exact match. Is it safe to assume that for what I am looking for (basically score the docs the same regardless on when someone is searching for "air balloon" or "balloon air") it would be better to use the SpanNearQuery rather than the PhraseQuery? Any input would be appreciated. Thanks in advance, Yannis. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie question: using Lucene to index hierarchical information.
Hi all, Thanks a lot for such a quick reply. Both scenario sounds very well for me. I would like to do my best and try to implement any of them (as the proof of the concept) and then incrementally improve, retest, investigate and rewrite then :) So, from the soap opera to the question part then: - How to implement those things (a and b) on the Lucene and Lucene contribs codebase? - I looked at the http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7 and didn't like that (too big, to heavy, ready-to use solution instead of toolkit). And I didn't understood how to implement "Normal scenario" on top of that? - Any suggestions how could I begin implementing these things? Gently moving from "Normal" scenario to some more advanced "Complex"? What should I afraid off and possible impacts if any? Have anybody tried to use Lucene to analyse things like that? What would be possible solutions to store indexed data and perform queries on that? If Lucene isn't the right tool for this job, maybe some other toolkit would more useful(possibly on top of the Lucene) Thanks in advance for any suggestions and comments. I would appreciate any ideas and directions to look into. On Tue, Sep 2, 2008 at 11:46 AM, Karsten F. <[EMAIL PROTECTED]>wrote: > > Hi Leonid, > > what kind of query is your use case? > > Comlex scenario: > You need all the hierarchical structure information in one query. This > means > you want to search with xpath in a real xml-Database. (like: All Documents > with a subtitle XY which contains directly after this subtitle a table with > the same column like ...) > > Normal scenario: > You want to search for only one part of your hierarchical information like > 'Document with word xy in title' and 'Documents with word xy in table'. > > I am not familiar with lucene use in xml-Databases, but I can advice for > "normal scenario": > > Take a look to the xml-aware search in xtf ( > > http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7 > ). > The idea is to use one lucene-document for each section with only two > fields: "text" and "sectionType". > But to collect all hits belonging to one hierarchical information (e.g. one > html-File) and compress this to one representative hit in lucene. > > Best regards > Karsten > > > leonardinius wrote: > > > > Any comments, suggestions? Maybe I should rephrase my original message or > > describe it in detail? > > I really would like to get any response if possible. > > > > Thanks a lot in advance! > > > > On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov <[EMAIL PROTECTED]> > wrote: > > > >> Hi all, > >> > >> First of all, sorry for my poor English. It's not my native language. > >> > >> I'm trying to use Lucene to index hierarchical kind of information: I > >> have > >> structured html and pdf/word documents and I want to index them in ways > >> to > >> perform search in titles, text, paragraphs or tables only, or any > >> combinations of items mentioned above. At the moment I see 3 possible > >> solutions: > >> > >>- Create the set of all possible fields, like: contents, title, > >>heading, table etc... And index the data in all them accordingly. > >> Possible > >>impacts: > >>- a big count of fields > >> - data duplication (because I need to make search looking in the > >> paragraphs to look inside all the inner elements, so every outer > >> element > >> indexed will contain all the inner element content as well) > >>- Create the hierarchy of the fields, like "title", > "paragraph/title", > >>"paragraph/title/subparagraph/table". Possible impacts: > >> - count of fields remains the same > >> - soft set of fields (not consistent) > >> - I'm not sure about the ways I could process required information > >> and perform search. > >> - Performance issues? > >> - Use one field for content and just add location prefix to > >> content. > >>For example "contents:*paragraph/heading:*token1 token2". * > >>paragraph/heading:* here is used as additional information prefix. > So, > >>I (possibly?) could reuse PrefixQuery functionality or smth. Impacts: > >> - Strong set of index fields (small) > >> - Additional information processing - all the queries I'll use > will > >> have to work as PrefixQuery > >> - Performance issues? > >> > >> > >> So, have anyone tried to make things work like that? Or am I trying to > >> use > >> wrench to hammer in nails? I assume Lucene wasn't thought to be used > like > >> that, but it's worth trying (at least asking). > >> Any results / suggestions are welcome! > >> > >> -- > >> Bests regards, > >> Leonid Maslov! > >> Adrienne Gusoff - "Opportunity knocked. My doorman threw him out." > >> > > > > > > > > -- > > Bests regards, > > Leonid Maslov! > > Adrienne Gusoff - "Opportunity knocked. My doorman threw him
Lucene debug logging?
Is there a way to turn on debug logging / trace logging for Lucene? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds
We have some code that uses lucene which has been working perfectly well for several months. Recently, a QA team in our organization has set up a server with a much larger data set than we have ever tested with in the past: the resulting lucene index is about 3G in size. On this particular server, the same lucene code which has been reliable in the past is now exhibiting erratic behavior. The first time you do a search, it returns the correct number of hits. The second time you do a search, it may or may not return the correct set. By the third time you do a search, it will return 0 hits even for a search that was returning hundreds of hits only a few seconds earlier. All subsequent searches will return 0 hits until you stop and restart the java process. A snippet of the relevant code follows: // getReader() returns the singleton IndexReader object final IndexReader reader = getReader(); // ANALYZER is another singleton final QueryParser queryParser = new QueryParser("text", ANALYZER); queryParser.setDefaultOperator(spec.getDefaultOp()); final Query query = queryParser.parse(spec.getSearchText()).rewrite( reader); final IndexSearcher searcher = new IndexSearcher(reader); final Hits hits = searcher.search(query, new CachingWrapperFilter( new QueryWrapperFilter(visibilityFilter))); total = hits.length(); I understand that Lucene should be able to handle very large datasets, so I'd be surprised if this were an actual Lucene bug. I'm hoping it's just that I'm doing something "wrong" which has gone unnoticed so far for several months because we've never had an index this large. We're using lucene verison 2.2.0. Thanks! Justin Grunau - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds
* And what's about visibility filter? * Are you sure no one else accesses IndexReader and modifies index? See reader.maxDocs() to be confident. On Fri, Sep 5, 2008 at 12:19 AM, Justin Grunau <[EMAIL PROTECTED]> wrote: > We have some code that uses lucene which has been working perfectly well > for several months. > > Recently, a QA team in our organization has set up a server with a much > larger data set than we have ever tested with in the past: the resulting > lucene index is about 3G in size. > > On this particular server, the same lucene code which has been reliable in > the past is now exhibiting erratic behavior. The first time you do a > search, it returns the correct number of hits. The second time you do a > search, it may or may not return the correct set. By the third time you do > a search, it will return 0 hits even for a search that was returning > hundreds of hits only a few seconds earlier. All subsequent searches will > return 0 hits until you stop and restart the java process. > > A snippet of the relevant code follows: > >// getReader() returns the singleton IndexReader object >final IndexReader reader = getReader(); > >// ANALYZER is another singleton >final QueryParser queryParser = new QueryParser("text", > ANALYZER); >queryParser.setDefaultOperator(spec.getDefaultOp()); >final Query query = > queryParser.parse(spec.getSearchText()).rewrite( >reader); >final IndexSearcher searcher = new IndexSearcher(reader); > >final Hits hits = searcher.search(query, new > CachingWrapperFilter( >new QueryWrapperFilter(visibilityFilter))); >total = hits.length(); > > > > I understand that Lucene should be able to handle very large datasets, so > I'd be surprised if this were an actual Lucene bug. I'm hoping it's just > that I'm doing something "wrong" which has gone unnoticed so far for several > months because we've never had an index this large. > > We're using lucene verison 2.2.0. > > Thanks! > > Justin Grunau > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Bests regards, Leonid Maslov! Personal blog: http://leonardinius.blogspot.com/ Random thought: Princess Margaret - "I have as much privacy as a goldfish in a bowl."
Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds
Sorry, I forgot to include the visibility filters: final BooleanQuery visibilityFilter = new BooleanQuery(); visibilityFilter.add(new TermQuery(new Term("isPublic", "true")), Occur.SHOULD); visibilityFilter.add(new TermQuery(new Term("reader", user.getId())), Occur.SHOULD); These visibility filters ensure that a user only sees files which he or she has access to see. I am pretty certain nobody else has modified the index in the meantime, but why is that important? We have several other servers -- whose only difference is a smaller data set -- with dozens of concurrent users, and the index on those servers gets modified and read concurrently all the time, but none of these other servers have ever exhibited this bug. - Original Message From: Leonid M. <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, September 4, 2008 5:35:47 PM Subject: Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds * And what's about visibility filter? * Are you sure no one else accesses IndexReader and modifies index? See reader.maxDocs() to be confident. On Fri, Sep 5, 2008 at 12:19 AM, Justin Grunau <[EMAIL PROTECTED]> wrote: > We have some code that uses lucene which has been working perfectly well > for several months. > > Recently, a QA team in our organization has set up a server with a much > larger data set than we have ever tested with in the past: the resulting > lucene index is about 3G in size. > > On this particular server, the same lucene code which has been reliable in > the past is now exhibiting erratic behavior. The first time you do a > search, it returns the correct number of hits. The second time you do a > search, it may or may not return the correct set. By the third time you do > a search, it will return 0 hits even for a search that was returning > hundreds of hits only a few seconds earlier. All subsequent searches will > return 0 hits until you stop and restart the java process. > > A snippet of the relevant code follows: > >// getReader() returns the singleton IndexReader object >final IndexReader reader = getReader(); > >// ANALYZER is another singleton >final QueryParser queryParser = new QueryParser("text", > ANALYZER); >queryParser.setDefaultOperator(spec.getDefaultOp()); >final Query query = > queryParser.parse(spec.getSearchText()).rewrite( >reader); >final IndexSearcher searcher = new IndexSearcher(reader); > >final Hits hits = searcher.search(query, new > CachingWrapperFilter( >new QueryWrapperFilter(visibilityFilter))); >total = hits.length(); > > > > I understand that Lucene should be able to handle very large datasets, so > I'd be surprised if this were an actual Lucene bug. I'm hoping it's just > that I'm doing something "wrong" which has gone unnoticed so far for several > months because we've never had an index this large. > > We're using lucene verison 2.2.0. > > Thanks! > > Justin Grunau > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Bests regards, Leonid Maslov! Personal blog: http://leonardinius.blogspot.com/ Random thought: Princess Margaret - "I have as much privacy as a goldfish in a bowl." - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene debug logging?
On Donnerstag, 4. September 2008, Justin Grunau wrote: > Is there a way to turn on debug logging / trace logging for Lucene? You can use IndexWriter's setInfoStream(). Besides that, Lucene doesn't do any logging AFAIK. Are you experiencing any problems that you want to diagnose with debugging? Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene debug logging?
For IndexWriter there's setInfoStream, which logs details about when flushing & merging is happening. Mike Justin Grunau wrote: Is there a way to turn on debug logging / trace logging for Lucene? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds
Anyway it is worth trying (to ensure docs aren't removed between searches).What if running MatchAllDocsQuery or smth similar? Still getting different hits count on query rerun? PS. I'm kinda newbie with Lucene and Lucene API. So don't take my notes too seriously :) On Fri, Sep 5, 2008 at 12:46 AM, Justin Grunau <[EMAIL PROTECTED]> wrote: > Sorry, I forgot to include the visibility filters: > >final BooleanQuery visibilityFilter = new BooleanQuery(); >visibilityFilter.add(new TermQuery(new Term("isPublic", > "true")), >Occur.SHOULD); >visibilityFilter.add(new TermQuery(new Term("reader", > user.getId())), >Occur.SHOULD); > > > These visibility filters ensure that a user only sees files which he or she > has access to see. > > I am pretty certain nobody else has modified the index in the meantime, but > why is that important? We have several other servers -- whose only > difference is a smaller data set -- with dozens of concurrent users, and the > index on those servers gets modified and read concurrently all the time, but > none of these other servers have ever exhibited this bug. > > > > - Original Message > From: Leonid M. <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Thursday, September 4, 2008 5:35:47 PM > Subject: Re: Problem with lucene search starting to return 0 hits when a > few seconds earlier it was returning hundreds > > * And what's about visibility filter? * Are you sure no one else accesses > IndexReader and modifies index? See reader.maxDocs() to be confident. > > On Fri, Sep 5, 2008 at 12:19 AM, Justin Grunau <[EMAIL PROTECTED]> wrote: > > > We have some code that uses lucene which has been working perfectly well > > for several months. > > > > Recently, a QA team in our organization has set up a server with a much > > larger data set than we have ever tested with in the past: the resulting > > lucene index is about 3G in size. > > > > On this particular server, the same lucene code which has been reliable > in > > the past is now exhibiting erratic behavior. The first time you do a > > search, it returns the correct number of hits. The second time you do a > > search, it may or may not return the correct set. By the third time you > do > > a search, it will return 0 hits even for a search that was returning > > hundreds of hits only a few seconds earlier. All subsequent searches > will > > return 0 hits until you stop and restart the java process. > > > > A snippet of the relevant code follows: > > > >// getReader() returns the singleton IndexReader > object > >final IndexReader reader = getReader(); > > > >// ANALYZER is another singleton > >final QueryParser queryParser = new QueryParser("text", > > ANALYZER); > >queryParser.setDefaultOperator(spec.getDefaultOp()); > >final Query query = > > queryParser.parse(spec.getSearchText()).rewrite( > >reader); > >final IndexSearcher searcher = new IndexSearcher(reader); > > > >final Hits hits = searcher.search(query, new > > CachingWrapperFilter( > >new QueryWrapperFilter(visibilityFilter))); > >total = hits.length(); > > > > > > > > I understand that Lucene should be able to handle very large datasets, so > > I'd be surprised if this were an actual Lucene bug. I'm hoping it's just > > that I'm doing something "wrong" which has gone unnoticed so far for > several > > months because we've never had an index this large. > > > > We're using lucene verison 2.2.0. > > > > Thanks! > > > > Justin Grunau > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > -- > Bests regards, > Leonid Maslov! > Personal blog: http://leonardinius.blogspot.com/ > > Random thought: > Princess Margaret - "I have as much privacy as a goldfish in a bowl." > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Bests regards, Leonid Maslov! Personal blog: http://leonardinius.blogspot.com/ Random thought: John Belushi - "I owe it all to little chocolate donuts."
Re: Lucene debug logging?
Daniel, yes, please see my "Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds" thread. - Original Message From: Daniel Naber <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, September 4, 2008 6:10:56 PM Subject: Re: Lucene debug logging? On Donnerstag, 4. September 2008, Justin Grunau wrote: > Is there a way to turn on debug logging / trace logging for Lucene? You can use IndexWriter's setInfoStream(). Besides that, Lucene doesn't do any logging AFAIK. Are you experiencing any problems that you want to diagnose with debugging? Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PhraseQuery issues - differences with SpanNearQuery
Op Thursday 04 September 2008 20:39:13 schreef Mark Miller: > Sounds like its more in line with what you are looking for. If I > remember correctly, the phrase query factors in the edit distance in > scoring, but the NearSpanQuery will just use the combined idf for > each of the terms in it, so distance shouldnt matter with spans (I'm > sure Paul will correct me if I am wrong). SpanScorer will use the similarity slop factor for each matching span size to adjust the effective frequency. The span size is the difference in position between the first and last matching term, and idf is not used for scoring Spans. The reason why idf is not used could be that there is no basic score value associated with inner spans; only top level spans are scored by SpanScorer. For more details, please consult the SpanScorer code. Regards, Paul Elschot > > - Mark > > Yannis Pavlidis wrote: > > Hi, > > > > I am having an issue when using the PhraseQuery which is best > > illustrated with this example: > > > > I have created 2 documents to emulate URLs. One with a URL of: > > "http://www.airballoon.com"; and title "air balloon" and the second > > one with URL "http://www.balloonair.com"; and title: "balloon air". > > > > Test1 (PhraseQuery) > > == > > Now when I use the phrase query with - title: "air balloon" ~2 > > I get back: > > > > url: "http://www.airballoon.com"; - score: 1.0 > > url: "http://www.balloonair.com"; - score: 0.57 > > > > Test2 (PhraseQuery) > > == > > Now when I use the phrase query with - title: "balloon air" ~2 > > I get back: > > url: "http://www.balloonair.com"; - score: 1.0 > > url: "http://www.airballoon.com"; - score: 0.57 > > > > Test3 (PhraseQuery) > > == > > Now when I use the phrase query with - title: "air balloon" ~2 > > title: "balloon air" ~2 I get back: > > url: "http://www.airballoon.com"; - score: 1.0 > > url: "http://www.balloonair.com"; - score: 1.0 > > > > Test4 (SpanNearQuery) > > === > > spanNear([title:air, title:balloon], 2, false) > > I get back: > > url: "http://www.airballoon.com"; - score: 1.0 > > url: "http://www.balloonair.com"; - score: 1.0 > > > > I would have expected that Test1, Test2 would actually return both > > URLs with score of 1.0 since I am setting the slop to 2. It seems > > though that lucene really favors and absolute exact match. > > > > Is it safe to assume that for what I am looking for (basically > > score the docs the same regardless on when someone is searching for > > "air balloon" or "balloon air") it would be better to use the > > SpanNearQuery rather than the PhraseQuery? > > > > Any input would be appreciated. > > > > Thanks in advance, > > > > Yannis. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser vs. BooleanQuery
Indeed, StandardAnalyzer removing the pluses, so analyse 'c++' to 'c'. QueryParser include Term that been analysed. And BooleanQuery include Term that hasn't been analysed. I think this is the difference between they. 2008/9/4 Ian Lea <[EMAIL PROTECTED]> > Have a look at the index with Luke to see what has actually been > indexed. StandardAnalyzer may well be removing the pluses, or you may > need to escape them. And watch out for case - Visual != visual in > term query land. > > > -- > Ian. > > > On Thu, Sep 4, 2008 at 9:46 AM, bogdan71 <[EMAIL PROTECTED]> wrote: > > > > Hello, > > > > I am experiencing a strange behaviour when trying to query the same > thing > > via > > BooleanQuery vs. via the know-it-all QueryParser class. Precisely, the > index > > contains > > the document: > > "12,Visual C++,4.2" with the field layout: ID,name,version(thus, "12" > is > > the ID field, "Visual C++" > > is the name field and "4.2" is the version field). > > The search string is "Visual C++" for the name field. > > > > The following test, using QueryParser, goes fine: > > > > public final void testUsingQueryParser() > >{ > >IndexSearcher recordSearcher; > >Query q; > >QueryParser parser = new QueryParser("name", new > StandardAnalyzer()); > >try > >{ > > q = parser.parse("name:visual +name:c++"); > > > >Directory directory = > > FSDirectory.getDirectory(); > >recordSearcher = new IndexSearcher(directory); > > > >Hits h = recordSearcher.search(q); > > > >assertEquals(1, h.length()); > >assertEquals(12, > Integer.parseInt(h.doc(0).get("ID"))); > >} > >catch(Exception exn) > >{ > >fail("Exception occurred."); > >} > >} > > > > But this one, using a BooleanQuery, fails. > > > > public final void testUsingTermQuery() > >{ > >IndexSearcher recordSearcher; > >BooleanQuery bq = new BooleanQuery(); > > > >bq.add(new TermQuery(new Term("name", "visual")), > > BooleanClause.Occur.SHOULD); > >bq.add(new TermQuery(new Term("name", "c++")), > BooleanClause.Occur.MUST); > > > >try > >{ > >Directory directory = > > FSDirectory.getDirectory(); > >recordSearcher = new IndexSearcher(directory); > > > >Hits h = recordSearcher.search(bq); > > > >assertEquals(1, h.length()); // fails, saying it > expects 0 !!! > >assertEquals(12, > Integer.parseInt(h.doc(0).get("ID"))); > >} > >catch(Exception exn) > >{ > >fail("Eexception occurred."); > >} > >} > > > > Rewriting the BooleanQuery and taking toString() yields the same String > > given to QueryParser.parse() in the first test. I am using Lucene 2.3.0.Can > > somebody explain the difference ? > > -- > > View this message in context: > http://www.nabble.com/QueryParser-vs.-BooleanQuery-tp19306087p19306087.html > > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Beginner: Specific indexing
Honestly: your problem doesn't sound like a Lucene problem to me at all ... i would write custom code to cehck your files for the pattern you are looking for. if you find it *then* construct a Document object, and add your 3 fields. I probably wouldn't even use an analyzer. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Javadoc wording in IndexWriter.addIndexesNoOptimize()
The Javadoc for this method has the following comment: "This requires this index not be among those to be added, and the upper bound* of those segment doc counts not exceed maxMergeDocs. " What does the second part of that mean, which is especially confusing given that MAX_MERGE_DOCS is deprecated. Thanks Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hits document offset information
: Now, I would like to to access to the best fragments offsetsfrom each : document (hits.doc(i)). I seem to recall that the recomended method for doing this is to subclass your favorite Formatter and record the information from each TokenGroup before delegating to the super class. but there may be an easier way. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Merging indexes - which is best option?
I am creating several temporary batches of indexes to separate indices and periodically will merge those batches to a set of master indices. I'm using IndexWriter#addIndexesNoOptimise(), but problem that gives me is that the master may already contain the index for that document and I get a duplicate. Duplicates are prevented in the temporary index, because when adding Documents, I call IndexWriter#deleteDocuments(Term) with my UID, before I add the Document. I have two choices a) merge indexes then clean up any duplicates in the master (or vice versa). Probably IndexWriter.deleteDocuments(Term[]) would suit here with all the UIDs of the incoming documents. b) iterate through the Documents in the temporary index and add them to the master b sounds worse as it seems an IndexWriter's Analyzer cannot be null and I guess there's a penalty in assembling the Document from the reader. Any views? Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can we know if 2 lucene indexes are same?
I am not using the same index with different writers. These are two separate indexes both have their own reader/writer I just wanted to minimize the network load by avoiding the download of an optimized index if the contents are indeed same. --noble On Thu, Sep 4, 2008 at 7:36 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Sorry, I should have said: you must always use the same writer, ie as of > 2.3, while IndexWriter.optimize (or normal segment merging) is running, > under one thread, another thread can use that *same* writer to > add/delete/update documents, and both are free to make changes to the index. > > Before 2.3, optimize() was fully synchronized and blocked add/update/delete > documents from changing the index until the optimize() call completed. > > So, your test is expected to fail: you're not allowed to open 2 "writers" on > a single index at the same time, where "writer" includes an IndexReader that > deletes documents; so those exceptions (LockObtainFailed, StaleReader) are > expected. > > Mike > > 叶双明 wrote: > >> I don't agreed with Michael McCandless. :) >> >> I konw that after 2.3, add and delete can run in one IndexWriter at one >> time, and also lucene has a update method which delete documents by term >> then add the new document. >> >> In my test, either LockObtainFailedException with thread sleep sentence: >> >> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: >> [EMAIL PROTECTED]:\index\write.lock >> at org.apache.lucene.store.Lock.obtain(Lock.java:85) >> at >> >> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298) >> at >> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750) >> at >> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786) >> at org.test.IndexThread.run(IndexThread.java:33) >> >> or StaleReaderException without thread sleep sentence: >> >> org.apache.lucene.index.StaleReaderException: IndexReader out of date and >> no >> longer valid for delete, undelete, or setNorm operations >> at >> >> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308) >> at >> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750) >> at >> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786) >> at org.test.IndexThread.run(IndexThread.java:31) >> >> My test code: >> >> >> public class Main { >> >> public static void main(String[] args) throws IOException { >> Directory directory = FSDirectory.getDirectory("e:/index"); >> IndexWriter writer = new IndexWriter(directory, null, false); >> Document document = new Document(); >> document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED)); >> writer.addDocument(document); >> >> Thread t = new IndexThread(); >> t.start(); >> >> try { >> Thread.sleep(1000); >> } catch (InterruptedException e) { >> // TODO Auto-generated catch block >> e.printStackTrace(); >> } >> >> writer.optimize(); >> writer.close(); >> System.out.println("out"); >> } >> } >> >> public class IndexThread extends Thread { >> >> @Override >> public void run() { >> Directory directory; >> try { >> try { >> Thread.sleep(10); >> } catch (InterruptedException e) { >> // TODO Auto-generated catch block >> e.printStackTrace(); >> } >> >> directory = FSDirectory.getDirectory("e:/index"); >> System.out.println("thread begin"); >> //IndexWriter reader = new IndexWriter(directory, null, false); >> IndexReader reader = IndexReader.open(directory); >> Term term = new Term("bbb", "bbb"); >> reader.deleteDocuments(term); >> reader.close(); >> System.out.println("thread end"); >> } catch (IOException e) { >> // TODO Auto-generated catch block >> e.printStackTrace(); >> } >> } >> } >> >> >> >> 2008/9/4, Michael McCandless <[EMAIL PROTECTED]>: >>> >>> >>> Actually, as of 2.3, this is no longer true: merges and optimizing run in >>> the background, and allow add/update/delete documents to run at the same >>> time. >>> >>> I think it's probably best to use application logic (outside of Lucene) >>> to >>> keep track of what updates happened to the master while the slave was >>> optimizing. >>> >>> Mike >>> >>> 叶双明 wrote: >>> >>> No documents can added into index when the index is optimizing, or optimizing can't run durling documents adding to the index. So, without other error, I think we can beleive the two index are indeed the same. :) 2008/9/4 Noble Paul നോബിള് नोब्ळ् <[EMAIL PROTECTED]> The use case is as follows > > I have two indexes . One at the master and one at the slave. The user > occasionally keeps committing on the master and the delta is > replicated everytime. But when the optimize happens the transfer size > can be really large. So I am thinking of doing the optimize > separately on master and slave . > > So far, so good. But how can I really know that after the op