Re: how many size of the index is the lucene's limit on per server ?
It depends what you call a server : - 4 dual Xeon, 64G RAM, 1TB of 15000 rpm raid10 hard-disks is one thing - 1 P4, 512M RAM, 40G 5400 rpm hard-disk, Win2K is completly something else It depends on index structure and the size of the documents you index/store . It depends on the way you query your index: - simple TermQuery, top 500 by relevancy, should be fast - complicated fuzzy and prefix query sorted by a string field, retrieving 10k stored document, will be definitely slow It depends what it "slow" for you...1 msec, 50 msec, 1 sec, 1 min ? I saw indexes with 100 millions documents and tens of Gb in size with reasonable performances (on reasonable hardware) On Tue, Mar 3, 2009 at 05:40, buddha1021 wrote: > > hi: > how many size of the index is the lucene's limit on per server ? I mean that > the speed of the search is very fast and doesn't be affected by the huge > index ! > which is the limit on per server,if the index is bigger than it ,the speed > of the search will be low! > any expert have a experience to tell me ? > thank you! > -- > View this message in context: > http://www.nabble.com/how-many-size-of-the-index-is-the-lucene%27s-limit-on-per-server---tp22301994p22301994.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Analyze other language using English Analyzer
Hello all, I am using default English Snowball analyzer to index and search English documents. There may be chances to index European, Chinese documents. What will be the impact to use English Analyzer for European or Chinese language documents? Whether i could do index and search as expected? The application will be installed in English OS but the chances of getting other language documents are high. I will not be able to detect the language of the document. Regards Ganesh Send instant messages to your online friends http://in.messenger.yahoo.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How to index Named Entities
I want to index document conents in two ways, one just a simple content, and the other as named entity. the senario is like this. if i have this document "the source of Nile is Ethiopia" then I want to index "source" as a normal content, "Nile" as river name, and "Ethiopia" as Country name. so that later if ask a question "where is the source of Nile", it should retrieve Ethiopia as an Answer. Note: I will have List of River names, Country names,... so that during indexing I will compare every word of a document with my lists. thanks a lot Seid M -- "RABI ZIDNI ILMA" - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
how many size of the index is the lucene's limit on per server ?
hi: how many size of the index is the lucene's limit on per server ? I mean that the speed of the search is very fast and doesn't be affected by the huge index ! which is the limit on per server,if the index is bigger than it ,the speed of the search will be low! any expert have a experience to tell me ? thank you! -- View this message in context: http://www.nabble.com/how-many-size-of-the-index-is-the-lucene%27s-limit-on-per-server---tp22301994p22301994.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing synonyms for multiple words
Thanks for your suggestion Michael and thanks to Uwe for clarifying. Payload is currently used to store only the start positions. What I gathered from your suggestion is that we could possibly store the end position, or span, or some other complex encoding in order to store the extra information. Am I right? --Sumukh Michael McCandless-2 wrote: > > > Since Lucene doesn't represent/store end position for a token, I don't > think the index can properly represent SYN spanning two positions? > > I suppose you could encode this into payloads, and create a custom > query that would look at the payload to enforce the constraint. > > Or, if you switch to doing SYN expansion only at runtime (not adding > it to the index), that might work. > > Mike > > Uwe Schindler wrote: > >> I think his problem is, that "SYN" is a synonym for the phrase "WORD1 >> WORD2". Using these positions, a phrase like "SYN WORD2" would also >> match >> (or other problems in queries that depend on order of words). >> >> Uwe >> >> - >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >>> -Original Message- >>> From: Michael McCandless [mailto:luc...@mikemccandless.com] >>> Sent: Monday, March 02, 2009 4:07 PM >>> To: java-user@lucene.apache.org >>> Subject: Re: Indexing synonyms for multiple words >>> >>> >>> Shouldn't WORD2's position be 1 more than your SYN? >>> >>> Ie, don't you want these positions?: >>> >>>WORD1 2 >>>WORD2 3 >>>SYN 2 >>> >>> The position is the starting position of the token; Lucene doesn't >>> store an ending position >>> >>> Mike >>> >>> Sumukh wrote: >>> Hi, I'm fairly new to Lucene. I'd like to know how we can index synonyms for multiple words. This is the scenario: Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG. Now assume the two words combined WORD1 WORD2 can be replaced by another word SYN. If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will follow SYN, which is incorrect; and the other way round if I place it after WORD2. If any of you have solved a similar problem, I'd be thankful if you could share some light on the solution. Regards, Sumukh >>> >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/Indexing-synonyms-for-multiple-words-tp22289069p22300656.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: term position in phrase query using queryparser
On Feb 25, 2009, at 2:52 PM, Tim Williams wrote: Is there a syntax to set the term position in a query built with queryparser? For example, I would like something like: PhraseQuery q = new PhraseQuery(); q.add(t1, 0); q.add(t2, 0); q.setSlop(0); As I understand it, the slop defaults to 0, but I don't know how to search for basically two tokens at the same term position using the queryparser syntax. I don't think this is available from the QueryParser. You could make a subclass that does this for the phrase query syntax. So if you have something like "term1 term2" then you can build your own Query and return it, but then you can't use phrase queries anymore... Either that or do your own parser... -- Matt Ronge mro...@mronge.com http://www.mronge.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Confidence scores at search time
On 3/2/09 4:23 PM, "Ken Williams" wrote: > On 3/2/09 1:58 PM, "Erik Hatcher" wrote: > >> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: >>> In the output, I get explanations like "0.88922405 = (MATCH) product >>> of:" >>> with no details. Perhaps I need to do something different in >>> indexing? >> >> Explanation.toString() only returns the first line. You can use >> toString(int depth) or loop over all the getDetails(). toHtml() >> returns a decently formatted tree of 's of the whole explanation >> also. > > It looks like toString(int) is a protected method, and toHtml() only seems > to return a single with no content. I can start writing a recursive > routine to dive down into getDetails(), but I thought there must be > something easier. Okay, silly me - notice that in my code I was printing the string with println(). I didn't realize println() truncated strings that contain newline characters (nor was I aware that the string had any newlines, I guess!). Once I ran it through replaceAll( "\n", "n" ) I'm getting the output I need. Thanks, -- Ken Williams Research Scientist The Thomson Reuters Corporation Eagan, MN - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Confidence scores at search time
On 3/2/09 1:58 PM, "Erik Hatcher" wrote: > > On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: >> In the output, I get explanations like "0.88922405 = (MATCH) product >> of:" >> with no details. Perhaps I need to do something different in >> indexing? > > Explanation.toString() only returns the first line. You can use > toString(int depth) or loop over all the getDetails(). toHtml() > returns a decently formatted tree of 's of the whole explanation > also. It looks like toString(int) is a protected method, and toHtml() only seems to return a single with no content. I can start writing a recursive routine to dive down into getDetails(), but I thought there must be something easier. -- Ken Williams Research Scientist The Thomson Reuters Corporation Eagan, MN - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Confidence scores at search time
On 3/2/09 4:19 PM, "Steven A Rowe" wrote: > On 3/2/2009 at 4:22 PM, Grant Ingersoll wrote: >> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: >>> Also, while perusing the threads you refer to below, I saw a >>> reference to the following link, which seems to have gone dead: >>> >>> https://issues.apache.org/bugzilla/show_bug.cgi?id=31841 >> >> Hmm, bugzilla has moved to JIRA. I'm not sure where the mapping is >> anymore. There used to be a Bugzilla Id in JIRA, I think. Sorry. > > http://issues.apache.org/jira/browse/LUCENE-295 > Great, thanks! -- Ken Williams Research Scientist The Thomson Reuters Corporation Eagan, MN - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Confidence scores at search time
On 3/2/2009 at 4:22 PM, Grant Ingersoll wrote: > On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: > > Also, while perusing the threads you refer to below, I saw a > > reference to the following link, which seems to have gone dead: > > > > https://issues.apache.org/bugzilla/show_bug.cgi?id=31841 > > Hmm, bugzilla has moved to JIRA. I'm not sure where the mapping is > anymore. There used to be a Bugzilla Id in JIRA, I think. Sorry. http://issues.apache.org/jira/browse/LUCENE-295 I found this by looking up the issue number in the map of Bugzilla -> JIRA issue numbers I put into the changes2html.pl script[1], so that linkification of old Bugzilla issues would continue to work in the Changes.html[2] it generates from CHANGES.txt[3]. Bug 31841 is mentioned (and now linked to LUCENE-295 in Changes.html) as item #4 under the "Changes in runtime behavior" section of the release notes for Release 1.9 RC1 - see [2]. Steve [1] changes2html.pl (look for "setup_bugzilla_jira_map" at the bottom of the file): http://svn.apache.org/viewvc/lucene/java/trunk/src/site/changes/changes2html.pl?view=markup [2] Changes.html: http://lucene.apache.org/java/2_4_0/changes/Changes.html [3] CHANGES.txt: http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=markup - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Faceted Search using Lucene
So then all is good. We were only pursuing this to explain it. Now that we know your directories are empty, that explains it. So you should call maybeReopen() inside get(), as long as it does not slow queries down. Mike Amin Mohammed-Coleman wrote: I think that is the case. When my SearchManager is initialised the directories are empty so when I do a get() nothing is present. Subsequent calls seem to work. Is there something I can do? or do I accept this or just do a maybeReopen and do a get(). As you mentioned it depends on timiing but I would be keen to know what the best practice would be in this situation... Cheers On Mon, Mar 2, 2009 at 8:43 PM, Michael McCandless < luc...@mikemccandless.com> wrote: Well the code looks fine. I can't explain why you see no search results if you don't call maybeReopen() in get, unless at the time you first create SearcherManager the Directories each have an empty index in them. Mike Amin Mohammed-Coleman wrote: Hi Here is the code that I am using, I've modified the get() method to include the maybeReopen() call. Again I'm not sure if this is a good idea. public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); MultiSearcher multiSearcher = get(); try { LOGGER.debug("Ensuring all index readers are up to date..."); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + query.toString() +"'"); Sort sort = null; sort = applySortIfApplicable(searchRequest); Filter[] filters =applyFiltersIfApplicable(searchRequest); ChainedFilter chainedFilter = null; if (filters != null) { chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); } TopDocs topDocs = multiSearcher.search(query,chainedFilter , 100,sort); ScoreDoc[] scoreDocs = topDocs.scoreDocs; LOGGER.debug("total number of hits for [" + query.toString() + " ] = "+topDocs. totalHits); for (ScoreDoc scoreDoc : scoreDocs) { final Document doc = multiSearcher.doc(scoreDoc.doc); float score = scoreDoc.score; final BaseDocument baseDocument = new BaseDocument(doc, score); Summary documentSummary = new DocumentSummaryImpl(baseDocument); summaryList.add(documentSummary); } } catch (Exception e) { throw new IllegalStateException(e); } finally { if (multiSearcher != null) { release(multiSearcher); } } stopWatch.stop(); LOGGER.debug("total time taken for document seach: " + stopWatch.getTotalTimeMillis() + " ms"); return summaryList.toArray(new Summary[] {}); } @Autowired public void setDirectories(@Qualifier("directories")ListFactoryBean listFactoryBean) throws Exception { this.directories = (List) listFactoryBean.getObject(); } @PostConstruct public void initialiseDocumentSearcher() { StopWatch stopWatch = new StopWatch("document-search-initialiser"); stopWatch.start(); PerFieldAnalyzerWrapper analyzerWrapper = new PerFieldAnalyzerWrapper( analyzer); analyzerWrapper.addAnalyzer(FieldNameEnum.TYPE.getDescription(), newKeywordAnalyzer()); queryParser = newMultiFieldQueryParser(FieldNameEnum.fieldNameDescriptions(), analyzerWrapper); try { LOGGER.debug("Initialising document searcher "); documentSearcherManagers = new DocumentSearcherManager[directories.size()]; for (int i = 0; i < directories.size() ;i++) { Directory directory = directories.get(i); DocumentSearcherManager documentSearcherManager = newDocumentSearcherManager(directory); documentSearcherManagers[i]=documentSearcherManager; } LOGGER.debug("Document searcher initialised"); } catch (IOException e) { throw new IllegalStateException(e); } stopWatch.stop(); LOGGER.debug("Total time taken to initialise DocumentSearcher '" + stopWatch.getTotalTimeMillis() +"' ms."); } private void maybeReopen() throws SearchExecutionException { LOGGER.debug("Initiating reopening of index readers..."); for (DocumentSearcherManager documentSearcherManager : documentSearcherManagers) { try { documentSearcherManager.maybeReopen(); } catch (InterruptedException e) { throw new SearchExecutionException(e); } catch (IOException e) { throw new SearchExecutionException(e); } } LOGGER.debug("reopening of index readers complete."); } private void release(MultiSearcher multiSeacher) { IndexSearcher[] indexSearchers = (IndexSearcher[]) multiSeacher.getSearchables(); for(int i =0 ; i < indexSearchers.length;i++) { try { documentSearcherManagers[i].release(indexSearchers[i]); } catch (IOException e) { throw new IllegalStateException(e); } } } private MultiSearcher get() throws SearchExecutionException { maybeReopen(); MultiSearcher multiSearcher =
Re: Faceted Search using Lucene
I think that is the case. When my SearchManager is initialised the directories are empty so when I do a get() nothing is present. Subsequent calls seem to work. Is there something I can do? or do I accept this or just do a maybeReopen and do a get(). As you mentioned it depends on timiing but I would be keen to know what the best practice would be in this situation... Cheers On Mon, Mar 2, 2009 at 8:43 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > > Well the code looks fine. > > I can't explain why you see no search results if you don't call > maybeReopen() in get, unless at the time you first create SearcherManager > the Directories each have an empty index in them. > > Mike > > Amin Mohammed-Coleman wrote: > > Hi >> Here is the code that I am using, I've modified the get() method to >> include >> the maybeReopen() call. Again I'm not sure if this is a good idea. >> >> public Summary[] search(final SearchRequest searchRequest) >> throwsSearchExecutionException { >> >> final String searchTerm = searchRequest.getSearchTerm(); >> >> if (StringUtils.isBlank(searchTerm)) { >> >> throw new SearchExecutionException("Search string cannot be empty. There >> will be too many results to process."); >> >> } >> >> List summaryList = new ArrayList(); >> >> StopWatch stopWatch = new StopWatch("searchStopWatch"); >> >> stopWatch.start(); >> >> MultiSearcher multiSearcher = get(); >> >> try { >> >> LOGGER.debug("Ensuring all index readers are up to date..."); >> >> Query query = queryParser.parse(searchTerm); >> >> LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + >> query.toString() +"'"); >> >> Sort sort = null; >> >> sort = applySortIfApplicable(searchRequest); >> >> Filter[] filters =applyFiltersIfApplicable(searchRequest); >> >> ChainedFilter chainedFilter = null; >> >> if (filters != null) { >> >> chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); >> >> } >> >> TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort); >> >> ScoreDoc[] scoreDocs = topDocs.scoreDocs; >> >> LOGGER.debug("total number of hits for [" + query.toString() + " ] = >> "+topDocs. >> totalHits); >> >> for (ScoreDoc scoreDoc : scoreDocs) { >> >> final Document doc = multiSearcher.doc(scoreDoc.doc); >> >> float score = scoreDoc.score; >> >> final BaseDocument baseDocument = new BaseDocument(doc, score); >> >> Summary documentSummary = new DocumentSummaryImpl(baseDocument); >> >> summaryList.add(documentSummary); >> >> } >> >> } catch (Exception e) { >> >> throw new IllegalStateException(e); >> >> } finally { >> >> if (multiSearcher != null) { >> >> release(multiSearcher); >> >> } >> >> } >> >> stopWatch.stop(); >> >> LOGGER.debug("total time taken for document seach: " + >> stopWatch.getTotalTimeMillis() + " ms"); >> >> return summaryList.toArray(new Summary[] {}); >> >> } >> >> >> @Autowired >> >> public void setDirectories(@Qualifier("directories")ListFactoryBean >> listFactoryBean) throws Exception { >> >> this.directories = (List) listFactoryBean.getObject(); >> >> } >> >> @PostConstruct >> >> public void initialiseDocumentSearcher() { >> >> StopWatch stopWatch = new StopWatch("document-search-initialiser"); >> >> stopWatch.start(); >> >> PerFieldAnalyzerWrapper analyzerWrapper = new PerFieldAnalyzerWrapper( >> analyzer); >> >> analyzerWrapper.addAnalyzer(FieldNameEnum.TYPE.getDescription(), >> newKeywordAnalyzer()); >> >> queryParser = >> newMultiFieldQueryParser(FieldNameEnum.fieldNameDescriptions(), >> analyzerWrapper); >> >> try { >> >> LOGGER.debug("Initialising document searcher "); >> >> documentSearcherManagers = new >> DocumentSearcherManager[directories.size()]; >> >> for (int i = 0; i < directories.size() ;i++) { >> >> Directory directory = directories.get(i); >> >> DocumentSearcherManager documentSearcherManager = >> newDocumentSearcherManager(directory); >> >> documentSearcherManagers[i]=documentSearcherManager; >> >> } >> >> LOGGER.debug("Document searcher initialised"); >> >> } catch (IOException e) { >> >> throw new IllegalStateException(e); >> >> } >> >> stopWatch.stop(); >> >> LOGGER.debug("Total time taken to initialise DocumentSearcher '" + >> stopWatch.getTotalTimeMillis() +"' ms."); >> >> } >> >> private void maybeReopen() throws SearchExecutionException { >> >> LOGGER.debug("Initiating reopening of index readers..."); >> >> for (DocumentSearcherManager documentSearcherManager : >> documentSearcherManagers) { >> >> try { >> >> documentSearcherManager.maybeReopen(); >> >> } catch (InterruptedException e) { >> >> throw new SearchExecutionException(e); >> >> } catch (IOException e) { >> >> throw new SearchExecutionException(e); >> >> } >> >> } >> >> LOGGER.debug("reopening of index readers complete."); >> >> } >> >> >> >> private void release(MultiSearcher multiSeacher) { >> >> IndexSearcher[] indexSearchers = (IndexSearcher[]) >> multiSeacher.getSearchables(); >> >> for(int i =0 ; i < indexSearchers.length;i++) { >> >> try { >> >> documentSearcherMa
Re: Confidence scores at search time
On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: Hi Grant, It's true, I may have an X-Y problem here. =) My basic need is to sacrifice recall to achieve greater precision. Rather than always presenting the user with the top N documents, I need to return *only* the documents that seem relevant. For some searches this may be 3 documents, for some it may be none. Therein lies the rub. How are you determining what is relevant? In some sense, you are asking Lucene to determine what is relevant and then turning around and telling it you are not happy with it doing what you told it to do (I'm exaggerating a bit, I know), namely tell you what the relevant documents are for a given query and a set of documents based on it's scoring model. As an alternate tack, I usually look at this type of thing and try to figure out a way to make my queries more precise (e.g. replace OR with AND, introduce phrase queries, filter or add NOT clauses or some other qualifiers) or some other relevance tricks [1], [2]. That being said, I could see maybe determining a delta value such that if the distance between any two scores is more than the delta, you cut off the rest of the docs. This takes into account the relative state of scores and is not some arbitrary value (although, the delta is, of course) Since you are allowing the user to "explore", it may be more reasonable to cutoff at some point, too, but I still don't know of a good way to determine what that point is in a generic way. Maybe with some specific knowledge about how you are creating your queries and what query terms matched you could come up with something, but still, I am uncertain. The other thing that strikes me is that you add in some type of learning/memory component that tracks your click-through information and gives feedback into the system about relevance. My user interface in this case isn't the standard "type words in a box and we'll show you the best docs" - I'm using Lucene as a tool in the background to do some exploration about how I could augment a set of traditional results with a few alternative results gleaned from a different path. Not sure if this helps with the X-Y problem, but that's my task at hand. Yes. Also, keep in mind there are other techniques for encouraging exploration: clustering, faceting, info extraction (identifying named entities, etc. and presenting them) Just throwing out some food for thought. Also, while perusing the threads you refer to below, I saw a reference to the following link, which seems to have gone dead: https://issues.apache.org/bugzilla/show_bug.cgi?id=31841 Hmm, bugzilla has moved to JIRA. I'm not sure where the mapping is anymore. There used to be a Bugzilla Id in JIRA, I think. Sorry. -Grant [1] http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-in-Search/ [2] http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing-Findability-in-Lucene-and-Solr/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Extracting TFIDF vectors
Have a look at the MoreLikeThis contrib module in the contrib section of Lucene. You can start with that, and then do the additions and subtractions from there. On Mar 2, 2009, at 9:35 AM, Gregory Gay wrote: Hi, I'm a complete novice at Lucene, and I'm looking for a little bit of help with something. How can I extract the TF*IDF vector for each document in the indexed collection? Also for the query? I need to build a user-feedback system which manipulates the query based on the liked and disliked documents from the local collection. This query modification uses the TF*IDF vectors. Thanks for your help! -- Gregory Gay Editor - 4 Color Rebellion (http://www.4colorrebellion.com) Research Assistant - WVU CSEE -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Faceted Search using Lucene
Well the code looks fine. I can't explain why you see no search results if you don't call maybeReopen() in get, unless at the time you first create SearcherManager the Directories each have an empty index in them. Mike Amin Mohammed-Coleman wrote: Hi Here is the code that I am using, I've modified the get() method to include the maybeReopen() call. Again I'm not sure if this is a good idea. public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); MultiSearcher multiSearcher = get(); try { LOGGER.debug("Ensuring all index readers are up to date..."); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + query.toString() +"'"); Sort sort = null; sort = applySortIfApplicable(searchRequest); Filter[] filters =applyFiltersIfApplicable(searchRequest); ChainedFilter chainedFilter = null; if (filters != null) { chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); } TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort); ScoreDoc[] scoreDocs = topDocs.scoreDocs; LOGGER.debug("total number of hits for [" + query.toString() + " ] = "+topDocs. totalHits); for (ScoreDoc scoreDoc : scoreDocs) { final Document doc = multiSearcher.doc(scoreDoc.doc); float score = scoreDoc.score; final BaseDocument baseDocument = new BaseDocument(doc, score); Summary documentSummary = new DocumentSummaryImpl(baseDocument); summaryList.add(documentSummary); } } catch (Exception e) { throw new IllegalStateException(e); } finally { if (multiSearcher != null) { release(multiSearcher); } } stopWatch.stop(); LOGGER.debug("total time taken for document seach: " + stopWatch.getTotalTimeMillis() + " ms"); return summaryList.toArray(new Summary[] {}); } @Autowired public void setDirectories(@Qualifier("directories")ListFactoryBean listFactoryBean) throws Exception { this.directories = (List) listFactoryBean.getObject(); } @PostConstruct public void initialiseDocumentSearcher() { StopWatch stopWatch = new StopWatch("document-search-initialiser"); stopWatch.start(); PerFieldAnalyzerWrapper analyzerWrapper = new PerFieldAnalyzerWrapper( analyzer); analyzerWrapper.addAnalyzer(FieldNameEnum.TYPE.getDescription(), newKeywordAnalyzer()); queryParser = newMultiFieldQueryParser(FieldNameEnum.fieldNameDescriptions(), analyzerWrapper); try { LOGGER.debug("Initialising document searcher "); documentSearcherManagers = new DocumentSearcherManager[directories.size()]; for (int i = 0; i < directories.size() ;i++) { Directory directory = directories.get(i); DocumentSearcherManager documentSearcherManager = newDocumentSearcherManager(directory); documentSearcherManagers[i]=documentSearcherManager; } LOGGER.debug("Document searcher initialised"); } catch (IOException e) { throw new IllegalStateException(e); } stopWatch.stop(); LOGGER.debug("Total time taken to initialise DocumentSearcher '" + stopWatch.getTotalTimeMillis() +"' ms."); } private void maybeReopen() throws SearchExecutionException { LOGGER.debug("Initiating reopening of index readers..."); for (DocumentSearcherManager documentSearcherManager : documentSearcherManagers) { try { documentSearcherManager.maybeReopen(); } catch (InterruptedException e) { throw new SearchExecutionException(e); } catch (IOException e) { throw new SearchExecutionException(e); } } LOGGER.debug("reopening of index readers complete."); } private void release(MultiSearcher multiSeacher) { IndexSearcher[] indexSearchers = (IndexSearcher[]) multiSeacher.getSearchables(); for(int i =0 ; i < indexSearchers.length;i++) { try { documentSearcherManagers[i].release(indexSearchers[i]); } catch (IOException e) { throw new IllegalStateException(e); } } } private MultiSearcher get() throws SearchExecutionException { maybeReopen(); MultiSearcher multiSearcher = null; List listOfIndexSeachers = new ArrayList(); for (DocumentSearcherManager documentSearcherManager : documentSearcherManagers) { listOfIndexSeachers.add(documentSearcherManager.get()); } try { multiSearcher = new MultiSearcher(listOfIndexSeachers.toArray(newIndexSearcher[] {})); } catch (IOException e) { throw new SearchExecutionException(e); } return multiSearcher; } Hope there is enough information. Cheers Amin P.S. I will continue to debug. On Mon, Mar 2, 2009 at 6:55 PM, Michael McCandless < luc...@mikemccandless.com> wrote: It makes perfect sense to call maybeReopen() followed by get(), as long as maybeReopen() is never slow enough to be noticeable to an end user (because you are mak
Re: Marking commit points as deleted does not clean up on IW.close
You mean on calling IndexWriter.close, with a deletion policy that's functionally equivalent to KeepOnlyLastCommitDeletionPolicy, you somehow see that last 2 commits remaining in the Directory once IndexWriter is done closing? That's odd. Are you sure "onCommit()" is really calling delete() on all the IndexCommits except the last one? Can you post the source for the deletion policy? Mike Shalin Shekhar Mangar wrote: Hello, In Solr, when a user calls commit, the IndexWriter is closed (causing a commit). It is opened again only when another document is added or, a delete is performed. In order to support replication, Solr trunk now uses a deletion policy. The default policy is (should be?) equivalent to KeepOnlyLastCommitDeletionPolicy. However, once a commit is performed, we see that the last two commit points are being kept back. The 2nd last one is cleaned up once the IndexWriter is opened again. It'd be great if someone can suggest on what we might be doing wrong. For the time being, we can work around this by using IW.commit and keeping the IW open. -- Regards, Shalin Shekhar Mangar. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Faceted Search using Lucene
Hi Here is the code that I am using, I've modified the get() method to include the maybeReopen() call. Again I'm not sure if this is a good idea. public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); MultiSearcher multiSearcher = get(); try { LOGGER.debug("Ensuring all index readers are up to date..."); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + query.toString() +"'"); Sort sort = null; sort = applySortIfApplicable(searchRequest); Filter[] filters =applyFiltersIfApplicable(searchRequest); ChainedFilter chainedFilter = null; if (filters != null) { chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); } TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort); ScoreDoc[] scoreDocs = topDocs.scoreDocs; LOGGER.debug("total number of hits for [" + query.toString() + " ] = "+topDocs. totalHits); for (ScoreDoc scoreDoc : scoreDocs) { final Document doc = multiSearcher.doc(scoreDoc.doc); float score = scoreDoc.score; final BaseDocument baseDocument = new BaseDocument(doc, score); Summary documentSummary = new DocumentSummaryImpl(baseDocument); summaryList.add(documentSummary); } } catch (Exception e) { throw new IllegalStateException(e); } finally { if (multiSearcher != null) { release(multiSearcher); } } stopWatch.stop(); LOGGER.debug("total time taken for document seach: " + stopWatch.getTotalTimeMillis() + " ms"); return summaryList.toArray(new Summary[] {}); } @Autowired public void setDirectories(@Qualifier("directories")ListFactoryBean listFactoryBean) throws Exception { this.directories = (List) listFactoryBean.getObject(); } @PostConstruct public void initialiseDocumentSearcher() { StopWatch stopWatch = new StopWatch("document-search-initialiser"); stopWatch.start(); PerFieldAnalyzerWrapper analyzerWrapper = new PerFieldAnalyzerWrapper( analyzer); analyzerWrapper.addAnalyzer(FieldNameEnum.TYPE.getDescription(), newKeywordAnalyzer()); queryParser = newMultiFieldQueryParser(FieldNameEnum.fieldNameDescriptions(), analyzerWrapper); try { LOGGER.debug("Initialising document searcher "); documentSearcherManagers = new DocumentSearcherManager[directories.size()]; for (int i = 0; i < directories.size() ;i++) { Directory directory = directories.get(i); DocumentSearcherManager documentSearcherManager = newDocumentSearcherManager(directory); documentSearcherManagers[i]=documentSearcherManager; } LOGGER.debug("Document searcher initialised"); } catch (IOException e) { throw new IllegalStateException(e); } stopWatch.stop(); LOGGER.debug("Total time taken to initialise DocumentSearcher '" + stopWatch.getTotalTimeMillis() +"' ms."); } private void maybeReopen() throws SearchExecutionException { LOGGER.debug("Initiating reopening of index readers..."); for (DocumentSearcherManager documentSearcherManager : documentSearcherManagers) { try { documentSearcherManager.maybeReopen(); } catch (InterruptedException e) { throw new SearchExecutionException(e); } catch (IOException e) { throw new SearchExecutionException(e); } } LOGGER.debug("reopening of index readers complete."); } private void release(MultiSearcher multiSeacher) { IndexSearcher[] indexSearchers = (IndexSearcher[]) multiSeacher.getSearchables(); for(int i =0 ; i < indexSearchers.length;i++) { try { documentSearcherManagers[i].release(indexSearchers[i]); } catch (IOException e) { throw new IllegalStateException(e); } } } private MultiSearcher get() throws SearchExecutionException { maybeReopen(); MultiSearcher multiSearcher = null; List listOfIndexSeachers = new ArrayList(); for (DocumentSearcherManager documentSearcherManager : documentSearcherManagers) { listOfIndexSeachers.add(documentSearcherManager.get()); } try { multiSearcher = new MultiSearcher(listOfIndexSeachers.toArray(newIndexSearcher[] {})); } catch (IOException e) { throw new SearchExecutionException(e); } return multiSearcher; } Hope there is enough information. Cheers Amin P.S. I will continue to debug. On Mon, Mar 2, 2009 at 6:55 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > > It makes perfect sense to call maybeReopen() followed by get(), as long as > maybeReopen() is never slow enough to be noticeable to an end user (because > you are making random queries pay the reopen/warming cost). > > If you call maybeReopen() after get(), then that search will not see the > newly opened readers, but the next search will. > > I'm just thinking that since you see no results with get() alone, debug > that c
Marking commit points as deleted does not clean up on IW.close
Hello, In Solr, when a user calls commit, the IndexWriter is closed (causing a commit). It is opened again only when another document is added or, a delete is performed. In order to support replication, Solr trunk now uses a deletion policy. The default policy is (should be?) equivalent to KeepOnlyLastCommitDeletionPolicy. However, once a commit is performed, we see that the last two commit points are being kept back. The 2nd last one is cleaned up once the IndexWriter is opened again. It'd be great if someone can suggest on what we might be doing wrong. For the time being, we can work around this by using IW.commit and keeping the IW open. -- Regards, Shalin Shekhar Mangar.
Re: Confidence scores at search time
On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: Finally, I seem unable to get Searcher.explain() to do much useful - my code looks like: Searcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(LuceneIndex.CONTENT, analyzer); Query query = parser.parse(queryString); TopDocCollector collector = new TopDocCollector(n); searcher.search(query, collector); for ( ScoreDoc d : collector.topDocs().scoreDocs ) { String explanation = searcher.explain(query, d.doc).toString(); Field id = searcher.doc( d.doc ).getField( LuceneIndex.ID ); System.out.println(id + "\t" + d.score + "\t" + explanation); } In the output, I get explanations like "0.88922405 = (MATCH) product of:" with no details. Perhaps I need to do something different in indexing? Explanation.toString() only returns the first line. You can use toString(int depth) or loop over all the getDetails(). toHtml() returns a decently formatted tree of 's of the whole explanation also. Erik - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Confidence scores at search time
Hi Grant, It's true, I may have an X-Y problem here. =) My basic need is to sacrifice recall to achieve greater precision. Rather than always presenting the user with the top N documents, I need to return *only* the documents that seem relevant. For some searches this may be 3 documents, for some it may be none. My user interface in this case isn't the standard "type words in a box and we'll show you the best docs" - I'm using Lucene as a tool in the background to do some exploration about how I could augment a set of traditional results with a few alternative results gleaned from a different path. Not sure if this helps with the X-Y problem, but that's my task at hand. Also, while perusing the threads you refer to below, I saw a reference to the following link, which seems to have gone dead: https://issues.apache.org/bugzilla/show_bug.cgi?id=31841 (in http://www.lucidimagination.com/search/document/1618ce933c8ebd6b ) Has the issue tracker moved somewhere else? Finally, I seem unable to get Searcher.explain() to do much useful - my code looks like: Searcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(LuceneIndex.CONTENT, analyzer); Query query = parser.parse(queryString); TopDocCollector collector = new TopDocCollector(n); searcher.search(query, collector); for ( ScoreDoc d : collector.topDocs().scoreDocs ) { String explanation = searcher.explain(query, d.doc).toString(); Field id = searcher.doc( d.doc ).getField( LuceneIndex.ID ); System.out.println(id + "\t" + d.score + "\t" + explanation); } In the output, I get explanations like "0.88922405 = (MATCH) product of:" with no details. Perhaps I need to do something different in indexing? Thanks, -Ken On 2/26/09 10:36 AM, "Grant Ingersoll" wrote: > I don't know of anyone doing work on it in the Lucene community. My > understanding to date is that it is not really worth trying, but that > may in fact be an outdated view. I haven't stayed up on the > literature on this subject, so background info on what you are > interested in would be helpful. > > Digging around in the archives a bit more, I come up with some more > relevant emails: > http://www.lucidimagination.com/search/?q=comparing+scores+across+searches#/ > p:lucene,solr/s:email > > What is the bigger problem that you are trying to solve? That is, you > imply that score comparison is the solution, but you haven't said the > problem you are trying to solve. > > Cheers, > Grant > > > On Feb 25, 2009, at 11:38 AM, Ken Williams wrote: > >> Hi all, >> >> I didn't get a response to this - not sure whether the question was >> ill-posed, or too-frequently-asked, or just not interesting. But if >> anyone >> could take a stab at it or let me know a different place to look, >> I'd really >> appreciate it. >> >> Thanks, >> >> -Ken >> >> >> On 2/20/09 12:00 PM, "Ken Williams" >> wrote: >> >>> Hi, >>> >>> Has there been any work done on getting confidence scores at >>> runtime, so >>> that scores of documents can be compared across queries? I found one >>> reference in the mailing list to some work in 2003, but couldn't >>> find any >>> follow-up: >>> >>> http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html >>> >>> Thanks. >> >> -- >> Ken Williams >> Research Scientist >> The Thomson Reuters Corporation >> Eagan, MN >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > using Solr/Lucene: > http://www.lucidimagination.com/search > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Ken Williams Research Scientist The Thomson Reuters Corporation Eagan, MN - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Restricting the result set with hierarchical ACL
There are two ways to handle this: 1) During indexing time, expand the group tree and store them to the documents, like "groups:1 2 3" 2) When indexing, storing only the exact group the document belongs to. Then during search time, expand group tree to search all the groups the user belongs to, including the sub groups. Approach 2 should be more flexible. I don't think a user will have that many groups exceeding the default 1024. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Mon, Mar 2, 2009 at 7:58 AM, wrote: > Dear list > > I need to restrict the resultlist to the appropriate rights of the user > who is searching the index. > > A document may belong to several groups. > > A user must belong to all groups of the document to find it. There's one > additional problem: The groups are a tree. A user is automaticaly > in every parent group of his groups. For example A is a child of B, so a > user in group A would also be allowed to see documents of group B. > > And now I have no Idea how to get a restricted search result from > lucene. There are about 1 documents, so I'm not very happy to filter > them after the index was searched. > > I tried to get all allowed document ids (there's a field for the id) and > put them into a BooleanQuery (id1 or id2, ...), but then I get a > BooleanQuery$TooManyClauses: maxClauseCount is set to 1024 > > So how can I restrict my search results with lucene? > > Markus Malkusch > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Faceted Search using Lucene
It makes perfect sense to call maybeReopen() followed by get(), as long as maybeReopen() is never slow enough to be noticeable to an end user (because you are making random queries pay the reopen/warming cost). If you call maybeReopen() after get(), then that search will not see the newly opened readers, but the next search will. I'm just thinking that since you see no results with get() alone, debug that case first. Then put back the maybeReopen(). Can you post your full code at this point? Mike Amin Mohammed-Coleman wrote: Hi Just out of curiosity does it not make sense to call maybeReopen and then call get()? If I call get() then I have a new mulitsearcher, so a call to maybeopen won't reinitialise the multi searcher. Unless I pass the multi searcher into the maybereopen method. But somehow that doesn't make sense. I maybe missing something here. Cheers Amin On 2 Mar 2009, at 15:48, Amin Mohammed-Coleman wrote: I'm seeing some interesting behviour when i do get() first followed by maybeReopen then there are no documents in the directory (directory that i am interested in. When i do the maybeReopen and then get() then the doc count is correct. I can post stats later. Weird... On Mon, Mar 2, 2009 at 2:17 PM, Amin Mohammed-Coleman > wrote: oh dear...i think i may cry...i'll debug. On Mon, Mar 2, 2009 at 2:15 PM, Michael McCandless > wrote: Or even just get() with no call to maybeReopen(). That should work fine as well. Mike Amin Mohammed-Coleman wrote: In my test case I have a set up method that should populate the indexes before I start using the document searcher. I will start adding some more debug statements. So basically I should be able to do: get() followed by maybeReopen. I will let you know what the outcome is. Cheers Amin On Mon, Mar 2, 2009 at 1:39 PM, Michael McCandless < luc...@mikemccandless.com> wrote: Is it possible that when you first create the SearcherManager, there is no index in each Directory? If not... you better start adding diagnostics. EG inside your get(), print out the numDocs() of each IndexReader you get from the SearcherManager? Something is wrong and it's best to explain it... Mike Amin Mohammed-Coleman wrote: Nope. If i remove the maybeReopen the search doesn't work. It only works when i cal maybeReopen followed by get(). Cheers Amin On Mon, Mar 2, 2009 at 12:56 PM, Michael McCandless < luc...@mikemccandless.com> wrote: That's not right; something must be wrong. get() before maybeReopen() should simply let you search based on the searcher before reopening. If you just do get() and don't call maybeReopen() does it work? Mike Amin Mohammed-Coleman wrote: I noticed that if i do the get() before the maybeReopen then I get no results. But otherwise I can change it further. On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless < luc...@mikemccandless.com> wrote: There is no such thing as final code -- code is alive and is always changing ;) It looks good to me. Though one trivial thing is: I would move the code in the try clause up to and including the multiSearcher=get() out above the try. I always attempt to "shrink wrap" what's inside a try clause to the minimum that needs to be there. Ie, your code that creates a query, finds the right sort & filter to use, etc, can all happen outside the try, because you have not yet acquired the multiSearcher. If you do that, you also don't need the null check in the finally clause, because multiSearcher must be non-null on entering the try. Mike Amin Mohammed-Coleman wrote: Hi there Good morning! Here is the final search code: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); MultiSearcher multiSearcher = null; try { LOGGER.debug("Ensuring all index readers are up to date..."); maybeReopen(); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + query.toString() +"'"); Sort sort = null; sort = applySortIfApplicable(searchRequest); Filter[] filters =applyFiltersIfApplicable(searchRequest); ChainedFilter chainedFilter = null; if (filters != null) { chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); } multiSearcher = get(); TopDocs topDocs = multiSearcher.search(query,chainedFilter , 100,sort); ScoreDoc[] scoreDocs = topDocs.scoreDocs; LOGGER.debug("total number of hits for [" + query.toString() + " ] = "+topDocs. totalHits); for (ScoreDoc scoreDoc : scoreDocs) { final Document doc = multiSearcher.doc(scoreDoc.doc); float score = scoreDoc.score; final BaseDocu
Re: Restricting the result set with hierarchical ACL
Hi Markus, I need to restrict the resultset to the appropriate rights of the user who is searching the index. A document may belong to several groups. A user must belong to all groups of the document to find it. There's one additional problem: The groups are a tree. A user is automaticaly in every parent group of his groups. For example A is a child of B, so a user in group A would also be allowed to see documents of group B. And now I have no Idea how to get a restricted search result from lucene. There are about 1 documents, so I'm not very happy to filter them after the index was searched. Well, 10K is actually a small number of docs. And the real question is how many documents will typically be part of the found set, and thus in the set that needs to be filtered. So try that first, as that's the obvious approach (to me, at least). Note that for this type of filtering, the way that you do the calculation will have a performance impact - e.g. you might want to use bitfields versus iterating over group names in the stored field. Since the set of a document's groups has to be a complete subset of the user's groups, you can't use the typical approach of having a doc field with every group in it, then adding a required subclause to your query with every group as a boolean OR term. -- Ken -- Ken Krugler +1 530-210-6378 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Question on Proximity Search in Lucene Query
See page 88 in Lucene In Action for a fuller explanation, including ordering considerations. But basically, phrase query slop is the maximum number of "moves" be required to get all the words next to each other in the proper order. If you can get all the words next to each other within slop moves, you succeed. So, it's not pairwise. I don't want to reproduce the example in the book, but that'd be the place to start. Best Erick On Mon, Mar 2, 2009 at 1:07 PM, Vasudevan Comandur wrote: > Hi All, > > I had posted the below mentioned query a week back and I have not > received any response from the group so far. > I was wondering if this is a trivial question to the group or it has been > answered previously. > > I appreciate your answers or any pointers to the answers are also welcome. > > Regards > Vasu > > ** > > > Hi, > > I have a question on the proximity query usage in Lucene Query Syntax. > > The documentation says "W1 W2"~5 means W1 and W2 can occur within 5 words. > Here W1 & W2 represents Words. > > What happens when I give "W1 W2 W3 W4"~25 as proximity query? > > Does it treat each word pairs (W1, W2) , (W1, W3) , (W1, W4) , (W2, W3) , > (W2, W4) , (W3, W4) can occur within 25 words? > > Looking forward to your reply. > > Regards > Vasu > > *** >
Question on Proximity Search in Lucene Query
Hi All, I had posted the below mentioned query a week back and I have not received any response from the group so far. I was wondering if this is a trivial question to the group or it has been answered previously. I appreciate your answers or any pointers to the answers are also welcome. Regards Vasu ** Hi, I have a question on the proximity query usage in Lucene Query Syntax. The documentation says "W1 W2"~5 means W1 and W2 can occur within 5 words. Here W1 & W2 represents Words. What happens when I give "W1 W2 W3 W4"~25 as proximity query? Does it treat each word pairs (W1, W2) , (W1, W3) , (W1, W4) , (W2, W3) , (W2, W4) , (W3, W4) can occur within 25 words? Looking forward to your reply. Regards Vasu ***
Re: Restricting the result set with hierarchical ACL
If you have a reasonable way of getting the doc IDs that your user is allowed to see (and it appears you do), you probably want a Filter. At root a Filter is just a BitSet where you turn on the bit for each document that *could* be allowed in the results and pass that filter to the appropriate search routine. CachingWrapperFilter may be your friend if you want to keep some of these filters around after you've created them. Erick On Mon, Mar 2, 2009 at 10:58 AM, wrote: > Dear list > > I need to restrict the resultlist to the appropriate rights of the user > who is searching the index. > > A document may belong to several groups. > > A user must belong to all groups of the document to find it. There's one > additional problem: The groups are a tree. A user is automaticaly > in every parent group of his groups. For example A is a child of B, so a > user in group A would also be allowed to see documents of group B. > > And now I have no Idea how to get a restricted search result from > lucene. There are about 1 documents, so I'm not very happy to filter > them after the index was searched. > > I tried to get all allowed document ids (there's a field for the id) and > put them into a BooleanQuery (id1 or id2, ...), but then I get a > BooleanQuery$TooManyClauses: maxClauseCount is set to 1024 > > So how can I restrict my search results with lucene? > > Markus Malkusch > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Restricting the result set with hierarchical ACL
Dear list I need to restrict the resultset to the appropriate rights of the user who is searching the index. A document may belong to several groups. A user must belong to all groups of the document to find it. There's one additional problem: The groups are a tree. A user is automaticaly in every parent group of his groups. For example A is a child of B, so a user in group A would also be allowed to see documents of group B. And now I have no Idea how to get a restricted search result from lucene. There are about 1 documents, so I'm not very happy to filter them after the index was searched. I tried to get all allowed document ids (there's a field for the id) and put them into a BooleanQuery (id1 or id2, ...), but then I get a BooleanQuery$TooManyClauses: maxClauseCount is set to 1024 So how can I restrict my search results with lucene? Markus Malkusch - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Faceted Search using Lucene
Hi Just out of curiosity does it not make sense to call maybeReopen and then call get()? If I call get() then I have a new mulitsearcher, so a call to maybeopen won't reinitialise the multi searcher. Unless I pass the multi searcher into the maybereopen method. But somehow that doesn't make sense. I maybe missing something here. Cheers Amin On 2 Mar 2009, at 15:48, Amin Mohammed-Coleman wrote: I'm seeing some interesting behviour when i do get() first followed by maybeReopen then there are no documents in the directory (directory that i am interested in. When i do the maybeReopen and then get() then the doc count is correct. I can post stats later. Weird... On Mon, Mar 2, 2009 at 2:17 PM, Amin Mohammed-Coleman > wrote: oh dear...i think i may cry...i'll debug. On Mon, Mar 2, 2009 at 2:15 PM, Michael McCandless > wrote: Or even just get() with no call to maybeReopen(). That should work fine as well. Mike Amin Mohammed-Coleman wrote: In my test case I have a set up method that should populate the indexes before I start using the document searcher. I will start adding some more debug statements. So basically I should be able to do: get() followed by maybeReopen. I will let you know what the outcome is. Cheers Amin On Mon, Mar 2, 2009 at 1:39 PM, Michael McCandless < luc...@mikemccandless.com> wrote: Is it possible that when you first create the SearcherManager, there is no index in each Directory? If not... you better start adding diagnostics. EG inside your get(), print out the numDocs() of each IndexReader you get from the SearcherManager? Something is wrong and it's best to explain it... Mike Amin Mohammed-Coleman wrote: Nope. If i remove the maybeReopen the search doesn't work. It only works when i cal maybeReopen followed by get(). Cheers Amin On Mon, Mar 2, 2009 at 12:56 PM, Michael McCandless < luc...@mikemccandless.com> wrote: That's not right; something must be wrong. get() before maybeReopen() should simply let you search based on the searcher before reopening. If you just do get() and don't call maybeReopen() does it work? Mike Amin Mohammed-Coleman wrote: I noticed that if i do the get() before the maybeReopen then I get no results. But otherwise I can change it further. On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless < luc...@mikemccandless.com> wrote: There is no such thing as final code -- code is alive and is always changing ;) It looks good to me. Though one trivial thing is: I would move the code in the try clause up to and including the multiSearcher=get() out above the try. I always attempt to "shrink wrap" what's inside a try clause to the minimum that needs to be there. Ie, your code that creates a query, finds the right sort & filter to use, etc, can all happen outside the try, because you have not yet acquired the multiSearcher. If you do that, you also don't need the null check in the finally clause, because multiSearcher must be non-null on entering the try. Mike Amin Mohammed-Coleman wrote: Hi there Good morning! Here is the final search code: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); MultiSearcher multiSearcher = null; try { LOGGER.debug("Ensuring all index readers are up to date..."); maybeReopen(); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + query.toString() +"'"); Sort sort = null; sort = applySortIfApplicable(searchRequest); Filter[] filters =applyFiltersIfApplicable(searchRequest); ChainedFilter chainedFilter = null; if (filters != null) { chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); } multiSearcher = get(); TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort); ScoreDoc[] scoreDocs = topDocs.scoreDocs; LOGGER.debug("total number of hits for [" + query.toString() + " ] = "+topDocs. totalHits); for (ScoreDoc scoreDoc : scoreDocs) { final Document doc = multiSearcher.doc(scoreDoc.doc); float score = scoreDoc.score; final BaseDocument baseDocument = new BaseDocument(doc, score); Summary documentSummary = new DocumentSummaryImpl(baseDocument); summaryList.add(documentSummary); } } catch (Exception e) { throw new IllegalStateException(e); } finally { if (multiSearcher != null) { release(multiSearcher); } } stopWatch.stop(); LOGGER.debug("total time taken for document seach: " + stopWatch.getTotalTimeMillis() + " ms"); return summaryList.toArray(new Summary[] {}); } I hope this makes sense...thanks again! Cheers Amin On Sun, Mar 1, 2009 at 8:09 PM, Michael Mc
Re: N-grams with numbers and Shinglefilters
Yes, I don't need a ShingleFilter I understand it by now. Yes I will have many of these phrases in the documents... this is why I thought I shouldn't use Lucene fields. I will investigate further your keyword approach sounds like possible, thx for the tip. However I presume I may need to normalize the phrases for the search phase, so it may not work. Keep in touch, -RB- On Mon, Mar 2, 2009 at 5:23 PM, Steven A Rowe wrote: > Hi Raymond, > > On 3/2/2009 at 10:09 AM, Raymond Balmès wrote: > > suppose I have a tri-gram, what I want to do is index the tri-gram > > "string digit1 digit2" as one indexing phrase, and not index each token > > separately. > > As long as you don't want any transformation performed on the phrase or its > components, you can add your phrase as a "keyword", i.e. a non-analyzed > string that will be indexed as-is. > > Unless your phrase field will be the only field on this document (pretty > unlikely), you'll want to use PerFieldAnalyzerWrapper[1] over > KeywordAnalyzer[2] for the phrase field, and whatever other analyzer you > like for the other document field(s). > > AFAICT, you don't need ShingleFilter. > > Steve > > [1] PerFieldAnalyzerWrapper: > http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html > [2] KeywordAnalyzer: > http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Indexing synonyms for multiple words
Since Lucene doesn't represent/store end position for a token, I don't think the index can properly represent SYN spanning two positions? I suppose you could encode this into payloads, and create a custom query that would look at the payload to enforce the constraint. Or, if you switch to doing SYN expansion only at runtime (not adding it to the index), that might work. Mike Uwe Schindler wrote: I think his problem is, that "SYN" is a synonym for the phrase "WORD1 WORD2". Using these positions, a phrase like "SYN WORD2" would also match (or other problems in queries that depend on order of words). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, March 02, 2009 4:07 PM To: java-user@lucene.apache.org Subject: Re: Indexing synonyms for multiple words Shouldn't WORD2's position be 1 more than your SYN? Ie, don't you want these positions?: WORD1 2 WORD2 3 SYN 2 The position is the starting position of the token; Lucene doesn't store an ending position Mike Sumukh wrote: Hi, I'm fairly new to Lucene. I'd like to know how we can index synonyms for multiple words. This is the scenario: Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG. Now assume the two words combined WORD1 WORD2 can be replaced by another word SYN. If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will follow SYN, which is incorrect; and the other way round if I place it after WORD2. If any of you have solved a similar problem, I'd be thankful if you could share some light on the solution. Regards, Sumukh - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Sort Collection of ScoreDocs
Perfect Thanks. Was also looking at org.apache.lucene.search.ScoreDocComparator Uwe Schindler wrote: > > How about java.util.Arrays.sort() on the array using a simple > Comparator with a compare() that returns -Float.compare(a.score, > b.score)? This is just about 7 lines of Java code. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > >> -Original Message- >> From: Chetan Shah [mailto:chetankrs...@gmail.com] >> Sent: Monday, March 02, 2009 4:47 PM >> To: java-user@lucene.apache.org >> Subject: Sort Collection of ScoreDocs >> >> >> Is there an existing Utility class which will sort a collection of >> ScoreDocs >> ? I have a result set (array of ScoreDocs) stored in JVM and want to sort >> them by relevanceScore. I do not want to execute the query again. The >> stored >> result set is sorted by another term and hence the need. >> >> Would highly appreciate if you would please let me know how do I do so? >> >> Thanks, >> >> -Chetan >> -- >> View this message in context: http://www.nabble.com/Sort-Collection-of- >> ScoreDocs-tp22290563p22290563.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/Sort-Collection-of-ScoreDocs-tp22290563p22291550.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: queryNorm affect on score
If I set the boost=0 at query time and the query contains only terms with boost=0, the scores are NaN (because weight.queryNorm = 1/0 = infinity), instead of 0. Peter On Sun, Mar 1, 2009 at 9:27 PM, Erick Erickson wrote: > FWIW, Hossman pointed out that the difference between index and > query time boosts is that index time boosts on title, for instance, > express "I care about this document's title more than other documents' > titles [when it matches]" Query time boosts express "I care about matches > on the title field more than matches on other fields". > > Best > Erick > > On Sun, Mar 1, 2009 at 8:57 PM, Peter Keegan > wrote: > > > As suggested, I added a query-time boost of 0.0f to the 'literals' field > > (with index-time boost still there) and I did get the same scores for > both > > queries :) (there is a subtlety between index-time and query-time > boosting > > that I missed.) > > > > I also tried disabling the coord factor, but that had no affect on the > > score, when combined with the above. This seems ok in this example since > > the > > the matching terms had boost = 0. > > > > Thanks Yonik, > > Peter > > > > > > > > On Sat, Feb 28, 2009 at 6:02 PM, Yonik Seeley < > yo...@lucidimagination.com > > >wrote: > > > > > On Sat, Feb 28, 2009 at 3:02 PM, Peter Keegan > > > wrote: > > > >> in situations where you deal with simple query types, and matching > > > query > > > > structures, the queryNorm > > > >> *can* be used to make scores semi-comparable. > > > > > > > > Hmm. My example used matching query structures. The only difference > was > > a > > > > single term in a field with zero weight that didn't exist in the > > matching > > > > document. But one score was 3X the other. > > > > > > But the zero boost was an index-time boost, and the queryNorm takes > > > into account query-time boosts and idfs. You might get closer to what > > > you expect with a query time boost of 0.0f > > > > > > The other thing affecting the score is the coord factor - the fact > > > that fewer of the optional terms matched (1/2) lowers the score. The > > > coordination factor can be disabled on any BooleanQuery. > > > > > > If you do both of the above, I *think* you would get the same scores > > > for this specific example. > > > > > > -Yonik > > > http://www.lucidimagination.com > > > > > > - > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > >
RE: N-grams with numbers and Shinglefilters
Hi Raymond, On 3/2/2009 at 10:09 AM, Raymond Balmès wrote: > suppose I have a tri-gram, what I want to do is index the tri-gram > "string digit1 digit2" as one indexing phrase, and not index each token > separately. As long as you don't want any transformation performed on the phrase or its components, you can add your phrase as a "keyword", i.e. a non-analyzed string that will be indexed as-is. Unless your phrase field will be the only field on this document (pretty unlikely), you'll want to use PerFieldAnalyzerWrapper[1] over KeywordAnalyzer[2] for the phrase field, and whatever other analyzer you like for the other document field(s). AFAICT, you don't need ShingleFilter. Steve [1] PerFieldAnalyzerWrapper: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html [2] KeywordAnalyzer: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Sort Collection of ScoreDocs
How about java.util.Arrays.sort() on the array using a simple Comparator with a compare() that returns -Float.compare(a.score, b.score)? This is just about 7 lines of Java code. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Chetan Shah [mailto:chetankrs...@gmail.com] > Sent: Monday, March 02, 2009 4:47 PM > To: java-user@lucene.apache.org > Subject: Sort Collection of ScoreDocs > > > Is there an existing Utility class which will sort a collection of > ScoreDocs > ? I have a result set (array of ScoreDocs) stored in JVM and want to sort > them by relevanceScore. I do not want to execute the query again. The > stored > result set is sorted by another term and hence the need. > > Would highly appreciate if you would please let me know how do I do so? > > Thanks, > > -Chetan > -- > View this message in context: http://www.nabble.com/Sort-Collection-of- > ScoreDocs-tp22290563p22290563.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Restricting the result set with hierarchical ACL
Dear list I need to restrict the resultlist to the appropriate rights of the user who is searching the index. A document may belong to several groups. A user must belong to all groups of the document to find it. There's one additional problem: The groups are a tree. A user is automaticaly in every parent group of his groups. For example A is a child of B, so a user in group A would also be allowed to see documents of group B. And now I have no Idea how to get a restricted search result from lucene. There are about 1 documents, so I'm not very happy to filter them after the index was searched. I tried to get all allowed document ids (there's a field for the id) and put them into a BooleanQuery (id1 or id2, ...), but then I get a BooleanQuery$TooManyClauses: maxClauseCount is set to 1024 So how can I restrict my search results with lucene? Markus Malkusch - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Sort Collection of ScoreDocs
Is there an existing Utility class which will sort a collection of ScoreDocs ? I have a result set (array of ScoreDocs) stored in JVM and want to sort them by relevanceScore. I do not want to execute the query again. The stored result set is sorted by another term and hence the need. Would highly appreciate if you would please let me know how do I do so? Thanks, -Chetan -- View this message in context: http://www.nabble.com/Sort-Collection-of-ScoreDocs-tp22290563p22290563.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.
RE: Indexing synonyms for multiple words
I think his problem is, that "SYN" is a synonym for the phrase "WORD1 WORD2". Using these positions, a phrase like "SYN WORD2" would also match (or other problems in queries that depend on order of words). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Monday, March 02, 2009 4:07 PM > To: java-user@lucene.apache.org > Subject: Re: Indexing synonyms for multiple words > > > Shouldn't WORD2's position be 1 more than your SYN? > > Ie, don't you want these positions?: > > WORD1 2 > WORD2 3 > SYN 2 > > The position is the starting position of the token; Lucene doesn't > store an ending position > > Mike > > Sumukh wrote: > > > Hi, > > > > I'm fairly new to Lucene. I'd like to know how we can index synonyms > > for > > multiple words. > > > > This is the scenario: > > > > Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG. > > > > Now assume the two words combined WORD1 WORD2 can be replaced by > > another > > word SYN. > > > > If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will > > follow SYN, > > which is incorrect; and the other way round if I place it after WORD2. > > > > If any of you have solved a similar problem, I'd be thankful if you > > could > > share some light on > > the solution. > > > > Regards, > > Sumukh > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Indexing synonyms for multiple words
> > Hi, > > I'm fairly new to Lucene. I'd like to know how we can index synonyms for > multiple words. > > This is the scenario: > > Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG. > > Now assume the two words combined WORD1 WORD2 can be replaced by another > word SYN. > > If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will > follow SYN, > which is incorrect; and the other way round if I place it after WORD2. > > If any of you have solved a similar problem, I'd be thankful if you could > share some light on > the solution. > > Regards, > Sumukh > >
Re: N-grams with numbers and Shinglefilters
Well, In the mean time I've looked at the details of the implementation and it gave me an idea for what I'm looking for : suppose I have a tri-gram, what I want to do is index the tri-gram "string digit1 digit2" as one indexing phrase, and not index each token separately. In the shingler filter, if I understood it correctly, tokens are separated by '_' whilst n-grams are separated by " ", that is the mechanism which I was missing. And of course I need my logic around to filter valid tri-grams but I don't need help for this, I can easily do that using regex for instance. My documents look like regular html or pdf pages although some of them contains those specific tri-grams. Thx, -RB- On Mon, Mar 2, 2009 at 2:37 PM, Steven A Rowe wrote: > Hi Raymond, > > On 3/1/2009, Raymond Balmès wrote: > > I'm trying to index (& search later) documents that contain tri-grams > > however they have the following form: > > > > <2 digit> <2 digit> > > > > Does the ShingleFilter work with numbers in the match ? > > Yes, though it is the tokenizer and previous filters in the chain that will > be the (potential) source of difficulties, not ShingleFilter. > > > Another complication, in future features I'd like to add optional > > digits like > > > > [<1 digit>] <2 digit> <2 digit> > > > > I suppose the ShingleFilter won't do it ? > > ShingleFilter just pastes together the tokens produced by the previous > component in the analysis chain, in a sliding window. As currently written, > it doesn't provide the sort of functionality you seem to be asking for. > > > Any better advice ? > > What do your documents look like? What do you hope to accomplish using > ShingleFilter? It's tough to give advice without knowing what you want to > do. > > Steve > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Indexing synonyms for multiple words
Shouldn't WORD2's position be 1 more than your SYN? Ie, don't you want these positions?: WORD1 2 WORD2 3 SYN 2 The position is the starting position of the token; Lucene doesn't store an ending position Mike Sumukh wrote: Hi, I'm fairly new to Lucene. I'd like to know how we can index synonyms for multiple words. This is the scenario: Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG. Now assume the two words combined WORD1 WORD2 can be replaced by another word SYN. If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will follow SYN, which is incorrect; and the other way round if I place it after WORD2. If any of you have solved a similar problem, I'd be thankful if you could share some light on the solution. Regards, Sumukh - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing synonyms for multiple words
This has been discussed in the user list, so searching there might get you answer quicker. See: http://wiki.apache.org/lucene-java/MailingListArchives I don't remember the results, but... Best Erick On Mon, Mar 2, 2009 at 9:13 AM, Sumukh wrote: > Hi, > > I'm fairly new to Lucene. I'd like to know how we can index synonyms for > multiple words. > > This is the scenario: > > Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG. > > Now assume the two words combined WORD1 WORD2 can be replaced by another > word SYN. > > If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will > follow SYN, > which is incorrect; and the other way round if I place it after WORD2. > > If any of you have solved a similar problem, I'd be thankful if you could > share some light on > the solution. > > Regards, > Sumukh >
Extracting TFIDF vectors
Hi, I'm a complete novice at Lucene, and I'm looking for a little bit of help with something. How can I extract the TF*IDF vector for each document in the indexed collection? Also for the query? I need to build a user-feedback system which manipulates the query based on the liked and disliked documents from the local collection. This query modification uses the TF*IDF vectors. Thanks for your help! -- Gregory Gay Editor - 4 Color Rebellion (http://www.4colorrebellion.com) Research Assistant - WVU CSEE
Indexing synonyms for multiple words
Hi, I'm fairly new to Lucene. I'd like to know how we can index synonyms for multiple words. This is the scenario: Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG. Now assume the two words combined WORD1 WORD2 can be replaced by another word SYN. If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will follow SYN, which is incorrect; and the other way round if I place it after WORD2. If any of you have solved a similar problem, I'd be thankful if you could share some light on the solution. Regards, Sumukh
Re: Faceted Search using Lucene
In my test case I have a set up method that should populate the indexes before I start using the document searcher. I will start adding some more debug statements. So basically I should be able to do: get() followed by maybeReopen. I will let you know what the outcome is. Cheers Amin On Mon, Mar 2, 2009 at 1:39 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > > Is it possible that when you first create the SearcherManager, there is no > index in each Directory? > > If not... you better start adding diagnostics. EG inside your get(), print > out the numDocs() of each IndexReader you get from the SearcherManager? > > Something is wrong and it's best to explain it... > > > Mike > > Amin Mohammed-Coleman wrote: > > Nope. If i remove the maybeReopen the search doesn't work. It only works >> when i cal maybeReopen followed by get(). >> >> Cheers >> Amin >> >> On Mon, Mar 2, 2009 at 12:56 PM, Michael McCandless < >> luc...@mikemccandless.com> wrote: >> >> >>> That's not right; something must be wrong. >>> >>> get() before maybeReopen() should simply let you search based on the >>> searcher before reopening. >>> >>> If you just do get() and don't call maybeReopen() does it work? >>> >>> >>> Mike >>> >>> Amin Mohammed-Coleman wrote: >>> >>> I noticed that if i do the get() before the maybeReopen then I get no >>> results. But otherwise I can change it further. On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless < luc...@mikemccandless.com> wrote: There is no such thing as final code -- code is alive and is always > changing ;) > > It looks good to me. > > Though one trivial thing is: I would move the code in the try clause up > to > and including the multiSearcher=get() out above the try. I always > attempt > to "shrink wrap" what's inside a try clause to the minimum that needs > to > be > there. Ie, your code that creates a query, finds the right sort & > filter > to > use, etc, can all happen outside the try, because you have not yet > acquired > the multiSearcher. > > If you do that, you also don't need the null check in the finally > clause, > because multiSearcher must be non-null on entering the try. > > Mike > > Amin Mohammed-Coleman wrote: > > Hi there > > Good morning! Here is the final search code: >> >> public Summary[] search(final SearchRequest searchRequest) >> throwsSearchExecutionException { >> >> final String searchTerm = searchRequest.getSearchTerm(); >> >> if (StringUtils.isBlank(searchTerm)) { >> >> throw new SearchExecutionException("Search string cannot be empty. >> There >> will be too many results to process."); >> >> } >> >> List summaryList = new ArrayList(); >> >> StopWatch stopWatch = new StopWatch("searchStopWatch"); >> >> stopWatch.start(); >> >> MultiSearcher multiSearcher = null; >> >> try { >> >> LOGGER.debug("Ensuring all index readers are up to date..."); >> >> maybeReopen(); >> >> Query query = queryParser.parse(searchTerm); >> >> LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + >> query.toString() +"'"); >> >> Sort sort = null; >> >> sort = applySortIfApplicable(searchRequest); >> >> Filter[] filters =applyFiltersIfApplicable(searchRequest); >> >> ChainedFilter chainedFilter = null; >> >> if (filters != null) { >> >> chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); >> >> } >> >> multiSearcher = get(); >> >> TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort); >> >> ScoreDoc[] scoreDocs = topDocs.scoreDocs; >> >> LOGGER.debug("total number of hits for [" + query.toString() + " ] = >> "+topDocs. >> totalHits); >> >> for (ScoreDoc scoreDoc : scoreDocs) { >> >> final Document doc = multiSearcher.doc(scoreDoc.doc); >> >> float score = scoreDoc.score; >> >> final BaseDocument baseDocument = new BaseDocument(doc, score); >> >> Summary documentSummary = new DocumentSummaryImpl(baseDocument); >> >> summaryList.add(documentSummary); >> >> } >> >> } catch (Exception e) { >> >> throw new IllegalStateException(e); >> >> } finally { >> >> if (multiSearcher != null) { >> >> release(multiSearcher); >> >> } >> >> } >> >> stopWatch.stop(); >> >> LOGGER.debug("total time taken for document seach: " + >> stopWatch.getTotalTimeMillis() + " ms"); >> >> return summaryList.toArray(new Summary[] {}); >> >> } >> >> >> >> I hope this makes sense...thanks again! >> >> >> Cheers >> >> Amin >> >> >> >> On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless < >> luc.
Re: Faceted Search using Lucene
Is it possible that when you first create the SearcherManager, there is no index in each Directory? If not... you better start adding diagnostics. EG inside your get(), print out the numDocs() of each IndexReader you get from the SearcherManager? Something is wrong and it's best to explain it... Mike Amin Mohammed-Coleman wrote: Nope. If i remove the maybeReopen the search doesn't work. It only works when i cal maybeReopen followed by get(). Cheers Amin On Mon, Mar 2, 2009 at 12:56 PM, Michael McCandless < luc...@mikemccandless.com> wrote: That's not right; something must be wrong. get() before maybeReopen() should simply let you search based on the searcher before reopening. If you just do get() and don't call maybeReopen() does it work? Mike Amin Mohammed-Coleman wrote: I noticed that if i do the get() before the maybeReopen then I get no results. But otherwise I can change it further. On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless < luc...@mikemccandless.com> wrote: There is no such thing as final code -- code is alive and is always changing ;) It looks good to me. Though one trivial thing is: I would move the code in the try clause up to and including the multiSearcher=get() out above the try. I always attempt to "shrink wrap" what's inside a try clause to the minimum that needs to be there. Ie, your code that creates a query, finds the right sort & filter to use, etc, can all happen outside the try, because you have not yet acquired the multiSearcher. If you do that, you also don't need the null check in the finally clause, because multiSearcher must be non-null on entering the try. Mike Amin Mohammed-Coleman wrote: Hi there Good morning! Here is the final search code: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); MultiSearcher multiSearcher = null; try { LOGGER.debug("Ensuring all index readers are up to date..."); maybeReopen(); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + query.toString() +"'"); Sort sort = null; sort = applySortIfApplicable(searchRequest); Filter[] filters =applyFiltersIfApplicable(searchRequest); ChainedFilter chainedFilter = null; if (filters != null) { chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); } multiSearcher = get(); TopDocs topDocs = multiSearcher.search(query,chainedFilter , 100,sort); ScoreDoc[] scoreDocs = topDocs.scoreDocs; LOGGER.debug("total number of hits for [" + query.toString() + " ] = "+topDocs. totalHits); for (ScoreDoc scoreDoc : scoreDocs) { final Document doc = multiSearcher.doc(scoreDoc.doc); float score = scoreDoc.score; final BaseDocument baseDocument = new BaseDocument(doc, score); Summary documentSummary = new DocumentSummaryImpl(baseDocument); summaryList.add(documentSummary); } } catch (Exception e) { throw new IllegalStateException(e); } finally { if (multiSearcher != null) { release(multiSearcher); } } stopWatch.stop(); LOGGER.debug("total time taken for document seach: " + stopWatch.getTotalTimeMillis() + " ms"); return summaryList.toArray(new Summary[] {}); } I hope this makes sense...thanks again! Cheers Amin On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless < luc...@mikemccandless.com> wrote: You're calling get() too many times. For every call to get() you must match with a call to release(). So, once at the front of your search method you should: MultiSearcher searcher = get(); then use that searcher to do searching, retrieve docs, etc. Then in the finally clause, pass that searcher to release. So, only one call to get() and one matching call to release(). Mike Amin Mohammed-Coleman wrote: Hi The searchers are injected into the class via Spring. So when a client calls the class it is fully configured with a list of index searchers. However I have removed this list and instead injecting a list of directories which are passed to the DocumentSearchManager. DocumentSearchManager is SearchManager (should've mentioned that earlier). So finally I have modified by release code to do the following: private void release(MultiSearcher multiSeacher) throws Exception { IndexSearcher[] indexSearchers = (IndexSearcher[]) multiSeacher.getSearchables(); for(int i =0 ; i < indexSearchers.length;i++) { documentSearcherManagers[i].release(indexSearchers[i]); } } and it's use looks like this: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isB
RE: N-grams with numbers and Shinglefilters
Hi Raymond, On 3/1/2009, Raymond Balmès wrote: > I'm trying to index (& search later) documents that contain tri-grams > however they have the following form: > > <2 digit> <2 digit> > > Does the ShingleFilter work with numbers in the match ? Yes, though it is the tokenizer and previous filters in the chain that will be the (potential) source of difficulties, not ShingleFilter. > Another complication, in future features I'd like to add optional > digits like > > [<1 digit>] <2 digit> <2 digit> > > I suppose the ShingleFilter won't do it ? ShingleFilter just pastes together the tokens produced by the previous component in the analysis chain, in a sliding window. As currently written, it doesn't provide the sort of functionality you seem to be asking for. > Any better advice ? What do your documents look like? What do you hope to accomplish using ShingleFilter? It's tough to give advice without knowing what you want to do. Steve - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Faceted Search using Lucene
Nope. If i remove the maybeReopen the search doesn't work. It only works when i cal maybeReopen followed by get(). Cheers Amin On Mon, Mar 2, 2009 at 12:56 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > > That's not right; something must be wrong. > > get() before maybeReopen() should simply let you search based on the > searcher before reopening. > > If you just do get() and don't call maybeReopen() does it work? > > > Mike > > Amin Mohammed-Coleman wrote: > > I noticed that if i do the get() before the maybeReopen then I get no >> results. But otherwise I can change it further. >> >> On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless < >> luc...@mikemccandless.com> wrote: >> >> >>> There is no such thing as final code -- code is alive and is always >>> changing ;) >>> >>> It looks good to me. >>> >>> Though one trivial thing is: I would move the code in the try clause up >>> to >>> and including the multiSearcher=get() out above the try. I always >>> attempt >>> to "shrink wrap" what's inside a try clause to the minimum that needs to >>> be >>> there. Ie, your code that creates a query, finds the right sort & filter >>> to >>> use, etc, can all happen outside the try, because you have not yet >>> acquired >>> the multiSearcher. >>> >>> If you do that, you also don't need the null check in the finally clause, >>> because multiSearcher must be non-null on entering the try. >>> >>> Mike >>> >>> Amin Mohammed-Coleman wrote: >>> >>> Hi there >>> Good morning! Here is the final search code: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); MultiSearcher multiSearcher = null; try { LOGGER.debug("Ensuring all index readers are up to date..."); maybeReopen(); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + query.toString() +"'"); Sort sort = null; sort = applySortIfApplicable(searchRequest); Filter[] filters =applyFiltersIfApplicable(searchRequest); ChainedFilter chainedFilter = null; if (filters != null) { chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); } multiSearcher = get(); TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort); ScoreDoc[] scoreDocs = topDocs.scoreDocs; LOGGER.debug("total number of hits for [" + query.toString() + " ] = "+topDocs. totalHits); for (ScoreDoc scoreDoc : scoreDocs) { final Document doc = multiSearcher.doc(scoreDoc.doc); float score = scoreDoc.score; final BaseDocument baseDocument = new BaseDocument(doc, score); Summary documentSummary = new DocumentSummaryImpl(baseDocument); summaryList.add(documentSummary); } } catch (Exception e) { throw new IllegalStateException(e); } finally { if (multiSearcher != null) { release(multiSearcher); } } stopWatch.stop(); LOGGER.debug("total time taken for document seach: " + stopWatch.getTotalTimeMillis() + " ms"); return summaryList.toArray(new Summary[] {}); } I hope this makes sense...thanks again! Cheers Amin On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless < luc...@mikemccandless.com> wrote: You're calling get() too many times. For every call to get() you must > match with a call to release(). > > So, once at the front of your search method you should: > > MultiSearcher searcher = get(); > > then use that searcher to do searching, retrieve docs, etc. > > Then in the finally clause, pass that searcher to release. > > So, only one call to get() and one matching call to release(). > > Mike > > Amin Mohammed-Coleman wrote: > > Hi > > The searchers are injected into the class via Spring. So when a >> client >> calls the class it is fully configured with a list of index searchers. >> However I have removed this list and instead injecting a list of >> directories which are passed to the DocumentSearchManager. >> DocumentSearchManager is SearchManager (should've mentioned that >> earlier). >> So finally I have modified by release code to do the following: >> >> private void release(MultiSearcher multiSeacher) throws
Re: Faceted Search using Lucene
That's not right; something must be wrong. get() before maybeReopen() should simply let you search based on the searcher before reopening. If you just do get() and don't call maybeReopen() does it work? Mike Amin Mohammed-Coleman wrote: I noticed that if i do the get() before the maybeReopen then I get no results. But otherwise I can change it further. On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless < luc...@mikemccandless.com> wrote: There is no such thing as final code -- code is alive and is always changing ;) It looks good to me. Though one trivial thing is: I would move the code in the try clause up to and including the multiSearcher=get() out above the try. I always attempt to "shrink wrap" what's inside a try clause to the minimum that needs to be there. Ie, your code that creates a query, finds the right sort & filter to use, etc, can all happen outside the try, because you have not yet acquired the multiSearcher. If you do that, you also don't need the null check in the finally clause, because multiSearcher must be non-null on entering the try. Mike Amin Mohammed-Coleman wrote: Hi there Good morning! Here is the final search code: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); MultiSearcher multiSearcher = null; try { LOGGER.debug("Ensuring all index readers are up to date..."); maybeReopen(); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + query.toString() +"'"); Sort sort = null; sort = applySortIfApplicable(searchRequest); Filter[] filters =applyFiltersIfApplicable(searchRequest); ChainedFilter chainedFilter = null; if (filters != null) { chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); } multiSearcher = get(); TopDocs topDocs = multiSearcher.search(query,chainedFilter , 100,sort); ScoreDoc[] scoreDocs = topDocs.scoreDocs; LOGGER.debug("total number of hits for [" + query.toString() + " ] = "+topDocs. totalHits); for (ScoreDoc scoreDoc : scoreDocs) { final Document doc = multiSearcher.doc(scoreDoc.doc); float score = scoreDoc.score; final BaseDocument baseDocument = new BaseDocument(doc, score); Summary documentSummary = new DocumentSummaryImpl(baseDocument); summaryList.add(documentSummary); } } catch (Exception e) { throw new IllegalStateException(e); } finally { if (multiSearcher != null) { release(multiSearcher); } } stopWatch.stop(); LOGGER.debug("total time taken for document seach: " + stopWatch.getTotalTimeMillis() + " ms"); return summaryList.toArray(new Summary[] {}); } I hope this makes sense...thanks again! Cheers Amin On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless < luc...@mikemccandless.com> wrote: You're calling get() too many times. For every call to get() you must match with a call to release(). So, once at the front of your search method you should: MultiSearcher searcher = get(); then use that searcher to do searching, retrieve docs, etc. Then in the finally clause, pass that searcher to release. So, only one call to get() and one matching call to release(). Mike Amin Mohammed-Coleman wrote: Hi The searchers are injected into the class via Spring. So when a client calls the class it is fully configured with a list of index searchers. However I have removed this list and instead injecting a list of directories which are passed to the DocumentSearchManager. DocumentSearchManager is SearchManager (should've mentioned that earlier). So finally I have modified by release code to do the following: private void release(MultiSearcher multiSeacher) throws Exception { IndexSearcher[] indexSearchers = (IndexSearcher[]) multiSeacher.getSearchables(); for(int i =0 ; i < indexSearchers.length;i++) { documentSearcherManagers[i].release(indexSearchers[i]); } } and it's use looks like this: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); List indexSearchers = new ArrayList(); try { LOGGER.debug("Ensuring all index readers are up to date..."); maybeReopen(); LOGGER.debug("All Index Searchers are up to date. No of index searchers '" + indexSearchers.size() +"'"); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '"
Re: Faceted Search using Lucene
I noticed that if i do the get() before the maybeReopen then I get no results. But otherwise I can change it further. On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > > There is no such thing as final code -- code is alive and is always > changing ;) > > It looks good to me. > > Though one trivial thing is: I would move the code in the try clause up to > and including the multiSearcher=get() out above the try. I always attempt > to "shrink wrap" what's inside a try clause to the minimum that needs to be > there. Ie, your code that creates a query, finds the right sort & filter to > use, etc, can all happen outside the try, because you have not yet acquired > the multiSearcher. > > If you do that, you also don't need the null check in the finally clause, > because multiSearcher must be non-null on entering the try. > > Mike > > Amin Mohammed-Coleman wrote: > > Hi there >> Good morning! Here is the final search code: >> >> public Summary[] search(final SearchRequest searchRequest) >> throwsSearchExecutionException { >> >> final String searchTerm = searchRequest.getSearchTerm(); >> >> if (StringUtils.isBlank(searchTerm)) { >> >> throw new SearchExecutionException("Search string cannot be empty. There >> will be too many results to process."); >> >> } >> >> List summaryList = new ArrayList(); >> >> StopWatch stopWatch = new StopWatch("searchStopWatch"); >> >> stopWatch.start(); >> >> MultiSearcher multiSearcher = null; >> >> try { >> >> LOGGER.debug("Ensuring all index readers are up to date..."); >> >> maybeReopen(); >> >> Query query = queryParser.parse(searchTerm); >> >> LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + >> query.toString() +"'"); >> >> Sort sort = null; >> >> sort = applySortIfApplicable(searchRequest); >> >> Filter[] filters =applyFiltersIfApplicable(searchRequest); >> >> ChainedFilter chainedFilter = null; >> >> if (filters != null) { >> >> chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); >> >> } >> >> multiSearcher = get(); >> >> TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort); >> >> ScoreDoc[] scoreDocs = topDocs.scoreDocs; >> >> LOGGER.debug("total number of hits for [" + query.toString() + " ] = >> "+topDocs. >> totalHits); >> >> for (ScoreDoc scoreDoc : scoreDocs) { >> >> final Document doc = multiSearcher.doc(scoreDoc.doc); >> >> float score = scoreDoc.score; >> >> final BaseDocument baseDocument = new BaseDocument(doc, score); >> >> Summary documentSummary = new DocumentSummaryImpl(baseDocument); >> >> summaryList.add(documentSummary); >> >> } >> >> } catch (Exception e) { >> >> throw new IllegalStateException(e); >> >> } finally { >> >> if (multiSearcher != null) { >> >> release(multiSearcher); >> >> } >> >> } >> >> stopWatch.stop(); >> >> LOGGER.debug("total time taken for document seach: " + >> stopWatch.getTotalTimeMillis() + " ms"); >> >> return summaryList.toArray(new Summary[] {}); >> >> } >> >> >> >> I hope this makes sense...thanks again! >> >> >> Cheers >> >> Amin >> >> >> >> On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless < >> luc...@mikemccandless.com> wrote: >> >> >>> You're calling get() too many times. For every call to get() you must >>> match with a call to release(). >>> >>> So, once at the front of your search method you should: >>> >>> MultiSearcher searcher = get(); >>> >>> then use that searcher to do searching, retrieve docs, etc. >>> >>> Then in the finally clause, pass that searcher to release. >>> >>> So, only one call to get() and one matching call to release(). >>> >>> Mike >>> >>> Amin Mohammed-Coleman wrote: >>> >>> Hi >>> The searchers are injected into the class via Spring. So when a client calls the class it is fully configured with a list of index searchers. However I have removed this list and instead injecting a list of directories which are passed to the DocumentSearchManager. DocumentSearchManager is SearchManager (should've mentioned that earlier). So finally I have modified by release code to do the following: private void release(MultiSearcher multiSeacher) throws Exception { IndexSearcher[] indexSearchers = (IndexSearcher[]) multiSeacher.getSearchables(); for(int i =0 ; i < indexSearchers.length;i++) { documentSearcherManagers[i].release(indexSearchers[i]); } } and it's use looks like this: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); List indexSearchers =
Re: Merging database index with fulltext index
Hi: The point to catch with bad performance during merging a database result is to reduce the number of rows visited by your first query. As an example take a look a these two queries using Lucene Domain Index, the two are equivalents: Option A: select * from (select rownum as ntop_pos,q.* from ( select extractValue(object_value,'/page/revision/timestamp'),extractValue(object_value,'/page/title') from pages where lcontains(object_value, 'musica')>0 and extractValue(object_value,'/page/revision/timestamp') between TO_TIMESTAMP_TZ('06-JAN-07 12.20.05.0 PM +00:00') and TO_TIMESTAMP_TZ('17-JUL-07 11.47.38.0 AM +00:00') order by extractValue(object_value,'/page/revision/timestamp')) q) where ntop_pos>=20 and ntop_pos<=30; Option B: select /*+ DOMAIN_INDEX_SORT */ extractValue(object_value,'/page/revision/timestamp'),extractValue(object_value,'/page/title') from pages where lcontains(object_value, 'rownum:[20 TO 30] AND musica AND revisionDate:[20070101 TO 20070718]','revisionDate')>0; First query is using all traditional SQL syntax to do filtering, sorting and pagination (Oracle Top-N syntax), the second query is using filtering (revisionDate:[20070101 TO 20070718]), sorting (revisionDate) and pagination (rownum:[20 TO 30], Lucene Domain Index syntax) resolved inside the Lucene Domain Index. In execution time the two queries over a sub set (around 32000 pages) of WikiPedia Dumps uploaded into an Oracle 11g are 4 minutes for the first option and 55 millisecond for the second option. The big difference is how many rows the DB need to visits and then discard, for the first option my DB performs 2.900.671 buffer gets (block disk that are loaded into memory) and 21 for the second option. In second execution plan the optimizer receives the exact 10 rows to return by the Domain Index. So, no matter what the technology used, the more you can filter on the index, the faster will be the query. Obviously there will be queries when this rule is not true, for example if you have a bit map index on some column, querying the bitmap index first could be faster than a Domain Index scan, but the optimizer knows the true. Best regards, Marcelo. PD: If you need more information about how to use or how Lucene Domain Index works inside Oracle please take a look at: http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg On Sat, Feb 28, 2009 at 5:07 PM, wrote: > Hi, > > what is the best approach to merge a database index with a lucene fulltext > index? Both databases store a unique ID per doc. This is the join criteria. > > requirements: > > * both resultsets may be very big (100.000 and much more) > * the merged resultset must be sorted by database index and/or relevance > * optional paging the merged resultset, a page has a size of 1000 docs max. > > example: > > select a, b from dbtable where c = 'foo' and content='bar' order by > relevance, a desc, d > > I would split this into: > > database: select ID, a, b from dbtable where c = 'foo' order by a desc, d > lucene: content:bar (sort:relevance) > merge: loop over the lucene resultset and add the db record into a new list > if the ID matches. > > If the resultset must be paged: > > database: select ID from dbtable where c = 'foo' order by a desc, d > lucene: content:bar (sort:relevance) > merge: loop over the lucene resultset and add the db record into a new list > if the ID matches. > page 1: select a,b from dbtable where ID IN (list of the ID's of page 1) > page 2: select a,b from dbtable where ID IN (list of the ID's of page 2) > ... > > > Is there a better way? > > Thank you. > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Marcelo F. Ochoa http://marceloochoa.blogspot.com/ http://marcelo.ochoa.googlepages.com/home __ Want to integrate Lucene and Oracle? http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html Is Oracle 11g REST ready? http://marceloochoa.blogspot.com/2008/02/is-oracle-11g-rest-ready.html - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Faceted Search using Lucene
There is no such thing as final code -- code is alive and is always changing ;) It looks good to me. Though one trivial thing is: I would move the code in the try clause up to and including the multiSearcher=get() out above the try. I always attempt to "shrink wrap" what's inside a try clause to the minimum that needs to be there. Ie, your code that creates a query, finds the right sort & filter to use, etc, can all happen outside the try, because you have not yet acquired the multiSearcher. If you do that, you also don't need the null check in the finally clause, because multiSearcher must be non-null on entering the try. Mike Amin Mohammed-Coleman wrote: Hi there Good morning! Here is the final search code: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); MultiSearcher multiSearcher = null; try { LOGGER.debug("Ensuring all index readers are up to date..."); maybeReopen(); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + query.toString() +"'"); Sort sort = null; sort = applySortIfApplicable(searchRequest); Filter[] filters =applyFiltersIfApplicable(searchRequest); ChainedFilter chainedFilter = null; if (filters != null) { chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); } multiSearcher = get(); TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort); ScoreDoc[] scoreDocs = topDocs.scoreDocs; LOGGER.debug("total number of hits for [" + query.toString() + " ] = "+topDocs. totalHits); for (ScoreDoc scoreDoc : scoreDocs) { final Document doc = multiSearcher.doc(scoreDoc.doc); float score = scoreDoc.score; final BaseDocument baseDocument = new BaseDocument(doc, score); Summary documentSummary = new DocumentSummaryImpl(baseDocument); summaryList.add(documentSummary); } } catch (Exception e) { throw new IllegalStateException(e); } finally { if (multiSearcher != null) { release(multiSearcher); } } stopWatch.stop(); LOGGER.debug("total time taken for document seach: " + stopWatch.getTotalTimeMillis() + " ms"); return summaryList.toArray(new Summary[] {}); } I hope this makes sense...thanks again! Cheers Amin On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless < luc...@mikemccandless.com> wrote: You're calling get() too many times. For every call to get() you must match with a call to release(). So, once at the front of your search method you should: MultiSearcher searcher = get(); then use that searcher to do searching, retrieve docs, etc. Then in the finally clause, pass that searcher to release. So, only one call to get() and one matching call to release(). Mike Amin Mohammed-Coleman wrote: Hi The searchers are injected into the class via Spring. So when a client calls the class it is fully configured with a list of index searchers. However I have removed this list and instead injecting a list of directories which are passed to the DocumentSearchManager. DocumentSearchManager is SearchManager (should've mentioned that earlier). So finally I have modified by release code to do the following: private void release(MultiSearcher multiSeacher) throws Exception { IndexSearcher[] indexSearchers = (IndexSearcher[]) multiSeacher.getSearchables(); for(int i =0 ; i < indexSearchers.length;i++) { documentSearcherManagers[i].release(indexSearchers[i]); } } and it's use looks like this: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); List indexSearchers = new ArrayList(); try { LOGGER.debug("Ensuring all index readers are up to date..."); maybeReopen(); LOGGER.debug("All Index Searchers are up to date. No of index searchers '" + indexSearchers.size() +"'"); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + query.toString() +"'"); Sort sort = null; sort = applySortIfApplicable(searchRequest); Filter[] filters =applyFiltersIfApplicable(searchRequest); ChainedFilter chainedFilter = null; if (filters != null) { chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); } TopDocs topDocs = get().search(query,chainedFilter ,100,sort); ScoreDoc[] scoreDocs = topDocs.scoreDocs; LOGGER.debug("total number of h
Re: Adding another factor to Lucene search
Hi Document.setBoost(float boost) where boost is either your score as is, or a value based on that score, might do the trick for you. Other boosting and custom score options include BoostingQuery, BoostingTermQuery and CustomScoreQuery. A google search for "lucene boosting" throws up lots of hits. -- Ian. On Mon, Mar 2, 2009 at 10:05 AM, liat oren wrote: > Hi, > > I would like to add to lucene's score another factor - a score between > words. > I have an index that holds couple of words with their score. > How can I take it into account when using Lucene search? > > Many thanks, > Liat > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search by word offset
Not sure what you are asking about, but you might want to take a look at http://lucene.apache.org/java/2_4_0/api/contrib-surround/index.html The Surround parser offers many features around the span query (which I suspect is what you are looking for) Shashi On Mon, Mar 2, 2009 at 4:57 AM, shb wrote: > > hi i need help. > > i need to search by word in sentences with lucene. for example by the word > "bbb" i got the right results of all the sentences : > > "text ok ok ok bbb" , "text 2 bbb text " , "bbb text 4...". > > but i need the result by the word offset in the sentence like this: > > "bbb text 4...". , "text 2 bbb text " , "text 1 ok ok ok bbb" .. > > waiting for ideas.. thanks.. > > > -- > View this message in context: > http://www.nabble.com/search-by-word-offset-tp22284787p22284787.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Adding another factor to Lucene search
Hi, I would like to add to lucene's score another factor - a score between words. I have an index that holds couple of words with their score. How can I take it into account when using Lucene search? Many thanks, Liat
search by word offset
hi i need help. i need to search by word in sentences with lucene. for example by the word "bbb" i got the right results of all the sentences : "text ok ok ok bbb" , "text 2 bbb text " , "bbb text 4...". but i need the result by the word offset in the sentence like this: "bbb text 4...". , "text 2 bbb text " , "text 1 ok ok ok bbb" .. waiting for ideas.. thanks.. -- View this message in context: http://www.nabble.com/search-by-word-offset-tp22284787p22284787.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Faceted Search using Lucene
Hi there Good morning! Here is the final search code: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string cannot be empty. There will be too many results to process."); } List summaryList = new ArrayList(); StopWatch stopWatch = new StopWatch("searchStopWatch"); stopWatch.start(); MultiSearcher multiSearcher = null; try { LOGGER.debug("Ensuring all index readers are up to date..."); maybeReopen(); Query query = queryParser.parse(searchTerm); LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + query.toString() +"'"); Sort sort = null; sort = applySortIfApplicable(searchRequest); Filter[] filters =applyFiltersIfApplicable(searchRequest); ChainedFilter chainedFilter = null; if (filters != null) { chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); } multiSearcher = get(); TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort); ScoreDoc[] scoreDocs = topDocs.scoreDocs; LOGGER.debug("total number of hits for [" + query.toString() + " ] = "+topDocs. totalHits); for (ScoreDoc scoreDoc : scoreDocs) { final Document doc = multiSearcher.doc(scoreDoc.doc); float score = scoreDoc.score; final BaseDocument baseDocument = new BaseDocument(doc, score); Summary documentSummary = new DocumentSummaryImpl(baseDocument); summaryList.add(documentSummary); } } catch (Exception e) { throw new IllegalStateException(e); } finally { if (multiSearcher != null) { release(multiSearcher); } } stopWatch.stop(); LOGGER.debug("total time taken for document seach: " + stopWatch.getTotalTimeMillis() + " ms"); return summaryList.toArray(new Summary[] {}); } I hope this makes sense...thanks again! Cheers Amin On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > > You're calling get() too many times. For every call to get() you must > match with a call to release(). > > So, once at the front of your search method you should: > > MultiSearcher searcher = get(); > > then use that searcher to do searching, retrieve docs, etc. > > Then in the finally clause, pass that searcher to release. > > So, only one call to get() and one matching call to release(). > > Mike > > Amin Mohammed-Coleman wrote: > > Hi >> The searchers are injected into the class via Spring. So when a client >> calls the class it is fully configured with a list of index searchers. >> However I have removed this list and instead injecting a list of >> directories which are passed to the DocumentSearchManager. >> DocumentSearchManager is SearchManager (should've mentioned that earlier). >> So finally I have modified by release code to do the following: >> >> private void release(MultiSearcher multiSeacher) throws Exception { >> >> IndexSearcher[] indexSearchers = (IndexSearcher[]) >> multiSeacher.getSearchables(); >> >> for(int i =0 ; i < indexSearchers.length;i++) { >> >> documentSearcherManagers[i].release(indexSearchers[i]); >> >> } >> >> } >> >> >> and it's use looks like this: >> >> >> public Summary[] search(final SearchRequest searchRequest) >> throwsSearchExecutionException { >> >> final String searchTerm = searchRequest.getSearchTerm(); >> >> if (StringUtils.isBlank(searchTerm)) { >> >> throw new SearchExecutionException("Search string cannot be empty. There >> will be too many results to process."); >> >> } >> >> List summaryList = new ArrayList(); >> >> StopWatch stopWatch = new StopWatch("searchStopWatch"); >> >> stopWatch.start(); >> >> List indexSearchers = new ArrayList(); >> >> try { >> >> LOGGER.debug("Ensuring all index readers are up to date..."); >> >> maybeReopen(); >> >> LOGGER.debug("All Index Searchers are up to date. No of index searchers '" >> + >> indexSearchers.size() +"'"); >> >> Query query = queryParser.parse(searchTerm); >> >> LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" + >> query.toString() +"'"); >> >> Sort sort = null; >> >> sort = applySortIfApplicable(searchRequest); >> >> Filter[] filters =applyFiltersIfApplicable(searchRequest); >> >> ChainedFilter chainedFilter = null; >> >> if (filters != null) { >> >> chainedFilter = new ChainedFilter(filters, ChainedFilter.OR); >> >> } >> >> TopDocs topDocs = get().search(query,chainedFilter ,100,sort); >> >> ScoreDoc[] scoreDocs = topDocs.scoreDocs; >> >> LOGGER.debug("total number of hits for [" + query.toString() + " ] = >> "+topDocs. >> totalHits); >> >> for (ScoreDoc scoreDoc : scoreDocs) { >> >> final Document doc = get().doc(scoreDoc.doc); >> >> float score = scoreDoc.score; >> >> final BaseDocument baseDocument = new BaseDocument(doc, score); >> >> Summary documentSummary = new DocumentSummaryImpl(baseDocument); >> >> summaryList.add(documentSummary); >> >> } >> >> } catch (Exception e) { >> >> throw new IllegalStateException(e); >> >> } finally { >> >>