Big problem with solr in an official server.
Hi everybody: I have a big problem with solr in a server with the memory size it is using, I would want to know how to configure it to use a limited memory size, I am setting up Solr with java -jar start.jar command in an ubuntu server, the process start.jar is using 7Gb of memory in the server and it is affecting considerably the performance of the server. Could you help me please ??? Thanks in advance. Regards Ariel
How to delete documents from an index and how to reset de remote multisearcher so the deleted docs not being shown in the search results ???
Hi every body: I am using lucene version 2.3.2 to index and search my documents. The problem is that I have a remote search server implemented this way: [code] Searcher parallelSearcher; try { parallelSearcher = new ParallelMultiSearcher(searchables); parallelImpl = new RemoteSearchable(parallelSearcher); Naming.rebind(rmiUrlSearch, parallelImpl); } catch (RemoteException e) { log.error(ERROR , e); } catch (MalformedURLException e) { log.error(ERROR , e); } catch (IOException e) { log.error(ERROR , e); } [/code] Then a client from another host connect to the search server to obtain the search results. But when a document is deleted in the indexes in the search server the deleted documents still appear in the search results, the only way the deleted documents don't appear in the search results is restarting the rmi in the search server. Please could you help me to know what can I do to make that the documents deleted don't appear in the search results when they are deleted ??? I hope you can help me. Regards Ariel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Problem with ranking in lucene
Hi everybody: I have a question about the ranking of lucene. Here I have the problem: when I do a search in my index like this: bank OR transference I get 10 results, the first two documents that are returned have the both terms in the content field but then the 3th, 4th and 5th only has the word bank and then the 6th is a document that have both terms. Why is this happening ? It is not supposed that when I do a search with the OR operator it returned first the documents that have the terms together and then the document that only have one of the two terms ??? I am indexing by two fields and I am searching with MultifieldQuery in both fields two: title and content, I am using the same analyzer for indexing and searching. I hope you can help me. Thanks in advance Regards Ariel
How can I change that lucene use by default the AND operator between terms ???
When I do a search using lucene internally lucene use by default the OR operator between terms, How can I change that lucene use by default the AND operator between terms ??? Regards Ariel
How Can I make an analyzer that ignore the numbers o the texts ???
Hi everybody: I would want to know how Can I make an analyzer that ignore the numbers o the texts like the stop words are ignored ??? For example that the terms : 3.8, 100, 4.15, 4,33 don't be added to the index. How can I do that ??? Regards Ariel
Re: How to search a phrase using quotes in a query ???
catch block e.printStackTrace(); } } [/code] English Analyzer is a custom analyzer that have all these filters:SynonymFilter, SnowballFilter, StopFilter, LowerCaseFilter, StandardFilter and StandardTokenizer. So, I don't know why when I do a search like the bank of america the search results doesn't return the documents that have the exact phrase the bank of america. Could you help me please ??? Regards Ariel On Mon, Apr 6, 2009 at 5:26 PM, Erick Erickson erickerick...@gmail.comwrote: If you have luke, you should be able to submit your query and use the explain functionality to gain some insights into what the query actually looks like as well Best Erick On Mon, Apr 6, 2009 at 5:24 PM, Ariel isaacr...@gmail.com wrote: Well I have luke lucene, the index has been build fine. The field where I am searching is the content field. I am using the same analyzer in query and indexing time: SnowBall English Analyzer. I am going to submit later the snippet code. Regards Ariel On Mon, Apr 6, 2009 at 4:37 PM, Erick Erickson erickerick...@gmail.com wrote: We really need some more data. First, I *strongly* recommend you get a copy of Luke and examine your index to see what is *actually* there. Google lucene luke. That often answers many questions. Second, query.toString is your friend. For instance, if the query you provided below is all that you're submitting, it's going against the default field you might have specified when you instantiated your query parser. Third, what analyzers are you using at index and query time? Code snippets would also help. Best Erick On Mon, Apr 6, 2009 at 4:32 PM, Ariel isaacr...@gmail.com wrote: Hi every body: Why when I make a query with this search query : the fool of the hill doesn't appear documents in the search results that contains the entire phrase the fool of the hill and it does exist documents that contain that phrase, I am using snowball analyzer for English ??? Could you help with this please ??? Regards Ariel
How to search a phrase using quotes in a query ???
Hi every body: Why when I make a query with this search query : the fool of the hill doesn't appear documents in the search results that contains the entire phrase the fool of the hill and it does exist documents that contain that phrase, I am using snowball analyzer for English ??? Could you help with this please ??? Regards Ariel
Re: How to search a phrase using quotes in a query ???
Well I have luke lucene, the index has been build fine. The field where I am searching is the content field. I am using the same analyzer in query and indexing time: SnowBall English Analyzer. I am going to submit later the snippet code. Regards Ariel On Mon, Apr 6, 2009 at 4:37 PM, Erick Erickson erickerick...@gmail.comwrote: We really need some more data. First, I *strongly* recommend you get a copy of Luke and examine your index to see what is *actually* there. Google lucene luke. That often answers many questions. Second, query.toString is your friend. For instance, if the query you provided below is all that you're submitting, it's going against the default field you might have specified when you instantiated your query parser. Third, what analyzers are you using at index and query time? Code snippets would also help. Best Erick On Mon, Apr 6, 2009 at 4:32 PM, Ariel isaacr...@gmail.com wrote: Hi every body: Why when I make a query with this search query : the fool of the hill doesn't appear documents in the search results that contains the entire phrase the fool of the hill and it does exist documents that contain that phrase, I am using snowball analyzer for English ??? Could you help with this please ??? Regards Ariel
How to index correctly taking in account the synonyms using Wordnet ???
Hi every body: I am using wordnet to index my document taking in account the synonyms with wordnet. After I indexed the whole documents collections I made a query with the word snort but documents that contain the word bird are retrieved, I don't understand this because snort and bird are not synonyms then Why are the documents that contain bird retrieved ??? Could help me to solve that problem ??? How do you index your documents using wordnet ??? Thanks in advance. Regards Ariel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to index correctly taking in account the synonyms using Wordnet ???
Well, I have the luke 0.8, I opened my index with that tool but there is not any clue of synonyms in the field I have indexed with the synonym analyzer. I don't know how can I see the group of synonyms of each term, sould somebody tell me hot to do that ??? On Wed, Feb 4, 2009 at 5:09 PM, Erick Erickson erickerick...@gmail.comwrote: The first thing I'd do is get a copy of luke (google lucene luke) and examine your index to see what's actually there in the document you claim in incorrectly returned. If that doesn't enlighten you, you really have to provide more details and code examples, because your question is unanswerable as it stands. Best Erick On Wed, Feb 4, 2009 at 3:27 PM, Ariel isaacr...@gmail.com wrote: Hi every body: I am using wordnet to index my document taking in account the synonyms with wordnet. After I indexed the whole documents collections I made a query with the word snort but documents that contain the word bird are retrieved, I don't understand this because snort and bird are not synonyms then Why are the documents that contain bird retrieved ??? Could help me to solve that problem ??? How do you index your documents using wordnet ??? Thanks in advance. Regards Ariel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to index correctly taking in account the synonyms using Wordnet ???
How can I see the senses of a word with wordnet ??? And How could I select the most populars ??? Is there a way to make queries ignoring the synonyms I have added to the index ??? I hope you can help me. Regards Ariel On Wed, Feb 4, 2009 at 7:46 PM, Manu Konchady mkonch...@yahoo.com wrote: --- On Wed, 4/2/09, Ariel isaacr...@gmail.com wrote: From: Ariel isaacr...@gmail.com I am using wordnet to index my document taking in account the synonyms with wordnet. After I indexed the whole documents collections I made a query with the word snort but documents that contain the word bird are retrieved, I don't understand this because snort and bird are not synonyms then Why are the documents that contain bird retrieved ??? In WordNet bird is one of the noun senses for the meaning of snort Noun Sense 1: snicker, snort, snigger Description: a disrespectful laugh Sense 2: boo, hoot, Bronx cheer, hiss, raspberry, razzing, razz, snort, bird Description: a cry or noise made to express displeasure or contempt You may want to try and select just the synonyms of the most popular sense of the word. Regards, Manu Get perfect Email ID for your Resume. Grab now http://in.promos.yahoo.com/address
Re: Default and optimal use of RAMDirectory
Did you mean that the people that think the use of RAMDirectory is going to speed up the indexing proccess are wrong ??? On Sun, Dec 21, 2008 at 10:22 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Let me add to that that I clearly recall having a hard time getting the tests for that particular section of LIA1 to clearly and consistently show that using the RAMDirectory buffering approach instead of vanilla IndexWriter yields faster indexing. Even back then IndexWriter buffered indexed data in memory, though today's IndexWriter is much, much better at it. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Michael McCandless luc...@mikemccandless.com To: java-user@lucene.apache.org Sent: Saturday, December 20, 2008 4:25:13 AM Subject: Re: Default and optimal use of RAMDirectory Actually, things have improved since LIA1 was written a few years ago: IndexWriter now does a good job managing the RAM buffer you assign to it, so you should not see much benefit by doing your own buffering with RAMDirectory (and if you somehow do, I'd like to know about it!). Instead you should call IndexWriter.setRAMBufferSizeMB. Also, FSDirectory does no RAM buffering on its own. See here for further ways to tune for indexing throughput: http://wiki.apache.org/lucene-java/ImproveIndexingSpeed Mike wrote: Hi all, First of I'd like to say I'm quite pleased to be a part of this mailing list - its even more exciting to know that we have Otis G. and Erik H., authors of (at least in my opinion) the Lucene Bible - Lucene in Action, actively answering all these inquiries =) We're currently in the initial stages of implementing lucene as part of our product and one problem that we need to resolve is optimizing lucene. I've been reading Lucene in Action book and one of the tips for optimizing lucene indexing is by using RAMDirectory as a buffer before writing to FSDirectory. According to the book, this is done internally and automatically when I use FSDirectory. My questions are 1.) What's the default implementation/ computation used in allocating RAMdirectory when we implement FSDirectory and 2.) What's the optimal way of customizing RAMDirectory usage - any tips on how to do it. BTW, we're using Lucene 2.3.2 Thanks for all the help Joseph - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to search documents taking in account the dates ???
What I am doing is this: code Sort sort = new Sort(); sort.setSort(year, true); hits = searcher.search(pquery,sort); /code How I must put my code to sort first by date an then by score ??? Greetings Ariel On Thu, Dec 18, 2008 at 4:48 AM, Ian Lea ian@gmail.com wrote: Lucene lets you sort by multiple fields, including score. See the javadocs for Sort and SortField, specifically SortField.SCORE. -- Ian. On Wed, Dec 17, 2008 at 8:15 PM, Ariel isaacr...@gmail.com wrote: Hi: This solution have a problem. the results are sorted bye the year criteria but I need that after sort by year criteria it sort by the scoring criteria two. How can I do this ??? I hope you can help me. Greetings Ariel On Wed, Nov 19, 2008 at 5:28 PM, Erick Erickson erickerick...@gmail.com wrote: Well, MultiSearcher is just a Searcher, so you have available all of the search methods on Searcher. One of which is: search public TopFieldDocs file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/TopFieldDocs.html *search*(Query file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Query.html query, Filter file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Filter.html filter, int n, Sort file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Sort.html sort) throws IOException http://java.sun.com/j2se/1.4/docs/api/java/io/IOException.html Expert: Low-level search implementation with arbitrary sorting. Finds the top n hits for query, applying filter if non-null, and sorting the hits by the criteria in sort. Best Erick On Wed, Nov 19, 2008 at 4:22 PM, Ariel isaacr...@gmail.com wrote: Well, this is what I am doing: queryString=year:[2003 TO 2005] [CODE] Query pquery = null; Hits hits = null; Analyzer analyzer = null; analyzer = new SnowballAnalyzer(English); try { pquery = MultiFieldQueryParser.parse(new String[] {queryString, queryString}, new String[] {title, content}, analyzer); //analyzer } catch (ParseException e1) { e1.printStackTrace(); } MultiSearcher searcher = (MultiSearcher) searcherCache.get(name); try { hits = searcher.search(pquery); } catch (IOException e1) { e1.printStackTrace(); } [/CODE] I don't know the methods that include sorting. I have made the sorting by the score criteria so far, I don-t know how to change it to the year field criteria. As you can see, I am using a multisearcher because I have several indexes. I hope you can help me. Regards Thanks in advance Ariel On Wed, Nov 19, 2008 at 3:58 PM, Ian Lea ian@gmail.com wrote: Are you using one of the search methods that includes sorting? If not, then do. If you are, then you need to tell us exactly what you are doing and exactly what you reckon is going wrong. -- Ian. On Wed, Nov 19, 2008 at 6:23 PM, Ariel isaacr...@gmail.com wrote: it is supposed lucene make a lexicocraphic sorting but this is not hapening, Could you tell me what I'm doing wrong ? I hope you can help me. Regards On Wed, Nov 19, 2008 at 11:56 AM, Ariel isaacr...@gmail.com wrote: Thanks, that was very helpful, but I have a question when I make the searches it does not sort the results according to the range, for example: year: [2003 TO 2008] in the first page 2003 documents are showed, in the second 2005 documents, in the third page 2004 documents, I don't see any sort criteria. How could I fix that problem ??? Greetings Ariel On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea ian@gmail.com wrote: Hi - sounds like you need a range query. http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches -- Ian. On Wed, Nov 19, 2008 at 4:02 PM, Ariel isaacr...@gmail.com wrote: Hi everybody: I need to make search with lucene 2.3.2, taking in account the dates, previously when I build the index I create a date field where I stored the year in which the document was created, at the search moment I would like to retrieve documents that have been created before a Year or after a Year, for example documents before 2002 year o after 2003 year. It is possible to do that with lucene ??? Regards Ariel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to search documents taking in account the dates ???
Thank you, it works very good. Regards Ariel On Thu, Dec 18, 2008 at 8:22 AM, Erick Erickson erickerick...@gmail.comwrote: Use the setSort that takes an array of Sort objects... On Thu, Dec 18, 2008 at 8:11 AM, Ariel isaacr...@gmail.com wrote: What I am doing is this: code Sort sort = new Sort(); sort.setSort(year, true); hits = searcher.search(pquery,sort); /code How I must put my code to sort first by date an then by score ??? Greetings Ariel On Thu, Dec 18, 2008 at 4:48 AM, Ian Lea ian@gmail.com wrote: Lucene lets you sort by multiple fields, including score. See the javadocs for Sort and SortField, specifically SortField.SCORE. -- Ian. On Wed, Dec 17, 2008 at 8:15 PM, Ariel isaacr...@gmail.com wrote: Hi: This solution have a problem. the results are sorted bye the year criteria but I need that after sort by year criteria it sort by the scoring criteria two. How can I do this ??? I hope you can help me. Greetings Ariel On Wed, Nov 19, 2008 at 5:28 PM, Erick Erickson erickerick...@gmail.com wrote: Well, MultiSearcher is just a Searcher, so you have available all of the search methods on Searcher. One of which is: search public TopFieldDocs file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/TopFieldDocs.html *search*(Query file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Query.html query, Filter file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Filter.html filter, int n, Sort file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Sort.html sort) throws IOException http://java.sun.com/j2se/1.4/docs/api/java/io/IOException.html Expert: Low-level search implementation with arbitrary sorting. Finds the top n hits for query, applying filter if non-null, and sorting the hits by the criteria in sort. Best Erick On Wed, Nov 19, 2008 at 4:22 PM, Ariel isaacr...@gmail.com wrote: Well, this is what I am doing: queryString=year:[2003 TO 2005] [CODE] Query pquery = null; Hits hits = null; Analyzer analyzer = null; analyzer = new SnowballAnalyzer(English); try { pquery = MultiFieldQueryParser.parse(new String[] {queryString, queryString}, new String[] {title, content}, analyzer); //analyzer } catch (ParseException e1) { e1.printStackTrace(); } MultiSearcher searcher = (MultiSearcher) searcherCache.get(name); try { hits = searcher.search(pquery); } catch (IOException e1) { e1.printStackTrace(); } [/CODE] I don't know the methods that include sorting. I have made the sorting by the score criteria so far, I don-t know how to change it to the year field criteria. As you can see, I am using a multisearcher because I have several indexes. I hope you can help me. Regards Thanks in advance Ariel On Wed, Nov 19, 2008 at 3:58 PM, Ian Lea ian@gmail.com wrote: Are you using one of the search methods that includes sorting? If not, then do. If you are, then you need to tell us exactly what you are doing and exactly what you reckon is going wrong. -- Ian. On Wed, Nov 19, 2008 at 6:23 PM, Ariel isaacr...@gmail.com wrote: it is supposed lucene make a lexicocraphic sorting but this is not hapening, Could you tell me what I'm doing wrong ? I hope you can help me. Regards On Wed, Nov 19, 2008 at 11:56 AM, Ariel isaacr...@gmail.com wrote: Thanks, that was very helpful, but I have a question when I make the searches it does not sort the results according to the range, for example: year: [2003 TO 2008] in the first page 2003 documents are showed, in the second 2005 documents, in the third page 2004 documents, I don't see any sort criteria. How could I fix that problem ??? Greetings Ariel On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea ian@gmail.com wrote: Hi - sounds like you need a range query. http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches -- Ian. On Wed, Nov 19, 2008 at 4:02 PM, Ariel isaacr...@gmail.com wrote: Hi everybody: I need to make search with lucene 2.3.2, taking in account the dates, previously when I build the index I create a date
Re: How to search documents taking in account the dates ???
Hi: This solution have a problem. the results are sorted bye the year criteria but I need that after sort by year criteria it sort by the scoring criteria two. How can I do this ??? I hope you can help me. Greetings Ariel On Wed, Nov 19, 2008 at 5:28 PM, Erick Erickson erickerick...@gmail.comwrote: Well, MultiSearcher is just a Searcher, so you have available all of the search methods on Searcher. One of which is: search public TopFieldDocs file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/TopFieldDocs.html *search*(Query file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Query.html query, Filter file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Filter.html filter, int n, Sort file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Sort.html sort) throws IOException http://java.sun.com/j2se/1.4/docs/api/java/io/IOException.html Expert: Low-level search implementation with arbitrary sorting. Finds the top n hits for query, applying filter if non-null, and sorting the hits by the criteria in sort. Best Erick On Wed, Nov 19, 2008 at 4:22 PM, Ariel isaacr...@gmail.com wrote: Well, this is what I am doing: queryString=year:[2003 TO 2005] [CODE] Query pquery = null; Hits hits = null; Analyzer analyzer = null; analyzer = new SnowballAnalyzer(English); try { pquery = MultiFieldQueryParser.parse(new String[] {queryString, queryString}, new String[] {title, content}, analyzer); //analyzer } catch (ParseException e1) { e1.printStackTrace(); } MultiSearcher searcher = (MultiSearcher) searcherCache.get(name); try { hits = searcher.search(pquery); } catch (IOException e1) { e1.printStackTrace(); } [/CODE] I don't know the methods that include sorting. I have made the sorting by the score criteria so far, I don-t know how to change it to the year field criteria. As you can see, I am using a multisearcher because I have several indexes. I hope you can help me. Regards Thanks in advance Ariel On Wed, Nov 19, 2008 at 3:58 PM, Ian Lea ian@gmail.com wrote: Are you using one of the search methods that includes sorting? If not, then do. If you are, then you need to tell us exactly what you are doing and exactly what you reckon is going wrong. -- Ian. On Wed, Nov 19, 2008 at 6:23 PM, Ariel isaacr...@gmail.com wrote: it is supposed lucene make a lexicocraphic sorting but this is not hapening, Could you tell me what I'm doing wrong ? I hope you can help me. Regards On Wed, Nov 19, 2008 at 11:56 AM, Ariel isaacr...@gmail.com wrote: Thanks, that was very helpful, but I have a question when I make the searches it does not sort the results according to the range, for example: year: [2003 TO 2008] in the first page 2003 documents are showed, in the second 2005 documents, in the third page 2004 documents, I don't see any sort criteria. How could I fix that problem ??? Greetings Ariel On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea ian@gmail.com wrote: Hi - sounds like you need a range query. http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches -- Ian. On Wed, Nov 19, 2008 at 4:02 PM, Ariel isaacr...@gmail.com wrote: Hi everybody: I need to make search with lucene 2.3.2, taking in account the dates, previously when I build the index I create a date field where I stored the year in which the document was created, at the search moment I would like to retrieve documents that have been created before a Year or after a Year, for example documents before 2002 year o after 2003 year. It is possible to do that with lucene ??? Regards Ariel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: I would want to know more about the lucene implementation in C++
Thank you, very much. On Thu, Dec 4, 2008 at 11:33 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: There is CLucene. It's not a part of Apache, but lives on SourceForge, I think. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: lucene user java-user@lucene.apache.org Sent: Tuesday, December 2, 2008 2:13:08 PM Subject: I would want to know more about the lucene implementation in C++ Hi everybody: I have seen the lucene project for C++ has been abandoned, could you tell me if there is another similar implementation of java lucene in C++ ??? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
I would want to know more about the lucene implementation in C++
Hi everybody: I have seen the lucene project for C++ has been abandoned, could you tell me if there is another similar implementation of java lucene in C++ ???
How to search documents taking in account the dates ???
Hi everybody: I need to make search with lucene 2.3.2, taking in account the dates, previously when I build the index I create a date field where I stored the year in which the document was created, at the search moment I would like to retrieve documents that have been created before a Year or after a Year, for example documents before 2002 year o after 2003 year. It is possible to do that with lucene ??? Regards Ariel
Re: How to search documents taking in account the dates ???
Thanks, that was very helpful, but I have a question when I make the searches it does not sort the results according to the range, for example: year: [2003 TO 2008] in the first page 2003 documents are showed, in the second 2005 documents, in the third page 2004 documents, I don't see any sort criteria. How could I fix that problem ??? Greetings Ariel On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea [EMAIL PROTECTED] wrote: Hi - sounds like you need a range query. http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches -- Ian. On Wed, Nov 19, 2008 at 4:02 PM, Ariel [EMAIL PROTECTED] wrote: Hi everybody: I need to make search with lucene 2.3.2, taking in account the dates, previously when I build the index I create a date field where I stored the year in which the document was created, at the search moment I would like to retrieve documents that have been created before a Year or after a Year, for example documents before 2002 year o after 2003 year. It is possible to do that with lucene ??? Regards Ariel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to search documents taking in account the dates ???
it is supposed lucene make a lexicocraphic sorting but this is not hapening, Could you tell me what I'm doing wrong ? I hope you can help me. Regards On Wed, Nov 19, 2008 at 11:56 AM, Ariel [EMAIL PROTECTED] wrote: Thanks, that was very helpful, but I have a question when I make the searches it does not sort the results according to the range, for example: year: [2003 TO 2008] in the first page 2003 documents are showed, in the second 2005 documents, in the third page 2004 documents, I don't see any sort criteria. How could I fix that problem ??? Greetings Ariel On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea [EMAIL PROTECTED] wrote: Hi - sounds like you need a range query. http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches -- Ian. On Wed, Nov 19, 2008 at 4:02 PM, Ariel [EMAIL PROTECTED] wrote: Hi everybody: I need to make search with lucene 2.3.2, taking in account the dates, previously when I build the index I create a date field where I stored the year in which the document was created, at the search moment I would like to retrieve documents that have been created before a Year or after a Year, for example documents before 2002 year o after 2003 year. It is possible to do that with lucene ??? Regards Ariel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to search documents taking in account the dates ???
Well, this is what I am doing: queryString=year:[2003 TO 2005] [CODE] Query pquery = null; Hits hits = null; Analyzer analyzer = null; analyzer = new SnowballAnalyzer(English); try { pquery = MultiFieldQueryParser.parse(new String[] {queryString, queryString}, new String[] {title, content}, analyzer); //analyzer } catch (ParseException e1) { e1.printStackTrace(); } MultiSearcher searcher = (MultiSearcher) searcherCache.get(name); try { hits = searcher.search(pquery); } catch (IOException e1) { e1.printStackTrace(); } [/CODE] I don't know the methods that include sorting. I have made the sorting by the score criteria so far, I don-t know how to change it to the year field criteria. As you can see, I am using a multisearcher because I have several indexes. I hope you can help me. Regards Thanks in advance Ariel On Wed, Nov 19, 2008 at 3:58 PM, Ian Lea [EMAIL PROTECTED] wrote: Are you using one of the search methods that includes sorting? If not, then do. If you are, then you need to tell us exactly what you are doing and exactly what you reckon is going wrong. -- Ian. On Wed, Nov 19, 2008 at 6:23 PM, Ariel [EMAIL PROTECTED] wrote: it is supposed lucene make a lexicocraphic sorting but this is not hapening, Could you tell me what I'm doing wrong ? I hope you can help me. Regards On Wed, Nov 19, 2008 at 11:56 AM, Ariel [EMAIL PROTECTED] wrote: Thanks, that was very helpful, but I have a question when I make the searches it does not sort the results according to the range, for example: year: [2003 TO 2008] in the first page 2003 documents are showed, in the second 2005 documents, in the third page 2004 documents, I don't see any sort criteria. How could I fix that problem ??? Greetings Ariel On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea [EMAIL PROTECTED] wrote: Hi - sounds like you need a range query. http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches -- Ian. On Wed, Nov 19, 2008 at 4:02 PM, Ariel [EMAIL PROTECTED] wrote: Hi everybody: I need to make search with lucene 2.3.2, taking in account the dates, previously when I build the index I create a date field where I stored the year in which the document was created, at the search moment I would like to retrieve documents that have been created before a Year or after a Year, for example documents before 2002 year o after 2003 year. It is possible to do that with lucene ??? Regards Ariel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
What is the percent of size of lucene's index ?
I need to know what is the percent of size of lucene's index respect the information I'm going to index, I have read some articles that say if a I index 120 Gb of information the index will grow until 40 Gb, that means the percent is 30 %, Could somebody tell me how can be proved that ? Is there any official document of apache lucene where says that ? I hope somebody can help me. Thanks. Ariel
How to make documents clustering and topic classification with lucene
Hi everybody: Do you have Idea how to make how to make documents clustering and topic classification using lucene ??? Is there anyway to do this. Please I need help. Thanks everybody. Ariel
Re: How to make documents clustering and topic classification with lucene
Hi everybody: Do you have Idea how to make how to make documents clustering and topic classification using lucene ??? Is there anyway to do this. Please I need help. Thanks everybody. Ariel
Re: boosting relevance of certain documents
Ok. So I'm not an expert of the scoring algorithm, but based on tf*idf you can tell that the returned document is more relevant because it has more term frequency. Using the explain you can see the following: Doc 1 0.643841 = (MATCH) fieldWeight(searchable:fifa in 0), product of: 1.0 = tf(termFreq(searchable:fifa)=1) 1.287682 = idf(docFreq=2) 0.5 = fieldNorm(field=searchable, doc=0) Doc2 0.68289655 = (MATCH) fieldWeight(searchable:fifa in 1), product of: 1.4142135 = tf(termFreq(searchable:fifa)=2) 1.287682 = idf(docFreq=2) 0.375 = fieldNorm(field=searchable, doc=1) On Fri, Apr 25, 2008 at 2:30 PM, Daniel Freudenberger [EMAIL PROTECTED] wrote: I'm using the StandardAnalyzer - hope this answers your question (I'm quite new to the lucene thing) -Original Message- From: Jonathan Ariel [mailto:[EMAIL PROTECTED] Sent: Friday, April 25, 2008 6:59 PM To: java-user@lucene.apache.org Subject: Re: boosting relevance of certain documents How are you analyzing the searchable field? On Fri, Apr 25, 2008 at 12:49 PM, Daniel Freudenberger [EMAIL PROTECTED] wrote: Hello, I'm using lucene within a new project and I'm not sure about how to solve the following problem: My index consists of the two attributes id and searchable. id is the id of a product and searchable is a combination of the product name and its category name. example: id searchable 1 fifa 08 - playstation 3 2 fifa 2003 fifa 03 - playstation 3 3 playstation 60gb hdd - playstation 3 4 playstation i like you - playstation 3 When searching for fifa, lucene returns the product with id 2 at first, whereas id 1 (fifa 08) would be the much more relevant result (from the user side of view). the same problem arises when searching for playstation - the customer expects products having playstation in their names at first, ideally the console itself. in reality however, he gets all possible products which are in the playstation category as well. my idea was to introduce another attribute relevance, which may increase the relevance of an entry. the actual relevance shouldn't be suppressed completely though, but should only be taken into account with products that are similarly relevant for a specific search term. Does anybody have an idea on how to solve this problem? Thank you in advance, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MoreLikeThis over a subset of documents
Yes, it will be too much to do in real time, but it is a good idea tough. I don't know if a vector of term frequencies is stored with the document. Because I could search on the index to get the subset of documents and then take the term frequencies from there. In that case I could change MoreLikeThis to receive a set of term frequencies, instead of an IndexReader, and use that to do all the process. Anyone knows if a document contains for his fields the term frequencies? On Wed, Apr 23, 2008 at 7:46 AM, Karl Wettin [EMAIL PROTECTED] wrote: Jonathan Ariel skrev: Smart idea, but it won't help me. I have almost 50 categories and eventually I would like to filter not just on category but maybe also on language, etc. Karl: what do you mean by measure the distance between the term vectors and cluster them in real time? I mean exactly what I say, that if your subsets are small enough you could evalute the cosine coefficient and group documents accordingly. 2 million documents is however way to much data to do that in real time. I would probably create one index for each filter you want to use. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MoreLikeThis patch to support boost factor
This is a patch I made to be able to boost the terms with a specific factor beside the relevancy returned by MoreLikeThis. This is helpful when having more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) can be boosted more than words in the field B (i.e. Description). Any feedback? Jonathan Index: /home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java === --- /home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java (revision 651048) +++ /home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java (working copy) @@ -284,6 +284,11 @@ private final IndexReader ir; /** + * Boost factor to use when boosting the terms + */ +private int boostFactor = 1; + +/** * Constructor requiring an IndexReader. */ public MoreLikeThis(IndexReader ir) { @@ -574,7 +579,7 @@ } float myScore = ((Float) ar[2]).floatValue(); -tq.setBoost(myScore / bestScore); +tq.setBoost(boostFactor * myScore / bestScore); } try { @@ -921,6 +926,22 @@ x = 1; } } + +/** + * Returns the boost factor used when boosting terms + * @return the boost factor used when boosting terms + */ + public int getBoostFactor() { + return boostFactor; + } + + /** +* Sets the boost factor to use when boosting terms +* @param boostFactor +*/ + public void setBoostFactor(int boostFactor) { + this.boostFactor = boostFactor; + } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MoreLikeThis over a subset of documents
Is there any way to execute a MoreLikeThis over a subset of documents? I need to retrieve a set of interesting keywords from a subset of documents and not the entire index (imagine that my index has documents categorized as A, B and C and I just want to work with those categorized as A). Right now it is using docFreq from the IndexReader. So I looked into the FilterIndexReader to see if I can override the docFreq behavior, but I'm not sure if it's possible. What do you think? Jonathan
Re: MoreLikeThis over a subset of documents
But that doesn't help me with my problem, because the interesting terms are taken from the entire index and not a subset as I need. On Tue, Apr 22, 2008 at 6:46 PM, Glen Newton [EMAIL PROTECTED] wrote: Instead of this: MoreLikeThis mlt = new MoreLikeThis(ir); Reader target = ... // orig source of doc you want to find similarities to Query query = mlt.like( target); Hits hits = is.search(query); do this: MoreLikeThis mlt = new MoreLikeThis(ir); Reader target = ... // orig source of doc you want to find similarities to Query moreQuery = mlt.like( target); BooleanQuery bq = new BooleanQuery(); bq.add(moreQuery, BooleanClause.Occur.MUST); Query restrictQuery = new TermQuery(new Term(Category, A)); bq.add(restrictQuery, BooleanClause.Occur.MUST); Hits hits = is.search(bq); -glen 2008/4/22 Jonathan Ariel [EMAIL PROTECTED]: Is there any way to execute a MoreLikeThis over a subset of documents? I need to retrieve a set of interesting keywords from a subset of documents and not the entire index (imagine that my index has documents categorized as A, B and C and I just want to work with those categorized as A). Right now it is using docFreq from the IndexReader. So I looked into the FilterIndexReader to see if I can override the docFreq behavior, but I'm not sure if it's possible. What do you think? Jonathan -- - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MoreLikeThis over a subset of documents
I could have up to 2 million documents and growing. On Tue, Apr 22, 2008 at 7:29 PM, Karl Wettin [EMAIL PROTECTED] wrote: Jonathan Ariel skrev: Is there any way to execute a MoreLikeThis over a subset of documents? I need to retrieve a set of interesting keywords from a subset of documents and not the entire index (imagine that my index has documents categorized as A, B and C and I just want to work with those categorized as A). Right now it is using docFreq from the IndexReader. So I looked into the FilterIndexReader to see if I can override the docFreq behavior, but I'm not sure if it's possible. What do you think? It might be tricky. How many documents do you have in the subset? Could you measure the distance between the term vectors and cluster them in real time? karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MoreLikeThis over a subset of documents
Smart idea, but it won't help me. I have almost 50 categories and eventually I would like to filter not just on category but maybe also on language, etc. Karl: what do you mean by measure the distance between the term vectors and cluster them in real time? On Tue, Apr 22, 2008 at 7:39 PM, Glen Newton [EMAIL PROTECTED] wrote: Sorry, I misunderstood the problem. My mistake. While not optimal and rather expensive space-wise, you could have - in addition to existing keyword field - a field for each category. If the document being indexed is in category A, only add the text to the catA field. Now do MoreLikeThis on catA. This assumes you know the categories at index time, of course. Redundant but workable. -Glen 2008/4/22 Jonathan Ariel [EMAIL PROTECTED]: Is there any way to execute a MoreLikeThis over a subset of documents? I need to retrieve a set of interesting keywords from a subset of documents and not the entire index (imagine that my index has documents categorized as A, B and C and I just want to work with those categorized as A). Right now it is using docFreq from the IndexReader. So I looked into the FilterIndexReader to see if I can override the docFreq behavior, but I'm not sure if it's possible. What do you think? Jonathan -- - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to obtain the freq term vector of a field from a remote index ?
Hi folks: I need to know how to get the frequency term vector of a field from a remote index in another host. I know that *IndexSearcher *class has a method named *getIndexReader().getTermFreqVector(idDoc, fieldName) *to know the the term frequency vector of certain field* *but I am using* RemoteSearchable *that is * Searcher, *because my search functionalities are in an rmi server. I access the remoteSearcheble from a another host to obtain the hits but so far I haven't found the way to obtain the term frequency vector of certain field too. Do you know if it is possible to do that ??? How can I make it ? Any help is appreciated . Greetings Ariel
MoreLikeThis jar doesn't contain classes
Hi, I've downloaded Lucene 2.3.0 binaries and in the contrib folder I can see the Similarity package, but inside the Jar there are no classes! Downloading the sources I ran into the same issue. Am I doing something wrong? Where should I get the MoreLikeThis classes from? Thanks! Jonathan
MoreLikeThis queries
Hi, I'm trying to use MoreLikeThis but I can't find how to make a MoreLikeThis query that will return related documents given a document and some conditions, like country field in the related documents should be 1, etc. Is there any documentation on how to do this kind of queries? Thanks, Jonathan
Re: Why is lucene so slow indexing in nfs file system ?
Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? I hope you can help me. I have take in consideration the suggestions you have make me before, I going to do some things to test it. Ariel On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote: Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
I am indexing into RAM then merging explicitly because my application demand it due to I have design it as a distributed enviroment so many threads or workers are in different machines indexing into RAM serialize to disk an another thread in another machine access the segment index to merge it with the principal one, that is faster than if I had just one thread indexing the documents, doesn' it ? Yours suggestions are very useful. I hope you can help me. Greetings Ariel On Jan 10, 2008 10:21 AM, Erick Erickson [EMAIL PROTECTED] wrote: This seems really clunky. Especially if your merge step also optimizes. There's not much point in indexing into RAM then merging explicitly. Just use an FSDirectory rather than a RAMDirectory. There is *already* buffering built in to FSDirectory, and your merge factor etc. control how much RAM is used before flushing to disk. There's considerable discussion of this on the Wiki I believe, but in the mail archive for sure. And I believe there's a RAM usage based flushing policy somewhere. You're adding complexity where it's probably not necessary. Did you adopt this scheme because you *thought* it would be faster or because you were addressing a *known* problem? Don't *ever* write complex code to support a theoretical case unless you have considerable certainty that it really is a problem. It would be faster is a weak argument when you don't know whether you're talking about saving 1% or 95%. The added maintenance is just not worth it. There's a famous quote about that from Donald Knuth (paraphrasing Hoare) We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. It's true. So the very *first* measurement I'd take is to get rid of the in-RAM stuff and just write the index to local disk. I suspect you'll be *far* better off doing this then just copying your index to the nfs mount. Best Erick On Jan 10, 2008 10:05 AM, Ariel [EMAIL PROTECTED] wrote: In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? I hope you can help me. I have take in consideration the suggestions you have make me before, I going to do some things to test it. Ariel On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote: Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know
Re: Why is lucene so slow indexing in nfs file system ?
Thanks for yours suggestions. I'm sorry I didn't know but I would want to know what Do you mean with SAN and FC? Another thing, I have visited the lucene home page and there is not released the 2.3 version, could you tell me where is the download link ? Thanks in advance. Ariel On Jan 10, 2008 2:59 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, Comments inline. - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, January 10, 2008 10:05:28 AM Subject: Re: Why is lucene so slow indexing in nfs file system ? In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. OG: What about SAN connected over FC for example? One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will do in-memory thing for you. Make good use of your RAM and use 2.3 which gives you more control over RAM use during indexing. Parallelizing indexing over multiple machines and merging at the end is faster, so that's a good approach. Also, if your boxes have multiple CPUs write your code so that it has multiple worker threads that do indexing and feed docs to IndexWriter.addDocument(Document) to keep the CPUs fully utilized. OG: Oh, something faster than PDFBox? There is (can't remember the name now... itextstream or something like that?), though it may not be free like PDFBox. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote: Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Why is lucene so slow indexing in nfs file system ?
Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings
Re: How to build your custom termfreq vector an add it to the field ?
Very interesting the link you suggest me Mr Grant Ingersoll. Let see if I understand how the ranking issue in lucene could be implemented: 1. First I must create my own query class extending the abstract Query class. The only method I must implement from this class is toString. Is right this ??? 2. I must implement inside my own query class the Weight interface But I really don't understand how this is going to let me change my ranking scoring. 3 I must implement my custom Scorer ??? I don't know how integrate this. There is a lot of little pieces of information but not concrete. Greetings On Nov 7, 2007 1:48 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: Term Vectors (specifically TermFreqVector) in Lucene are a storage mechanism for convenience and applications to use. They are not an integral part of the scoring in the way you may be thinking of them in terms of the traditional Vector Space Model, thus there may be some confusion from the different usages of that terminology. If you want to see examples of how to implement scorers have a look at classes like TermScorer, BoostingTermQuery, and any of the other classes that extend Scorer. You might also find the file formats page (off of the Lucene Java website under Documentation) helpful for understanding what Lucene stores so that it can do scoring. There really isn't any tutorial on scoring, as it is not something that many people have expressed an interest in or no one has made it a high enough priority to write one. Having written a Scorer (or maybe two, I forget) I can give advice on specific things, but I am not sure I could write a tutorial that is general enough to be useful at this point. One thought for associating a weight to a given term based on its cooccurring terms is to use the new Payload mechanism whereby you can store a byte array at each term which can then be used in scoring via things like the BoostingTermQuery (or your own implementation.) If that is of interest, you can search the archives for payloads (I also think Michael Busch is presenting on Payloads, amongst other things, at ApacheCon in Atlanta) and have a look at the BoostingTermQuery. There certainly are other PayloadQueries that need to be implemented. See the Lucene wiki for some background and details on Payloads as well. I don't know that it is a big mistake to try this in Lucene. The community hasn't put a huge priority on making altering the innards of scoring easier to deal with (if possible), but that doesn't mean we are not open to suggestions and patches.You may find https://issues.apache.org/jira/browse/LUCENE-965 to be informative for both the implementation and the discussion of things that need to happen to be accepted into Lucene. This JIRA issue specifically attempts to provide Lucene with a new scoring mechanism. You might also have a look at Lemur (http://www.lemurproject.org/) which is much more academically focused. Cheers, Grant On Nov 7, 2007, at 12:49 PM, Ariel wrote: Then if I want to use another scoring formula I must to implement my own Query/Weigh/Scorer ? For example instead of cousine distance leiderbage distance or .. another. I'm studying Query/Weigh/Scorer classes to find out how to do that but there is not much documentation about that. I have seen I could change similarity factors extending the simlarity class, but I have not seen any example about changing scoring formula and changing the weight by term in the term vector. Do you know any tutorial about this ? What I want to do changing frecuency in the terms vector is this: for example instead of take the tf term frecuency of the term and stored in the vector I want to consider the correlation of the term with the other terms of the documents and store that measure by term in the vector so later with my custom similarity formula calculate the ranking of a document against a query considering the correlation between terms. Dou you think is a big mistake try to do this with lucene ??? Is there any way ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Boot Camp Training: ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to build your custom termfreq vector an add it to the field ?
Then if I want to use another scoring formula I must to implement my own Query/Weigh/Scorer ? For example instead of cousine distance leiderbage distance or .. another. I'm studying Query/Weigh/Scorer classes to find out how to do that but there is not much documentation about that. I have seen I could change similarity factors extending the simlarity class, but I have not seen any example about changing scoring formula and changing the weight by term in the term vector. Do you know any tutorial about this ? What I want to do changing frecuency in the terms vector is this: for example instead of take the tf term frecuency of the term and stored in the vector I want to consider the correlation of the term with the other terms of the documents and store that measure by term in the vector so later with my custom similarity formula calculate the ranking of a document against a query considering the correlation between terms. Dou you think is a big mistake try to do this with lucene ??? Is there any way ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to change the similarity function of lucene
Sorry the delay. But, what I want to do is to change the terms weigth, I don´t want that terms weight be the frecuency the term appear in the document intead of that I want it to be another special measure and with that change the similarity function. I don´t know how to change the terms weight in the term vector in a document, How can I do it ? Greetings Ariel On 9/24/07, Grant Ingersoll [EMAIL PROTECTED] wrote: Perhaps you can explain in what way you want to make it more powerful? There are possibilities to do: 1. Change the Similarity class (a call back mechanism) 2. Implement or extend Queries, Scorers, etc. 3. Others??? See http://lucene.apache.org/java/docs/scoring.html for some insights. In other words, it can be as complex as you want it to be... -Grant On Sep 24, 2007, at 5:24 PM, Ariel wrote: Hi every body: I would like to know how to change the similarity function of lucene to extends the posibilities of searching and make it more powefull. Have somebody made this before ? Could you help me please ? I don't know how complex might be this. I hope you can help me. Greetings Ariel -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to change the similarity function of lucene
Hi every body: I would like to know how to change the similarity function of lucene to extends the posibilities of searching and make it more powefull. Have somebody made this before ? Could you help me please ? I don't know how complex might be this. I hope you can help me. Greetings Ariel
How to get documents similar to other document ?
Hi every body: My question is if there is an api function of lucene to obtain similar documents to other document comparing the term frequence vector of a field ??? I supposed a lot of people have asked this before but I haven't found the answer neither with google nor api lucene. This could be a very useful functionality of the lucene api. I am using lucene version 1.9 I hope you can help me. Greetings. Ariel
Re: How to get documents similar to other document ?
Excuse me, Could you give more details ? Are you telling me that functionality exists ? Which class should I use for this ? I hope not being bothering you. Greetings On 9/11/07, Grant Ingersoll [EMAIL PROTECTED] wrote: See the MoreLikeThis functionality in the contrib package, also search this archive for MoreLikeThis. On Sep 11, 2007, at 11:50 AM, Ariel wrote: Hi every body: My question is if there is an api function of lucene to obtain similar documents to other document comparing the term frequence vector of a field ??? I supposed a lot of people have asked this before but I haven't found the answer neither with google nor api lucene. This could be a very useful functionality of the lucene api. I am using lucene version 1.9 I hope you can help me. Greetings. Ariel -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Indexing
Hi, I'm new to this list. So first of all Hello to everyone! So right now I have a little issue I would like to discuss with you. Suppose that your are in a really big application where the data in your database is updated really fast. I reindex lucene every 5 min but since my application lists everything from lucene there are like 5 minutes (in the worse case) where I don't see new staff. What do you think would be the best aproach to this problem? Thanks! Jonathan
Re: Indexing
I'm not reindexing the entire index. I'm just commiting the updated. But I'm not sure how it would affect performance to commit in real time. I think right now I have like 10 updated per minute. On 8/22/07, Erick Erickson [EMAIL PROTECTED] wrote: There are several approaches. First, is your index small enough to fit in RAM? You might consider just putting it all in RAM and searching that. A more complex solution would be to keep the increments in a separate RAMDir AND your FSDir, search both and keep things coordinated. Something like open FSDIr create RAMDir while (whatever) { get request if (modification) { write to FSDir and RAMDir } if (search) { search FSDir open RAMDir reader search RAMDir close RAMDir reader (but not writer!) } } close FSDIr close RAMDir start again from the top. Warning: I haven't done this, but it *should* work. The sticky part seems to me to be coordinating deletes since the open FSDir may contain documents also in the RAMDir, but that's an exercise for the readerG, You could also define the problem away and just live with a 5 minute latency. Best Erick On 8/22/07, Jonathan Ariel [EMAIL PROTECTED] wrote: Hi, I'm new to this list. So first of all Hello to everyone! So right now I have a little issue I would like to discuss with you. Suppose that your are in a really big application where the data in your database is updated really fast. I reindex lucene every 5 min but since my application lists everything from lucene there are like 5 minutes (in the worse case) where I don't see new staff. What do you think would be the best aproach to this problem? Thanks! Jonathan
One index per user or one index per day?
Greetings, I'm creating an application that requires the indexing of millions of documents on behalf of a large group of users, and was hoping to get an opinion on whether I should use one index per user or one index per day. My application will have to handle the following: - the indexing of about 1 million 5K documents per day, with each document containing about 5 fields - expiration of documents, since after a while, my hard drive would run out of room - queries that consist of boolean expressions (e.g., the body field contains a AND b, and the title field contains c), as well as ranges (e.g., the document needs to have been indexed between 2/25/07 10:00 am and 2/28/07 9:00 pm) - permissions; in other words, user A might be able to search on documents X and Y, but user B might be able to search on documents Y and Z. - up to 1,000 users So, I was considering the following: 1) Using one index per user This would entail creating and using up to 1,000 indices. Document Y in the example above would have to be duplicated. Expiration is performed via IndexWriter.deleteDocuments. The advantage here is that querying should be reasonably quick, because each index would only contain tens of thousands of documents, instead of millions. The disadvantages: I'm concerned about the too many open files error, and I'm also concerned about the performance of deleteDocuments. 2) Using one index per day Each day, I create a new index. Again, document Y in the example above would have to be duplicated (is there any way around this?) The advantage here is that expiring documents means simply deleting the index corresponding to a particular day. The disadvantage is the query performance, since the queries, which are already very complex, would have to be performed using MultiSearcher (if expiration is after 10 days, that's 10 indices to search across). Tough to know for sure which option is better without testing, but does anyone have a gut reaction? Any advice would be greatly appreciated! Thanks, Ariel Need Mail bonding? Go to the Yahoo! Mail QA for great tips from Yahoo! Answers users. http://answers.yahoo.com/dir/?link=listsid=396546091 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Full disk space during indexing process with 120 gb of free disk space
Here is my source code where I convert pdf files to text for indexing, I got this source code from lucene in action examples and adapted it for my convenience, I hop you could help me to fix this problem, anyway if you know another more efficient way to do it please tell me how to: import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.Iterator; import java.util.List; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.pdfbox.cos.COSDocument; import org.pdfbox.encryption.DecryptDocument; import org.pdfbox.exceptions.CryptographyException; import org.pdfbox.exceptions.InvalidPasswordException; import org.pdfbox.pdfparser.PDFParser; import org.pdfbox.pdmodel.PDDocument; import org.pdfbox.pdmodel.PDDocumentInformation; import org.pdfbox.util.PDFTextStripper; import cu.co.cenatav.kernel.parser.DocumentHandler; import cu.co.cenatav.kernel.parser.DocumentHandlerException; import cu.co.cenatav.kernel.parser.schema.SchemaExtractor; public class PDFBoxPDFHandler implements DocumentHandler { public static String password = -password; public Document getDocument(InputStream is) throws DocumentHandlerException { COSDocument cosDoc = null; try { cosDoc = parseDocument(is); } catch (IOException e) { closeCOSDocument(cosDoc); throw new DocumentHandlerException( Cannot parse PDF document, e); } // decrypt the PDF document, if it is encrypted try { if (cosDoc.isEncrypted()) { DecryptDocument decryptor = new DecryptDocument(cosDoc); decryptor.decryptDocument(password); } } catch (CryptographyException e) { closeCOSDocument(cosDoc); throw new DocumentHandlerException( Cannot decrypt PDF document, e); } catch (InvalidPasswordException e) { closeCOSDocument(cosDoc); throw new DocumentHandlerException( Cannot decrypt PDF document, e); } catch (IOException e) { closeCOSDocument(cosDoc); throw new DocumentHandlerException( Cannot decrypt PDF document, e); } // extract PDF document's textual content String bodyText = null; try { PDFTextStripper stripper = new PDFTextStripper(); bodyText = stripper.getText(new PDDocument(cosDoc)); } catch (IOException e) { closeCOSDocument(cosDoc); throw new DocumentHandlerException( Cannot parse PDF document, e); // String errS = e.toString(); // if (errS.toLowerCase().indexOf(font) != -1) { // } } Document doc = new Document(); if (bodyText != null) { PDDocument pdDoc = null; PDDocumentInformation docInfo = null; try { pdDoc = new PDDocument(cosDoc); docInfo = pdDoc.getDocumentInformation(); } catch (Exception e) { closeCOSDocument(cosDoc); closePDDocument(pdDoc); System.err.println(Cannot extraxt metadata from PDF: + e.getMessage()); } SchemaExtractor schemaExtractor = new SchemaExtractor(bodyText); String author = null; if (docInfo != null) author = docInfo.getAuthor(); if (author == null || author.equals()){ //TODO Hacer el componente schemaExtractor List Authors = schemaExtractor.getAuthor(); Iterator I = Authors.iterator(); while (I.hasNext()){ String Author = (String)I.next(); doc.add(new Field(author, Author, Field.Store.YES , Field.Index.TOKENIZED, Field.TermVector.YES)); } }else{ doc.add(new Field(author, author, Field.Store.YES , Field.Index.TOKENIZED, Field.TermVector.YES)); } String title = null; if (docInfo != null) title = docInfo.getTitle(); if (title == null || title.equals()){ title = schemaExtractor.getTitle(); } String keywords = null; if (docInfo != null) keywords = docInfo.getKeywords(); if (keywords == null) keywords = ; String summary = null; if (docInfo != null) summary = docInfo.getProducer() + + docInfo.getCreator() + + docInfo.getSubject(); if (summary == null || summary.equals()){ summary = schemaExtractor.getAbstract(); } String content = schemaExtractor.getContent(); Field fieldTitle = new Field(title, title, Field.Store.YES , Field.Index.TOKENIZED,Field.TermVector.YES); //fieldTitle.setBoost(new Float(1.5)); doc.add(fieldTitle); Field fieldSumary = new Field(sumary, summary, Field.Store.YES , Field.Index.TOKENIZED,Field.TermVector.YES); //fieldSumary.setBoost(new Float(1.3)); doc.add(fieldSumary); doc.add(new Field(content, content, Field.Store.YES ,
Full disk space during indexing process with 120 gb of free disk space
Hi every body: I am getting a problem during the indexing process, I am indexing big amounts of texts most of them in pdf format I am using pdf box 0.6 version. The space in hard disk before that the indexing process begin is around 120 Gb but incredibly even when my lucene index doesn't have yet 300 mb my hard disk has not already free space, more incredible is that when I turn off the process of indexing then the free disk space arise rapidly to 120 Gb. How could happen this if I doesn't copy the documents to the disk ??? , I have a linux machine for the indexing process, I have been thinking that could be the temporaly files of something , may be pdf box ??? Could you help me please ??? Greetings
Re: Big problem with big indexes
Here are pieces of my source code: First of all, I search in all the indexes given a query String with a parallel searcher. As you can see I make a multi field query. Then you can see the index format I use, I store in the index all the fields. My index is optimized. public Hits search(String query) throws IOException { AnalyzerHandler analizer = new AnalyzerHandler(); Query pquery = null; try { pquery = MultiFieldQueryParser.parse(query, new String[] {title, sumary, filename, content, author}, analizer.getAnalyzer ()); } catch (ParseException e1) { e1.printStackTrace(); } Searchable[] searchables = new Searchable[IndexCount]; for (int i = 0; i IndexCount; i++) { searchables[i] = new IndexSearcher(RAMIndexsManager.getInstance ().getDirectoryAt(i)); } Searcher parallelSearcher = new ParallelMultiSearcher(searchables); return parallelSearcher.search(pquery); } Then in another method I obtain the fragment where the term occur, As you can see I use an EnglisAnalyzer that filter stopwords, stemming, synonims detection ... : public Vector getResults(Hits h, String string) throws IOException{ Vector ResultItems = new Vector(); int cantHits = h.length(); if (cantHits!=0){ QueryParser qparser = new QueryParser(content, new AnalyzerHandler().getAnalyzer()); Query query1 = null; try { query1 = qparser.parse(string); } catch (ParseException e1) { e1.printStackTrace(); } QueryScorer scorer = new QueryScorer(query1); Highlighter highlighter = new Highlighter(scorer); Fragmenter fragmenter = new SimpleFragmenter(150); highlighter.setTextFragmenter(fragmenter); for (int i = 0; i cantHits; i++) { org.apache.lucene.document.Document doc = h.doc(i); String filename = doc.get(filename); filename = filename.substring(filename.indexOf(/) + 1); String filepath = doc.get(filepath); Integer id = new Integer(h.id(i)); String score = (h.score(i))+ ; int fileSize = Integer.parseInt(doc.get(filesize)); String title = doc.get(title); String summary = doc.get(sumary); //fragment String body = h.doc(i).get(content); TokenStream stream = new EnglishAnalyzer().tokenStream(content,new StringReader(body)); String[] fragment = highlighter.getBestFragments(stream, body, 4); //fragment if (fragment.length == 0) { fragment = new String[1]; fragment[0] = ; } StringBuilder buffer = new StringBuilder(); for (int I = 0; I fragment.length; I++){ buffer.append(validateCad(fragment[I]) + ...\n); } String stringFragment = buffer.toString(); ResultItem result = new ResultItem(); result.setFilename(fileName); result.setFilepath(filePath); result.setFilesize(filesize); result.setScore(Double.parseDouble(score)); result.setFragment(fragment); result.setId(new Integer(id)); result.setSummary(summary); result.setTitle(title); ResultItems.add(result); } } return ResultItems; } So these are the principals methods that make search. Could you tell me if I do something wrong or inefficient ? As you can see I make a parallel search, I have a dual xeon machine with two CPU hyperthreading 2,4 Ghz 512 RAM but when I make the parallel searcher I can see in my command prompt on Linux that the 3 og my 4 cpu are always idle while only one is working, why occur that if the parallel searcher must saturate all the CPU of work. I hope you can help me.
Big problem with big indexes
Hi everybody: I have a big problem making prallel searches in big indexes. I have indexed with lucene over 60 000 articles, I have distributed the indexes in 10 computers nodes so each index not exceed the 60 MB of size. I makes parallel searches in those indexes but I get the search results after 40 MINUTES !!! Then I put the indexes in memory to do the parallel searches But still I get the search results after 3 minutes !!! that`s to mucho time waiting !!! How Can I reduce the time of search ??? Could you help me please ??? I need help ! Greetings
RE: graphically representing an index
Hi Andzej, Thanks for the tip, it does what I want. You are right, though, it's of limited use for helping the user access data. But I'm sure it will come in handy for my own analysis. Best, Ariel -Message d'origine- De : Andrzej Bialecki [mailto:[EMAIL PROTECTED] Envoyé : jeudi 31 août 2006 15:49 À : java-user@lucene.apache.org Objet : Re: graphically representing an index SOMMERIA KLEIN Ariel Ext VIACCESS-BU_DRM wrote: Hi all, I'm a newbie with Lucene and I'm looking to implement the following: I want to index posts from a forum, and, rather than proposing a search on the contents, graphically represent the contents of the index. More precisely, I would like to have a list of the most popular words, with a number next to each indicating how often they occur. The icing on the cake would be to be able to click on such a word and get a subset of the posts including that word. Can Lucene be used for this? Has anyone already implemented it? Any links? I've dug around a bit without any success, but my apologies if this has already been dealt with See http://www.getopt.org/luke for an example of such functionality. However, I must disappoint you - the most frequent words in a corpus are quite probably also most useless words. For English these are: the, a, to, for, by, in, can, I, ... So, you will need to eliminate them from the top of the list to get any useful results. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Privileged/Confidential information may be contained in this e-mail and attachments. This e-mail, including attachments, constitutes non-public information intended to be conveyed only to the designated recipient(s). If you are not an intended recipient, please delete this e-mail, including attachments, and notify us immediately. The unauthorized use, dissemination, distribution or reproduction of this e-mail, including attachments, is prohibited and may be unlawful. In general, the content of this e-mail and attachments does not constitute any form of commitment by VIACCESS SA. - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
graphically representing an index
Hi all, I'm a newbie with Lucene and I'm looking to implement the following: I want to index posts from a forum, and, rather than proposing a search on the contents, graphically represent the contents of the index. More precisely, I would like to have a list of the most popular words, with a number next to each indicating how often they occur. The icing on the cake would be to be able to click on such a word and get a subset of the posts including that word. Can Lucene be used for this? Has anyone already implemented it? Any links? I've dug around a bit without any success, but my apologies if this has already been dealt with - Privileged/Confidential information may be contained in this e-mail and attachments. This e-mail, including attachments, constitutes non-public information intended to be conveyed only to the designated recipient(s). If you are not an intended recipient, please delete this e-mail, including attachments, and notify us immediately. The unauthorized use, dissemination, distribution or reproduction of this e-mail, including attachments, is prohibited and may be unlawful. In general, the content of this e-mail and attachments does not constitute any form of commitment by VIACCESS SA. - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to merge lucene indexes ???
Hi every body: I need to know how to merge an index into another. I have a master index whose another indexes are added to it from others nodes . I want to merge indexes from the others nodes to master index, I made this method: public void merge(String MasterIndexDir, String IndexToMerge) { FSDirectory fsDir; try { fsDir = FSDirectory.getDirectory(IndexDir, false); IndexReader indexToMerge = IndexReader.open(IndexToMerge); AnalyzerHandler analyzer = new AnalyzerHandler(); IndexWriter fsWriter = new IndexWriter(fsDir, analyzer.getAnalyzer(), false); fsWriter.addIndexes(new IndexReader[] {indexToMerge}); fsWriter.close(); } catch (IOException e) { System.err.println(e.getMessage()); e.printStackTrace(); } } But with this method I get the following exception: Lock obtain timed out: [EMAIL PROTECTED]:\DOCUME~1\a\LOCALS~1\Temp\lucene- f9488d465badf2bf80c713184c580f65-write.lock java.io.IOException: Lock obtain timed out: [EMAIL PROTECTED] :\DOCUME~1\aromero\LOCALS~1\Temp\lucene- f9488d465badf2bf80c713184c580f65-write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:58) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:223) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:213) at cu.co.cenatav.kernel.indexing.MergeIndexes.merge(MergeIndexes.java :18) at cu.co.cenatav.kernel.indexing.MergeIndexes.main(MergeIndexes.java:36) Could you help me ??? I don't know why this is happening ??? Sorry for my english.
Re: How to merge lucene indexes ???
I think that do not solve my problem, because the line who's throwing the exception is this : IndexWriter fsWriter = new IndexWriter(fsDir, analyzer.getAnalyzer(), false); Besides if I create a new master index each time I'm going to merge them I'd lose others indexes I have merged into master index before, that's why I can't put the boolean parameter true. I really need help, please. I'm open to any suggestion. On 5/15/06, Daniel Naber [EMAIL PROTECTED] wrote: On Montag 15 Mai 2006 19:51, Ariel Isaac Romero wrote: IndexReader indexToMerge = IndexReader.open(IndexToMerge); AnalyzerHandler analyzer = new AnalyzerHandler(); IndexWriter fsWriter = new IndexWriter(fsDir, analyzer.getAnalyzer(), false); Don't open a reader, supply an array of Directories instead and use an IndexWriter that creates a new index (true as last parameter). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]