BTRFS ?
Hi, I spotted Uwe's comment in JIRA the other day BTRFS, which might also bring some cool things for Lucene.. Has anyone tried Lucene (or Solr or Elasticsearch) with BTRFS and seen some (performance) benefits over ext3/4 or xfs for example? Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/
JOB @ Sematext: Professional Services Lead = Head
Hello, We have what I think is a great opening at Sematext. Ideal candidate would be in New York, but that's not an absolute must. More info below + on http://sematext.com/about/jobs.html in job-ad-speak, but I'd be happy to describe what we are looking for, what we do, and what types of companies we work with in regular-human-speak off-line. DESCRIPTION Sematext is hiring a technical, hands-onProfessional Services Lead to join, lead, and grow the Professional Services side of Sematext and potentially grow into the Head role. REQUIREMENTS * Experience working with Solr or Elasticsearch * Plan and coordinate customer engagements from business and technical perspective * Identify customer pain points, needs, and success criteria at the onset of each engagement * Provide expert-level consulting and support services and strive to be a trustworthy advisor to a wide range of customers * Resolve complex search issues involving Solr or Elasticsearch * Identify opportunities to provide customers with additional value through our products or services * Communicate high-value use cases and customer feedback to our Product teams * Participate in open source community by contributing bug fixes, improvements, answering questions, etc. EXPERIENCE * BS or higher in Engineering or Computer Science preferred * 2 or more years of IT Consulting and/or Professional Services experience required * Exposure to other related open source projects (Hadoop, Nutch, Kafka, Storm, Mahout, etc.) a plus * Experience with other commercial and open source search technologies a plus * Enterprise Search, eCommerce, and/or Business Intelligence experience a plus * Experience working in a startup a plus Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/
Re: MergePolicy for append-only indices?
Thanks Mike(s) Co. Added https://issues.apache.org/jira/browse/LUCENE-5419 Sounds like a killer feature :) Otis On Wed, Jan 8, 2014 at 4:17 AM, Michael McCandless luc...@mikemccandless.com wrote: On Mon, Jan 6, 2014 at 3:42 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I think the key optimization when there are no deletions is that you don't need to renumber documents and can bulk-copy blocks of contiguous documents, and that is independent of merge policy. I think :) Merging of term vectors and stored fields will always use bulk-copy for contiguous chunks of non-deleted docs, so for the append-only case these will be the max chunk size and be efficient. We have no codec that implements bulk merging for postings, which would be interesting to pursue: in the append-only case it's possible, and merging of postings is normally by far the most time consuming step of a merge. Also, no RAM will be used holding the doc mapping, since the docIDs don't change. These benefits are independent of the MergePolicy. I think TieredMergePolicy will work fine for append-only; I'm not sure how you'd improve on its approach. It will in general renumber the docs, so if that's a problem, apps should use LogByteSizeMP. Mike McCandless http://blog.mikemccandless.com
MergePolicy for append-only indices?
Hi, (cross-posting to both Solr and Lucene user lists because while this is a Lucene-level question, I suspect a lot of people who know about this or are interested in this subject are actually on the Solr list) I have a large append-only index and I looked at merge policies hoping to identify one that is naturally more suitable for indices without any updates and deletions, just adds. I've read http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/index/TieredMergePolicy.htmland the javadocs for its cousins, but it doesn't look like any of them is more suited for append-only index than the other ones and Tiered MP having more knobs is probably the best one to use. I was wondering if I was missing something, if one of the MPs is in fact better for append-only indices OR if one can suggest how one could write a custom MP that's specialized for append-only indices. Thanks, Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/
Re: Lucene for Log file indexing and search
Hi, Logstash is the piece that first touches your logs, filters them, and then outputs them somewhere. People often use it with ElasticSearch. Once logs are in ES, they look at them with Kibana. Note: somebody should write a Logstash output for Solr! In Solr world there is Flume, which has a Solr sink. Flume has file tailing capability and Cloudera's Morphlines should allow one to process the log much like Logstash filters let you process them. At Sematext we've built something called Logsene - http://sematext.com/logsene/ , which uses some of the above technologies or plays nice with them. Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm From: Ivan Krišto ivan.kri...@gmail.com To: java-user@lucene.apache.org Cc: gudiseashok gudise.as...@gmail.com Sent: Friday, September 20, 2013 1:59 AM Subject: Re: Lucene for Log file indexing and search On 09/19/2013 07:41 PM, gudiseashok wrote: I am learning lucene, I am developing an application do do a search in log files in multi-environment boxes, I have googled for the deeper understanding, but all examples were just referring for just field File Name Modification (i.e. fieldtypes associated with text search) and they are returning results. Hello! If you don't have some extremly specific needs checkout Logstash -- http://logstash.net/ http://www.elasticsearch.org/overview/logstash/ It is powered by ElasticSearch (product similar to Solr, also based on Lucene). Regards, Ivan Krišto
Re: Content based recommender using lucene/solr
Hi, Have a look at http://www.youtube.com/watch?v=13yQbaW2V4Y . I'd say it's easier than Mahout, especially if you already have and know your way around Solr. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Jun 28, 2013 at 2:02 PM, Luis Carlos Guerrero Covo lcguerreroc...@gmail.com wrote: Hey saikat, thanks for your suggestion. I've looked into mahout and other alternatives for computing k nearest neighbors. I would have to run a job and computer the k nearest neighbors and track them in the index for retrieval. I wanted to see if this was something I could do with lucene using lucene's scoring function and solr's morelikethis component. The job you specifically mention is for Item based recommendation which would require me to track the different items users have viewed. I'm looking for a content based approach where I would use a distance measure to establish how near items are (how similar) and have some kind of training phase to adjust weights. On Fri, Jun 28, 2013 at 12:42 PM, Saikat Kanjilal sxk1...@hotmail.comwrote: Why not just use mahout to do this, there is an item similarity algorithm in mahout that does exactly this :) https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html You can use mahout in distributed and non-distributed mode as well. From: lcguerreroc...@gmail.com Date: Fri, 28 Jun 2013 12:16:57 -0500 Subject: Content based recommender using lucene/solr To: solr-u...@lucene.apache.org; java-user@lucene.apache.org Hi, I'm using lucene and solr right now in a production environment with an index of about a million docs. I'm working on a recommender that basically would list the n most similar items to the user based on the current item he is viewing. I've been thinking of using solr/lucene since I already have all docs available and I want a quick version that can be deployed while we work on a more robust recommender. How about overriding the default similarity so that it scores documents based on the euclidean distance of normalized item attributes and then using a morelikethis component to pass in the attributes of the item for which I want to generate recommendations? I know it has its issues like recomputing scores/normalization/weight application at query time which could make this idea unfeasible/impractical. I'm at a very preliminary stage right now with this and would love some suggestions from experienced users. thank you, Luis Guerrero -- Luis Carlos Guerrero Covo M.S. Computer Engineering (57) 3183542047 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Content based recommender using lucene/solr
Hi, It doesn't have to be one or the other. In the past I've built a news recommender engine based on CF (Mahout) and combined it with Content Similarity-based engine (wasn't Solr/Lucene, but something custom that worked with ngrams, but it may have as well been Lucene/Solr/ES). It worked well. If you haven't worked with Mahout before I'd suggest the approach in that video and going from there to Mahout only if it's limiting. See Ted's stuff on this topic, too: http://www.slideshare.net/tdunning/search-as-recommendation + http://berlinbuzzwords.de/sessions/multi-modal-recommendation-algorithms (note: Mahout, Solr, Pig) Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Jun 28, 2013 at 2:07 PM, Saikat Kanjilal sxk1...@hotmail.com wrote: You could build a custom recommender in mahout to accomplish this, also just out of curiosity why the content based approach as opposed to building a recommender based on co-occurence. One other thing, what is your data size, are you looking at scale where you need something like hadoop? From: lcguerreroc...@gmail.com Date: Fri, 28 Jun 2013 13:02:00 -0500 Subject: Re: Content based recommender using lucene/solr To: solr-u...@lucene.apache.org CC: java-user@lucene.apache.org Hey saikat, thanks for your suggestion. I've looked into mahout and other alternatives for computing k nearest neighbors. I would have to run a job and computer the k nearest neighbors and track them in the index for retrieval. I wanted to see if this was something I could do with lucene using lucene's scoring function and solr's morelikethis component. The job you specifically mention is for Item based recommendation which would require me to track the different items users have viewed. I'm looking for a content based approach where I would use a distance measure to establish how near items are (how similar) and have some kind of training phase to adjust weights. On Fri, Jun 28, 2013 at 12:42 PM, Saikat Kanjilal sxk1...@hotmail.comwrote: Why not just use mahout to do this, there is an item similarity algorithm in mahout that does exactly this :) https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html You can use mahout in distributed and non-distributed mode as well. From: lcguerreroc...@gmail.com Date: Fri, 28 Jun 2013 12:16:57 -0500 Subject: Content based recommender using lucene/solr To: solr-u...@lucene.apache.org; java-user@lucene.apache.org Hi, I'm using lucene and solr right now in a production environment with an index of about a million docs. I'm working on a recommender that basically would list the n most similar items to the user based on the current item he is viewing. I've been thinking of using solr/lucene since I already have all docs available and I want a quick version that can be deployed while we work on a more robust recommender. How about overriding the default similarity so that it scores documents based on the euclidean distance of normalized item attributes and then using a morelikethis component to pass in the attributes of the item for which I want to generate recommendations? I know it has its issues like recomputing scores/normalization/weight application at query time which could make this idea unfeasible/impractical. I'm at a very preliminary stage right now with this and would love some suggestions from experienced users. thank you, Luis Guerrero -- Luis Carlos Guerrero Covo M.S. Computer Engineering (57) 3183542047 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Document scoring order?
Hi, When Lucene scores matching documents, what is the order in which documents are processed/scored and can that be changed? I'm guessing it scores matches in whichever order they are stored in the index/on disk, which means by increasing docIDs? I do see some out of order scoring is possible but can one visit docs to score in, say, lexicographical order of a specific document field? Thanks, Otis -- Solr ElasticSearch Support http://sematext.com/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Any benchmark corps to evaluate performance of specified query?
Hi, Maybe https://github.com/sematext/ActionGenerator could be of help? We use it to produce query load for Solr and ElasticSearch and the whole thing is extensible, so you could easily add support for talking directly to Lucene. Oh, and there is the benchmark in Lucene: http://lucene.apache.org/core/4_0_0/benchmark/index.html Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm From: lukai lukai1...@gmail.com To: java-user@lucene.apache.org Sent: Wednesday, January 16, 2013 2:19 PM Subject: Any benchmark corps to evaluate performance of specified query? As the title, do we have any benchmark corps to test performance of a new query implementation? like 10k docs, or 1M docs? Thanks,
Poll: how to report # of docs in index over time
Hello, Quick poll for those who have an opinion about what index size monitoring should report in terms of the number of documents in the index. Poll: http://blog.sematext.com/2012/02/13/poll-solr-index-size-monitoring/ For example, imagine that in some 5-minute time period (say 10:00 AM to 10:05 AM) we check the index 5 times (in reality we do it much for frequently) and each time we do that we find the index has a different number of documents in it: 10, 15, 20, 25, and finally 30 documents. Now imagine this data as a graph showing the number of indexed document over time, but with the smallest time period shown being a 5 minutes interval. Given the above example,how many documents should this graph report for the 10:00 – 10:05 AM period? Should it show the minimum – 10? Average – 20? Mean – 20? Maximum -30? Minimum, average, and maximum – 10, 20, 30? Something else? Thanks! Otis - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How can i search lucene java user list archive?
Have a look at http://search-lucene.com/ where you can search Lucene mailing list archives (user, dev, common) its web site, wiki, source code, jira, etc. as well as the same types of data for Solr, Nutch, and so on. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: janwen tom.grade1...@163.com To: java-user java-user@lucene.apache.org Sent: Thursday, October 20, 2011 4:46 AM Subject: How can i search lucene java user list archive? I want to know how to search the java user list archive. There is no search function on the site:http://mail-archives.apache.org/mod_mbox/lucene-java-user/ Any idea? thanks 2011-10-20 janwen | China website : http://www.qianpin.com/
Hit search-lucene.com a little harder
Hello folks, Do you ever use http://search-lucene.com (SL) or http://search-hadoop.com (SH)? If you do, I'd like to ask you for a small favour: We are at Lucene Eurocon in Barcelona and we are about to show the Search Analytics [1] and Performance Monitoring [2] tools/services we've built and that we use on these two sites. We would like to show the audience various pretty graphs and would love those graph to be a little less sparse. :) So if you use SL and/or SH, please feel free to use them a little extra now, if you feel like helping. [1] http://sematext.com/search-analytics/index.html [2] http://sematext.com/spm/solr-performance-monitoring/index.html I think we'll open up both of the above services to the public tomorrow (and 100% free for undetermined length of time), but if you don't have time to sign up and set it up for yourself, yet are interested in reports, graphs, etc., let me know and we'll put together a blog post or something and include interesting things in it. Thanks, Otis
Re: OutOfMemoryError
Bok Tamara, You didn't say what -Xmx value you are using. Try a little higher value. Note that loading field values (and it looks like this one may be big because is compressed) from a lot of hits is not recommended. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Tamara Bobic tamara.bo...@scai.fraunhofer.de To: java-user@lucene.apache.org Cc: Roman Klinger roman.klin...@scai.fraunhofer.de Sent: Tuesday, October 18, 2011 12:21 PM Subject: OutOfMemoryError Hi all, I am using Lucene to query Medline abstracts and as a result I get around 3 million hits. Each of the hits is processed and information from a certain field is used. After certain number of hits, somewhere around 1 million (not always the same number) I get OutOfMemory exception that looks like this: Exception in thread main java.lang.OutOfMemoryError at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:221) at java.util.zip.Inflater.inflate(Inflater.java:238) at org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108) at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:609) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:385) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:231) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:1013) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:520) at org.apache.lucene.index.FilterIndexReader.document(FilterIndexReader.java:149) at org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:152) at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:156) at org.apache.lucene.search.Hits.doc(Hits.java:180) at de.fhg.scai.bio.tamara.corpusBuilding.LuceneCmdLineInterface.queryMedline(LuceneCmdLineInterface.java:178) at de.fhg.scai.bio.tamara.corpusBuilding.LuceneCmdLineInterface.main(LuceneCmdLineInterface.java:152) this line which causes problems is: String docText = hits.doc(j).getField(DOCUMENT).stringValue() ; I am using java 1.6 and I tried solving this issue with different garbage collectors (-XX:+UseParallelGC and -XX:+UseParallelOldGC) but it didn't help. Does anyone have any idea how to solve this problem? There is also an official bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6293787 Help is much appreciated. :) Best regards, Tamara Bobic - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Castle for Lucene/Solr?
Hello, I saw mentions of something called Caste a while back, but only now looked at what it is, and it sounds like something that's potentially interesting/useful (performance-wise) for Lucene/Solr. See http://twitter.com/#!/otisg/status/109768673467699200 Has anyone tried it with Lucene/Solr by any chance? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: distributing the indexing process
We've used Hadoop MapReduce with Solr to parallelize indexing for a customer and that brought down their multi-hour indexing process down to a couple of minutes. There is/was also Lucene-level contrib in Hadoop that makes use of MapReduce to parallelize indexing. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message - From: Guru Chandar guru.chan...@consona.com To: java-user@lucene.apache.org Cc: Sent: Thursday, June 30, 2011 5:12 AM Subject: distributing the indexing process If we have to index a lot of documents, is there a way to divide the documents into multiple sets and index them on multiple machines in parallel, and then merge the resulting indexes back into a single machine? If yes, will the result be logically equivalent to indexing all the documents on a single machine? Thanks, -gc - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How do I sort lucene search results by relevance and time?
If only you were using Solr http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29 Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Johnbin Wang johnbin.w...@gmail.com To: java-user@lucene.apache.org Sent: Sun, May 8, 2011 11:59:11 PM Subject: How do I sort lucene search results by relevance and time? What do I want to do is just like Google search results. The results in the first page is the most relevant and also recent documents, but not absolutely sorted by time desc. -- cheers, Johnbin Wang - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: AW: AW: AW: AW: fuzzy prefix search
We do have EdgeNGramTokenizer if that is what you are after. See how Solr uses it here: http://search-lucene.com/c/Solr:/src/java/org/apache/solr/analysis/EdgeNGramTokenizerFactory.java||EdgeNGramTokenizer Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss clemens...@mysign.ch To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Wed, May 4, 2011 2:07:40 AM Subject: AW: AW: AW: AW: fuzzy prefix search I know this is just an example. But even the WhitespaceAnalyzer takes the words apart, which I don't want. I would like the phrases as they are (maximum 3 words, e.g. Merlot del Ticino, ...) to be n-gram-ed. I hence want to have the n-grams. Mer Merl Merlo Merlot Merlot Merlot d ... Regards Clemens -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Gesendet: Dienstag, 3. Mai 2011 23:12 An: java-user@lucene.apache.org Betreff: Re: AW: AW: AW: fuzzy prefix search Clemens - that's just an example. Stick another tokenizer in there, like WhitespaceTokenizer in there, for example. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss clemens...@mysign.ch To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Tue, May 3, 2011 4:31:14 PM Subject: AW: AW: AW: fuzzy prefix search But doesn't the KeyWordTokenizer extract single words out oft he stream? I would like to create n-grams on the stream (field content) as it is... -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Gesendet: Dienstag, 3. Mai 2011 21:31 An: java-user@lucene.apache.org Betreff: Re: AW: AW: fuzzy prefix search Clemens, Something a la: public TokenStream tokenStream (String fieldName, Reader r) { return nw EdgeNGramTokenFilter(new KeywordTokenizer(r), EdgeNGramTokenFilter.Side.FRONT, 1, 4); } Check out page 265 of Lucene in Action 2. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss clemens...@mysign.ch To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Tue, May 3, 2011 12:57:39 PM Subject: AW: AW: fuzzy prefix search How does an simple Analyzer look that just n-grams the docs/fields. class SimpleNGramAnalyzer extends Analyzer { @Override public TokenStream tokenStream ( String fieldName, Reader reader ) { EdgeNGramTokenFilter... ??? } } -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Gesendet: Dienstag, 3. Mai 2011 13:36 An: java-user@lucene.apache.org Betreff: Re: AW: fuzzy prefix search Hi, I didn't read this thread closely, but just in case: * Is this something you can handle with synonyms? * If this is for English and you are trying to handle typos, there is a list of common English misspellings out there that you could use for this perhaps. * Have you considered n-gramming your tokens? Not sure if this would help, didn't read messages/examples closely enough, but you may want to look at this if you haven't done so yet. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss clemens...@mysign.ch To: java-user@lucene.apache.orgjava- u...@lucene.apache.org Sent: Tue, May 3, 2011 5:25:30 AM Subject: AW: fuzzy prefix search PrefixQuery I'd like the combination of prefix and fuzzy ;-) because people could also type menlo or märl and in any of these cases I'd like to get a hit on Merlot (for suggesting Merlot) -Ursprüngliche Nachricht- Von: Ian Lea [mailto:ian@gmail.com] Gesendet: Dienstag, 3. Mai 2011 11:22An: java-user@lucene.apache.org Betreff: Re: fuzzy prefix search I'd assumed that FuzzyQuery wouldn't ignore case but I could be wrong. What would be the edit distance between mer and merlot? Would it be less that 1.5 which I reckon would be the value of length(term)*0.5 as detailed in the javadocs? Seems
Re: AW: fuzzy prefix search
Hi, I didn't read this thread closely, but just in case: * Is this something you can handle with synonyms? * If this is for English and you are trying to handle typos, there is a list of common English misspellings out there that you could use for this perhaps. * Have you considered n-gramming your tokens? Not sure if this would help, didn't read messages/examples closely enough, but you may want to look at this if you haven't done so yet. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss clemens...@mysign.ch To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Tue, May 3, 2011 5:25:30 AM Subject: AW: fuzzy prefix search PrefixQuery I'd like the combination of prefix and fuzzy ;-) because people could also type menlo or märl and in any of these cases I'd like to get a hit on Merlot (for suggesting Merlot) -Ursprüngliche Nachricht- Von: Ian Lea [mailto:ian@gmail.com] Gesendet: Dienstag, 3. Mai 2011 11:22 An: java-user@lucene.apache.org Betreff: Re: fuzzy prefix search I'd assumed that FuzzyQuery wouldn't ignore case but I could be wrong. What would be the edit distance between mer and merlot? Would it be less that 1.5 which I reckon would be the value of length(term)*0.5 as detailed in the javadocs? Seems unlikely, but I don't really know anything about the Levenshtein (edit distance) algorithm as used by FuzzyQuery. Wouldn't a PrefixQuery be more appropriate here? -- Ian. On Tue, May 3, 2011 at 10:10 AM, Clemens Wyss clemens...@mysign.ch wrote: Unfortunately lowercasing doesn't help. Also, doesn't the FuzzyQuery ignore casing? -Ursprüngliche Nachricht- Von: Ian Lea [mailto:ian@gmail.com] Gesendet: Dienstag, 3. Mai 2011 11:06 An: java-user@lucene.apache.org Betreff: Re: fuzzy prefix search Mer != mer. The latter will be what is indexed because StandardAnalyzer calls LowerCaseFilter. -- Ian. On Tue, May 3, 2011 at 9:56 AM, Clemens Wyss clemens...@mysign.ch wrote: Sorry for coming back to my issue. Can anybody explain why my simple unit test below fails? Any hint/help appreciated. Directory directory = new RAMDirectory(); IndexWriter indexWriter = new IndexWriter( directory, new StandardAnalyzer( Version.LUCENE_31 ), IndexWriter.MaxFieldLength.UNLIMITED ); Document document = new Document(); document.add( new Field( test, Merlot, Field.Store.YES, Field.Index.ANALYZED ) ); indexWriter.addDocument( document ); IndexReader indexReader = indexWriter.getReader(); IndexSearcher searcher = new IndexSearcher( indexReader ); Query q = new FuzzyQuery( new Term( test, Mer ), 0.5f, 0, 10 ); // or Query q = new FuzzyQuery( new Term( test, Mer ), 0.5f); TopDocs result = searcher.search( q, 10 ); Assert.assertEquals( 1, result.totalHits ); - Clemens -Ursprüngliche Nachricht- Von: Clemens Wyss [mailto:clemens...@mysign.ch] Gesendet: Montag, 2. Mai 2011 23:01 An: java-user@lucene.apache.org Betreff: AW: fuzzy prefix search Is it the combination of FuzzyQuery and Term which makes the search to go for word boundaries? -Ursprüngliche Nachricht- Von: Clemens Wyss [mailto:clemens...@mysign.ch] Gesendet: Montag, 2. Mai 2011 14:13 An: java-user@lucene.apache.org Betreff: AW: fuzzy prefix search I tried this too, but unfortunately I only get hits when the search term is a least as long as the word to be looked up. E.g.: ... Directory directory = new RAMDirectory(); IndexWriter indexWriter = new IndexWriter( directory, IndexManager.getIndexingAnalyzer( LOCALE_DE ), IndexWriter.MaxFieldLength.UNLIMITED ); Document document = new Document(); document.add( new Field( test, Merlot, Field.Store.YES, Field.Index.ANALYZED ) ); indexWriter.addDocument( document ); IndexReader indexReader = indexWriter.getReader(); IndexSearcher searcher = new IndexSearcher( indexReader ); Query q = new FuzzyQuery( new Term( test, Mer ), 0.6f, 1 ); TopDocs result = searcher.search( q, 10 ); Assert.assertEquals( 1, result.totalHits ); ... -Ursprüngliche Nachricht- Von: Uwe Schindler [mailto:u...@thetaphi.de] Gesendet: Montag, 2. Mai 2011 13:50 An: java-user@lucene.apache.org Betreff: RE: fuzzy prefix search Hi, You can pass an integer to FuzzyQuery which defines the number of characters that are seen as prefix. So all terms must match this prefix and the rest of each term is matched using fuzzy. Uwe - Uwe
Re: AW: AW: fuzzy prefix search
Clemens, Something a la: public TokenStream tokenStream (String fieldName, Reader r) { return nw EdgeNGramTokenFilter(new KeywordTokenizer(r), EdgeNGramTokenFilter.Side.FRONT, 1, 4); } Check out page 265 of Lucene in Action 2. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss clemens...@mysign.ch To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Tue, May 3, 2011 12:57:39 PM Subject: AW: AW: fuzzy prefix search How does an simple Analyzer look that just n-grams the docs/fields. class SimpleNGramAnalyzer extends Analyzer { @Override public TokenStream tokenStream ( String fieldName, Reader reader ) { EdgeNGramTokenFilter... ??? } } -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Gesendet: Dienstag, 3. Mai 2011 13:36 An: java-user@lucene.apache.org Betreff: Re: AW: fuzzy prefix search Hi, I didn't read this thread closely, but just in case: * Is this something you can handle with synonyms? * If this is for English and you are trying to handle typos, there is a list of common English misspellings out there that you could use for this perhaps. * Have you considered n-gramming your tokens? Not sure if this would help, didn't read messages/examples closely enough, but you may want to look at this if you haven't done so yet. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss clemens...@mysign.ch To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Tue, May 3, 2011 5:25:30 AM Subject: AW: fuzzy prefix search PrefixQuery I'd like the combination of prefix and fuzzy ;-) because people could also type menlo or märl and in any of these cases I'd like to get a hit on Merlot (for suggesting Merlot) -Ursprüngliche Nachricht- Von: Ian Lea [mailto:ian@gmail.com] Gesendet: Dienstag, 3. Mai 2011 11:22 An: java-user@lucene.apache.org Betreff: Re: fuzzy prefix search I'd assumed that FuzzyQuery wouldn't ignore case but I could be wrong. What would be the edit distance between mer and merlot? Would it be less that 1.5 which I reckon would be the value of length(term)*0.5 as detailed in the javadocs? Seems unlikely, but I don't really know anything about the Levenshtein (edit distance) algorithm as used by FuzzyQuery. Wouldn't a PrefixQuery be more appropriate here? -- Ian. On Tue, May 3, 2011 at 10:10 AM, Clemens Wyss clemens...@mysign.ch wrote: Unfortunately lowercasing doesn't help. Also, doesn't the FuzzyQuery ignore casing? -Ursprüngliche Nachricht- Von: Ian Lea [mailto:ian@gmail.com] Gesendet: Dienstag, 3. Mai 2011 11:06 An: java-user@lucene.apache.org Betreff: Re: fuzzy prefix search Mer != mer. The latter will be what is indexed because StandardAnalyzer calls LowerCaseFilter. -- Ian. On Tue, May 3, 2011 at 9:56 AM, Clemens Wyss clemens...@mysign.ch wrote: Sorry for coming back to my issue. Can anybody explain why my simple unit test below fails? Any hint/help appreciated. Directory directory = new RAMDirectory(); IndexWriter indexWriter = new IndexWriter( directory, new StandardAnalyzer( Version.LUCENE_31 ), IndexWriter.MaxFieldLength.UNLIMITED ); Document document = new Document(); document.add( new Field( test, Merlot, Field.Store.YES, Field.Index.ANALYZED ) ); indexWriter.addDocument( document ); IndexReader indexReader = indexWriter.getReader(); IndexSearcher searcher = new IndexSearcher( indexReader ); Query q = new FuzzyQuery( new Term( test, Mer ), 0.5f, 0, 10 ); // or Query q = new FuzzyQuery( new Term( test, Mer ), 0.5f); TopDocs result = searcher.search( q, 10 ); Assert.assertEquals( 1, result.totalHits ); - Clemens -Ursprüngliche Nachricht- Von: Clemens Wyss [mailto:clemens...@mysign.ch] Gesendet: Montag, 2. Mai 2011 23:01 An: java-user@lucene.apache.org Betreff: AW: fuzzy prefix search Is it the combination of FuzzyQuery and Term which makes the search to go for word boundaries? -Ursprüngliche Nachricht- Von: Clemens Wyss [mailto:clemens...@mysign.ch] Gesendet: Montag, 2. Mai 2011 14:13 An: java-user@lucene.apache.org Betreff: AW: fuzzy prefix search I tried this too, but unfortunately I only get hits when
Re: AW: AW: AW: fuzzy prefix search
Clemens - that's just an example. Stick another tokenizer in there, like WhitespaceTokenizer in there, for example. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss clemens...@mysign.ch To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Tue, May 3, 2011 4:31:14 PM Subject: AW: AW: AW: fuzzy prefix search But doesn't the KeyWordTokenizer extract single words out oft he stream? I would like to create n-grams on the stream (field content) as it is... -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Gesendet: Dienstag, 3. Mai 2011 21:31 An: java-user@lucene.apache.org Betreff: Re: AW: AW: fuzzy prefix search Clemens, Something a la: public TokenStream tokenStream (String fieldName, Reader r) { return nw EdgeNGramTokenFilter(new KeywordTokenizer(r), EdgeNGramTokenFilter.Side.FRONT, 1, 4); } Check out page 265 of Lucene in Action 2. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss clemens...@mysign.ch To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Tue, May 3, 2011 12:57:39 PM Subject: AW: AW: fuzzy prefix search How does an simple Analyzer look that just n-grams the docs/fields. class SimpleNGramAnalyzer extends Analyzer { @Override public TokenStream tokenStream ( String fieldName, Reader reader ) { EdgeNGramTokenFilter... ??? } } -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Gesendet: Dienstag, 3. Mai 2011 13:36 An: java-user@lucene.apache.org Betreff: Re: AW: fuzzy prefix search Hi, I didn't read this thread closely, but just in case: * Is this something you can handle with synonyms? * If this is for English and you are trying to handle typos, there is a list of common English misspellings out there that you could use for this perhaps. * Have you considered n-gramming your tokens? Not sure if this would help, didn't read messages/examples closely enough, but you may want to look at this if you haven't done so yet. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss clemens...@mysign.ch To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Tue, May 3, 2011 5:25:30 AM Subject: AW: fuzzy prefix search PrefixQuery I'd like the combination of prefix and fuzzy ;-) because people could also type menlo or märl and in any of these cases I'd like to get a hit on Merlot (for suggesting Merlot) -Ursprüngliche Nachricht- Von: Ian Lea [mailto:ian@gmail.com] Gesendet: Dienstag, 3. Mai 2011 11:22 An: java-user@lucene.apache.org Betreff: Re: fuzzy prefix search I'd assumed that FuzzyQuery wouldn't ignore case but I could be wrong. What would be the edit distance between mer and merlot? Would it be less that 1.5 which I reckon would be the value of length(term)*0.5 as detailed in the javadocs? Seems unlikely, but I don't really know anything about the Levenshtein (edit distance) algorithm as used by FuzzyQuery. Wouldn't a PrefixQuery be more appropriate here? -- Ian. On Tue, May 3, 2011 at 10:10 AM, Clemens Wyss clemens...@mysign.ch wrote: Unfortunately lowercasing doesn't help. Also, doesn't the FuzzyQuery ignore casing? -Ursprüngliche Nachricht- Von: Ian Lea [mailto:ian@gmail.com] Gesendet: Dienstag, 3. Mai 2011 11:06 An: java-user@lucene.apache.org Betreff: Re: fuzzy prefix search Mer != mer. The latter will be what is indexed because StandardAnalyzer calls LowerCaseFilter. -- Ian. On Tue, May 3, 2011 at 9:56 AM, Clemens Wyss clemens...@mysign.ch wrote: Sorry for coming back to my issue. Can anybody explain why my simple unit test below fails? Any hint/help appreciated. Directory directory = new RAMDirectory(); IndexWriter indexWriter = new IndexWriter( directory, new StandardAnalyzer( Version.LUCENE_31 ), IndexWriter.MaxFieldLength.UNLIMITED ); Document document = new Document(); document.add( new Field( test
Re: MultiPhraseQuery slowing down over time in Lucene 3.1
Hi, I think this describes what's going on: 10 load N stored queries 20 parse N stored queries, keep them in some List forever 30 for each incoming document create a new MemoryIndex instance mi 40 for query 1 to N do mi.search(query) Over time this step 40 takes longer and longer and longer -- if some of the queries are MultiPhraseQueries. This is even with with mergeSort being used in MultiPhraseQuery. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Michael McCandless luc...@mikemccandless.com To: java-user@lucene.apache.org Sent: Mon, May 2, 2011 12:15:40 PM Subject: Re: MultiPhraseQuery slowing down over time in Lucene 3.1 By slowing down over time do you mean you use the same index (no new docs added) yet running the same MPQ over and over you see it taking longer to execute over time? Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 12:00 PM, Tomislav Poljak tpol...@gmail.com wrote: Hi, after running tests on both MemoryIndex and RAMDirectory based index in Lucene 3.1, seems MultiPhraseQueries are slowing down over time (each iteration of executing the same MultiPhraseQueries on the same doc, seems to require more and more execution time). Are there any existing/known issues related to the MultiPhraseQuery in Lucene 3.1 which could lead to this performance drop? Tomislav - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Thoughts on Search Analytics?
Hi, I'd like to solicit your thoughts about Search Analytics if you are doing any sort of analysis/reporting of search logs or click stream or anything related. * Which information or reports do you find the most useful and why? * Which reports would you like to have, but don't have for whatever reason (don't have the needed data, or it's too hard to produce such reports, or ...) * Which tool(s) or service(s) do you use and find the most useful? I'm preparing a presentation on the topic of Search Analytics, so I'm trying to solicit opinions, practices, desires, etc. on this topic. Your thoughts would be greatly appreciated. If you could reply directly, that would be great, since this may be a bit OT for the list. Thanks! Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: SorterTemplate.quickSort causes StackOverflowError
Hi, OK, so it looks like it's not MemoryIndex and its Comparator that are funky. After switching from quickSort call in MemoryIndex to mergeSort, the problem persists: '1205215856@qtp-684754483-7' Id=18, RUNNABLE on lock=, total cpu time=497060.ms user time=495210.msat org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:105) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) So something else is calling quickSort when it gets stuck. Weirdly, when I get a thread dump and get the above, I don't see the original caller. Maybe because the stack is already too deep and the printout is limited to N lines per call stack? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Uwe Schindler u...@thetaphi.de To: java-user@lucene.apache.org Sent: Thu, April 28, 2011 5:54:44 PM Subject: RE: SorterTemplate.quickSort causes StackOverflowError Thanks for confirming, Javier! :) Uwe, I assume you are referring to this line 528 in MemoryIndex? 528 if (size 1) ArrayUtil.quickSort(entries, termComparator); And this funky Comparator from MemoryIndex: 208 private static final ComparatorObject termComparator = new ComparatorObject() { 209 @SuppressWarnings(unchecked) 210 public int compare(Object o1, Object o2) { 211if (o1 instanceof Map.Entry?,?) o1 = ((Map.Entry?,?) o1).getKey(); 212if (o2 instanceof Map.Entry?,?) o2 = ((Map.Entry?,?) o2).getKey(); 213if (o1 == o2) return 0; 214return ((Comparable) o1).compareTo((Comparable) o2); 215 } 216 }; Will try, thanks! Yeah, simply try with mergeSort in line 528. If that helps, this comparator is buggy. Uwe - Original Message From: Uwe Schindler u...@thetaphi.de To: java-user@lucene.apache.org Sent: Thu, April 28, 2011 5:36:13 PM Subject: RE: SorterTemplate.quickSort causes StackOverflowError Hi Otis, Can you reproduce this somehow and send test code? I could look into it. I don't expect the error in the quicksort algorithm itself as this one is used e.g. BytesRefHash / TermsHash, if there is a bug we would have seen it long time ago. I have not seen this before, but I suspect a problem in this very strange comparator in MemoryIndex (which is very broken, if you look at its code - it can compare Strings with Map.Entry and so on, b), maybe the comparator is not stable? In this case, quicksort can easily loop endless and stack overflow. In Lucene 3.0 this used stock java sort (which is mergesort), maybe replace the ArrayUtils.quickSort my ArrayUtils.mergeSort() and see if problem is still there? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Thursday, April 28, 2011 11:17 PM To: java-user@lucene.apache.org Subject: SorterTemplate.quickSort causes StackOverflowError Hi, I'm looking at some code that uses MemoryIndex (Lucene 3.1) and that's exhibiting a strange behaviour - it slows down over time. The MemoryIndex contains 1 doc, of course, and executes a set of a few thousand queries against it. The set of queries does not change - the same set of queries gets executed on all incoming documents. This code runs very quickly. in the beginning. But with time is gets slower and slower and slower. and then I get this: 4/28/11 10:32:52 PM (S) SolrException.log : java.lang.StackOverflowError at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java: 104) I haven't profiled this code yet (remote server, firewall in between, can't use YourKit...), but does the above look familiar to anyone? I've looked at the code and obviously there is the recursive call that's problematic here - it looks like the recursion just gets deeper and deeper and gets stuck, eventually getting too deep for the JVM's taste. Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com
Re: SorterTemplate.quickSort causes StackOverflowError
Hi, Yeah, that's what we were going to do, but instead we did: * changed MemoryIndex to use ArrayUtil.mergeSort * ran the up and did a thread dump that shows that SorterTemplate.quickSort in deep recursion again! * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump * we now occasionally catch SorterTemplate.mergeSort in our thread dumps, but only a few levels deep, which looks healthy I don't think we'll be able to reproduce this easily - this happens with MemoryIndex and a few thousand stored queries that are confidential customer data :( I'll be back if after a while mergeSort starts behaving the same as quickSort. Thanks! Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Dawid Weiss dawid.we...@gmail.com To: java-user@lucene.apache.org Sent: Fri, April 29, 2011 7:51:39 AM Subject: Re: SorterTemplate.quickSort causes StackOverflowError Don't know if this helps, but debugging stuff like this I simply add a (manually inserted or aspectj-injected) recursion count, add a breakpoint inside an if checking for recursion count X and run the vm with an attached socket debugger. This lets you run at (nearly) full speed and once you hit the breakpoint, inspect the stack, variables, etc... Dawid On Fri, Apr 29, 2011 at 1:40 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, OK, so it looks like it's not MemoryIndex and its Comparator that are funky. After switching from quickSort call in MemoryIndex to mergeSort, the problem persists: '1205215856@qtp-684754483-7' Id=18, RUNNABLE on lock=, total cpu time=497060.ms user time=495210.msat org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:105) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) So something else is calling quickSort when it gets stuck. Weirdly, when I get a thread dump and get the above, I don't see the original caller. Maybe because the stack is already too deep and the printout is limited to N lines per call stack? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Uwe Schindler u...@thetaphi.de To: java-user@lucene.apache.org Sent: Thu, April 28, 2011 5:54:44 PM Subject: RE: SorterTemplate.quickSort causes StackOverflowError Thanks for confirming, Javier! :) Uwe, I assume you are referring to this line 528 in MemoryIndex? 528 if (size 1) ArrayUtil.quickSort(entries, termComparator); And this funky Comparator from MemoryIndex: 208 private static final ComparatorObject termComparator = new ComparatorObject() { 209 @SuppressWarnings(unchecked) 210 public int compare(Object o1, Object o2) { 211if (o1 instanceof Map.Entry?,?) o1 = ((Map.Entry?,?) o1).getKey(); 212 if (o2 instanceof Map.Entry?,?) o2 = ((Map.Entry?,?) o2).getKey(); 213if (o1 == o2) return 0; 214return ((Comparable) o1).compareTo((Comparable) o2); 215 } 216 }; Will try, thanks! Yeah, simply try with mergeSort in line 528. If that helps, this comparator is buggy. Uwe - Original Message From: Uwe Schindler u...@thetaphi.de To: java-user@lucene.apache.org Sent: Thu, April 28, 2011 5:36:13 PM Subject: RE: SorterTemplate.quickSort causes StackOverflowError Hi Otis, Can you reproduce this somehow and send test code? I could look into it. I don't expect the error in the quicksort algorithm itself as this one is used e.g. BytesRefHash / TermsHash, if there is a bug we would have seen it long time ago. I have not seen this before, but I suspect a problem in this very strange comparator in MemoryIndex (which is very broken, if you look at its code - it can compare Strings with Map.Entry and so on, b), maybe the comparator is not stable? In this case, quicksort can easily loop endless and stack overflow. In Lucene 3.0 this used stock java sort (which is mergesort), maybe replace the ArrayUtils.quickSort my ArrayUtils.mergeSort() and see if problem is still there? Uwe - Uwe Schindler H.-H.-Meier
Reusing Query instances
Hi, Is there any reason why one would *not* want to reuse Query instances? I'm using MemoryIndex with a fixed set of queries and I'm executing them all on each new document that comes in. Because each document needs to have many tens of thousands of queries executed against it, I thought I'd just run all queries through QueryParser once at the beginning, and then just reuse Query instances on each incoming document. What I've noticed is that my fixed set of queries takes longer and longer to execute as time passes (more and more time is spent inside memoryIndex.search() somewhere). The problem is not heap/memory - there is no crazy GCing and the heap is not full, but the CPU is 100% busy. I should note that queries I'm dealing with are ugly and big, using lots of wildcards, but trailing and prefix ones (and this is Lucene 3.1, so no faster Wildcard impl). I should also emphasize that at this point I only *suspect* that maaaybe the gradual slowdown I'm seeing has something to do with the fact that I'm reusing Query instances. Is there any reason why one should not reuse Query instances? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
SorterTemplate.quickSort causes StackOverflowError
Hi, I'm looking at some code that uses MemoryIndex (Lucene 3.1) and that's exhibiting a strange behaviour - it slows down over time. The MemoryIndex contains 1 doc, of course, and executes a set of a few thousand queries against it. The set of queries does not change - the same set of queries gets executed on all incoming documents. This code runs very quickly. in the beginning. But with time is gets slower and slower and slower. and then I get this: 4/28/11 10:32:52 PM (S) SolrException.log : java.lang.StackOverflowError at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) I haven't profiled this code yet (remote server, firewall in between, can't use YourKit...), but does the above look familiar to anyone? I've looked at the code and obviously there is the recursive call that's problematic here - it looks like the recursion just gets deeper and deeper and gets stuck, eventually getting too deep for the JVM's taste. Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: SorterTemplate.quickSort causes StackOverflowError
Thanks for confirming, Javier! :) Uwe, I assume you are referring to this line 528 in MemoryIndex? 528 if (size 1) ArrayUtil.quickSort(entries, termComparator); And this funky Comparator from MemoryIndex: 208 private static final ComparatorObject termComparator = new ComparatorObject() { 209 @SuppressWarnings(unchecked) 210 public int compare(Object o1, Object o2) { 211 if (o1 instanceof Map.Entry?,?) o1 = ((Map.Entry?,?) o1).getKey(); 212 if (o2 instanceof Map.Entry?,?) o2 = ((Map.Entry?,?) o2).getKey(); 213 if (o1 == o2) return 0; 214 return ((Comparable) o1).compareTo((Comparable) o2); 215 } 216 }; Will try, thanks! Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Uwe Schindler u...@thetaphi.de To: java-user@lucene.apache.org Sent: Thu, April 28, 2011 5:36:13 PM Subject: RE: SorterTemplate.quickSort causes StackOverflowError Hi Otis, Can you reproduce this somehow and send test code? I could look into it. I don't expect the error in the quicksort algorithm itself as this one is used e.g. BytesRefHash / TermsHash, if there is a bug we would have seen it long time ago. I have not seen this before, but I suspect a problem in this very strange comparator in MemoryIndex (which is very broken, if you look at its code - it can compare Strings with Map.Entry and so on, b), maybe the comparator is not stable? In this case, quicksort can easily loop endless and stack overflow. In Lucene 3.0 this used stock java sort (which is mergesort), maybe replace the ArrayUtils.quickSort my ArrayUtils.mergeSort() and see if problem is still there? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Thursday, April 28, 2011 11:17 PM To: java-user@lucene.apache.org Subject: SorterTemplate.quickSort causes StackOverflowError Hi, I'm looking at some code that uses MemoryIndex (Lucene 3.1) and that's exhibiting a strange behaviour - it slows down over time. The MemoryIndex contains 1 doc, of course, and executes a set of a few thousand queries against it. The set of queries does not change - the same set of queries gets executed on all incoming documents. This code runs very quickly. in the beginning. But with time is gets slower and slower and slower. and then I get this: 4/28/11 10:32:52 PM (S) SolrException.log : java.lang.StackOverflowError at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) I haven't profiled this code yet (remote server, firewall in between, can't use YourKit...), but does the above look familiar to anyone? I've looked at the code and obviously there is the recursive call that's problematic here - it looks like the recursion just gets deeper and deeper and gets stuck, eventually getting too deep for the JVM's taste. Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRT consistency
I think what's being described here is a lot like what I *think* ElasticSearch does, where there is no single master and index changed made to any node get propagated to N-1 other nodes (N=number of index replicas). I'm not sure how it deals with situations where incompatible index changes are made to the same index via 2 different nodes at the same time. Is that what vector clocks are about? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Mark Miller markrmil...@gmail.com To: java-user@lucene.apache.org Sent: Mon, April 11, 2011 11:52:05 AM Subject: Re: NRT consistency On Apr 10, 2011, at 4:34 AM, Em wrote: Hello list, I am currently trying to understand Lucene's Near-Real-Time-Feature which was covered in Lucene in Action, Second Edition. Let's say I got a distributed system with a master and a slave. In Solr replication is solved by checking for any differences in the index-directory and to consume those differences to keep indices consistent. How is this possible within a NRT-System? Is there any possibility to consume snapshots of the internal buffer of the index writer to send them to the slave? I think for near real time, Solr index replication may not be appropriate. Though I think it would be cool to use Andrzej's mythical single pass index splitter to create a single+ doc segment that could be shipped around. Most likely, a system that just sends each doc to each replica is probably going to work a lot better. Introduces other issues of course - some of which we hope to alleviate with further SolrCloud work. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/NRT-consistency-tp2801878p2801878.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing Non-Textual Data
Hi Chris, Yes, people have done classification with Lucene before. Have a look at http://search-lucene.com/?q=classifierfc_project=Lucene for some discussions and actual code (in old JIRA issues) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Chris Spencer chriss...@gmail.com To: java-user@lucene.apache.org Sent: Wed, April 6, 2011 7:46:45 PM Subject: Indexing Non-Textual Data Hi, I'm new to Lucene, so forgive me if this is a newbie question. I have a dataset composed of several thousand lists of 128 integer features, each list associated with a class label. Would it be possible to use Lucene as a classifier, by indexing the label with respect to these integer features, and then classify a new list by finding the most similar labels with Lucene? I'm specifically interested in doing so through the PyLucene API, so I've been going through the PyLucene samples, but they only seem to involve indexing text, not continuous features (understandably). Could anyone point me to an example that indexes non-textual data? I think the project Lire (http://www.semanticmetadata.net/lire/) is using Lucene to do something similar to this, although with an emphasis on image features. I've dug into their code a little, but I'm not a strong Java programmer, so I'm not sure how they're pulling it off, nor how I might translate this into the PyLucene API. In your opinion, is this a practical use of Lucene? Regards, Chris - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Detecting duplicates
Mark, Keep in mind that there are actually multiple patches for this. SOLR-236 and SOLR-1086 IIRC. Also, I just noticed this is java-user@lucene. You may want to continue on solr-user@lucene. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Mark static.void@gmail.com To: java-user@lucene.apache.org Sent: Sat, March 5, 2011 8:35:13 PM Subject: Re: Detecting duplicates I'm familiar with Deduplication however I do not wish to remove my duplicates and my needs are slightly different. I would like to mark the first document with signature 'xyz' as unique but the next one as a duplicate. This way I can filter out duplicates during searching using a filter query but still return the original document. The only thing I know of at the moment is to use field collapsing but I tried the patch on 1.4.1 and it was terribly slow. On 3/5/11 4:43 AM, Grant Ingersoll wrote: See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to pull out if you are doing just Lucene. On Mar 5, 2011, at 1:49 AM, Mark wrote: Is there a way one could detect duplicates (say by using some unique hash of certain fields) and marking a document as a duplicate but not remove it. Here is an example: Doc 1) This is my test Doc 2) This is my test Doc 3) Another test Doc 4) This is my test Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked as duplicates (of doc 1). Can this be easily accomplished? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Backup or replication option with lucene
Hi Ganesh, You could probably use replication scripts from Solr. But why not just use Solr? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Ganesh emailg...@yahoo.co.in To: java-user@lucene.apache.org Sent: Thu, March 3, 2011 12:03:20 AM Subject: Re: Backup or replication option with lucene Any suggestions. We are planning to move towords cloud and its become a mandatory requirement to have backup or replication of search db. Regards Ganesh - Original Message - From: Ganesh emailg...@yahoo.co.in To: java-user@lucene.apache.org Sent: Tuesday, March 01, 2011 12:06 PM Subject: Backup or replication option with lucene Hello all, Could any one guide me how to backup or do replication with Lucene. Regards Ganesh Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Best practices for multiple languages?
Hi Clemens, If you will be searching individual languages, go with language-specific indices. Wunder likes to give an example of die in German vs. English. :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss clemens...@mysign.ch To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Tue, January 18, 2011 12:53:57 PM Subject: Best practices for multiple languages? What is the best practice to support multiple languages, i.e. Lucene-Documents that have multiple language content/fields? Should a) each language be indexed in a seperate index/directory or should b) the Documents (in a single directory) hold the diverse localized fields? We most often will be searching language dependent which (at least performance wise) mandates one-directory-per-language... Any (lucene specific) white papers on this topic? Thx in advance Clemens - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
[X] ASF Mirrors (linked in our release announcements or via the Lucene website) [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [X] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a downstream project) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: does lucene support Database full text search
Hello, You can use LuSQL to index DB content into Lucene. Solr (the Lucene Server) has DataImportHandler for indexing data from DBs: http://search-lucene.com/?q=dataimporthandler Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: yang Yang m4ecli...@gmail.com To: java-user@lucene.apache.org Sent: Fri, September 10, 2010 9:38:58 AM Subject: does lucene support Database full text search Hi: I am using MySql,and I want to use the full text search is rather weak. So I use the Sphinx,however I found it can not support Chinese work searching prefectly. So I wonder if Lucene can work better? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Calculate Term Co-occurrence Matrix
Ahmed, That's what that KPE (link in my previous email, below) will do for you. It's not open source at this time, but that is exactly one of the things it does. I think Mahout collocations stuff might work for you, too. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: ahmed algohary algoharya...@gmail.com To: java-user@lucene.apache.org Sent: Sat, August 21, 2010 7:20:03 AM Subject: Re: Calculate Term Co-occurrence Matrix Thanks for all your answers! it seems like I did not make my question clear. I have a text corpus and I need to determine the pairs of words that occur together in many documents. I need to do that to be able to measure the semantic proximity between words. This method is expanded herehttp://forums.searchenginewatch.com/showthread.php?t=48. I hope to find some code that given a text corpus, generate all the words pairs with their probability of occurring together. On Sat, Aug 21, 2010 at 1:46 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: There is also a non-Mahout Key Phrase Extractor for Collocations, SIPs, and a few other things: http://sematext.com/products/key-phrase-extractor/index.html One of the demos that uses news data is at http://sematext.com/demo/kpe/index.html Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Grant Ingersoll gsing...@apache.org To: java-user@lucene.apache.org Sent: Fri, August 20, 2010 8:52:17 AM Subject: Re: Calculate Term Co-occurrence Matrix You might also be interested in Mahout's collocations package: http://cwiki.apache.org/confluence/display/MAHOUT/Collocations -Grant On Aug 19, 2010, at 11:39 AM, ahmed algohary wrote: Hi all, I need to know if there is a Lucene plug-in or a Lucene-based API for calculating the term co-occurrence matrix for a given text corpus. Thanks! -- Ahmed -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene indexing configuration
Hi, Are you actually talking about Solr? Sounds like it. Check solr-u...@lucene list. Maybe you need to treat those words are protected words? See the protwords.txt file in the conf dir. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Shuai Weng sh...@genome.stanford.edu To: java-user@lucene.apache.org Sent: Fri, August 20, 2010 5:47:31 PM Subject: Re: lucene indexing configuration Hey, Currently we have indexed some biological full text pages, I was wondering how to config the schema.xml such that the gene names 'met1', 'met2', 'met3' will be treated as different words. Currently they are all mapped to 'met'. Thanks, Shuai - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: understanding lucene
Manning, the Lucene in Action publisher, frequently offers 30-50% off on a number of their books, including LIA2. See http://twitter.com/ManningBooks Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Yakob jacob...@opensuse-id.org To: java-user@lucene.apache.org Sent: Sun, August 8, 2010 5:54:38 AM Subject: Re: understanding lucene On 8/8/10, Uwe Schindler u...@thetaphi.de wrote: The example code you found is very old (seems to be from the Version 1.x of Lucene), and is not working with Version 2.x or 3.x of Lucene (previously deprecated Hits class is gone in 3.0, static Field constructors were gone long time ago in 2.0, so you get compilation errors). If you want to learn Lucene, buy the Book Lucene in Action - 2nd Edition, there is everything explained and lots of examples for everyday use with the newest Version 3.0.2. See http://www.manning.com/hatcher2/ for ordering the PDF version or go to your local bookstore. In all cases, if you are new to Lucene don't use version 2.9.x or earlier, use 3.0.x with its clean API. This makes it easier for beginners. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de the ebook cost 30 dollars,can't I just get the free pirate version instead?hehe... I mean if you had the ebook yourself maybe you can email me the pdf version to my email here.so that it would not cost me money. :-) or maybe I can find it in rapidshare,maybe there is someone kind enough that put the book there. -- http://jacobian.web.id - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using categories with Lucene
Hello Luan, I think you are looking for facets and faceted search. In short, it means storing the category for a document (web page) in the Document Field in Lucene index . Then, at search time, you count how many matches were in which category. You can implement this yourself or you can use Solr, which has this functionality built-in. If you want to stick with Lucene and don't want Solr, you can use Bobo Browse with Lucene - Lucene in Action 2 has a case study about Bobo Browse, where you can learn how this is done. Slick stuff. Thanks for using http://search-lucene.com :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Luan Cestari luan.cest...@gmail.com To: java-user@lucene.apache.org Sent: Sun, August 8, 2010 7:16:05 PM Subject: Using categories with Lucene Lucene developers, We’ve been working on a undergraduate project to the college about changing Apache Nutch (that uses Lucene do index it’s web pages) to include a category filter, and we are having problems about the query part. We want to develop an application with a good performance, so we thought that here would be the best place to ask this kind of question. The idea is that the user can search pages stored for only a category. So the number of results found should display the number of pages that actually is classified in that category. The problem is about how to add to the Lucene indexes the category information, and how filter the search on that. We tried to look on the Nutch mailing-list (Nabble) about that and asked some help, but people from there think that we should use some plug-in like Carrot, that get like 100 of pages and classify it in the query time. We are not very confident that it’s the best solution. We thought in other two different ideas: #1 To classify those pages and store that information on a DB and in the query time filter the result that DB to filter the result. #2 Use different index servers, one for each category and one to search without filtering by category. We have seen on this project http://search-lucene.com/ that there are pre-defined categories. We think that this should be classified at indexing time, as we wanted. Do you have any other idea about how to do that? Sincerely, Daniel Costa Gimenes Luan Cestari Undergraduate students of University Center of FEI Brazil -- View this message in context: http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1049232.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: LUCENE-2456 (A Column-Oriented Cassandra-Based Lucene Directory)
Utku, you should ask via comments on https://issues.apache.org/jira/browse/LUCENE-2453. What happened with Lucandra? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Utku Can Topçu u...@topcu.gen.tr To: java-user@lucene.apache.org Sent: Fri, July 23, 2010 12:59:36 PM Subject: LUCENE-2456 (A Column-Oriented Cassandra-Based Lucene Directory) Hi All, I'm trying to use the patch for testing, provided in the issue. I downloaded the patch and the dependency *LUCENE-2453 https://issues.apache.org/jira/browse/LUCENE-2453*. I tested this contribution against the r942817 revision where I assume the contributor has been using during the time of development. The tests seemed to fail. This time, I updated the CassandraDirectory.java to match the new Cassandra Interface. It unfortunately failed again. Has anyone here have an idea on which cassandra revision and lucene revision this patch works against? Best Regards, Utku - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Personal Intro and a question on find top 10 similar items functionality
Igor, You can treat that question as the query and use it to search the index where you've indexed other questions. More Like This is another option. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Igor Chudov ichu...@gmail.com To: java-user@lucene.apache.org Sent: Thu, July 8, 2010 6:12:37 PM Subject: Personal Intro and a question on find top 10 similar items functionality Hello, My name is Igor and I own a website algebra.com. I just joined. I have a database of answered algebra questions (208,000 and growing). A typical question is here (original spelling): ``who long does it take 2 people to finish painting a house if the first one takes 6 days and the second one takes 9 days'' What I would like to do is, for anyone viewing a archived problem, to find top 10 similar problems that would be most similar to the currently viewed query. Note that meaning of similar is not defined in my question. Is Lucene even capable of this sort of thing? Could I expect reasonable performance (under 1-2 seconds) from it? thanks a bunch guys. i - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: arguments in favour of lucene over commercial competition
And I was just thinking the other day how it would be cool to take, say, Lucene 1.4, then some 2.* version and now the latest 3.* version and compare. :) Want to do it and share? I don't think anyone has done this before. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: jm jmugur...@gmail.com To: java-user@lucene.apache.org Sent: Thu, June 24, 2010 3:50:58 AM Subject: Re: arguments in favour of lucene over commercial competition I want to add some perf numbers too, to show how it has improved in the last versions (not that it was bad before) does anyone have a link to a nice page with numbers/graphs ? On Thu, Jun 24, 2010 at 7:43 AM, Otis Gospodnetic href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com wrote: Coincidentally, just after I replied to this thread I received an email from one of our customers. In that email was a quote from one of the commercial search vendors. My jaw didn't drop because I've seen similar numbers from other commercial search vendors before, but I won't mention the customer nor the vendor, but I can tell you that the amount could put a couple of kids through a top-notch private college in the U.S. Talking about TOC reduction through use of open-source! Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: jm href=mailto:jmugur...@gmail.com;jmugur...@gmail.com To: ymailto=mailto:java-user@lucene.apache.org; href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org Sent: Wed, June 23, 2010 5:57:32 PM Subject: Re: arguments in favour of lucene over commercial competition yes, in my case the competition is one of the list... On Wed, Jun 23, 2010 at 11:41 PM, Otis Gospodnetic ymailto=mailto: href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com href=mailto: href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com ymailto=mailto:otis_gospodne...@yahoo.com; href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com wrote: Off the top of my head: FAST Endeca Coveo Attivio Vivisimo Google Search Appliance (tell me when to stop) Dieselpoint IBM OmniFind Exalead Autonomy dtSearch ISYS Oracle ... ... Otis Sematext :: href=http://sematext.com/; target=_blank http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Hans Merkl ymailto=mailto: href=mailto:hme...@rightonpoint.us;hme...@rightonpoint.us href=mailto: href=mailto:hme...@rightonpoint.us;hme...@rightonpoint.us ymailto=mailto:hme...@rightonpoint.us; href=mailto:hme...@rightonpoint.us;hme...@rightonpoint.us To: java-user href=mailto: ymailto=mailto:java-user@lucene.apache.org; href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org ymailto=mailto:java-user@lucene.apache.org; href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org Sent: Wed, June 23, 2010 5:15:46 PM Subject: Re: arguments in favour of lucene over commercial competition Just curious. What commercial alternatives are out there? On Wed, Jun 23, 2010 at 04:01, jm href=mailto: ymailto=mailto: ymailto=mailto:jmugur...@gmail.com; href=mailto:jmugur...@gmail.com;jmugur...@gmail.com href=mailto: href=mailto:jmugur...@gmail.com;jmugur...@gmail.com ymailto=mailto:jmugur...@gmail.com; href=mailto:jmugur...@gmail.com;jmugur...@gmail.com ymailto=mailto: href=mailto:jmugur...@gmail.com;jmugur...@gmail.com href=mailto: href=mailto:jmugur...@gmail.com;jmugur...@gmail.com ymailto=mailto:jmugur...@gmail.com; href=mailto:jmugur...@gmail.com;jmugur...@gmail.com wrote: Hi, I am trying to compile some arguments in favour of lucene as management is deciding weather to standardize on lucene or a competing commercial product (we have a couple of produc, one using lucene, another using commercial product, imagine what am i using). I searched the lists but could not find any post, although I remember seeing such posts in the past. Does somebody kept such posts linked or something? Or does someone know of some page that would help me? I would like to show: - traction of lucene, really improving a lot last couple of years - rich ecosystem (solr...) - references of other companies choosing lucene/solr over commercial (be it Fast or whatever) thanks - To unsubscribe, e-mail: href=mailto: ymailto=mailto: ymailto=mailto:java-user-unsubscr...@lucene.apache.org; href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr
Re: arguments in favour of lucene over commercial competition
Lucene/Solr choice typically means: * lower cost of ownership (think about various crazy licensing models some of the commercial search vendors have: per doc, per server, per query, per year) * faster implementation (just think about the duration of the sales/negotiation phase for commercial search vendors) * flexibility -- it's open source, you can change whatever you want. Try that with closed-source commercial search vendor's package. * super fast and knowledgeable community -- see http://www.jroller.com/otis/entry/lucene_solr_nutch_amazing_tech * commercial support and experts still available -- see http://www.sematext.com/services/index.html * adoption - small companies, medium companies, HUGE companies, secret organizations, everyone's using some form of Lucene -- see http://wiki.apache.org/lucene-java/PoweredBy , http://wiki.apache.org/solr/PublicServers * maturity - Lucene is over 10 years old. Solr is over 4 years old. * future - look at JIRA, look at mailing list traffic, look at pace of development, look at CHANGES.txt * searchable documentation and mailing list archives -- http://search-lucene.com/ * ... Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: jm jmugur...@gmail.com To: java-user@lucene.apache.org Sent: Wed, June 23, 2010 4:01:05 AM Subject: arguments in favour of lucene over commercial competition Hi, I am trying to compile some arguments in favour of lucene as management is deciding weather to standardize on lucene or a competing commercial product (we have a couple of produc, one using lucene, another using commercial product, imagine what am i using). I searched the lists but could not find any post, although I remember seeing such posts in the past. Does somebody kept such posts linked or something? Or does someone know of some page that would help me? I would like to show: - traction of lucene, really improving a lot last couple of years - rich ecosystem (solr...) - references of other companies choosing lucene/solr over commercial (be it Fast or whatever) thanks - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: arguments in favour of lucene over commercial competition
Off the top of my head: FAST Endeca Coveo Attivio Vivisimo Google Search Appliance (tell me when to stop) Dieselpoint IBM OmniFind Exalead Autonomy dtSearch ISYS Oracle ... ... Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Hans Merkl hme...@rightonpoint.us To: java-user java-user@lucene.apache.org Sent: Wed, June 23, 2010 5:15:46 PM Subject: Re: arguments in favour of lucene over commercial competition Just curious. What commercial alternatives are out there? On Wed, Jun 23, 2010 at 04:01, jm href=mailto:jmugur...@gmail.com;jmugur...@gmail.com wrote: Hi, I am trying to compile some arguments in favour of lucene as management is deciding weather to standardize on lucene or a competing commercial product (we have a couple of produc, one using lucene, another using commercial product, imagine what am i using). I searched the lists but could not find any post, although I remember seeing such posts in the past. Does somebody kept such posts linked or something? Or does someone know of some page that would help me? I would like to show: - traction of lucene, really improving a lot last couple of years - rich ecosystem (solr...) - references of other companies choosing lucene/solr over commercial (be it Fast or whatever) thanks - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: arguments in favour of lucene over commercial competition
I won't comment on Attivio, as I think I might have signed some NDA with them. But they do claim to combine full-text search with DB-like joins. Can't MarkLogic do that, too? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Itamar Syn-Hershko ita...@code972.com To: java-user@lucene.apache.org Sent: Wed, June 23, 2010 5:54:34 PM Subject: RE: arguments in favour of lucene over commercial competition Otis, I'm 99% sure Attivio is just a wrapper arround Lucene... And I personally wouldn't count full text search solutions such as Oracle's. Itamar. -Original Message- From: Otis Gospodnetic [mailto: href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com] Sent: Thursday, June 24, 2010 12:42 AM To: ymailto=mailto:java-user@lucene.apache.org; href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org Subject: Re: arguments in favour of lucene over commercial competition Off the top of my head: FAST Endeca Coveo Attivio Vivisimo Google Search Appliance (tell me when to stop) Dieselpoint IBM OmniFind Exalead Autonomy dtSearch ISYS Oracle ... ... Otis Sematext :: href=http://sematext.com/; target=_blank http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: href=http://search-lucene.com/; target=_blank http://search-lucene.com/ - Original Message From: Hans Merkl ymailto=mailto:hme...@rightonpoint.us; href=mailto:hme...@rightonpoint.us;hme...@rightonpoint.us To: java-user href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org Sent: Wed, June 23, 2010 5:15:46 PM Subject: Re: arguments in favour of lucene over commercial competition Just curious. What commercial alternatives are out there? On Wed, Jun 23, 2010 at 04:01, jm href=mailto: ymailto=mailto:jmugur...@gmail.com; href=mailto:jmugur...@gmail.com;jmugur...@gmail.com ymailto=mailto:jmugur...@gmail.com; href=mailto:jmugur...@gmail.com;jmugur...@gmail.com wrote: Hi, I am trying to compile some arguments in favour of lucene as management is deciding weather to standardize on lucene or a competing commercial product (we have a couple of produc, one using lucene, another using commercial product, imagine what am i using). I searched the lists but could not find any post, although I remember seeing such posts in the past. Does somebody kept such posts linked or something? Or does someone know of some page that would help me? I would like to show: - traction of lucene, really improving a lot last couple of years - rich ecosystem (solr...) - references of other companies choosing lucene/solr over commercial (be it Fast or whatever) thanks - To unsubscribe, e-mail: href=mailto: ymailto=mailto:java-user-unsubscr...@lucene.apache.org; href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.orgjava-user-unsubs href=mailto:cr...@lucene.apache.org;cr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org href=mailto: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org ymailto=mailto:java-user-h...@lucene.a; href=mailto:java-user-h...@lucene.a;java-user-h...@lucene.a pache.org - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: arguments in favour of lucene over commercial competition
Coincidentally, just after I replied to this thread I received an email from one of our customers. In that email was a quote from one of the commercial search vendors. My jaw didn't drop because I've seen similar numbers from other commercial search vendors before, but I won't mention the customer nor the vendor, but I can tell you that the amount could put a couple of kids through a top-notch private college in the U.S. Talking about TOC reduction through use of open-source! Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: jm jmugur...@gmail.com To: java-user@lucene.apache.org Sent: Wed, June 23, 2010 5:57:32 PM Subject: Re: arguments in favour of lucene over commercial competition yes, in my case the competition is one of the list... On Wed, Jun 23, 2010 at 11:41 PM, Otis Gospodnetic ymailto=mailto:otis_gospodne...@yahoo.com; href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com wrote: Off the top of my head: FAST Endeca Coveo Attivio Vivisimo Google Search Appliance (tell me when to stop) Dieselpoint IBM OmniFind Exalead Autonomy dtSearch ISYS Oracle ... ... Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Hans Merkl ymailto=mailto:hme...@rightonpoint.us; href=mailto:hme...@rightonpoint.us;hme...@rightonpoint.us To: java-user href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org Sent: Wed, June 23, 2010 5:15:46 PM Subject: Re: arguments in favour of lucene over commercial competition Just curious. What commercial alternatives are out there? On Wed, Jun 23, 2010 at 04:01, jm href=mailto: ymailto=mailto:jmugur...@gmail.com; href=mailto:jmugur...@gmail.com;jmugur...@gmail.com ymailto=mailto:jmugur...@gmail.com; href=mailto:jmugur...@gmail.com;jmugur...@gmail.com wrote: Hi, I am trying to compile some arguments in favour of lucene as management is deciding weather to standardize on lucene or a competing commercial product (we have a couple of produc, one using lucene, another using commercial product, imagine what am i using). I searched the lists but could not find any post, although I remember seeing such posts in the past. Does somebody kept such posts linked or something? Or does someone know of some page that would help me? I would like to show: - traction of lucene, really improving a lot last couple of years - rich ecosystem (solr...) - references of other companies choosing lucene/solr over commercial (be it Fast or whatever) thanks - To unsubscribe, e-mail: href=mailto: ymailto=mailto:java-user-unsubscr...@lucene.apache.org; href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org ymailto=mailto:java-user-unsubscr...@lucene.apache.org; href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org href=mailto: href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Monitoring low level IO
Ah, there is another one I came across several months back - http://wiki.sdn.sap.com/wiki/display/Java/JPicus. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Otis Gospodnetic otis_gospodne...@yahoo.com To: java-user@lucene.apache.org Sent: Fri, June 4, 2010 1:54:15 AM Subject: Re: Monitoring low level IO Other than iostat, vmstat and such? Otis - Original Message From: Jason Rutherglen ymailto=mailto:jason.rutherg...@gmail.com; href=mailto:jason.rutherg...@gmail.com;jason.rutherg...@gmail.com To: href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org Sent: Thu, June 3, 2010 2:13:17 PM Subject: Monitoring low level IO This is more of a unix related question than Lucene specific however because Lucene is being used, I'm asking here as perhaps other people have run into a similar issue. On an Amazon EC2 merge, read, and write operations are possibly blocking due to underlying IO. Is there a tool that you have used to monitor this type of thing? - To unsubscribe, e-mail: href=mailto: ymailto=mailto:java-user-unsubscr...@lucene.apache.org; href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org ymailto=mailto:java-user-unsubscr...@lucene.apache.org; href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org href=mailto: href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: is there any resources that explain detailed implementation of lucene?
Li Li: Then best to go to the source. Here's one version with syntax highlighting and line numbers, should you have questions about specific parts of that class: http://search-lucene.com/c/Lucene:/src/java/org/apache/lucene/search/PhraseQuery.java Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Li Li fancye...@gmail.com To: java-user@lucene.apache.org Sent: Thu, June 3, 2010 2:51:02 AM Subject: Re: is there any resources that explain detailed implementation of lucene? e.g. I want to know the code under phrase query so that I can make some extenstion. 2010/6/3 Erick Erickson ymailto=mailto:erickerick...@gmail.com; href=mailto:erickerick...@gmail.com;erickerick...@gmail.com: Why do you care (tm)? Or, put another way, are you asking just for general understanding of how Lucene works or is there a higher-level problem you're trying to solve? Best Erick On Wed, Jun 2, 2010 at 8:54 PM, Li Li ymailto=mailto:fancye...@gmail.com; href=mailto:fancye...@gmail.com;fancye...@gmail.com wrote: such as the detailed process of store data structures, index, search and sort. not just apis. thanks. - To unsubscribe, e-mail: ymailto=mailto:java-user-unsubscr...@lucene.apache.org; href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: numDeletedDocs()
Btw. folks, http://search-lucene.com/ has a really handy source code search with auto-completion for Lucene, Solr, etc. For example, I typed in: numDel - and immediately found those methods. Use it. :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Michael McCandless luc...@mikemccandless.com To: java-user@lucene.apache.org Sent: Thu, June 3, 2010 4:02:09 PM Subject: Re: numDeletedDocs() Hmm... I don't think IndexWriter has ever had a numDeletedDocs() (w/ no params)? (IndexReader does). Mike On Thu, Jun 3, 2010 at 3:50 PM, Woolf, Ross href=mailto:ross_wo...@bmc.com;ross_wo...@bmc.com wrote: There seems to be a mismatch between the IndexWriter().numDeletedDocs() method as stated in the javadocs supplied in the 2.9.2 download and what is actual. JavaDocs for 2.9.2 as came with the 2.9.2 download numDeletedDocs public int numDeletedDocs()Returns the number of deleted documents. (No parameter required) -- Source code for 2.9.2 public int numDeletedDocs(SegmentInfo info) throws IOException { (Parameter required) Why is there no longer a no parameter numDeleteDocs as stated in the JavaDocs? I'm not sure how I use the experimental SegmentInfo just to get the delete count in my index? Any help appreciated. - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Monitoring low level IO
Other than iostat, vmstat and such? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Jason Rutherglen jason.rutherg...@gmail.com To: java-user@lucene.apache.org Sent: Thu, June 3, 2010 2:13:17 PM Subject: Monitoring low level IO This is more of a unix related question than Lucene specific however because Lucene is being used, I'm asking here as perhaps other people have run into a similar issue. On an Amazon EC2 merge, read, and write operations are possibly blocking due to underlying IO. Is there a tool that you have used to monitor this type of thing? - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Wich way would you recommend for successive-words similarity and scoring ?
Hi Pablo, This question comes up every once in a while. You'll find some previous discussions and answers here: http://search-lucene.com/?q=terms+closer+together+score Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Pablo pablo.queixa...@gmail.com To: java-user@lucene.apache.org Sent: Mon, May 3, 2010 3:20:10 PM Subject: Wich way would you recommend for successive-words similarity and scoring ? Hello, Lucene core doesn't seems to use relative word positioning (?) for scoring. For example, indexing that phrase a b c d e f g h i j k l m n o p q r s t u v w x y z, these queries give the same results (0.19308087) : - 1 : phrase:'e f g' - 2 : phrase:'o k z' I'm a bit familiar with lucene and snowballs, but I never (really) needed this feature before, and didn't browse the lucene contribs. Maybe I'm misunderstanding something, but, what can I do to obtain query 1 get a better score than the second ? Should I implement a Scorer and or a Similarity, or can an analyser and a specific stemmer be sufficient? Thanks, [I first wrote to dev, wasn't the right place.] Pablo - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Grouping or de-duping
Pasa, Maybe Field Collapsing (Solr) can help? See SOLR-236 in JIRA http://search-lucene.com/?q=field+collapsingfc_project=Lucenefc_project=Solr Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Паша Минченков char...@gmail.com To: java-user@lucene.apache.org Sent: Mon, May 31, 2010 4:15:40 PM Subject: Grouping or de-duping Sorry for my similar questions. I need to remove duplicates from search results for a given field (or group by). Documents on this field are not ordered. Which one will get duplicates in search results - I do not care. I tried to use DuplicateFilter and PerParentLimitedQuery, but they didn't help. In searching for an answer I found references to SimpleFacetParameters, but I do not understand how this material can be useful to me because it refers to the project Solr. Maybe someone has an example of grouping searh result or something like DeDupinQuery. On the link below, I found a solution, but there is no sample and I can't make these modifications my self. href=http://markmail.org/message/uvrh3y5ogjgu4gfx#query:group%20lucene%20results%20by%20field+page:1+mid:uvrh3y5ogjgu4gfx+state:results; target=_blank http://markmail.org/message/uvrh3y5ogjgu4gfx#query:group%20lucene%20results%20by%20field+page:1+mid:uvrh3y5ogjgu4gfx+state:results Thanks. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Is Lucene a document oriented database?
I think those doc-oriented DBs tend to be distributed, with replication built-in and such, but yes, in some way the schemaless DB with docs and fields (whether they are pumped in as JSON or XML or Java objects) feels the same. I saw something from Grant about 2 months ago how Lucene is nosql-ish. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Shashi Kant sk...@sloan.mit.edu To: java-user@lucene.apache.org Sent: Mon, May 31, 2010 12:20:36 PM Subject: Is Lucene a document oriented database? There seems to be considerable buzz on the internets about document oriented dbs such as MongoDB, CouchDB etc. I am at a loss as to what are the principal differences between Lucene and the DODBs. I could very use Lucene as any of the above (schema-free, Document oriented) and perform similar queries, *with* the added benefit of text search. I fail to see what benefits such DoDBs bring, or is it old wine in new bottles? Thanks Shashi - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using JSON for index input and search output
VL, Solr (not Lucene, but you can embed Solr) has JsonUpdateRequestHandler, which lets you send docs to Solr for indexing in JSON (instead of the usual XML): http://search-lucene.com/c/Solr:/src/java/org/apache/solr/handler/JsonUpdateRequestHandler.java And you can get Solr to respond with JSON, as you pointed out: http://wiki.apache.org/solr/SolJSON Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Visual Logic visual.lo...@gmail.com To: java-user@lucene.apache.org Sent: Sun, May 30, 2010 1:33:19 PM Subject: Using JSON for index input and search output Lucene, JSON is the format used for all the configuration and property files in the RIA application we are developing. Is Lucene able to create a document from a given JSON file and index it? Is Lucene able to provide a JSON output response from a query made to an index? Does the Tika package provide this? Local indexing and searching is needed on the local client so Solr is not a solution even though it does provide a search response in JSON format. VL - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: TermsFilter instead of should TermQueries
I think what Tomislav was trying to ask is: Can filters replace only strictly boolean clauses (i.e. only MUST and MUST_NOT), such as: +gender:F, -rating:xxx)? Or can filters also replace SHOULD clauses, such as: food:banana (which is neither absolutely required or strictly prohibited)? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Erick Erickson erickerick...@gmail.com To: java-user@lucene.apache.org Sent: Fri, May 7, 2010 8:30:18 PM Subject: Re: TermsFilter instead of should TermQueries Well, you construct the filter by enumerating the terms you're interested in and pass it along to the relevant search. But it looks like you've figured that part out. If you're asking how can you use a Filter and still have the terms replaced by the filter contribute to scoring, you can't. But it's a reasonable question to ask whether it changes the score enough to matter given that this is only a problem when there are many terms. If this doesn't speak to your question, can you ask for more detail? HTH Erick On Fri, May 7, 2010 at 1:19 PM, Tomislav Poljak href=mailto:tpol...@gmail.com;tpol...@gmail.com wrote: Hi, in API documentation for TermsFilter: href=http://search-lucene.com/jd/lucene/org/apache/lucene/search/TermsFilter.html; target=_blank http://search-lucene.com/jd/lucene/org/apache/lucene/search/TermsFilter.html it states: 'As a filter, this is much faster than the equivalent query (a BooleanQuery with many should TermQueries)' I would like to replace should TermQueries with TermsFilter to benefit in performance, but I'm trying to understand how this change/switch can work. I was under the impression that the BooleanQuery with many should TermQueries affects scoring like: each should term present in result, increases the result's score. If someone could explain how can a TermsFilter (which is like any filter a binary thing - result document is matched or not) be used to replace should clauses, I would really appreciate it. Tomislav - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Filter vs. TermQuery performance
I think others will have more thoughts on this, esp. for Numeric* questions... but I'll try answering... - Original Message From: Tomislav Poljak tpol...@gmail.com To: java-user@lucene.apache.org Sent: Fri, May 7, 2010 2:34:46 PM Subject: Filter vs. TermQuery performance Hi, when is it wise to replace a TermQuery with cached Filter (regarding search performance). If TermQuery is used only to filter results based on field value (it doesn't participate in scoring), is it alway wise to replace it with filter? Yes, assuming the filter will be reused. I think there is not a lot of value in using a filter (vs. just a regular query) if that filter will not be reused. This is why in Solr fqs (filtered queries) are cached in a special filter cache. I *think* the only other benefit of using a filter query vs., say, TermQuery, is that the former will not spend any time/CPU on computing the score for the filter part. Is it only wise if Filter is cached (wrapped in CachingWrapperFilter) and reused often? I think so. See above. Does it matter how many distinct values field has (which is related to how many matches/results for one given/selected value is returned and also with how many times same filter instance is reused)? I *think* it matters. I think the more docs a filter matches, the higher the benefit from reusing a filter. For example, what if filter for single value matches only 5% of docs, should filter be used or is it better to use TermQuery? What about if filter for single value matches 20%? or 50% or 75% I'm not sure... I have a question regarding caching performance/memory usage. Documents have datetime indexed (as NumericField) with minute resolution and there are few thousands unique datetime in index. On the search side open ended range filter is used (NumericRangeFilter) with current time as a parameter. Now, is it wise to cache NumericRangeFilter here (reuse instance of CachingWrapperFilter wrapping NumericRangeFilter) since it will not be reused often (only from users searching at same time in same time zone)? If the cache hit rate is low, why waste memory on caching is what I would think is the logic to apply here. If you have 3 queries, and each uses a different date range query, then you will not see benefits from caching.. If 2 of those 3 queries use the exact same date range query, then you will see caching benefits. Is it better to use NumericRangeFilter or NumericRangeQuery in this case? I'm not sure, but I'd be happy to add specific advice to Javadoc when the answer is clear. Otis - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucandra - Lucene/Solr on Cassandra: April 26, NYC
Hello folks, Those of you in or near NYC and using Lucene or Solr should come to Lucandra - a Cassandra-based backend for Lucene and Solr on April 26th: http://www.meetup.com/NYC-Search-and-Discovery/calendar/12979971/ The presenter will be Lucandra's author, Jake Luciani. Please spread the word. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Range Query Assistance
Joseph, If you can, get the latest Lucene and use NumericField to index your dates with appropriate precision and then use NumericRangeQueries when searching. This will be faster than searching for string dates in a given range. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: i...@josephcrawford.com i...@josephcrawford.com To: java-user@lucene.apache.org Sent: Fri, April 16, 2010 9:23:30 AM Subject: Range Query Assistance Hello, I would like to query based on a start and end date. I was thinking something like this start_date: [2101 TO todays date] end_date: [todays date TO 20900101] Would this work for me? Our dates are stored in the index as strings so I am not sure the syntax above would be correct. Any assistance would be appreciated. Thanks, Joseph Crawford - To unsubscribe, e-mail: href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ymailto=mailto:java-user-h...@lucene.apache.org; href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NumericField indexing performance
Hi, I actually don't follow your change, because after but changing it to line the only different thing I see is the doc.add(dateField) call, which you didn't list before but changing it to. Also, if I understood Uwe correctly, he was suggesting reusing NumericField instances, which means new NumericField(date) should exist and be called for only *once* in your code. The same for Document instances. GC threads will thank you and Uwe for this change. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Tomislav Poljak tpol...@gmail.com To: java-user@lucene.apache.org Sent: Thu, April 15, 2010 7:41:02 AM Subject: RE: NumericField indexing performance Hi Uwe, thank you very much for your answers. I've done Document and NumericField reuse like this: Document doc = getDocument(); NumericField dateField = new NumericField(date); for each doc: doc.add(dateField.setLongValue(Long.parseLong(DateTools.dateToString(date), DateTools.Resolution.MINUTE; ,but changing it to: Document doc = getDocument(); NumericField dateField = new NumericField(date); doc.add(dateField); for each doc: dateField.setLongValue(Long.parseLong(DateTools.dateToString(date), DateTools.Resolution.MINUTE))); did the trick. Now indexing with NumericField takes minutes, not hours. Thanks again, Tomislav On Wed, 2010-04-14 at 23:38 +0200, Uwe Schindler wrote: One addition: If you are indexing millions of numeric fields, you should also try to reuse NumericField and Document instances (as described in JavaDocs). NumericField creates internally a NumericTokenStream and lots of small objects (attributes), so GC cost may be high. This is just another idea. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: href=mailto:u...@thetaphi.de;u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto: ymailto=mailto:u...@thetaphi.de; href=mailto:u...@thetaphi.de;u...@thetaphi.de] Sent: Wednesday, April 14, 2010 11:28 PM To: ymailto=mailto:java-user@lucene.apache.org; href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org Subject: RE: NumericField indexing performance Hi Tomislav, indexing with NumericField takes longer (at least for the default precision step of 4, which means out of 32 bit integers make 8 subterms with each 4 bits of the value). So you produce 8 times more terms during indexing that must be handled by the indexer. If you have lots of documents, with distinct values the term index gets larger and larger, but search performance increases dramatically (for NumericRangeQueries). So if you index *only* numeric fields and nothing else, a 8 times slower indexing can be true. If you are not using NumericRangeQuery or you want tune indexing performance, try larger precision Steps like 6 or 8. If you don’t use NumericRangeQuery and only want to index the numeric terms as *one* term, use precStep=Integer.MAX_VALUE. Also check your memory requirements, as the indexer may need more memory and GC costs too much. Also the index size will increase, so lots of more I/O is done. Without more details I cannot say anything about your configuration. So please tell us, how many documents, how many fields and how many numeric fields in which configuration do you use? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen href=http://www.thetaphi.de; target=_blank http://www.thetaphi.de eMail: href=mailto:u...@thetaphi.de;u...@thetaphi.de -Original Message- From: Tomislav Poljak [mailto: href=mailto:tpol...@gmail.com;tpol...@gmail.com] Sent: Wednesday, April 14, 2010 8:13 PM To: ymailto=mailto:java-user@lucene.apache.org; href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org Subject: NumericField indexing performance Hi, is it normal for indexing time to increase up to 10 times after introducing NumericField instead of Field (for two fields)? I've changed two date fields from String representation (Field) to NumericField, now it is: doc.add(new NumericField(time).setIntValue(date.getTime()/24/3600)) and after this change indexing took 10x more time (before it was few minutes and after more than an hour and half). I've tested with a simple counter like this: doc.add(new NumericField(endTime).setIntValue(count++)) but nothing changed, it still takes around 10x longer. If I comment adding one numeric field to index time drops significantly and if I comment both fields indexing takes only few minutes again. Tomislav - To unsubscribe, e-mail:
Re: Searching Subversion comments:
Hi Erick, For what it's worth, we are considering indexing JIRA comments over on http://search-lucene.com/ , though I'm not entirely convinced searching in comments would be super valuable. Would it? But note that JIRA (and LucidFind) already do that. For example, go to http://issues.apache.org/jira/browse/LUCENE-2061 and search for Attached first cut python script nrtBench.py.~10 (it's in that issue's comments) and JIRA will find that issue. What exactly are you lokoing to do/build? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Erick Erickson erickerick...@gmail.com To: java-user java-user@lucene.apache.org Sent: Mon, March 8, 2010 3:48:41 PM Subject: Searching Subversion comments: Before I reinvent the wheel. Is there any convenient way to, say, find all the files associated with patch ? I realize one can (hopefully) get this information from JIRA, but... This is a subset of the problem of searching Subversion comments. I can see it being useful, especially for people coming into the code fresh. Grep (or the equivalent in the IDE) only goes so far. If there's any interest, I'm thinking of playing with http://svn-search.sourceforge.net/ to see what I could see and report back. It should be easy enough to set up on my machine at home, although I'm not set up to show it to others. And it's even based on Lucene. This is feeling recursive.. Mostly I'm checking to see if something like this has already been done and I just missed the boat. Besides, I'm curious... Erick - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: OutOfMemoryError
Maybe it's not a leak, Monique. :) If you use sorting in Lucene, then the FieldCache object will keep some data permanently in memory, for example. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Monique Monteiro monique.lou...@gmail.com To: java-user@lucene.apache.org Sent: Fri, March 5, 2010 1:38:31 PM Subject: OutOfMemoryError Hi all, I’m new to Lucene and I’m evaluating it in a web application which looks up strings in a huge index – the index file contains 32GB. I keep a reference to a Searcher object during the application’s lifetime, but this object has strong memory requirements and keeps memory consumption around 950MB. I did some optimization in order to share some fields in two “composed” indices, but in a web application with less than 1GB for JVM, OutOfMemoryError is generated. It seems that the searcher keeps some form of cache which is not frequently released. I’d like to know if this kind of memory leak is normal according to Lucene’s behaviour and if the only available solution is adding memory to the JVM. Thanks in advance! -- Monique Monteiro, MSc IBM OOAD / SCJP / MCTS Web Blog: http://moniquelouise.spaces.live.com/ Twitter: http://twitter.com/monilouise MSN: monique_lou...@msn.com GTalk: monique.lou...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene: Finite-State Queries, Flexible Indexing, Scoring, and more
Hello folks, Those of you in or near New York and using Lucene or Solr should come to Lucene: Finite-State Queries, Flexible Indexing, Scoring, and more on March 24th: http://www.meetup.com/NYC-Search-and-Discovery/calendar/12720960/ The presenter will be the hyper active Lucene committer Robert Muir. Please spread the word. Otis -- Lucene ecosystem search :: http://search-lucene.com/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Email Filter using Lucene 3.0
Hi Jamie, Could you say more about how it's not working? No compiling? Run-time exceptions? Doesn't work as expected after you run a unit test for it? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Jamie ja...@stimulussoft.com To: java-user@lucene.apache.org Sent: Fri, January 29, 2010 7:29:13 AM Subject: Email Filter using Lucene 3.0 Hi THere In the absence of documentation, I am trying to convert an EmailFilter class to Lucene 3.0. Its not working! Obviously, my understanding of the new token filter mechanism is misguided. Can someone in the know help me out for a sec and let me know where I am going wrong. Thanks. import org.apache.commons.logging.*; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.tokenattributes.TermAttribute; import java.io.IOException; import java.io.Serializable; import java.util.ArrayList; import java.util.Stack; /* Many thanks to Michael J. Prichard for his * original the email filter code. It is rewritten. */ public class EmailFilter extends TokenFilter implements Serializable { public EmailFilter(TokenStream in) { super(in); } public final boolean incrementToken() throws java.io.IOException { if (!input.incrementToken()) { return false; } TermAttribute termAtt = (TermAttribute) input.getAttribute(TermAttribute.class); char[] buffer = termAtt.termBuffer(); final int bufferLength = termAtt.termLength(); String emailAddress = new String(buffer, 0,bufferLength); emailAddress = emailAddress.replaceAll(, ); emailAddress = emailAddress.replaceAll(, ); emailAddress = emailAddress.replaceAll(\, ); String [] parts = extractEmailParts(emailAddress); clearAttributes(); for (int i = 0; i parts.length; i++) { if (parts[i]!=null) { TermAttribute newTermAttribute = addAttribute(TermAttribute.class); newTermAttribute.setTermBuffer(parts[i]); newTermAttribute.setTermLength(parts[i].length()); } } return true; } private String[] extractWhitespaceParts(String email) { String[] whitespaceParts = email.split( ); ArrayListpartsList = new ArrayList(); for (int i=0; i whitespaceParts.length; i++) { partsList.add(whitespaceParts[i]); } return whitespaceParts; } private String[] extractEmailParts(String email) { if (email.indexOf('@')==-1) return extractWhitespaceParts(email); ArrayListpartsList = new ArrayList(); String[] whitespaceParts = extractWhitespaceParts(email); for (int w=0;w if (whitespaceParts[w].indexOf('@')==-1) partsList.add(whitespaceParts[w]); else { partsList.add(whitespaceParts[w]); String[] splitOnAmpersand = whitespaceParts[w].split(@); try { partsList.add(splitOnAmpersand[0]); partsList.add(splitOnAmpersand[1]); } catch (ArrayIndexOutOfBoundsException ae) {} if (splitOnAmpersand.length 0) { String[] splitOnDot = splitOnAmpersand[0].split(\\.); for (int i=0; i splitOnDot.length; i++) { partsList.add(splitOnDot[i]); } } if (splitOnAmpersand.length 1) { String[] splitOnDot = splitOnAmpersand[1].split(\\.); for (int i=0; i splitOnDot.length; i++) { partsList.add(splitOnDot[i]); } if (splitOnDot.length 2) { String domain = splitOnDot[splitOnDot.length-2] + . + splitOnDot[splitOnDot.length-1]; partsList.add(domain); } } } } return partsList.toArray(new String[0]); } } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: index demo throws LockObtainFailedException
Fedora Core 4 is *ancient*! :) Could it be that the NFS client on it is old, and this is causing problems? I remember emails about NFS 3 vs. NFS 4 and some improvements in the latter. I don't recall the details and tend to keep my Lucene and Solr instances away from NFS mounts. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Teruhiko Kurosaka k...@basistech.com To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Thu, January 28, 2010 8:15:26 PM Subject: index demo throws LockObtainFailedException We have many Linux machines of different brands, sharing the same NFS filesystem for home. The Lucene file indexing demo program is failing with LockObainFailedException only on one particular Linux machine (Fedora Core 4, x86). I am including the console output at the bottom of this message. I tried Lucene 2.9.0, 2.9.1 and 3.0.0, and the result is identical. After searching the Internet, I saw some postings suggesting that this happens when the disk space is low. But there seem to be more than enough for this small demo. I didn't understand suggestions about lockd. I'd appreciate for any advices on how to find the cause of this Exception. Thank you in advance. T. Kuro Kurosaka -bash-3.00$ cd lucene-3.0.0/ -bash-3.00$ ant demo-index-text Buildfile: build.xml jar.core-check: compile-demo: [mkdir] Created dir: /basis/users/kuro/opt/lucene-3.0.0/build/classes/demo [javac] Compiling 17 source files to /basis/users/kuro/opt/lucene-3.0.0/build/classes/demo jar-demo: [jar] Building jar: /basis/users/kuro/opt/lucene-3.0.0/lucene-demos-3.0.0.jar demo-index-text: [echo] - (1) Prepare dir - [echo] cd /basis/users/kuro/opt/lucene-3.0.0 [echo] rmdir demo-text-dir [echo] mkdir demo-text-dir [mkdir] Created dir: /basis/users/kuro/opt/lucene-3.0.0/demo-text-dir [echo] cd demo-text-dir [echo] - (2) Index the files located under /basis/users/kuro/opt/lucene-3.0.0/src - [echo] java -classpath ../lucene-core-3.0.0.jar;../lucene-demos-3.0.0.jar org.apache.lucene.demo.IndexFiles ../src/demo [java] caught a class org.apache.lucene.store.LockObtainFailedException [java] with message: Lock obtain timed out: NativeFSLock@/basis/users/kuro/opt/lucene-3.0.0/demo-text-dir/index/write.lock: java.io.IOException: Input/output error BUILD SUCCESSFUL Total time: 6 seconds -bash-3.00$ df -k . /tmp Filesystem 1K-blocks Used Available Use% Mounted on storev:/vol/exports/users 3119362560 2790661520 328701040 90% /basis/users /dev/sda2 9718360 7700764 1515968 84% / - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Proximity of More than Single Words?
Yes, that's just a phrase slop, allowing for variable gaps between words. I *believe* the Surround QP that works with Span family of queries does handle what you are looking for. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: T. R. Halvorson t...@midrivers.com To: java-user@lucene.apache.org Sent: Tue, January 19, 2010 9:40:07 AM Subject: Proximity of More than Single Words? For proximity expressions, the query parser documentation says, use the tilde, ~, symbol at the end of a Phrase. It gives the example jakarta apache~10 Does this mean that proximity can only be operated on single words enquoted in quotation marks? To clarify the question by comparision, on some systems, the w/ proximity operator lets one search for: crude w/4 west texas or spot prices w/3 gulf coast The Lucene documentation seems to imply that such searches cannot be constructed in any straightforward way (although there might be a way to get the effect by going around Cobb's Hill). Or does the Lucene syntax allow the examples to be cast as: crude west texas~4 or spot prices gulf coast~3 If not, is it a fair assessment to say that in Lucene, proximity is limited to being a part of phrase searching, and its function is exhausted by allowing a slop factor in matching phrases. Thanks in advance for any help with this. T. R. t...@midrivers.com http://www.linkedin.com/in/trhalvorson www.ncodian.com http://twitter.com/trhalvorson - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene as a primary datastore
Guido, No, you should absolutely not need to constantly rebuild the index. If you find you have to do that, you'll know you are doing something wrong. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Guido Bartolucci guido.bartolu...@gmail.com To: java-user@lucene.apache.org Sent: Wed, January 20, 2010 4:25:09 PM Subject: Re: Lucene as a primary datastore Thanks for the response. I understand all of what you wrote, but what I care about and what I had a little trouble describing exactly in my previous question is: - Are all problems with Lucene obvious (e.g., you get an exception and you know your data is now bad) or are there subtle corruptions that just happen and because of that it makes sense to constantly rebuild the index? I ask this because if this isn't the case then replication isn't going to help, the problems probably get copied over to the other instances (unless I'm missing something). guido. On Wed, Jan 20, 2010 at 11:40 AM, Chris Lu wrote: I have 3 concerns of making Lucene as a primary database. 1) Lucene is stable when it's stable. But you will have java exceptions. What would you do when FileNotFoundException or Lucene 2.9.1 'read past EOF' IOException under system load happens? For me, I don't the data is safe this way. Or, you can understand all Lucene APIs and never make any mistakes. Some databases, like some versions of mysql, could corrupt data. No better, but it's still more robust. 2) As the name suggests, Lucene index is just an index, like database index, it's an auxiliary data structure. It's only fast in one way, but could be slow in other ways. 3) The more robust approach is to pull data out of database, and create a Lucene index. In case something goes wrong, you can always pull data out again and create the index again. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! Guido Bartolucci wrote: I know that the primary use case for Lucene is as an index of data that can be reconstructed (e.g., from a relational database or from spidering your corporate intranet). But, I'm curious if anyone uses Lucene as their primary datastore for their gold data. Is it good enough? Would anyone consider (or do people already) store data in Lucene that, if it was lost, would destroy their business? And no, I'm not suggesting that you don't back up this data, I'm just curious if there are problems with using Lucene in this way. Are there subtle corruptions that might show up in Lucene that wouldn't show up in Oracle or MySQL? I'm considering using Lucene in this way but I haven't been able to find any documentation describing this use case. Are there any studies of Lucene vs MySQL running for N years comparing the corruptions and recovery times? Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? Thanks. -guido. (BTW, I did find a similar question asked back in 2007 in the archives but it doesn't really answer my question) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Can you boost multiple terms using brackets ?
Yes, I believe it is the same. I bet the Explain explanation would help confirm this. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Paul Taylor paul_t...@fastmail.fm To: java-user@lucene.apache.org Sent: Wed, January 20, 2010 1:03:14 PM Subject: Can you boost multiple terms using brackets ? Hi is title:(return panther)^3 alias:(return panther) the same as title:return^3 title:panther^3 alias:(return panther) thanks Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene as a primary datastore
You are not alone, Guido. It's a good question. In my experience, Lucene is as stable as MySQL/PostgreSQL in terms of its ability to hold your data and not corrupt it. Of course, even with the most expensive databases, you'd want to make backups. The same goes with Lucene. Nowadays, one way people make backups is via replication. :) Solr users thus often get backups for free, as do people who put copies of their data on file systems like HDFS, which tend to have replication turned on. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Guido Bartolucci guido.bartolu...@gmail.com To: java-user@lucene.apache.org Sent: Tue, January 19, 2010 10:58:36 PM Subject: Lucene as a primary datastore I know that the primary use case for Lucene is as an index of data that can be reconstructed (e.g., from a relational database or from spidering your corporate intranet). But, I'm curious if anyone uses Lucene as their primary datastore for their gold data. Is it good enough? Would anyone consider (or do people already) store data in Lucene that, if it was lost, would destroy their business? And no, I'm not suggesting that you don't back up this data, I'm just curious if there are problems with using Lucene in this way. Are there subtle corruptions that might show up in Lucene that wouldn't show up in Oracle or MySQL? I'm considering using Lucene in this way but I haven't been able to find any documentation describing this use case. Are there any studies of Lucene vs MySQL running for N years comparing the corruptions and recovery times? Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? Thanks. -guido. (BTW, I did find a similar question asked back in 2007 in the archives but it doesn't really answer my question) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene as a primary datastore
Have you seen the Hot Backups with Lucene paper available via http://www.manning.com/hatcher3/ ? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Ganesh emailg...@yahoo.co.in To: java-user@lucene.apache.org Sent: Wed, January 20, 2010 1:13:21 AM Subject: Re: Lucene as a primary datastore We have data in compound files and we use Lucene as primary database. Its working great and much faster with millions of records. The only issue, I face is with sorting. Lucene sorting consumes good amount of memory. I don't know much about the MySQL/PostgreSQL database, and how they behave with millions of records but i guess their sorting memory consumption would be less. It would be great, If Lucene has the ability to do backups / replication. I don't know how to modify/use the solr script. Regards Ganesh - Original Message - From: Otis Gospodnetic To: ; Sent: Wednesday, January 20, 2010 10:45 AM Subject: Re: Lucene as a primary datastore You are not alone, Guido. It's a good question. In my experience, Lucene is as stable as MySQL/PostgreSQL in terms of its ability to hold your data and not corrupt it. Of course, even with the most expensive databases, you'd want to make backups. The same goes with Lucene. Nowadays, one way people make backups is via replication. :) Solr users thus often get backups for free, as do people who put copies of their data on file systems like HDFS, which tend to have replication turned on. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Guido Bartolucci To: java-user@lucene.apache.org Sent: Tue, January 19, 2010 10:58:36 PM Subject: Lucene as a primary datastore I know that the primary use case for Lucene is as an index of data that can be reconstructed (e.g., from a relational database or from spidering your corporate intranet). But, I'm curious if anyone uses Lucene as their primary datastore for their gold data. Is it good enough? Would anyone consider (or do people already) store data in Lucene that, if it was lost, would destroy their business? And no, I'm not suggesting that you don't back up this data, I'm just curious if there are problems with using Lucene in this way. Are there subtle corruptions that might show up in Lucene that wouldn't show up in Oracle or MySQL? I'm considering using Lucene in this way but I haven't been able to find any documentation describing this use case. Are there any studies of Lucene vs MySQL running for N years comparing the corruptions and recovery times? Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? Thanks. -guido. (BTW, I did find a similar question asked back in 2007 in the archives but it doesn't really answer my question) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org Send instant messages to your online friends http://in.messenger.yahoo.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A way to download URLs and index better ?
Hello, Use Droids, it's much simpler than Nutch or Heritrix: http://incubator.apache.org/droids/ Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Phan The Dai thienthanhom...@gmail.com To: java-user@lucene.apache.org Sent: Sat, January 16, 2010 2:20:47 AM Subject: A way to download URLs and index better ? Hi everyone, please help me this question: I need downloading some webpages from a list of URLs (about 200 links) and then index them by Lucene. This list is not fixed, because it depends on definition of my process. Currently, in my web application, I wrote class for downloading, but it download time is too long. Please recommend me a Java library suitable with my situation for optimize downloading. More its examples are very wonderful (INPUT: list of URLs; OUTPUT: webpages content, or indexed repository) Thank you very much. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Max Segmentation Size when Optimizing Index
I think Jason meant 15-20GB segments? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch From: Jason Rutherglen jason.rutherg...@gmail.com To: java-user@lucene.apache.org Sent: Wed, January 13, 2010 5:54:38 PM Subject: Re: Max Segmentation Size when Optimizing Index Yes... You could hack LogMergePolicy to do something else. I use optimise(numsegments:5) regularly on 80GB indexes, that if optimized to 1 segment, would thrash the IO excessively. This works fine because 15-20GB indexes are plenty large and fast. On Wed, Jan 13, 2010 at 2:44 PM, Trin Chavalittumrong mrt...@gmail.com wrote: Seems like optimize() only cares about final number of segments rather than the size of the segment. Is it so? On Wed, Jan 13, 2010 at 2:35 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: There's a different method in LogMergePolicy that performs the optimize... Right, so normal merging uses the findMerges method, then there's a findMergeOptimize (method names could be inaccurate). On Wed, Jan 13, 2010 at 2:29 PM, Trin Chavalittumrong mrt...@gmail.com wrote: Do you mean MergePolicy is only used during index time and will be ignored by by the Optimize() process? On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Oh ok, you're asking about optimizing... I think that's a different algorithm inside LogMergePolicy. I think it ignores the maxMergeMB param. On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong mrt...@gmail.com wrote: Thanks, Jason. Is my understanding correct that LogByteSizeMergePolicy.setMaxMergeMB(100) will prevent merging of two segments that is larger than 100 Mb each at the optimizing time? If so, why do think would I still see segment that is larger than 200 MB? On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Hi Trin, There was recently a discussion about this, the max size is for the before merge segments, rather than the resultant merged segment (if that makes sense). It'd be great if we had a merge policy that limited the resultant merged segment, though that'd by a rough approximation at best. Jason On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong mrt...@gmail.com wrote: Hi, I am trying to optimize the index which would merge different segment together. Let say the index folder is 1Gb in total, I need each segmentation to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy *and setMaxMergeMB(100) to ensure no segment after merging would be 200Mb. However, I still see segment that are larger than 200Mb. I did call IndexWriter.optimize(20) to make sure there are enough number segmentation to allow each segment to be under 200Mb. Can someone let me know if I am using this right? Or any suggestion on how to tackle this would be helpful. Thanks, Trin - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
NYC Search in the Cloud meetup: Jan 20
Hello, If Search Engine Integration, Deployment and Scaling in the Cloud sounds interesting to you, and you are going to be in or near New York next Wednesday (Jan 20) evening: http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/ Sorry for dupes to those of you subscribed to multiple @lucene lists. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to follow intranet: configuration in nutch website
Zhou, Your question will get more attention if you send it to nutch-u...@lucene.apache.org list instead. This list is for Lucene Java. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: jyzhou...@yahoo.com jyzhou...@yahoo.com To: java-user@lucene.apache.org Sent: Tue, January 12, 2010 10:51:59 PM Subject: how to follow intranet: configuration in nutch website Hi, I try to following the instruction from http://lucene.apache.org/nutch/tutorial8.html . Intranet: Configuration To configure things for intranet crawling you must:1. Create a directory with a flat file of root urls. For example, to crawl the nutch site you might start with a file named urls/nutch containing the url of just the Nutch home page. All other Nutch pages should be reachable from this page. The urls/nutch file would thus contain: http://lucene.apache.org/nutch/ not understand. Can anyone help me out. Thanks. zhou New Email addresses available on Yahoo! Get the Email name you've always wanted on the new @ymail and @rocketmail. Hurry before someone else does! http://mail.promotions.yahoo.com/newdomains/sg/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene index file randomly crash and need to reindex
Hi, Use the latest version of Lucene, obey Lucene's locks, write with 1 IndexWriter, avoid NFS... Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: zhang99 second_co...@yahoo.com To: java-user@lucene.apache.org Sent: Tue, January 12, 2010 10:41:19 PM Subject: lucene index file randomly crash and need to reindex how you all deal wich such issue of occasionally need to reindex? what recommendation do you suggest to minimize this? -- View this message in context: http://old.nabble.com/lucene-index-file-randomly-crash-and-need-to-reindex-tp27139147p27139147.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: a complete solution for building a website search with lucene
Nutch is written in Java, so Nutch itself *should* work on other non-Linux OSs that the JVM supports. But it does contain some shell scripts, as does Hadoop that Nutch uses. Oh, I guess Windows people run it under Cygwin? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: jyzhou...@yahoo.com jyzhou...@yahoo.com To: java-user@lucene.apache.org Sent: Fri, January 8, 2010 5:03:41 AM Subject: Re: a complete solution for building a website search with lucene Hi Paul, Thanks. Use Nutch to do crawling. and integrate Lucene to the web application, so that can do search online. BTW, Nutch seems to have only Linux version, what my development is on Windows. Am i right? Zhou --- On Fri, 8/1/10, Paul Libbrecht wrote: From: Paul Libbrecht Subject: Re: a complete solution for building a website search with lucene To: java-user@lucene.apache.org Date: Friday, 8 January, 2010, 4:27 PM Zhou, Lucene is a back-end library, it's very useful for developer but it is not a complete site-search-engine. A lucene-based site-search-engine is Nutch, it does crawl. Solr also provides functions close to these with a large amount of thoughts on flexible integration; crawling methods are rather based on feeds or other acquisition methods (see DIH for example). paul Le 08-janv.-10 à 08:08, a écrit : Hi , I am new in Lucene. To build a web search function, it need to have a backendc indexing function. But, before that, should run a Crawler? because Lucene index based on Html documents, while Crawler can change the website pages to Html documents. Am i right? If so, please anyone suggest to me a Crawler? like Nutch? Thanks Zhou New Email names for you! Get the Email name you've always wanted on the new @ymail and @rocketmail. Hurry before someone else does! http://mail.promotions.yahoo.com/newdomains/sg/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org New Email names for you! Get the Email name you've always wanted on the new @ymail and @rocketmail. Hurry before someone else does! http://mail.promotions.yahoo.com/newdomains/sg/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index
Yuliya, The index *directory* will be larger *while* you are optimizing. After the optimization is completed successfully, the index directory will be smaller. It is possible that your index directory is large(r) because you have some left-over segments (e.g. from some earlier failed/interrupted optimizations) that are not really a part of the index. After optimizing, you should have only 1 segment, so if you see more than 1 segment, look at the ones with older timestamps. Those can be (re)moved. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Yuliya Palchaninava y...@solute.de To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Thu, January 7, 2010 11:23:08 AM Subject: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index Hi, According to the api documentation: In general, once the optimize completes, the total size of the index will be less than the size of the starting index. It could be quite a bit smaller (if there were many pending deletes) or just slightly smaller. In our case the index becomes not smaller but larger, namely thrice as large. The not optimized index doesn't contain compressed fields, what could have caused the growth of the index due to the otimization. So we cannot explain what happens. Does someone have an explanation for the index growth due to the optimization? Thanks, Yuliya - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Performance Results on changing the way fields are stored
You could try Avro instead of JSON/XML/Java Serialization. It's compact (and new). http://hadoop.apache.org/avro/ Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Paul Taylor paul_t...@fastmail.fm To: java-user@lucene.apache.org Sent: Tue, January 5, 2010 7:44:21 AM Subject: Performance Results on changing the way fields are stored So currently in my index I index and store a number of small fields, I need both so I can search on the fields, then I use the stored versions to generate the output document (which is either an XML or JSON representation), because I read stored and index fields are dealt with completely seperately I tried another tact only storing one field which was a serialized version of the output documentation. This solves a couple of issues I was having but I was disappointed that both the size of the index increased and the index build time increased, I thought that if all the stored data was held in one field that the resultant index would be smaller, and I didn't expect index time to increase by as much as it did. I was also suprised that Java serilaization was slower and used more space than both JSON and XML serialization. Results as Follows Type: Time : Index Size Only indexed no norms 105 : 38 MB Only indexed 111 : 43 MB Same fields written as Indexed and Stored (current Situation) 115 : 83 MB Fields Indexed, One JAXB classed Stored using JSON Marshalling 140 : 115 MB Fields Indexed, One JAXB classed Stored using XML Marshalling 189 : 198 MB Fields Indexed, One JAXB classed Stored using Java Serialization 305 : 485 MB Are these results to be expected, could anybody suggest anything else I could do Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index
Maybe you can paste a directory listing before optimization and after optimization? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Yuliya Palchaninava y...@solute.de To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Thu, January 7, 2010 11:50:29 AM Subject: AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index Otis, thanks for the answer. Unfortunatelly the index *directory* remains larger *after the optimization. In our case the otimization was/is completed successfully and, as you say, there is only one segment in the directory. Some other ideas? Thanks, Yuliya -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Gesendet: Donnerstag, 7. Januar 2010 17:35 An: java-user@lucene.apache.org Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index Yuliya, The index *directory* will be larger *while* you are optimizing. After the optimization is completed successfully, the index directory will be smaller. It is possible that your index directory is large(r) because you have some left-over segments (e.g. from some earlier failed/interrupted optimizations) that are not really a part of the index. After optimizing, you should have only 1 segment, so if you see more than 1 segment, look at the ones with older timestamps. Those can be (re)moved. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Yuliya Palchaninava To: java-user@lucene.apache.org Sent: Thu, January 7, 2010 11:23:08 AM Subject: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index Hi, According to the api documentation: In general, once the optimize completes, the total size of the index will be less than the size of the starting index. It could be quite a bit smaller (if there were many pending deletes) or just slightly smaller. In our case the index becomes not smaller but larger, namely thrice as large. The not optimized index doesn't contain compressed fields, what could have caused the growth of the index due to the otimization. So we cannot explain what happens. Does someone have an explanation for the index growth due to the optimization? Thanks, Yuliya - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Is there a way to limit the size of an index?
Merge factor controls how many segments are merged at once. The default is 10. The maxMergeMB setting sets the max size for a given segment to be included in a merge. I wonder if renaming that to maxSegSizeMergeMB would make it more obvious what this does? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch Roughly, the upper bound on merged segments is the sum of their sizes. So the rough upper bound on any segment's size is mergeFactor * maxMergeMB. Mike On Thu, Jan 7, 2010 at 11:04 AM, Dvora wrote: Can you explain how the combination of merge factor and max merge size control the size of files? For example, if one would like to limit the files size to 3,4 or 7MB - how these parameters values can be predicted? Michael McCandless-2 wrote: This tells the IndexWriter NOT to merge any segment that's over 1.0 MB in size. With a default merge factor of 10, this should generally mean you don't get a segment over 10MB, though it may not be a hard guarantee (you can lower the 1.0 if you still see a segment over 10 MB). -- View this message in context: http://old.nabble.com/Is-there-a-way-to-limit-the-size-of-an-index--tp27056573p27062291.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Is there a way to limit the size of an index?
Sure, sounds good, maybe even drop ing. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Michael McCandless luc...@mikemccandless.com To: java-user@lucene.apache.org Sent: Thu, January 7, 2010 2:28:15 PM Subject: Re: Is there a way to limit the size of an index? On Thu, Jan 7, 2010 at 2:23 PM, Otis Gospodnetic wrote: Merge factor controls how many segments are merged at once. The default is 10. The maxMergeMB setting sets the max size for a given segment to be included in a merge. I wonder if renaming that to maxSegSizeMergeMB would make it more obvious what this does? Well... that setting is already in LogByteSizeMergePolicy (not IndexWriter), so I think in that context it's pretty clear? Though I'd love to find a better name that conveys that the size limitation applies to the segments *being* merged, not to the resulting merged segment. maxStartingSegSizeMB? Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Implementing filtering based on multiple fields
For something like CSE, I think you want to isolate users and their data/indices. I'd look at Bixo or Nutch or Droids == Lucene or Solr Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Yaniv Ben Yosef yani...@gmail.com To: java-user@lucene.apache.org Sent: Thu, January 7, 2010 3:54:22 PM Subject: Implementing filtering based on multiple fields Hi, I'm very new to Lucene. In fact, I'm at the beginning of an evaluation phase, trying to figure whether Lucene is the right fit for my needs. The project I'm involved in requires something similar to the Google Custom Search Engine (CSE). In CSE, each user can define a set (could be a large set) of websites, and limit the search to only those websites. So for example, I can create a CSE that searches all web pages on cnn.com, msnbc.com and nytimes.com only. I am trying to understand whether and how I can do something similar in Lucene. The FAQ hints about this possibility here, but it mentions a class that no longer exists in 3.0 (QueryFilter), and is very laconic about the suggested options. Also I'm not sure how well it will perform in my use case (or even if it fits at all). I thought about creating a separate index for each user or CSE. However, my system should be able to handle tens of thousands of concurrent users. I haven't done any analysis yet on how this will affect CPU, RAM, I/O and storage size, but was wondering if any of you experienced Lucene users/developers think it's a good direction. If that's not a good idea, what would be a good strategy here? Any help will be much appreciated, Yaniv - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Implementing filtering based on multiple fields
Ah, well, masking it didn't help. Yes, ignore Bixo, Nutch, and Droids then. Consider DataImportHandler from Solr or wait a bit for Lucene Connectors Framework to materialize. Or use LuSql, or DbSight, or Sematext's Database Indexer. Yes, I was suggesting a separate index for each user. That's what Simpy uses and has some 200K indices on 1 box and I think dozens of QPS without any caching, if I remember correctly. Load is under 1.0. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Yaniv Ben Yosef yani...@gmail.com To: java-user@lucene.apache.org Sent: Thu, January 7, 2010 6:55:18 PM Subject: Re: Implementing filtering based on multiple fields Thanks Otis. If I understand correctly - Bixo, Nutch and Droids are technologies to use for crawling the web and building an index. My project is actually about indexing a large database, where you can think of every row as a web page, and a particular column is the equivalent of a web site. (I didn't mention that in the previous post because I didn't want to complicate my question, and it seems equivalent to Google CSE given that Lucene can use virtually any input for indexing, AFAIK) Therefore I'm not sure if the frameworks you've mentioned are applicable to my project as they seem to be related to web page indexing, but perhaps I'm missing something. Also, what did you mean about isolating users and their data/indices. Did you mean that I should create a separate index per user? Thanks again! On Fri, Jan 8, 2010 at 12:35 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: For something like CSE, I think you want to isolate users and their data/indices. I'd look at Bixo or Nutch or Droids == Lucene or Solr Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Yaniv Ben Yosef To: java-user@lucene.apache.org Sent: Thu, January 7, 2010 3:54:22 PM Subject: Implementing filtering based on multiple fields Hi, I'm very new to Lucene. In fact, I'm at the beginning of an evaluation phase, trying to figure whether Lucene is the right fit for my needs. The project I'm involved in requires something similar to the Google Custom Search Engine (CSE). In CSE, each user can define a set (could be a large set) of websites, and limit the search to only those websites. So for example, I can create a CSE that searches all web pages on cnn.com, msnbc.com and nytimes.com only. I am trying to understand whether and how I can do something similar in Lucene. The FAQ hints about this possibility here, but it mentions a class that no longer exists in 3.0 (QueryFilter), and is very laconic about the suggested options. Also I'm not sure how well it will perform in my use case (or even if it fits at all). I thought about creating a separate index for each user or CSE. However, my system should be able to handle tens of thousands of concurrent users. I haven't done any analysis yet on how this will affect CPU, RAM, I/O and storage size, but was wondering if any of you experienced Lucene users/developers think it's a good direction. If that's not a good idea, what would be a good strategy here? Any help will be much appreciated, Yaniv - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NGramTokenizer stops working after about 1000 terms
This actually rings a bell for me... have a look at Lucene's JIRA, I think this was reported as a bug once and perhaps has been fixed. Note that Lucene in Action 2 has a case study that talks about searching source code. You may find that study interesting. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Stefan Trcek wzzelfz...@abas.de To: java-user@lucene.apache.org Sent: Mon, December 14, 2009 9:39:34 AM Subject: NGramTokenizer stops working after about 1000 terms Hello For a source code (git repo) search engine I choose to use an ngram analyzer for substring search (something like git blame). This worked fine except it didn't find some strings. I tracked it down to the analyzer. When the ngram analyzer yielded about 1000 terms it stopped yielding more terms, seem to be at most (1024 - ngram_length) terms. When I use StandardAnalyzer it works as expected. Is this a bug or did I miss a limit? Tested with lucene-2.9.1 and 3.0, this is the core routine I use: public static class NGramAnalyzer5 extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return new NGramTokenizer(reader, 5, 5); } } public static String[] analyzeString(Analyzer analyzer, String fieldName, String string) throws IOException { Listoutput = new ArrayList(); TokenStream tokenStream = analyzer.tokenStream(fieldName, new StringReader(string)); TermAttribute termAtt = (TermAttribute)tokenStream.addAttribute( TermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { output.add(termAtt.term()); } tokenStream.end(); tokenStream.close(); return output.toArray(new String[0]); } The complete example is attached. in.txt must be in . and is plain ASCII. Stefan - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Snowball Stemmer Question
Chris, You could look at KStem to see if that does a better job. Or perhaps WordNet can be used to get the lemma of those terms instead of using stemming. Finally what was I going to say... ah, yes, using synonyms may be another way this can be handled. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Christopher Condit con...@sdsc.edu To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Thu, December 3, 2009 3:04:03 PM Subject: Snowball Stemmer Question The Snowball Analyzer works well for certain constructs but not others. In particular I'm having a problem with things like colossal vs colossus and hippocampus vs hippocampal. Is there a way to customize the analyzer to include these rules? Thanks, -Chris - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting score of explicit documents for a query
I think you should be able to use 1+ FilteredQuery (with IDs of your docs) with your main query and thus get the scores only for docs that interest you. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Erdinc Yilmazel erd...@yilmazel.com To: java-user@lucene.apache.org Sent: Thu, December 3, 2009 11:37:08 AM Subject: Getting score of explicit documents for a query Hi, Given a query, is there a way to learn score of some specific documents in the index against this query? I don't want to make a global search in the index and rank and sort all the matching documents. What I want to do is learn the rank of a bunch of documents in the index that I can identify by document id.. Erdinc - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
NYC Search Discovery Meetup
Hello, For those living in or near NYC, you may be interested in joining (and/or presenting?) at the NYC Search Discovery Meetup. Topics are: search, machine learning, data mining, NLP, information gathering, information extraction, etc. http://www.meetup.com/NYC-Search-and-Discovery/ Our previous/first meetup was about solr-python and parse.ly (a service that makes use of Solr and solr-python). Tomorrow (December 2 2009) we have: Incorporating Probabilistic Retrieval Knowledge into TFIDF-based Search Engine You can RSVP at: http://www.meetup.com/NYC-Search-and-Discovery/calendar/11745435/ Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Need help regarding implementation of autosuggest using jquery
Hi, Have a look at http://www.sematext.com/products/autocomplete/index.html It handles Chinese and large volumes of data. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: fulin tang tangfu...@gmail.com To: java-user@lucene.apache.org Sent: Thu, November 26, 2009 9:10:41 PM Subject: Re: Need help regarding implementation of autosuggest using jquery By the way , we search Chinese words, so Trie tree looks not perfect for us either 2009/11/27 fulin tang : We have the same needs in our music search, and we found this is not a good approach for performance reason . Did any one have experience of implement the autosuggestion in a heavy product environment ? Any suggestions ? 2009/11/26 Anshum : Try this, Change the code as required: - import java.io.IOException; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermEnum; /** * @author anshum * */ public class GetTermsToSuggest { private static void getTerms(String inputText) { IndexReader reader = null; try { reader = IndexReader.open(/home/anshum/index/testindex); String field = fieldname; field = field.intern(); TermEnum tenum = reader.terms(new Term(fieldname, )); Boolean hasRun = false; try { do { final Term term = tenum.term(); if (term == null || term.field() != field) break; final String termText = term.text(); if (termText.startsWith(inputText)) { System.out.println(termText); hasRun = true; } else if (hasRun == true) break; } while (tenum.next()); tenum.close(); } catch (IOException e) { e.printStackTrace(); } } catch (CorruptIndexException e2) { e2.printStackTrace(); } catch (IOException e2) { e2.printStackTrace(); } } /** * @param args */ public static void main(String[] args) { GetTermsToSuggest.getTerms(args[0]); } } -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Thu, Nov 26, 2009 at 3:19 PM, Uwe Schindler wrote: You can fix this if you just create the initial term not with , instead with your prefix: TermEnum tenum = reader.terms(new Term(field,prefix)); And inside the while loop just break out, if (!termText.startsWith(prefix)) break; - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: DHIVYA M [mailto:dhivyakrishna...@yahoo.com] Sent: Thursday, November 26, 2009 10:39 AM To: java-user@lucene.apache.org Subject: RE: Need help regarding implementation of autosuggest using jquery Sir, Your suggestion was fantastic. I tried the below mentioned code but it is showing me the entire result of indexed words starting from the letter that i give as input. Ex: if i give fo am getting all the indexes from the word starting with fo upto words starting with z. i.e. it starts displaying from the word matching the search word and ends up with the last word available in the index file. Kindly suggest me a solution for this problem Thanks in advance, Dhivya --- On Wed, 25/11/09, Uwe Schindler wrote: From: Uwe Schindler Subject: RE: Need help regarding implementation of autosuggest using jquery To: java-user@lucene.apache.org Date: Wednesday, 25 November, 2009, 9:54 AM Hi Dhivya, you can iterate all terms in the index using a TermEnum, that can be retrieved using IndexReader.terms(Term startTerm). If you are interested in all terms from a specific field, position the TermEnum on the first possible term in this field () and iterate until the field name changes. As terms in the TermEnum are first ordered by field name then by term text (in UTF-16 order), the loop would look like this: IndexReader reader = ... String field = Field = field.intern(); // important for the while loop TermEnum tenum = reader.terms(new Term(field,)); try { do { final Term term = tenum.term(); if (term==null || term.field()!=field) break; final String termText = term.text(); // do something with the termText } while (tenum.next()); } finally { tenum.close(); } - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: DHIVYA M [mailto:dhivyakrishna...@yahoo.com] Sent: Wednesday, November 25, 2009 8:06 AM To: java user Subject: Need help regarding implementation of autosuggest using jquery Hi all, Am using lucene
Re: Is Lucene a good choice for PB scale mailbox search?
For what it's worth, AOL uses a Solr cluster to handle searches for @aol users. Each user has his own index. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: fulin tang tangfu...@gmail.com To: java-user@lucene.apache.org Sent: Mon, November 23, 2009 9:35:57 PM Subject: Is Lucene a good choice for PB scale mailbox search? We are going to add full-text search for our mailbox service . The problem is we have more than 1 PB mails there , and obviously we don't want to add another PB storage for search service , so we hope the index data will be small enough for storage while the search keeps fast . The lucky is that every user just search with mails of their own , so we can split the data into a lot of indexes instead of keeping them in a big one . So, after all these concerns , the question is , is lucene a good choice for this ? or which is the right way to do this ? Does anyone have done this before ? All opinions and comments are welcome ! fulin -- 梦的开始挣扎于城市的边缘 心的远方执着在脚步的瞬间 我的宿命埋藏了寂寞的永远 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene not returning correct results eventhough search query is present
Hi, Please use java-user list for user questions. Are you sure the file got fully indexed in the first place? Use Luke to check. Also, see: IndexWriter.MaxFieldLength Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: udayKIRAN udayacc2...@yahoo.com To: java-...@lucene.apache.org Sent: Thu, November 19, 2009 12:08:32 AM Subject: lucene not returning correct results eventhough search query is present hi, i am lucene to search log files. but i am not able search any words in the file that are after a certain line. i am using file reader to serach. Lucene is searching only upto a certain line in the file. can anyone hepl me. these are few lines of my code IndexWriter writer = new IndexWriter(idx, new StandardAnalyzer(), true); writer.addDocument(createDocument(filename, new FileReader(new File(filepath; writer.optimize(); writer.close(); public static Document createDocument(String folderpath, FileReader fr) { Document doc = new Document(); doc.add( new Field(title, folderpath,Field.Store.YES, Field.Index.TOKENIZED )); doc.add(new Field(content, fr )); return doc; } //search function public static void search(Searcher searcher, String queryString) throws ParseException, IOException { Query query = new QueryParser(content,new StandardAnalyzer()).parse(queryString ); // Search for the query Hits hits = searcher.search(query ); TopDocs topdocs = searcher.search(query ,1 ); //System.out.println(topdocs.); // Examine the Hits object to see if there were any matches int hitCount = hits.length(); if (hitCount == 0) { System.out.println( No matches were found for \ + queryString + \); } else { System.out.println(Hits for \ + queryString + \ were found in files by:); for (int i = 0; i hitCount; i++) { Document doc = hits.doc(i); System.out.println( + (i + 1) + . + doc.get(title)); } } System.out.println(); } -- View this message in context: http://old.nabble.com/lucene-not-returning-correct-results-eventhough-search-query-is-present-tp26420491p26420491.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Why Lucene takes longer time for the first query and less for subsequent ones
Hello, Most likely due to the operating system caching the relevant portions of the index after the first set of queries. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Dinh pcd...@gmail.com To: java-user@lucene.apache.org Sent: Tue, November 17, 2009 12:39:14 PM Subject: Why Lucene takes longer time for the first query and less for subsequent ones Hi all, I made a list of 4 simple, singe term queries and do 4 searches via Lucene and find that if the term is used for search in the first time, Lucene takes quite a bit time to handle it. - Query A 00:27:28,781 INFO LuceneSearchService:151 - Internal search took 328.21463ms 00:27:28,781 INFO SearchController:86 - Page rendered in 338.29553ms - Query B 00:27:39,171 INFO LuceneSearchService:151 - Internal search took 480.30908ms 00:27:39,187 INFO SearchController:86 - Page rendered in 493.07327ms - Query C 00:27:46,765 INFO LuceneSearchService:151 - Internal search took 189.33635ms 00:27:46,765 INFO SearchController:86 - Page rendered in 195.43823ms - Query D 00:28:00,312 INFO LuceneSearchService:151 - Internal search took 330.3596ms 00:28:00,328 INFO SearchController:86 - Page rendered in 347.34747ms It looks no good at the first glance because I have only 500 000 indexed documents. However, when I searched them again I found that Lucene run much faster. - Query A 00:28:04,046 INFO LuceneSearchService:151 - Internal search took 3.90301ms 00:28:04,062 INFO SearchController:86 - Page rendered in 15.694173ms - Query C 00:28:15,390 INFO LuceneSearchService:151 - Internal search took 1.425879ms 00:28:15,390 INFO SearchController:86 - Page rendered in 7.946541ms - Query D 00:28:26,031 INFO LuceneSearchService:151 - Internal search took 1.849956ms 00:28:26,046 INFO SearchController:86 - Page rendered in 12.023037ms - Query B 00:28:31,609 INFO LuceneSearchService:151 - Internal search took 1.668648ms 00:28:31,625 INFO SearchController:86 - Page rendered in 15.57237ms Why does it happens? Does it mean that Lucene has an internal cache engine, just like MySQL query result cache or Oracle query execution plan cache? Thanks Dinh - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene Java 3.0.0 RC1 now available for testing
Well, I think some people will be for hiding complexity, while others will be for being in control and having transparency. Think how surprised one would be to find 1 extra field in his index, say when looking at their index with Luke. :) Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Glen Newton glen.new...@gmail.com To: java-user@lucene.apache.org Sent: Tue, November 17, 2009 10:53:01 PM Subject: Re: Lucene Java 3.0.0 RC1 now available for testing I understand the reasons, but - if I may ask so late in the game - was this the best way to do this? From a user (developer) perspective, this is an implementation issue. Couldn't this have been done behind the scenes, so that when I asked for Field.Index.ANALYZED Field.Store.COMPRESS, instead of what previously happened (and was variously problematic), two fields were transparently created, one being binary compressed stored and the other being indexed only? The Field API could hide all of this complexity, using one underlying Field when I use Field.getString() (compressed stored one), using the other when I use Field.setBoost() (the indexed one) and both when I call Field.setValue(). This might have less impact on developers and be less disruptive on API changes. Oh, some naming convention could handle the underlying Fields. A little complicated I agree. Again, apologies to those who worked hard on these changes: my fault for not noticing this sooner (I hadn't started moving my code to 2.9 from 2.4 so I hadn't read the deprecation signs). thanks, Glen 2009/11/17 Mark Miller : Here is some of the history: https://issues.apache.org/jira/browse/LUCENE-652 https://issues.apache.org/jira/browse/LUCENE-1960 Glen Newton wrote: Could someone send me where the rationale for the removal of COMPRESSED fields is? I've looked at http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/Changes.html#3.0.0.changes_in_runtime_behavior but it is a little light on the 'why' of this change. My fault - of course - for not paying attention. thanks, Glen 2009/11/17 Uwe Schindler : Hello Lucene users, On behalf of the Lucene dev community (a growing community far larger than just the committers) I would like to announce the first release candidate for Lucene Java 3.0. Please download and check it out - take it for a spin and kick the tires. If all goes well, we hope to release the final version of Lucene 3.0 in a little over a week. The new version is mostly a cleanup release without any new features. All deprecations targeted to be removed in version 3.0 were removed. If you are upgrading from version 2.9.1 of Lucene, you have to fix all deprecation warnings in your code base to be able to recompile against this version. This is the first Lucene release with Java 5 as a minimum requirement. The API was cleaned up to make use of Java 5's generics, varargs, enums, and autoboxing. New users of Lucene are advised to use this version for new developments, because it has a clean, type safe new API. Upgrading users can now remove unnecessary casts and add generics to their code, too. If you have not upgraded your installation to Java 5, please read the file JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene 3.0, it will also happen with any previous release when you upgrade your Java environment). Lucene 3.0 has some changes regarding compressed fields: 2.9 already deprecated compressed fields; support for them was removed now. Lucene 3.0 is still able to read indexes with compressed fields, but as soon as merges occur or the index is optimized, all compressed fields are decompressed and converted to Field.Store.YES. Because of this, indexes with compressed fields can suddenly get larger. While we generally try and maintain full backwards compatibility between major versions, Lucene 3.0 has some minor breaks, mostly related to deprecation removal, pointed out in the 'Changes in backwards compatibility policy' section of CHANGES.txt. Notable are: - IndexReader.open(Directory) now opens in read-only mode per default (this method was deprecated because of that in 2.9). The same occurs to IndexSearcher. - Already started in 2.9, core TokenStreams are now made final to enforce the decorator pattern. - If you interrupt an IndexWriter merge thread, IndexWriter now throws an unchecked ThreadInterruptedException that extends RuntimeException and clears the interrupt status. Also, remember that this is a release candidate, and not the final Lucene 3.0 release. You can find the full list of changes here: HTML version:
Re: OutofMemory in large index
Hello, Comments inlined. - Original Message From: vsevel v.se...@lombardodier.com To: java-user@lucene.apache.org Sent: Fri, November 13, 2009 11:32:02 AM Subject: Re: OutofMemory in large index Hi, I am jumping into the thread because I have got a similar issue. My index is 30Gb large and contains 21M docs. I was able to stay with 1Gb of RAM on the server for a while. Recently I Is that 1GB heap or 1GB RAM? started to simulate parallel searches. Just 2 parallel searches would get the server to crash with out of memory errors. I upgraded the server to 3Gb of RAM and I was able to run happily 10 parallel full text searches on my documents. My questions: - is 3Gb a relatively normal amount of memory for a server doing lucene searches? These days 3GB of RAM is very little even for a laptop. :) - when is that going to stop? I am planning to have at least 40M docs in my index. will I need to go from 2.5 to 5Gb of RAM? what about 60M docs? what about 20 concurrent searches? The more you hit the machine, the more resources it needs. The more resource intensive the queries (e.g. sorting? fuzzy? wildcard?), the more resources they'll need. One instance of Lucene/Solr I looked at today has an index with 5M not very large documents, but high query rates and relatively expensive queries hitting a 20GB index. Each of 10 servers has 8 cores that were only about 30% idle. This is just an example. Each case is different. - are there any safety mechanisms that would get a search to abort rather than make the server crash with out of memory? I don't think so. When an app hits OOM, I think it doesn't have much control over its destiny. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR Simon Willnauer wrote: On Fri, Nov 13, 2009 at 11:17 AM, Ian Lea wrote: I got OutOfMemoryError at org.apache.lucene.search.Searcher.search(Searcher.java:183) My index is 43G bytes. Is that too big for Lucene ? Luke can see the index has over 1800M docs, but the search is also out of memory. I use -Xmx1024M to specify 1G java heap space. 43Gb is not too big for lucene, but it certainly isn't small and that is a lot of docs. Just give it more memory. I would strongly recommend to give it more memory, what version of lucene do you use? Depending on your setup you could run into a JVM bug if you use a lucene version 2.9. Your index is big enough (document wise) that you norms file grows 100MB, depending on your Xmx settings this could trigger a false OOM during index open. So if you are using 2.9 check out this issue https://issues.apache.org/jira/browse/LUCENE-1566 One abnormal thing is that I broke a running optimize of this index. Is that can be a problem ? Possibly ... In general, this should not be a problem. The optimize will not destroy the index you are optimizing as segments are write once. If so, how can I fix an index after optimize process is broken. Probably depends on what you mean by broken. Start with running org.apache.lucene.index.CheckIndex. That can also fix some things - but see the warning in the javadocs. 100% recommended to make sure nothing is wrong! :) -- Ian. -- View this message in context: http://old.nabble.com/OutofMemory-in-large-index-tp26332397p26339388.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Prefix Query for autocomplete - TooManyClauses
Hello, Also keep in mind prefix queries are not the cheapest. Plug: We've seen people use this successfully: http://www.sematext.com/products/autocomplete/index.html I believe somebody is trying this out with a set of 1B suggestions. The demo at http://www.sematext.com/demo/ac/index.html searches 6M Wikipedia titles with a a *tiny* JVM heap. Otis - Original Message From: Anjana Sarkar anjana...@gmail.com To: java-user@lucene.apache.org Sent: Fri, November 13, 2009 8:50:38 AM Subject: Prefix Query for autocomplete - TooManyClauses We are using lucene for one our projects here and has been working very well for last 2 years. The new requirement is to use it for autocomplete. Here , queries like a* or ab* pose a problem. I have set BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ) to get around the TooManyClausesException. The issue now is performance is not acceptable. It takes about 3 secs for a* query to return results. I have 250,000 documents , each document is 5 - 15 words in the indexed field and am using StandardAnalyzer. I have tried using a filter, since in this case, I am only interested in documents with a boost higher than a certain number. I had the boost value as a separate lucene indexed field so I can filter on it. I realized that the filtering is only applied after the boolean query is prepared and scored, so there is no performance benefit with using that approach. I cannot use a ConstantScoreQuery as I need the top n matches for the query. Any suggestions on how I can get around this issue will be highly appreciated. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene index write performance optimization
This is what we have in Lucene in Action 2: ~/lia2$ ff \*Thread\*java ./src/lia/admin/CreateThreadedIndexTask.java ./src/lia/admin/ThreadedIndexWriter.java Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Jamie Band ja...@stimulussoft.com To: java-user@lucene.apache.org Sent: Tue, November 10, 2009 11:43:30 AM Subject: Lucene index write performance optimization Hi There Our app spends alot of time waiting for Lucene to finish writing to the index. I'd like to minimize this. If you have a moment to spare, please let me know if my LuceneIndex class presented below can be improved upon. It is used in the following way: luceneIndex = new LuceneIndex(Config.getConfig().getIndex().getIndexBacklog(), exitReq,volume.getID()+ indexer,volume.getIndexPath(), Config.getConfig().getIndex().getMaxSimultaneousDocs()); Document doc = new Document(); IndexInfo indexInfo = new IndexInfo(doc); luceneIndex.indexDocument(indexInfo); As an aside note, is there any way for Lucene to support simultaneous writes to an index? For example, each write threads could write to a separate shard, after a period the shared could be merged into a single index? Or is this overkill? I am interested hear the opinion of the Lucene experts. Thanks in advance Jamie package com.stimulus.archiva.index; import java.io.File; import java.io.IOException; import java.io.PrintStream; import org.apache.commons.logging.*; import org.apache.lucene.document.Document; import org.apache.lucene.index.*; import org.apache.lucene.store.FSDirectory; import java.util.*; import org.apache.lucene.store.LockObtainFailedException; import org.apache.lucene.store.AlreadyClosedException; import java.util.concurrent.locks.ReentrantLock; import java.util.concurrent.*; public class LuceneIndex extends Thread { protected ArrayBlockingQueuequeue; protected static final Log logger = LogFactory.getLog(LuceneIndex.class.getName()); protected static final Log indexLog = LogFactory.getLog(indexlog); IndexWriter writer = null; protected static ScheduledExecutorService scheduler; protected static ScheduledFuture scheduledTask; protected LuceneDocument EXIT_REQ = null; ReentrantLock indexLock = new ReentrantLock(); ArchivaAnalyzer analyzer = new ArchivaAnalyzer(); File indexLogFile; PrintStream indexLogOut; IndexProcessor indexProcessor; String friendlyName; String indexPath; int maxSimultaneousDocs; public LuceneIndex(int queueSize, LuceneDocument exitReq, String friendlyName, String indexPath, int maxSimultaneousDocs) { this.queue = new ArrayBlockingQueue(queueSize); this.EXIT_REQ = exitReq; this.friendlyName = friendlyName; this.indexPath = indexPath; this.maxSimultaneousDocs = maxSimultaneousDocs; setLog(friendlyName); } public int getMaxSimultaneousDocs() { return maxSimultaneousDocs; } public void setMaxSimultaneousDocs(int maxSimultaneousDocs) { this.maxSimultaneousDocs = maxSimultaneousDocs; } public ReentrantLock getIndexLock() { return indexLock; } protected void setLog(String logName) { try { indexLogFile = getIndexLogFile(logName); if (indexLogFile!=null) { if (indexLogFile.length()10485760) indexLogFile.delete(); indexLogOut = new PrintStream(indexLogFile); } logger.debug(set index log file path {path='+indexLogFile.getCanonicalPath()+'}); } catch (Exception e) { logger.error(failed to open index log file:+e.getMessage(),e); } } protected File getIndexLogFile(String logName) { try { String logfilepath = Config.getFileSystem().getLogPath()+File.separator+logName+index.log; return new File(logfilepath); } catch (Exception e) { logger.error(failed to open index log file:+e.getMessage(),e); return null; } } protected void openIndex() throws MessageSearchException { Exception lastError = null; if (writer==null) { logger.debug(openIndex() index +friendlyName+
Re: Filtering query results based on relevance/acuracy
Alex, If I understand you correctly, all you have to do is either make sure that query is run as a phrase query (with quotes around the it), or as a term query where both terms are required (with plus sign in front of each term, no space). As for detecting score gap and such, you could do that with a custom Collector. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Alex azli...@gmail.com To: java-user@lucene.apache.org Sent: Monday, September 21, 2009 6:17:53 PM Subject: Filtering query results based on relevance/acuracy Hi, I'm, a total newbie with lucene and trying to understand how to achieve my (complicated) goals. So what I'm doing is yet totally experimental for me but is probably extremely trivial for the experts in this list :) I use lucene and Hibernate Search to index locations by their name, type, etc ... The LocationType is an Object that has it's name field indexed both tokenized and untokenized. The following LocationType names are indexed Restaurant Mexican Restaurant Chinese Restaurant Greek Restaurant etc... Considering the following query : Mexican Restaurant I systematically get all the entries as a result, most certainly because the Restaurant keyword is present in all of them. I'm trying to have a finer grained result set. Obviously for Mexican Restaurant I want the Mexican Restaurant entry as a result but NOT Chinese Restaurant nor Greek Restaurant as they are irrelevant. But maybe Restaurant itself should be returned with a lower wight/score or maybe it shouldn't ... im not sure about this one. 1) How can I do that ? Here is the code I use for querying : String[] typeFields = {name, tokenized_name}; MapboostPerField = new HashMap(2); boostPerField.put( name, (float) 4); boostPerField.put( tokenized_name, (float) 2); QueryParser parser = new MultiFieldQueryParser( typeFields , new StandardAnalyzer(), boostPerField ); org.apache.lucene.search.Query luceneQuery; try { luceneQuery = parser.parse(queryString); } catch (ParseException e) { throw new RuntimeException(Unable to parse query: + queryString, e); } I guess that there is a way to filter out results that have a score below a given threshold or a way to filter out results based on score gap or anything similar. But I have no idea on how to do this... What is the best way to achieve what I want? Thank you for your help ! Cheers, Alex - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Language Detection for Analysis?
Bradford, If I may: Have a look at http://www.sematext.com/products/language-identifier/index.html And/or http://www.sematext.com/products/multilingual-indexer/index.html Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Bradford Stephens bradfordsteph...@gmail.com To: solr-u...@lucene.apache.org; java-user@lucene.apache.org Sent: Thursday, August 6, 2009 3:46:21 PM Subject: Language Detection for Analysis? Hey there, We're trying to add foreign language support into our new search engine -- languages like Arabic, Farsi, and Urdu (that don't work with standard analyzers). But our data source doesn't tell us which languages we're actually collecting -- we just get blocks of text. Has anyone here worked on language detection so we can figure out what analyzers to use? Are there commercial solutions? Much appreciated! -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to improve search time?
With such a large index be prepared to put it on a server with lots of RAM (even if you follow all the tips from the Wiki). When reporting performance numbers, you really ought to tell us about your hardware, types of queries, etc. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: prashant ullegaddi prashullega...@gmail.com To: java-user@lucene.apache.org Sent: Monday, August 3, 2009 12:33:46 AM Subject: How to improve search time? Hi, I've a single index of size 87GB containing around 50M documents. When I search for any query, best search time I observed was 8sec. And when query is expanded with synonyms, search takes minutes (~ 2-3min). Is there a better way to search so that overall search time reduces? Thanks, Prashant. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene for dynamic data retrieval
Hi Satish, Lucene doesn't enforce an index schema, so each document can have a different set of fields. It sounds like you need to write a custom indexer that follows your custom rules and creates Lucene Documents with different Fields, depending on what you want indexed. You also mention searching and retrieval of data from DB. This, too, sounds like a custom search application - there is nothing in Lucene that uses a (R)DBMS to retrieve field values. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Findsatish findsat...@gmail.com To: java-user@lucene.apache.org Sent: Friday, July 31, 2009 7:13:47 AM Subject: Lucene for dynamic data retrieval Hi All, I am new to Lucene and I am working on a search application. My application needs dynamic data retrieval from the database. That means, based on my previous step output, I need to retrieve entries from the DB for the next step. For example, if my search query contains Name field entry, I need to retrieve the Designations from the DB that are matched with the identified Name in the query. if there is no Name identified in the query, then I need to retrieve ALL the Designations from the DB. In the next step, if Designation is also identified in the query, then I need to retrieve the Departments from the DB that are matched with this Designation. if there is no Designation identified, then I need to retrieve ALL the Departments from the DB. Like this, there are around 6-7 steps, all are dependent on the previous step output. In this scenario, I would like to know whether I can use Lucene for creating the index? If so, How can I use it? Any help is highly appreciated. Thanks, Satish -- View this message in context: http://www.nabble.com/Lucene-for-dynamic-data-retrieval-tp24754777p24754777.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: most frquent term in the index
Hello, Here is a class you can use for that: ./contrib/miscellaneous/src/java/org/apache/lucene/misc/HighFreqTerms.java Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: starz10de farag_ah...@yahoo.com To: java-user@lucene.apache.org Sent: Friday, July 24, 2009 4:54:47 PM Subject: most frquent term in the index How to get the most frequent terms in the index in descending order? Thanks -- View this message in context: http://www.nabble.com/most-frquent-term-in-the-index-tp24651807p24651807.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Cosine similarity
Yes, have a look at this: http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Similarity.html Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: starz10de farag_ah...@yahoo.com To: java-user@lucene.apache.org Sent: Friday, July 24, 2009 4:50:22 PM Subject: Cosine similarity Does lucene use cosine smiliarity measure to measure the similarity between the query and the indexed documents? Thanks -- View this message in context: http://www.nabble.com/Cosine-similarity-tp24651759p24651759.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Loading an index into memory
I haven't verified this myself, but I remember talking to somebody who tried MMapDirectory and compared it to simply using tmpfs (RAM FS). The result was that MMapDirectory had some memory overhead, so putting the index on tmpfs was more memory-efficient. I guess this person had read-only indices, so tmpfs was an option. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Uwe Schindler uschind...@pangaea.de To: java-user@lucene.apache.org Sent: Thursday, July 23, 2009 9:47:24 AM Subject: RE: Loading an index into memory The size is in bytes and the RAMDirectory stores the bytes in bytes, so size is equal. I would suggest to not copy the dir into a RAMdirectory. It is better to use MMapDirectory in this case, as it swaps the files into address space like a normal OS swap file. The OS kernel will automatically swap needed parts into physical RAM. In this case the Java Heap is not wasted and only needed parts are swapped into RAM. - UWE SCHINDLER Webserver/Middleware Development PANGAEA - Publishing Network for Geoscientific and Environmental Data MARUM - University of Bremen Room 2500, Leobener Str., D-28359 Bremen Tel.: +49 421 218 65595 Fax: +49 421 218 65505 http://www.pangaea.de/ E-mail: uschind...@pangaea.de -Original Message- From: Dragon Fly [mailto:dragon-fly...@hotmail.com] Sent: Thursday, July 23, 2009 3:38 PM To: java-user@lucene.apache.org Subject: Loading an index into memory Hi, I have a question regarding RAMDirectory. I have a 5 GB index on disk and it is opened like the following: searcher = new IndexSearcher (new RAMDirectory (indexDirectory)); Approximately how much memory is needed to load the index? 5GB of memory or 10GB because of Unicode? Does the entire index get loaded into memory or only parts of it? Thank you. _ Windows LiveT HotmailR: Celebrate the moment with your favorite sports pics. Check it out. http://www.windowslive.com/Online/Hotmail/Campaign/QuickAdd?ocid=TXT_TAGLM _WL_QA_HM_sports_photos_072009cat=sports - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org