RE: indexing 100GB of data

2009-07-23 Thread m.harig
Thanks all , Very thankful to all , am tired of hadoop settings , is it good to use read such type large index with lucene alone? will it go for OOM ? anyone pl suggest me. -- View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24620846.html Sent

Re: indexing 100GB of data

2009-07-23 Thread Shai Erera
Generally you shouldn't hit OOM. But it may change depending on how you use the index. For example, if you have millions of documents spread across the 100 GB, and you use sorting for various fields, then it will consume lots of RAM. Also, if you run hundreds of queries in parallel, each with a doz

Re: Alternative way to simulate sorting without doing actual sort

2009-07-23 Thread Ian Lea
Another idea - instead of storing MMDDhhmm, as longs, store the value as number of minutes since some start time, as integers. If my sums are correct it should cope with several thousand years, and sorting on integers should use less memory than sorting on longs. -- Ian. On Thu, Jul 23, 20

[ANN] SIREn 0.1 Release

2009-07-23 Thread Renaud Delbru
On behalf of the Data Intensive Infrastructure unit (DERI) [1], I'm pleased to announce the first public version of SIREn (Semantic Information Retrieval Engine). SIREn, the Information Retrieval system at the core of the Semantic Web Index Sindice, is now available for download and includes th

RE: Alternative way to simulate sorting without doing actual sort

2009-07-23 Thread Uwe Schindler
I would propose to not sort the date/time by its string value, instead I would try to represent the date/time as a integer value (e.g. the long returned by Date.getTime()). If you do not need precision to the millisecond, you could divide it by some value, e.g. Date.getTime()/(1000L*60L) to have it

Re: Doc IDs via IndexReader?

2009-07-23 Thread Michael McCandless
I think you could also delete by Query (using IndexWriter), concocting a single large query that's something like MatchAllDocsQuery AND NOT (Q1 OR Q2 OR Q3...) where Q1, Q2, Q3 are the queries that identify the docs you want to keep. Mike On Wed, Jul 22, 2009 at 10:58 PM, Anuj Bhatt wrote: > Hi,

arabic analyzer

2009-07-23 Thread walid
http://issues.apache.org/jira/browse/LUCENE-1406 http://issues.apache.org/jira/browse/LUCENE-153 based on this, there are two options: 1- using the aramorph library 2- moving the code from trunk to the current release and using the provided arabic analyzer 1- the library works very well in indexi

Re: Batch searching

2009-07-23 Thread Matthew Hall
This was at least one of the threads that was bouncing around... I'm fairly sure there were others as well. Hopefully its worth the read to you ^^ http://www.opensubscriber.com/message/java-...@lucene.apache.org/11079539.html Phil Whelan wrote: On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall w

Re: arabic analyzer

2009-07-23 Thread Robert Muir
walid, can you provide any more information other than "very poor result"? Others have not measured much difference between morphological analysis and light stemming: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf On Thu, Jul 23, 2009 at 7:34 AM, walid wrote: > http://issues.apache.org/jira/browse

Loading an index into memory

2009-07-23 Thread Dragon Fly
Hi, I have a question regarding RAMDirectory. I have a 5 GB index on disk and it is opened like the following: searcher = new IndexSearcher (new RAMDirectory (indexDirectory)); Approximately how much memory is needed to load the index? 5GB of memory or 10GB because of Unicode? Does the ent

RE: Loading an index into memory

2009-07-23 Thread Uwe Schindler
The size is in bytes and the RAMDirectory stores the bytes in bytes, so size is equal. I would suggest to not copy the dir into a RAMdirectory. It is better to use MMapDirectory in this case, as it "swaps" the files into address space like a normal OS swap file. The OS kernel will automatically swa

Re: Loading an index into memory

2009-07-23 Thread Erick Erickson
What are you trying to accomplish? I'd insure that my performance wasa problem before doing anything. If you're thinking "it's in RAM so it has to be faster" you might be surprised. So gather evidence that you have a problem before you jump to providing a solution. Erick On Thu, Jul 23, 2009

RE: Loading an index into memory

2009-07-23 Thread Dragon Fly
Thank you both. > Date: Thu, 23 Jul 2009 11:55:58 -0400 > Subject: Re: Loading an index into memory > From: erickerick...@gmail.com > To: java-user@lucene.apache.org > > What are you trying to accomplish? I'd insure that my performance wasa > problem before doing anything. If you're thinking "it

Re: PageRanking with Lucene

2009-07-23 Thread Grant Ingersoll
On Jul 22, 2009, at 6:30 AM, prashant ullegaddi wrote: Is it that boost of a Document is stored in 6-bits? Kind of, the boost is stored in the norm, which also includes other factors like length normalization. There is one byte for all of those factors, whereas w/ the function approach,

Re: Loading an index into memory

2009-07-23 Thread Otis Gospodnetic
I haven't verified this myself, but I remember talking to somebody who tried MMapDirectory and compared it to simply using tmpfs (RAM FS). The result was that MMapDirectory had some memory overhead, so putting the index on tmpfs was more memory-efficient. I guess this person had read-only indi

Re: Loading an index into memory

2009-07-23 Thread eks dev
I do not know much about RAM FS, but I know for sure if you have enough memory for RAMDirectory, you should go for it. That gives you the fastest and the most stable performance, no OS swaps, no sudden performance drops... Uwe's tip is very good, if you/OS occasionally need RAM for other things

Combining hits

2009-07-23 Thread Max Lynch
Hi, I am doing a search on my index for a query like this: query = "\"Term 1\" \"Term 2\" \"Term 3\"" Where I want to find Term 1, Term 2 and Term 3 in the index. However, I only want to search for "Term 3" if I find "Term 1" and "Term 2" first, to avoid doing processing on hits that only contai

Re: Combining hits

2009-07-23 Thread Erick Erickson
What do you mean by "first"? Would you want to process a doc thatdid NOT have a "Term 3"? Let's say you have the following: doc1: "Term 1" doc2: "Term 2" doc3: "Term 1" "Term 2" doc4: "Term 3" doc5: "Term 1" "Term 2" "Term 3" doc6: "Term 2" "Term 3" Which docs do you want to get from your search?

Re: Combining hits

2009-07-23 Thread Matthew Hall
Erm.. I have to be missing something here, wouldn't you be able just do the following: do a search on "Term 1" AND "Term 2" do a search on "Term 2" AND "Term2" AND "Term 3" This would ensure that you have two objects back, one of which is guaranteed to be a subset of the other. Then, when yo

Re: Combining hits

2009-07-23 Thread Max Lynch
> What do you mean by "first"? Would you want to process a doc thatdid NOT > have a "Term 3"? > > Let's say you have the following: > doc1: "Term 1" > doc2: "Term 2" > doc3: "Term 1" "Term 2" > doc4: "Term 3" > doc5: "Term 1" "Term 2" "Term 3" > doc6: "Term 2" "Term 3" > > Which docs do you want to

Re: Combining hits

2009-07-23 Thread Max Lynch
> do a search on "Term 1" AND "Term 2" > do a search on "Term 2" AND "Term2" AND "Term 3" > > This would ensure that you have two objects back, one of which is > guaranteed to be a subset of the other. I did start doing this after sending the email. My only concern is search speed. Right now I

Re: Combining hits

2009-07-23 Thread Matthew Hall
Looking at what you wrote: I am doing a weighting system where I rank documents that have Term 1 AND Term 2 AND Term 3 more highly than documents that have just Term 1 AND Term 2, and more highly than documents that just have Term 1 OR Term 2 but not both. Couldn't you maybe get the same effect

Re: Doc IDs via IndexReader?

2009-07-23 Thread Anuj Bhatt
Hi, Thanks Shai and Mike for your suggestions. I went with Shai's second approach. However, I'm confronted with this now: After deleting that document from the index, I also delete it from a copy of the directory that contained the original documents. With this, I expected that both the directory

Re: Combining hits

2009-07-23 Thread Max Lynch
> Couldn't you maybe get the same effect using some clever term boosting? > > I.. think something like > > "Term 1" OR "Term 2" OR "Term 3" ^ .25 > > would return in almost the exact order that you are asking for here, with > the only real difference being that you would have some matches for only

A question about the relevancy

2009-07-23 Thread Naranjo, Pedro
Hi there, I have a questionÂ… we have two querys which only different is the fact that Query_1 includes phrase queries where Query_2 has the phrase query but converted into a Boolean query. When each query is executed, Query_1 gives a relevancy of 1.0 and Query_2 gives one of 0.34. The questio

Re: A question about the relevancy

2009-07-23 Thread Otis Gospodnetic
Hi Pedro, Lucene's Explanation will show you all the juicy details: http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Scorer.html#explain(int) But with a query like that, I'm not sure if you'll be able to follow everything. Maybe pick a super simple pair of queries instead,

Re: A question about the relevancy

2009-07-23 Thread Erick Erickson
Also, see http://wiki.apache.org/lucene-java/ScoresAsPercentages. The relevancy here is that comparing scores across different queries is fairly meaningless, even if you *do* know how that score was arrived at... Best Erick On Thu, Jul 23, 2009 at 6:17 PM, Otis Gospodnetic < otis_gospodne...@yaho

RE: A question about the relevancy

2009-07-23 Thread Naranjo, Pedro
Folks, Thank you so much for your replay. We will share this with management. -Pedro -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thu 7/23/2009 4:41 PM To: java-user@lucene.apache.org Subject: Re: A question about the relevancy Also, see http://wiki.a

Re: Exclusion search

2009-07-23 Thread ba3
Thanks to all for the replies. I thought of a mechanism to achieve the results without reindexing or updating the documents. search1 = boolean query of (vol krish + vol Raj) search2 = boolean query(vol - (vol krish and vol Raj)) Removing the results of search2 from search1 gave the desired resu