date:20090723

RE: indexing 100GB of data

2009-07-23 Thread m.harig

Thanks all , Very thankful to all , am tired of hadoop settings , is it good to use read such type large index with lucene alone? will it go for OOM ? anyone pl suggest me. -- View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24620846.html Sent

Re: indexing 100GB of data

2009-07-23 Thread Shai Erera

Generally you shouldn't hit OOM. But it may change depending on how you use the index. For example, if you have millions of documents spread across the 100 GB, and you use sorting for various fields, then it will consume lots of RAM. Also, if you run hundreds of queries in parallel, each with a doz

Re: Alternative way to simulate sorting without doing actual sort

2009-07-23 Thread Ian Lea

Another idea - instead of storing MMDDhhmm, as longs, store the value as number of minutes since some start time, as integers. If my sums are correct it should cope with several thousand years, and sorting on integers should use less memory than sorting on longs. -- Ian. On Thu, Jul 23, 20

[ANN] SIREn 0.1 Release

2009-07-23 Thread Renaud Delbru

On behalf of the Data Intensive Infrastructure unit (DERI) [1], I'm pleased to announce the first public version of SIREn (Semantic Information Retrieval Engine). SIREn, the Information Retrieval system at the core of the Semantic Web Index Sindice, is now available for download and includes th

RE: Alternative way to simulate sorting without doing actual sort

2009-07-23 Thread Uwe Schindler

I would propose to not sort the date/time by its string value, instead I would try to represent the date/time as a integer value (e.g. the long returned by Date.getTime()). If you do not need precision to the millisecond, you could divide it by some value, e.g. Date.getTime()/(1000L*60L) to have it

Re: Doc IDs via IndexReader?

2009-07-23 Thread Michael McCandless

I think you could also delete by Query (using IndexWriter), concocting a single large query that's something like MatchAllDocsQuery AND NOT (Q1 OR Q2 OR Q3...) where Q1, Q2, Q3 are the queries that identify the docs you want to keep. Mike On Wed, Jul 22, 2009 at 10:58 PM, Anuj Bhatt wrote: > Hi,

arabic analyzer

2009-07-23 Thread walid

http://issues.apache.org/jira/browse/LUCENE-1406 http://issues.apache.org/jira/browse/LUCENE-153 based on this, there are two options: 1- using the aramorph library 2- moving the code from trunk to the current release and using the provided arabic analyzer 1- the library works very well in indexi

Re: Batch searching

2009-07-23 Thread Matthew Hall

This was at least one of the threads that was bouncing around... I'm fairly sure there were others as well. Hopefully its worth the read to you ^^ http://www.opensubscriber.com/message/java-...@lucene.apache.org/11079539.html Phil Whelan wrote: On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall w

Re: arabic analyzer

2009-07-23 Thread Robert Muir

walid, can you provide any more information other than "very poor result"? Others have not measured much difference between morphological analysis and light stemming: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf On Thu, Jul 23, 2009 at 7:34 AM, walid wrote: > http://issues.apache.org/jira/browse

Loading an index into memory

2009-07-23 Thread Dragon Fly

Hi, I have a question regarding RAMDirectory. I have a 5 GB index on disk and it is opened like the following: searcher = new IndexSearcher (new RAMDirectory (indexDirectory)); Approximately how much memory is needed to load the index? 5GB of memory or 10GB because of Unicode? Does the ent

RE: Loading an index into memory

2009-07-23 Thread Uwe Schindler

The size is in bytes and the RAMDirectory stores the bytes in bytes, so size is equal. I would suggest to not copy the dir into a RAMdirectory. It is better to use MMapDirectory in this case, as it "swaps" the files into address space like a normal OS swap file. The OS kernel will automatically swa

Re: Loading an index into memory

2009-07-23 Thread Erick Erickson

What are you trying to accomplish? I'd insure that my performance wasa problem before doing anything. If you're thinking "it's in RAM so it has to be faster" you might be surprised. So gather evidence that you have a problem before you jump to providing a solution. Erick On Thu, Jul 23, 2009

RE: Loading an index into memory

2009-07-23 Thread Dragon Fly

Thank you both. > Date: Thu, 23 Jul 2009 11:55:58 -0400 > Subject: Re: Loading an index into memory > From: erickerick...@gmail.com > To: java-user@lucene.apache.org > > What are you trying to accomplish? I'd insure that my performance wasa > problem before doing anything. If you're thinking "it

Re: PageRanking with Lucene

2009-07-23 Thread Grant Ingersoll

On Jul 22, 2009, at 6:30 AM, prashant ullegaddi wrote: Is it that boost of a Document is stored in 6-bits? Kind of, the boost is stored in the norm, which also includes other factors like length normalization. There is one byte for all of those factors, whereas w/ the function approach,

Re: Loading an index into memory

2009-07-23 Thread Otis Gospodnetic

I haven't verified this myself, but I remember talking to somebody who tried MMapDirectory and compared it to simply using tmpfs (RAM FS). The result was that MMapDirectory had some memory overhead, so putting the index on tmpfs was more memory-efficient. I guess this person had read-only indi

Re: Loading an index into memory

2009-07-23 Thread eks dev

I do not know much about RAM FS, but I know for sure if you have enough memory for RAMDirectory, you should go for it. That gives you the fastest and the most stable performance, no OS swaps, no sudden performance drops... Uwe's tip is very good, if you/OS occasionally need RAM for other things

Combining hits

2009-07-23 Thread Max Lynch

Hi, I am doing a search on my index for a query like this: query = "\"Term 1\" \"Term 2\" \"Term 3\"" Where I want to find Term 1, Term 2 and Term 3 in the index. However, I only want to search for "Term 3" if I find "Term 1" and "Term 2" first, to avoid doing processing on hits that only contai

Re: Combining hits

2009-07-23 Thread Erick Erickson

What do you mean by "first"? Would you want to process a doc thatdid NOT have a "Term 3"? Let's say you have the following: doc1: "Term 1" doc2: "Term 2" doc3: "Term 1" "Term 2" doc4: "Term 3" doc5: "Term 1" "Term 2" "Term 3" doc6: "Term 2" "Term 3" Which docs do you want to get from your search?

Re: Combining hits

2009-07-23 Thread Matthew Hall

Erm.. I have to be missing something here, wouldn't you be able just do the following: do a search on "Term 1" AND "Term 2" do a search on "Term 2" AND "Term2" AND "Term 3" This would ensure that you have two objects back, one of which is guaranteed to be a subset of the other. Then, when yo

Re: Combining hits

2009-07-23 Thread Max Lynch

> What do you mean by "first"? Would you want to process a doc thatdid NOT > have a "Term 3"? > > Let's say you have the following: > doc1: "Term 1" > doc2: "Term 2" > doc3: "Term 1" "Term 2" > doc4: "Term 3" > doc5: "Term 1" "Term 2" "Term 3" > doc6: "Term 2" "Term 3" > > Which docs do you want to

Re: Combining hits

2009-07-23 Thread Max Lynch

> do a search on "Term 1" AND "Term 2" > do a search on "Term 2" AND "Term2" AND "Term 3" > > This would ensure that you have two objects back, one of which is > guaranteed to be a subset of the other. I did start doing this after sending the email. My only concern is search speed. Right now I

Re: Combining hits

2009-07-23 Thread Matthew Hall

Looking at what you wrote: I am doing a weighting system where I rank documents that have Term 1 AND Term 2 AND Term 3 more highly than documents that have just Term 1 AND Term 2, and more highly than documents that just have Term 1 OR Term 2 but not both. Couldn't you maybe get the same effect

Re: Doc IDs via IndexReader?

2009-07-23 Thread Anuj Bhatt

Hi, Thanks Shai and Mike for your suggestions. I went with Shai's second approach. However, I'm confronted with this now: After deleting that document from the index, I also delete it from a copy of the directory that contained the original documents. With this, I expected that both the directory

Re: Combining hits

2009-07-23 Thread Max Lynch

> Couldn't you maybe get the same effect using some clever term boosting? > > I.. think something like > > "Term 1" OR "Term 2" OR "Term 3" ^ .25 > > would return in almost the exact order that you are asking for here, with > the only real difference being that you would have some matches for only

A question about the relevancy

2009-07-23 Thread Naranjo, Pedro

Hi there, I have a question we have two querys which only different is the fact that Query_1 includes phrase queries where Query_2 has the phrase query but converted into a Boolean query. When each query is executed, Query_1 gives a relevancy of 1.0 and Query_2 gives one of 0.34. The questio

Re: A question about the relevancy

2009-07-23 Thread Otis Gospodnetic

Hi Pedro, Lucene's Explanation will show you all the juicy details: http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Scorer.html#explain(int) But with a query like that, I'm not sure if you'll be able to follow everything. Maybe pick a super simple pair of queries instead,

Re: A question about the relevancy

2009-07-23 Thread Erick Erickson

Also, see http://wiki.apache.org/lucene-java/ScoresAsPercentages. The relevancy here is that comparing scores across different queries is fairly meaningless, even if you *do* know how that score was arrived at... Best Erick On Thu, Jul 23, 2009 at 6:17 PM, Otis Gospodnetic < otis_gospodne...@yaho

RE: A question about the relevancy

2009-07-23 Thread Naranjo, Pedro

Folks, Thank you so much for your replay. We will share this with management. -Pedro -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thu 7/23/2009 4:41 PM To: java-user@lucene.apache.org Subject: Re: A question about the relevancy Also, see http://wiki.a

Re: Exclusion search

2009-07-23 Thread ba3

Thanks to all for the replies. I thought of a mechanism to achieve the results without reindexing or updating the documents. search1 = boolean query of (vol krish + vol Raj) search2 = boolean query(vol - (vol krish and vol Raj)) Removing the results of search2 from search1 gave the desired resu

RE: indexing 100GB of data

Re: indexing 100GB of data

Re: Alternative way to simulate sorting without doing actual sort

[ANN] SIREn 0.1 Release

RE: Alternative way to simulate sorting without doing actual sort

Re: Doc IDs via IndexReader?

arabic analyzer

Re: Batch searching

Re: arabic analyzer

Loading an index into memory

RE: Loading an index into memory

Re: Loading an index into memory

RE: Loading an index into memory

Re: PageRanking with Lucene

Re: Loading an index into memory

Re: Loading an index into memory

Combining hits

Re: Combining hits

Re: Combining hits

Re: Combining hits

Re: Combining hits

Re: Combining hits

Re: Doc IDs via IndexReader?

Re: Combining hits

A question about the relevancy

Re: A question about the relevancy

Re: A question about the relevancy

RE: A question about the relevancy

Re: Exclusion search

29 matches

Site Navigation

Mail list logo

Footer information