Re: Re: Re: Lucene search problem

2008-12-23 Thread tom
AUTOMATIC REPLY
LUX is closed until 5th January 2009



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Re: Lucene search problem

2008-12-23 Thread tom
AUTOMATIC REPLY
LUX is closed until 5th January 2009



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene search problem

2008-12-23 Thread amar . sannaik
hi Erick,

I agree lucene do not index the object. in the following example I have
quoted fields are indexed as chain.chainName.
I am able to retrieve recipe objects using FullTextQuery as
"chain.chainName:something' ... question is in somecase chain itself is
null.
I can be able to achieve required as below:

class Recipe {
  @DocumentId
  Integer id;
  @IndexedEmbedded
  Chain chain = new Chain();
  //gettter and setter
}

 class Chain {
  @DocumentId
  Integer id;
  @Field(index = Index.TOKENIZED, name="chainName")
  String name = "NULLNULNUL";
  //getter and setter
 }

so this means to always there is will be chain along with Recipe object with
has default name to "NULLNULNUL" and that wil be indexed..
We dont want to do that, Recipe is our persistence object and we hate to do
that.

-Amar

On Tue, Dec 23, 2008 at 8:05 PM, Erick Erickson wrote:

> How do you intend to index these? Lucene will not
> index objects for you. You have to break the object
> down into a series of fields. At that point you can
> substitute whatever you want.
>
> Best
> Erick
>
> On Tue, Dec 23, 2008 at 3:36 AM,  wrote:
>
> > Hi Aaron Schon/EricK,
> >
> > That really make sense to me but it really seems easy if is the string
> > object. See the object structure I have it below hopefully that gives you
> > some idea
> >
> > class Recipe {
> > @DocumentId
> > Integer id;
> > @IndexedEmbedded
> > Chain chain;
> > //gettter and setter
> > }
> >
> > class Chain {
> > @DocumentId
> > Integer id;
> > @Field(index = Index.TOKENIZED, name="chainName")
> > String name;
> > //getter and setter
> > }
> >
> > I am creating index on the recipe object. and for some recipe.m_chain
> would
> > be null. So can you tell me how do I assign the value "NULLNULNULLNULL"
> for
> > object chain in recipe.
> >
> > I also was thinking if #FieldBridge help me this way. My plan was to have
> > default value where chain is null as you mentioned. but it does not seems
> > to
> > work for null values.
> >
> > Please suggest
> >
> > Thanks in advance.
> > -Amar
> >
> > On Tue, Dec 23, 2008 at 12:04 AM, Aaron Schon 
> > wrote:
> >
> > > I would second Erick's recommendation - create an arbitrary
> > representation
> > > for NULL such as "NULL" (if you are certain the term "NULL" does not
> > occur
> > > in actual docs. Alternatively, use "NULLNULNULLNULL" or something to
> that
> > > effect.
> > >
> > >
> > >
> > > - Original Message 
> > > From: Erick Erickson 
> > > To: java-user@lucene.apache.org
> > > Sent: Monday, December 22, 2008 8:58:21 AM
> > > Subject: Re: Lucene search problem
> > >
> > > Try searching the mailing list archives for a fuller discussion, but
> > > the short answer is usually to index an unique value for your
> > > "null" entries, then search on that, something totally
> > > outrageous like, say AAABBBCCCDDDEEEFFF.
> > >
> > > Alternatively, you could create, at startup time, a
> > > Filter of all the docs that *do* contain terms for the
> > > field in question, flip the bits and use the Filter in your
> > > searches. (Hint: see TermDocs/TermEnum)
> > >
> > > Best
> > > Erick
> > >
> > > On Mon, Dec 22, 2008 at 8:11 AM,  wrote:
> > >
> > > > Hi,
> > > >
> > > > I have problem with lucene search, I am quite new to this. Can some
> > body
> > > > help or just push me to who can please.
> > > >
> > > > Problem what I am facing we need search for object whose attribute
> > > "chain"
> > > > contaning null, but lucene does not help indexing the null values..
> > > >
> > > > how can I achieve this, or please guide me the alternative way of
> doing
> > > > this.
> > > >
> > > > Thanks in advance.
> > > > -Amar
> > > >
> > >
> > >
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Amar Sannaik | Programmer | ATHARVA LIBSON Software Pvt Ltd.,
> > # 9886476270, amarsann...@atharvalibson.com
> >
>



-- 
Amar Sannaik | Programmer | ATHARVA LIBSON Software Pvt Ltd.,
# 9886476270, amarsann...@atharvalibson.com


Re: Optimize and Out Of Memory Errors

2008-12-23 Thread Mark Miller

Mark Miller wrote:

Lebiram wrote:
Also, what are norms 
Norms are a byte value per field stored in the index that is factored 
into the score. Its used for length normalization (shorter documents = 
more important) and index time boosting. If you want either of those, 
you need norms. When norms are loaded up into an IndexReader, its 
loaded into a byte[maxdoc] array for each field - so even if one 
document out of 400 million has a field, its still going to load 
byte[maxdoc] for that field (so a lot of wasted RAM).  Did you say you 
had 400 million docs and 7 fields? Google says that would be:



   **400 million x 7 byte = 2 670.28809 megabytes**

On top of your other RAM usage.
Just to avoid confusion, that should really read a byte per document per 
field. If I remember right, it gives 255 boost possibilities, limited to 
25 with length normalization.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Optimize and Out Of Memory Errors

2008-12-23 Thread Mark Miller

Lebiram wrote:
Also, what are norms 
Norms are a byte value per field stored in the index that is factored 
into the score. Its used for length normalization (shorter documents = 
more important) and index time boosting. If you want either of those, 
you need norms. When norms are loaded up into an IndexReader, its loaded 
into a byte[maxdoc] array for each field - so even if one document out 
of 400 million has a field, its still going to load byte[maxdoc] for 
that field (so a lot of wasted RAM).  Did you say you had 400 million 
docs and 7 fields? Google says that would be:



   **400 million x 7 byte = 2 670.28809 megabytes**

On top of your other RAM usage.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Optimize and Out Of Memory Errors

2008-12-23 Thread mark harwood
>>how do I turn off norms and where is it set? 

doc.add(new Field("field2", "sender" + i, Field.Store.NO, 
Field.Index.ANALYZED_NO_NORMS));





- Original Message 
From: Lebiram 
To: java-user@lucene.apache.org
Sent: Tuesday, 23 December, 2008 17:03:07
Subject: Re: Optimize and Out Of Memory Errors

Hi All, 

Thanks for the replies, 

I've just managed to reproduced the error on my test machine.

What we did was, generate about 100,000,000 documents with about 7 fields in 
it, with terms from 1 to 10.

After the index of about 20GB, we did an optimize and it was able to make 1 big 
index of the size 17GB...

Now, when we do a normal search, it just fails. 

Here is the stack trace:

2008-12-23 16:56:05,388 [main] INFO  LuceneTesterMain  - Max Memory:1598226432
2008-12-23 16:56:05,388 [main] INFO  LuceneTesterMain  - Available 
Memory:854133192
2008-12-23 16:56:05,388 [main] ERROR LuceneTesterMain  - Seaching failed.
java.lang.OutOfMemoryError
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(RandomAccessFile.java:315)
at 
org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:550)
at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131)
at 
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:240)
at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131)
at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:87)
at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:834)
at 
org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.java:335)
at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113)
at org.apache.lucene.search.Searcher.search(Searcher.java:132)


The search code is quite simple in fact. 

2008-12-23 16:56:05,076 [main] INFO  LuceneTesterMain  - 
query.toString()=+content:test

and the code to search.

Filter filter = new RangeFilter("timestamp", DateTools.dateToString(start, 
DateTools.Resolution.SECOND),
 DateTools.dateToString(end, 
DateTools.Resolution.SECOND), true, true);
Searcher = new IndexSearcher(FSDirectory.getDirectory(IndexName));
TopDocs hits = Searcher.search(query, filter, 1000);

I really have no idea why this is breaking.

Also, what are norms and 
how do I turn off norms and where is it set? 

This is code on adding documents:

Document doc = new Document();
doc.add(new Field("id", String.valueOf(i), Field.Store.YES, 
Field.Index.UN_TOKENIZED));

doc.add(new Field("timestamp", DateTools.dateToString(new 
Date(), DateTools.Resolution.SECOND), Field.Store.YES, 
Field.Index.UN_TOKENIZED));
doc.add(new Field("content", LuceneTesterMain.StaticString, 
Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("field2", "sender" + i, Field.Store.NO, 
Field.Index.TOKENIZED));
doc.add(new Field("field3", LuceneTesterMain.StaticString, 
Field.Store.NO, Field.Index.TOKENIZED));

doc.add(new Field("field4", "group" + i, Field.Store.NO, 
Field.Index.TOKENIZED));
doc.add(new Field("field5", "groupId" + i, Field.Store.YES, 
Field.Index.UN_TOKENIZED));


writer.addDocument(doc);






From: mark harwood 
To: java-user@lucene.apache.org
Sent: Tuesday, December 23, 2008 2:42:25 PM
Subject: Re: Optimize and Out Of Memory Errors

I've had reports of OOM exceptions during optimize on a couple of large 
deployments recently (based on Lucene 2.4.0)
I've given the usual advice of turning off norms, providing plenty of RAM and 
also suggested setting IndexWriter.setTermIndexInterval().

I don't have access to these deployment environments and have tried hard to 
reproduce the circumstances that lead to this. For the record, I've 
experimented with huge indexes with hundreds of fields, several "unique value" 
fields e.g. primary keys, "fixed-vocab" fields with limited values e.g. 
male/female and fields with "power-curve" distributions e.g. plain text.
I've wound my index up to 22GB with several commit sessions involving 
deletions, full optimises and partial optimises along the way. Still no error.

However, the errors that have been reported to me from 2 different environments 
with large indexes make me think there is still something to be uncovered 
here...




- Original Message 
From: Michael McCandless 
To: java-user@lucene.apache.org
Cc: Utan Bisaya 
Sent: Tuesday, 23 December, 2008 14:08:26
Subject: Re: Optimize and Out Of Memory Errors


How many indexed fields do you have, overall, in the index?

If you have a 

Re: Optimize and Out Of Memory Errors

2008-12-23 Thread Lebiram
Hi All, 

Thanks for the replies, 

I've just managed to reproduced the error on my test machine.

What we did was, generate about 100,000,000 documents with about 7 fields in 
it, with terms from 1 to 10.

After the index of about 20GB, we did an optimize and it was able to make 1 big 
index of the size 17GB...

Now, when we do a normal search, it just fails. 

Here is the stack trace:

2008-12-23 16:56:05,388 [main] INFO  LuceneTesterMain  - Max Memory:1598226432
2008-12-23 16:56:05,388 [main] INFO  LuceneTesterMain  - Available 
Memory:854133192
2008-12-23 16:56:05,388 [main] ERROR LuceneTesterMain  - Seaching failed.
java.lang.OutOfMemoryError
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(RandomAccessFile.java:315)
at 
org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:550)
at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131)
at 
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:240)
at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131)
at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:87)
at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:834)
at 
org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.java:335)
at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113)
at org.apache.lucene.search.Searcher.search(Searcher.java:132)


The search code is quite simple in fact. 

2008-12-23 16:56:05,076 [main] INFO  LuceneTesterMain  - 
query.toString()=+content:test

and the code to search.

Filter filter = new RangeFilter("timestamp", DateTools.dateToString(start, 
DateTools.Resolution.SECOND),
 DateTools.dateToString(end, 
DateTools.Resolution.SECOND), true, true);
Searcher = new IndexSearcher(FSDirectory.getDirectory(IndexName));
TopDocs hits = Searcher.search(query, filter, 1000);

I really have no idea why this is breaking.

Also, what are norms and 
how do I turn off norms and where is it set? 

This is code on adding documents:

Document doc = new Document();
doc.add(new Field("id", String.valueOf(i), Field.Store.YES, 
Field.Index.UN_TOKENIZED));

doc.add(new Field("timestamp", DateTools.dateToString(new 
Date(), DateTools.Resolution.SECOND), Field.Store.YES, 
Field.Index.UN_TOKENIZED));
doc.add(new Field("content", LuceneTesterMain.StaticString, 
Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("field2", "sender" + i, Field.Store.NO, 
Field.Index.TOKENIZED));
doc.add(new Field("field3", LuceneTesterMain.StaticString, 
Field.Store.NO, Field.Index.TOKENIZED));

doc.add(new Field("field4", "group" + i, Field.Store.NO, 
Field.Index.TOKENIZED));
doc.add(new Field("field5", "groupId" + i, Field.Store.YES, 
Field.Index.UN_TOKENIZED));


writer.addDocument(doc);






From: mark harwood 
To: java-user@lucene.apache.org
Sent: Tuesday, December 23, 2008 2:42:25 PM
Subject: Re: Optimize and Out Of Memory Errors

I've had reports of OOM exceptions during optimize on a couple of large 
deployments recently (based on Lucene 2.4.0)
I've given the usual advice of turning off norms, providing plenty of RAM and 
also suggested setting IndexWriter.setTermIndexInterval().

I don't have access to these deployment environments and have tried hard to 
reproduce the circumstances that lead to this. For the record, I've 
experimented with huge indexes with hundreds of fields, several "unique value" 
fields e.g. primary keys, "fixed-vocab" fields with limited values e.g. 
male/female and fields with "power-curve" distributions e.g. plain text.
I've wound my index up to 22GB with several commit sessions involving 
deletions, full optimises and partial optimises along the way. Still no error.

However, the errors that have been reported to me from 2 different environments 
with large indexes make me think there is still something to be uncovered 
here...




- Original Message 
From: Michael McCandless 
To: java-user@lucene.apache.org
Cc: Utan Bisaya 
Sent: Tuesday, 23 December, 2008 14:08:26
Subject: Re: Optimize and Out Of Memory Errors


How many indexed fields do you have, overall, in the index?

If you have a very large number of fields that are "sparse" (meaning any given 
document would only have a small subset of the fields), then norms could 
explain what you are seeing.

Norms are not stored sparsely, so when segments get merged the "holes" get 
filled (occupy bytes on disk and in RAM) and consume more resources.  Turning 
off no

Re: lucene explanation

2008-12-23 Thread Chris Salem
That worked perfectly.
Thanks alot!
Sincerely,
Chris Salem 


- Original Message - 
To: java-user@lucene.apache.org
From: Erick Erickson 
Sent: 12/22/2008 5:00:51 PM
Subject: Re: lucene explanation


Warning! I'm really reaching on this


But it seems you could use TermDocs/TermEnum to
good effect here. Basically, you should be able, for a
given term, use the above to determine whether
doc N had a hit in one of your fields pretty efficiently.
There's even a WildcardTermEnum that will iterate
over wildcards.

Filters are surprisingly fast to construct, so you could
use the above to construct a filter on each term for
each field. Then determining whether the doc is a hit
for a particular field is just a matter of seeing if
that bit is on in the relevant filter.

Either one should be wy under 30 seconds,
although I don't know how big your index is
or how encompassing your wildcard searches
are...

FWIW
Erick

On Mon, Dec 22, 2008 at 4:48 PM, Chris Salem  wrote:

> Hello,
> I'm wondering what the best way to accomplish this is.
> When a user enters text to search on it customarily searches 3 fields,
> resume_text, profile_text, and summary_text, so a standard query would be
> something like:
> (resume_text:(query) OR profile_text:(query) OR summary_text:(query))
> For each hit (up to 50) I'd like to find out which part of the query
> matched with the document. Right now I use the Explanation object, here's
> the code:
> int len = hits.length();
> if(len > 50) len = 50;
> for(int i=0; i Explanation ex = searcher.explain(Query.parse("resume_text:(query)"),
> hits.id(i));
> if(ex.isMatch()) ...
> ex = searcher.explain(Query.parse("profile_text:(query)"), hits.id(i));
> if(ex.isMatch()) ...
> ex = searcher.explain(Query.parse("summary_text:(query)"), hits.id(i));
> if(ex.isMatch()) ...
> }
> This works fine with regular queries, but if someone does a query with a
> wildcard search times increase to more than 30 seconds. Is there a better
> way to do this?
> Thanks
> Sincerely,
> Chris Salem
>


Re: Optimize and Out Of Memory Errors

2008-12-23 Thread mark harwood
I've had reports of OOM exceptions during optimize on a couple of large 
deployments recently (based on Lucene 2.4.0)
I've given the usual advice of turning off norms, providing plenty of RAM and 
also suggested setting IndexWriter.setTermIndexInterval().

I don't have access to these deployment environments and have tried hard to 
reproduce the circumstances that lead to this. For the record, I've 
experimented with huge indexes with hundreds of fields, several "unique value" 
fields e.g. primary keys, "fixed-vocab" fields with limited values e.g. 
male/female and fields with "power-curve" distributions e.g. plain text.
I've wound my index up to 22GB with several commit sessions involving 
deletions, full optimises and partial optimises along the way. Still no error.

However, the errors that have been reported to me from 2 different environments 
with large indexes make me think there is still something to be uncovered 
here...




- Original Message 
From: Michael McCandless 
To: java-user@lucene.apache.org
Cc: Utan Bisaya 
Sent: Tuesday, 23 December, 2008 14:08:26
Subject: Re: Optimize and Out Of Memory Errors


How many indexed fields do you have, overall, in the index?

If you have a very large number of fields that are "sparse" (meaning any given 
document would only have a small subset of the fields), then norms could 
explain what you are seeing.

Norms are not stored sparsely, so when segments get merged the "holes" get 
filled (occupy bytes on disk and in RAM) and consume more resources.  Turning 
off norms on sparse fields would resolve it, but you must rebuild the entire 
index since if even a single doc in the index has norms enabled for a given 
field, it "spreads".

Mike

Utan Bisaya wrote:

> Recently, our lucene index version was upgraded to 2.3.1 and the index had to 
> be rebuilt for several weeks which made the entire index a total of 20 GB or 
> so.
> 
> After the the rebuild, a weekly sunday task was executed for optimization.
> 
> During that time, the optimization failed several times complaining about OOM 
> errors but then after a couple of tries, it completes.
> 
> So the entire index is now 1 segment that is 20 GB.
> 
> The problem is that any subsequent searches on that index fails with OOM 
> errors at Lucene's reading of bytes.
> 
> Our environment:
> jvm Xmx1600 (This is the max we could set the box since it's windows)
> 8G Memory available on box
> 4G CPU (8 core) but only 12.5% is used. (Not sure if this would impact it)
> Harddisk available is 120GB
> 
> mergeFactor, and other lucene config is set at default.
> 
> We
> checked this 20GB using luke and it has 400,000,000 documents in it. It was 
> able to count the docs however when we do the search in Luke it fails giving 
> us OOM errors.
> 
> We also did the check index and the tool fails at the 20GB segment but 
> succeeds on the others.
> 
> We've managed to rollback to a previously unoptimized index copy (about 20GB 
> or so) and the searches were find now. This unoptimized index is made up of 
> several 8GB segments and a few smaller segments.
> 
> However there is a big possibility that the optimization error could happen 
> again...
> 
> Does anybody have insights on why this is happening?
> 
> 
> 
> 
> 
> 
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: QueryWrapperFilter

2008-12-23 Thread Erick Erickson
My first bit of advice would be to step back and take a deep
breath and "take off your DB hat". Lucene is a *text* search
application, not an RDBMS.

The usual solution is to flatten your data representation when
you index so you can use simpler searches. Others have
posted that it's hard to use Lucene to express relationships
satisfactorily.

Best
Erick

On Tue, Dec 23, 2008 at 5:24 AM, csantos wrote:

>
> Hello,
>
> I need to filter a FullTextSearch against a query, that means, i search a
> term in a indexed entity "A", A contains a embedded Index "B", entity B has
> a m:1 bidirectional relationship with entity "C", the foreign Key in "B" is
> "c_id".  My filter condition would be like "filter the fulltext search for
> entries where the c_id equals some value", where value is given.
>
> I thought of using the QueryWrapperFilter. But the JavaDoc says for the
> TermQuery: "A Query that matches documents containing a term.". My problem
> is that the field I want to use do not appear on the Lucene Index. Which is
> the best approach?
>
> thanks in advanced
> --
> View this message in context:
> http://www.nabble.com/QueryWrapperFilter-tp21142252p21142252.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Lucene search problem

2008-12-23 Thread Erick Erickson
How do you intend to index these? Lucene will not
index objects for you. You have to break the object
down into a series of fields. At that point you can
substitute whatever you want.

Best
Erick

On Tue, Dec 23, 2008 at 3:36 AM,  wrote:

> Hi Aaron Schon/EricK,
>
> That really make sense to me but it really seems easy if is the string
> object. See the object structure I have it below hopefully that gives you
> some idea
>
> class Recipe {
> @DocumentId
> Integer id;
> @IndexedEmbedded
> Chain chain;
> //gettter and setter
> }
>
> class Chain {
> @DocumentId
> Integer id;
> @Field(index = Index.TOKENIZED, name="chainName")
> String name;
> //getter and setter
> }
>
> I am creating index on the recipe object. and for some recipe.m_chain would
> be null. So can you tell me how do I assign the value "NULLNULNULLNULL" for
> object chain in recipe.
>
> I also was thinking if #FieldBridge help me this way. My plan was to have
> default value where chain is null as you mentioned. but it does not seems
> to
> work for null values.
>
> Please suggest
>
> Thanks in advance.
> -Amar
>
> On Tue, Dec 23, 2008 at 12:04 AM, Aaron Schon 
> wrote:
>
> > I would second Erick's recommendation - create an arbitrary
> representation
> > for NULL such as "NULL" (if you are certain the term "NULL" does not
> occur
> > in actual docs. Alternatively, use "NULLNULNULLNULL" or something to that
> > effect.
> >
> >
> >
> > - Original Message 
> > From: Erick Erickson 
> > To: java-user@lucene.apache.org
> > Sent: Monday, December 22, 2008 8:58:21 AM
> > Subject: Re: Lucene search problem
> >
> > Try searching the mailing list archives for a fuller discussion, but
> > the short answer is usually to index an unique value for your
> > "null" entries, then search on that, something totally
> > outrageous like, say AAABBBCCCDDDEEEFFF.
> >
> > Alternatively, you could create, at startup time, a
> > Filter of all the docs that *do* contain terms for the
> > field in question, flip the bits and use the Filter in your
> > searches. (Hint: see TermDocs/TermEnum)
> >
> > Best
> > Erick
> >
> > On Mon, Dec 22, 2008 at 8:11 AM,  wrote:
> >
> > > Hi,
> > >
> > > I have problem with lucene search, I am quite new to this. Can some
> body
> > > help or just push me to who can please.
> > >
> > > Problem what I am facing we need search for object whose attribute
> > "chain"
> > > contaning null, but lucene does not help indexing the null values..
> > >
> > > how can I achieve this, or please guide me the alternative way of doing
> > > this.
> > >
> > > Thanks in advance.
> > > -Amar
> > >
> >
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
>
> --
> Amar Sannaik | Programmer | ATHARVA LIBSON Software Pvt Ltd.,
> # 9886476270, amarsann...@atharvalibson.com
>


Re: Combining results of multiple indexes

2008-12-23 Thread Erick Erickson
You're kind of in uncharted territory. I've been watching this list for
quite a while and you're the first person I remember who's said
"indexing speed is more important than querying speed" .

Mostly I'll leave responses to folks who understand the guts of
indexing, except to say that for point (e) you can always use
a Sort object in your queries that insures this. And that merging
indexes occurs in order. That is, if you merge index1, index2
and index3, the following holds true, just based upon the
position of the index in the merge array.
max(id in index 1) < min (id in indedx2)
max(id in index 2) < min(id in index 3)

Best
Erick

On Tue, Dec 23, 2008 at 2:44 AM, Preetham Kajekar (preetham) <
preet...@cisco.com> wrote:

> Hi Erick,
>  Thanks for the heads up. I understand that I am using an implementation
> detail rather than a feature.
>
>  Looks like having a single index is the best option. Hence, any
> optimizations (to improve indexing speed) you would suggest given that
>
> a) once a doc is added to an index, it will not get modified/deleted
> b) all the fields added are keywords (mostly numbers) - no analysis is
> required.
> c) indexing speed is more important than querying speed.
> d) every document is the same - there is no boost or relevancy required.
>
> e) Query results should be sorted in the order they were indexed.
>
>
> Thanks,
>  ~preetham
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, December 19, 2008 12:12 AM
> To: java-user@lucene.apache.org
> Subject: Re: Combining results of multiple indexes
>
> I would recommend, very strongly, that you don't rely on the doc IDs
> being
> the same in two different indexes. Doc IDs are just incremented by one
> for each doc added, but.
>
> optimization can change the doc ID. and is guaranteed to change at
> least some of them if there are deletions from your index. If you, for
> whatever reason indexed document N in one index and then skipped
> it in the other, all subsequent document IDs would not match. If.
>
> The fact that your IDs are the same is more than undocumented, it
> is coincidental.
>
> Best
> Erick
>
> On Thu, Dec 18, 2008 at 11:46 AM, Preetham Kajekar
> wrote:
>
> > Hi,
> > I noticed that the doc id is the same. So, if I have HitCollector,
> just
> > collect the doc-ids of both Searchers (for the two indexes) and find
> the
> > intersection between them, it would work. Also, get the doc is even
> where
> > there are large number of hits is fast.
> >
> > Of course, I am using something undocumented of Lucene.
> >
> >
> > Thanks,
> > ~preetham
> >
> > Preetham Kajekar wrote:
> >
> >> Thanks. Yep the code is very easy. However, it take about 3 mins to
> >> complete merging.
> >>
> >> Looks like I will need to have an out of band merging of indexes once
> they
> >> are closed (planning to store about 50mil entries in each index
> partition)
> >>
> >>
> >> However, as the data is being indexed, is there any other way to
> combine
> >> results ?
> >>
> >> I could get the results of one index, get all the hits and then apply
> this
> >> as a filter for the next index. But if there are large number of hits
> (which
> >> is likely to be the case), this would not perform too well.
> >>
> >> Do you think the document id can be used in anyway. How is the
> document id
> >> generated ? After all, i have the two indexes operating on a common
> List of
> >> objects. Would the doc is in index1 and index2 for object N in the
> list be
> >> the same ?
> >>
> >>
> >> Thanks,
> >> ~preetham
> >>
> >> Erick Erickson wrote:
> >>
> >>> You will be stunned at how easy it is. The merging code should be
> >>> a dozen lines (and that only if you are merging 6 or so indexes)
> >>>
> >>> See IndexWriter.addIndexes or
> >>> IndexWriter.addIndexesNoOptimize
> >>>
> >>> Best
> >>> Erick
> >>>
> >>> On Thu, Dec 18, 2008 at 5:03 AM, Preetham Kajekar
>  >>> >wrote:
> >>>
> >>>
> >>>
>  Hi,
>  I tried out a single IndexWriter used by two threads to index
> different
>  fields. It is slower than using two separate IndexWriters. These
> are my
>  findings
> 
>  All Fields (9) using 1 IndexWriter 1 Thread - 38,000 object per sec
>  5 Fields   using 1 IndexWriter 1 Thread - 62,000 object per sec
>  All Fields (9) using 1 IndexWriter 2 Thread - 29,000 object per sec
>  All Fields (9) using 2 IndexWriter 2 Thread - 55,000 object per sec
> 
>  So, it looks like I will have figure how to combine results of
> multiple
>  indexes.
> 
>  Thanks,
>  ~preetham
> 
> 
>  Preetham Kajekar wrote:
> 
> 
> 
> > Thanks Erick and Michael.
> > I will try out these suggestions and post my findings.
> >
> > ~preetham
> >
> > Erick Erickson wrote:
> >
> >
> >
> >> Well, maybe if I'd read the original post more carefully I'd have
> >> figured
> >> that out,
> >> sorry 'bout that.
> >

Re: Optimize and Out Of Memory Errors

2008-12-23 Thread Michael McCandless


How many indexed fields do you have, overall, in the index?

If you have a very large number of fields that are "sparse" (meaning  
any given document would only have a small subset of the fields), then  
norms could explain what you are seeing.


Norms are not stored sparsely, so when segments get merged the "holes"  
get filled (occupy bytes on disk and in RAM) and consume more  
resources.  Turning off norms on sparse fields would resolve it, but  
you must rebuild the entire index since if even a single doc in the  
index has norms enabled for a given field, it "spreads".


Mike

Utan Bisaya wrote:

Recently, our lucene index version was upgraded to 2.3.1 and the  
index had to be rebuilt for several weeks which made the entire  
index a total of 20 GB or so.


After the the rebuild, a weekly sunday task was executed for  
optimization.


During that time, the optimization failed several times complaining  
about OOM errors but then after a couple of tries, it completes.


So the entire index is now 1 segment that is 20 GB.

The problem is that any subsequent searches on that index fails with  
OOM errors at Lucene's reading of bytes.


Our environment:
jvm Xmx1600 (This is the max we could set the box since it's windows)
8G Memory available on box
4G CPU (8 core) but only 12.5% is used. (Not sure if this would  
impact it)

Harddisk available is 120GB

mergeFactor, and other lucene config is set at default.

We
checked this 20GB using luke and it has 400,000,000 documents in it.  
It was able to count the docs however when we do the search in Luke  
it fails giving us OOM errors.


We also did the check index and the tool fails at the 20GB segment  
but succeeds on the others.


We've managed to rollback to a previously unoptimized index copy  
(about 20GB or so) and the searches were find now. This unoptimized  
index is made up of several 8GB segments and a few smaller segments.


However there is a big possibility that the optimization error could  
happen again...


Does anybody have insights on why this is happening?










-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Multiple IndexReaders from the same Index Directory - issues with Locks / performance

2008-12-23 Thread Michael McCandless


Locking is completely unused from IndexReader unless you do deletes or  
change norms, so sharing a remote mounted index is just fine (except  
for performance concerns).


If you're using 2.4, you should open your readers with readOnly=true.

Mike

Tomer Gabel wrote:



Ultimately it depends on your specific usage patterns. Generally  
speaking, if
you have IndexReaders (and do not use their delete functionality)  
you don't
need locking at all; you can use a no-op lock factory, in which case  
you'll

pretty much only be constrained by your storage subsystem.


Kay Kay-3 wrote:


For one of our projects - we were planning to have the system of
multiple individual Lucene readers (just read-only instances and no
writes whatsoever ) in different physical machines having their
IndexReader-s warmed up from the same directory for the indices and
working on the same.

I was reading about locks (implemented as files) that Lucene uses
internally. I am just curious if using multiple readers would be a
feasible option here, all sharing the same index directory (across  
NFS /

similar network mounted storage ) in terms of locking etc.

Would there be a performance hit ( ignoring the NFS related  
performance

of course)  that would hinder multiple readers to serve query search
simultaneously from the same set of index files.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






-
--

http://www.tomergabel.com Tomer Gabel


--
View this message in context: 
http://www.nabble.com/Multiple-IndexReaders-from-the-same-Index-Directory---issues-with-Locks---performance-tp21136262p21142273.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Multiple IndexReaders from the same Index Directory - issues with Locks / performance

2008-12-23 Thread Tomer Gabel

Ultimately it depends on your specific usage patterns. Generally speaking, if
you have IndexReaders (and do not use their delete functionality) you don't
need locking at all; you can use a no-op lock factory, in which case you'll
pretty much only be constrained by your storage subsystem.


Kay Kay-3 wrote:
> 
> For one of our projects - we were planning to have the system of 
> multiple individual Lucene readers (just read-only instances and no 
> writes whatsoever ) in different physical machines having their 
> IndexReader-s warmed up from the same directory for the indices and 
> working on the same. 
> 
> I was reading about locks (implemented as files) that Lucene uses 
> internally. I am just curious if using multiple readers would be a 
> feasible option here, all sharing the same index directory (across NFS / 
> similar network mounted storage ) in terms of locking etc.
> 
> Would there be a performance hit ( ignoring the NFS related performance 
> of course)  that would hinder multiple readers to serve query search 
> simultaneously from the same set of index files.
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 


-
--

http://www.tomergabel.com Tomer Gabel 


-- 
View this message in context: 
http://www.nabble.com/Multiple-IndexReaders-from-the-same-Index-Directory---issues-with-Locks---performance-tp21136262p21142273.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



QueryWrapperFilter

2008-12-23 Thread csantos

Hello,

I need to filter a FullTextSearch against a query, that means, i search a
term in a indexed entity "A", A contains a embedded Index "B", entity B has
a m:1 bidirectional relationship with entity "C", the foreign Key in "B" is
"c_id".  My filter condition would be like "filter the fulltext search for
entries where the c_id equals some value", where value is given.

I thought of using the QueryWrapperFilter. But the JavaDoc says for the
TermQuery: "A Query that matches documents containing a term.". My problem
is that the field I want to use do not appear on the Lucene Index. Which is
the best approach?

thanks in advanced
-- 
View this message in context: 
http://www.nabble.com/QueryWrapperFilter-tp21142252p21142252.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene search problem

2008-12-23 Thread amar . sannaik
Hi Aaron Schon/EricK,

That really make sense to me but it really seems easy if is the string
object. See the object structure I have it below hopefully that gives you
some idea

class Recipe {
@DocumentId
Integer id;
@IndexedEmbedded
Chain chain;
//gettter and setter
}

class Chain {
@DocumentId
Integer id;
@Field(index = Index.TOKENIZED, name="chainName")
String name;
//getter and setter
}

I am creating index on the recipe object. and for some recipe.m_chain would
be null. So can you tell me how do I assign the value "NULLNULNULLNULL" for
object chain in recipe.

I also was thinking if #FieldBridge help me this way. My plan was to have
default value where chain is null as you mentioned. but it does not seems to
work for null values.

Please suggest

Thanks in advance.
-Amar

On Tue, Dec 23, 2008 at 12:04 AM, Aaron Schon  wrote:

> I would second Erick's recommendation - create an arbitrary representation
> for NULL such as "NULL" (if you are certain the term "NULL" does not occur
> in actual docs. Alternatively, use "NULLNULNULLNULL" or something to that
> effect.
>
>
>
> - Original Message 
> From: Erick Erickson 
> To: java-user@lucene.apache.org
> Sent: Monday, December 22, 2008 8:58:21 AM
> Subject: Re: Lucene search problem
>
> Try searching the mailing list archives for a fuller discussion, but
> the short answer is usually to index an unique value for your
> "null" entries, then search on that, something totally
> outrageous like, say AAABBBCCCDDDEEEFFF.
>
> Alternatively, you could create, at startup time, a
> Filter of all the docs that *do* contain terms for the
> field in question, flip the bits and use the Filter in your
> searches. (Hint: see TermDocs/TermEnum)
>
> Best
> Erick
>
> On Mon, Dec 22, 2008 at 8:11 AM,  wrote:
>
> > Hi,
> >
> > I have problem with lucene search, I am quite new to this. Can some body
> > help or just push me to who can please.
> >
> > Problem what I am facing we need search for object whose attribute
> "chain"
> > contaning null, but lucene does not help indexing the null values..
> >
> > how can I achieve this, or please guide me the alternative way of doing
> > this.
> >
> > Thanks in advance.
> > -Amar
> >
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Amar Sannaik | Programmer | ATHARVA LIBSON Software Pvt Ltd.,
# 9886476270, amarsann...@atharvalibson.com