Question on the FAQ list with filters

2002-03-27 Thread Armbrust, Daniel C.

>From the FAQ:

***
16. What is filtering and how is it performed ?

Filtering means imposing additional restriction on the hit list to eliminate
hits that otherwise would be included in the search results. There are two
ways to filter hits:

* Search Query - in this approach, provide your custom filter object to the
when you call the search() method. This filter will be called exactly once
to evaluate every document that resulted in non zero score.

* Selective Collection - in this approach you perform the regular search and
when you get back the hit list, collect only those that matches your
filtering criteria. In this approach, your filter is called only for hits
that returned by the search method which may be only a subset of the non
zero matches (useful when evaluating your search filter is expensive). 

***

I don't see why the second way is useful.  Yes, your filter is called only
for hits that got returned by the search method, but aren't those the same
hits that the search() method would run through the filter?  Maybe I'm just
not reading it close enough.

Is my assumption that it is faster to provide a filter to the search()
method, than to do a selective collation correct?  





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Chainable Filter contribution

2002-03-28 Thread Armbrust, Daniel C.

Thanks.  It makes much more sense now.  For a while there I thought maybe
the BitSets were performing some black magic that I hadn't yet learned
about.

Dan






-Original Message-
From: Kelvin Tan [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, March 27, 2002 7:58 PM
To: Armbrust, Daniel C.
Cc: [EMAIL PROTECTED]
Subject: Re: Chainable Filter contribution


Dan,

Totally my bad. I had since changed it but hadn't posted it to the list coz
I didn't think anyone found it useful.

Here's the correct version. I haven't really documented since it's pretty
straightforward. Just holler if you need any help.

Regards,
Kelvin


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




RE: Question on the FAQ list with filters

2002-03-28 Thread Armbrust, Daniel C.

I meant to add, here, that many applications that do searching
and filtering will display the hits only a chunk at a time (typical
web search interface).  This is another situation where it would make
a lot more sense to filter after the search, since you'd only have to
filter a relatively small portion of the hits for each page of results
the user asks for.



How nice it is to have a list like this where there are thoughtful replies
given.  Thanks all!

I don't know why I didn't think of this case last night.  The various ways
make a lot more sense now.

Dan

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Newbie Questions

2002-04-08 Thread Armbrust, Daniel C.

The way that we have done this (and this isn't necessarily the best way, it
was just the solution we came up with) is that we store all dates and
numbers as strings, but formatted in such a way that when they are
alphabetized, they will be in the right order.

The Lucene Date Filtering mechanism was useless to us, because it doesn't
allow dates before 1970.  

We stored all of our dates as strings in a format of year month day, this
way they sorted in the proper order.
Then you can write your own datefilter, which is basically a cut and paste
from lucenes date filter.

We also had an age field, and to make it sort properly, we had to prefix all
of the ages, like 

003
050
101

This way they sort properly, and you can write an age filter (again a cut
and paste from date filter) that will let you search for ages > 50.

Oh, and to apply more than one filter at a time (the way we did it) you will
need the Chainable Filter class, which I think is now on the contributions
page, but was also in the mailing archives in the last 2 weeks.

Dan


-Original Message-
From: Chris Withers [mailto:[EMAIL PROTECTED]]
Sent: Sunday, April 07, 2002 4:55 AM
To: [EMAIL PROTECTED]
Subject: Newbie Questions


Hi there,

I'm new to Lucene and have what will hopefully be a couple of simple
questions.

1. Can I index numbers with Lucene? If so, ints or floats or ?

2. Can I index dates with Lucene?

In either case, is there any way I can sort the results returned by a search
on
these fields?
Also, can I search for only documents which have been indexed with a range
in
one of these fields?

For example: I only want documents where the 'cost' field is between 1000
and
2000 and where the date of manufacture was prior to 13th June 1978.

cheers,

Chris

--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Proper use of Lucene?

2002-04-11 Thread Armbrust, Daniel C.

I want to know if this is supposed to be a legal thing to do with lucene:

I indexed some files into index 1 that had fields x, y, and z.

I indexed some files into a index 2 that had fields x, y, q.

I used a multisearcher on the two indexes, and search for things like

q:term

So far, this all works.  However, if I search for 

q:ter*

lucene throws an exception.  Specifically:
Opening index:
/people/medinf1/projects/vocabulary/CNI/test/prototype/index...done
Opening index:
/people/medinf1/projects/vocabulary/CNI/test/prototype/index2...done
Enter query: q:mit*
java.lang.NullPointerException
at org.apache.lucene.index.SegmentTermEnum.clone(Unknown Source)
at org.apache.lucene.index.TermInfosReader.terms(Unknown Source)
at org.apache.lucene.index.SegmentReader.terms(Unknown Source)
at org.apache.lucene.search.PrefixQuery.getQuery(Unknown Source)
at org.apache.lucene.search.PrefixQuery.sumOfSquaredWeights(Unknown
Source)
at org.apache.lucene.search.BooleanQuery.sumOfSquaredWeights(Unknown
Source)
at org.apache.lucene.search.Query.scorer(Unknown Source)
at org.apache.lucene.search.IndexSearcher.search(Unknown Source)
at org.apache.lucene.search.MultiSearcher.search(Unknown Source)
at org.apache.lucene.search.Hits.getMoreDocs(Unknown Source)
at org.apache.lucene.search.Hits.(Unknown Source)
at org.apache.lucene.search.Searcher.search(Unknown Source)
at org.apache.lucene.search.Searcher.search(Unknown Source)
at
edu.mayo.mir.cni.search.SearchTest.commandLineTest(SearchTest.java:82)
at edu.mayo.mir.cni.search.SearchTest.main(SearchTest.java:175)

Should I be able to do this?
If so, I'll grab a copy of the newest release of lucene (I'm using 1.2 beta
1 now I think) and see if it still happens.  I'll try to write up a self
contained bug test too, but I'm not sure I'll be able to do that for a
couple of days.

Thanks, 

Dan

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Proper use of Lucene?

2002-04-11 Thread Armbrust, Daniel C.

This error no longer occurs in the latest daily build (lucene-20020411.jar).

I should have checked first.

Thanks, 

Dan



-Original Message-
From: Armbrust, Daniel C. [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 11, 2002 9:40 AM
To: Lucene Users List (E-mail)
Subject: Proper use of Lucene?


I want to know if this is supposed to be a legal thing to do with lucene:

I indexed some files into index 1 that had fields x, y, and z.

I indexed some files into a index 2 that had fields x, y, q.

I used a multisearcher on the two indexes, and search for things like

q:term

So far, this all works.  However, if I search for 

q:ter*

lucene throws an exception.  Specifically:
Opening index:
/people/medinf1/projects/vocabulary/CNI/test/prototype/index...done
Opening index:
/people/medinf1/projects/vocabulary/CNI/test/prototype/index2...done
Enter query: q:mit*
java.lang.NullPointerException
at org.apache.lucene.index.SegmentTermEnum.clone(Unknown Source)
at org.apache.lucene.index.TermInfosReader.terms(Unknown Source)
at org.apache.lucene.index.SegmentReader.terms(Unknown Source)
at org.apache.lucene.search.PrefixQuery.getQuery(Unknown Source)
at org.apache.lucene.search.PrefixQuery.sumOfSquaredWeights(Unknown
Source)
at org.apache.lucene.search.BooleanQuery.sumOfSquaredWeights(Unknown
Source)
at org.apache.lucene.search.Query.scorer(Unknown Source)
at org.apache.lucene.search.IndexSearcher.search(Unknown Source)
at org.apache.lucene.search.MultiSearcher.search(Unknown Source)
at org.apache.lucene.search.Hits.getMoreDocs(Unknown Source)
at org.apache.lucene.search.Hits.(Unknown Source)
at org.apache.lucene.search.Searcher.search(Unknown Source)
at org.apache.lucene.search.Searcher.search(Unknown Source)
at
edu.mayo.mir.cni.search.SearchTest.commandLineTest(SearchTest.java:82)
at edu.mayo.mir.cni.search.SearchTest.main(SearchTest.java:175)

Should I be able to do this?
If so, I'll grab a copy of the newest release of lucene (I'm using 1.2 beta
1 now I think) and see if it still happens.  I'll try to write up a self
contained bug test too, but I'm not sure I'll be able to do that for a
couple of days.

Thanks, 

Dan

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




RE: Removing a write.lock file

2002-04-18 Thread Armbrust, Daniel C.

If your not letting lucene remove the write.lock file, and doing it with
your own code, that just screams out to me to be a very bad thing to do.

Unless of course you want things to get corrupted.

Maybe it has worked for you so far, out of luck, but I'm sure its not the
way you are supposed to be doing things.






-Original Message-
From: Biswas, Goutam_Kumar [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 18, 2002 10:25 AM
To: 'Lucene Users List'
Subject: RE: Removing a write.lock file


I'm not removing the write.lock file by hand. I'm doing it inside the code
before opening the index
-Goutam


-Original Message-
From: Aruna Raghavan [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 18, 2002 8:37 PM
To: 'Lucene Users List'
Subject: RE: Removing a write.lock file


I don't think it is a good approach to delete the write.lock file by hand.
It is there for a reason. You may want to dig into some of the older
dialogs/e-mails on this topic.

-Original Message-
From: Biswas, Goutam_Kumar [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 18, 2002 9:53 AM
To: 'Lucene Users List'
Subject: RE: Removing a write.lock file


well suneetha,

   before I write to the index I check whether a write.lock file exists! If
it does I delete it before opening the index. It works fine
for me. 

-Goutam

-Original Message-
From: Aruna Raghavan [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 18, 2002 8:22 PM
To: 'Lucene Users List'
Subject: RE: Removing a write.lock file



Hi,
The write.lock file won't be there if you close the index using a lock
mechanism. I use my own RWLock to access the index dir and unlock it after I
close the index. Basically, the access to the index is synchronized. I have
never had any problems with this approach.
Aruna.
-Original Message-
From: suneethad [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, April 17, 2002 11:47 PM
To: Lucene Users List
Subject: Removing a write.lock file


Hi,
I'm currently indexing allowing  multiple access , I find that a
write.lock file has got created.
I know this is to prevent  multiple writers, but now how do I
continue.??I  do not want to reindex as I work on a very large database
and it takes a real long time How do I remove this lock file ??

Thanx 4 ur help,
Suneetha.



--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Removing a write.lock file

2002-04-19 Thread Armbrust, Daniel C.

Since I am far from a lucene expert, I would suggest searching the archive
for write.lock, I'm sure this has been discussed before.

My belief is that when you call the close method to close your index writer,
it removes the write.lock file.  

Someone please verify/correct me if I'm wrong.




-Original Message-
From: Biswas, Goutam_Kumar [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 1:25 AM
To: 'Lucene Users List'
Subject: RE: Removing a write.lock file


Then how am I supposed to deal with the write.lock file ?

-Original Message-----
From: Armbrust, Daniel C. [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 18, 2002 9:05 PM
To: 'Lucene Users List'
Subject: RE: Removing a write.lock file


If your not letting lucene remove the write.lock file, and doing it with
your own code, that just screams out to me to be a very bad thing to do.

Unless of course you want things to get corrupted.

Maybe it has worked for you so far, out of luck, but I'm sure its not the
way you are supposed to be doing things.






-Original Message-
From: Biswas, Goutam_Kumar [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 18, 2002 10:25 AM
To: 'Lucene Users List'
Subject: RE: Removing a write.lock file


I'm not removing the write.lock file by hand. I'm doing it inside the code
before opening the index
-Goutam


-Original Message-
From: Aruna Raghavan [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 18, 2002 8:37 PM
To: 'Lucene Users List'
Subject: RE: Removing a write.lock file


I don't think it is a good approach to delete the write.lock file by hand.
It is there for a reason. You may want to dig into some of the older
dialogs/e-mails on this topic.

-Original Message-
From: Biswas, Goutam_Kumar [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 18, 2002 9:53 AM
To: 'Lucene Users List'
Subject: RE: Removing a write.lock file


well suneetha,

   before I write to the index I check whether a write.lock file exists! If
it does I delete it before opening the index. It works fine
for me. 

-Goutam

-Original Message-
From: Aruna Raghavan [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 18, 2002 8:22 PM
To: 'Lucene Users List'
Subject: RE: Removing a write.lock file



Hi,
The write.lock file won't be there if you close the index using a lock
mechanism. I use my own RWLock to access the index dir and unlock it after I
close the index. Basically, the access to the index is synchronized. I have
never had any problems with this approach.
Aruna.
-Original Message-
From: suneethad [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, April 17, 2002 11:47 PM
To: Lucene Users List
Subject: Removing a write.lock file


Hi,
I'm currently indexing allowing  multiple access , I find that a
write.lock file has got created.
I know this is to prevent  multiple writers, but now how do I
continue.??I  do not want to reindex as I work on a very large database
and it takes a real long time How do I remove this lock file ??

Thanx 4 ur help,
Suneetha.



--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




RE: Lucene's scalability

2002-04-29 Thread Armbrust, Daniel C.

I currently have an index of ~ 12 million documents, which are each about
that size (but in xml form).

When they are transformed for lucene to index, there are upwards of 50
searchable fields.

The index is about 10 GB right now.

I have not yet had any problems with "pushing the limits" of lucene.

In the next few weeks, I will be pushing my number of indexed documents up
into the 15-20 million range.  I can let you know if any problems arise.

Dan



-Original Message-
From: Joel Bernstein [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 29, 2002 1:32 PM
To: [EMAIL PROTECTED]
Subject: Lucene's scalability


Is there a known limit to the number of documents that Lucene can handle
efficiently?  I'm looking to index around 15 million, 2K docs which contain
7-10 searchable fields. Should I be attempting this with Lucene?

Thanks,

Joel


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Search all fields

2002-05-01 Thread Armbrust, Daniel C.

There's a cut and paste error on that contributions page, with the link for
multiple field searching.

It reads (notice the two http://'s) in the link

http://http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg0
0775.html">
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00775.html



-Original Message-
From: Peter Carlson [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, May 01, 2002 8:47 AM
To: Lucene Users List
Subject: Re: Search all fields


There is an example of how to do this in the contributions section of the
website (it's toward the bottom).

--Peter

On 5/1/02 5:22 AM, "Christoph Kiehl" <[EMAIL PROTECTED]> wrote:

> 
> is it somehow possible to simple search all indexed fields, without
> explicitly naming them in parse()? Or is there a method to get all fields
> ever indexed?
> 
> Thanks
> Christoph


--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Lucene's scalability

2002-05-17 Thread Armbrust, Daniel C.

Currently, we are using a Ultra-80 Sparc Solaris with 4 processors and 4 GB
of Ram.  

However, we are only making use of one of those processors with the index.
Our biggest speed restriction is the fact that our entire index resides on a
single disk drive.  We have a raid array coming soon.

The performance has been very impressive, but as you can imagine, the speed
depends highly on the complexity of the query.  If you run a query with a
1/2 a dozen terms and fields, which returns ~30,000 results, it usually
takes on the order of a second or two.  If you run a query with 50-60 terms,
it may take 5-6 seconds.  

I don't have any better performance stats than this currently.  

Dan


-Original Message-
From: Harpreet S Walia [mailto:[EMAIL PROTECTED]]
Sent: Friday, May 17, 2002 7:23 AM
To: Lucene Users List
Subject: Re: Lucene's scalability


Hi ,

I am also trying to do a similar thing . I am very eager to know what kind
of hardware u are using to maintain such a big index.
In my case it is very important that the search happens very fast . so does
such a big index of 10 gb pose any problems in this direction

TIA

Regards
Harpreet



- Original Message -----
From: "Armbrust, Daniel C." <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Tuesday, April 30, 2002 12:07 AM
Subject: RE: Lucene's scalability


> I currently have an index of ~ 12 million documents, which are each about
> that size (but in xml form).
>
> When they are transformed for lucene to index, there are upwards of 50
> searchable fields.
>
> The index is about 10 GB right now.
>
> I have not yet had any problems with "pushing the limits" of lucene.
>
> In the next few weeks, I will be pushing my number of indexed documents up
> into the 15-20 million range.  I can let you know if any problems arise.
>
> Dan
>
>
>
> -Original Message-
> From: Joel Bernstein [mailto:[EMAIL PROTECTED]]
> Sent: Monday, April 29, 2002 1:32 PM
> To: [EMAIL PROTECTED]
> Subject: Lucene's scalability
>
>
> Is there a known limit to the number of documents that Lucene can handle
> efficiently?  I'm looking to index around 15 million, 2K docs which
contain
> 7-10 searchable fields. Should I be attempting this with Lucene?
>
> Thanks,
>
> Joel
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>


--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




RE: Lucene's scalability

2002-05-20 Thread Armbrust, Daniel C.

In my experience the time it takes depends much more on the complexity of
the query, rather than the number of results returned.  If I am making a
query with 50-60 terms, I am usually getting down to a couple thousand or
less results.

Dan


-Original Message-
From: CNew [mailto:[EMAIL PROTECTED]]
Sent: Monday, May 20, 2002 8:46 AM
To: Lucene Users List
Subject: Re: Lucene's scalability


you didn't mention the hit count on the query with 50-60 terms.
just wondering if the time was linear.

- Original Message -----
From: Armbrust, Daniel C. <[EMAIL PROTECTED]>
To: 'Lucene Users List' <[EMAIL PROTECTED]>
Sent: Friday, May 17, 2002 6:57 AM
Subject: RE: Lucene's scalability


> Currently, we are using a Ultra-80 Sparc Solaris with 4 processors and 4
GB
> of Ram.
>
> However, we are only making use of one of those processors with the index.
> Our biggest speed restriction is the fact that our entire index resides on
a
> single disk drive.  We have a raid array coming soon.
>
> The performance has been very impressive, but as you can imagine, the
speed
> depends highly on the complexity of the query.  If you run a query with a
> 1/2 a dozen terms and fields, which returns ~30,000 results, it usually
> takes on the order of a second or two.  If you run a query with 50-60
terms,
> it may take 5-6 seconds.
>
> I don't have any better performance stats than this currently.
>
> Dan
>
>
> -Original Message-
> From: Harpreet S Walia [mailto:[EMAIL PROTECTED]]
> Sent: Friday, May 17, 2002 7:23 AM
> To: Lucene Users List
> Subject: Re: Lucene's scalability
>
>
> Hi ,
>
> I am also trying to do a similar thing . I am very eager to know what kind
> of hardware u are using to maintain such a big index.
> In my case it is very important that the search happens very fast . so
does
> such a big index of 10 gb pose any problems in this direction
>
> TIA
>
> Regards
> Harpreet
>
>
>
> - Original Message -
> From: "Armbrust, Daniel C." <[EMAIL PROTECTED]>
> To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> Sent: Tuesday, April 30, 2002 12:07 AM
> Subject: RE: Lucene's scalability
>
>
> > I currently have an index of ~ 12 million documents, which are each
about
> > that size (but in xml form).
> >
> > When they are transformed for lucene to index, there are upwards of 50
> > searchable fields.
> >
> > The index is about 10 GB right now.
> >
> > I have not yet had any problems with "pushing the limits" of lucene.
> >
> > In the next few weeks, I will be pushing my number of indexed documents
up
> > into the 15-20 million range.  I can let you know if any problems arise.
> >
> > Dan
> >
> >
> >
> > -Original Message-
> > From: Joel Bernstein [mailto:[EMAIL PROTECTED]]
> > Sent: Monday, April 29, 2002 1:32 PM
> > To: [EMAIL PROTECTED]
> > Subject: Lucene's scalability
> >
> >
> > Is there a known limit to the number of documents that Lucene can handle
> > efficiently?  I'm looking to index around 15 million, 2K docs which
> contain
> > 7-10 searchable fields. Should I be attempting this with Lucene?
> >
> > Thanks,
> >
> > Joel
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>



--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




deleting question

2002-05-20 Thread Armbrust, Daniel C.

I did a batch deletion on an index.  Then, after searching the archives for
something else, I came across this

> I understand there are three modes for using IndexReader and 
> IndexWriter:
> 
> A- IndexReader for reading only, not deleting
> B- IndexReader for deleting (and reading)
> C- IndexWriter (for adding and optimizing)

> What matters is that only one of B or C can be done at once.  
> That's to say, only a single process/thread may
> modify an index at once.  Modification should be single threaded.

In looking back at the code I ran, I had an IndexWriter open, and then I
opened an IndexReader, and did the deletions.  Every 300,000 deletions, I
called the optimize method of the writer.  When I was done with the
deletions, I closed the reader, then called optimize again on the writer,
then closed the writer.

My question is, does anyone know offhand if having both of those open at
once would have done anything bad (like corrupt) my index?

Thanks, 

Dan


-Original Message-
From: Armbrust, Daniel C. [mailto:[EMAIL PROTECTED]]
Sent: Monday, May 20, 2002 5:06 PM
To: 'Lucene Users List'
Subject: RE: Lucene's scalability


In my experience the time it takes depends much more on the complexity of
the query, rather than the number of results returned.  If I am making a
query with 50-60 terms, I am usually getting down to a couple thousand or
less results.

Dan


-Original Message-
From: CNew [mailto:[EMAIL PROTECTED]]
Sent: Monday, May 20, 2002 8:46 AM
To: Lucene Users List
Subject: Re: Lucene's scalability


you didn't mention the hit count on the query with 50-60 terms.
just wondering if the time was linear.

- Original Message -
From: Armbrust, Daniel C. <[EMAIL PROTECTED]>
To: 'Lucene Users List' <[EMAIL PROTECTED]>
Sent: Friday, May 17, 2002 6:57 AM
Subject: RE: Lucene's scalability


> Currently, we are using a Ultra-80 Sparc Solaris with 4 processors and 4
GB
> of Ram.
>
> However, we are only making use of one of those processors with the index.
> Our biggest speed restriction is the fact that our entire index resides on
a
> single disk drive.  We have a raid array coming soon.
>
> The performance has been very impressive, but as you can imagine, the
speed
> depends highly on the complexity of the query.  If you run a query with a
> 1/2 a dozen terms and fields, which returns ~30,000 results, it usually
> takes on the order of a second or two.  If you run a query with 50-60
terms,
> it may take 5-6 seconds.
>
> I don't have any better performance stats than this currently.
>
> Dan
>
>
> -Original Message-
> From: Harpreet S Walia [mailto:[EMAIL PROTECTED]]
> Sent: Friday, May 17, 2002 7:23 AM
> To: Lucene Users List
> Subject: Re: Lucene's scalability
>
>
> Hi ,
>
> I am also trying to do a similar thing . I am very eager to know what kind
> of hardware u are using to maintain such a big index.
> In my case it is very important that the search happens very fast . so
does
> such a big index of 10 gb pose any problems in this direction
>
> TIA
>
> Regards
> Harpreet
>
>
>
> - Original Message -
> From: "Armbrust, Daniel C." <[EMAIL PROTECTED]>
> To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> Sent: Tuesday, April 30, 2002 12:07 AM
> Subject: RE: Lucene's scalability
>
>
> > I currently have an index of ~ 12 million documents, which are each
about
> > that size (but in xml form).
> >
> > When they are transformed for lucene to index, there are upwards of 50
> > searchable fields.
> >
> > The index is about 10 GB right now.
> >
> > I have not yet had any problems with "pushing the limits" of lucene.
> >
> > In the next few weeks, I will be pushing my number of indexed documents
up
> > into the 15-20 million range.  I can let you know if any problems arise.
> >
> > Dan
> >
> >
> >
> > -Original Message-
> > From: Joel Bernstein [mailto:[EMAIL PROTECTED]]
> > Sent: Monday, April 29, 2002 1:32 PM
> > To: [EMAIL PROTECTED]
> > Subject: Lucene's scalability
> >
> >
> > Is there a known limit to the number of documents that Lucene can handle
> > efficiently?  I'm looking to index around 15 million, 2K docs which
> contain
> > 7-10 searchable fields. Should I be attempting this with Lucene?
> >
> > Thanks,
> >
> > Joel
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>



--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




RE: Searching greater than/less than

2002-05-22 Thread Armbrust, Daniel C.

As long at the field can be sorted alphabetically, you can build your own
filter.  

If you put your field in as the same length number, i.e.

001
010
136

Then you can build a filter (that will be mostly cut and paste from the date
filter class)

which will allow you to do less than and greater than operations.

Dan





-Original Message-
From: Victor Hadianto [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, May 21, 2002 7:54 PM
To: [EMAIL PROTECTED]
Subject: Searching greater than/less than


Can I use lucene to search greater than / less than a value in the field? I 
have a field in the document that function as a score. I would need to be 
able to search the index + the option having to say a field > 50

Regards,

-- 
Victor Hadianto

--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Searching greater than/less than

2002-05-22 Thread Armbrust, Daniel C.

Just a note - I have used this method (the date filtering style of
filtering) on an index of 12 million documents, and haven't had a problem
with performance.  

I need to set up some tests, however, and see if it is faster to filter with
the date filter style of filtering, or filter by building the query for
every number you want returned.  

My guess is that if you are looking for numbers on the scale of 10 < x < 50
it would be faster to write the query to search for 11, 12, 13, 14, 15, 
49

But if you are looking for numbers on the scale of 5 < x < 1 it will be
faster to do the date filter style of filtering.

This is just a hunch, however, and I'm sure depends on the range of numbers
in your index, and the number of docs in it.  It would be nice to know,
however, roughly where the scale tips from one way being faster than the
other. 

If anyone has tested this, maybe an entry on the FAQ page is in order.

Or, the whole thing could be alleviated, if lucene supported a number field,
rather than just text, and date field (which seems like it was implemented
as an afterthought, and is useless (as written) if your dates go farther
back than ~1970.

Maybe put that on the list of possible next version ideas.



-Original Message-
From: David Smiley [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, May 21, 2002 8:54 PM
To: Lucene Users List
Subject: Re: Searching greater than/less than


Yes.  Check out how Date support is implemented.  As a quick 
workaround... you could piggy-back on Lucene's existing date support 
by creating a Date via milliseconds that is the number your are 
trying to put in the index.

Note that internally, a bit-vector is created that is the same size 
as the index which might cause performance problems depending on the 
size of your index and typical queries you will have in your 
environment.

~ Dave Smiley

On Tuesday, May 21, 2002, at 08:53  PM, Victor Hadianto wrote:

> Can I use lucene to search greater than / less than a value in the 
> field? I
> have a field in the document that function as a score. I would need 
> to be
> able to search the index + the option having to say a field > 50
>
> Regards,
>
> --
> Victor Hadianto
>
> --
> To unsubscribe, e-mail:    [EMAIL PROTECTED]>
> For additional commands, e-mail:  [EMAIL PROTECTED]>
>


--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Few questions regarding the design of the Filter class

2002-05-24 Thread Armbrust, Daniel C.

Looks to me like your looking for Kelvin Tan's chainable filter

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg01168.html

Dan



-Original Message-
From: Christian Meunier [mailto:[EMAIL PROTECTED]]
Sent: Friday, May 24, 2002 5:38 AM
To: Lucene Users List
Subject: Re: Few questions regarding the design of the Filter class


>
> A workaround for what?  It's not clear what you're trying to do.
>

Here is what i am trying to do:

A simple class to filter a field

FieldFilter.java


--
import java.util.BitSet;
import java.io.IOException;

import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.IndexReader;

public class FieldFilter extends org.apache.lucene.search.Filter
{

private String field;
private String value;
private Term searchTerm;

public FieldFilter(String field, String value)
{
this.field = field;
this.value = value;
searchTerm = new Term(field, value);
}

public String getField()
{
return field;
}

public BitSet bits(IndexReader reader) throws IOException
{
BitSet bits = new BitSet(reader.maxDoc());
TermDocs matchingDocs = reader.termDocs(searchTerm);
try
{
while(matchingDocs.next())
{
bits.set(matchingDocs.doc());
}
}
catch (Exception e) { /* ignore */ }
finally
{
if (matchingDocs != null)
{
matchingDocs.close();
}
}
return bits;
}
}


--

I then coded a class which handle multiple filters (FieldFilter,
DateFilter,) at once


MultiFilter.java


--
import java.util.Hashtable;
import java.util.BitSet;
import java.util.ArrayList;
import java.io.IOException;

import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.Filter;

public class MultiFilter extends org.apache.lucene.search.Filter
{
private ArrayList filterList;

public MultiFilter()
 {
filterList = new ArrayList();
}

public MultiFilter(int initialCapacity)
 {
filterList = new ArrayList(initialCapacity);
}

 public String getField()
 {
  return null;
 }


public void add(Filter filter)
 {
filterList.add(filter);
}

public BitSet bits(IndexReader reader) throws IOException
 {
 int filterListSize = filterList.size();

  if (filterListSize > 0)
  {
Hashtable filters = new Hashtable();
   int pos=0;
   for (int i=0; i Doug
>
Best regards
Christian


--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Merging Question

2002-05-24 Thread Armbrust, Daniel C.

I'm wondering if someone can speak to the normal behavior of lucene when it
is merging multiple indexes together.

Is it true when merging multiple FSDirectories together, you should start
seeing growing files in the output directory immediately?  

I am trying to merge 10 indexes together.  After about 4 hours of
processing, none of my indexes had changed size at all - input or output
indexes.  Memory usage had grown to 1 GB, after which it died since that is
where I had it capped.  

I tried again this morning, this time only merging 4 indexes into one
existing index.  The same behavior appears to be happening again.  

The indexes I am merging are moderately large, about 125 MB each, and I am
merging into a 9 GB index.

The only thing that is different this time that previously when I had merged
large indexes together, is that I delete a large batch of documents from the
9GB index a couple of days ago.  Could I have done something to the index in
the delete that would cause this strange merging behavior?

Any insight would be appreciated.

Dan

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Merging Question

2002-05-24 Thread Armbrust, Daniel C.

Scratch this question problem is my fault. 

Dan


-Original Message-
From: Armbrust, Daniel C. [mailto:[EMAIL PROTECTED]]
Sent: Friday, May 24, 2002 9:31 AM
To: Lucene Users List (E-mail)
Subject: Merging Question


I'm wondering if someone can speak to the normal behavior of lucene when it
is merging multiple indexes together.

Is it true when merging multiple FSDirectories together, you should start
seeing growing files in the output directory immediately?  

I am trying to merge 10 indexes together.  After about 4 hours of
processing, none of my indexes had changed size at all - input or output
indexes.  Memory usage had grown to 1 GB, after which it died since that is
where I had it capped.  

I tried again this morning, this time only merging 4 indexes into one
existing index.  The same behavior appears to be happening again.  

The indexes I am merging are moderately large, about 125 MB each, and I am
merging into a 9 GB index.

The only thing that is different this time that previously when I had merged
large indexes together, is that I delete a large batch of documents from the
9GB index a couple of days ago.  Could I have done something to the index in
the delete that would cause this strange merging behavior?

Any insight would be appreciated.

Dan

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




RE: Merging (adding) indices

2002-05-28 Thread Armbrust, Daniel C.

I have also had the indexes that I merged disappear when I was merging
indexes.  It seems to happen nearly 100% of the time, when merging more than
2 indexes.  I just thought it was the normal behavior.  However, when I
merge only 2 indexes, they usually don't get deleted.  I don't have any code
that deletes the indexes on mergeing.  Maybe I should look closer at some
code if this isn't supposed to be happening?

Dan





-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: Monday, May 27, 2002 5:41 PM
To: Lucene Users List
Subject: Re: Merging (adding) indices


The source code looks like this:

  public final synchronized void addIndexes(Directory[] dirs)
  throws IOException {
optimize();   // start with zero or
1 seg
for (int i = 0; i < dirs.length; i++) {
  SegmentInfos sis = new SegmentInfos();  // read infos from
dir
  sis.read(dirs[i]);
  for (int j = 0; j < sis.size(); j++) {
segmentInfos.addElement(sis.info(j)); // add each info
  }
}
optimize();   // final cleanup
  }

So I think the original directories/indices should not be modified in
any way.  Are you sure your application is not deleting them?

Otis



--- Lex Lawrence <[EMAIL PROTECTED]> wrote:
> Hello-
> I am using org.apache.lucene.index.IndexWriter.addIndexes(Directory[]
> dirs) 
> to merge several indices into one.  The resulting index appears to
> work 
> fine, but afterward the original indices seem to have been completely
> 
> emptied.
> 
> I can deal with that, but I just wanted to check: Is this method
> supposed to 
> alter the indices in the 'dirs' parameter?  It's not mentioned in the
> 
> javadoc.
> 
> Thanks- Lex
> 
> _
> Chat with friends online, try MSN Messenger: http://messenger.msn.com
> 
> 
> --
> To unsubscribe, e-mail:  
> 
> For additional commands, e-mail:
> 
> 


__
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: MS Word Search ??

2002-05-30 Thread Armbrust, Daniel C.

This might be worth looking into for those who need to parse word, excel,
powerpoint, or other MS file types of microsofts.

openoffice - www.openoffice.org knows how to parse all of the microsoft
formats (at least all that I've tried so far) - and then, you can a do a
save as, and write out the open office format, which is a couple of xml
files zipped together.  So, this makes me think of two possible ways that
you could get at the content of the MS files in a text form you can index
(neither of which I have tried or even looked to see if they are possible)

#1 - get the code for openoffice - it is open source - and use it for
parsing the MS documents into xml which could then be indexed

#2 - if open office is programmatically drivable (which I don't know if it
is), fire up a copy of open office and use it to convert the files as
necessary.

Just some suggestions.  Does anyone know much more about openoffice?  I
would be interested in knowing if either of these would be feasible.  

Dan




-Original Message-
From: Ewout Prangsma [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, May 29, 2002 1:00 PM
To: Lucene Users List
Subject: Re: MS Word Search ??


Op Wednesday 29 May 2002 11:56, Karl Øie schreef:
> b: convert the documents to something that is accessable through java like
> xml, etc...

We're using wvWare (wvware.com) to convert word to html (or text) and index 
that and xpdf for converting PDF to text and index that. Any links on 
indexing using POI converters (or other java converters) are very welcome!

Ewout

>
> the best way is to convert as the java api's for MSOffice documents still
> are under development
>
> mvh karl øie
>
> On Wednesday 29 May 2002 11:48, Rama Krishna wrote:
> > Hi,
> >
> > I am trying to build a search engine which search in MS Word, excel, ppt
> > and adobe pdf. I am not sure whether i can use Lucene for this or not. 
> > pl. help me out in this regard.
> >
> >
> > Regards,
> > Ramakrishna
> >
> >
> > _
> > Chat with friends online, try MSN Messenger: http://messenger.msn.com

-- 
Ewout Prangsma, Directeur
Daisy Software
Telefoon/fax: +31-77-3270305/3270306
Email: [EMAIL PROTECTED]
Website: www.daisysoftware.com
KvK Venlo nr. 12046144 




--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Your experiences with Lucene

2002-10-29 Thread Armbrust, Daniel C.
Currently maintaining an index of approx 12 million documents here.  

Its about an 11 GB index.  Haven't had scalability problems yet, and we have not done 
much work toward optimizing things either.

Dan




-Original Message-
From: Tim Jones [mailto:timothy.jones@;mongoosetech.com] 
Sent: Tuesday, October 29, 2002 2:03 PM
To: [EMAIL PROTECTED]
Subject: Your experiences with Lucene


Hi,
 
I am currently starting work on a project that requires indexing and searching on 
potentially thousands, maybe tens of thousands, of text documents.
 
I'm hoping that someone has a great success story about using Lucene for a project 
that required indexing and searching of a large number of documents. Like maybe more 
than 10,000. I guess what I'm trying to figure out is if Lucene's performance will be 
acceptable where the number of documents is very large. I realize this is a very 
general question but I just need a general answer.
 
Thanks,
 
Tim J.

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Date can't be before 1970?

2002-11-07 Thread Armbrust, Daniel C.
You can also still use a date filter... You just have to write your own (this is what 
we did)

Its basically a copy of the current date filter... With a couple of minor changes.

Dan


-Original Message-
From: Peter Carlson [mailto:carlson@;bookandhammer.com] 
Sent: Monday, November 04, 2002 12:06 AM
To: Lucene Users List
Subject: Re: Date can't be before 1970?


The other option that some people have used is to not use the DateField 
and just create your own format following the
MMdd format

So 20020101 for Jan 1, 2002

Note that you cannot use the Date Filter to filter dates.

However you can use the built in range query (see query syntax for more 
details).

So to search for all dates in 2002 in the pubDate field you can use

pubDate:[20020101 - 20021231]

Note the space on either side of the dash (-).

Since this is just a string, it is not affected by the 1970 issues. I 
hope this helps.

--Peter


On Sunday, November 3, 2002, at 10:01 PM, Herman Chen wrote:

> Hi,
>
> I noticed that DateField.dateToString does not allow dates before
> 1970.  Is the limitation caused by java's Date or by the way it needs 
> to be encoded for the index.  What is the suggested solution to deal 
> with dates prior to 1970?  Thanks.
>
> --
> Herman


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: large index -> slow optimize()

2002-11-22 Thread Armbrust, Daniel C.
Note - this is not a fact, this is what I think I know about how it works.

My working assumption has been its just a matter of disk speed, since during optimize, 
the entire index is copied into new files, and then at the end, the old one is 
removed.  So the more GB you have to copy, the longer it takes.

This is also the reason that you need double the size of your index available on the 
drive in order to perform an optimize, correct?  Or does this only apply when you are 
merging indexes?


Dan



-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] 
Sent: Friday, November 22, 2002 12:52 PM
To: [EMAIL PROTECTED]
Subject: large index -> slow optimize()


Hello,

I am building an index with a few 1M documents, and every X documents
added to the index I call optimize() on the IndexWriter.
I have noticed that as the index grows this calls takes more and more
time, even though the number of new segments that need to be merged is
the same between every optimize() call.
I suspect this is normal and not a bug, but is there no way around
that?  Do you know which part is the part that takes longer and longer
as the index grows?

Thanks,
Otis


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Lucene Speed under diff JVMs

2002-12-05 Thread Armbrust, Daniel C.
This may be of use to people who want to make lucene index faster.  Also, I'm curious 
as to what JVM most people run Lucene under, and if anyone else has seen results like 
this:

I'm using the class that Otis wrote (see message from about 3 weeks ago) for testing 
the scalability of lucene (more results on that later) and I first tried running it 
under different versions of Java, to see where it runs the fastest.  The class simply 
creates an index out of randomly generated documents. 

All of the following were running on a dual CPU 1 GHz PIII Windows 2000 machine that 
wasn't doing much else during the benchmark.  The indexing program was single 
threaded, so it only used one of the processors of the machine.

java version "1.3.1_04"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1_04-b02)
Java HotSpot(TM) Client VM (build 1.3.1_04-b02, mixed mode)

42 seconds/1000 documents

java version "1.4.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-b21)
Java HotSpot(TM) Client VM (build 1.4.1-b21, mixed mode)

42 seconds/1000 documents

Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1_01)
BEA WebLogic JRockit(R) Virtual Machine (build 
8.0_Beta-1.4.1_01-win32-CROSIS-20021105-1617, Native Threads, Generational Concurrent 
Garbage Collector)

35 seconds/1000 documents

java version "1.3.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1)
Classic VM (build 1.3.1, J2RE 1.3.1 IBM Windows 32 build cn131-20020403 (JIT enabled: 
jitc))

27 seconds/1000 documents


As you can see, the IBM jvm pretty much smoked Suns.  And beat out JRockit as well.  
Just a hunch, but it wouldn't surprise me if search times were also faster under the 
IBM jdk.  Has anyone else come to this conclusion?


Dan

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Lucene Speed under diff JVMs

2002-12-06 Thread Armbrust, Daniel C.
To clarify (which means adding the info I should have put in it the first time but 
missed), the run was of 40,000 documents.  The number was an average.

Each run was done twice (and the results were identical).

And the machine was a dual processor machine, so most OS tasks ran on the idle 
processor, while the indexing process gobbeled up the other one.

And I'm definitely not trying to say one JVM is better than another, but for this task 
of creating a lucene index, there is a very noticeable speed difference.  I was really 
just curious if anyone else had done any tests similar to this.

Dan




> As you can see, the IBM jvm pretty much smoked Suns.  And beat out
> JRockit as well.  Just a hunch, but it wouldn't surprise me if search
> times were also faster under the IBM jdk.  Has anyone else come to this
> conclusion?

Just a brief note on performance measurements and statistical sampling: no
offense, but if these are measurements of a single trial of 1000 documents
for each JVM, they're not so different that I'd be willing to conclude
that one JVM is notably faster for this task than another.  The problem is
compounded by the fact that it can be hard to tell just how much CPU is
being taken up by OS tasks (and this can fluctuate quite a lot).  If you
really want to quote statistics like this, using 5 or 10 trials would give
a more accurate notion of the real performance differences (if any).

Casuistically :),

Joshua O'Madadhain

  [EMAIL PROTECTED] Per Obscuriuswww.ics.uci.edu/~jmadden
   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for.  -- Bill Watterson
 My opinions are too rational and insightful to be those of any organization.





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Lucene Speed under diff JVMs

2002-12-06 Thread Armbrust, Daniel C.
One more bit of info that I should have included:

The randomly generated documents consisted of 2 fields, one Text with 3 words, and one 
UnStored with 500 words.  Average word length was 7 characters.

If Otis (he wrote it, I just made a tweak or two) doesn't mind, I'll post the source 
code.

Dan


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Lucene Speed under diff JVMs

2002-12-06 Thread Armbrust, Daniel C.
Class that was used (attached)

And correction, the UnStored field had 1000 words, not 500.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] 
Sent: Friday, December 06, 2002 10:57 AM
To: Lucene Users List
Subject: RE: Lucene Speed under diff JVMs


Otis doesn't mind.

-

One more bit of info that I should have included:

The randomly generated documents consisted of 2 fields, one Text with 3 words, and one 
UnStored with 500 words.  Average word length was 7 characters.

If Otis (he wrote it, I just made a tweak or two) doesn't mind, I'll post the source 
code.

Dan




Words2Index.java
Description: Binary data
--
To unsubscribe, e-mail:   
For additional commands, e-mail: 


Lucene Benchmarks and Information

2002-12-20 Thread Armbrust, Daniel C.
I've been running some scalability tests on Lucene over the past couple of weeks.  
While there may be some flaws with some of my methods, I think they will be useful for 
people that want an idea as to how Lucene will scale.  If anyone has any questions 
about what I did, or wants clarifications on something, I'll be happy to provide them.

I'll start by filling out the form


  Hardware Environment
* Dedicated machine for indexing: yes
* CPU: 1 2.53 GHz Pentium 4
* RAM: Self-explanatory
* Drive configuration: 100 GB 7200 RPM IDE, 80 GB 7200 RPM IDE

  Software environment
* Java Version: java version "1.3.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1)
Classic VM (build 1.3.1, J2RE 1.3.1 IBM Windows 32 build cn131-20020403 (JIT 
enabled: jitc))
* OS Version: Win XP SP1
* Location of index: Local File Systems

  Lucene indexing variables
* Number of source documents: 43,779,000
* Total filesize of source documents: ~350 GB -- never stored (documents were 
randomly generated)
* Average filesize of source documents: 8 KB
* Source documents storage location: Generated while indexing, never written to 
disk
* File type of source documents: text
* Parser(s) used, if any: None
* Analyzer(s) used: Standard Analyzer
* Number of fields per document: 2
* Type of fields: text, Unstored
* Index persistence: FSDirectory

  Figures
* Time taken (in ms/s as an average of at least 3 indexing runs): See notes below
* Time taken / 1000 docs indexed: 6.5 seconds/1000, not counting optimization 
time.  15 seconds/1000 when optimizing every 100,000 documents, and building an index 
to ~ 5 million documents.  Above 5 million documents, optimization took too much time. 
 See notes below.
* Memory consumption: ~ 200 mb
   *  Index Size: 70.7 GB

  Notes
* Notes: The documents were randomly generated on the fly as part of the indexing 
process from a list of ~100,000 words, who's average length was 7.   The documents had 
3 words in the title, and 500 words in the body.

While I was trying to build this index, the biggest limitation of Lucene that I ran 
into was optimization.  Optimization kills the indexers performance when you get 
between 3-5 million documents in an index.  On my Windows XP box, I had to reoptimize 
every 100,000 documents to keep from running out of file handles.  While I could build 
a 5 million document index in 24 hours... I could only add about another million over 
the next 24 hours due to the pain of the optimizer recopying the entire index over and 
over again (about 10 GB at this point), and it would only get worse from there.  So, 
to build this large of an index, I built several ~ 5 million document indexes, and 
then merged them at the end into a single index.  The second issue (though not really 
a problem) was that you have to have at least double the disk space available to build 
the index as you need when you are done.  I could have kept building the index bigger, 
but I ran out of disk space.  

When I was done building indexes, I ran some query's against them to see how the 
search performance varied with the size of the index.  Following are my results for 
various size indexes.

Index Size (GB) MS per query
4.5383
7.9283
10  89
12.7112
52.5694
70.7944


These numbers are an average of 3 runs of 500 randomly generated queries being tossed 
at the index (single threaded) on the same hardware that built the index.  The queries 
were randomly generated (about 50 % of the queries had 0 results, 50% had 1 or more 
results) 

I was happy to see that these numbers make a nice linear plot (attached).  I'm not 
sure what other comments to add here, other to thank the authors of Lucene for their 
great design and implementation of Lucene.

If anyone has anything else they would like me to test on this index before I dump 
it... Speak up quick, I have to pull out one of the hard drives this weekend to pass 
it on to its real owner.

Dan


<>--
To unsubscribe, e-mail:   
For additional commands, e-mail: 


RE: Lucene Benchmarks and Information

2002-12-20 Thread Armbrust, Daniel C.
The query's were definitely not very intelligently built.  It was a last minute thing 
I decided to do for the heck of it, as my main reason for thrashing my hard drives in 
this exercise was to make sure I could run the document count up significantly higher 
than what we are currently up to at work.

The operator was chosen randomly, followed by a random field, followed by a random 
word.  I didn't put any phrases in, as I expected the number of hits I got would be 
quite low (since the documents were also randomly generated) but in retrospect, even 
with near 0 results, it would probably be interesting.  Maybe I'll run a couple 
tonight if I get a chance.

-Original Message-
From: Jonathan Reichhold [mailto:[EMAIL PROTECTED]] 
Sent: Friday, December 20, 2002 12:24 PM
To: 'Lucene Users List'
Subject: RE: Lucene Benchmarks and Information


A question on the queries you used.  What sort of distribution of terms
did you use?  I.e. were all the queries single random words, or did you
add in multi-word queries and phrases?

I'm impressed with the results, just want to understand the testing
methodology better.

JR

-----Original Message-
From: Armbrust, Daniel C. [mailto:[EMAIL PROTECTED]] 
Sent: Friday, December 20, 2002 8:57 AM
To: 'Lucene Users List'
Subject: Lucene Benchmarks and Information


I've been running some scalability tests on Lucene over the past couple
of weeks.  While there may be some flaws with some of my methods, I
think they will be useful for people that want an idea as to how Lucene
will scale.  If anyone has any questions about what I did, or wants
clarifications on something, I'll be happy to provide them.

I'll start by filling out the form


  Hardware Environment
* Dedicated machine for indexing: yes
* CPU: 1 2.53 GHz Pentium 4
* RAM: Self-explanatory
* Drive configuration: 100 GB 7200 RPM IDE, 80 GB 7200 RPM IDE

  Software environment
* Java Version: java version "1.3.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1)
Classic VM (build 1.3.1, J2RE 1.3.1 IBM Windows 32 build
cn131-20020403 (JIT enabled: jitc))
* OS Version: Win XP SP1
* Location of index: Local File Systems

  Lucene indexing variables
* Number of source documents: 43,779,000
* Total filesize of source documents: ~350 GB -- never stored
(documents were randomly generated)
* Average filesize of source documents: 8 KB
* Source documents storage location: Generated while indexing, never
written to disk
* File type of source documents: text
* Parser(s) used, if any: None
* Analyzer(s) used: Standard Analyzer
* Number of fields per document: 2
* Type of fields: text, Unstored
* Index persistence: FSDirectory

  Figures
* Time taken (in ms/s as an average of at least 3 indexing runs):
See notes below
* Time taken / 1000 docs indexed: 6.5 seconds/1000, not counting
optimization time.  15 seconds/1000 when optimizing every 100,000
documents, and building an index to ~ 5 million documents.  Above 5
million documents, optimization took too much time.  See notes below.
* Memory consumption: ~ 200 mb
   *  Index Size: 70.7 GB

  Notes
* Notes: The documents were randomly generated on the fly as part of
the indexing process from a list of ~100,000 words, who's average length
was 7.   The documents had 3 words in the title, and 500 words in the
body.

While I was trying to build this index, the biggest limitation of Lucene
that I ran into was optimization.  Optimization kills the indexers
performance when you get between 3-5 million documents in an index.  On
my Windows XP box, I had to reoptimize every 100,000 documents to keep
from running out of file handles.  While I could build a 5 million
document index in 24 hours... I could only add about another million
over the next 24 hours due to the pain of the optimizer recopying the
entire index over and over again (about 10 GB at this point), and it
would only get worse from there.  So, to build this large of an index, I
built several ~ 5 million document indexes, and then merged them at the
end into a single index.  The second issue (though not really a problem)
was that you have to have at least double the disk space available to
build the index as you need when you are done.  I could have kept
building the index bigger, but I ran out of disk space.  

When I was done building indexes, I ran some query's against them to see
how the search performance varied with the size of the index.  Following
are my results for various size indexes.

Index Size (GB) MS per query
4.5383
7.9283
10  89
12.7112
52.5694
70.7944


These numbers are an average of 3 runs of 500 randomly generat

RE: Lucene Benchmarks and Information

2002-12-20 Thread Armbrust, Daniel C.
Not sure what the file handle limit is on XP, but it seemed to be fairly generous.  
I wasn't opening any other files, or index readers while building the index.  However, 
I didn't realize the connection between the merge factor and the number of files held 
open.  In some ad hoc tests, on these docs that I was indexing I got the best speed 
out of the indexer (not taking optimizations into account) with a merge factor of 500 
(which is what I used).  I'll have to try cranking down the merge factor, and see how 
many I can index without calling optimize.

Thanks for the info!

Dan


-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]] 
Sent: Friday, December 20, 2002 12:45 PM
To: Lucene Users List
Subject: Re: Lucene Benchmarks and Information


Armbrust, Daniel C. wrote:
> While I was trying to build this index, the biggest limitation of
> Lucene that I ran into was optimization.  Optimization kills the
> indexers performance when you get between 3-5 million documents in an
> index.  On my Windows XP box, I had to reoptimize every 100,000
> documents to keep from running out of file handles.

What is the file handle limit on XP?

When batch indexing, optimizing before the end slows things down, and 
should not be required.

Are you otherwise opening index readers in the same process?  Index 
readers use a lot more file handles than the index writer, since they 
must keep all files in all segments open.  For large indexes it's best 
to do your indexing in a separate process which never opens an IndexReader.

The max a reader will keep open is:

   mergeFactor * log_base_mergeFactor(N) * files_per_segment

With mergeFactor=10 (the default) and 1 million documents, and 10 files 
per segment, a reader on a never-optimized index should at most require 
600 open files, and typically half that.

A writer will open:

   (1 + mergeFactor) * files_per_segment

With mergeFactor=10 (the default) and 1 million documents, a writer on a 
never-optimized index would require 110 open files.

I just built a 3M document index on Linux in five hours, with no 
intermediate optimizations.  I set the mergeFactor to 50.  This required 
around 500 file handles, well beneath the 1024 limit.

Doug


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




RE: Lucene Benchmarks and Information

2002-12-20 Thread Armbrust, Daniel C.
Question about the behavior of 
FileWriter.addIndexes()

What I observed (and I know I should just go look at the source but) was that if I 
used this method to add several indexes into a new blank index, optimize was not 
called when it was done.  If I added several indexes into an existing index, then when 
it finished merging them, it called optimize.

Is this what occurs?

Assuming yes, then the best performance would be to make indexes in a ram directory, 
write them to a blank index on disk, and then when your all done, merge all of the 
disk directories in one swoop to avoid optimizing multiple times?  I'll have to give 
it a try.  

Thanks for the tip.

Dan



-Original Message-
From: Scott Ganyo [mailto:[EMAIL PROTECTED]] 
Sent: Friday, December 20, 2002 12:58 PM
To: Lucene Users List
Subject: Re: Lucene Benchmarks and Information


FYI: The best thing I've found for both increasing speed and reducing 
file handles is to use an IndexWriter on a RamDirectory for indexing and 
then use FileWriter.addIndexes() to write the result to disk.  This is 
subject to the amount of memory you have available, of course...

Scott



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Lucene Benchmarks and Information

2002-12-23 Thread Armbrust, Daniel C.
I ran some more query sets on the various size indexes, this time the queries 
contained 5 up to 5 word long phrases.  While the queries took a lot longer to run (as 
expected) the speed per query still came out to be a linear growth with the index 
size.  

Dan


-Original Message-
From: Armbrust, Daniel C. [mailto:[EMAIL PROTECTED]] 
Sent: Friday, December 20, 2002 12:55 PM
To: 'Lucene Users List'
Subject: RE: Lucene Benchmarks and Information


The query's were definitely not very intelligently built.  It was a last minute thing 
I decided to do for the heck of it, as my main reason for thrashing my hard drives in 
this exercise was to make sure I could run the document count up significantly higher 
than what we are currently up to at work.

The operator was chosen randomly, followed by a random field, followed by a random 
word.  I didn't put any phrases in, as I expected the number of hits I got would be 
quite low (since the documents were also randomly generated) but in retrospect, even 
with near 0 results, it would probably be interesting.  Maybe I'll run a couple 
tonight if I get a chance.

-Original Message-
From: Jonathan Reichhold [mailto:[EMAIL PROTECTED]] 
Sent: Friday, December 20, 2002 12:24 PM
To: 'Lucene Users List'
Subject: RE: Lucene Benchmarks and Information


A question on the queries you used.  What sort of distribution of terms
did you use?  I.e. were all the queries single random words, or did you
add in multi-word queries and phrases?

I'm impressed with the results, just want to understand the testing
methodology better.

JR

-----Original Message-
From: Armbrust, Daniel C. [mailto:[EMAIL PROTECTED]] 
Sent: Friday, December 20, 2002 8:57 AM
To: 'Lucene Users List'
Subject: Lucene Benchmarks and Information


I've been running some scalability tests on Lucene over the past couple
of weeks.  While there may be some flaws with some of my methods, I
think they will be useful for people that want an idea as to how Lucene
will scale.  If anyone has any questions about what I did, or wants
clarifications on something, I'll be happy to provide them.

I'll start by filling out the form


  Hardware Environment
* Dedicated machine for indexing: yes
* CPU: 1 2.53 GHz Pentium 4
* RAM: Self-explanatory
* Drive configuration: 100 GB 7200 RPM IDE, 80 GB 7200 RPM IDE

  Software environment
* Java Version: java version "1.3.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1)
Classic VM (build 1.3.1, J2RE 1.3.1 IBM Windows 32 build
cn131-20020403 (JIT enabled: jitc))
* OS Version: Win XP SP1
* Location of index: Local File Systems

  Lucene indexing variables
* Number of source documents: 43,779,000
* Total filesize of source documents: ~350 GB -- never stored
(documents were randomly generated)
* Average filesize of source documents: 8 KB
* Source documents storage location: Generated while indexing, never
written to disk
* File type of source documents: text
* Parser(s) used, if any: None
* Analyzer(s) used: Standard Analyzer
* Number of fields per document: 2
* Type of fields: text, Unstored
* Index persistence: FSDirectory

  Figures
* Time taken (in ms/s as an average of at least 3 indexing runs):
See notes below
* Time taken / 1000 docs indexed: 6.5 seconds/1000, not counting
optimization time.  15 seconds/1000 when optimizing every 100,000
documents, and building an index to ~ 5 million documents.  Above 5
million documents, optimization took too much time.  See notes below.
* Memory consumption: ~ 200 mb
   *  Index Size: 70.7 GB

  Notes
* Notes: The documents were randomly generated on the fly as part of
the indexing process from a list of ~100,000 words, who's average length
was 7.   The documents had 3 words in the title, and 500 words in the
body.

While I was trying to build this index, the biggest limitation of Lucene
that I ran into was optimization.  Optimization kills the indexers
performance when you get between 3-5 million documents in an index.  On
my Windows XP box, I had to reoptimize every 100,000 documents to keep
from running out of file handles.  While I could build a 5 million
document index in 24 hours... I could only add about another million
over the next 24 hours due to the pain of the optimizer recopying the
entire index over and over again (about 10 GB at this point), and it
would only get worse from there.  So, to build this large of an index, I
built several ~ 5 million document indexes, and then merged them at the
end into a single index.  The second issue (though not really a problem)
was that you have to have at least double the disk space available to
build the index as you need when you are done.  I could have kept
building the index bigger, but I ran out of disk space.  

When I was done building indexes, I ran some qu

RE: Lucene Benchmarks and Information

2002-12-23 Thread Armbrust, Daniel C.
-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED]] 
Sent: Saturday, December 21, 2002 9:36 AM
To: Lucene Users List
Subject: Re: Lucene Benchmarks and Information
[snip]

>IMHO it is a bug and the
>point why Lucene does not scale well on huge collections of documents. I
>am talking about my previous tests when I used live index and concurrent
>query+insert+delete (I wanted to simulate real application).

[snip]

What is your definition of huge?  I have yet to have a problem, and I am running one 
of the biggest indexes that I have seen posted to the mailing list.  I've been very 
impressed with the way that lucene scales.  Apparently I was not on the mailing list 
when you posted these tests.  (I'm still fairly new)


>BTW, your mail is also an answer to previous topic "how often could one
>call optimize()". The method would be called before the index goes to
>production state. And it also means that tests are irrelevant until they
>are made with lower mergeFactor.

[snip]

Maybe "irrelevant" to you, but I didn't intend my exercise to be a benchmark as to how 
fast I could make Lucene Index, as there are a lot of things that I could have done to 
make it faster.  (And I ended up learning several more via the experiment and follow 
up discussion here)  Maybe "Benchmarks" is a bad word to have in the subject.  They 
were done so that 

A.  So I know that there is no limitation (that will affect me) in Lucene (Hardcoded, 
bug, or designwise) as to how many documents can be put into an index.  That's why I 
built this ~43 million document index.  Just to see if I could.

B.  I know the impact on search times of adding more documents

C.  I know I can search this size of an index without running into problems.


I would imagine any benchmark that says I can index x documents this fast is fairly 
irrelevant to anyone else using different hardware, as it varies too much based  on 
disk speed, platform, cpu, doc size, doc format (in my real apps I'm doing xml 
transformations), how dedicated the machine is, jvm, etc etc etc.  

The results were posted to the list so that the question 

"I just found Lucene.  It looks nice, but can it handle 30 (or more) million 
documents?"

can be answered matter of factly to others in the future.  Additionally, it serves as 
a *very* rough guide to the amount of hardware you would need to construct your index 
of X documents in Y amount of time.

Dan

 

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Commit.lock?

2003-01-30 Thread Armbrust, Daniel C.
What (if anything) would cause lucene to write out a file called commit.lock inside of 
an index?

Thanks, 

Dan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: Commit.lock?

2003-01-30 Thread Armbrust, Daniel C.
Ignore me... Found it in the FAQ.



-Original Message-
From: Armbrust, Daniel C. [mailto:[EMAIL PROTECTED]] 
Sent: Thursday, January 30, 2003 11:11 AM
To: 'Lucene Users List'
Subject: Commit.lock?


What (if anything) would cause lucene to write out a file called commit.lock inside of 
an index?

Thanks, 

Dan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Finding out which field caused the hit?

2003-05-27 Thread Armbrust, Daniel C.
Is there a (better) way that I can use to figure out which field in a document caused 
the document to be returned from a query?  Currently, after I do a search across all 
of my fields and documents, I am researching on each document that had a hit, on each 
field individually, and keeping track of the scores..  The highest scoring field is 
the one that I credit with returning the document. 

This is fine for a small index, with a small number of fields, but it definitely 
doesn't seem like the correct way to go about getting this information.

Any suggestions would be appreciated, 

Thanks, 

Dan




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Document scoring

2003-05-30 Thread Armbrust, Daniel C.
I've noticed an oddity in scoring

If I do my search like this:

searcher.search(query, filter, new HitCollector()
{
public void collect(int doc, float score)
{
tempHits.add(new LuceneHits(doc, score));
}
});


I get different scores for the resulting documents than I do if I do my search like 
this:

hits = searcher.search(query, filter);

Both methods return the same number of hits.  I can live with them returning different 
scores, I'm just curious as to why it happens.

Furthermore, the first method returns several scores that are greater than 1.0.  Isn't 
this supposed to be impossible?  The FAQ states that scores range from 0 to 1.

Thanks, 

Dan






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: parallizing index building

2003-06-30 Thread Armbrust, Daniel C.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#addIndexes(org.apache.lucene.store.Directory[])

IndexWriter.addIndexes(Directory [])


-Original Message-
From: Lixin Meng [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 30, 2003 2:11 PM
To: Lucene Users List
Subject: RE: parallizing index building


Where can I find any sample code or documentation about merging a set of
small indexes into one big index?

Lixin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing very large sets (10 million docs)

2003-07-28 Thread Armbrust, Daniel C.
We are currently doing something similar here.

We have upwards of 15 million documents in our index.

There has been a lot of discussion on this in the past... But I'll give a few details:

My current techniques for indexing very large amounts of data is to 

Set the merge factor to 90
Leave the maxMergeDocs alone (lucene default)

Index 100,000 (the number you use depends on the filesize/number of fields in the 
documents... If you have a lot of fields, you will run out of file handles sooner I 
believe) documents into an index - no optimize, just index them in.  Close the index.  
Open a new index, write 100,000 docs into this one, etc.  Continue.  If you have more 
than one machine/more than one disk drive, running these processes in parallel helps a 
lot.

After you have a whole bunch of indexes of size 100,000 documents, then do a merge of 
all of the indexes.  You never have to call optimize - it will be optimized when it is 
merged.

I have written an entire wrapper around lucene which handles all of this for me... 
Though I did it at work and I don't know if I can release the code.  I think a couple 
of other people have done similar things, and have released the code.

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing very large sets (10 million docs)

2003-07-28 Thread Armbrust, Daniel C.
Oh, and you may be short on diskspace You must have double the amount of diskspace 
available as the end size of your index to call optimize - you may get by with less 
diskspace if just do a single merge - never calling optimize - but I'm not sure about 
this.  

Our index of 15 million documents is over 100 GB.  But the index size will vary a lot 
depending on how many of the fields you store (vs just index) and how big your 
documents are

Dan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing very large sets (10 million docs)

2003-07-28 Thread Armbrust, Daniel C.
I would say that something definitely went wrong to make your index that big that 
early - now that I saw you are only storing one field.

Even if you make your indexes partitioned at 2.5 instead of 10 million (which you 
probably don't need to do) I would still recommend that you lower your mergeFactor - 
when I was testing, I had the best performance on my machine at 90.  It will probably 
vary a little bit based on how fast your disk is, etc. 

When you have it set that high, what you end up doing is using a lot of memory - and a 
lot of file handles (which you noticed) And then pausing for a good while while 
you do all of your disk writes at once.  At a lower number, you will be able to take 
advantage of hardware level write buffering, etc. by doing more smaller writes and get 
better performance.  At least that was my experience.  I have also used the RAM index 
- and found its performance to be _worse_ than a properly configured file index - 
because the ram index ends up writing the whole thing out to disk at once.  And its 
even worse if you are merging it with a master index every X documents, rather than 
just writing it out to a blank spot on disk, because you really end up repeatedly 
reading and writing the older parts of the index.

Also - Optimizing kills your indexing speed performance... What you are doing is 
stopping indexing, and then reading the entire index in, and writing it out to a new 
set of files.  So, this gets slower and slower as your index gets larger.

If you set your mergeFactor back down to something closer to the default (10) - you 
probably wouldn't have any problems with file handles.  The higher you make it, the 
more open files you will have.  When I set it at 90 for performance reasons, I would 
run out of file handles (on XP) somewhere after 100,000 documents.  So, I simply 
create a new index every 100,000 documents.  This way, I get the best of both worlds.  
Performance while indexing from a relatively high mergefactor, and non-excessive file 
handle usage (without calling optimize), from closing the index before it gets huge.  
I am never recopying the index over itself until the last step, when I merge all of 
the indexes into one master index.  

Plus, with multiple (relatively) small indexes, you can keep them on different disk 
drives - and when you do your final merge, read them from one drive, and write them to 
another to help get around the issue of needing double the disk space free to do the 
merging/optimizing.  And if you make a mistake along the way (which I am prone to) or 
a sysadmin kills one of your processes or something like that - you only end up with 
one small corrupt index, which can be quickly rebuilt, instead of your master index.  

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing very large sets (10 million docs)

2003-07-28 Thread Armbrust, Daniel C.
Execute 'ulimit -f' to see what your current limit is... And then change appropriately 
after reading the man pages.  My redhat machines come up with an unlimited file size 
limit.  I don't know what the real limit is of an "unlimited" limit - but I haven't 
found it yet

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query creation

2003-12-04 Thread Armbrust, Daniel C.
Is it possible to create a query that would find a match in a document if and only if 
the query (a one word query) matched with the first word in the field I am searching?

Or do I have to rebuild my indexes, with a field that only contains the first word?

Thanks, 

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: build a case insensitive index

2003-12-12 Thread Armbrust, Daniel C.
I believe that if you enter an identical document twice, when you search, you will get 
it back twice.  If you don't want duplicate results, I think you will need to keep a 
hashset of the terms you have already indexed, and not add the document of the 
lowercase values are equal (or something along those lines)

Dan

-Original Message-
From: Thomas Krämer [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 11, 2003 3:01 PM
To: Lucene Users List
Subject: build a case insensitive index


Hello Lucene Users

i need a document term matrix to initialize a neural network, that i 
want to use to integrate user feedback in the retrieval process.

until now, i am using a slightly modified class of the IndexHTML example.

how can i create an index of all the terms in a collection without 
"term" and "Term" being indexed twice?

in the example, a standard analyzer is used, and in the documentation it 
sais :


Filters StandardTokenizer with StandardFilter, LowerCaseFilter and 
StopFilter.

So, why do i get double entries for terms in upper- and lower case writing?


Regards.

Thomas


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Result scoring question

2004-04-14 Thread Armbrust, Daniel C.
I know that the lucene scoring algorithm is pretty complicated, I know I don't 
understand all the pieces.  But given these documents:

A) -  left renal calculus
B) -  renal calculus

Should a query of 

other_designation:("renal calculus") OR preferred_designation:("renal calculus")

Score document B higher than document A?

Those documents are a made up example.  Here are the documents and scores I am getting 
back from the query on my real index:

Score 1.0 - Document Text Unindexed 
Text Keyword 
Keyword>

Score 0.85714287 - Document 
Keyword Text Unindexed 
Text 
Text>

Score 0.7409672 - Document Text Unindexed Text Keyword 
Keyword>


Am I just making a dumb mistake somewhere?

Thanks, 

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Result scoring question

2004-04-14 Thread Armbrust, Daniel C.
I should have remembered that.

Here are the 3 explanations for the top 3 documents returned (contents below)

3.3513687 = product of:
  6.7027373 = weight(preferred_designation:"renal calculus" in 48270), product of:
0.8114604 = queryWeight(preferred_designation:"renal calculus"), product of:
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.04297941 = queryNorm
8.260092 = fieldWeight(preferred_designation:"renal calculus" in 48270), product 
of:
  1.0 = tf(phraseFreq=1.0)
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.4375 = fieldNorm(field=preferred_designation, doc=48270)
  0.5 = coord(1/2)

2.8726017 = product of:
  5.7452035 = weight(preferred_designation:"renal calculus" in 514631), product of:
0.8114604 = queryWeight(preferred_designation:"renal calculus"), product of:
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.04297941 = queryNorm
7.080079 = fieldWeight(preferred_designation:"renal calculus" in 514631), product 
of:
  1.0 = tf(phraseFreq=1.0)
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.375 = fieldNorm(field=preferred_designation, doc=514631)
  0.5 = coord(1/2)

2.4832542 = product of:
  4.9665084 = weight(other_designation:"renal calculus" in 481129), product of:
0.58440757 = queryWeight(other_designation:"renal calculus"), product of:
  13.5973835 = idf(other_designation: renal=8560 calculus=971)
  0.04297941 = queryNorm
8.498364 = fieldWeight(other_designation:"renal calculus" in 481129), product of:
  1.0 = tf(phraseFreq=1.0)
  13.5973835 = idf(other_designation: renal=8560 calculus=971)
  0.625 = fieldNorm(field=other_designation, doc=481129)
  0.5 = coord(1/2) 


Is there anything that I can do in my query construction, to ensure that if a query 
exactly matches a document, it will be the top result?

Thanks, 

Dan


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 14, 2004 12:17 PM
To: Lucene Users List
Subject: Re: Result scoring question

Try using IndexSearcher.explain (and then a toString on the resulting 
Explanation object) to see the details of why things are scoring how 
they are.  This can be most enlightening!

Erik


On Apr 14, 2004, at 12:16 PM, Armbrust, Daniel C. wrote:

> I know that the lucene scoring algorithm is pretty complicated, I know 
> I don't understand all the pieces.  But given these documents:
>
> A) -  left renal calculus
> B) -  renal calculus
>
> Should a query of
>
> other_designation:("renal calculus") OR preferred_designation:("renal 
> calculus")
>
> Score document B higher than document A?
>
> Those documents are a made up example.  Here are the documents and 
> scores I am getting back from the query on my real index:
>
> Score 1.0 - Document 
> Text diverticulum> Unindexed Text 
> Keyword 
> Keyword>
>
> Score 0.85714287 - 
> Document 
> Keyword Text 
> Unindexed Text in a solitary left kidney> Text>
>
> Score 0.7409672 - Document 
> Text Unindexed 
> Text Keyword 
> Keyword>
>
>
> Am I just making a dumb mistake somewhere?
>
> Thanks,
>
> Dan
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Result scoring question

2004-04-15 Thread Armbrust, Daniel C.
Thanks for the advice.

I created a class to extend DefaultSimilarity, and made it return 10 for the idf 
value.  (I don't really have any data to back up picking 10, other than it seems to 
work)

This did indeed, cause my exact matches to float up to the top.  Your explanation 
makes sense, because for this particular query, there were only 2 documents in the 
index that contained the words "renal calculus" in the preferred_designation field 
while there were hundreds that  contained those words in the other_designation field.

I'll keep testing it to make sure that nothing odd happens in other searches now, but 
is seems good so far.

Thanks, 

Dan



 
-Original Message-
From: Ype Kingma [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 15, 2004 2:00 AM
To: Lucene Users List
Subject: Re: Result scoring question


It seems that the problem is in the idf weights.
Try using a scorer that returns a constant for the idf.
You can inherit all the default behaviour and only override the idf().

The idf weights are established for Lucene terms, which are a combination
of a field and a text term. If a text term occurs infrequently in one field, it
will score higher than in a field in which it occurs frequently.
(idf means inverse document frequency).
My guess is this is what's happening here.


Good luck,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene shouldn't use java.io.tmpdir

2004-07-09 Thread Armbrust, Daniel C.
The problem I ran into the other day with the new lock location is that Person A had 
started an index, ran into problems, erased the index and asked me to look at it.  I 
tried to rebuild the index (in the same place on a Solaris machine) and found out that 
A) - her locks still existed, B) - I didn't have a clue where it put the locks on the 
Solaris machine (since no full path was given with the error - has this been fixed?) 
and C) - I didn't have permission to remove her locks.

I think the locks should go back in the index, and we should fall back or give an 
option to put them elsewhere for the case of the read-only index.

Dan 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index Size

2004-08-19 Thread Armbrust, Daniel C.
Have you tried looking at the contents of this small index with Luke, to see what 
actually got put into it?  Maybe one of your stored fields is being fed something you 
didn't expect.

Dan 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene 1.4.1 not listed on jakarta downloads page

2004-08-31 Thread Armbrust, Daniel C.
FYI
I was able to find Lucene 1.4.1 here:  
http://cvs.apache.org/dist/jakarta/lucene/v1.4.1/

But if I go here:
http://jakarta.apache.org/site/binindex.cgi

1.4 is the only lucene download option available.

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Compound File Format question

2004-09-08 Thread Armbrust, Daniel C.
Is it safe to change the compound file format option at any time during the life of an 
index?

Can I build an index with it off, then turn it on, and call optimize, and have a 
compound file formatted index?

And then later, turn it on, call optimize again, and go back the other way?

The JavaDocs don't say much of anything about it (oh - and PS - there is a copy and 
paste error in the description for the getUseCompoundFile() method)

Thanks, 

Dan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Compound File Format question

2004-09-08 Thread Armbrust, Daniel C.
Hmm, I tried that in Luke - but it doesn't seem to take.  When I uncheck the use 
compound file check box, and then select optimize, it doesn't change anything.

I guess I should just write some code already :)

Dan
 

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 08, 2004 2:37 PM
To: Lucene Users List
Subject: Re: Compound File Format question

Armbrust, Daniel C. wrote:

> Is it safe to change the compound file format option at any time during the life of 
> an index?
> 
> Can I build an index with it off, then turn it on, and call optimize, and have a 
> compound file formatted index?
> 
> And then later, turn it on, call optimize again, and go back the other way?

In my experience it's safe. I've been doing this in a couple of real 
applications, and also in Luke there is an option to re-pack the index 
using compound or not.

-- 
Best regards,
Andrzej Bialecki

-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Compound File Format question

2004-09-08 Thread Armbrust, Daniel C.
Ahh - two new discoveries:

You have to add a document, remove a document, and then call optimize.   Then 
everything works (nearly as expected)

The version of Lucene that ships with Luke still has the broken optimize code in it 
that didn't clean up after itself - so you need to just download Luke, and then run it 
with 1.4.1 of Lucene, rather than what is ships with (which the website indicates is 
1.4 RC4)


Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: "Orphan" segment files

2004-10-04 Thread Armbrust, Daniel C.
There was a broken version of Lucene in there - (I think the 1.4 release?) which was 
not cleaning up old files after you did an optimize in certain cases.   For me, 
upgrading to 1.4.1, and re-optimizing automatically cleaned up the index.

You may have to add and remove a "dummy" document first, so the optimize actually 
occurs (if your index is currently optimized)

Dan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing process causes Tomcat to stop working

2004-10-27 Thread Armbrust, Daniel C.
So, are you creating the indexes from inside the tomcat runtime, or are you creating 
them on the command line (which would be in a different runtime than tomcat)?

What happens to tomcat?  Does it hang - still running but not responsive?  Or does it 
crash?  

If it hangs, maybe you are running out of memory.  By default, Tomcat's limit is set 
pretty low...

There is no reason at all you should have to reboot... If you stop and start tomcat, 
(make sure it actually stopped - sometimes it requires a kill -9 when it really gets 
hung) it should start working again.  Depending on your setup of Tomcat + apache, you 
may  have to restart apache as well to get them linked to each other again...

Dan




-Original Message-
From: James Tyrrell [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 27, 2004 10:49 AM
To: [EMAIL PROTECTED]
Subject: RE: Indexing process causes Tomcat to stop working

Aad,
  D'oh forgot to mention that mildly important info. Rather than 
re-index I am just creating a new index each time, this makes things easier 
to roll-back etc (which is what my boss wants). the command line is 
something like  I 
have wondered about whether sessions could be a problem, but I don't think 
so, otherwise wouldn't a restart of Tomcat be sufficient rather than a 
reboot? I even tried the killall command on java & tomcat then started 
everything again to no avail.

cheers,

JT



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexWriter Constructor question

2004-10-27 Thread Armbrust, Daniel C.
Wouldn't it make more sense if the constructor for the IndexWriter always created an 
index if it doesn't exist - and the boolean parameter should be clear (instead of 
create)

So instead of this (from javadoc):

IndexWriter

public IndexWriter(Directory d,
   Analyzer a,
   boolean create)
throws IOException

Constructs an IndexWriter for the index in d. Text will be analyzed with a. If 
create is true, then a new, empty index will be created in d, replacing the index 
already there, if any.

Parameters:
d - the index directory
a - the analyzer to use
create - true to create the index or overwrite the existing one; false to append 
to the existing index 
Throws:
IOException - if the directory cannot be read/written to, or if it does not exist, 
and create is false


We would have this:

IndexWriter

public IndexWriter(Directory d,
   Analyzer a,
   boolean clear)
throws IOException

Constructs an IndexWriter for the index in d. Text will be analyzed with a. If 
clear is true, and a index exists at location d, then it will be erased, and a new, 
empty index will be created in d.

Parameters:
d - the index directory
a - the analyzer to use
clear - true to overwrite the existing one; false to append to the existing index 
Throws:
IOException - if the directory cannot be read/written to, or if it does not exist.



Its current behavior is kind of annoying, because I have an app that should never 
clear an existing index, it should always append.  So I want create set to false.  But 
when I am starting a brand new index, then I have to change the create flag to keep it 
from throwing an exception...  I guess for now I will have to write code to check if a 
index actually has content yet, and if it doesn't, change the flag on the fly.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing process causes Tomcat to stop working

2004-10-28 Thread Armbrust, Daniel C.
You want version 1.4.2, not version 1.4.

The website makes it hard to find 1.4.2, because the mirrors have not been updated yet.

Get 1.4.2 here:  http://cvs.apache.org/dist/jakarta/lucene/v1.4.2/
 

>My queries do use sorting! So I have placed the 1.4 final jar onto my 
>classpath and have started 'another' index, as the company I work for is 
>moving home tomorrow may not be able to tell you if that worked till next 
>week mind.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: What is the best file system for Lucene?

2004-11-30 Thread Armbrust, Daniel C.
You may want to give the IBM JVM a try - I've found it faster in some cases...

http://www-106.ibm.com/developerworks/java/jdk/linux140/


Dan 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: What is the best file system for Lucene?

2004-11-30 Thread Armbrust, Daniel C.
As I understand hyperthreading, this is not true: 

>Also, unless you take your hyperthreading off, with just one index you are
>searching with just one half of the CPU - so your desktop is actually using
>a 1.5GHz CPU for the search.

You still have the full speed of the processor available - the processor itself 
just keeps switching between different threads of execution.  Some people have 
noted that some (single threaded) applications will run 5-10% slower when 
hyperthreading is turned on - but that depends on the app.  It certainly won't 
be running at half speed.

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]