Re: Running OutOfMemory while optimizing and searching

2004-07-02 Thread James Dunn
Ah yes, I don't think I made that clear enough.  From
Mark's original post, I believe he mentioned that he
used seperate readers for each simultaneous query.

His other issue was that he was getting an OOM during
an optimize, even when he set the JVM heap to 2GB.  He
said his index was about 10.5GB spread over ~7000
files on Linux.  

My guess is that OOM might actually be a "too many
open files" error.  I have seen that type of error
being reported by the JVM as an OutOfMemory error on
Linux before.  I had the same problem but once I
switched to the new Lucene compound file format, I
haven't had that problem since.  

Mark, have you tried switching to the compound file
format?  

Jim




--- Doug Cutting <[EMAIL PROTECTED]> wrote:
>  > What do your queries look like?  The memory
> required
>  > for a query can be computed by the following
> equation:
>  >
>  > 1 Byte * Number of fields in your query * Number
> of
>  > docs in your index
>  >
>  > So if your query searches on all 50 fields of
> your 3.5
>  > Million document index then each search would
> take
>  > about 175MB.  If your 3-4 searches run
> concurrently
>  > then that's about 525MB to 700MB chewed up at
> once.
> 
> That's not quite right.  If you use the same
> IndexSearcher (or 
> IndexReader) for all of the searches, then only
> 175MB are used.  The 
> arrays in question (the norms) are read-only and can
> be shared by all 
> searches.
> 
> In general, the amount of memory required is:
> 
> 1 byte * Number of searchable fields in your index *
> Number of docs in 
> your index
> 
> plus
> 
> 1k bytes * number of terms in query
> 
> plus
> 
> 1k bytes * number of phrase terms in query
> 
> The latter are for i/o buffers.  There are a few
> other things, but these 
> are the major ones.
> 
> Doug
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question about remote searching

2004-07-02 Thread James Dunn
You could wrap your RemoteSearchable in a
MultiSearcher and then search against that to get your
Hits object:

/Get a handle on the remote searchable
Searchable remoteSearchable = (Searchable)
Naming.lookup(serverName);

//Create an array of searchables, for use with
MultiSearcher
Searchable[] searchables = new Searchable[1];
searchables[0] = remoteSearchable;

//Create the searcher
MultiSearcher searcher = new
MultiSearcher(searchables);

//Build query
Query q = .;

Hits hits = searcher.search(q);

--- Cocula Remi <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> I am trying to do remote searching via RMI.
> In a first step I wrote my own remote search method
> that should return results as an object of type
> Hits.
> But it does not work as the Hit class is not
> Serializable.
> Then I took a look at the RemoteSearchable class and
> realized that it implements search using the low
> level API (ie:  public void search(Query query,
> Filter filter, HitCollector results)).
> 
> Elsewhere in Lucene source code I read that using
> the high level API (those how deals with Hits) is
> much more efficient.
> 
> Question : would it be possible to make the Hit
> class Serializable so it could be used through RMI
> mechanisms ?
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sort in Cocoon/XSP

2004-07-02 Thread James Dunn
I think the default Cocoon build includes an older
version Lucene, 1.2 maybe.  There maybe a library
conflict on your machine.

Hope that helps,

Jim

--- Rob Clews <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> I'm experiencing a problem with Cocoon, and whilst I
> know this list is
> for Lucene I wonder if you can help me with a sort
> of related problem.
> 
> When I try and import the Sort class I get the
> following:
> 
> // start error (lines 69-69) "Only a type can be
> imported. org.apache.
> lucene.search.Sort resolves to a package"
> 
> Nothing on Google really gives a solution to this. I
> have lucene 1.4 rc3
> jar in $COCOON_HOME/WEB-INF/lib.
> 
> Thanks
> -- 
> Rob Clews
> Klear Systems Ltd
> t: +44 (0)121 707 8558 e: [EMAIL PROTECTED]
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Running OutOfMemory while optimizing and searching

2004-06-29 Thread James Dunn
Mark,

What do your queries look like?  The memory required
for a query can be computed by the following equation:

1 Byte * Number of fields in your query * Number of
docs in your index

So if your query searches on all 50 fields of your 3.5
Million document index then each search would take
about 175MB.  If your 3-4 searches run concurrently
then that's about 525MB to 700MB chewed up at once.

Also, if your queries use wildcards, the memory
requirements could be much greater.  

Hope that helps,

Jim
--- Mark Florence <[EMAIL PROTECTED]> wrote:
> Otis, Thanks for considering this problem.
> 
> I'm using all the default parameters -- and still
> scratching my head!
> 
> I can't see any that would control memory usage.
> Plus, a 2GB heap is
> quite big. I see others have indexes bigger than
> mine, so I'm not sure
> how to find out why mine is throwing OutOfMemory --
> not only on the
> optimize, but when 3-4 searchers are running, too.
> 
> -- Mark
> 
> -Original Message-
> From: Otis Gospodnetic
> [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, June 29, 2004 01:02 am
> To: Lucene Users List
> Subject: Re: Running OutOfMemory while optimizing
> and searching
> 
> 
> Mark,
> 
> Tough situation.  I hate when things like this
> happen on production :(.
>  You are not mentioning what you are using for
> various IndexWriter
> parameters.  You may be able to get this working by
> tweaking them (see
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
> r.html#field_summary).
> Hm, now that I think about it, I am not sure if
> those are considered
> during index optimization.  I'll try checking the
> sources later.
> 
> Otis
> 
> --- Mark Florence <[EMAIL PROTECTED]> wrote:
> > Hi, I'm using Lucene to index ~3.5M documents,
> over about 50 fields.
> > The Lucene
> > index itself is ~10.5GB, spread over ~7,000 files.
> Some of these
> > files are
> > "large" -- that is, several PRX files are ~1.5GB.
> >
> > Lucene runs on a dedicated server (Linux on a 1Ghz
> Dell, with 1GB
> > RAM). Clients
> > on other machines use RMI to perform reads /
> writes. Each night the
> > server
> > automatically performs an optimize.
> >
> > The problem is that the optimize now dies with an
> OutOfMemory
> > exception, even
> > when the JVM heap size is set to its maximum of
> 2GB. I need to
> > optimize, because
> > as the number of Lucene files grows, search
> performance becomes
> > unacceptable.
> >
> > Search performance is also adversely affected
> because I've had to
> > effectively
> > single-thread reads and writes. I was using a
> simple read / write
> > lock
> > mechanism, allowing multiple readers to
> simultaneously search, but
> > now more than
> > 3-4 simultaneous readers will also cause an
> OutOfMemory condition.
> > Searches can
> > take as long as 30-40 seconds, and with
> single-threading, that's
> > crippling the
> > main client application.
> >
> > Needless to say, the Lucene index is
> mission-critical, and must run
> > 24/7.
> >
> > I've seen other posts along this same vein, but no
> definite
> > consensus. Is my
> > problem simply inadequate hardware? Should I run
> on a 64-bit
> > platform, where I
> > can allocate a Java heap of > 2GB?
> >
> > Or could there be something fundamentally "wrong"
> with my index? I
> > should add
> > that I've just spent about a week (!!) rebuilding
> from scratch, over
> > all 3.5M
> > documents.
> >
> > -- Many thanks for any help! Mark Florence
> >
> >
> >
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: To anyone who has used Luke

2004-06-25 Thread James Dunn
I've seen that too.  I'm pretty sure either Lucene is
creating it, or it's an issue with Luke.

Jim
--- Don Vaillancourt <[EMAIL PROTECTED]> wrote:
> Hello All,
> 
> I'm using Luke, the software that someone mentioned
> before.  It's great for 
> debugging.  My question to anyone who has used it. 
> Under the overview tab 
> in the available fields column on the left is listed
> all the columns that I 
> wanted indexed and/or stored and/or tokenized.
> 
> I only defined 4 fields to get
> indexed/stored/tokenized, but Luke is 
> showing 5, the last one being blank.
> 
> I'm thinking that maybe there's a bug in my code
> that's doing an extra 
> column, but I just wanted to verify that Luke wasn't
> showing an extra 
> column or that maybe Lucene was creating the extra
> column for internal use.
> 
> Thanks
> 
> Don Vaillancourt
> Director of Software Development
> 
> WEB IMPACT INC.
> 416-815-2000 ext. 245
> email: [EMAIL PROTECTED]
> web: http://www.web-impact.com
> 
> 
> 
> 
> This email message is intended only for the
> addressee(s)
> and contains information that may be confidential
> and/or
> copyright.  If you are not the intended recipient
> please
> notify the sender by reply email and immediately
> delete
> this email. Use, disclosure or reproduction of this
> email
> by anyone other than the intended recipient(s) is
> strictly
> prohibited. No representation is made that this
> email or
> any attachments are free of viruses. Virus scanning
> is
> recommended and is the responsibility of the
> recipient.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Compound file format file size question

2004-06-18 Thread James Dunn
Otis,

Thanks for the response.

Yeah, I was copying the file to a brand new hard-drive
and it was formatted to FAT32 by default, which is
probably why it couldn't handle the 13GB file.  

I'm converting the drive to NTFS now, which should get
me through temporarily.  In the future though, I may
break the index up into smaller sub-indexes so that I
can distribute them across seperate physical disks for
better disk IO.

Thanks for your help!

Jim
--- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:
> Hello,
> 
> --- James Dunn <[EMAIL PROTECTED]> wrote:
> > Hello all,
> > 
> > I have an index that's about 13GB on disk.  I'm
> using
> > 1.4 rc3 which uses the compound file format by
> > default.
> > 
> > Once I run optimize on my index, it creates one
> 13GB
> > ..cfs file.  This isn't a problem on Linux (yet),
> but
> > I'm having some trouble copying the file over to
> my
> > Windows XP box.
> 
> What is the exact problem? The sheer size of it or
> something else? 
> Just curious...
> 
> > Is there some way using the compound file format
> to
> > set the maximum file size and have Lucene break
> the
> > index into multiple files once it hits that limit?
> 
> Can't be done with Lucene, but I seem to recall some
> discussion about
> it.  Nothing concrete, though.
> 
> > Or do I need to go back to using the non-compound
> file
> > format?
> 
> The total size should be (about) the same, but you
> could certainly do
> that, if having more smaller files is better for
> you.
> 
> Otis
> 
> > Another solution, I suppose, would be to break up
> my
> > index into seperate smaller indexes.  This would
> be my
> > second choice, however.
> > 
> > Thanks a lot,
> > 
> > Jim
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Compound file format file size question

2004-06-18 Thread James Dunn
Hello all,

I have an index that's about 13GB on disk.  I'm using
1.4 rc3 which uses the compound file format by
default.

Once I run optimize on my index, it creates one 13GB
.cfs file.  This isn't a problem on Linux (yet), but
I'm having some trouble copying the file over to my
Windows XP box.

Is there some way using the compound file format to
set the maximum file size and have Lucene break the
index into multiple files once it hits that limit?

Or do I need to go back to using the non-compound file
format?

Another solution, I suppose, would be to break up my
index into seperate smaller indexes.  This would be my
second choice, however.

Thanks a lot,

Jim

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-27 Thread James Dunn
Otis,

My app does run within Tomcat.  But when I started
getting these OutOfMemoryErrors I wrote a little unit
test to watch the memory usage without Tomcat in the
middle and I still see the memory usage.

Thanks,

Jim
--- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:
> Sorry if I'm stating the obvious.  Is this happening
> in some
> stand-alone unit tests, or are you running things
> from some application
> and in some environment, like Tomcat, Jetty or in
> some non-web app?
> 
> Your queries are pretty big (although I recall some
> people using even
> bigger ones... but it all depends on the hardware
> they had), but are
> you sure running out of memory is due to Lucene, or
> could it be a leak
> in the app from which you are running queries?
> 
> Otis
> 
> 
> --- James Dunn <[EMAIL PROTECTED]> wrote:
> > Doug,
> > 
> > We only search on analyzed text fields.  There are
> a
> > couple of additional fields in the index like
> > OBJECT_ID that are keywords but we don't search
> > against those, we only use them once we get a
> result
> > back to find the thing that document represents.
> > 
> > Thanks,
> > 
> > Jim
> > 
> > --- Doug Cutting <[EMAIL PROTECTED]> wrote:
> > > It is cached by the IndexReader and lives until
> the
> > > index reader is 
> > > garbage collected.  50-70 searchable fields is a
> > > *lot*.  How many are 
> > > analyzed text, and how many are simply keywords?
> > > 
> > > Doug
> > > 
> > > James Dunn wrote:
> > > > Doug,
> > > > 
> > > > Thanks!  
> > > > 
> > > > I just asked a question regarding how to
> calculate
> > > the
> > > > memory requirements for a search.  Does this
> > > memory
> > > > only get used only during the search operation
> > > itself,
> > > > or is it referenced by the Hits object or
> anything
> > > > else after the actual search completes?
> > > > 
> > > > Thanks again,
> > > > 
> > > > Jim
> > > > 
> > > > 
> > > > --- Doug Cutting <[EMAIL PROTECTED]> wrote:
> > > > 
> > > >>James Dunn wrote:
> > > >>
> > > >>>Also I search across about 50 fields but I
> don't
> > > >>
> > > >>use
> > > >>
> > > >>>wildcard or range queries. 
> > > >>
> > > >>Lucene uses one byte of RAM per document per
> > > >>searched field, to hold the 
> > > >>normalization values.  So if you search a 10M
> > > >>document collection with 
> > > >>50 fields, then you'll end up using 500MB of
> RAM.
> > > >>
> > > >>If you're using unanalyzed fields, then an
> easy
> > > >>workaround to reduce the 
> > > >>number of fields is to combine many in a
> single
> > > >>field.  So, instead of, 
> > > >>e.g., using an "f1" field with value "abc",
> and an
> > > >>"f2" field with value 
> > > >>"efg", use a single field named "f" with
> values
> > > >>"1_abc" and "2_efg".
> > > >>
> > > >>We could optimize this in Lucene.  If no
> values of
> > > >>an indexed field are 
> > > >>analyzed, then we could store no norms for the
> > > field
> > > >>and hence read none 
> > > >>into memory.  This wouldn't be too hard to
> > > >>implement...
> > > >>
> > > >>Doug
> > > >>
> > > >>
> > > > 
> > > >
> > >
> >
>
-
> > > > 
> > > >>To unsubscribe, e-mail:
> > > >>[EMAIL PROTECTED]
> > > >>For additional commands, e-mail:
> > > >>[EMAIL PROTECTED]
> > > >>
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > __
> > > > Do you Yahoo!?
> > > > Friends.  Fun.  Try the all-new Yahoo!
> Messenger.
> > > > http://messenger.yahoo.com/ 
> > > > 
> > > >
> > >
> >
>
-
> > > > To unsubscribe, e-mail:
> > > [EMAIL PROTECTED]
> > > > For additional commands, e-mail:
> > > [EMAIL PROTECTED]
> > > > 
> > > 
> > >
> >
>
-
> > > To unsubscribe, e-mail:
> > > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > > [EMAIL PROTECTED]
> > > 
> > 
> > 
> > 
> > 
> > 
> > __
> > Do you Yahoo!?
> > Friends.  Fun.  Try the all-new Yahoo! Messenger.
> > http://messenger.yahoo.com/ 
> > 
> >
>
-
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread James Dunn
Doug,

We only search on analyzed text fields.  There are a
couple of additional fields in the index like
OBJECT_ID that are keywords but we don't search
against those, we only use them once we get a result
back to find the thing that document represents.

Thanks,

Jim

--- Doug Cutting <[EMAIL PROTECTED]> wrote:
> It is cached by the IndexReader and lives until the
> index reader is 
> garbage collected.  50-70 searchable fields is a
> *lot*.  How many are 
> analyzed text, and how many are simply keywords?
> 
> Doug
> 
> James Dunn wrote:
> > Doug,
> > 
> > Thanks!  
> > 
> > I just asked a question regarding how to calculate
> the
> > memory requirements for a search.  Does this
> memory
> > only get used only during the search operation
> itself,
> > or is it referenced by the Hits object or anything
> > else after the actual search completes?
> > 
> > Thanks again,
> > 
> > Jim
> > 
> > 
> > --- Doug Cutting <[EMAIL PROTECTED]> wrote:
> > 
> >>James Dunn wrote:
> >>
> >>>Also I search across about 50 fields but I don't
> >>
> >>use
> >>
> >>>wildcard or range queries. 
> >>
> >>Lucene uses one byte of RAM per document per
> >>searched field, to hold the 
> >>normalization values.  So if you search a 10M
> >>document collection with 
> >>50 fields, then you'll end up using 500MB of RAM.
> >>
> >>If you're using unanalyzed fields, then an easy
> >>workaround to reduce the 
> >>number of fields is to combine many in a single
> >>field.  So, instead of, 
> >>e.g., using an "f1" field with value "abc", and an
> >>"f2" field with value 
> >>"efg", use a single field named "f" with values
> >>"1_abc" and "2_efg".
> >>
> >>We could optimize this in Lucene.  If no values of
> >>an indexed field are 
> >>analyzed, then we could store no norms for the
> field
> >>and hence read none 
> >>into memory.  This wouldn't be too hard to
> >>implement...
> >>
> >>Doug
> >>
> >>
> > 
> >
>
-
> > 
> >>To unsubscribe, e-mail:
> >>[EMAIL PROTECTED]
> >>For additional commands, e-mail:
> >>[EMAIL PROTECTED]
> >>
> > 
> > 
> > 
> > 
> > 
> > 
> > __
> > Do you Yahoo!?
> > Friends.  Fun.  Try the all-new Yahoo! Messenger.
> > http://messenger.yahoo.com/ 
> > 
> >
>
-
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread James Dunn
Doug,

Thanks!  

I just asked a question regarding how to calculate the
memory requirements for a search.  Does this memory
only get used only during the search operation itself,
or is it referenced by the Hits object or anything
else after the actual search completes?

Thanks again,

Jim


--- Doug Cutting <[EMAIL PROTECTED]> wrote:
> James Dunn wrote:
> > Also I search across about 50 fields but I don't
> use
> > wildcard or range queries. 
> 
> Lucene uses one byte of RAM per document per
> searched field, to hold the 
> normalization values.  So if you search a 10M
> document collection with 
> 50 fields, then you'll end up using 500MB of RAM.
> 
> If you're using unanalyzed fields, then an easy
> workaround to reduce the 
> number of fields is to combine many in a single
> field.  So, instead of, 
> e.g., using an "f1" field with value "abc", and an
> "f2" field with value 
> "efg", use a single field named "f" with values
> "1_abc" and "2_efg".
> 
> We could optimize this in Lucene.  If no values of
> an indexed field are 
> analyzed, then we could store no norms for the field
> and hence read none 
> into memory.  This wouldn't be too hard to
> implement...
> 
> Doug
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread James Dunn
Erik,

Thanks for the response.  

My actual documents are fairly small.  Most docs only
have about 10 fields.  Some of those fields are
stored, however, like the OBJECT_ID, NAME and DESC
fields.  The stored fields are pretty small as well. 
None should be more than 4KB and very few will
approach that limit.

I'm also using the default maxFieldSize value of
1.  

I'm not caching hits, either.

Could it be my query?  I have about 80 total unique
fields in the index although no document has all 80. 
My query ends up looking like this:

+(F1:test F2:test ..  F80:test)

>From previous mails that doesn't look like an enormous
amount of fields to be searching against.  Is there
some formula for the amount of memory required for a
query based on the number of clauses and terms?

Jim



--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
> How big are your actual Documents?  Are you caching
> Hits?  It stores, 
> internally, up to 200 documents.
> 
>   Erik
> 
> 
> On May 26, 2004, at 4:08 PM, James Dunn wrote:
> 
> > Will,
> >
> > Thanks for your response.  It may be an object
> leak.
> > I will look into that.
> >
> > I just ran some more tests and this time I create
> a
> > 20GB index by repeatedly merging my large index
> into
> > itself.
> >
> > When I ran my test query against that index I got
> an
> > OutOfMemoryError on the very first query.  I have
> my
> > heap set to 512MB.  Should a query against a 20GB
> > index require that much memory?  I page through
> the
> > results 100 at a time, so I should never have more
> > than 100 Document objects in memory.
> >
> > Any help would be appreciated, thanks!
> >
> > Jim
> > --- [EMAIL PROTECTED] wrote:
> >> This sounds like a memory leakage situation.  If
> you
> >> are using tomcat I
> >> would suggest you make sure you are on a recent
> >> version, as it is known to
> >> have some memory leaks in version 4.  It doesn't
> >> make sense that repeated
> >> queries would use more memory that the most
> >> demanding query unless objects
> >> are not getting freed from memory.
> >>
> >> -Will
> >>
> >> -Original Message-
> >> From: James Dunn [mailto:[EMAIL PROTECTED]
> >> Sent: Wednesday, May 26, 2004 3:02 PM
> >> To: [EMAIL PROTECTED]
> >> Subject: Memory usage
> >>
> >>
> >> Hello,
> >>
> >> I was wondering if anyone has had problems with
> >> memory
> >> usage and MultiSearcher.
> >>
> >> My index is composed of two sub-indexes that I
> >> search
> >> with a MultiSearcher.  The total size of the
> index
> >> is
> >> about 3.7GB with the larger sub-index being 3.6GB
> >> and
> >> the smaller being 117MB.
> >>
> >> I am using Lucene 1.3 Final with the compound
> file
> >> format.
> >>
> >> Also I search across about 50 fields but I don't
> use
> >> wildcard or range queries.
> >>
> >> Doing repeated searches in this way seems to
> >> eventually chew up about 500MB of memory which
> seems
> >> excessive to me.
> >>
> >> Does anyone have any ideas where I could look to
> >> reduce the memory my queries consume?
> >>
> >> Thanks,
> >>
> >> Jim
> >>
> >>
> >>
> >>
> >> __
> >> Do you Yahoo!?
> >> Friends.  Fun.  Try the all-new Yahoo! Messenger.
> >> http://messenger.yahoo.com/
> >>
> >>
> >
>
-
> >> To unsubscribe, e-mail:
> >> [EMAIL PROTECTED]
> >> For additional commands, e-mail:
> >> [EMAIL PROTECTED]
> >>
> >>
> >
>
-
> >> To unsubscribe, e-mail:
> >> [EMAIL PROTECTED]
> >> For additional commands, e-mail:
> >> [EMAIL PROTECTED]
> >>
> >
> >
> > __
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> >
>
-
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem Indexing Large Document Field

2004-05-26 Thread James Dunn
Gilberto,

Look at the IndexWriter class.  It has a property,
maxFieldLength, which you can set to determine the max
number of characters to be stored in the index.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html

Jim

--- Gilberto Rodriguez
<[EMAIL PROTECTED]> wrote:
> I am trying to index a field in a Lucene document
> with about 90,000 
> characters. The problem is that it only indexes part
> of the document. 
> It seems to only index about 65,00 characters. So,
> if I search on terms 
> that are at the beginning of the text, the search
> works, but it fails 
> for terms that are at the end of the document.
> 
> Is there a limitation on how many characters can be
> stored in a 
> document field? Any help would be appreciated,
> thanks
> 
> 
> Gilberto Rodriguez
> Software Engineer
>    
> 370 CenterPointe Circle, Suite 1178
> Altamonte Springs, FL 32701-3451
>    
> 407.339.1177 (Ext.112) • phone
> 407.339.6704 • fax
> [EMAIL PROTECTED] • email
> www.conviveon.com • web
>  
> This e-mail contains legally privileged and
> confidential information 
> intended only for the individual or entity named
> within the message. If 
> the reader of this message is not the intended
> recipient, or the agent 
> responsible to deliver it to the intended recipient,
> the recipient is 
> hereby notified that any review, dissemination,
> distribution or copying 
> of this communication is prohibited. If this
> communication was received 
> in error, please notify me by reply e-mail and
> delete the original 
> message.
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory usage

2004-05-26 Thread James Dunn
Will,

Thanks for your response.  It may be an object leak. 
I will look into that.

I just ran some more tests and this time I create a
20GB index by repeatedly merging my large index into
itself.

When I ran my test query against that index I got an
OutOfMemoryError on the very first query.  I have my
heap set to 512MB.  Should a query against a 20GB
index require that much memory?  I page through the
results 100 at a time, so I should never have more
than 100 Document objects in memory.  

Any help would be appreciated, thanks!

Jim
--- [EMAIL PROTECTED] wrote:
> This sounds like a memory leakage situation.  If you
> are using tomcat I
> would suggest you make sure you are on a recent
> version, as it is known to
> have some memory leaks in version 4.  It doesn't
> make sense that repeated
> queries would use more memory that the most
> demanding query unless objects
> are not getting freed from memory.
> 
> -Will
> 
> -Original Message-
> From: James Dunn [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, May 26, 2004 3:02 PM
> To: [EMAIL PROTECTED]
> Subject: Memory usage
> 
> 
> Hello,
> 
> I was wondering if anyone has had problems with
> memory
> usage and MultiSearcher.
> 
> My index is composed of two sub-indexes that I
> search
> with a MultiSearcher.  The total size of the index
> is
> about 3.7GB with the larger sub-index being 3.6GB
> and
> the smaller being 117MB.
> 
> I am using Lucene 1.3 Final with the compound file
> format.
> 
> Also I search across about 50 fields but I don't use
> wildcard or range queries. 
> 
> Doing repeated searches in this way seems to
> eventually chew up about 500MB of memory which seems
> excessive to me.
> 
> Does anyone have any ideas where I could look to
> reduce the memory my queries consume?
> 
> Thanks,
> 
> Jim
> 
> 
>   
>   
> __
> Do you Yahoo!?
> Friends.  Fun.  Try the all-new Yahoo! Messenger.
> http://messenger.yahoo.com/ 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Memory usage

2004-05-26 Thread James Dunn
Hello,

I was wondering if anyone has had problems with memory
usage and MultiSearcher.

My index is composed of two sub-indexes that I search
with a MultiSearcher.  The total size of the index is
about 3.7GB with the larger sub-index being 3.6GB and
the smaller being 117MB.

I am using Lucene 1.3 Final with the compound file
format.

Also I search across about 50 fields but I don't use
wildcard or range queries. 

Doing repeated searches in this way seems to
eventually chew up about 500MB of memory which seems
excessive to me.

Does anyone have any ideas where I could look to
reduce the memory my queries consume?

Thanks,

Jim




__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Preventing duplicate document insertion during optimize

2004-04-30 Thread James Dunn
Kevin,

I have a similar issue.  The only solution I have been
able to come up with is, after the merge, to open an
IndexReader against the merge index, iterate over all
the docs and delete duplicate docs based on my
"primary key" field.

Jim

--- "Kevin A. Burton" <[EMAIL PROTECTED]> wrote:
> Let's say you have two indexes each with the same
> document literal.  All 
> the fields hash the same and the document is a
> binary duplicate of a 
> different document in the second index.
> 
> What happens when you do a merge to create a 3rd
> index from the first 
> two?  I assume you now have two documents that are
> identical in one 
> index.  Is there any way to prevent this?
> 
> It would be nice to figure out if there's a way to
> flag a field as a 
> primary key so that if it has already added it to
> just skip.
> 
> Kevin
> 
> -- 
> 
> Please reply using PGP.
> 
> http://peerfear.org/pubkey.asc
> 
> NewsMonster - http://www.newsmonster.org/
> 
> Kevin A. Burton, Location - San Francisco, CA, Cell
> - 415.595.9965
>AIM/YIM - sfburtonator,  Web -
> http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D
> 8D04 99F1 4412
>   IRC - freenode.net #infoanarchy | #p2p-hackers |
> #newsmonster
> 
> 

> ATTACHMENT part 2 application/pgp-signature
name=signature.asc






__
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs  
http://hotjobs.sweepstakes.yahoo.com/careermakeover 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problems From the Word Go

2004-04-29 Thread James Dunn
Alex,

Could you send along whatever error messages you are
receiving?

Thanks,

Jim
--- Alex Wybraniec <[EMAIL PROTECTED]>
wrote:
> I'm sorry if this is not the correct place to post
> this, but I'm very
> confused, and getting towards the end of my tether.
> 
> I need to install/compile and run Lucene on a
> Windows XP Pro based machine,
> running J2SE 1.4.2, with ANT.
> 
> I downloaded both the source code and the
> pre-compile versions, and as yet
> have not been able to get either running. I've been
> through the
> documentation, and still I can find little to help
> me set it up properly.
> 
> All I want to do (to start with) is compile and run
> the demo version.
> 
> I'm sorry to ask such a newbie question, but I'm
> really stuck.
> 
> So if anyone can point me to an idiots guide, or
> offer me some help, I would
> be most grateful.
> 
> Once I get past this stage, I'll have all sorts of
> juicer questions for you,
> but at the minute, I can't even get past stage 1
> 
> Thank you in advance
> Alex
> ---
> Outgoing mail is certified Virus Free.
> Checked by AVG anti-virus system
> (http://www.grisoft.com).
> Version: 6.0.672 / Virus Database: 434 - Release
> Date: 28/04/2004
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs  
http://hotjobs.sweepstakes.yahoo.com/careermakeover 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread James Dunn
Which version of lucene are you using?  In 1.2, I
believe the lock file was located in the index
directory itself.  In 1.3, it's in your system's tmp
folder.  

Perhaps it's a permission problem on either one of
those folders.  Maybe your process doesn't have write
access to the correct folder and is thus unable to
create the lock file?  

You can also pass lucene a system property to increase
the lock timeout interval, like so:

-Dorg.apache.lucene.commitLockTimeout=6

or 

-Dorg.apache.lucene.writeLockTimeout=6

The above sets the timeout to one minute.

Hope this helps,

Jim

--- "Kevin A. Burton" <[EMAIL PROTECTED]> wrote:
> I've noticed this really strange problem on one of
> our boxes.  It's 
> happened twice already.
> 
> We have indexes where when Lucnes starts it says
> 'Lock obtain timed out' 
> ... however NO locks exist for the directory. 
> 
> There are no other processes present and no locks in
> the index dir or /tmp.
> 
> Is there anyway to figure out what's going on here?
> 
> Looking at the index it seems just fine... But this
> is only a brief 
> glance.  I was hoping that if it was corrupt (which
> I don't think it is) 
> that lucene would give me a better error than "Lock
> obtain timed out"
> 
> Kevin
> 
> -- 
> 
> Please reply using PGP.
> 
> http://peerfear.org/pubkey.asc
> 
> NewsMonster - http://www.newsmonster.org/
> 
> Kevin A. Burton, Location - San Francisco, CA, Cell
> - 415.595.9965
>AIM/YIM - sfburtonator,  Web -
> http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D
> 8D04 99F1 4412
>   IRC - freenode.net #infoanarchy | #p2p-hackers |
> #newsmonster
> 
> 

> ATTACHMENT part 2 application/pgp-signature
name=signature.asc






__
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs  
http://hotjobs.sweepstakes.yahoo.com/careermakeover 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: ArrayIndexOutOfBoundsException

2004-04-28 Thread James Dunn
Philippe, thanks for the reply.  I didn't FTP my index
anywhere, but your response does make it seem that my
index is in fact corrupted somehow.

Does anyone know of a tool that can verify the
validity of a Lucene index, and/or possibly repair it?
 If not, anyone have any idea how difficult it would
be to write one?  

Thanks,

Jim 

--- Phil brunet <[EMAIL PROTECTED]> wrote:
> 
> Hi.
> 
> I had this problem when i transfered a Lucene index
> by FTP in "ASCII" mode. 
> Using binary mode, i never has such a problem.
> 
> Philippe
> 
> >From: James Dunn <[EMAIL PROTECTED]>
> >Reply-To: "Lucene Users List"
> <[EMAIL PROTECTED]>
> >To: [EMAIL PROTECTED]
> >Subject: ArrayIndexOutOfBoundsException
> >Date: Mon, 26 Apr 2004 12:15:39 -0700 (PDT)
> >
> >Hello all,
> >
> >I have a web site whose search is driven by Lucene
> >1.3.  I've been doing some load testing using
> JMeter
> >and occassionally I will see the exception below
> when
> >the search page is under heavy load.
> >
> >Has anyone seen similar errors during load testing?
> >
> >I've seen some posts with similar exceptions and
> the
> >general consensus is that this error means that the
> >index is corrupt.  I'm not sure my index is corrupt
> >however.  I can run all the queries I use for load
> >testing under normal load and I don't appear to get
> >this error.
> >
> >Is there any way to verify that a Lucene index is
> >corrupt or not?
> >
> >Thanks,
> >
> >Jim
> >
> >java.lang.ArrayIndexOutOfBoundsException: 53 >= 52
> > at
> java.util.Vector.elementAt(Vector.java:431)
> > at
>
>org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135)
> > at
>
>org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103)
> > at
>
>org.apache.lucene.index.SegmentReader.document(SegmentReader.java:275)
> > at
>
>org.apache.lucene.index.SegmentsReader.document(SegmentsReader.java:112)
> > at
>
>org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:107)
> > at
>
>org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
> > at
>
>org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
> > at
> >org.apache.lucene.search.Hits.doc(Hits.java:130)
> >
> >
> >
> >
> >
> >__
> >Do you Yahoo!?
> >Yahoo! Photos: High-quality 4x6 digital prints for
> 25¢
> >http://photos.yahoo.com/ph/print_splash
> >
>
>-
> >To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> >For additional commands, e-mail:
> [EMAIL PROTECTED]
> >
> 
>
_
> Hotmail : un compte GRATUIT qui vous suit partout et
> tout le temps ! 
> http://g.msn.fr/FR1000/9493
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs  
http://hotjobs.sweepstakes.yahoo.com/careermakeover 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ArrayIndexOutOfBoundsException

2004-04-26 Thread James Dunn
Hello all,

I have a web site whose search is driven by Lucene
1.3.  I've been doing some load testing using JMeter
and occassionally I will see the exception below when
the search page is under heavy load.

Has anyone seen similar errors during load testing?

I've seen some posts with similar exceptions and the
general consensus is that this error means that the
index is corrupt.  I'm not sure my index is corrupt
however.  I can run all the queries I use for load
testing under normal load and I don't appear to get
this error.

Is there any way to verify that a Lucene index is
corrupt or not? 

Thanks,

Jim

java.lang.ArrayIndexOutOfBoundsException: 53 >= 52
at java.util.Vector.elementAt(Vector.java:431)
at
org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135)
at
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:275)
at
org.apache.lucene.index.SegmentsReader.document(SegmentsReader.java:112)
at
org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:107)
at
org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
at
org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
at
org.apache.lucene.search.Hits.doc(Hits.java:130)





__
Do you Yahoo!?
Yahoo! Photos: High-quality 4x6 digital prints for 25¢
http://photos.yahoo.com/ph/print_splash

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]