Re: Acedemic Question About Indexing

2004-11-11 Thread Luke Shannon
40 Million! Wow. Ok this is the kind of answer I was looking for. The site I
am working on indexes maybe 1000 at any given time. I think I am ok with a
single index.

Thanks.

- Original Message - 
From: Will Allen [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 7:23 PM
Subject: RE: Acedemic Question About Indexing


I have an application that I run monthly that indexes 40 million documents
into 6 indexes, then uses a multisearcher.  The advantage for me is that I
can have multiple writers indexing 1/6 of that total data reducing the time
it takes to index by about 5X.

-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:39 PM
To: Lucene Users List
Subject: Re: Acedemic Question About Indexing


Don't worry, regardless of what I learn in this forum I am telling my
company to get me a copy of that bad boy when it comes out (which as far as
I am concerned can't be soon enough). I will pay for grama's myself.

I think I have reviewed the code you are referring to and have something
similar working in my own indexer (using the uid). All is well.

My stupid question for the day is why would you ever want multiple indexes
running if you can build one smart indexer that does everything as
efficiently as possible? Does the answer to this question move me to multi
threaded indexing territory?

Thanks,

Luke


- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:08 PM
Subject: Re: Acedemic Question About Indexing


 Uh, I hate to market it, but it's in the book.  But you don't have
 to wait for it, as there already is a Lucene demo that does what you
 described.  I am not sure if the demo always recreates the index or
 whether it deletes and re-adds only the new and modified files, but if
 it's the former, you would only need to modify the demo a little bit to
 check the timestamps of File objects and compare them to those stored
 in the index (if they are being stored - if not, you should add a field
 to hold that data)

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I am working on debugging an existing Lucene implementation.
 
  Before I started, I built a demo to understand Lucene. In my demo I
  indexed
  the entire content hierarhcy all at once, and than optimize this
  index and
  used it for queries. It was time consuming but very simply.
 
  The code I am currently trying to fix indexes the content hierarchy
  by
  folder creating a seperate index for each one. Thus it ends up with a
  bunch
  of indexes. I still don't understand how this works (I am assuming
  they get
  merged someone that I have tracked down yet) but I have noticed it
  doesn't
  always index the right folder. This results in the users reporting
  inconsistant behavior in searching after they make a change to a
  document.
  To keep things simiple I would like to remove all the logic that
  figures out
  which folder to index and just do them all (usually less than 1000
  files) so
  I end up with one index.
 
  Would indexing time be the only area I would be losing out in, or is
  there
  something more to the approach of creating multiple indexes and
  merging
  them.
 
  What is a good approach I can take to indexing a content hierarchy
  composed
  primarily of pdf, xsl, doc and xml where any of these documents can
  be
  changed several times a day?
 
  Thanks,
 
  Luke
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Acedemic Question About Indexing

2004-11-11 Thread Gard Arneson Haugen
Could I ask how  fast the search goes against this index, both for 
simple words and more advanced phrase and boolean searches?
And is there something smart you have done to make this go fast, both on 
the infrastructure or the system it selves?

Best regards,
Gard Arneson Haugen
Email : [EMAIL PROTECTED]
Mobile: +47 93 05 01 91 
Fax   : +47 21 95 51 99
Magenta News AS - Møllergata 8, 0179 Oslo


Will Allen wrote:
I have an application that I run monthly that indexes 40 million documents into 
6 indexes, then uses a multisearcher.  The advantage for me is that I can have 
multiple writers indexing 1/6 of that total data reducing the time it takes to 
index by about 5X.
-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:39 PM
To: Lucene Users List
Subject: Re: Acedemic Question About Indexing
Don't worry, regardless of what I learn in this forum I am telling my
company to get me a copy of that bad boy when it comes out (which as far as
I am concerned can't be soon enough). I will pay for grama's myself.
I think I have reviewed the code you are referring to and have something
similar working in my own indexer (using the uid). All is well.
My stupid question for the day is why would you ever want multiple indexes
running if you can build one smart indexer that does everything as
efficiently as possible? Does the answer to this question move me to multi
threaded indexing territory?
Thanks,
Luke
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:08 PM
Subject: Re: Acedemic Question About Indexing

 

Uh, I hate to market it, but it's in the book.  But you don't have
to wait for it, as there already is a Lucene demo that does what you
described.  I am not sure if the demo always recreates the index or
whether it deletes and re-adds only the new and modified files, but if
it's the former, you would only need to modify the demo a little bit to
check the timestamps of File objects and compare them to those stored
in the index (if they are being stored - if not, you should add a field
to hold that data)
Otis
--- Luke Shannon [EMAIL PROTECTED] wrote:
   

I am working on debugging an existing Lucene implementation.
Before I started, I built a demo to understand Lucene. In my demo I
indexed
the entire content hierarhcy all at once, and than optimize this
index and
used it for queries. It was time consuming but very simply.
The code I am currently trying to fix indexes the content hierarchy
by
folder creating a seperate index for each one. Thus it ends up with a
bunch
of indexes. I still don't understand how this works (I am assuming
they get
merged someone that I have tracked down yet) but I have noticed it
doesn't
always index the right folder. This results in the users reporting
inconsistant behavior in searching after they make a change to a
document.
To keep things simiple I would like to remove all the logic that
figures out
which folder to index and just do them all (usually less than 1000
files) so
I end up with one index.
Would indexing time be the only area I would be losing out in, or is
there
something more to the approach of creating multiple indexes and
merging
them.
What is a good approach I can take to indexing a content hierarchy
composed
primarily of pdf, xsl, doc and xml where any of these documents can
be
changed several times a day?
Thanks,
Luke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Acedemic Question About Indexing

2004-11-11 Thread Will Allen
I have a servlet that instanciates a multisearcher on 6 indexes:
(du -h)
7.2G./0
7.2G./1
7.2G./2
7.2G./3
7.2G./4
7.2G./5
43G .

I recreate the index from scratch each month based upon a 50gig zip file with 
all of the 40 million documents.  I wanted to keep my indexing speed as low as 
possible, without hurting search performace too much, as each searcher 
allocates a certain amount of memory proportional to the number of terms it 
has.  A single large index has a lot of overlap in terms, so it needs less 
memory than multiple indexes.

Anyway, for indexing, I am able to index ~100 documents per second.  The total 
indexing process takes 2.5 days.  I have a powerful machine with 2 
hyperthreaded processors (linux sees 4 processors) and 1GB ram.  I also have 
pretty fast SCSI disks.

I perform no updates or deletes on my indexes.

The indexing process equally divides the work amongst the indexers.  The 
bottleneck of the indexing process is not memory or CPU, rather disk IO of 6 
writers.  If I had faster disks, I could create more indexers.

-Original Message-
From: Sodel Vazquez-Reyes
[mailto:[EMAIL PROTECTED]
Sent: Thursday, November 11, 2004 11:37 AM
To: Lucene Users List
Cc: Will Allen
Subject: Re: Acedemic Question About Indexing


Will,
could you give more details about your architecture?
-each time update o create new indexes
-data stored at each index
etc.

because it is quite interesting, and I would like to test it.

Sodel



Quoting Luke Shannon [EMAIL PROTECTED]:

 40 Million! Wow. Ok this is the kind of answer I was looking for. The site I
 am working on indexes maybe 1000 at any given time. I think I am ok with a
 single index.

 Thanks.

 - Original Message -
 From: Will Allen [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 7:23 PM
 Subject: RE: Acedemic Question About Indexing


 I have an application that I run monthly that indexes 40 million documents
 into 6 indexes, then uses a multisearcher.  The advantage for me is that I
 can have multiple writers indexing 1/6 of that total data reducing the time
 it takes to index by about 5X.

 -Original Message-
 From: Luke Shannon [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 2:39 PM
 To: Lucene Users List
 Subject: Re: Acedemic Question About Indexing


 Don't worry, regardless of what I learn in this forum I am telling my
 company to get me a copy of that bad boy when it comes out (which as far as
 I am concerned can't be soon enough). I will pay for grama's myself.

 I think I have reviewed the code you are referring to and have something
 similar working in my own indexer (using the uid). All is well.

 My stupid question for the day is why would you ever want multiple indexes
 running if you can build one smart indexer that does everything as
 efficiently as possible? Does the answer to this question move me to multi
 threaded indexing territory?

 Thanks,

 Luke


 - Original Message -
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 2:08 PM
 Subject: Re: Acedemic Question About Indexing


 Uh, I hate to market it, but it's in the book.  But you don't have
 to wait for it, as there already is a Lucene demo that does what you
 described.  I am not sure if the demo always recreates the index or
 whether it deletes and re-adds only the new and modified files, but if
 it's the former, you would only need to modify the demo a little bit to
 check the timestamps of File objects and compare them to those stored
 in the index (if they are being stored - if not, you should add a field
 to hold that data)

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I am working on debugging an existing Lucene implementation.
 
  Before I started, I built a demo to understand Lucene. In my demo I
  indexed
  the entire content hierarhcy all at once, and than optimize this
  index and
  used it for queries. It was time consuming but very simply.
 
  The code I am currently trying to fix indexes the content hierarchy
  by
  folder creating a seperate index for each one. Thus it ends up with a
  bunch
  of indexes. I still don't understand how this works (I am assuming
  they get
  merged someone that I have tracked down yet) but I have noticed it
  doesn't
  always index the right folder. This results in the users reporting
  inconsistant behavior in searching after they make a change to a
  document.
  To keep things simiple I would like to remove all the logic that
  figures out
  which folder to index and just do them all (usually less than 1000
  files) so
  I end up with one index.
 
  Would indexing time be the only area I would be losing out in, or is
  there
  something more to the approach of creating multiple indexes and
  merging
  them.
 
  What is a good approach I can take to indexing a content hierarchy
  composed
  primarily of pdf, xsl, doc and xml

Re: Acedemic Question About Indexing

2004-11-10 Thread Otis Gospodnetic
Uh, I hate to market it, but it's in the book.  But you don't have
to wait for it, as there already is a Lucene demo that does what you
described.  I am not sure if the demo always recreates the index or
whether it deletes and re-adds only the new and modified files, but if
it's the former, you would only need to modify the demo a little bit to
check the timestamps of File objects and compare them to those stored
in the index (if they are being stored - if not, you should add a field
to hold that data)

Otis

--- Luke Shannon [EMAIL PROTECTED] wrote:

 I am working on debugging an existing Lucene implementation.
 
 Before I started, I built a demo to understand Lucene. In my demo I
 indexed
 the entire content hierarhcy all at once, and than optimize this
 index and
 used it for queries. It was time consuming but very simply.
 
 The code I am currently trying to fix indexes the content hierarchy
 by
 folder creating a seperate index for each one. Thus it ends up with a
 bunch
 of indexes. I still don't understand how this works (I am assuming
 they get
 merged someone that I have tracked down yet) but I have noticed it
 doesn't
 always index the right folder. This results in the users reporting
 inconsistant behavior in searching after they make a change to a
 document.
 To keep things simiple I would like to remove all the logic that
 figures out
 which folder to index and just do them all (usually less than 1000
 files) so
 I end up with one index.
 
 Would indexing time be the only area I would be losing out in, or is
 there
 something more to the approach of creating multiple indexes and
 merging
 them.
 
 What is a good approach I can take to indexing a content hierarchy
 composed
 primarily of pdf, xsl, doc and xml where any of these documents can
 be
 changed several times a day?
 
 Thanks,
 
 Luke
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Acedemic Question About Indexing

2004-11-10 Thread Luke Shannon
Don't worry, regardless of what I learn in this forum I am telling my
company to get me a copy of that bad boy when it comes out (which as far as
I am concerned can't be soon enough). I will pay for grama's myself.

I think I have reviewed the code you are referring to and have something
similar working in my own indexer (using the uid). All is well.

My stupid question for the day is why would you ever want multiple indexes
running if you can build one smart indexer that does everything as
efficiently as possible? Does the answer to this question move me to multi
threaded indexing territory?

Thanks,

Luke


- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:08 PM
Subject: Re: Acedemic Question About Indexing


 Uh, I hate to market it, but it's in the book.  But you don't have
 to wait for it, as there already is a Lucene demo that does what you
 described.  I am not sure if the demo always recreates the index or
 whether it deletes and re-adds only the new and modified files, but if
 it's the former, you would only need to modify the demo a little bit to
 check the timestamps of File objects and compare them to those stored
 in the index (if they are being stored - if not, you should add a field
 to hold that data)

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I am working on debugging an existing Lucene implementation.
 
  Before I started, I built a demo to understand Lucene. In my demo I
  indexed
  the entire content hierarhcy all at once, and than optimize this
  index and
  used it for queries. It was time consuming but very simply.
 
  The code I am currently trying to fix indexes the content hierarchy
  by
  folder creating a seperate index for each one. Thus it ends up with a
  bunch
  of indexes. I still don't understand how this works (I am assuming
  they get
  merged someone that I have tracked down yet) but I have noticed it
  doesn't
  always index the right folder. This results in the users reporting
  inconsistant behavior in searching after they make a change to a
  document.
  To keep things simiple I would like to remove all the logic that
  figures out
  which folder to index and just do them all (usually less than 1000
  files) so
  I end up with one index.
 
  Would indexing time be the only area I would be losing out in, or is
  there
  something more to the approach of creating multiple indexes and
  merging
  them.
 
  What is a good approach I can take to indexing a content hierarchy
  composed
  primarily of pdf, xsl, doc and xml where any of these documents can
  be
  changed several times a day?
 
  Thanks,
 
  Luke
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Acedemic Question About Indexing

2004-11-10 Thread Will Allen
I have an application that I run monthly that indexes 40 million documents into 
6 indexes, then uses a multisearcher.  The advantage for me is that I can have 
multiple writers indexing 1/6 of that total data reducing the time it takes to 
index by about 5X.

-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:39 PM
To: Lucene Users List
Subject: Re: Acedemic Question About Indexing


Don't worry, regardless of what I learn in this forum I am telling my
company to get me a copy of that bad boy when it comes out (which as far as
I am concerned can't be soon enough). I will pay for grama's myself.

I think I have reviewed the code you are referring to and have something
similar working in my own indexer (using the uid). All is well.

My stupid question for the day is why would you ever want multiple indexes
running if you can build one smart indexer that does everything as
efficiently as possible? Does the answer to this question move me to multi
threaded indexing territory?

Thanks,

Luke


- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:08 PM
Subject: Re: Acedemic Question About Indexing


 Uh, I hate to market it, but it's in the book.  But you don't have
 to wait for it, as there already is a Lucene demo that does what you
 described.  I am not sure if the demo always recreates the index or
 whether it deletes and re-adds only the new and modified files, but if
 it's the former, you would only need to modify the demo a little bit to
 check the timestamps of File objects and compare them to those stored
 in the index (if they are being stored - if not, you should add a field
 to hold that data)

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I am working on debugging an existing Lucene implementation.
 
  Before I started, I built a demo to understand Lucene. In my demo I
  indexed
  the entire content hierarhcy all at once, and than optimize this
  index and
  used it for queries. It was time consuming but very simply.
 
  The code I am currently trying to fix indexes the content hierarchy
  by
  folder creating a seperate index for each one. Thus it ends up with a
  bunch
  of indexes. I still don't understand how this works (I am assuming
  they get
  merged someone that I have tracked down yet) but I have noticed it
  doesn't
  always index the right folder. This results in the users reporting
  inconsistant behavior in searching after they make a change to a
  document.
  To keep things simiple I would like to remove all the logic that
  figures out
  which folder to index and just do them all (usually less than 1000
  files) so
  I end up with one index.
 
  Would indexing time be the only area I would be losing out in, or is
  there
  something more to the approach of creating multiple indexes and
  merging
  them.
 
  What is a good approach I can take to indexing a content hierarchy
  composed
  primarily of pdf, xsl, doc and xml where any of these documents can
  be
  changed several times a day?
 
  Thanks,
 
  Luke
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]