Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Otis Gospodnetic
2.3 is in the process of being released.  Give it another week to 10 days and 
it will be out.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Ariel <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, January 10, 2008 6:26:44 PM
Subject: Re: Why is lucene so slow indexing in nfs file system ?

Thanks for yours suggestions.

I'm sorry I didn't know but I would want to know what Do you mean with
 "SAN"
and "FC"?

Another thing, I have visited the lucene home page and there is not
 released
the 2.3 version, could you tell me where is the download link ?

Thanks in advance.
Ariel

On Jan 10, 2008 2:59 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Ariel,
>
> Comments inline.
>
>
> - Original Message 
> From: Ariel <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Thursday, January 10, 2008 10:05:28 AM
> Subject: Re: Why is lucene so slow indexing in nfs file system ?
>
> In a distributed enviroment the application should make an exhaustive
>  use of
> the network and there is not another way to access to the documents
 in
>  a
> remote repository but accessing in nfs file system.
>
> OG: What about SAN connected over FC for example?
>
> One thing I must clarify: I index the documents in memory, I use
> RAMDirectory to do that, then when the RAMDirectory reach the limit(I
>  have
> put about 10 Mb) then I serialize to disk(nfs) the index to merge it
>  with
> the central index(the central index is in nfs file system), is that
>  correct?
>
> OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it
 will
> do in-memory thing for you.  Make good use of your RAM and use 2.3
 which
> gives you more control over RAM use during indexing.  Parallelizing
 indexing
> over multiple machines and merging at the end is faster, so that's a
 good
> approach.  Also, if your boxes have multiple CPUs write your code so
 that it
> has multiple worker threads that do indexing and feed docs to
> IndexWriter.addDocument(Document) to keep the CPUs fully utilized.
>
> OG: Oh, something faster than PDFBox?  There is (can't remember the
 name
> now... itextstream or something like that?), though it may not be
 free like
> PDFBox.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:
>
> > Thanks all you for yours answers, I going to change a few things in
>  my
> > application and make tests.
> > One thing I haven't find another good pdfToText converter like
 pdfBox
>  Do
> > you know any other faster ?
> > Greetings
> > Thanks for yours answers
> > Ariel
> >
> >
> > On Jan 9, 2008 11:08 PM, Otis Gospodnetic
>  <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Ariel,
> > >
> > > I believe PDFBox is not the fastest thing and was built more to
>  handle
> > > all possible PDFs than for speed (just my impression - Ben,
>  PDFBox's author
> > > might still be on this list and might comment).  Pulling data
 from
>  NFS to
> > > index seems like a bad idea.  I hope at least the indices are
 local
>  and not
> > > on a remote NFS...
> > >
> > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall
 which
>  one)
> > > and indexing overNFS was slooow.
> > >
> > > Otis
> > >
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > > - Original Message 
> > > From: Ariel <[EMAIL PROTECTED]>
> > > To: java-user@lucene.apache.org
> > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > Subject: Why is lucene so slow indexing in nfs file system ?
> > >
> > > Hi:
> > > I have seen the post in
> > >
>
  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
> > >  and
> > > I am implementing a similar application in a distributed
>  enviroment, a
> > > cluster of nodes only 5 nodes. The operating system I use is
> > >  Linux(Centos)
> > > so I am using nfs file system too to access the home directory
>  where
> > >  the
> > > documents to be indexed reside and I would like to know how much
>  time
> > >  an
> > > application spends to index a big amount of documents like 10 Gb
 ?
> > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512
 Mb
>  in
> > >  every
> > > nodes, LAN: 1Gbits/s.
> > >
> > > The problem I have is that my application spends a lot of time to
>  index
> > >  all
> > > the documents, the delay to index 10 gb of pdf documents is about
 2
> > >  days (to
> > > convert pdf to text I am using pdfbox) that is of course a lot of
>  time,
> > > others applications based in lucene, for instance ibm omnifind
 only
> > >  takes 5
> > > hours to index the same amount of pdfs documents. I would like to
>  find
> > >  out
> > > why my application has this big delay to index, any help is
>  welcome.
> > > Dou you know others distributed architecture application that
 uses
> > >  lucene to
> > > index big amounts of documents ? How long time it takes to index
 ?
> > > I hope yo can help me
> > > Greetings
> >

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Chris Lu
SAN is Storage Area Network. FC is fiber channel.

I can confirm by one customer experience that using SAN does scale
pretty well, and pretty simple. Well, it costs some money.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request)
got 2.6 Million Euro funding!


On Jan 10, 2008 3:26 PM, Ariel <[EMAIL PROTECTED]> wrote:
> Thanks for yours suggestions.
>
> I'm sorry I didn't know but I would want to know what Do you mean with "SAN"
> and "FC"?
>
> Another thing, I have visited the lucene home page and there is not released
> the 2.3 version, could you tell me where is the download link ?
>
> Thanks in advance.
> Ariel
>
> On Jan 10, 2008 2:59 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
>
> wrote:
>
> > Ariel,
> >
> > Comments inline.
> >
> >
> > - Original Message 
> > From: Ariel <[EMAIL PROTECTED]>
> > To: java-user@lucene.apache.org
> > Sent: Thursday, January 10, 2008 10:05:28 AM
> > Subject: Re: Why is lucene so slow indexing in nfs file system ?
> >
> > In a distributed enviroment the application should make an exhaustive
> >  use of
> > the network and there is not another way to access to the documents in
> >  a
> > remote repository but accessing in nfs file system.
> >
> > OG: What about SAN connected over FC for example?
> >
> > One thing I must clarify: I index the documents in memory, I use
> > RAMDirectory to do that, then when the RAMDirectory reach the limit(I
> >  have
> > put about 10 Mb) then I serialize to disk(nfs) the index to merge it
> >  with
> > the central index(the central index is in nfs file system), is that
> >  correct?
> >
> > OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will
> > do in-memory thing for you.  Make good use of your RAM and use 2.3 which
> > gives you more control over RAM use during indexing.  Parallelizing indexing
> > over multiple machines and merging at the end is faster, so that's a good
> > approach.  Also, if your boxes have multiple CPUs write your code so that it
> > has multiple worker threads that do indexing and feed docs to
> > IndexWriter.addDocument(Document) to keep the CPUs fully utilized.
> >
> > OG: Oh, something faster than PDFBox?  There is (can't remember the name
> > now... itextstream or something like that?), though it may not be free like
> > PDFBox.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> > On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:
> >
> > > Thanks all you for yours answers, I going to change a few things in
> >  my
> > > application and make tests.
> > > One thing I haven't find another good pdfToText converter like pdfBox
> >  Do
> > > you know any other faster ?
> > > Greetings
> > > Thanks for yours answers
> > > Ariel
> > >
> > >
> > > On Jan 9, 2008 11:08 PM, Otis Gospodnetic
> >  <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Ariel,
> > > >
> > > > I believe PDFBox is not the fastest thing and was built more to
> >  handle
> > > > all possible PDFs than for speed (just my impression - Ben,
> >  PDFBox's author
> > > > might still be on this list and might comment).  Pulling data from
> >  NFS to
> > > > index seems like a bad idea.  I hope at least the indices are local
> >  and not
> > > > on a remote NFS...
> > > >
> > > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
> >  one)
> > > > and indexing overNFS was slooow.
> > > >
> > > > Otis
> > > >
> > > > --
> > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > >
> > > > - Original Message 
> > > > From: Ariel <[EMAIL PROTECTED]>
> > > > To: java-user@lucene.apache.org
> > > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > > Subject: Why is lucene so slow indexing in nfs file system ?
> > > >
> > > > Hi:
> > > > I have seen the post in
> > > >
> >  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
> > > >  and
> > > > I am implementing a similar application in a distributed
> >  enviroment, a
> > > > cluster of nodes only 5 nodes. The operating system I use is
> > > >  Linux(Centos)
> > > > so I am using nfs file system too to access the home directory
> >  where
> > > >  the
> > > > documents to be indexed reside and I would like to know how much
> >  time
> > > >  an
> > > > application spends to index a big amount of documents like 10 Gb ?
> > > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
> >  in
> > > >  every
> > > > nodes, LAN: 1Gbits/s.
> > > >
> > > > The problem I have is that my application spends a lot of time to
> >  index
> > > >  all
> > > > the documents, the delay to index 10 gb of pdf documents is about 2
> > > >  days (to
> > > > convert pdf to text I am using pdfbox) that is of cour

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
Thanks for yours suggestions.

I'm sorry I didn't know but I would want to know what Do you mean with "SAN"
and "FC"?

Another thing, I have visited the lucene home page and there is not released
the 2.3 version, could you tell me where is the download link ?

Thanks in advance.
Ariel

On Jan 10, 2008 2:59 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Ariel,
>
> Comments inline.
>
>
> - Original Message 
> From: Ariel <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Thursday, January 10, 2008 10:05:28 AM
> Subject: Re: Why is lucene so slow indexing in nfs file system ?
>
> In a distributed enviroment the application should make an exhaustive
>  use of
> the network and there is not another way to access to the documents in
>  a
> remote repository but accessing in nfs file system.
>
> OG: What about SAN connected over FC for example?
>
> One thing I must clarify: I index the documents in memory, I use
> RAMDirectory to do that, then when the RAMDirectory reach the limit(I
>  have
> put about 10 Mb) then I serialize to disk(nfs) the index to merge it
>  with
> the central index(the central index is in nfs file system), is that
>  correct?
>
> OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will
> do in-memory thing for you.  Make good use of your RAM and use 2.3 which
> gives you more control over RAM use during indexing.  Parallelizing indexing
> over multiple machines and merging at the end is faster, so that's a good
> approach.  Also, if your boxes have multiple CPUs write your code so that it
> has multiple worker threads that do indexing and feed docs to
> IndexWriter.addDocument(Document) to keep the CPUs fully utilized.
>
> OG: Oh, something faster than PDFBox?  There is (can't remember the name
> now... itextstream or something like that?), though it may not be free like
> PDFBox.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:
>
> > Thanks all you for yours answers, I going to change a few things in
>  my
> > application and make tests.
> > One thing I haven't find another good pdfToText converter like pdfBox
>  Do
> > you know any other faster ?
> > Greetings
> > Thanks for yours answers
> > Ariel
> >
> >
> > On Jan 9, 2008 11:08 PM, Otis Gospodnetic
>  <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Ariel,
> > >
> > > I believe PDFBox is not the fastest thing and was built more to
>  handle
> > > all possible PDFs than for speed (just my impression - Ben,
>  PDFBox's author
> > > might still be on this list and might comment).  Pulling data from
>  NFS to
> > > index seems like a bad idea.  I hope at least the indices are local
>  and not
> > > on a remote NFS...
> > >
> > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
>  one)
> > > and indexing overNFS was slooow.
> > >
> > > Otis
> > >
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > > - Original Message 
> > > From: Ariel <[EMAIL PROTECTED]>
> > > To: java-user@lucene.apache.org
> > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > Subject: Why is lucene so slow indexing in nfs file system ?
> > >
> > > Hi:
> > > I have seen the post in
> > >
>  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
> > >  and
> > > I am implementing a similar application in a distributed
>  enviroment, a
> > > cluster of nodes only 5 nodes. The operating system I use is
> > >  Linux(Centos)
> > > so I am using nfs file system too to access the home directory
>  where
> > >  the
> > > documents to be indexed reside and I would like to know how much
>  time
> > >  an
> > > application spends to index a big amount of documents like 10 Gb ?
> > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
>  in
> > >  every
> > > nodes, LAN: 1Gbits/s.
> > >
> > > The problem I have is that my application spends a lot of time to
>  index
> > >  all
> > > the documents, the delay to index 10 gb of pdf documents is about 2
> > >  days (to
> > > convert pdf to text I am using pdfbox) that is of course a lot of
>  time,
> > > others applications based in lucene, for instance ibm omnifind only
> > >  takes 5
> > > hours to index the same amount of pdfs documents. I would like to
>  find
> > >  out
> > > why my application has this big delay to index, any help is
>  welcome.
> > > Dou you know others distributed architecture application that uses
> > >  lucene to
> > > index big amounts of documents ? How long time it takes to index ?
> > > I hope yo can help me
> > > Greetings
> > >
> > >
> > >
> > >
> > >
>  -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL

Re: how do I get my own TopDocHitCollector?

2008-01-10 Thread Antony Bowesman

Beard, Brian wrote:

Ok, I've been thinking about this some more. Is the cache mechanism
pulling from the cache if the external id already exists there and then
hitting the searcher if it's not already in the cache (maybe using a
FieldSelector for just retrieving the external id)?


I am warming searchers in background and each search has one or more query 
related caches.  The external Id cache is normally preloaded by simply iterating 
terms, e.g.


String field = fieldName.intern();
final String[] retArray = new String[reader.maxDoc()];
TermDocs termDocs = reader.termDocs();
TermEnum termEnum = reader.terms (new Term (field, ""));
try
{
do
{
Term term = termEnum.term();
if (term == null || term.field() != field)
break;
String termval = term.text();
termDocs.seek(termEnum);
while (termDocs.next())
{
retArray[termDocs.doc()] = termval;
}
}
while (termEnum.next());
}
finally
{
termDocs.close();
termEnum.close();
}
return retArray;

I do allow for a partial cache, in which case, as you suggest, the searcher uses 
a FieldSelector to get the external Id from the document which then is stored to 
cache.


Antony





-Original Message-
From: Beard, Brian [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 10, 2008 10:08 AM

To: java-user@lucene.apache.org
Subject: RE: how do I get my own TopDocHitCollector?

Thanks for the post. So you're using the doc id as the key into the
cache to retrieve the external id. Then what mechanism fetches the
external id's from the searcher and places them in the cache?


-Original Message-
From: Antony Bowesman [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 09, 2008 7:19 PM

To: java-user@lucene.apache.org
Subject: Re: how do I get my own TopDocHitCollector?

Beard, Brian wrote:

Question:

The documents that I index have two id's - a unique document id and a
record_id that can link multiple documents together that belong to a
common record.

I'd like to use something like TopDocs to return the first 1024

results

that have unique record_id's, but I will want to skip some of the
returned documents that have the same record_id. We're using the
ParallelMultiSearcher. 


I read that I could use a HitCollector and throw an exception to get

it

to stop, but is there a cleaner way?


I'm doing a similar thing.  I have external Ids (equivalent to yout
record_id), 
which have one or more Lucene Documents associated with them.  I wrote a
custom 
HitCollector that uses a Map to hold the so far collected external ids
along 
with the collected document.


I had to write my own priority queue to know when an object was dropped
of the 
bottom of the score sorted queue, but the latest PriorityQueue on the
trunk now 
has insertWithOverflow(), which does the same thing.


Note that ResultDoc extends ScoreDoc, so that the external Id of the
item 
dropped off the queue can be used to remove it from my Map.


Code snippet is somewhat as below (I am caching my external Ids, hence
the cache 
usage)


protected Map results;

public void collect(int doc, float score)
 {
 if (score > 0.0f)
 {
 totalHits++;
 if (pq.size() < numHits || score > minScore)
 {
 OfficeId id = cache.get(doc);
 ResultDoc rd = results.get(id);
 //  No current result for this ID yet found
 if (rd == null)
 {
 rd = new ResultDoc(id, doc, score);
 ResultDoc added = pq.insert(rd);
 if (added == null)
 {
 //  Nothing dropped of the bottom
 results.put(id, rd);
 }
 else
 {
 //  Return value dropped of the bottom
 results.remove(added.id);
 results.put(id, rd);
 remaining++;
 }
 }
 //  Already found this ID, so replace high score if
necessary
 else
 {
 if (score > rd.score)
 {
 pq.remove(rd);
 rd.score = score;
 pq.insert(rd);
 }
 }
 //  Readjust our minimum score again from the top entry
 minScore = pq.peek().score;
 }
 else
 remaining++;
 }
 }

HTH
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For add

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Otis Gospodnetic
Ariel,
 
Comments inline.


- Original Message 
From: Ariel <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, January 10, 2008 10:05:28 AM
Subject: Re: Why is lucene so slow indexing in nfs file system ?

In a distributed enviroment the application should make an exhaustive
 use of
the network and there is not another way to access to the documents in
 a
remote repository but accessing in nfs file system.

OG: What about SAN connected over FC for example?

One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit(I
 have
put about 10 Mb) then I serialize to disk(nfs) the index to merge it
 with
the central index(the central index is in nfs file system), is that
 correct?

OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will do 
in-memory thing for you.  Make good use of your RAM and use 2.3 which gives you 
more control over RAM use during indexing.  Parallelizing indexing over 
multiple machines and merging at the end is faster, so that's a good approach.  
Also, if your boxes have multiple CPUs write your code so that it has multiple 
worker threads that do indexing and feed docs to 
IndexWriter.addDocument(Document) to keep the CPUs fully utilized.

OG: Oh, something faster than PDFBox?  There is (can't remember the name now... 
itextstream or something like that?), though it may not be free like PDFBox.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:

> Thanks all you for yours answers, I going to change a few things in
 my
> application and make tests.
> One thing I haven't find another good pdfToText converter like pdfBox
 Do
> you know any other faster ?
> Greetings
> Thanks for yours answers
> Ariel
>
>
> On Jan 9, 2008 11:08 PM, Otis Gospodnetic
 <[EMAIL PROTECTED]>
> wrote:
>
> > Ariel,
> >
> > I believe PDFBox is not the fastest thing and was built more to
 handle
> > all possible PDFs than for speed (just my impression - Ben,
 PDFBox's author
> > might still be on this list and might comment).  Pulling data from
 NFS to
> > index seems like a bad idea.  I hope at least the indices are local
 and not
> > on a remote NFS...
> >
> > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
 one)
> > and indexing overNFS was slooow.
> >
> > Otis
> >
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > - Original Message 
> > From: Ariel <[EMAIL PROTECTED]>
> > To: java-user@lucene.apache.org
> > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > Subject: Why is lucene so slow indexing in nfs file system ?
> >
> > Hi:
> > I have seen the post in
> >
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
> >  and
> > I am implementing a similar application in a distributed
 enviroment, a
> > cluster of nodes only 5 nodes. The operating system I use is
> >  Linux(Centos)
> > so I am using nfs file system too to access the home directory
 where
> >  the
> > documents to be indexed reside and I would like to know how much
 time
> >  an
> > application spends to index a big amount of documents like 10 Gb ?
> > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
 in
> >  every
> > nodes, LAN: 1Gbits/s.
> >
> > The problem I have is that my application spends a lot of time to
 index
> >  all
> > the documents, the delay to index 10 gb of pdf documents is about 2
> >  days (to
> > convert pdf to text I am using pdfbox) that is of course a lot of
 time,
> > others applications based in lucene, for instance ibm omnifind only
> >  takes 5
> > hours to index the same amount of pdfs documents. I would like to
 find
> >  out
> > why my application has this big delay to index, any help is
 welcome.
> > Dou you know others distributed architecture application that uses
> >  lucene to
> > index big amounts of documents ? How long time it takes to index ?
> > I hope yo can help me
> > Greetings
> >
> >
> >
> >
> >
 -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
I am indexing into RAM then merging explicitly because my application demand
it due to I have design it as a distributed enviroment so many threads or
workers are in different machines indexing into RAM serialize to disk an
another thread in another machine access the segment index to merge it with
the principal one, that is faster than if I had just one thread indexing the
documents, doesn' it ?
Yours suggestions are very useful.
I hope you can help me.
Greetings
Ariel

On Jan 10, 2008 10:21 AM, Erick Erickson <[EMAIL PROTECTED]> wrote:

> This seems really clunky. Especially if your merge step also optimizes.
>
> There's not much point in indexing into RAM then merging explicitly.
> Just use an FSDirectory rather than a RAMDirectory. There is *already*
> buffering built in to FSDirectory, and your merge factor etc. control
> how much RAM is used before flushing to disk. There's considerable
> discussion of this on the Wiki I believe, but in the mail archive for
> sure.
> And I believe there's a RAM usage based flushing policy somewhere.
>
> You're adding complexity where it's probably not necessary. Did you
> adopt this scheme because you *thought* it would be faster or because
> you were addressing a *known* problem? Don't *ever* write complex code
> to support a theoretical case unless you have considerable certainty
> that it really is a problem. "It would be faster" is a weak argument when
> you don't know whether you're talking about saving 1% or 95%. The
> added maintenance is just not worth it.
>
> There's a famous quote about that from Donald Knuth
> (paraphrasing Hoare) "We should forget about small efficiencies,
> say about 97% of the time: premature optimization is the root of
> all evil." It's true.
>
> So the very *first* measurement I'd take is to get rid of the in-RAM
> stuff and just write the index to local disk. I suspect you'll be *far*
> better off doing this then just copying your index to the nfs mount.
>
> Best
> Erick
>
> On Jan 10, 2008 10:05 AM, Ariel <[EMAIL PROTECTED]> wrote:
>
> > In a distributed enviroment the application should make an exhaustive
> use
> > of
> > the network and there is not another way to access to the documents in a
> > remote repository but accessing in nfs file system.
> > One thing I must clarify: I index the documents in memory, I use
> > RAMDirectory to do that, then when the RAMDirectory reach the limit(I
> have
> > put about 10 Mb) then I serialize to disk(nfs) the index to merge it
> with
> > the central index(the central index is in nfs file system), is that
> > correct?
> > I hope you can help me.
> > I have take in consideration the suggestions you have make me before, I
> > going to do some things to test it.
> > Ariel
> >
> >
> > On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:
> >
> > > Thanks all you for yours answers, I going to change a few things in my
> > > application and make tests.
> > > One thing I haven't find another good pdfToText converter like pdfBox
> Do
> > > you know any other faster ?
> > > Greetings
> > > Thanks for yours answers
> > > Ariel
> > >
> > >
> > > On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Ariel,
> > > >
> > > > I believe PDFBox is not the fastest thing and was built more to
> handle
> > > > all possible PDFs than for speed (just my impression - Ben, PDFBox's
> > author
> > > > might still be on this list and might comment).  Pulling data from
> NFS
> > to
> > > > index seems like a bad idea.  I hope at least the indices are local
> > and not
> > > > on a remote NFS...
> > > >
> > > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
> > one)
> > > > and indexing overNFS was slooow.
> > > >
> > > > Otis
> > > >
> > > > --
> > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > >
> > > > - Original Message 
> > > > From: Ariel <[EMAIL PROTECTED]>
> > > > To: java-user@lucene.apache.org
> > > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > > Subject: Why is lucene so slow indexing in nfs file system ?
> > > >
> > > > Hi:
> > > > I have seen the post in
> > > >
> > http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
> > > >  and
> > > > I am implementing a similar application in a distributed enviroment,
> a
> > > > cluster of nodes only 5 nodes. The operating system I use is
> > > >  Linux(Centos)
> > > > so I am using nfs file system too to access the home directory where
> > > >  the
> > > > documents to be indexed reside and I would like to know how much
> time
> > > >  an
> > > > application spends to index a big amount of documents like 10 Gb ?
> > > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
> in
> > > >  every
> > > > nodes, LAN: 1Gbits/s.
> > > >
> > > > The problem I have is that my application spends a lot of time to
> > index
> > > >  all
> > > > the documents, the delay to index 10 gb of pdf documents is about 2
> > > >  days (to
> > > > convert pdf to text I am us

Re: Retrieve number of terms

2008-01-10 Thread Luis Rodrigo

Hi Chris,

by "number of terms", do you mean the number of different terms that 
compose the index, or the numers of total terms, including repetitions?



chris.b escribió:

I'm sure this has been asked a few times before, but i searched and searched
and found no answer (apart from using luke), but I would like to know if
there's a way of retrieving the number of terms in an index.
I tried cycling through a TermEnum, but i doesn't do anything :|
  


--

*Luis Rodrigo Aguado*

Innovation and R&D

Research Manager

lrodrigo(at)isoco.com

#T  +34913349777

C/Pedro de Valdivia, 10

28006, Madrid, Spain

* *

*iSOCO** *

   intelligent software for the networked economy

   www.isoco.com 



Este mensaje se dirige exclusivamente a su destinatario y puede contener 
información privilegiada o confidencial. Si no es vd. el destinatario 
indicado, queda notificado de que la utilización, divulgación y/o copia 
sin autorización está prohibida en virtud de la legislación vigente. Si 
ha recibido este mensaje por error, le rogamos que nos lo comunique 
inmediatamente por esta misma vía y proceda a su destrucción.




This message is intended exclusively for its addressee and may contain 
information that is CONFIDENTIAL and protected by professional 
privilege. If you are not the intended recipient you are hereby notified 
that any dissemination, copy or disclosure of this communication is 
strictly prohibited by law. If this message has been received in error, 
please immediately notify us via e-mail and delete it.




RE: how do I get my own TopDocHitCollector?

2008-01-10 Thread Beard, Brian
Ok, I've been thinking about this some more. Is the cache mechanism
pulling from the cache if the external id already exists there and then
hitting the searcher if it's not already in the cache (maybe using a
FieldSelector for just retrieving the external id)?

-Original Message-
From: Beard, Brian [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 10, 2008 10:08 AM
To: java-user@lucene.apache.org
Subject: RE: how do I get my own TopDocHitCollector?

Thanks for the post. So you're using the doc id as the key into the
cache to retrieve the external id. Then what mechanism fetches the
external id's from the searcher and places them in the cache?


-Original Message-
From: Antony Bowesman [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 09, 2008 7:19 PM
To: java-user@lucene.apache.org
Subject: Re: how do I get my own TopDocHitCollector?

Beard, Brian wrote:
> Question:
> 
> The documents that I index have two id's - a unique document id and a
> record_id that can link multiple documents together that belong to a
> common record.
> 
> I'd like to use something like TopDocs to return the first 1024
results
> that have unique record_id's, but I will want to skip some of the
> returned documents that have the same record_id. We're using the
> ParallelMultiSearcher. 
> 
> I read that I could use a HitCollector and throw an exception to get
it
> to stop, but is there a cleaner way?

I'm doing a similar thing.  I have external Ids (equivalent to yout
record_id), 
which have one or more Lucene Documents associated with them.  I wrote a
custom 
HitCollector that uses a Map to hold the so far collected external ids
along 
with the collected document.

I had to write my own priority queue to know when an object was dropped
of the 
bottom of the score sorted queue, but the latest PriorityQueue on the
trunk now 
has insertWithOverflow(), which does the same thing.

Note that ResultDoc extends ScoreDoc, so that the external Id of the
item 
dropped off the queue can be used to remove it from my Map.

Code snippet is somewhat as below (I am caching my external Ids, hence
the cache 
usage)

protected Map results;

public void collect(int doc, float score)
 {
 if (score > 0.0f)
 {
 totalHits++;
 if (pq.size() < numHits || score > minScore)
 {
 OfficeId id = cache.get(doc);
 ResultDoc rd = results.get(id);
 //  No current result for this ID yet found
 if (rd == null)
 {
 rd = new ResultDoc(id, doc, score);
 ResultDoc added = pq.insert(rd);
 if (added == null)
 {
 //  Nothing dropped of the bottom
 results.put(id, rd);
 }
 else
 {
 //  Return value dropped of the bottom
 results.remove(added.id);
 results.put(id, rd);
 remaining++;
 }
 }
 //  Already found this ID, so replace high score if
necessary
 else
 {
 if (score > rd.score)
 {
 pq.remove(rd);
 rd.score = score;
 pq.insert(rd);
 }
 }
 //  Readjust our minimum score again from the top entry
 minScore = pq.peek().score;
 }
 else
 remaining++;
 }
 }

HTH
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Retrieve number of terms

2008-01-10 Thread chris.b

I'm sure this has been asked a few times before, but i searched and searched
and found no answer (apart from using luke), but I would like to know if
there's a way of retrieving the number of terms in an index.
I tried cycling through a TermEnum, but i doesn't do anything :|
-- 
View this message in context: 
http://www.nabble.com/Retrieve-number-of-terms-tp14737981p14737981.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Michael McCandless


If possible you should also test the soon-to-be-released version 2.3,  
which has a number of speedups to indexing.


Also try the steps here:

  http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

You should also try an A/B test: A) writing your index to the NFS  
directory and then B) to a local IO system, to see how much NFS is  
really slowing you down.


Mike

Erick Erickson wrote:

This seems really clunky. Especially if your merge step also  
optimizes.


There's not much point in indexing into RAM then merging explicitly.
Just use an FSDirectory rather than a RAMDirectory. There is *already*
buffering built in to FSDirectory, and your merge factor etc. control
how much RAM is used before flushing to disk. There's considerable
discussion of this on the Wiki I believe, but in the mail archive  
for sure.

And I believe there's a RAM usage based flushing policy somewhere.

You're adding complexity where it's probably not necessary. Did you
adopt this scheme because you *thought* it would be faster or because
you were addressing a *known* problem? Don't *ever* write complex code
to support a theoretical case unless you have considerable certainty
that it really is a problem. "It would be faster" is a weak  
argument when

you don't know whether you're talking about saving 1% or 95%. The
added maintenance is just not worth it.

There's a famous quote about that from Donald Knuth
(paraphrasing Hoare) "We should forget about small efficiencies,
say about 97% of the time: premature optimization is the root of
all evil." It's true.

So the very *first* measurement I'd take is to get rid of the in-RAM
stuff and just write the index to local disk. I suspect you'll be  
*far*

better off doing this then just copying your index to the nfs mount.

Best
Erick

On Jan 10, 2008 10:05 AM, Ariel <[EMAIL PROTECTED]> wrote:

In a distributed enviroment the application should make an  
exhaustive use

of
the network and there is not another way to access to the  
documents in a

remote repository but accessing in nfs file system.
One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit 
(I have
put about 10 Mb) then I serialize to disk(nfs) the index to merge  
it with

the central index(the central index is in nfs file system), is that
correct?
I hope you can help me.
I have take in consideration the suggestions you have make me  
before, I

going to do some things to test it.
Ariel


On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:

Thanks all you for yours answers, I going to change a few things  
in my

application and make tests.
One thing I haven't find another good pdfToText converter like  
pdfBox Do

you know any other faster ?
Greetings
Thanks for yours answers
Ariel


On Jan 9, 2008 11:08 PM, Otis Gospodnetic  
<[EMAIL PROTECTED]>

wrote:


Ariel,

I believe PDFBox is not the fastest thing and was built more to  
handle
all possible PDFs than for speed (just my impression - Ben,  
PDFBox's

author
might still be on this list and might comment).  Pulling data  
from NFS

to

index seems like a bad idea.  I hope at least the indices are local

and not

on a remote NFS...

We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which

one)

and indexing overNFS was slooow.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Ariel <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, January 9, 2008 2:50:41 PM
Subject: Why is lucene so slow indexing in nfs file system ?

Hi:
I have seen the post in

http://www.mail-archive.com/[EMAIL PROTECTED]/ 
msg12700.html

 and
I am implementing a similar application in a distributed  
enviroment, a

cluster of nodes only 5 nodes. The operating system I use is
 Linux(Centos)
so I am using nfs file system too to access the home directory  
where

 the
documents to be indexed reside and I would like to know how much  
time

 an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512  
Mb in

 every
nodes, LAN: 1Gbits/s.

The problem I have is that my application spends a lot of time to

index

 all
the documents, the delay to index 10 gb of pdf documents is about 2
 days (to
convert pdf to text I am using pdfbox) that is of course a lot of

time,

others applications based in lucene, for instance ibm omnifind only
 takes 5
hours to index the same amount of pdfs documents. I would like  
to find

 out
why my application has this big delay to index, any help is  
welcome.

Dou you know others distributed architecture application that uses
 lucene to
index big amounts of documents ? How long time it takes to index ?
I hope yo can help me
Greetings




--- 
--

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









---

Re: Highlighting + phrase queries

2008-01-10 Thread Mark Miller
I don't think you would see much of gain. Shoving the TokenStream into 
the MemoryIndex is actually pretty fast and I wouldn't be surprised if 
it was much faster than reading from disk. Most of the computational 
time is spent in reconstructing the TokenStream, whether you use 
term-vectors or re-analyze. Also, if the Query does not have any 
position sensitive clauses, no MemoryIndex is created, so no worries there.


The great speed challenge of the current method (other than needing a 
TokenStream created) is that it runs over each Token and stitches the 
document together a piece at a time. This doesn't scale well on huge 
docs. There are ways to cut this down and to just analyze the pertinent 
Tokens as is done by a different patch. However, you'd need to have 
TermVectors stored, and the concept doesn't fit with the current 
Highlighter framework, which already has some significant functionality 
and robustness.


- Mark

Marjan Celikik wrote:

Mark Miller wrote:


That is why the original contrib does not work with PhraseQuery's. It 
simply matches Tokens from the query with those in the TokenStream. 
LUCENE-794 takes the TokenStream and shoves it into a MemoryIndex. 
Then, after converting the query to a SpanQuery approximation, 
getSpans is called on the index for the query. The Spans provide a 
bound on what positions should be Highlighted. Everything else is 
done exactly like the original Highlighter (This is a patch that fits 
into the original Highlighter framework that was developed, thereby 
retaining all of its richness :) ).



Mark, thanks for your patience! I have one final (conceptual, 
high-level) question concerning the usage of the MemoryIndex index 
over the TokenStream. Is it a good idea to
store the procomputed MemoryIndex (conceptually speaking) as a field 
into each document at indexing time and then just load this 
precomputed index from
disk (as you do with TermVector) such that you save extra computation 
for the highlighting?


Marjan.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighting + phrase queries

2008-01-10 Thread Marjan Celikik

Mark Miller wrote:


That is why the original contrib does not work with PhraseQuery's. It 
simply matches Tokens from the query with those in the TokenStream. 
LUCENE-794 takes the TokenStream and shoves it into a MemoryIndex. 
Then, after converting the query to a SpanQuery approximation, 
getSpans is called on the index for the query. The Spans provide a 
bound on what positions should be Highlighted. Everything else is done 
exactly like the original Highlighter (This is a patch that fits into 
the original Highlighter framework that was developed, thereby 
retaining all of its richness :) ).



Mark, thanks for your patience! I have one final (conceptual, 
high-level) question concerning the usage of the MemoryIndex index over 
the TokenStream. Is it a good idea to
store the procomputed MemoryIndex (conceptually speaking) as a field 
into each document at indexing time and then just load this precomputed 
index from
disk (as you do with TermVector) such that you save extra computation 
for the highlighting?


Marjan.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Erick Erickson
This seems really clunky. Especially if your merge step also optimizes.

There's not much point in indexing into RAM then merging explicitly.
Just use an FSDirectory rather than a RAMDirectory. There is *already*
buffering built in to FSDirectory, and your merge factor etc. control
how much RAM is used before flushing to disk. There's considerable
discussion of this on the Wiki I believe, but in the mail archive for sure.
And I believe there's a RAM usage based flushing policy somewhere.

You're adding complexity where it's probably not necessary. Did you
adopt this scheme because you *thought* it would be faster or because
you were addressing a *known* problem? Don't *ever* write complex code
to support a theoretical case unless you have considerable certainty
that it really is a problem. "It would be faster" is a weak argument when
you don't know whether you're talking about saving 1% or 95%. The
added maintenance is just not worth it.

There's a famous quote about that from Donald Knuth
(paraphrasing Hoare) "We should forget about small efficiencies,
say about 97% of the time: premature optimization is the root of
all evil." It's true.

So the very *first* measurement I'd take is to get rid of the in-RAM
stuff and just write the index to local disk. I suspect you'll be *far*
better off doing this then just copying your index to the nfs mount.

Best
Erick

On Jan 10, 2008 10:05 AM, Ariel <[EMAIL PROTECTED]> wrote:

> In a distributed enviroment the application should make an exhaustive use
> of
> the network and there is not another way to access to the documents in a
> remote repository but accessing in nfs file system.
> One thing I must clarify: I index the documents in memory, I use
> RAMDirectory to do that, then when the RAMDirectory reach the limit(I have
> put about 10 Mb) then I serialize to disk(nfs) the index to merge it with
> the central index(the central index is in nfs file system), is that
> correct?
> I hope you can help me.
> I have take in consideration the suggestions you have make me before, I
> going to do some things to test it.
> Ariel
>
>
> On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:
>
> > Thanks all you for yours answers, I going to change a few things in my
> > application and make tests.
> > One thing I haven't find another good pdfToText converter like pdfBox Do
> > you know any other faster ?
> > Greetings
> > Thanks for yours answers
> > Ariel
> >
> >
> > On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Ariel,
> > >
> > > I believe PDFBox is not the fastest thing and was built more to handle
> > > all possible PDFs than for speed (just my impression - Ben, PDFBox's
> author
> > > might still be on this list and might comment).  Pulling data from NFS
> to
> > > index seems like a bad idea.  I hope at least the indices are local
> and not
> > > on a remote NFS...
> > >
> > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
> one)
> > > and indexing overNFS was slooow.
> > >
> > > Otis
> > >
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > > - Original Message 
> > > From: Ariel <[EMAIL PROTECTED]>
> > > To: java-user@lucene.apache.org
> > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > Subject: Why is lucene so slow indexing in nfs file system ?
> > >
> > > Hi:
> > > I have seen the post in
> > >
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
> > >  and
> > > I am implementing a similar application in a distributed enviroment, a
> > > cluster of nodes only 5 nodes. The operating system I use is
> > >  Linux(Centos)
> > > so I am using nfs file system too to access the home directory where
> > >  the
> > > documents to be indexed reside and I would like to know how much time
> > >  an
> > > application spends to index a big amount of documents like 10 Gb ?
> > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
> > >  every
> > > nodes, LAN: 1Gbits/s.
> > >
> > > The problem I have is that my application spends a lot of time to
> index
> > >  all
> > > the documents, the delay to index 10 gb of pdf documents is about 2
> > >  days (to
> > > convert pdf to text I am using pdfbox) that is of course a lot of
> time,
> > > others applications based in lucene, for instance ibm omnifind only
> > >  takes 5
> > > hours to index the same amount of pdfs documents. I would like to find
> > >  out
> > > why my application has this big delay to index, any help is welcome.
> > > Dou you know others distributed architecture application that uses
> > >  lucene to
> > > index big amounts of documents ? How long time it takes to index ?
> > > I hope yo can help me
> > > Greetings
> > >
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
>


Re: Highlighting + phrase queries

2008-01-10 Thread Marjan Celikik

Marjan Celikik wrote:

Mark Miller wrote:
The Highlighter works by comparing the TokenStream of the document 
with the Tokens in the query. The TokenStream can be rebuilt from the 
index if you use TermVectors with TokenSources or you can get it by 
reanalyzing the document.  Each Token from the TokenStream is checked 
against Tokens in the query, and if there is a match you have a 
Highlight. The original text is then reconstructed with the 
Highlights from info in the TokenStream about original offsets into 
the document for each Token. Also, there is a Fragment system that 
will break apart the Highlighted text into score sorted text Fragments.

OK, this is what I already knew...


That is why the original contrib does not work with PhraseQuery's. It 
simply matches Tokens from the query with those in the TokenStream. 
LUCENE-794 takes the TokenStream and shoves it into a MemoryIndex. 
Then, after converting the query to a SpanQuery approximation, 
getSpans is called on the index for the query. The Spans provide a 
bound on what positions should be Highlighted. Everything else is 
done exactly like the original Highlighter (This is a patch that fits 
into the original Highlighter framework that was developed, thereby 
retaining all of its richness :) ).



Thanks! This is what I needed. Still I don't know how to obtain the 
source code of your patch :(

ok, I will just apply the patch. Sorry.

Marjan.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: how do I get my own TopDocHitCollector?

2008-01-10 Thread Beard, Brian
Thanks for the post. So you're using the doc id as the key into the
cache to retrieve the external id. Then what mechanism fetches the
external id's from the searcher and places them in the cache?


-Original Message-
From: Antony Bowesman [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 09, 2008 7:19 PM
To: java-user@lucene.apache.org
Subject: Re: how do I get my own TopDocHitCollector?

Beard, Brian wrote:
> Question:
> 
> The documents that I index have two id's - a unique document id and a
> record_id that can link multiple documents together that belong to a
> common record.
> 
> I'd like to use something like TopDocs to return the first 1024
results
> that have unique record_id's, but I will want to skip some of the
> returned documents that have the same record_id. We're using the
> ParallelMultiSearcher. 
> 
> I read that I could use a HitCollector and throw an exception to get
it
> to stop, but is there a cleaner way?

I'm doing a similar thing.  I have external Ids (equivalent to yout
record_id), 
which have one or more Lucene Documents associated with them.  I wrote a
custom 
HitCollector that uses a Map to hold the so far collected external ids
along 
with the collected document.

I had to write my own priority queue to know when an object was dropped
of the 
bottom of the score sorted queue, but the latest PriorityQueue on the
trunk now 
has insertWithOverflow(), which does the same thing.

Note that ResultDoc extends ScoreDoc, so that the external Id of the
item 
dropped off the queue can be used to remove it from my Map.

Code snippet is somewhat as below (I am caching my external Ids, hence
the cache 
usage)

protected Map results;

public void collect(int doc, float score)
 {
 if (score > 0.0f)
 {
 totalHits++;
 if (pq.size() < numHits || score > minScore)
 {
 OfficeId id = cache.get(doc);
 ResultDoc rd = results.get(id);
 //  No current result for this ID yet found
 if (rd == null)
 {
 rd = new ResultDoc(id, doc, score);
 ResultDoc added = pq.insert(rd);
 if (added == null)
 {
 //  Nothing dropped of the bottom
 results.put(id, rd);
 }
 else
 {
 //  Return value dropped of the bottom
 results.remove(added.id);
 results.put(id, rd);
 remaining++;
 }
 }
 //  Already found this ID, so replace high score if
necessary
 else
 {
 if (score > rd.score)
 {
 pq.remove(rd);
 rd.score = score;
 pq.insert(rd);
 }
 }
 //  Readjust our minimum score again from the top entry
 minScore = pq.peek().score;
 }
 else
 remaining++;
 }
 }

HTH
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
In a distributed enviroment the application should make an exhaustive use of
the network and there is not another way to access to the documents in a
remote repository but accessing in nfs file system.
One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit(I have
put about 10 Mb) then I serialize to disk(nfs) the index to merge it with
the central index(the central index is in nfs file system), is that correct?
I hope you can help me.
I have take in consideration the suggestions you have make me before, I
going to do some things to test it.
Ariel


On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:

> Thanks all you for yours answers, I going to change a few things in my
> application and make tests.
> One thing I haven't find another good pdfToText converter like pdfBox Do
> you know any other faster ?
> Greetings
> Thanks for yours answers
> Ariel
>
>
> On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
>
> > Ariel,
> >
> > I believe PDFBox is not the fastest thing and was built more to handle
> > all possible PDFs than for speed (just my impression - Ben, PDFBox's author
> > might still be on this list and might comment).  Pulling data from NFS to
> > index seems like a bad idea.  I hope at least the indices are local and not
> > on a remote NFS...
> >
> > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one)
> > and indexing overNFS was slooow.
> >
> > Otis
> >
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > - Original Message 
> > From: Ariel <[EMAIL PROTECTED]>
> > To: java-user@lucene.apache.org
> > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > Subject: Why is lucene so slow indexing in nfs file system ?
> >
> > Hi:
> > I have seen the post in
> > http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
> >  and
> > I am implementing a similar application in a distributed enviroment, a
> > cluster of nodes only 5 nodes. The operating system I use is
> >  Linux(Centos)
> > so I am using nfs file system too to access the home directory where
> >  the
> > documents to be indexed reside and I would like to know how much time
> >  an
> > application spends to index a big amount of documents like 10 Gb ?
> > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
> >  every
> > nodes, LAN: 1Gbits/s.
> >
> > The problem I have is that my application spends a lot of time to index
> >  all
> > the documents, the delay to index 10 gb of pdf documents is about 2
> >  days (to
> > convert pdf to text I am using pdfbox) that is of course a lot of time,
> > others applications based in lucene, for instance ibm omnifind only
> >  takes 5
> > hours to index the same amount of pdfs documents. I would like to find
> >  out
> > why my application has this big delay to index, any help is welcome.
> > Dou you know others distributed architecture application that uses
> >  lucene to
> > index big amounts of documents ? How long time it takes to index ?
> > I hope yo can help me
> > Greetings
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>


Re: Self Join Query

2008-01-10 Thread Paul Elschot
Sachin,

As the merging of the results is the issue, I'll assume that you don't
have clear user requirements for that.

The simplest way out of that is to allow the users to search
the B's first, and once they have determined which B's they'd like
to use, use those B's to limit the results in of user searches in A.
That would normally be done by a filtering on B, much like RangeFilter.
Caching that filter allows for quick repeated searches in A.
Is that what the users want?

For each normalization a filter can be used to search across it.
One feature of filters is that the original score is lost.
Would you have user requirements related to this?

As the texts of A and B are the problem for reindexing, you
may want to index these separately: one index for Aid+Atext,
and one for Bid+Btext.

That leaves the A-B 1-n association: one more index for Aid+Bids.
In this last one you could also put a small text field of A.

Denormalizing the Btext into Aid+Bids as Aid+Bids+Btexts can make
it difficult for the users to explicitly select the B's. OTOH it makes it
easy to implicitly select the B's. What do the users want?

Each id field will have to be indexed to allow filtering, and stored to
allow retrieval for filtering in another index. Retrieving stored fields
is normally a performance bottleneck, so a FieldCache might be handy.

Regards,
Paul Elschot



On Thursday 10 January 2008 12:58:44 sachin wrote:
> Here are more details about my issue.
> 
> I have two tables in database. A row in table 1 can have multiple rows
> associated with it in table 2. It is a one to many mapping. 
> Let's say a row in table 1 is A and it has multiple rows B1, B2 and B3
> associated with it in table 2. I need to search on both A and B types
> and the result should have A and all the Bs associated with it. Also for
> your information, A and Bs are long text in database. 
> 
> I could have two approaches for indexing/searching
> 
> First approach is to create the index in denormalized form. In this case
> document would be like A, B1, B2, B3. The issue with this approach is
> that any modification to any row would require me to re-index the
> document again and fetch A and all Bs again from database. This is a
> heavy process.
> 
> The other approach is to index A, B1, B2 and B3 in different documents
> and after search merge the results. This makes my re-indexing lighter
> but I need to put extra logic to merge the results. For this type of
> index I would require self join kind of query from lucene. Query can be
> written by using boolean query but merging of two type of documents is a
> issue. If I go by this approach for indexing, what is the best way to
> fetch the results?
> 
> I hope I have made myself clear. 
> 
> Thanks
> Sachin
> 
> 
> 
> On Tue, 2008-01-08 at 20:13 +0530, Developer Developer wrote:
> > Provide more details please.
> > 
> > Can you not use boolean query and filters if need be ?
> > 
> > 
> > 
> > On Jan 8, 2008 7:23 AM, sachin <[EMAIL PROTECTED]> wrote:
> > 
> > I need to write lucene query something similar to SQL self
> > joins.
> > 
> > My current implementation is very primitive. I fire first
> > query, get the
> > results, based on the result of first query I fire second
> > query and then
> > merge the results from both the queries. The whole processing
> > is very 
> > expensive. Doing this is very easy with SQL query as we need
> > to just
> > write self join query and database do the rest for you.
> > 
> > What is the best way of implementing the above functionality
> > in lucene?
> > 
> > Regards 
> > Sachin
> > 
> > 
> > 
> > -
> > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > 
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighting + phrase queries

2008-01-10 Thread Marjan Celikik

Mark Miller wrote:
The Highlighter works by comparing the TokenStream of the document 
with the Tokens in the query. The TokenStream can be rebuilt from the 
index if you use TermVectors with TokenSources or you can get it by 
reanalyzing the document.  Each Token from the TokenStream is checked 
against Tokens in the query, and if there is a match you have a 
Highlight. The original text is then reconstructed with the Highlights 
from info in the TokenStream about original offsets into the document 
for each Token. Also, there is a Fragment system that will break apart 
the Highlighted text into score sorted text Fragments.

OK, this is what I already knew...


That is why the original contrib does not work with PhraseQuery's. It 
simply matches Tokens from the query with those in the TokenStream. 
LUCENE-794 takes the TokenStream and shoves it into a MemoryIndex. 
Then, after converting the query to a SpanQuery approximation, 
getSpans is called on the index for the query. The Spans provide a 
bound on what positions should be Highlighted. Everything else is done 
exactly like the original Highlighter (This is a patch that fits into 
the original Highlighter framework that was developed, thereby 
retaining all of its richness :) ).



Thanks! This is what I needed. Still I don't know how to obtain the 
source code of your patch :(


Majan.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighting + phrase queries

2008-01-10 Thread Mark Miller
The Highlighter works by comparing the TokenStream of the document with 
the Tokens in the query. The TokenStream can be rebuilt from the index 
if you use TermVectors with TokenSources or you can get it by 
reanalyzing the document.  Each Token from the TokenStream is checked 
against Tokens in the query, and if there is a match you have a 
Highlight. The original text is then reconstructed with the Highlights 
from info in the TokenStream about original offsets into the document 
for each Token. Also, there is a Fragment system that will break apart 
the Highlighted text into score sorted text Fragments.


That is why the original contrib does not work with PhraseQuery's. It 
simply matches Tokens from the query with those in the TokenStream. 
LUCENE-794 takes the TokenStream and shoves it into a MemoryIndex. Then, 
after converting the query to a SpanQuery approximation, getSpans is 
called on the index for the query. The Spans provide a bound on what 
positions should be Highlighted. Everything else is done exactly like 
the original Highlighter (This is a patch that fits into the original 
Highlighter framework that was developed, thereby retaining all of its 
richness :) ).


Marjan Celikik wrote:

Mark Miller wrote:
Oh yeah...something that you may not have seen is that this has a 
dependency on MemoryIndex from contrib. You need that jar as well.


- Mark
Hm, I need the source code. How do I download the files from 
https://issues.apache.org/jira/browse/LUCENE-794 (all I see are some 
.patch files)?


What I really need is a how the highlighter works in a nutshell... I 
am working on a publication and I want to have a reference to Lucene 
and its highlighting...


Thanks again.

Marjan.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighting + phrase queries

2008-01-10 Thread Marjan Celikik

Mark Miller wrote:
Oh yeah...something that you may not have seen is that this has a 
dependency on MemoryIndex from contrib. You need that jar as well.


- Mark
Hm, I need the source code. How do I download the files from 
https://issues.apache.org/jira/browse/LUCENE-794 (all I see are some 
.patch files)?


What I really need is a how the highlighter works in a nutshell... I am 
working on a publication and I want to have a reference to Lucene and 
its highlighting...


Thanks again.

Marjan.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
Thanks all you for yours answers, I going to change a few things in my
application and make tests.
One thing I haven't find another good pdfToText converter like pdfBox Do you
know any other faster ?
Greetings
Thanks for yours answers
Ariel

On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Ariel,
>
> I believe PDFBox is not the fastest thing and was built more to handle all
> possible PDFs than for speed (just my impression - Ben, PDFBox's author
> might still be on this list and might comment).  Pulling data from NFS to
> index seems like a bad idea.  I hope at least the indices are local and not
> on a remote NFS...
>
> We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one)
> and indexing overNFS was slooow.
>
> Otis
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> - Original Message 
> From: Ariel <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Wednesday, January 9, 2008 2:50:41 PM
> Subject: Why is lucene so slow indexing in nfs file system ?
>
> Hi:
> I have seen the post in
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
>  and
> I am implementing a similar application in a distributed enviroment, a
> cluster of nodes only 5 nodes. The operating system I use is
>  Linux(Centos)
> so I am using nfs file system too to access the home directory where
>  the
> documents to be indexed reside and I would like to know how much time
>  an
> application spends to index a big amount of documents like 10 Gb ?
> I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
>  every
> nodes, LAN: 1Gbits/s.
>
> The problem I have is that my application spends a lot of time to index
>  all
> the documents, the delay to index 10 gb of pdf documents is about 2
>  days (to
> convert pdf to text I am using pdfbox) that is of course a lot of time,
> others applications based in lucene, for instance ibm omnifind only
>  takes 5
> hours to index the same amount of pdfs documents. I would like to find
>  out
> why my application has this big delay to index, any help is welcome.
> Dou you know others distributed architecture application that uses
>  lucene to
> index big amounts of documents ? How long time it takes to index ?
> I hope yo can help me
> Greetings
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Highlighting + phrase queries

2008-01-10 Thread Mark Miller
Oh yeah...something that you may not have seen is that this has a 
dependency on MemoryIndex from contrib. You need that jar as well.


- Mark

Marjan Celikik wrote:

Mark Miller wrote:

The contrib Highlighter doesn't know and highlights them all.

Check out my patch here for position sensitive highlighting:
https://issues.apache.org/jira/browse/LUCENE-794


It seems that the patch does not work with Lucene 2.2 as I get some 
compile errors. Is this really the case?


Marjan.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighting + phrase queries

2008-01-10 Thread Mark Miller
It should work no problem with 2.2. What are the compile errors you are 
getting?


If you send me a note directly I will send you a jar.

- Mark

Marjan Celikik wrote:

Mark Miller wrote:

The contrib Highlighter doesn't know and highlights them all.

Check out my patch here for position sensitive highlighting:
https://issues.apache.org/jira/browse/LUCENE-794


It seems that the patch does not work with Lucene 2.2 as I get some 
compile errors. Is this really the case?


Marjan.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighting + phrase queries

2008-01-10 Thread Marjan Celikik

Mark Miller wrote:

The contrib Highlighter doesn't know and highlights them all.

Check out my patch here for position sensitive highlighting:
https://issues.apache.org/jira/browse/LUCENE-794


It seems that the patch does not work with Lucene 2.2 as I get some 
compile errors. Is this really the case?


Marjan.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Self Join Query

2008-01-10 Thread sachin
Here are more details about my issue.

I have two tables in database. A row in table 1 can have multiple rows
associated with it in table 2. It is a one to many mapping. 
Let's say a row in table 1 is A and it has multiple rows B1, B2 and B3
associated with it in table 2. I need to search on both A and B types
and the result should have A and all the Bs associated with it. Also for
your information, A and Bs are long text in database. 

I could have two approaches for indexing/searching

First approach is to create the index in denormalized form. In this case
document would be like A, B1, B2, B3. The issue with this approach is
that any modification to any row would require me to re-index the
document again and fetch A and all Bs again from database. This is a
heavy process.

The other approach is to index A, B1, B2 and B3 in different documents
and after search merge the results. This makes my re-indexing lighter
but I need to put extra logic to merge the results. For this type of
index I would require self join kind of query from lucene. Query can be
written by using boolean query but merging of two type of documents is a
issue. If I go by this approach for indexing, what is the best way to
fetch the results?

I hope I have made myself clear. 

Thanks
Sachin



On Tue, 2008-01-08 at 20:13 +0530, Developer Developer wrote:
> Provide more details please.
> 
> Can you not use boolean query and filters if need be ?
> 
> 
> 
> On Jan 8, 2008 7:23 AM, sachin <[EMAIL PROTECTED]> wrote:
> 
> I need to write lucene query something similar to SQL self
> joins.
> 
> My current implementation is very primitive. I fire first
> query, get the
> results, based on the result of first query I fire second
> query and then
> merge the results from both the queries. The whole processing
> is very 
> expensive. Doing this is very easy with SQL query as we need
> to just
> write self join query and database do the rest for you.
> 
> What is the best way of implementing the above functionality
> in lucene?
> 
> Regards 
> Sachin
> 
> 
> -
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]