Re: Duplicate Hits

2005-02-01 Thread sergiu gordea
Erik Hatcher wrote:
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote:
OK - but I'm dealing with indexing between 1.5 and 2 million 
documents, so I
really don't want to 'batch' them up if I can avoid it.  And I also 
don't
think I can keep an IndexRead open to the index at the same time I 
have an
IndexWriter open.  I may have to try and deal with this issue through 
some
sort of filter on the query side, provided it doesn't impact 
performance to
much.

You can use an IndexReader and IndexWriter at the same time (the 
caveat is that you cannot delete with the IndexReader at the same time 
you're writing with an IndexWriter).  Is there no other identifying 
information, though, on the incoming documents with a date stamp?  
Identifier?  Or something unique you can go on?

Erik
As Erick suggested earlier, I think that keeping the information in the 
database and indentifying the new entries at database level is a better 
approach.
Indexing documents and optimizing the index on a that big index will be 
very time consuming information.
Also .. consider that in the future you would like to modify the 
structure of your index.

Think how much effort will be to split some fields in a few smaller 
parts. Or just to change the format of a field,
let's say you have a date in DDMMYY format and you need to change to 
MMDD.

And consider how much effort is needed to rebuild a completly new index 
from the database

Of course, your requirements may not ask to have the information stored 
in the database, and ... it is up to you to use a DB + Lucene index,

or just a Lucene index.
Best,
Sergiu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote:
OK - but I'm dealing with indexing between 1.5 and 2 million 
documents, so I
really don't want to 'batch' them up if I can avoid it.  And I also 
don't
think I can keep an IndexRead open to the index at the same time I 
have an
IndexWriter open.  I may have to try and deal with this issue through 
some
sort of filter on the query side, provided it doesn't impact 
performance to
much.
You can use an IndexReader and IndexWriter at the same time (the caveat 
is that you cannot delete with the IndexReader at the same time you're 
writing with an IndexWriter).  Is there no other identifying 
information, though, on the incoming documents with a date stamp?  
Identifier?  Or something unique you can go on?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote:
OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I
really don't want to 'batch' them up if I can avoid it.  And I also don't
think I can keep an IndexRead open to the index at the same time I have an
IndexWriter open.  I may have to try and deal with this issue through some
sort of filter on the query side, provided it doesn't impact performance to
much.
 

I was thinking of indexing in batches of a few documents (10? 100? 
1000?) which means flipping between IndexReaders and IndexWriters 
wouldn't be too onerous.

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I
really don't want to 'batch' them up if I can avoid it.  And I also don't
think I can keep an IndexRead open to the index at the same time I have an
IndexWriter open.  I may have to try and deal with this issue through some
sort of filter on the query side, provided it doesn't impact performance to
much.

Thanks.

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


-Original Message-
From: John Haxby [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 9:48 AM
To: Lucene Users List
Subject: Re: Duplicate Hits


Jerry Jalenak wrote:

>Just to make sure I understand
>
>Do you keep an IndexReader open at the same time you are running the
>IndexWriter?  From what I can see in the JavaDocs, it looks like only
>IndexReader (or IndexSearch) can peek into the index and see if a document
>exists or not
>  
>
I slightly misled you: it wasn't Lucene that I was using at the time and 
in that system the distinction between IndexReader and IndexWriter 
didn't exist.   I'm just getting to grips with Lucene really but it 
would seem to be possible to use a similar scheme, especially if you 
batch up your documents for indexing: as they come in, check the md5 
checksum against what's already known and what's already queued and then 
when the time comes to process the queue you know what you've got needs 
to be indexed.

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote:
Just to make sure I understand
Do you keep an IndexReader open at the same time you are running the
IndexWriter?  From what I can see in the JavaDocs, it looks like only
IndexReader (or IndexSearch) can peek into the index and see if a document
exists or not
 

I slightly misled you: it wasn't Lucene that I was using at the time and 
in that system the distinction between IndexReader and IndexWriter 
didn't exist.   I'm just getting to grips with Lucene really but it 
would seem to be possible to use a similar scheme, especially if you 
batch up your documents for indexing: as they come in, check the md5 
checksum against what's already known and what's already queued and then 
when the time comes to process the queue you know what you've got needs 
to be indexed.

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Just to make sure I understand

Do you keep an IndexReader open at the same time you are running the
IndexWriter?  From what I can see in the JavaDocs, it looks like only
IndexReader (or IndexSearch) can peek into the index and see if a document
exists or not

Thanks!

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


-Original Message-
From: John Haxby [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 9:39 AM
To: Lucene Users List
Subject: Re: Duplicate Hits


Jerry Jalenak wrote:

>Nice idea John - one I hadn't considered.  Once you have the checksum, do
>you 'check' in the index first before storing the second document?  Or do
>you filter on the query side?
>  
>
I do a quick search for the md5 checksum before indexing.

Although I suspect not applicable in your case, I also maintained a 
"last time something was indexed" time alongside the index.  I used this 
to drastically prune the number of documents that needed to be 
considered for indexing if I restarted; anything modified before then 
wasn't a candidate.  Since the MD5 checksum provides the definitive (for 
a sufficiently loose definition of definitive) indication of whether a 
document is indexed I didn't need to worry about ultra-fine granularity 
in the time stamp and I didn't need to worry about it being committed to 
disk; it generally got committed to the magnetic stuff every few seconds 
or so.

It does help a lot though if documents have nice unique identifiers that 
you can use instead, then you can use the identifier and the last 
modified time to decide whether or not to re-index.

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote:
Nice idea John - one I hadn't considered.  Once you have the checksum, do
you 'check' in the index first before storing the second document?  Or do
you filter on the query side?
 

I do a quick search for the md5 checksum before indexing.
Although I suspect not applicable in your case, I also maintained a 
"last time something was indexed" time alongside the index.  I used this 
to drastically prune the number of documents that needed to be 
considered for indexing if I restarted; anything modified before then 
wasn't a candidate.  Since the MD5 checksum provides the definitive (for 
a sufficiently loose definition of definitive) indication of whether a 
document is indexed I didn't need to worry about ultra-fine granularity 
in the time stamp and I didn't need to worry about it being committed to 
disk; it generally got committed to the magnetic stuff every few seconds 
or so.

It does help a lot though if documents have nice unique identifiers that 
you can use instead, then you can use the identifier and the last 
modified time to decide whether or not to re-index.

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 9:49 AM, Jerry Jalenak wrote:
Given Erik's response of 'don't put duplicate documents in the index', 
how
can I accomplish this in the IndexWriter?
As John said - you'll have to come up with some way of knowing whether 
you should index or not.  For example, when dealing with filesystem 
files, the Ant  task (in the sandbox) checks last modified date 
and only indexes new files.

Using a unique id on your data (primary key from a DB, URL from web 
pages, etc) is generally what people use for this.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Nice idea John - one I hadn't considered.  Once you have the checksum, do
you 'check' in the index first before storing the second document?  Or do
you filter on the query side?

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


-Original Message-
From: John Haxby [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 9:06 AM
To: Lucene Users List
Subject: Re: Duplicate Hits


Jerry Jalenak wrote:

>Given Erik's response of 'don't put duplicate documents in the index', how
>can I accomplish this in the IndexWriter?
>  
>
I was dealing with a similar requirement recently.   I eventually 
decided on storing the MD5 checksum of the document as a keyword.   It 
means reading it twice (once to calculate the checksum, once to index 
it), but it seems to do the trick.

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote:
Given Erik's response of 'don't put duplicate documents in the index', how
can I accomplish this in the IndexWriter?
 

I was dealing with a similar requirement recently.   I eventually 
decided on storing the MD5 checksum of the document as a keyword.   It 
means reading it twice (once to calculate the checksum, once to index 
it), but it seems to do the trick.

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Ok, OK.  Should have that response coming  8-)

The documents I'm indexing are sent from a legacy system, and can be sent
multiple times - but I only want to keep the documents if something has
changed.  If the indexed fields match exactly, I don't want to index the
second (or third, forth, etc) documents.  If the indexed fields have
changed, then I want to index the 'new' document, and keep it.

Given Erik's response of 'don't put duplicate documents in the index', how
can I accomplish this in the IndexWriter?

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 8:35 AM
To: Lucene Users List
Subject: Re: Duplicate Hits


On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote:
> Is there a way to eliminate duplicate hits being returned from the 
> index?

Sure, don't put duplicate documents in the index :)

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote:
Is there a way to eliminate duplicate hits being returned from the 
index?
Sure, don't put duplicate documents in the index :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate hits using ParallelMultiSearcher

2005-01-24 Thread Jason Polites
Agreed on the "set of unique messages", however the problem I have is with 
the "count" of the Hits.  The Hits object may contain 100 results (for 
example), of which only 90 are unique.  Because I am paging through results 
10 at a time, I need to know the total count without loading each document. 
If I get a count of 100 but a Collection of only 90 my paging breaks.

After careful consideration I have decided that the better approach is to 
create a separate "global" index in which all messages are stored.  This 
will not only relieve my duplication issue but should also scale better 
if/when there are several hundred or several thousand distinct indexes.

Thanks,
- JP
- Original Message - 
From: "PA" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Monday, January 24, 2005 10:43 PM
Subject: Re: Duplicate hits using ParallelMultiSearcher


On Jan 24, 2005, at 09:14, Jason Polites wrote:
I am aware of the Filter object however the unique identifier of my 
document is a field within the lucene document itself (messageid); and I 
am reluctant to access this field using the public API for every Hit as I 
fear it will have drastic performance implications.
Well... I don't see any way around that as you basically want to uniquely 
identify your messages based on their Message-ID.

That said, you don't need to do it during the search itself. You could 
simply perform your search as you do now and then create a set of unique 
messages while preserving Lucene Hits sort ordering for "relevance" 
purpose.

HTH.
Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate hits using ParallelMultiSearcher

2005-01-24 Thread PA
On Jan 24, 2005, at 09:14, Jason Polites wrote:
I am aware of the Filter object however the unique identifier of my 
document is a field within the lucene document itself (messageid); and 
I am reluctant to access this field using the public API for every Hit 
as I fear it will have drastic performance implications.
Well... I don't see any way around that as you basically want to 
uniquely identify your messages based on their Message-ID.

That said, you don't need to do it during the search itself. You could 
simply perform your search as you do now and then create a set of 
unique messages while preserving Lucene Hits sort ordering for 
"relevance" purpose.

HTH.
Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]