Note that Solr (trunk, soon to be 1.4) has a duplicate detection
feature that may work for your need. See http://wiki.apache.org/solr/Deduplication
(looks like docs need updating to say 1.4 here) and http://issues.apache.org/jira/browse/SOLR-799
Erik
On Apr 7, 2009, at 11:25 AM, Veselin K wrote:
Thank you much Fergus,
I was considering implementing a database which would hold a path name
and an MD5 sum of each file.
Then as a part of Solr indexing, one could check against the DB if a
file path exists, if Yes, then compare MD5 and only index if
different.
Regards,
Veselin K
On Tue, Apr 07, 2009 at 09:01:31AM +0100, Fergus McMenemie wrote:
Veselin,
Well, as far as solr is concerned, there is two issues here:-
1) To stop the same document ending up in the indexes twice, use
the document
pathname as the unique ID. Then if you do index it twice, the
previous index
information will be discarded. Not very efficient, but it may be
tolerable.
IMHO using pathname as the unique ID is often best practice.
2) To stop a document even being submitted to solr. You need to
implement some
middle ware that either performs a search/lookup using a
documents pathname
to see if it is already indexed. Or, after examining timestampts,
only submits
documents which have changed since the last folder scan.
Fergus.
Hello Paul,
I'm indexing with "curl http://localhost... -F myfi...@file.pdf"
Regards,
Veselin K
On Mon, Apr 06, 2009 at 02:56:20PM +0530, Noble
Paul ????????????????????? ?????????????????? wrote:
how are you indexing?
On Mon, Apr 6, 2009 at 2:54 PM, Veselin Kantsev
<vese...@campbell-lange.net> wrote:
Hello,
apologies for the basic question.
How can I avoid double indexing files?
In case all my files are in one folder which is scanned
frequently, is
there a Solr feature of checking and skipping a file if it has
already been indexed
and not changed since?
Thank you.
Regards,
Veselin K
--
--Noble Paul
--
===============================================================
Fergus McMenemie Email:fer...@twig.me.uk
Techmore Ltd Phone:(UK) 07721 376021
Unix/Mac/Intranets Analyst Programmer
===============================================================