Re: checking existing docs before indexing

Neeraj Gupta Thu, 12 Jul 2007 23:44:50 -0700

Yes, you need to store one untokenized field which will identifiy the 
exact document you want to update.


You can also check whether any document like that exists in your indexes, 
by using deleteDocuments() method of Indexreader. This returns the number 
of documents deleted as per the Term provided. 

Cheers,
Neeraj




"Samuel LEMOINE" <[EMAIL PROTECTED]> 

07/12/2007 09:38 PM
Please respond to
java-user@lucene.apache.org



To
java-user@lucene.apache.org
cc
[EMAIL PROTECTED]
Subject
Re: checking existing docs before indexing






Neeraj Gupta a écrit :
> Hi,
>
> You an use updateDocument() method of IndexWriter to update any existing 

> document.. It searches for a document matching the Term, if document 
> existes then delete that document. After that it adds the provided 
> document to the indexes in both the cases whether document exists or 
not.
>
> Cheers,
> Neeraj
>
>
>
>
> "Heba Farouk" <[EMAIL PROTECTED]> 
>
> 07/12/2007 06:57 PM
> Please respond to
> java-user@lucene.apache.org, [EMAIL PROTECTED]
>
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> checking existing docs before indexing
>
>
>
>
>
>
> Hello
> i'm a newbie to lucene world and i hope that u help me.
> i was asking is there any options in IndexWriter to check if a document 
> already exsits before adding it to the index or i should maintain it 
> manually ??
>
> thanks in advance
>
>
> Yours 
>
> Heba
>
> 
> ---------------------------------
> Choose the right car based on your needs.  Check out Yahoo! Autos new 
Car 
> Finder tool.
>
> 
> The information contained in this e-mail and any accompanying documents 
may contain information that is confidential or otherwise protected from 
disclosure. If you are not the intended recipient of this message, or if 
this message has been addressed to you in error, please immediately alert 
the sender by reply e-mail and then delete this message, including any 
attachments. Any dissemination, distribution or other use of the contents 
of this message by anyone other than the intended recipient 
> is strictly prohibited.
>
>
>
> 
I also used the updateDocument() to do so, but I encountered the issue 
that it takes a term as argument, so that other documents may be deleted 
by this method. To avoid this, my conclusion was that a solution is to 
store some stored untokenized fields, used as keys to identify solely a 
document, each document being identified by a string that distinguish it 
from others (such as url or file path).

Sam


PS: Here is the sample code I've wrote during my internship, quite 
simple to grasp:
(there are no commentaries, I removed them as they were in french)
The method that could interest you is the addDocument(String) one.
Hope it helped.

public class Indexer {

    private static final Logger theLogger = 
Logger.getLogger(Indexer.class);

    private Analyzer theAnalyzer;
    private IndexWriter theIndexWriter;
    private Reader theReaderContent;
    private String theIndexPath;

    public Indexer(String anIndexPath) {
        theAnalyzer = new StandardAnalyzer();
        theIndexPath = anIndexPath;
    }

    public void addDocument(String aFileName){

        try {
        theIndexWriter = new IndexWriter(theIndexPath, theAnalyzer);
        } catch (IOException e) {
            theLogger.error(e);
        }

        Document doc = new Document();

        try {
            theReaderContent = new FileReader(aFileName);
        } catch (FileNotFoundException e) {
            theLogger.error(e);
        }

        TokenStream tokenStreamContent = new 
StandardTokenizer(theReaderContent);
        Field docPath = new Field("path", aFileName, Field.Store.YES, 
Field.Index.UN_TOKENIZED);
        Field docContent = new Field("content", tokenStreamContent);
        doc.add(docPath);
        doc.add(docContent);

        try {
//            theIndexWriter.addDocument(doc);
            theIndexWriter.updateDocument(new Term("path",aFileName),doc);
            theIndexWriter.close();
        } catch (IOException e) {
            theLogger.error(e);
        }
    }

    public void sort(){
        try {
            theIndexWriter = new IndexWriter(theIndexPath, theAnalyzer);
            theIndexWriter.optimize();
            theIndexWriter.close();
        } catch (IOException e) {
            theLogger.error(e);
        }
    }

 
    public void addAllDocuments(String aDirectoryPath){
        File directory = new File(aDirectoryPath);
        File[] subDirectory = directory.listFiles();
        System.out.println(subDirectory.length+" fichiers ont été 
indexés.");
        for (File file : subDirectory) {
        addDocument(file.getPath());
        }
        this.sort();
    }
}


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




 
The information contained in this e-mail and any accompanying documents may 
contain information that is confidential or otherwise protected from 
disclosure. If you are not the intended recipient of this message, or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message, including any attachments. Any 
dissemination, distribution or other use of the contents of this message by 
anyone other than the intended recipient 
is strictly prohibited.

Re: checking existing docs before indexing

Reply via email to