Re: how to index large number of files?

Erick Erickson Wed, 20 Oct 2010 14:47:44 -0700

My first guess is that you're accumulating too many documents in
before the flush gets triggered. The quick-n-dirty way to test this is
to do an IndexWriter.flush after every addDocument. This will slow
down indexing, but it will also tell you whether this is the problem and
you can look for more elegant solutions...


You can also get some stats by IndexWriter.getRamBufferSizeMB, and
can force automatic flushes given a particular ram buffer size via
IndexWriter.setRamBufferSizeMB.

One thing I'd be interested in is how big your files are. It might be, are
you trying to process a humongous file when it blows?

And if none of that helps, please post your stack trace.

Best
Erick

On Wed, Oct 20, 2010 at 2:45 PM, Sahin Buyrukbilen <
[email protected]> wrote:

> with the different parameters I still got the same error. My code is very
> simple, indeed I am only concerned with creating the index and then I will
> do some private information retrieval experiments on the inverted index
> file, which I created with the information extracted from the index. That
> is
> why I didnt go over optimization until now. the database size I had was ery
> small compared to 4.5million.
> My code is as follows:
>
> public static void createIndex() throws CorruptIndexException,
> LockObtainFailedException, IOException {
>        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
>        Directory indexDir = FSDirectory.open(new
> File("/media/work/WIKI/indexes/"));
>        boolean recreateIndexIfExists = true;
>        IndexWriter indexWriter = new IndexWriter(indexDir, analyzer,
> recreateIndexIfExists, IndexWriter.MaxFieldLength.UNLIMITED);
>        indexWriter.setUseCompoundFile(false);
>        File dir = new File(FILES_TO_INDEX_DIRECTORY);
>        File[] files = dir.listFiles();
>        for (File file : files) {
>            Document document = new Document();
>
>            //String path = file.getCanonicalPath();
>            //document.add(new Field(FIELD_PATH, path, Field.Store.YES,
> Field.Index.NOT_ANALYZED));
>
>            Reader reader = new FileReader(file);
>            document.add(new Field(FIELD_CONTENTS, reader));
>
>            indexWriter.addDocument(document);
>        }
>        indexWriter.optimize();
>        indexWriter.close();
>     }
>
>
> On Wed, Oct 20, 2010 at 2:39 PM, Qi Li <[email protected]> wrote:
>
> > 1. What is the difference when you used different vm parameters?
> > 2  What merge policy and optimization strategy did you use?
> > 3. How did you use the commit or flush ?
> >
> > Qi
> >
> > On Wed, Oct 20, 2010 at 2:05 PM, Sahin Buyrukbilen <
> > [email protected]> wrote:
> >
> > > Thank you so much for this infor. it looks pretty complicated for me
> but
> > I
> > > will try.
> > >
> > >
> > >
> > > On Wed, Oct 20, 2010 at 1:18 AM, Johnbin Wang <[email protected]
> > > >wrote:
> > >
> > > > You can start a fixedThreadPool to index all these files in the
> > multhread
> > > > way. Every thread execute an index task which could index a part of
> all
> > > the
> > > > files. In the index task, when indexing 10000 files, you need execute
> > the
> > > > indexWrite.commit() method to flush all the index add operation to
> disk
> > > > file.
> > > >
> > > > If you need index all these files into only one index file, you need
> to
> > > > hold
> > > > only one indexWriter instance among all the index thread.
> > > >
> > > > Hope it's helpful.
> > > >
> > > >
> > > >
> > > > On Wed, Oct 20, 2010 at 1:05 PM, Sahin Buyrukbilen <
> > > > [email protected]> wrote:
> > > >
> > > > > Thank you Johnbin,
> > > > > do you know which parameter I have to play with?
> > > > >
> > > > > On Wed, Oct 20, 2010 at 12:59 AM, Johnbin Wang <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > I think you can write index file once every 10,000 files or less
> > have
> > > > > been
> > > > > > read.
> > > > > >
> > > > > > On Wed, Oct 20, 2010 at 12:11 PM, Sahin Buyrukbilen <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I have to index about 4.5Million txt files. When I run the my
> > > > indexing
> > > > > > > application through Eclipse, I get this error : "Exception in
> > > thread
> > > > > > "main"
> > > > > > > java.lang.OutOfMemoryError: Java heap space"
> > > > > > >
> > > > > > > eclipse -vmargs -Xmx2000m -Xss8192k
> > > > > > >
> > > > > > > eclipse -vmargs -Xms40M -Xmx2G
> > > > > > >
> > > > > > >  I tried running Eclipse with above memory parameters, but
> still
> > > had
> > > > > the
> > > > > > > same error. The architecture of my computer is AMD x2 64bit
> 2GHz
> > > > > > processor,
> > > > > > > Ubuntu 10.04 LTS 64bit. java-6-openjdk.
> > > > > > >
> > > > > > > Anybody has a suggestion?
> > > > > > >
> > > > > > > thank you.
> > > > > > > Sahin.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > cheers,
> > > > > > Johnbin Wang
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > cheers,
> > > > Johnbin Wang
> > > >
> > >
> >
>

Re: how to index large number of files?

Reply via email to