Re: how to index large number of files?

Sahin Buyrukbilen Wed, 20 Oct 2010 20:01:37 -0700

Unfortunately both methods didnt go through. I am getting memory error even
at reading the directory contents.


Now, I am thinking this: What if I split 4.5million files into 100.000 (or
less depending on java error) files directories, index each of them
separately and merge those indexes(if possible).

Any suggestions?

On Wed, Oct 20, 2010 at 5:47 PM, Erick Erickson <[email protected]>wrote:

> My first guess is that you're accumulating too many documents in
> before the flush gets triggered. The quick-n-dirty way to test this is
> to do an IndexWriter.flush after every addDocument. This will slow
> down indexing, but it will also tell you whether this is the problem and
> you can look for more elegant solutions...
>
> You can also get some stats by IndexWriter.getRamBufferSizeMB, and
> can force automatic flushes given a particular ram buffer size via
> IndexWriter.setRamBufferSizeMB.
>
> One thing I'd be interested in is how big your files are. It might be, are
> you trying to process a humongous file when it blows?
>
> And if none of that helps, please post your stack trace.
>
> Best
> Erick
>
> On Wed, Oct 20, 2010 at 2:45 PM, Sahin Buyrukbilen <
> [email protected]> wrote:
>
> > with the different parameters I still got the same error. My code is very
> > simple, indeed I am only concerned with creating the index and then I
> will
> > do some private information retrieval experiments on the inverted index
> > file, which I created with the information extracted from the index. That
> > is
> > why I didnt go over optimization until now. the database size I had was
> ery
> > small compared to 4.5million.
> > My code is as follows:
> >
> > public static void createIndex() throws CorruptIndexException,
> > LockObtainFailedException, IOException {
> >        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
> >        Directory indexDir = FSDirectory.open(new
> > File("/media/work/WIKI/indexes/"));
> >        boolean recreateIndexIfExists = true;
> >        IndexWriter indexWriter = new IndexWriter(indexDir, analyzer,
> > recreateIndexIfExists, IndexWriter.MaxFieldLength.UNLIMITED);
> >        indexWriter.setUseCompoundFile(false);
> >        File dir = new File(FILES_TO_INDEX_DIRECTORY);
> >        File[] files = dir.listFiles();
> >        for (File file : files) {
> >            Document document = new Document();
> >
> >            //String path = file.getCanonicalPath();
> >            //document.add(new Field(FIELD_PATH, path, Field.Store.YES,
> > Field.Index.NOT_ANALYZED));
> >
> >            Reader reader = new FileReader(file);
> >            document.add(new Field(FIELD_CONTENTS, reader));
> >
> >            indexWriter.addDocument(document);
> >        }
> >        indexWriter.optimize();
> >        indexWriter.close();
> >     }
> >
> >
> > On Wed, Oct 20, 2010 at 2:39 PM, Qi Li <[email protected]> wrote:
> >
> > > 1. What is the difference when you used different vm parameters?
> > > 2  What merge policy and optimization strategy did you use?
> > > 3. How did you use the commit or flush ?
> > >
> > > Qi
> > >
> > > On Wed, Oct 20, 2010 at 2:05 PM, Sahin Buyrukbilen <
> > > [email protected]> wrote:
> > >
> > > > Thank you so much for this infor. it looks pretty complicated for me
> > but
> > > I
> > > > will try.
> > > >
> > > >
> > > >
> > > > On Wed, Oct 20, 2010 at 1:18 AM, Johnbin Wang <
> [email protected]
> > > > >wrote:
> > > >
> > > > > You can start a fixedThreadPool to index all these files in the
> > > multhread
> > > > > way. Every thread execute an index task which could index a part of
> > all
> > > > the
> > > > > files. In the index task, when indexing 10000 files, you need
> execute
> > > the
> > > > > indexWrite.commit() method to flush all the index add operation to
> > disk
> > > > > file.
> > > > >
> > > > > If you need index all these files into only one index file, you
> need
> > to
> > > > > hold
> > > > > only one indexWriter instance among all the index thread.
> > > > >
> > > > > Hope it's helpful.
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Oct 20, 2010 at 1:05 PM, Sahin Buyrukbilen <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Thank you Johnbin,
> > > > > > do you know which parameter I have to play with?
> > > > > >
> > > > > > On Wed, Oct 20, 2010 at 12:59 AM, Johnbin Wang <
> > > [email protected]
> > > > > > >wrote:
> > > > > >
> > > > > > > I think you can write index file once every 10,000 files or
> less
> > > have
> > > > > > been
> > > > > > > read.
> > > > > > >
> > > > > > > On Wed, Oct 20, 2010 at 12:11 PM, Sahin Buyrukbilen <
> > > > > > > [email protected]> wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I have to index about 4.5Million txt files. When I run the my
> > > > > indexing
> > > > > > > > application through Eclipse, I get this error : "Exception in
> > > > thread
> > > > > > > "main"
> > > > > > > > java.lang.OutOfMemoryError: Java heap space"
> > > > > > > >
> > > > > > > > eclipse -vmargs -Xmx2000m -Xss8192k
> > > > > > > >
> > > > > > > > eclipse -vmargs -Xms40M -Xmx2G
> > > > > > > >
> > > > > > > >  I tried running Eclipse with above memory parameters, but
> > still
> > > > had
> > > > > > the
> > > > > > > > same error. The architecture of my computer is AMD x2 64bit
> > 2GHz
> > > > > > > processor,
> > > > > > > > Ubuntu 10.04 LTS 64bit. java-6-openjdk.
> > > > > > > >
> > > > > > > > Anybody has a suggestion?
> > > > > > > >
> > > > > > > > thank you.
> > > > > > > > Sahin.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > cheers,
> > > > > > > Johnbin Wang
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > cheers,
> > > > > Johnbin Wang
> > > > >
> > > >
> > >
> >
>

Re: how to index large number of files?

Reply via email to