by the ways file size is not big. mostly 1kB. I am working on wikipedia articles in txt format.
On Wed, Oct 20, 2010 at 11:01 PM, Sahin Buyrukbilen < sahin.buyrukbi...@gmail.com> wrote: > Unfortunately both methods didnt go through. I am getting memory error even > at reading the directory contents. > > Now, I am thinking this: What if I split 4.5million files into 100.000 (or > less depending on java error) files directories, index each of them > separately and merge those indexes(if possible). > > Any suggestions? > > > On Wed, Oct 20, 2010 at 5:47 PM, Erick Erickson > <erickerick...@gmail.com>wrote: > >> My first guess is that you're accumulating too many documents in >> before the flush gets triggered. The quick-n-dirty way to test this is >> to do an IndexWriter.flush after every addDocument. This will slow >> down indexing, but it will also tell you whether this is the problem and >> you can look for more elegant solutions... >> >> You can also get some stats by IndexWriter.getRamBufferSizeMB, and >> can force automatic flushes given a particular ram buffer size via >> IndexWriter.setRamBufferSizeMB. >> >> One thing I'd be interested in is how big your files are. It might be, are >> you trying to process a humongous file when it blows? >> >> And if none of that helps, please post your stack trace. >> >> Best >> Erick >> >> On Wed, Oct 20, 2010 at 2:45 PM, Sahin Buyrukbilen < >> sahin.buyrukbi...@gmail.com> wrote: >> >> > with the different parameters I still got the same error. My code is >> very >> > simple, indeed I am only concerned with creating the index and then I >> will >> > do some private information retrieval experiments on the inverted index >> > file, which I created with the information extracted from the index. >> That >> > is >> > why I didnt go over optimization until now. the database size I had was >> ery >> > small compared to 4.5million. >> > My code is as follows: >> > >> > public static void createIndex() throws CorruptIndexException, >> > LockObtainFailedException, IOException { >> > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30); >> > Directory indexDir = FSDirectory.open(new >> > File("/media/work/WIKI/indexes/")); >> > boolean recreateIndexIfExists = true; >> > IndexWriter indexWriter = new IndexWriter(indexDir, analyzer, >> > recreateIndexIfExists, IndexWriter.MaxFieldLength.UNLIMITED); >> > indexWriter.setUseCompoundFile(false); >> > File dir = new File(FILES_TO_INDEX_DIRECTORY); >> > File[] files = dir.listFiles(); >> > for (File file : files) { >> > Document document = new Document(); >> > >> > //String path = file.getCanonicalPath(); >> > //document.add(new Field(FIELD_PATH, path, Field.Store.YES, >> > Field.Index.NOT_ANALYZED)); >> > >> > Reader reader = new FileReader(file); >> > document.add(new Field(FIELD_CONTENTS, reader)); >> > >> > indexWriter.addDocument(document); >> > } >> > indexWriter.optimize(); >> > indexWriter.close(); >> > } >> > >> > >> > On Wed, Oct 20, 2010 at 2:39 PM, Qi Li <aler...@gmail.com> wrote: >> > >> > > 1. What is the difference when you used different vm parameters? >> > > 2 What merge policy and optimization strategy did you use? >> > > 3. How did you use the commit or flush ? >> > > >> > > Qi >> > > >> > > On Wed, Oct 20, 2010 at 2:05 PM, Sahin Buyrukbilen < >> > > sahin.buyrukbi...@gmail.com> wrote: >> > > >> > > > Thank you so much for this infor. it looks pretty complicated for me >> > but >> > > I >> > > > will try. >> > > > >> > > > >> > > > >> > > > On Wed, Oct 20, 2010 at 1:18 AM, Johnbin Wang < >> johnbin.w...@gmail.com >> > > > >wrote: >> > > > >> > > > > You can start a fixedThreadPool to index all these files in the >> > > multhread >> > > > > way. Every thread execute an index task which could index a part >> of >> > all >> > > > the >> > > > > files. In the index task, when indexing 10000 files, you need >> execute >> > > the >> > > > > indexWrite.commit() method to flush all the index add operation to >> > disk >> > > > > file. >> > > > > >> > > > > If you need index all these files into only one index file, you >> need >> > to >> > > > > hold >> > > > > only one indexWriter instance among all the index thread. >> > > > > >> > > > > Hope it's helpful. >> > > > > >> > > > > >> > > > > >> > > > > On Wed, Oct 20, 2010 at 1:05 PM, Sahin Buyrukbilen < >> > > > > sahin.buyrukbi...@gmail.com> wrote: >> > > > > >> > > > > > Thank you Johnbin, >> > > > > > do you know which parameter I have to play with? >> > > > > > >> > > > > > On Wed, Oct 20, 2010 at 12:59 AM, Johnbin Wang < >> > > johnbin.w...@gmail.com >> > > > > > >wrote: >> > > > > > >> > > > > > > I think you can write index file once every 10,000 files or >> less >> > > have >> > > > > > been >> > > > > > > read. >> > > > > > > >> > > > > > > On Wed, Oct 20, 2010 at 12:11 PM, Sahin Buyrukbilen < >> > > > > > > sahin.buyrukbi...@gmail.com> wrote: >> > > > > > > >> > > > > > > > Hi all, >> > > > > > > > >> > > > > > > > I have to index about 4.5Million txt files. When I run the >> my >> > > > > indexing >> > > > > > > > application through Eclipse, I get this error : "Exception >> in >> > > > thread >> > > > > > > "main" >> > > > > > > > java.lang.OutOfMemoryError: Java heap space" >> > > > > > > > >> > > > > > > > eclipse -vmargs -Xmx2000m -Xss8192k >> > > > > > > > >> > > > > > > > eclipse -vmargs -Xms40M -Xmx2G >> > > > > > > > >> > > > > > > > I tried running Eclipse with above memory parameters, but >> > still >> > > > had >> > > > > > the >> > > > > > > > same error. The architecture of my computer is AMD x2 64bit >> > 2GHz >> > > > > > > processor, >> > > > > > > > Ubuntu 10.04 LTS 64bit. java-6-openjdk. >> > > > > > > > >> > > > > > > > Anybody has a suggestion? >> > > > > > > > >> > > > > > > > thank you. >> > > > > > > > Sahin. >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > -- >> > > > > > > cheers, >> > > > > > > Johnbin Wang >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > cheers, >> > > > > Johnbin Wang >> > > > > >> > > > >> > > >> > >> > >