Hello everybody, I am currently indexing wikipedia dumps and create an index for versioned document collections. As far everything is working fine but I have never thought that single articles of wikipedia would reach a size of around 2 GB! One article for example has 20000 versions with an average length of 60000 characters for each (HUGE in memory!). This means I need a heap space around 4 GB to perform indexing and I would like to decrease my memory consumption ;).
At the moment I load every wikipedia article completely into memory containing all versions. Then I collect some statistical data about the article to store extra information about term occurences which are written into the index as payloads. The statistic is created during an own tokenization run which happens before the document is written into index. This means I am analyzing my documents twice! :( I know there is a CachingTokenFilter but I haven't found how and where to implement it exactly (I tried it in my Analyzer but stream.reset() seems not to work). Does somebody have a nice example? 1) Can I somehow avoid loading one complete article to get my statistics? 2) Is it possible to index large files without completely loading it into a field? 3) How can I avoid to parse an article twice? Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-large-XML-dumps-tp2185926p2185926.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org