Hello everybody,

I am currently indexing wikipedia dumps and create an index for versioned
document collections. As far everything is working fine but I have never
thought that single articles of wikipedia would reach a size of around 2 GB!
One article for example has 20000 versions with an average length of 60000
characters  for each (HUGE in memory!). This means I need a heap space
around 4 GB to perform indexing and I would like to decrease my memory
consumption ;).

At the moment I load every wikipedia article completely into memory
containing all versions. Then I collect some statistical data about the
article to store extra information about term occurences which are written
into the index as payloads. The statistic is created during an own
tokenization run which happens before the document is written into index.
This means I am analyzing my documents twice! :( I know there is a
CachingTokenFilter but I haven't found how and where to implement it exactly
(I tried it in my Analyzer but stream.reset() seems not to work). Does
somebody have a nice example?

1) Can I somehow avoid loading one complete article to get my statistics? 
2) Is it possible to index large files without completely loading it into a
field? 
3) How can I avoid to parse an article twice? 

Best regards 
Alex


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-large-XML-dumps-tp2185926p2185926.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to