I tried the notion of a temporary RAMDirectory already, and the documents parse unacceptably slowly , 8-10 seconds. Great minds think alike. Believe it or not, I have to deal with a 7,500 page book that details Civil War records of Michigan volunteers. The XML form is 24M, probably 16M of text exclusive of tags.
About your second suggestion, I'm trying to figure out how to do essentially that. But a word count isn't very straight forward with stop words and dirty ascii (OCR) data. I'm trying to hook that process into the tokenizer so the counts have a better chance of being accurate, which is the essence of the scheme. I'd far rather get the term offset from the same place the indexer will than try to do a similar-but-not-quite-identical algorithm that failed miserably on, say, the 3,000th and subsequent pages... I'm sure you've been somewhere similar.... OK, you've just caused me to think a bit, for which I thank you. I think it's actually pretty simple. Just instantiate a class that is a thin wrapper around the Lucene analyzer that implements the tokenstream (or whatever) interface by calling a contained analyzer (has-a). Return the token and do any recording I want to. And provide any additional data to my process as necessary. I'll have to look at that in the morning. All in all, I'm probably going to make your exact argument about disk space being waaaay cheaper than engineering time. That said, exploring this serves two purposes; first it lets me back my recommendation with data. Second, and longer term, we're using Lucene on more and more products, and exploring the nooks and crannies involved in exotic schemes vastly increases my ability to quickly tirage ways of doing things. The *other* thing my boss is good at is being OK with a reasonable amount of time "wasted" in order to increase my toolkit. So it isn't as frustrating as it might have appeared by my rather off-hand blaming of IT <G>. Thanks for the suggestions, Erick On 10/18/06, Michael D. Curtin <[EMAIL PROTECTED]> wrote:
Erick Erickson wrote: > Arbitrary restrictions by IT on the space the indexes can take up. > > Actually, I won't categorically I *can't* make this happen, but in order to > use this option, I need to be able to present a convincing case. And I > can't > do that until I've exhausted my options/creativity. Disk space is a LOT cheaper than engineering time. Any manager worth his/her salt should be able to evaluate that tradeoff in a millisecond, and any IT professional unable to do so should be reprimanded. Maybe your boss can fix it. If not, yours is probably not the only such situation in the world ... If you can retrieve the pre-index content at search time, maybe this would work: 1. Create the "real" index in the form that lets you get the top N books by relevance, on IT's disks. 2. Create a temporary index on those books in the form that gives you the chapter counts in RAM, search it, then discard it. If N is sufficiently small, #2 could be pretty darn fast. If that wouldn't work, here's another idea. I'm not clear on how your solution with getLastTermPosition() would work, but how about just counting words in the pages as you document.add() them (instead of relying on getLastTermPosition())? It would mean two passes of parsing, but you wouldn't have to modify any Lucene code ... --MDC --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]