Re: termpositions at index time...

Erick Erickson Wed, 18 Oct 2006 17:31:35 -0700

I tried the notion of a temporary RAMDirectory already, and the documents
parse unacceptably slowly , 8-10 seconds. Great minds think alike. Believe
it or not, I have to deal with a 7,500 page book that details Civil War
records of Michigan volunteers. The XML form is 24M, probably 16M of text
exclusive of tags.


About your second suggestion, I'm trying to figure out how to do essentially
that. But a word count isn't very straight forward with stop words and dirty
ascii (OCR) data. I'm trying to hook that process into the tokenizer so the
counts have a better chance of being accurate, which is the essence of the
scheme. I'd far rather get the term offset from the same place the indexer
will than try to do a similar-but-not-quite-identical algorithm that failed
miserably on, say, the 3,000th and subsequent pages... I'm sure you've been
somewhere similar....

OK, you've just caused me to think a bit, for which I thank you. I think
it's actually pretty simple. Just instantiate a class that is a thin wrapper
around the Lucene analyzer that implements the tokenstream (or whatever)
interface by calling a contained analyzer (has-a). Return the token and do
any recording I want to. And provide any additional data  to my process as
necessary. I'll have to look at that in the morning.

All in all, I'm probably going to make your exact argument about disk space
being waaaay cheaper than engineering time. That said, exploring this serves
two purposes; first it lets me back my recommendation with data. Second, and
longer term, we're using Lucene on more and more products, and exploring the
nooks and crannies involved in exotic schemes vastly increases my ability to
quickly tirage ways of doing things. The *other* thing my boss is good at is
being OK with a reasonable amount of time "wasted" in order to increase my
toolkit. So it isn't as frustrating as it might have appeared by my rather
off-hand blaming of IT <G>.

Thanks for the suggestions,
Erick

On 10/18/06, Michael D. Curtin <[EMAIL PROTECTED]> wrote:

Erick Erickson wrote:

> Arbitrary restrictions by IT on the space the indexes can take up.
>
> Actually, I won't categorically I *can't* make this happen, but in order
to
> use this option, I need to be able to present a convincing case. And I
> can't
> do that until I've exhausted my options/creativity.

Disk space is a LOT cheaper than engineering time.  Any manager worth
his/her
salt should be able to evaluate that tradeoff in a millisecond, and any IT
professional unable to do so should be reprimanded.  Maybe your boss can
fix
it.  If not, yours is probably not the only such situation in the world
...

If you can retrieve the pre-index content at search time, maybe this would
work:

1.  Create the "real" index in the form that lets you get the top N books
by
relevance, on IT's disks.

2.  Create a temporary index on those books in the form that gives you the
chapter counts in RAM, search it, then discard it.

If N is sufficiently small, #2 could be pretty darn fast.

If that wouldn't work, here's another idea.  I'm not clear on how your
solution with getLastTermPosition() would work, but how about just
counting
words in the pages as you document.add() them (instead of relying on
getLastTermPosition())?  It would mean two passes of parsing, but you
wouldn't
have to modify any Lucene code ...

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: termpositions at index time...

Reply via email to