On Fri, Jun 4, 2010 at 3:14 PM, Chris Hostetter
<hossman_luc...@fucit.org> wrote:
> : That is still really small for 5MB documents. I think the default solr
> : document cache is 512 items, so you would need at least 3 GB of memory
> : if you didn't change that and the cache filled up.
>
> that assumes that the extracted text tika extracts from each document is
> the same size as the original raw files *and* that he's configured that
> content field to be "stored" ... in practice if you only stored=true the

Most times the extracted text is much smaller, though there are
occasional zip files that may expand in size (and in an unrelated
note, multifile zip archives cause tika 0.7 to hang currently).

> fast, 128MB is really, really, really small for a typical Solr instance.

In any case I bumped up the heap to 3G as suggested, which has helped
stability.  I have found that in practice I need to commit every
extraction because a crash or error will wipe out all extractions
after the last commit.

> if you are only seeing one log line per request, then you are just looking
> at the "request" log ... there should be more logs with messages from all
> over the code base with various levels of severity -- and using standard
> java log level controls you can turn these up/down for various components.

Unfortunately, I'm not very familiar with java deploys so I don't know
where the standard controls are yet.  As a concrete example, I do see
INFO level logs, but haven't found a way to move up DEBUG level in
either solr or tomcat.  I was hopeful debug statements would point to
where extraction/indexing hangs were occurring.  I will keep poking
around, thanks for the tips.

Jim

Reply via email to