You don't need to optimize, only commit.

This means that the JVM spends 98% of its time doing garbage
collection. This means there is not enough memory.

I made a mistake - the bug in Lucene is not about PDFs - it happens
with every field in every document you index in any way- so doing this
in Tika outside Solr does not help. The only trick I can think of is
to alternate between indexing large and small documents. This way the
bug does not need memory for two giant documents in a row.

Also, do not query the indexer at all. If you must, don't do sorted or
faceting requests. These eat up a lot of memory that is only freed
with the next commit (index reload).

On Sat, Jul 3, 2010 at 8:19 AM, Dennis Gearon <gear...@sbcglobal.net> wrote:
> I"ll be watching this one as I  hope to be loading lots of docs soon.
> Dennis Gearon
>
> Signature Warning
> ----------------
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Fri, 7/2/10, Jim Blomo <jim.bl...@pbworks.com> wrote:
>
>> From: Jim Blomo <jim.bl...@pbworks.com>
>> Subject: Re: general debugging techniques?
>> To: solr-user@lucene.apache.org
>> Date: Friday, July 2, 2010, 7:06 PM
>> Just to confirm I'm not doing
>> something insane, this is my general setup:
>>
>> - index approx 1MM documents including HTML, pictures,
>> office files, etc.
>> - files are not local to solr process
>> - use upload/extract to extract text from them through
>> tika
>> - use commit=1 on each POST (reasons below)
>> - use optimize=1 every 150 documents or so (reasons below)
>>
>> Through many manual restarts and modifications to the
>> upload script,
>> I've got about half way (numDocs : 467372, disk usage
>> 1.6G).  The
>> biggest problem is that any serious problem cannot be
>> recovered from
>> without a restart to tomcat, and serious problems can't be
>> differentiated at the client level from non-serious
>> problems (eg tika
>> exceptions thrown by bad documents).
>>
>> On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo <jim.bl...@pbworks.com>
>> wrote:
>> > In any case I bumped up the heap to 3G as suggested,
>> which has helped
>> > stability.  I have found that in practice I need to
>> commit every
>> > extraction because a crash or error will wipe out all
>> extractions
>> > after the last commit.
>>
>> I've also found that I need to optimize very regularly
>> because I kept
>> getting "too many file handles" errors (though they usually
>> came up as
>> the more cryptic "directory, but cannot be listed: list()
>> returned
>> null" returned empty error).
>>
>> What I am running into now is
>>
>> SEVERE: Exception invoking periodic operation:
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>         at
>> java.lang.String.substring(String.java:1940)
>> [full backtrace below]
>>
>> After a restart and optimize this goes away for a while
>> (~100
>> documents) but then comes back and every request after the
>> error
>> fails.  Even if I can't prevent this error, is there a
>> way I can
>> recover from it better?  Perhaps an option to solr or
>> tomcat to just
>> restart itself if it hits that error?
>>
>> Jim
>>
>> SEVERE: Exception invoking periodic operation:
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>         at
>> java.lang.String.substring(String.java:1940)
>>         at
>> java.lang.String.substring(String.java:1905)
>>         at
>> java.io.File.getName(File.java:401)
>>         at
>> java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:229)
>>         at
>> java.io.File.isDirectory(File.java:754)
>>         at
>> org.apache.catalina.startup.HostConfig.checkResources(HostConfig.java:1000)
>>         at
>> org.apache.catalina.startup.HostConfig.check(HostConfig.java:1214)
>>         at
>> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:293)
>>         at
>> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120)
>>         at
>> org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1306)
>>         at
>> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1570)
>>         at
>> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1579)
>>         at
>> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1559)
>>         at
>> java.lang.Thread.run(Thread.java:619)
>> Jul 3, 2010 1:32:20 AM
>> org.apache.solr.update.processor.LogUpdateProcessor finish
>>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to