You don't need to optimize, only commit. This means that the JVM spends 98% of its time doing garbage collection. This means there is not enough memory.
I made a mistake - the bug in Lucene is not about PDFs - it happens with every field in every document you index in any way- so doing this in Tika outside Solr does not help. The only trick I can think of is to alternate between indexing large and small documents. This way the bug does not need memory for two giant documents in a row. Also, do not query the indexer at all. If you must, don't do sorted or faceting requests. These eat up a lot of memory that is only freed with the next commit (index reload). On Sat, Jul 3, 2010 at 8:19 AM, Dennis Gearon <gear...@sbcglobal.net> wrote: > I"ll be watching this one as I hope to be loading lots of docs soon. > Dennis Gearon > > Signature Warning > ---------------- > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Fri, 7/2/10, Jim Blomo <jim.bl...@pbworks.com> wrote: > >> From: Jim Blomo <jim.bl...@pbworks.com> >> Subject: Re: general debugging techniques? >> To: solr-user@lucene.apache.org >> Date: Friday, July 2, 2010, 7:06 PM >> Just to confirm I'm not doing >> something insane, this is my general setup: >> >> - index approx 1MM documents including HTML, pictures, >> office files, etc. >> - files are not local to solr process >> - use upload/extract to extract text from them through >> tika >> - use commit=1 on each POST (reasons below) >> - use optimize=1 every 150 documents or so (reasons below) >> >> Through many manual restarts and modifications to the >> upload script, >> I've got about half way (numDocs : 467372, disk usage >> 1.6G). The >> biggest problem is that any serious problem cannot be >> recovered from >> without a restart to tomcat, and serious problems can't be >> differentiated at the client level from non-serious >> problems (eg tika >> exceptions thrown by bad documents). >> >> On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo <jim.bl...@pbworks.com> >> wrote: >> > In any case I bumped up the heap to 3G as suggested, >> which has helped >> > stability. I have found that in practice I need to >> commit every >> > extraction because a crash or error will wipe out all >> extractions >> > after the last commit. >> >> I've also found that I need to optimize very regularly >> because I kept >> getting "too many file handles" errors (though they usually >> came up as >> the more cryptic "directory, but cannot be listed: list() >> returned >> null" returned empty error). >> >> What I am running into now is >> >> SEVERE: Exception invoking periodic operation: >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> at >> java.lang.String.substring(String.java:1940) >> [full backtrace below] >> >> After a restart and optimize this goes away for a while >> (~100 >> documents) but then comes back and every request after the >> error >> fails. Even if I can't prevent this error, is there a >> way I can >> recover from it better? Perhaps an option to solr or >> tomcat to just >> restart itself if it hits that error? >> >> Jim >> >> SEVERE: Exception invoking periodic operation: >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> at >> java.lang.String.substring(String.java:1940) >> at >> java.lang.String.substring(String.java:1905) >> at >> java.io.File.getName(File.java:401) >> at >> java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:229) >> at >> java.io.File.isDirectory(File.java:754) >> at >> org.apache.catalina.startup.HostConfig.checkResources(HostConfig.java:1000) >> at >> org.apache.catalina.startup.HostConfig.check(HostConfig.java:1214) >> at >> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:293) >> at >> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120) >> at >> org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1306) >> at >> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1570) >> at >> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1579) >> at >> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1559) >> at >> java.lang.Thread.run(Thread.java:619) >> Jul 3, 2010 1:32:20 AM >> org.apache.solr.update.processor.LogUpdateProcessor finish >> > -- Lance Norskog goks...@gmail.com