Re: general debugging techniques?
Ah! I did not notice the 'too many open files' part. This means that your mergeFactor setting is too high for what your operating system allows. The default mergeFactor is 10 (which translates into thousands of open file descriptors). You should lower this number. On Tue, Jul 6, 2010 at 1:14 PM, Jim Blomo wrote: > On Sat, Jul 3, 2010 at 1:10 PM, Lance Norskog wrote: >> You don't need to optimize, only commit. > > OK, thanks for the tip, Lance. I thought the "too many open files" > problem was because I wasn't optimizing/merging frequently enough. My > understanding of your suggestion is that commit also does merging, and > since I am only building the index, not querying or updating it, I > don't need to optimize. > >> This means that the JVM spends 98% of its time doing garbage >> collection. This means there is not enough memory. > > I'll increase the memory to 4G, decrease the documentCache to 5 and try again. > >> I made a mistake - the bug in Lucene is not about PDFs - it happens >> with every field in every document you index in any way- so doing this >> in Tika outside Solr does not help. The only trick I can think of is >> to alternate between indexing large and small documents. This way the >> bug does not need memory for two giant documents in a row. > > I've checked out and built solr from branch_3x with the > tika-0.8-SNAPSHOT patch. (Earlier I was having trouble with Tika > crashing too frequently.) I've confirmed that LUCENE-2387 is fixed in > this branch so hopefully I won't run into that this time. > >> Also, do not query the indexer at all. If you must, don't do sorted or >> faceting requests. These eat up a lot of memory that is only freed >> with the next commit (index reload). > > Good to know, though I have not been querying the index and definitely > haven't ventured into faceted requests yet. > > The advice is much appreciated, > > Jim > -- Lance Norskog goks...@gmail.com
Re: general debugging techniques?
On Sat, Jul 3, 2010 at 1:10 PM, Lance Norskog wrote: > You don't need to optimize, only commit. OK, thanks for the tip, Lance. I thought the "too many open files" problem was because I wasn't optimizing/merging frequently enough. My understanding of your suggestion is that commit also does merging, and since I am only building the index, not querying or updating it, I don't need to optimize. > This means that the JVM spends 98% of its time doing garbage > collection. This means there is not enough memory. I'll increase the memory to 4G, decrease the documentCache to 5 and try again. > I made a mistake - the bug in Lucene is not about PDFs - it happens > with every field in every document you index in any way- so doing this > in Tika outside Solr does not help. The only trick I can think of is > to alternate between indexing large and small documents. This way the > bug does not need memory for two giant documents in a row. I've checked out and built solr from branch_3x with the tika-0.8-SNAPSHOT patch. (Earlier I was having trouble with Tika crashing too frequently.) I've confirmed that LUCENE-2387 is fixed in this branch so hopefully I won't run into that this time. > Also, do not query the indexer at all. If you must, don't do sorted or > faceting requests. These eat up a lot of memory that is only freed > with the next commit (index reload). Good to know, though I have not been querying the index and definitely haven't ventured into faceted requests yet. The advice is much appreciated, Jim
Re: general debugging techniques?
You don't need to optimize, only commit. This means that the JVM spends 98% of its time doing garbage collection. This means there is not enough memory. I made a mistake - the bug in Lucene is not about PDFs - it happens with every field in every document you index in any way- so doing this in Tika outside Solr does not help. The only trick I can think of is to alternate between indexing large and small documents. This way the bug does not need memory for two giant documents in a row. Also, do not query the indexer at all. If you must, don't do sorted or faceting requests. These eat up a lot of memory that is only freed with the next commit (index reload). On Sat, Jul 3, 2010 at 8:19 AM, Dennis Gearon wrote: > I"ll be watching this one as I hope to be loading lots of docs soon. > Dennis Gearon > > Signature Warning > > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Fri, 7/2/10, Jim Blomo wrote: > >> From: Jim Blomo >> Subject: Re: general debugging techniques? >> To: solr-user@lucene.apache.org >> Date: Friday, July 2, 2010, 7:06 PM >> Just to confirm I'm not doing >> something insane, this is my general setup: >> >> - index approx 1MM documents including HTML, pictures, >> office files, etc. >> - files are not local to solr process >> - use upload/extract to extract text from them through >> tika >> - use commit=1 on each POST (reasons below) >> - use optimize=1 every 150 documents or so (reasons below) >> >> Through many manual restarts and modifications to the >> upload script, >> I've got about half way (numDocs : 467372, disk usage >> 1.6G). The >> biggest problem is that any serious problem cannot be >> recovered from >> without a restart to tomcat, and serious problems can't be >> differentiated at the client level from non-serious >> problems (eg tika >> exceptions thrown by bad documents). >> >> On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo >> wrote: >> > In any case I bumped up the heap to 3G as suggested, >> which has helped >> > stability. I have found that in practice I need to >> commit every >> > extraction because a crash or error will wipe out all >> extractions >> > after the last commit. >> >> I've also found that I need to optimize very regularly >> because I kept >> getting "too many file handles" errors (though they usually >> came up as >> the more cryptic "directory, but cannot be listed: list() >> returned >> null" returned empty error). >> >> What I am running into now is >> >> SEVERE: Exception invoking periodic operation: >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> at >> java.lang.String.substring(String.java:1940) >> [full backtrace below] >> >> After a restart and optimize this goes away for a while >> (~100 >> documents) but then comes back and every request after the >> error >> fails. Even if I can't prevent this error, is there a >> way I can >> recover from it better? Perhaps an option to solr or >> tomcat to just >> restart itself if it hits that error? >> >> Jim >> >> SEVERE: Exception invoking periodic operation: >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> at >> java.lang.String.substring(String.java:1940) >> at >> java.lang.String.substring(String.java:1905) >> at >> java.io.File.getName(File.java:401) >> at >> java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:229) >> at >> java.io.File.isDirectory(File.java:754) >> at >> org.apache.catalina.startup.HostConfig.checkResources(HostConfig.java:1000) >> at >> org.apache.catalina.startup.HostConfig.check(HostConfig.java:1214) >> at >> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:293) >> at >> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120) >> at >> org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1306) >> at >> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1570) >> at >> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1579) >> at >> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1559) >> at >> java.lang.Thread.run(Thread.java:619) >> Jul 3, 2010 1:32:20 AM >> org.apache.solr.update.processor.LogUpdateProcessor finish >> > -- Lance Norskog goks...@gmail.com
Re: general debugging techniques?
I"ll be watching this one as I hope to be loading lots of docs soon. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 7/2/10, Jim Blomo wrote: > From: Jim Blomo > Subject: Re: general debugging techniques? > To: solr-user@lucene.apache.org > Date: Friday, July 2, 2010, 7:06 PM > Just to confirm I'm not doing > something insane, this is my general setup: > > - index approx 1MM documents including HTML, pictures, > office files, etc. > - files are not local to solr process > - use upload/extract to extract text from them through > tika > - use commit=1 on each POST (reasons below) > - use optimize=1 every 150 documents or so (reasons below) > > Through many manual restarts and modifications to the > upload script, > I've got about half way (numDocs : 467372, disk usage > 1.6G). The > biggest problem is that any serious problem cannot be > recovered from > without a restart to tomcat, and serious problems can't be > differentiated at the client level from non-serious > problems (eg tika > exceptions thrown by bad documents). > > On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo > wrote: > > In any case I bumped up the heap to 3G as suggested, > which has helped > > stability. I have found that in practice I need to > commit every > > extraction because a crash or error will wipe out all > extractions > > after the last commit. > > I've also found that I need to optimize very regularly > because I kept > getting "too many file handles" errors (though they usually > came up as > the more cryptic "directory, but cannot be listed: list() > returned > null" returned empty error). > > What I am running into now is > > SEVERE: Exception invoking periodic operation: > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > java.lang.String.substring(String.java:1940) > [full backtrace below] > > After a restart and optimize this goes away for a while > (~100 > documents) but then comes back and every request after the > error > fails. Even if I can't prevent this error, is there a > way I can > recover from it better? Perhaps an option to solr or > tomcat to just > restart itself if it hits that error? > > Jim > > SEVERE: Exception invoking periodic operation: > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > java.lang.String.substring(String.java:1940) > at > java.lang.String.substring(String.java:1905) > at > java.io.File.getName(File.java:401) > at > java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:229) > at > java.io.File.isDirectory(File.java:754) > at > org.apache.catalina.startup.HostConfig.checkResources(HostConfig.java:1000) > at > org.apache.catalina.startup.HostConfig.check(HostConfig.java:1214) > at > org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:293) > at > org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120) > at > org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1306) > at > org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1570) > at > org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1579) > at > org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1559) > at > java.lang.Thread.run(Thread.java:619) > Jul 3, 2010 1:32:20 AM > org.apache.solr.update.processor.LogUpdateProcessor finish >
Re: general debugging techniques?
Just to confirm I'm not doing something insane, this is my general setup: - index approx 1MM documents including HTML, pictures, office files, etc. - files are not local to solr process - use upload/extract to extract text from them through tika - use commit=1 on each POST (reasons below) - use optimize=1 every 150 documents or so (reasons below) Through many manual restarts and modifications to the upload script, I've got about half way (numDocs : 467372, disk usage 1.6G). The biggest problem is that any serious problem cannot be recovered from without a restart to tomcat, and serious problems can't be differentiated at the client level from non-serious problems (eg tika exceptions thrown by bad documents). On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo wrote: > In any case I bumped up the heap to 3G as suggested, which has helped > stability. I have found that in practice I need to commit every > extraction because a crash or error will wipe out all extractions > after the last commit. I've also found that I need to optimize very regularly because I kept getting "too many file handles" errors (though they usually came up as the more cryptic "directory, but cannot be listed: list() returned null" returned empty error). What I am running into now is SEVERE: Exception invoking periodic operation: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.String.substring(String.java:1940) [full backtrace below] After a restart and optimize this goes away for a while (~100 documents) but then comes back and every request after the error fails. Even if I can't prevent this error, is there a way I can recover from it better? Perhaps an option to solr or tomcat to just restart itself if it hits that error? Jim SEVERE: Exception invoking periodic operation: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.String.substring(String.java:1940) at java.lang.String.substring(String.java:1905) at java.io.File.getName(File.java:401) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:229) at java.io.File.isDirectory(File.java:754) at org.apache.catalina.startup.HostConfig.checkResources(HostConfig.java:1000) at org.apache.catalina.startup.HostConfig.check(HostConfig.java:1214) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:293) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120) at org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1306) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1570) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1579) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1559) at java.lang.Thread.run(Thread.java:619) Jul 3, 2010 1:32:20 AM org.apache.solr.update.processor.LogUpdateProcessor finish
Re: general debugging techniques?
: > if you are only seeing one log line per request, then you are just looking : > at the "request" log ... there should be more logs with messages from all : > over the code base with various levels of severity -- and using standard : > java log level controls you can turn these up/down for various components. : : Unfortunately, I'm not very familiar with java deploys so I don't know : where the standard controls are yet. As a concrete example, I do see : INFO level logs, but haven't found a way to move up DEBUG level in : either solr or tomcat. I was hopeful debug statements would point to : where extraction/indexing hangs were occurring. I will keep poking : around, thanks for the tips. Hmm ... it sounds like maybe you haven't seen this wiki page... http://wiki.apache.org/solr/SolrLogging ..as mentioned there, for quick debugging, there is an admin page to adjust the log levels on the fly... http://localhost:8983/solr/admin/logging.jsp ...but for more long term changes to the logging configuration, it depends greatly on wether your servlet container customizes the Java LogManager. There are links there to general info about Java logging, and about tweaking this in the example Jetty setup. -Hoss
Re: general debugging techniques?
https://issues.apache.org/jira/browse/LUCENE-2387 There is a "memory leak" that causes the last PDF binary file image to stick around while working on the next binary image. When you commit after every extraction, you clear up this "memory leak". This is fixed in trunk and should make it into a 'bug fix' Solr 1.4.1 if such a thing happens. Lance On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo wrote: > On Fri, Jun 4, 2010 at 3:14 PM, Chris Hostetter > wrote: >> : That is still really small for 5MB documents. I think the default solr >> : document cache is 512 items, so you would need at least 3 GB of memory >> : if you didn't change that and the cache filled up. >> >> that assumes that the extracted text tika extracts from each document is >> the same size as the original raw files *and* that he's configured that >> content field to be "stored" ... in practice if you only stored=true the > > Most times the extracted text is much smaller, though there are > occasional zip files that may expand in size (and in an unrelated > note, multifile zip archives cause tika 0.7 to hang currently). > >> fast, 128MB is really, really, really small for a typical Solr instance. > > In any case I bumped up the heap to 3G as suggested, which has helped > stability. I have found that in practice I need to commit every > extraction because a crash or error will wipe out all extractions > after the last commit. > >> if you are only seeing one log line per request, then you are just looking >> at the "request" log ... there should be more logs with messages from all >> over the code base with various levels of severity -- and using standard >> java log level controls you can turn these up/down for various components. > > Unfortunately, I'm not very familiar with java deploys so I don't know > where the standard controls are yet. As a concrete example, I do see > INFO level logs, but haven't found a way to move up DEBUG level in > either solr or tomcat. I was hopeful debug statements would point to > where extraction/indexing hangs were occurring. I will keep poking > around, thanks for the tips. > > Jim > -- Lance Norskog goks...@gmail.com
Re: general debugging techniques?
On Fri, Jun 4, 2010 at 3:14 PM, Chris Hostetter wrote: > : That is still really small for 5MB documents. I think the default solr > : document cache is 512 items, so you would need at least 3 GB of memory > : if you didn't change that and the cache filled up. > > that assumes that the extracted text tika extracts from each document is > the same size as the original raw files *and* that he's configured that > content field to be "stored" ... in practice if you only stored=true the Most times the extracted text is much smaller, though there are occasional zip files that may expand in size (and in an unrelated note, multifile zip archives cause tika 0.7 to hang currently). > fast, 128MB is really, really, really small for a typical Solr instance. In any case I bumped up the heap to 3G as suggested, which has helped stability. I have found that in practice I need to commit every extraction because a crash or error will wipe out all extractions after the last commit. > if you are only seeing one log line per request, then you are just looking > at the "request" log ... there should be more logs with messages from all > over the code base with various levels of severity -- and using standard > java log level controls you can turn these up/down for various components. Unfortunately, I'm not very familiar with java deploys so I don't know where the standard controls are yet. As a concrete example, I do see INFO level logs, but haven't found a way to move up DEBUG level in either solr or tomcat. I was hopeful debug statements would point to where extraction/indexing hangs were occurring. I will keep poking around, thanks for the tips. Jim
RE: general debugging techniques?
: That is still really small for 5MB documents. I think the default solr : document cache is 512 items, so you would need at least 3 GB of memory : if you didn't change that and the cache filled up. that assumes that the extracted text tika extracts from each document is the same size as the original raw files *and* that he's configured that content field to be "stored" ... in practice if you only stored=true the summary fields (title, author, short summary, etc...) the document cache isn't going to be nearly that big (and even if you do store the entire content field, the plain text is usually *much* msaller then the binary source file) : -Xmx128M - my understanding is that this bumps heap size to 128M. FWIW: depending on how many docs you are indexing, and wether you want to support things like faceting that rely on building in memory caches to be fast, 128MB is really, really, really small for a typical Solr instance. Even on a box that is only doing indexing (no queries) i would imagine Tika likes to have a lot of ram when doing extraction (most doc types are gong to require the raw binary data is entirely in the heap, plus all hte extracted Strings, plus all of the connecting objects to build the DOM, etc And that's before you even start thinking about Solr & Lucene and the index itself. -Hoss
Re: general debugging techniques?
: to format the data from my sources. I can read through the catalina : log, but this seems to just log requests; not much info is given about : errors or when the service hangs. Here are some examples: if you are only seeing one log line per request, then you are just looking at the "request" log ... there should be more logs with messages from all over the code base with various levels of severity -- and using standard java log level controls you can turn these up/down for various components. : Although I am keeping document size under 5MB, I regularly see : "SEVERE: java.lang.OutOfMemoryError: Java heap space" errors. How can : I find what component had this problem? that's one of java's most anoying problems -- even if you have the full stack trace of the OOM, that just tells you which code path was hte straw that broke the camels back -- it doesn't tell you where all your memory was being used. for that you really need to use a java profiler, or turn on heap dumps and use a heap dump analyzer after the OOM occurs. : After the above error, I often see this followup error on the next : document: "SEVERE: org.apache.lucene.store.LockObtainFailedException: : Lock obtain timed out: NativeFSLock@/var/lib/solr/data/ : index/lucene-d6f7b3bf6fe64f362b4d45bfd4924f54-write.lock" . This has : a backtrace, so I could dive directly into the code. Is this the best : way to track down the problem, or are there debugging settings that : could help show why the lock is being held elsewhere? probably not -- after an OOM, most java apps are just screwed in general after an OOM (or any other low level error). : I attempted to turn on indexing logging with the line : : true : : but I can't seem to find this file in either the tomacat or the index directory. it will probably be in whatever the Current Working Directory (CWD) is -- assuming the file permissions allow writting to it. the top of the Solr admin screen tells you what the CWD is in case it's not clear from how your servlet container is run. -Hoss
RE: general debugging techniques?
That is still really small for 5MB documents. I think the default solr document cache is 512 items, so you would need at least 3 GB of memory if you didn't change that and the cache filled up. Try disabling the document cache by removing the block from your solrconfig, or at least turn it down to like 5 documents. -Kal -Original Message- From: jim.bl...@pbwiki.com [mailto:jim.bl...@pbwiki.com] On Behalf Of Jim Blomo Sent: Thursday, June 03, 2010 2:29 PM To: solr-user@lucene.apache.org Subject: Re: general debugging techniques? On Thu, Jun 3, 2010 at 11:17 AM, Nagelberg, Kallin wrote: > How much memory have you given tomcat? The default is 64M which is going to > be really small for 5MB documents. -Xmx128M - my understanding is that this bumps heap size to 128M. What is a reasonable size? Are there other memory flags I should specify? Jim
Re: general debugging techniques?
On Thu, Jun 3, 2010 at 11:17 AM, Nagelberg, Kallin wrote: > How much memory have you given tomcat? The default is 64M which is going to > be really small for 5MB documents. -Xmx128M - my understanding is that this bumps heap size to 128M. What is a reasonable size? Are there other memory flags I should specify? Jim
RE: general debugging techniques?
How much memory have you given tomcat? The default is 64M which is going to be really small for 5MB documents. -Original Message- From: jim.bl...@pbwiki.com [mailto:jim.bl...@pbwiki.com] On Behalf Of Jim Blomo Sent: Thursday, June 03, 2010 2:05 PM To: solr-user@lucene.apache.org Subject: general debugging techniques? I am new to debugging Java services, so I'm wondering what the best practices are for debugging solr on tomcat. I'm running into a few issues while building up my index, using the ExtractingRequestHandler to format the data from my sources. I can read through the catalina log, but this seems to just log requests; not much info is given about errors or when the service hangs. Here are some examples: Some zip or Office formats uploaded to the extract requestHandler simply hang with the jsvc process spinning at 100% CPU. I'm unclear where in the process the request is hanging. Did it make it through Tika? Is it attempting to index? The problem is often not reproducible after restarting tomcat and starting with the last failed document. Although I am keeping document size under 5MB, I regularly see "SEVERE: java.lang.OutOfMemoryError: Java heap space" errors. How can I find what component had this problem? After the above error, I often see this followup error on the next document: "SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/lib/solr/data/ index/lucene-d6f7b3bf6fe64f362b4d45bfd4924f54-write.lock" . This has a backtrace, so I could dive directly into the code. Is this the best way to track down the problem, or are there debugging settings that could help show why the lock is being held elsewhere? I attempted to turn on indexing logging with the line true but I can't seem to find this file in either the tomacat or the index directory. I am using solr 3.1 with the patch to work with Tika 0.7. Thanks for any tips, Jim
general debugging techniques?
I am new to debugging Java services, so I'm wondering what the best practices are for debugging solr on tomcat. I'm running into a few issues while building up my index, using the ExtractingRequestHandler to format the data from my sources. I can read through the catalina log, but this seems to just log requests; not much info is given about errors or when the service hangs. Here are some examples: Some zip or Office formats uploaded to the extract requestHandler simply hang with the jsvc process spinning at 100% CPU. I'm unclear where in the process the request is hanging. Did it make it through Tika? Is it attempting to index? The problem is often not reproducible after restarting tomcat and starting with the last failed document. Although I am keeping document size under 5MB, I regularly see "SEVERE: java.lang.OutOfMemoryError: Java heap space" errors. How can I find what component had this problem? After the above error, I often see this followup error on the next document: "SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/lib/solr/data/ index/lucene-d6f7b3bf6fe64f362b4d45bfd4924f54-write.lock" . This has a backtrace, so I could dive directly into the code. Is this the best way to track down the problem, or are there debugging settings that could help show why the lock is being held elsewhere? I attempted to turn on indexing logging with the line true but I can't seem to find this file in either the tomacat or the index directory. I am using solr 3.1 with the patch to work with Tika 0.7. Thanks for any tips, Jim