[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space
[ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12413983 ] Marcel Schnippe commented on NUTCH-292: --- The cause for the OutOfMemoryError in my document, was an (large) Document containing a very large set of token. Most of the tokens are made of overlapping substrings like in "all your base are belong to us" => all, all-your, your, your-base, all-your-base, your-base, base-are etc > OpenSearchServlet: OutOfMemoryError: Java heap space > > > Key: NUTCH-292 > URL: http://issues.apache.org/jira/browse/NUTCH-292 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.8-dev > Reporter: Stefan Neufeind > Priority: Critical > Attachments: summarizer.diff > > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space > > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203) > org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) > > org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155) > javax.servlet.http.HttpServlet.service(HttpServlet.java:689) > javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > The URL I use is: > [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url > It seems to be a problem specific to the date I'm working with. Moving the > start from 0 to 10 or changing the query works fine. > Or maybe it doesn't have to do with sorting but it's just that I hit one "bad > search-result" that has a broken summary? > !! The problem is repeatable. So if anybody has an idea where to search / > what to fix, I can easily try that out !! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space
[ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12413982 ] Marcel Schnippe commented on NUTCH-292: --- Hi Stefan, Thanks for trying out the Patch. Yes, you were right, it was for 0.7. I should definitly switch, but i made so many custom changes. The proper place to apply would be in summary-basic.getTokens like in private Token[] getTokens(String text) { ArrayList result = new ArrayList(); TokenStream ts = analyzer.tokenStream("content", new StringReader(text)); Token token = null; - while (true) { +while (result.size()Beware of the above code. I have only proven it correct, not tested it (D.Knuth) > OpenSearchServlet: OutOfMemoryError: Java heap space > > > Key: NUTCH-292 > URL: http://issues.apache.org/jira/browse/NUTCH-292 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.8-dev > Reporter: Stefan Neufeind > Priority: Critical > Attachments: summarizer.diff > > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space > > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203) > org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) > > org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155) > javax.servlet.http.HttpServlet.service(HttpServlet.java:689) > javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > The URL I use is: > [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url > It seems to be a problem specific to the date I'm working with. Moving the > start from 0 to 10 or changing the query works fine. > Or maybe it doesn't have to do with sorting but it's just that I hit one "bad > search-result" that has a broken summary? > !! The problem is repeatable. So if anybody has an idea where to search / > what to fix, I can easily try that out !! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: java 1.4 versus 1.5
Do it; java 1.5 has much better profilability too. On Tue, May 30, 2006 at 03:21:00PM -0700, Owen O'Malley wrote: > Java 1.5 has been out for a couple of years now and has some nice > improvements in the libraries. In particular, I wish I had access to > the timeout settings on UrlConnections. Would anyone object if starting > with the 0.3 release this week, we required java 1.5 to compile and run > Hadoop? > > -- Owen
[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12413959 ] Matt Kangas commented on NUTCH-272: --- Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy, albeit costly for large crawls. (Should have an option to turn off?) I also see that URLFilters.filter() is applied in Fetcher (for redirects) and ParseOutputFormat, plus other tools. Another possibie choke-point: CrawlDbMerger.Merger.reduce(). The key is URL, and they're sorted. You can veto crawldb additions here. Could you effectively count URLs/host here? (Not sure when distributed.) Would it require setting a Partitioner, like crawl.PartitionUrlByHost? > Max. pages to crawl/fetch per site (emergency limit) > > > Key: NUTCH-272 > URL: http://issues.apache.org/jira/browse/NUTCH-272 > Project: Nutch > Type: Improvement > Reporter: Stefan Neufeind > > If I'm right, there is no way in place right now for setting an "emergency > limit" to fetch a certain max. number of pages per site. Is there an "easy" > way to implement such a limit, maybe as a plugin? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413940 ] Stefan Groschupf commented on NUTCH-289: +1 Andrzej, I agree that lookup the ip in ParseOutputFormat would be the best as Doug suggested. The biggest problem nutch has at the moment is spam. The most often seen spam method is to setup a dns return the same ip for all subdomains and than deliver dynamically generated content. Than spammers just randomly generate subdomains within the content. Also it happens often that they have many url but all of them pointing to the same server == ip. Buying more ip addresses is possible but in the moment more expansive than buying more domains. Limit the urls by Ip is a great approach to prevent the crawler staying in honey pots with ten thousends of urls pointing to the same ip. However to do so we need to have the ip already until generation and not lookup it when fetching. We would be able to reuse the ip in the fetcher, also we can try catch the parts in the fetcher and in case the ip is not available we can re lookup the ip. I don't think round robbing dns are huge problem, since only large sites have them and in such a case each ip is able to handle requests. In any case storing the ip in crawl-datum and use it for urls by ip limitations will be a gib step forward to in the fight against web spam. > CrawlDatum should store IP address > -- > > Key: NUTCH-289 > URL: http://issues.apache.org/jira/browse/NUTCH-289 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Reporter: Doug Cutting > > If the CrawlDatum stored the IP address of the host of it's URL, then one > could: > - partition fetch lists on the basis of IP address, for better politeness; > - truncate pages to fetch per IP address, rather than just hostname. This > would be a good way to limit the impact of domain spammers. > The IP addresses could be resolved when a CrawlDatum is first created for a > new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413939 ] Matt Kangas commented on NUTCH-289: --- +1 to saving IP address in CrawlDatum, wherever the value comes from. (Fetcher or otherwise) > CrawlDatum should store IP address > -- > > Key: NUTCH-289 > URL: http://issues.apache.org/jira/browse/NUTCH-289 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Reporter: Doug Cutting > > If the CrawlDatum stored the IP address of the host of it's URL, then one > could: > - partition fetch lists on the basis of IP address, for better politeness; > - truncate pages to fetch per IP address, rather than just hostname. This > would be a good way to limit the impact of domain spammers. > The IP addresses could be resolved when a CrawlDatum is first created for a > new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
java 1.4 versus 1.5
Java 1.5 has been out for a couple of years now and has some nice improvements in the libraries. In particular, I wish I had access to the timeout settings on UrlConnections. Would anyone object if starting with the 0.3 release this week, we required java 1.5 to compile and run Hadoop? -- Owen
Do analyzer plugins have acces to the Configuration?
Jérôme, or anybody familiar with language plugin architecture, I am writing a language analyzer plugin. This plugin has configurable parameters, which I am hoping I can add to nutch-site.xml. But the German and French plugin examples don't access to the Configuration object. Does the current analyzer plugin architecture allows each plugin implementation to access the Configuration object? If not, what would it take to allow such access? It would be best if it is allowed at the plugin class loading time and insantiation time separately. -kuro
Re: Fetcher and MapReduce
Hi, so you have 3 boxes, since you run 3 reduce tasks? What happens is that 3 splits of your data are sorted. In the very end you will get as much out put files as you have reduce tasks. The sorting itself does happen in memory. Check in hadoop-default.xml (it is may be in the hadoop jar) io.sort.factor and io.sort.mb HTH Stefan Am 24.05.2006 um 11:13 schrieb Hamza Kaya: Hi, I'm trying to crawl approx. 500.000 urls. After inject and generate I started fetchers using 6 map tasks and 3 reduce tasks. All the map tasks had successfully completed while all the reduce tasks got an OutOfMemory exception. This exception was caught after the append phase (during the sort phase). As far as I observed, during a fetch operation, all the map tasks outputs to a temp. sequence file. During the reduce operation, each reducer copies all map outputs to their local disk and append them to a single seq. file. After this operation reducer try to sort this file and output the sorted file to its local disk. And then, a record writer is opened to write this sorted file to the segment, which is in DFS. If this scenario is correct, then all the reduce tasks are supposed to do the same job. All try to sort the whole map outputs and the winner of this operation will be able to write to dfs. So only one reducer is expected to write to dfs. If this is the case then an OutOfMemory exception will not be surprising for 500.000+urls. Since reducers will try to sort a file bigger then 1GB. Any comments on this scenario are welcome. And how can I avoid these exceptions? Thanx, -- Hamza KAYA
Re: Mailing List nutch-agent Reports of Bots Submitting Forms
Ken Krugler wrote: 2. Are the Nutch Devs replying to the emails sent to this list? I could understand if they are replying off-list, but to an outside observer such as myself it appears as though webmasters are not getting many replies to their inqueries. I can speak for myself only .. I'm not tracking that list. What about others? Folks who are running a nutch-based crawler that provides this email address as the contact address should subscribe to this list and respond to messages, especially those which may have been caused by their crawler. Others are also encouraged to subscribe and help respond to messages here, as a bad reputation for the crawler affects the whole project. This list is actually fairly low-volume. This brings up an issue I've been thinking about. It might make sense to require everybody set the user-agent string, versus it having default values that point to Nutch. The first time you run Nutch, it would display an error re the user-agent string not being set, but if the instructions for how to do this were explicit, this wouldn't be much of a hardship for anybody trying it out. +1 That would be a better solution. Doug
Re: Extract infos from documents and query external sites
Think about using the google API. However the way to go could be: + fetch your pages + do not parse the pages + write a map reduce job that extract your data ++ make a xhtml dom from the html e.g. using neko ++ use xpath queries to extract your data ++ also check out gate as a named entity extraction tool to extract names based on patterns and heuristics. ++ write the names in a file. + build your query urls + inject the query urls in a empty crawl db + create a segment fetch it and update the segment agains a second empty crawl database + remove the first segment and db + create a segment with your second db and fetch it. You second segment will only contains the paper pages. HTH Stefan Am 30.05.2006 um 12:14 schrieb HellSpawn: I'm working on a search engine for my university and they want me to do that to create a repository of scientific articles on the web :D I red something about xpath for extracting exact parts from a document, once done this building the query is very easy but my doubts are about how to insert all of this in the nutch crawler... Thank you -- View this message in context: http://www.nabble.com/Extract+infos +from+documents+and+query+external+sites-t1675003.html#a4624272 Sent from the Nutch - Dev forum at Nabble.com.
Re: JVM error while parsing
Hi, I heard there is a bug in JVM 1.5_06 beta, can you try a older or may be a 1.4 jvm and report if this happens with a other jvm as well. Thanks, Stefan Am 30.05.2006 um 14:14 schrieb Uygar Yüzsüren: Hi everyone, I am using Hadoop-0.2.0 and Nutch-0.8, and at the moment trying to complete a 1-depth-crawl by using DFS and mapreduce structures. However, after a fetch step, I encounter the below JVM exception at one or more task trackers at the parsing step. It does not differ whether I use only the default parsers, or I also use the additional ones (pdf excel etc.). My task trackers work on AMD X2 64-bit machines and my JVM version is 1.5_06. Have you ever faced with such a problem at the parse stage?Or how do you think I can spot the cause of this JVM exception?The error report is : 060530 144113 task_0007_m_10_0 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060530 144113 task_0007_m_10_0 5.0391704E-6%/crawl/segments/20060521171305/content/part-4/data: 0+12303612 060530 144114 task_0007_m_10_0 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060530 144114 task_0007_m_07_0 0.084114%/crawl/segments/20060521171305/content/part-00011/data:0 +12493176 060530 144115 task_0007_m_07_0 0.09551566%/crawl/segments/20060521171305/content/part-00011/data:0 +12493176 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # An unexpected error has been detected by HotSpot Virtual Machine: 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # SIGSEGV (0xb) at pc=0x003d1d247c10, pid=25093, tid=182894086496 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # Java VM: Java HotSpot(TM) 64- Bit Server VM (1.5.0_06-b05 mixed mode) 060530 144115 task_0007_m_07_0 # Problematic frame: 060530 144115 task_0007_m_07_0 # C [libc.so.6+0x47c10] printf_size+0x740 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # An error report file with more information is saved as hs_err_pid25093.log 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # If you would like to submit a bug report, please visit: 060530 144115 task_0007_m_07_0 # http://java.sun.com/webapps/bugreport/crash.jsp 060530 144115 task_0007_m_07_0 # 060530 144115 Server connection on port 51950 from 192.168.15.61: exiting 060530 144115 task_0007_m_07_0 Child Error java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.runChild (TaskRunner.java:242) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145) Thank you very much.
[jira] Updated: (NUTCH-283) If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads
[ http://issues.apache.org/jira/browse/NUTCH-283?page=all ] Scott Ganyo updated NUTCH-283: -- Attachment: patch.txt There was a typo in the earlier patch. This patch supersedes the first patch. > If the Fetcher times out and abandons Fetcher Threads, severe errors will > occur on those Threads > > > Key: NUTCH-283 > URL: http://issues.apache.org/jira/browse/NUTCH-283 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Reporter: Scott Ganyo > Attachments: patch.txt, patch.txt > > If a Fetcher has chosen to time out and has abandoned outstanding Fetcher > Threads, resources that those Fetcher Threads may be using are closed. This > naturally causes any abandoned Fetcher Threads to fail when they later > attempt to finish up their work in progress. > I have a patch that addresses this that I am attaching. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: NPE When using a merged segment
I was about to look into it, but wasn't sure which var was holding the new segment name to replace with segment :) lucky for me you read this email... :) -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 30, 2006 6:31 PM To: nutch-dev@lucene.apache.org Subject: Re: NPE When using a merged segment Gal Nitzan wrote: > I think it is a bug. It saves the old segment name instead of replacing it > with the new segment name > > I confirm, this is a bug - I forgot that Indexer relies on this metadata ... I'll fix it in a moment - sorry for the trouble! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: NPE When using a merged segment
Gal Nitzan wrote: I think it is a bug. It saves the old segment name instead of replacing it with the new segment name I confirm, this is a bug - I forgot that Indexer relies on this metadata ... I'll fix it in a moment - sorry for the trouble! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: NPE When using a merged segment
I think it is a bug. It saves the old segment name instead of replacing it with the new segment name -Original Message- From: Dominik Friedrich [mailto:[EMAIL PROTECTED] Sent: Monday, May 29, 2006 7:57 PM To: nutch-dev@lucene.apache.org Subject: Re: NPE When using a merged segment I have the same problem with a merged segment. I had a look with luke at the index and it seems that the indexer puts the old segment names in there instead of the name of the merged segment. I'm not sure if I did something wrong or if this is a bug. Dominik Gal Nitzan schrieb: > Hi, > > I have built a new index based on the new segment only. > > > > -Original Message- > From: Stefan Neufeind [mailto:[EMAIL PROTECTED] > Sent: Monday, May 29, 2006 10:03 AM > To: nutch-dev@lucene.apache.org > Subject: Re: NPE When using a merged segment > > Gal Nitzan wrote: > >> Hi, >> >> After using mergesegs to merge all my segments to one segment only, I >> > moved > >> the new segment to segments. >> >> When accessing the web UI I get: >> >> java.lang.RuntimeException: java.lang.NullPointerException >> >> >> > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:20 > >> 3) >> org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) >> org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:175) >> > > Hi, > > I'm not sure - but have you tried reindexing that new segment? To my > understanding the index holds refereences to the segment (segment-name) > - and in your case those are invalid. This would also explain the error > you get (in call to getSummary) because the summary is fetched from the > segment. > > If this works, then maybe you'll need to find a better way of cleaning > up the index - not reindexing everything but maybe just rewriting the > segmeent-names all into one or so. > > Feedback welcome. > > > Good luck, > Stefan > > > > >
JVM error while parsing
Hi everyone, I am using Hadoop-0.2.0 and Nutch-0.8, and at the moment trying to complete a 1-depth-crawl by using DFS and mapreduce structures. However, after a fetch step, I encounter the below JVM exception at one or more task trackers at the parsing step. It does not differ whether I use only the default parsers, or I also use the additional ones (pdf excel etc.). My task trackers work on AMD X2 64-bit machines and my JVM version is 1.5_06. Have you ever faced with such a problem at the parse stage?Or how do you think I can spot the cause of this JVM exception?The error report is : 060530 144113 task_0007_m_10_0 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060530 144113 task_0007_m_10_0 5.0391704E-6%/crawl/segments/20060521171305/content/part-4/data:0+12303612 060530 144114 task_0007_m_10_0 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060530 144114 task_0007_m_07_0 0.084114%/crawl/segments/20060521171305/content/part-00011/data:0+12493176 060530 144115 task_0007_m_07_0 0.09551566%/crawl/segments/20060521171305/content/part-00011/data:0+12493176 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # An unexpected error has been detected by HotSpot Virtual Machine: 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # SIGSEGV (0xb) at pc=0x003d1d247c10, pid=25093, tid=182894086496 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # Java VM: Java HotSpot(TM) 64-Bit Server VM (1.5.0_06-b05 mixed mode) 060530 144115 task_0007_m_07_0 # Problematic frame: 060530 144115 task_0007_m_07_0 # C [libc.so.6+0x47c10] printf_size+0x740 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # An error report file with more information is saved as hs_err_pid25093.log 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # If you would like to submit a bug report, please visit: 060530 144115 task_0007_m_07_0 # http://java.sun.com/webapps/bugreport/crash.jsp 060530 144115 task_0007_m_07_0 # 060530 144115 Server connection on port 51950 from 192.168.15.61: exiting 060530 144115 task_0007_m_07_0 Child Error java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:242) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145) Thank you very much.
Re: Extract infos from documents and query external sites
I'm working on a search engine for my university and they want me to do that to create a repository of scientific articles on the web :D I red something about xpath for extracting exact parts from a document, once done this building the query is very easy but my doubts are about how to insert all of this in the nutch crawler... Thank you -- View this message in context: http://www.nabble.com/Extract+infos+from+documents+and+query+external+sites-t1675003.html#a4624272 Sent from the Nutch - Dev forum at Nabble.com.
[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space
[ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12413778 ] Stefan Neufeind commented on NUTCH-292: --- That patch is for the 0.7-branch, right? In 0.8-dev you'd want to do that in BasicSummarizer.java. But to me it looks like something similar is already in place: // Iterate through as long as we're before the end of // the document and we haven't hit the max-number-of-items // -in-a-summary. // while ((j < endToken) && (j - startToken < sumLength)) { But I also suspect it might have something to do with tokens. What I experienced is that several search-results currently contain arbitrary binary data. Those are the cases where a parser-plugin has "failed" and where parse-text was used as a fallback. If I'm right this might lead to actually quite large tokens because no whitespace is found in a row of characters. @Marcel: Thank you for the fix anyway ... you help is very much appreciated. > OpenSearchServlet: OutOfMemoryError: Java heap space > > > Key: NUTCH-292 > URL: http://issues.apache.org/jira/browse/NUTCH-292 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.8-dev > Reporter: Stefan Neufeind > Priority: Critical > Attachments: summarizer.diff > > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space > > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203) > org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) > > org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155) > javax.servlet.http.HttpServlet.service(HttpServlet.java:689) > javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > The URL I use is: > [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url > It seems to be a problem specific to the date I'm working with. Moving the > start from 0 to 10 or changing the query works fine. > Or maybe it doesn't have to do with sorting but it's just that I hit one "bad > search-result" that has a broken summary? > !! The problem is repeatable. So if anybody has an idea where to search / > what to fix, I can easily try that out !! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413780 ] Stefan Neufeind commented on NUTCH-290: --- The plugin itself imho works fine now. Does not throw an exception anymore and if allowed outputs text correctly. However I still get the "garbage-output" from a PDF. Could that be due to the fact that in case no extraction is allowed (empty parsing-text returned) the parser will still fallback to using the raw text to index? What I did was deleting crawl_parse and parse_* from the segments-directory, running "nutch parse" and reindexing everything. However the raw chars in the search-output (summary) remain. :-(( > parse-pdf: Garbage indexed when text-extraction not allowed > --- > > Key: NUTCH-290 > URL: http://issues.apache.org/jira/browse/NUTCH-290 > Project: Nutch > Type: Bug > Components: indexer > Versions: 0.8-dev > Reporter: Stefan Neufeind > Attachments: NUTCH-290-canExtractContent.patch > > It seems that garbage (or undecoded text?) is indexed when text-extraction > for a PDF is not allowed. > Example-PDF: > http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira