[jira] Created: (NUTCH-294) Topic-maps of related searchwords
Topic-maps of related searchwords - Key: NUTCH-294 URL: http://issues.apache.org/jira/browse/NUTCH-294 Project: Nutch Type: New Feature Components: searcher Reporter: Stefan Neufeind Would it be possible to offer a user topic-maps? It's when you search for something and get topic-related words that might also be of interest for you. I wonder if that's somehow possible with the ngram-index for did you mean (see separate feature-enhancement-bug for this), but we'd need to have a relation between words (in what context do they occur). For the webfrontend usually trees are used - which for some users offer quite impressive eye-candy :-) E.g. see this advertisement by Novell where I've just seen a similar topic-map as well: http://www.novell.com/de-de/company/advertising/defineyouropen.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)
[ http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ] Stefan Groschupf commented on NUTCH-282: Is that related to host grouping we discussed? Can we in this case close this bug? Showing too few results on a page (Paging not correct) -- Key: NUTCH-282 URL: http://issues.apache.org/jira/browse/NUTCH-282 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind I did a search and got back the value itemsPerPage from opensearch. But the output shows results 1-8 and I have a total of 46 searchresults. Same happens for the webinterface. Why aren't enough results fetched? The problem might be somewhere in the area of where Nutch should only display a certaian number of websites per site. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-286) Handling common error-pages as 404
[ http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ] Stefan Groschupf commented on NUTCH-286: This is difficult to realize since the http error code is readed from response in the fetcher and setted into the protocol status , content analysis can only done during parsing. Also normally such pages do not get a high OPIC score and should be not in the top search results. However this is a wrong configured http server response, so you may should open a bug in the typo3 issue tracking. Should we close this issue? Handling common error-pages as 404 -- Key: NUTCH-286 URL: http://issues.apache.org/jira/browse/NUTCH-286 Project: Nutch Type: Improvement Reporter: Stefan Neufeind Idea: Some pages from some software-packages/scripts report an http 200 ok even though a specific page could not be found. Example I just found is: http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef That's a typo3-page explaining in it's standard-layout and wording: The requested page did not exist or was inaccessible. So I had the idea if somebody might create a plugin that could find commonly used formulations for page does not exist etc. and turn the page into a 404 before feeding them into the nutch-index - although the server responded with status 200 ok. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space
[ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] Stefan Groschupf commented on NUTCH-292: +1, Can someone create a clean patch file? OpenSearchServlet: OutOfMemoryError: Java heap space Key: NUTCH-292 URL: http://issues.apache.org/jira/browse/NUTCH-292 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind Priority: Critical Attachments: summarizer.diff java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203) org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155) javax.servlet.http.HttpServlet.service(HttpServlet.java:689) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) The URL I use is: [...]something[...]/opensearch?query=mysearchstart=0hitsPerSite=3hitsPerPage=20sort=url It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine. Or maybe it doesn't have to do with sorting but it's just that I hit one bad search-result that has a broken summary? !! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified
[ http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ] Stefan Groschupf commented on NUTCH-291: lastModified will be only indexed if you switch on the index-more plugin. If you think you should change the way lastmodified and date is stored in the index, please submit a patch for MoreIndexingFilter. OpenSearchServlet should return date as well as lastModified Key: NUTCH-291 URL: http://issues.apache.org/jira/browse/NUTCH-291 Project: Nutch Type: Improvement Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-291-unfinished.patch Currently lastModified is provided by OpenSearchServlet - but only in case the date lastModified-date is known. Since you can sort by date (which is lastModified or if not present the fetchdate), it might be useful if OpenSearchServlet could provide date as well. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] Stefan Groschupf commented on NUTCH-290: If a parser throws an exeption: Fetcher, 261: try { parse = this.parseUtil.parse(content); parseStatus = parse.getData().getStatus(); } catch (Exception e) { parseStatus = new ParseStatus(e); } if (!parseStatus.isSuccess()) { LOG.warning(Error parsing: + key + : + parseStatus); parse = parseStatus.getEmptyParse(getConf()); } than we use the empty parse object: and a empthy parse contans just no text, see getText private static class EmptyParseImpl implements Parse { private ParseData data = null; public EmptyParseImpl(ParseStatus status, Configuration conf) { data = new ParseData(status, , new Outlink[0], new Metadata(), new Metadata()); data.setConf(conf); } public ParseData getData() { return data; } public String getText() { return ; } } So the Problem should be somewhere else. parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space
[ http://issues.apache.org/jira/browse/NUTCH-292?page=all ] Stefan Neufeind updated NUTCH-292: -- Attachment: NUTCH-292-summarizer08.diff As per demand, here is the patch. Please note that it has not throughly been testeed by myself. But the patch looks fine and makes sense :-) Oh, and it compiles clean ... OpenSearchServlet: OutOfMemoryError: Java heap space Key: NUTCH-292 URL: http://issues.apache.org/jira/browse/NUTCH-292 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind Priority: Critical Attachments: NUTCH-292-summarizer08.diff, summarizer.diff java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203) org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155) javax.servlet.http.HttpServlet.service(HttpServlet.java:689) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) The URL I use is: [...]something[...]/opensearch?query=mysearchstart=0hitsPerSite=3hitsPerPage=20sort=url It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine. Or maybe it doesn't have to do with sorting but it's just that I hit one bad search-result that has a broken summary? !! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-287) Exception when searching with sort
[ http://issues.apache.org/jira/browse/NUTCH-287?page=all ] Stefan Groschupf closed NUTCH-287: -- Resolution: Won't Fix http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html Exception when searching with sort -- Key: NUTCH-287 URL: http://issues.apache.org/jira/browse/NUTCH-287 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: Stefan Neufeind Priority: Critical Running a search with sort=url works. But when usingsort=title I get the following exception. 2006-05-25 14:04:25 StandardWrapperValve[jsp]: Servlet.service() for servlet jsp threw exception java.lang.RuntimeException: Unknown sort value type! at org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:157) at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95) at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239) at org.apache.jsp.search_jsp._jspService(search_jsp.java:257) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595) What is in those lines is: WritableComparable sortValue; // convert value to writable if (sortField == null) { sortValue = new FloatWritable(scoreDocs[i].score); } else { Object raw = ((FieldDoc)scoreDocs[i]).fields[0]; if (raw instanceof Integer) { sortValue = new IntWritable(((Integer)raw).intValue()); } else if (raw instanceof Float) { sortValue = new FloatWritable(((Float)raw).floatValue()); } else if (raw instanceof String) { sortValue = new UTF8((String)raw); } else { throw new RuntimeException(Unknown sort value type!); } } So I thought that maybe raw is an instance of something strange and tried raw.getClass().getName() or also raw.toString() to track the cause down - but that always resulted in a NullPointerException. So it seems I'm having raw being null for some strange reason. When I try with title2 (or something none-existing) I get a different error that title2 is unknown / not indexed. So I suspect that title
[jira] Closed: (NUTCH-284) NullPointerException during index
[ http://issues.apache.org/jira/browse/NUTCH-284?page=all ] Stefan Groschupf closed NUTCH-284: -- Resolution: Won't Fix Yes, I was missing index-basic. NullPointerException during index - Key: NUTCH-284 URL: http://issues.apache.org/jira/browse/NUTCH-284 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind For quite a few this reduce sort has been going on. Then it fails. What could be wrong with this? 060524 212613 reduce sort 060524 212614 reduce sort 060524 212615 reduce sort 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212619 Optimizing index. 060524 212619 job_jlbhhm java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.indexer.Indexer.index(Indexer.java:287) at org.apache.nutch.indexer.Indexer.main(Indexer.java:304) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-284) NullPointerException during index
[ http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ] Stefan Groschupf commented on NUTCH-284: Please try discuss such things first in the user mailing list than open a issue. Maintaining the issue tracking is very time consuming. But if there is a bug please continue open bug reports. :) Thanks. NullPointerException during index - Key: NUTCH-284 URL: http://issues.apache.org/jira/browse/NUTCH-284 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind For quite a few this reduce sort has been going on. Then it fails. What could be wrong with this? 060524 212613 reduce sort 060524 212614 reduce sort 060524 212615 reduce sort 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212619 Optimizing index. 060524 212619 job_jlbhhm java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.indexer.Indexer.index(Indexer.java:287) at org.apache.nutch.indexer.Indexer.main(Indexer.java:304) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments
[ http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ] Stefan Groschupf commented on NUTCH-281: Can you submit a patch file? cached.jsp: base-href needs to be outside comments -- Key: NUTCH-281 URL: http://issues.apache.org/jira/browse/NUTCH-281 Project: Nutch Type: Bug Components: web gui Reporter: Stefan Neufeind Priority: Trivial see cached.jsp base href=... does not take effect when showing a cached page because of the comments around it -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error
[ http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ] Stefan Groschupf commented on NUTCH-274: Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the Injector? Empty row in/at end of URL-list results in error Key: NUTCH-274 URL: http://issues.apache.org/jira/browse/NUTCH-274 Project: Nutch Type: Bug Versions: 0.8-dev Environment: nightly-2006-05-20 Reporter: Stefan Neufeind Priority: Minor This is minor - but it's a little unclean :-) Reproduce: Have a URL-file with one URL followed by a newline, thus producing an empty line. Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is fine - but second is empty and therefor fails proper protocol-detection. 60521 022639 Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 060521 022639 Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 060521 022639 found resource parse-plugins.xml at file:/home/mm/nutch-nightly/conf/parse-plugins.xml 060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060521 022639 fetching http://www.bild.de/ 060521 022639 fetching 060521 022639 fetch of failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: no protocol: 060521 022639 http.proxy.host = null 060521 022639 http.proxy.port = 8080 060521 022639 http.timeout = 1 060521 022639 http.content.limit = 65536 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) 060521 022639 fetcher.server.delay = 1000 060521 022639 http.max.delays = 1000 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml 060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml 060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060521 022640 map 0% reduce 0% 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414458 ] Stefan Neufeind commented on NUTCH-290: --- But if one plugin fails in 0.8-dev, isn't the next used? I understand that in the default-config the text-parser would be used as the last resort fallback. Also I'm not sure where the summary-text comes from if I use the patch above to prevent generating an exception but return empty parse-data. parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified
[ http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414466 ] Stefan Neufeind commented on NUTCH-291: --- Which way is most favorable? To always set lastModified although it was not returned from the webserver (maybe unclean) or always return date as well (cleaner?). OpenSearchServlet should return date as well as lastModified Key: NUTCH-291 URL: http://issues.apache.org/jira/browse/NUTCH-291 Project: Nutch Type: Improvement Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-291-unfinished.patch Currently lastModified is provided by OpenSearchServlet - but only in case the date lastModified-date is known. Since you can sort by date (which is lastModified or if not present the fetchdate), it might be useful if OpenSearchServlet could provide date as well. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] Stefan Groschupf commented on NUTCH-290: As far I understand the code, the next parser is only used if the previous parser return with a unsuccessfully paring status. If the parser throws an expception these exception is not catched in the parseutil at all. So the pdf parser should throw an expception and not report a unsucessfully status to solve this problem, isn't it? parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-286) Handling common error-pages as 404
[ http://issues.apache.org/jira/browse/NUTCH-286?page=all ] Stefan Groschupf closed NUTCH-286: -- Resolution: Won't Fix I hope everybody agree with the statement: We can not detect http response codes based on responded html content. Prune the index is a good idea to solve the problem. Handling common error-pages as 404 -- Key: NUTCH-286 URL: http://issues.apache.org/jira/browse/NUTCH-286 Project: Nutch Type: Improvement Reporter: Stefan Neufeind Idea: Some pages from some software-packages/scripts report an http 200 ok even though a specific page could not be found. Example I just found is: http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef That's a typo3-page explaining in it's standard-layout and wording: The requested page did not exist or was inaccessible. So I had the idea if somebody might create a plugin that could find commonly used formulations for page does not exist etc. and turn the page into a 404 before feeding them into the nutch-index - although the server responded with status 200 ok. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414477 ] Stefan Neufeind commented on NUTCH-290: --- But to my understanding of the plugin it still extracts as much as possible (meta-data) from the PDF. So if indexing is not allowed but this is a PDF, then returning empty text as the document-body should be fine - shouldn't it? Nothing else except a PDF-plugin will be able to handle PDF correclty in this case. Stefan G., can you point out why in the summary I see binary data for a PDF as summary and if there is a possible fix for it in the context of this current bug here? parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-295) More description for fetcher.threads.fetch property
More description for fetcher.threads.fetch property --- Key: NUTCH-295 URL: http://issues.apache.org/jira/browse/NUTCH-295 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Dennis Kubes Priority: Minor Added some description to the fetcher.threads.fetch property to explain the number of threads running in a cluster. Patch is attached. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-295) More description for fetcher.threads.fetch property
[ http://issues.apache.org/jira/browse/NUTCH-295?page=all ] Dennis Kubes updated NUTCH-295: --- Attachment: fetcher_threads_desc.patch More description for fetcher.threads.fetch property as relating to running in distributed mode. More description for fetcher.threads.fetch property --- Key: NUTCH-295 URL: http://issues.apache.org/jira/browse/NUTCH-295 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Dennis Kubes Priority: Minor Attachments: fetcher_threads_desc.patch Added some description to the fetcher.threads.fetch property to explain the number of threads running in a cluster. Patch is attached. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira