[jira] Commented: (NUTCH-377) Add possibility to search for multiple values
[ http://issues.apache.org/jira/browse/NUTCH-377?page=comments#action_12439018 ] Stefan Neufeind commented on NUTCH-377: --- Hmm, I'm not too sure I understand how to do that. There is one part which adds prohibited or required phrases but ... To my understanding isn't the above example parsed "as is" into one string for the whole "site:...|..." ? If yes, could the split be done where evaluating the site-command maybe? Had a look at query-site - but there doesn't seem to be much code over there ... What is a good syntax that the nutch-community could agree on? And could you maybe wrap up an initial patch for that? > Add possibility to search for multiple values > - > > Key: NUTCH-377 > URL: http://issues.apache.org/jira/browse/NUTCH-377 > Project: Nutch > Issue Type: Improvement > Components: searcher >Reporter: Stefan Neufeind > > Searches with boolean operators (AND or OR) are not (yet) possible. All > search-items are always searched with AND. > But it would be nice to have the possibility to allow multiple values for a > certain field. Maybe that could done using a separator? > As an example you might want to search for: > somewordsite:www.example.org|www.apache.org > Which (to my understand) would allow to search for one or more words with a > restriction to those two sites. It would prevent having to implement AND and > OR fully (maybe even including brackets) but would allow to cover a few often > used cases imho. > Easy/hard to do? To my understanding Lucene itself allows AND/OR-searches. So > might basically be a problem of string-parsing and query-building towards > Lucene? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-377) Add possibility to search for multiple values
Add possibility to search for multiple values - Key: NUTCH-377 URL: http://issues.apache.org/jira/browse/NUTCH-377 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Stefan Neufeind Searches with boolean operators (AND or OR) are not (yet) possible. All search-items are always searched with AND. But it would be nice to have the possibility to allow multiple values for a certain field. Maybe that could done using a separator? As an example you might want to search for: somewordsite:www.example.org|www.apache.org Which (to my understand) would allow to search for one or more words with a restriction to those two sites. It would prevent having to implement AND and OR fully (maybe even including brackets) but would allow to cover a few often used cases imho. Easy/hard to do? To my understanding Lucene itself allows AND/OR-searches. So might basically be a problem of string-parsing and query-building towards Lucene? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-334) I am using the search technique
[ http://issues.apache.org/jira/browse/NUTCH-334?page=comments#action_12424554 ] Stefan Neufeind commented on NUTCH-334: --- Right, that's a real bug :-) > I am using the search technique > --- > > Key: NUTCH-334 > URL: http://issues.apache.org/jira/browse/NUTCH-334 > Project: Nutch > Issue Type: Bug >Reporter: Siddharudh nadgeri > -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-335) Pdf summary corrupt issue
[ http://issues.apache.org/jira/browse/NUTCH-335?page=comments#action_12424553 ] Stefan Neufeind commented on NUTCH-335: --- The problem that that in most cases I've come across the PDF is protected and does not allow text-extraction. Though this could be theoretically worked around, it's not really allowed afaik. Also there is an issue for this already: http://issues.apache.org/jira/browse/NUTCH-290 The problem that still needs to be worked around imho is that no text should be shown instead - and I'd which a clarification why currently we the raw binary data is taken as summary. PS: Next time please search before opening a new issue. (Meant just as information, not to make anybody angry ...) > Pdf summary corrupt issue > - > > Key: NUTCH-335 > URL: http://issues.apache.org/jira/browse/NUTCH-335 > Project: Nutch > Issue Type: Bug > Environment: As it is web application it is not nessasary >Reporter: Siddharudh nadgeri > > I am using the Nutch search but for pdf it is giving summary as some garbage > like > "!!"#"#"#"#"#"#"#"!$%$%$##'$$ ("$$$ > please provide the solution -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-271) Meta-data per URL/site/section
[ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_1246 ] Stefan Neufeind commented on NUTCH-271: --- Does somebody have an existing demo-plugin for that, that would catch URL-prefixes from a file and in case matches are found certain tags are then added? I don't yet fully get it how to do it "the elegant way" :-) > Meta-data per URL/site/section > -- > > Key: NUTCH-271 > URL: http://issues.apache.org/jira/browse/NUTCH-271 > Project: Nutch > Issue Type: New Feature >Affects Versions: 0.7.2 >Reporter: Stefan Neufeind > > We have the need to index sites and attach additional meta-data-tags to them. > Afaik this is not yet possible, or is there a "workaround" I don't see? What > I think of is using meta-tags per start-url, only indexing content below that > URL, and have the ability to limit searches upon those meta-tags. E.g. > http://www.example1.com/something1/ -> meta-tag "companybranch1" > http://www.example2.com/something2/ -> meta-tag "companybranch2" > http://www.example3.com/something3/ -> meta-tag "companybranch1" > http://www.example4.com/something4/ -> meta-tag "companybranch3" > search for everything in companybranch1 or across 1 and 3 or similar -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-279) Additions for regex-normalize
[ http://issues.apache.org/jira/browse/NUTCH-279?page=all ] Stefan Neufeind updated NUTCH-279: -- Attachment: regex-normalize2.patch New patch with just one session-ID-regex extended (also including . - , now), since I came across those extra chars while used on a common German website (www.bahn.de). > Additions for regex-normalize > - > > Key: NUTCH-279 > URL: http://issues.apache.org/jira/browse/NUTCH-279 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Stefan Neufeind > Attachments: regex-normalize.patch, regex-normalize2.patch > > Imho needed: > 1) Extend normalize-rules to commonly used session-id's etc. > 2) Ship a checker to check rules easily by hand -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-48) "Did you mean" query enhancement/refignment feature request
[ http://issues.apache.org/jira/browse/NUTCH-48?page=comments#action_12415970 ] Stefan Neufeind commented on NUTCH-48: -- Could somebody please have a look? I currently lack a test-system to try that ... > "Did you mean" query enhancement/refignment feature request > > > Key: NUTCH-48 > URL: http://issues.apache.org/jira/browse/NUTCH-48 > Project: Nutch > Type: New Feature > Components: web gui > Environment: All platforms > Reporter: byron miller > Assignee: Sami Siren > Priority: Minor > Attachments: did-you-mean-combined08.patch, rss-spell.patch, > spell-check.patch > > Looking to implement a "Did you mean" feature for query result pages that > return < = x amount of results to invoke a response that would recommend a > fixed/related or spell checked query to try. > Note from Doug to users list: > David Spencer has worked on this some. > http://www.searchmorph.com/weblog/index.php?id=23 > I think the code on his site might be more recent than what's committed > to the lucene/contrib directory. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-305) Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP
[ http://issues.apache.org/jira/browse/NUTCH-305?page=all ] Stefan Neufeind updated NUTCH-305: -- Attachment: suffix-urlfilter.txt Find attached an suffix-urlfilter.txt that might be interesting to some people. More contributions welcome at any time. Maybe we should ship such a list and use the suffix-filter instead of regex to filter by document-extension? > Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP > -- > > Key: NUTCH-305 > URL: http://issues.apache.org/jira/browse/NUTCH-305 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Reporter: chris finne > Attachments: suffix-urlfilter.txt > -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-294) Topic-maps of related searchwords
[ http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12414962 ] Stefan Neufeind commented on NUTCH-294: --- 1) I enabled it in plugins.include and restarted tomcat - but there is no checkbox for me. 2) My "idea" was if maybe an index of top-keywords (from "did you mean"-plugin possibly?) could be used and a query could be run on it like "the current search we searched for appeared in NNN pages, where the top10-top-keywords are ...". Wouldn't that work as a topicmap? > Topic-maps of related searchwords > - > > Key: NUTCH-294 > URL: http://issues.apache.org/jira/browse/NUTCH-294 > Project: Nutch > Type: New Feature > Components: searcher > Reporter: Stefan Neufeind > > Would it be possible to offer a user "topic-maps"? It's when you search for > something and get topic-related words that might also be of interest for you. > I wonder if that's somehow possible with the ngram-index for "did you mean" > (see separate feature-enhancement-bug for this), but we'd need to have a > relation between words (in what context do they occur). > For the webfrontend usually trees are used - which for some users offer > quite impressive eye-candy :-) E.g. see this advertisement by Novell where > I've just seen a similar "topic-map" as well: > http://www.novell.com/de-de/company/advertising/defineyouropen.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-294) Topic-maps of related searchwords
[ http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12414653 ] Stefan Neufeind commented on NUTCH-294: --- I'm not sure. On a quick run I wasn't able to get the "clustering-carrot2"-plugin to work - though I thought I simply need to include it. Maybe somebody else already worked with it and could comment if that plugin is within scope of this feature-request. To what I found about carror2 it's also used to cluster data from multiple search-engines - not sure how that relates to topic-clusters. > Topic-maps of related searchwords > - > > Key: NUTCH-294 > URL: http://issues.apache.org/jira/browse/NUTCH-294 > Project: Nutch > Type: New Feature > Components: searcher > Reporter: Stefan Neufeind > > Would it be possible to offer a user "topic-maps"? It's when you search for > something and get topic-related words that might also be of interest for you. > I wonder if that's somehow possible with the ngram-index for "did you mean" > (see separate feature-enhancement-bug for this), but we'd need to have a > relation between words (in what context do they occur). > For the webfrontend usually trees are used - which for some users offer > quite impressive eye-candy :-) E.g. see this advertisement by Novell where > I've just seen a similar "topic-map" as well: > http://www.novell.com/de-de/company/advertising/defineyouropen.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host
[ http://issues.apache.org/jira/browse/NUTCH-298?page=comments#action_12414647 ] Stefan Neufeind commented on NUTCH-298: --- Is the description-line of this bug correct? I've been indexing pages without robots.txt, and I just checked that those hosts give a 404 since robots.txt does not exist. > if a 404 for a robots.txt is returned no page is fetched at all from the host > - > > Key: NUTCH-298 > URL: http://issues.apache.org/jira/browse/NUTCH-298 > Project: Nutch > Type: Bug > Reporter: Stefan Groschupf > Fix For: 0.8-dev > Attachments: fixNpeRobotRuleSet.patch > > What happen: > Is no RobotRuleSet is in the cache for a host, we create try to fetch the > robots.txt. > In case http response code is not 200 or 403 but for example 404 we do " > robotRules = EMPTY_RULES; " (line: 402) > EMPTY_RULES is a RobotRuleSet created with the default constructor. > tmpEntries and entries is null and will never changed. > If we now try to fetch a page from the host that use the EMPTY_RULES is used > and we call isAllowed in the RobotRuleSet. > In this case a NPE is thrown in this line: > if (entries == null) { > entries= new RobotsEntry[tmpEntries.size()]; > possible Solution: > We can intialize the tmpEntries by default and also remove other null checks > and initialisations. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414646 ] Stefan Neufeind commented on NUTCH-258: --- Agreed. The root-causee of the loop should be identified. So I'd suggest turning this into a wont-fix-bug - and if it occurs again somewhere, we should try to track down the root cause. > Once Nutch logs a SEVERE log item, Nutch fails forevermore > -- > > Key: NUTCH-258 > URL: http://issues.apache.org/jira/browse/NUTCH-258 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Environment: All > Reporter: Scott Ganyo > Priority: Critical > Attachments: dumbfix.patch > > Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. > This is from the run() method in Fetcher.java: > public void run() { > synchronized (Fetcher.this) {activeThreads++;} // count threads > > try { > UTF8 key = new UTF8(); > CrawlDatum datum = new CrawlDatum(); > > while (true) { > if (LogFormatter.hasLoggedSevere()) // something bad happened > break;// exit > > Notice the last 2 lines. This will prevent Nutch from ever Fetching again > once this is hit as LogFormatter is storing this data as a static. > (Also note that "LogFormatter.hasLoggedSevere()" is also checked in > org.apache.nutch.net.URLFilterChecker and will disable this class as well.) > This must be fixed or Nutch cannot be run as any kind of long-running > service. Furthermore, I believe it is a poor decision to rely on a logging > event to determine the state of the application - this could have any number > of side-effects that would be extremely difficult to track down. (As it has > already for me.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-299) Bittorrent Parser
[ http://issues.apache.org/jira/browse/NUTCH-299?page=comments#action_12414643 ] Stefan Neufeind commented on NUTCH-299: --- Could you briefly explain what it does? Extract meta-data and index the comment as "content of that page"? Or does it also follow the URL to the tracker (maybe) to discover other torrents etc.? > Bittorrent Parser > - > > Key: NUTCH-299 > URL: http://issues.apache.org/jira/browse/NUTCH-299 > Project: Nutch > Type: New Feature > Reporter: Hasan Diwan > Priority: Minor > Attachments: BitTorrent.jar > > BitTorrent information file parser -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414477 ] Stefan Neufeind commented on NUTCH-290: --- But to my understanding of the plugin it still extracts as much as possible (meta-data) from the PDF. So if indexing is not allowed but this is a PDF, then returning empty text as the document-body should be fine - shouldn't it? Nothing else except a PDF-plugin will be able to handle PDF correclty in this case. Stefan G., can you point out why in the summary I see binary data for a PDF as summary and if there is a possible fix for it in the context of this current bug here? > parse-pdf: Garbage indexed when text-extraction not allowed > --- > > Key: NUTCH-290 > URL: http://issues.apache.org/jira/browse/NUTCH-290 > Project: Nutch > Type: Bug > Components: indexer > Versions: 0.8-dev > Reporter: Stefan Neufeind > Attachments: NUTCH-290-canExtractContent.patch > > It seems that garbage (or undecoded text?) is indexed when text-extraction > for a PDF is not allowed. > Example-PDF: > http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at all
[ http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12414476 ] Stefan Neufeind commented on NUTCH-275: --- Maybe just XHTML is something special in this casee? In general I guess mime-magic is a good idea. But could it be extended to differentiate xml and xhtml? > Fetcher not parsing XHTML-pages at all > -- > > Key: NUTCH-275 > URL: http://issues.apache.org/jira/browse/NUTCH-275 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Environment: problem with nightly-2006-05-20; worked fine with same website > on 0.7.2 > Reporter: Stefan Neufeind > > Server reports page as "text/html" - so I thought it would be processed as > html. > But something I guess evaluated the headers of the document and re-labeled it > as "text/xml" (why not text/xhtml?). > For some reason there is no plugin to be found for indexing text/xml (why > does TextParser not feel responsible?). > Links inside this document are NOT indexed at all - no digging this website > actually stops here. > Funny thing: For some magical reasons the dtd-files referenced in the header > seem to be valid links for the fetcher and as such are indexed in the next > round (if urlfilter allows). > 060521 025018 fetching http://www.secreturl.something/ > 060521 025018 http.proxy.host = null > 060521 025018 http.proxy.port = 8080 > 060521 025018 http.timeout = 1 > 060521 025018 http.content.limit = 65536 > 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; > http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) > 060521 025018 fetcher.server.delay = 1000 > 060521 025018 http.max.delays = 1000 > 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser > mapped to contentType text/xml via parse-plugins.xml, but > its plugin.xml file does not claim to support contentType: text/xml > 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser > mapped to contentType text/xml via parse-plugins.xml, but > not enabled via plugin.includes in nutch-default.xml > 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature > 060521 025019 map 0% reduce 0% > 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, > 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-291) OpenSearchServlet should return "date" as well as "lastModified"
[ http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414466 ] Stefan Neufeind commented on NUTCH-291: --- Which way is most favorable? To always set lastModified although it was not returned from the webserver (maybe unclean) or always return date as well (cleaner?). > OpenSearchServlet should return "date" as well as "lastModified" > > > Key: NUTCH-291 > URL: http://issues.apache.org/jira/browse/NUTCH-291 > Project: Nutch > Type: Improvement > Components: web gui > Versions: 0.8-dev > Reporter: Stefan Neufeind > Attachments: NUTCH-291-unfinished.patch > > Currently lastModified is provided by OpenSearchServlet - but only in case > the date lastModified-date is known. > Since you can sort by "date" (which is lastModified or if not present the > fetchdate), it might be useful if OpenSearchServlet could provide "date" as > well. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-286) Handling common error-pages as 404
[ http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414464 ] Stefan Neufeind commented on NUTCH-286: --- Well, we _could_ close it, though the question still remains for me. The problem imho is that you say it's hard to do. For sure you could always write searches to prune those pages from the index - but I wonder if that's a clean solution or if it would be better to have a way of excluding certain pages (like these common error-pages, though their header is wrong). I guess it's the typical problem when crawling the web: Technician will say "that webserver/typo3 is wrong and is to be fixed" - but management will not care, and you will have to solve the problem in whatever way. > Handling common error-pages as 404 > -- > > Key: NUTCH-286 > URL: http://issues.apache.org/jira/browse/NUTCH-286 > Project: Nutch > Type: Improvement > Reporter: Stefan Neufeind > > Idea: Some pages from some software-packages/scripts report an "http 200 ok" > even though a specific page could not be found. Example I just found is: > http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef > That's a typo3-page explaining in it's standard-layout and wording: "The > requested page did not exist or was inaccessible." > So I had the idea if somebody might create a plugin that could find commonly > used formulations for "page does not exist" etc. and turn the page into a 404 > before feeding them into the nutch-index - although the server responded > with status 200 ok. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)
[ http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414461 ] Stefan Neufeind commented on NUTCH-282: --- Sorry for not getting back to this. Actually it had to do with per-site-dedup. I had a page-navigation built on the total number of pages, and the first page I saw were already the "last" result-page. When moving to page 2 I got no results, when moving to a later page, I got exceptions. For me it was fixed simply by using the pagination correctly :-) and applying the fix from NUTCH-288 to not fetch results when out of bounds. > Showing too few results on a page (Paging not correct) > -- > > Key: NUTCH-282 > URL: http://issues.apache.org/jira/browse/NUTCH-282 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.8-dev > Reporter: Stefan Neufeind > > I did a search and got back the value "itemsPerPage" from opensearch. But > the output shows "results 1-8" and I have a total of 46 searchresults. > Same happens for the webinterface. > Why aren't "enough" results fetched? > The problem might be somewhere in the area of where Nutch should only display > a certaian number of websites per site. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414458 ] Stefan Neufeind commented on NUTCH-290: --- But if one plugin fails in 0.8-dev, isn't the next used? I understand that in the default-config the text-parser would be used as the last resort fallback. Also I'm not sure where the summary-text comes from if I use the patch above to prevent generating an exception but return empty parse-data. > parse-pdf: Garbage indexed when text-extraction not allowed > --- > > Key: NUTCH-290 > URL: http://issues.apache.org/jira/browse/NUTCH-290 > Project: Nutch > Type: Bug > Components: indexer > Versions: 0.8-dev > Reporter: Stefan Neufeind > Attachments: NUTCH-290-canExtractContent.patch > > It seems that garbage (or undecoded text?) is indexed when text-extraction > for a PDF is not allowed. > Example-PDF: > http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space
[ http://issues.apache.org/jira/browse/NUTCH-292?page=all ] Stefan Neufeind updated NUTCH-292: -- Attachment: NUTCH-292-summarizer08.diff As per demand, here is the patch. Please note that it has not throughly been testeed by myself. But the patch looks fine and makes sense :-) Oh, and it compiles clean ... > OpenSearchServlet: OutOfMemoryError: Java heap space > > > Key: NUTCH-292 > URL: http://issues.apache.org/jira/browse/NUTCH-292 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.8-dev > Reporter: Stefan Neufeind > Priority: Critical > Attachments: NUTCH-292-summarizer08.diff, summarizer.diff > > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space > > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203) > org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) > > org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155) > javax.servlet.http.HttpServlet.service(HttpServlet.java:689) > javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > The URL I use is: > [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url > It seems to be a problem specific to the date I'm working with. Moving the > start from 0 to 10 or changing the query works fine. > Or maybe it doesn't have to do with sorting but it's just that I hit one "bad > search-result" that has a broken summary? > !! The problem is repeatable. So if anybody has an idea where to search / > what to fix, I can easily try that out !! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-294) Topic-maps of related searchwords
Topic-maps of related searchwords - Key: NUTCH-294 URL: http://issues.apache.org/jira/browse/NUTCH-294 Project: Nutch Type: New Feature Components: searcher Reporter: Stefan Neufeind Would it be possible to offer a user "topic-maps"? It's when you search for something and get topic-related words that might also be of interest for you. I wonder if that's somehow possible with the ngram-index for "did you mean" (see separate feature-enhancement-bug for this), but we'd need to have a relation between words (in what context do they occur). For the webfrontend usually trees are used - which for some users offer quite impressive eye-candy :-) E.g. see this advertisement by Novell where I've just seen a similar "topic-map" as well: http://www.novell.com/de-de/company/advertising/defineyouropen.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space
[ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12413778 ] Stefan Neufeind commented on NUTCH-292: --- That patch is for the 0.7-branch, right? In 0.8-dev you'd want to do that in BasicSummarizer.java. But to me it looks like something similar is already in place: // Iterate through as long as we're before the end of // the document and we haven't hit the max-number-of-items // -in-a-summary. // while ((j < endToken) && (j - startToken < sumLength)) { But I also suspect it might have something to do with tokens. What I experienced is that several search-results currently contain arbitrary binary data. Those are the cases where a parser-plugin has "failed" and where parse-text was used as a fallback. If I'm right this might lead to actually quite large tokens because no whitespace is found in a row of characters. @Marcel: Thank you for the fix anyway ... you help is very much appreciated. > OpenSearchServlet: OutOfMemoryError: Java heap space > > > Key: NUTCH-292 > URL: http://issues.apache.org/jira/browse/NUTCH-292 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.8-dev > Reporter: Stefan Neufeind > Priority: Critical > Attachments: summarizer.diff > > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space > > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203) > org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) > > org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155) > javax.servlet.http.HttpServlet.service(HttpServlet.java:689) > javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > The URL I use is: > [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url > It seems to be a problem specific to the date I'm working with. Moving the > start from 0 to 10 or changing the query works fine. > Or maybe it doesn't have to do with sorting but it's just that I hit one "bad > search-result" that has a broken summary? > !! The problem is repeatable. So if anybody has an idea where to search / > what to fix, I can easily try that out !! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413780 ] Stefan Neufeind commented on NUTCH-290: --- The plugin itself imho works fine now. Does not throw an exception anymore and if allowed outputs text correctly. However I still get the "garbage-output" from a PDF. Could that be due to the fact that in case no extraction is allowed (empty parsing-text returned) the parser will still fallback to using the raw text to index? What I did was deleting crawl_parse and parse_* from the segments-directory, running "nutch parse" and reindexing everything. However the raw chars in the search-output (summary) remain. :-(( > parse-pdf: Garbage indexed when text-extraction not allowed > --- > > Key: NUTCH-290 > URL: http://issues.apache.org/jira/browse/NUTCH-290 > Project: Nutch > Type: Bug > Components: indexer > Versions: 0.8-dev > Reporter: Stefan Neufeind > Attachments: NUTCH-290-canExtractContent.patch > > It seems that garbage (or undecoded text?) is indexed when text-extraction > for a PDF is not allowed. > Example-PDF: > http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=all ] Stefan Neufeind updated NUTCH-290: -- Summary: parse-pdf: Garbage indexed when text-extraction not allowed (was: parse-pdf: Garbage (?) indexed when text-extraction now allowed) > parse-pdf: Garbage indexed when text-extraction not allowed > --- > > Key: NUTCH-290 > URL: http://issues.apache.org/jira/browse/NUTCH-290 > Project: Nutch > Type: Bug > Components: indexer > Versions: 0.8-dev > Reporter: Stefan Neufeind > Attachments: NUTCH-290-canExtractContent.patch > > It seems that garbage (or undecoded text?) is indexed when text-extraction > for a PDF is not allowed. > Example-PDF: > http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-290) parse-pdf: Garbage (?) indexed when text-extraction now allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=all ] Stefan Neufeind updated NUTCH-290: -- Attachment: NUTCH-290-canExtractContent.patch This patch adds a check to first see if text-extraction is allowed - and only in that case try to extract text (prevents the above mentioned exception and a parse-fail). Note: The line ((PDStandardEncryption) encDict).setCanExtractContent(true); is imho up to discussion. It only sets a bit on "encrypted" documents. Since I've read in several places that many people seem to be setting this to "false" for no good reason, I believe we don't really "brake encryption" with this line - and as such should try to index as much data as possible. Does anybody have "problems" with this line? If yes, maybe it could be a config-option that's false by default? > parse-pdf: Garbage (?) indexed when text-extraction now allowed > --- > > Key: NUTCH-290 > URL: http://issues.apache.org/jira/browse/NUTCH-290 > Project: Nutch > Type: Bug > Components: indexer > Versions: 0.8-dev > Reporter: Stefan Neufeind > Attachments: NUTCH-290-canExtractContent.patch > > It seems that garbage (or undecoded text?) is indexed when text-extraction > for a PDF is not allowed. > Example-PDF: > http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage (?) indexed when text-extraction now allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413637 ] Stefan Neufeind commented on NUTCH-290: --- this one here fires in the PDF-parser: } catch (Exception e) { // run time exception LOG.warning("General exception in PDF parser: "+e.getMessage()); e.printStackTrace(); return new ParseStatus(ParseStatus.FAILED, "Can't be handled as pdf document. " + e).getEmptyParse(getConf()); } The exception is: 060522 001010 General exception in PDF parser: You do not have permission to extract text java.io.IOException: You do not have permission to extract text at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:189) at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140) at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:120) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:257) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:143) Could it be that, maybe as a fallback, in case the document can't be parsed and no "description" is returned that in search-output the document itself is used as "description"? If yes: In case of binary files this seems to lead to problems. > parse-pdf: Garbage (?) indexed when text-extraction now allowed > --- > > Key: NUTCH-290 > URL: http://issues.apache.org/jira/browse/NUTCH-290 > Project: Nutch > Type: Bug > Components: indexer > Versions: 0.8-dev > Reporter: Stefan Neufeind > > It seems that garbage (or undecoded text?) is indexed when text-extraction > for a PDF is not allowed. > Example-PDF: > http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space
OpenSearchServlet: OutOfMemoryError: Java heap space Key: NUTCH-292 URL: http://issues.apache.org/jira/browse/NUTCH-292 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind Priority: Critical java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203) org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155) javax.servlet.http.HttpServlet.service(HttpServlet.java:689) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) The URL I use is: [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine. Or maybe it doesn't have to do with sorting but it's just that I hit one "bad search-result" that has a broken summary? !! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-291) OpenSearchServlet should return "date" as well as "lastModified"
[ http://issues.apache.org/jira/browse/NUTCH-291?page=all ] Stefan Neufeind updated NUTCH-291: -- Attachment: NUTCH-291-unfinished.patch I tried implementing this in OpenSearchServlet.java (see patch). The idea for this match is based on more.jsp. However I receive: java.lang.NumberFormatException: null java.lang.Long.parseLong(Long.java:372) java.lang.Long.(Long.java:671) org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:230) javax.servlet.http.HttpServlet.service(HttpServlet.java:689) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) Guess that has to do with date not being present here?!? I've tried hunting down the "problem" and it seems that in java/org/apache/nutch/searcher/IndexSearcher.java the field also needs to be provided. But I assume that the Lucene-engine here correctly provides the date-field. Maybe somebody could fix up my patch and then maybe commit as well. I guess always knowing the date from the RSS-feed might be good. > OpenSearchServlet should return "date" as well as "lastModified" > > > Key: NUTCH-291 > URL: http://issues.apache.org/jira/browse/NUTCH-291 > Project: Nutch > Type: Improvement > Components: web gui > Versions: 0.8-dev > Reporter: Stefan Neufeind > Attachments: NUTCH-291-unfinished.patch > > Currently lastModified is provided by OpenSearchServlet - but only in case > the date lastModified-date is known. > Since you can sort by "date" (which is lastModified or if not present the > fetchdate), it might be useful if OpenSearchServlet could provide "date" as > well. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-291) OpenSearchServlet should return "date" as well as "lastModified"
OpenSearchServlet should return "date" as well as "lastModified" Key: NUTCH-291 URL: http://issues.apache.org/jira/browse/NUTCH-291 Project: Nutch Type: Improvement Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind Currently lastModified is provided by OpenSearchServlet - but only in case the date lastModified-date is known. Since you can sort by "date" (which is lastModified or if not present the fetchdate), it might be useful if OpenSearchServlet could provide "date" as well. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-290) parse-pdf: Garbage (?) indexed when text-extraction now allowed
parse-pdf: Garbage (?) indexed when text-extraction now allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation
[ http://issues.apache.org/jira/browse/NUTCH-288?page=all ] Stefan Neufeind updated NUTCH-288: -- Attachment: NUTCH-288-OpenSearch-fix.patch This patch includes Doug's one-line fix to prevent an exception. Also it does go back page by page until you get to the last result-page. The start-value returned in the RSS-feed is correct afterwards(!). This easily allows you to check whether the requested result-start and the one received are identical - otherwise you are on the last page and were "redirected" - and now know that you don't need to display any pages in your page-navigation following this one :-) Applies and works fine for me. > hitsPerSite-functionality "flawed": problems writing a page-navigation > -- > > Key: NUTCH-288 > URL: http://issues.apache.org/jira/browse/NUTCH-288 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.8-dev > Reporter: Stefan Neufeind > Attachments: NUTCH-288-OpenSearch-fix.patch > > The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads > to problems when trying to offer a page-navigation (e.g. allow the user to > jump to page 10). This is because dedup is done after fetching. > RSS shows a maximum number of 7763 documents (that is without dedup!), I set > it to display 10 items per page. My "naive" approach was to estimate I have > 7763/10 = 777 pages. But already when moving to page 3 I got no more > searchresults (I guess because of dedup). And when moving to page 10 I got > an exception (see below). > 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for > servlet OpenSearch threw exception > java.lang.NegativeArraySizeException > at org.apache.nutch.searcher.Hits.getHits(Hits.java:65) > at > org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) > at > org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) > at > org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) > at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) > at java.lang.Thread.run(Thread.java:595) > Only workaround I see for the moment: Fetching RSS without duplication, dedup > myself and cache the RSS-result to improve performance. But a cleaner > solution would imho be nice. Is there a performant way of doing deduplication > and knowing for sure how many documents are available to view? For sure this > would mean to dedup all search-results first ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administr
[jira] Commented: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413275 ] Stefan Neufeind commented on NUTCH-288: --- How do they do that? Right, I'm transfered to page 16. But if I click on page 14 this also seems to be the last page in order? Something looks strange there, too ... And using Nutch: How should I know (using the RSS-feed) on which page I am? I'm getting the above exception - no reply, and no new "start"-value so I could compute on which page I actually am. Is there a quickfix possible somehow? > hitsPerSite-functionality "flawed": problems writing a page-navigation > -- > > Key: NUTCH-288 > URL: http://issues.apache.org/jira/browse/NUTCH-288 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.8-dev > Reporter: Stefan Neufeind > > The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads > to problems when trying to offer a page-navigation (e.g. allow the user to > jump to page 10). This is because dedup is done after fetching. > RSS shows a maximum number of 7763 documents (that is without dedup!), I set > it to display 10 items per page. My "naive" approach was to estimate I have > 7763/10 = 777 pages. But already when moving to page 3 I got no more > searchresults (I guess because of dedup). And when moving to page 10 I got > an exception (see below). > 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for > servlet OpenSearch threw exception > java.lang.NegativeArraySizeException > at org.apache.nutch.searcher.Hits.getHits(Hits.java:65) > at > org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) > at > org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) > at > org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) > at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) > at java.lang.Thread.run(Thread.java:595) > Only workaround I see for the moment: Fetching RSS without duplication, dedup > myself and cache the RSS-result to improve performance. But a cleaner > solution would imho be nice. Is there a performant way of doing deduplication > and knowing for sure how many documents are available to view? For sure this > would mean to dedup all search-results first ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] Stefan Neufeind updated NUTCH-110: -- Attachment: fixIllegalXmlChars08.patch Since original patch didn't cleanly apply for me on 0.8-dev (nightly-2006-05-20) I re-did it for 0.8 ... With this patch the XML is fine. Without I had big trouble parsing the RSS-feed in another application. > OpenSearchServlet outputs illegal xml characters > > > Key: NUTCH-110 > URL: http://issues.apache.org/jira/browse/NUTCH-110 > Project: Nutch > Type: Bug > Components: searcher > Versions: 0.7 > Environment: linux, jdk 1.5 > Reporter: [EMAIL PROTECTED] > Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, > fixIllegalXmlChars08.patch > > OpenSearchServlet does not check text-to-output for illegal xml characters; > dependent on search result, its possible for OSS to output xml that is not > well-formed. For example, if text has the character FF character in it -- -- > i.e. the ascii character at position (decimal) 12 -- the produced XML will > show the FF character as '' The character/entity '' is not legal in > XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation
hitsPerSite-functionality "flawed": problems writing a page-navigation -- Key: NUTCH-288 URL: http://issues.apache.org/jira/browse/NUTCH-288 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching. RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My "naive" approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I got an exception (see below). 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception java.lang.NegativeArraySizeException at org.apache.nutch.searcher.Hits.getHits(Hits.java:65) at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595) Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-287) Exception when searching with sort
Exception when searching with sort -- Key: NUTCH-287 URL: http://issues.apache.org/jira/browse/NUTCH-287 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: Stefan Neufeind Priority: Critical Running a search with &sort=url works. But when using&sort=title I get the following exception. 2006-05-25 14:04:25 StandardWrapperValve[jsp]: Servlet.service() for servlet jsp threw exception java.lang.RuntimeException: Unknown sort value type! at org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:157) at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95) at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239) at org.apache.jsp.search_jsp._jspService(search_jsp.java:257) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595) What is in those lines is: WritableComparable sortValue; // convert value to writable if (sortField == null) { sortValue = new FloatWritable(scoreDocs[i].score); } else { Object raw = ((FieldDoc)scoreDocs[i]).fields[0]; if (raw instanceof Integer) { sortValue = new IntWritable(((Integer)raw).intValue()); } else if (raw instanceof Float) { sortValue = new FloatWritable(((Float)raw).floatValue()); } else if (raw instanceof String) { sortValue = new UTF8((String)raw); } else { throw new RuntimeException("Unknown sort value type!"); } } So I thought that maybe raw is an instance of something "strange" and tried raw.getClass().getName() or also raw.toString() to track the cause down - but that always resulted in a NullPointerException. So it seems I'm having raw being null for some strange reason. When I try with "title2" (or something none-existing) I get a different error that title2 is unknown / not indexed. So I suspect that title should be fine here ... If there is any information I can help out with, let me know. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://ww
[jira] Commented: (NUTCH-284) NullPointerException during index
[ http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12413240 ] Stefan Neufeind commented on NUTCH-284: --- Yes, I was missing index-basic. Please apologize. I needed the extra fields of index-more and thought it would do the basic fields as well. The same thing occured in NUTCH-51. Would it be possible to maybe demand that index-basic is loaded (same like "well, you need a scoring-plugin" etc.)? What if somebody writes his own index-basic2-plugin - then he'd have to be able to put an "provides index-basic" into his plugin to notify that he indexes the basic fields or so. Maybe something like this could avoid trouble / searching for some people like me :-) > NullPointerException during index > - > > Key: NUTCH-284 > URL: http://issues.apache.org/jira/browse/NUTCH-284 > Project: Nutch > Type: Bug > Components: indexer > Versions: 0.8-dev > Reporter: Stefan Neufeind > > For quite a few this "reduce > sort" has been going on. Then it fails. What > could be wrong with this? > 060524 212613 reduce > sort > 060524 212614 reduce > sort > 060524 212615 reduce > sort > 060524 212615 found resource common-terms.utf8 at > file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 > 060524 212615 found resource common-terms.utf8 at > file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 > 060524 212619 Optimizing index. > 060524 212619 job_jlbhhm > java.lang.NullPointerException > at > org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111) > at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114) > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) > at org.apache.nutch.indexer.Indexer.index(Indexer.java:287) > at org.apache.nutch.indexer.Indexer.main(Indexer.java:304) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-286) Handling common error-pages as 404
Handling common error-pages as 404 -- Key: NUTCH-286 URL: http://issues.apache.org/jira/browse/NUTCH-286 Project: Nutch Type: Improvement Reporter: Stefan Neufeind Idea: Some pages from some software-packages/scripts report an "http 200 ok" even though a specific page could not be found. Example I just found is: http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef That's a typo3-page explaining in it's standard-layout and wording: "The requested page did not exist or was inaccessible." So I had the idea if somebody might create a plugin that could find commonly used formulations for "page does not exist" etc. and turn the page into a 404 before feeding them into the nutch-index - although the server responded with status 200 ok. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-284) NullPointerException during index
NullPointerException during index - Key: NUTCH-284 URL: http://issues.apache.org/jira/browse/NUTCH-284 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind For quite a few this "reduce > sort" has been going on. Then it fails. What could be wrong with this? 060524 212613 reduce > sort 060524 212614 reduce > sort 060524 212615 reduce > sort 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212619 Optimizing index. 060524 212619 job_jlbhhm java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114) Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.indexer.Indexer.index(Indexer.java:287) at org.apache.nutch.indexer.Indexer.main(Indexer.java:304) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-70) duplicate pages - virtual hosts in db.
[ http://issues.apache.org/jira/browse/NUTCH-70?page=comments#action_12413169 ] Stefan Neufeind commented on NUTCH-70: -- Is the content exactly the same? Maybe could the page be checked against an already existing one by an MD5 on the content? But I'm not sure if there is a clean way to workaround the problem - what if all pages are the same except one, on the other vhost? Would have to crawl all anyway, wouldn't you? > duplicate pages - virtual hosts in db. > -- > > Key: NUTCH-70 > URL: http://issues.apache.org/jira/browse/NUTCH-70 > Project: Nutch > Type: Bug > Environment: 0,7 dev > Reporter: YourSoft > > Dear Developers, > I have a problem with nutch: > - There are many sites duplicates in the webdb and in the segments. > The source of this problem is: > - If the site make 'virtual hosts' (like Apache), e.g. www.origo.hu, > origo.hu, origo.matav.hu, origo.matavnet.hu etc.: the result pages are the > same, only the inlinks are differents. > - The ip address is the same. > - When search, all virtualhosts are in the results. > Google only show one of these virtual hosts, the nutch show all. The result > nutch db is larger, and this case slower, than google. > Have any idea, how to remove these duplicates? > Regards, > Ferenc -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-44) too many search results
[ http://issues.apache.org/jira/browse/NUTCH-44?page=comments#action_12413155 ] Stefan Neufeind commented on NUTCH-44: -- hi, any progress on this? > too many search results > --- > > Key: NUTCH-44 > URL: http://issues.apache.org/jira/browse/NUTCH-44 > Project: Nutch > Type: Bug > Components: web gui > Environment: web environment > Reporter: Emilijan Mirceski > > There should be a limitation (user defined) on the number of results the > search engine can return. > For example, if one modifies the seach url as: > http:///search.jsp?query=&hitsPerPage=2&hitsPerSite=0 > The search will try to return 20,000 pages which isn't good for the server > side performance. > Is it possible to have a setting in the config xml files to control this? > Thanks, > Emilijan -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-282) Showing too few results on a page (Paging not correct)
Showing too few results on a page (Paging not correct) -- Key: NUTCH-282 URL: http://issues.apache.org/jira/browse/NUTCH-282 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind I did a search and got back the value "itemsPerPage" from opensearch. But the output shows "results 1-8" and I have a total of 46 searchresults. Same happens for the webinterface. Why aren't "enough" results fetched? The problem might be somewhere in the area of where Nutch should only display a certaian number of websites per site. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-281) cached.jsp: base-href needs to be outside comments
[ http://issues.apache.org/jira/browse/NUTCH-281?page=all ] Stefan Neufeind updated NUTCH-281: -- Component: web gui Priority: Trivial (was: Major) > cached.jsp: base-href needs to be outside comments > -- > > Key: NUTCH-281 > URL: http://issues.apache.org/jira/browse/NUTCH-281 > Project: Nutch > Type: Bug > Components: web gui > Reporter: Stefan Neufeind > Priority: Trivial > > see cached.jsp > > does not take effect when showing a cached page because of the comments > around it -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-281) cached.jsp: base-href needs to be outside comments
cached.jsp: base-href needs to be outside comments -- Key: NUTCH-281 URL: http://issues.apache.org/jira/browse/NUTCH-281 Project: Nutch Type: Bug Components: web gui Reporter: Stefan Neufeind see cached.jsp does not take effect when showing a cached page because of the comments around it -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-255) Regular Expression for RegexUrlNormalizer to remove jsessionid
[ http://issues.apache.org/jira/browse/NUTCH-255?page=comments#action_12412777 ] Stefan Neufeind commented on NUTCH-255: --- You might want to have a / right after the .com in the example - but that's not too important here :-) You can also omit the (.*) at beginning/end of expression as it's not needed for this task NUTCH-279 includes your patch modified in there. PS: Thanks for the contribution. > Regular Expression for RegexUrlNormalizer to remove jsessionid > -- > > Key: NUTCH-255 > URL: http://issues.apache.org/jira/browse/NUTCH-255 > Project: Nutch > Type: Improvement > Components: fetcher > Versions: 0.8-dev > Environment: Windows XP Media Center 2005, 2 Gigs RAM, 3.0 Ghz Pentium 4 > Hyperthreaded, Eclipse 3.2.0 > Reporter: Dennis Kubes > Priority: Trivial > Attachments: urlnormalize_jessionid.patch > > Some URLs are filtered out by the crawl url filter for special characters (by > default). One of these is the jsessionid urls such as: > http://www.somesite.com;jsessionid=A8D7D812B5EFD3099F099A760F779E3B?query=string > We want to get rid of the jessionid and keep everything else so that it looks > like this: > http://www.somesite.com?query=string > Below is a regular expression for the regex-normalize.xml file used by the > RegexUrlNormalizer that sucessfully removes jsessionid strings while leaving > the hostname and querystring. I have also attached a patch for the > regex-normalize.xml.template file that adds the following expression. > > (.*)(;jsessionid=[a-zA-Z0-9]{32})(.*) > $1$3 > -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-279) Additions for regex-normalize
[ http://issues.apache.org/jira/browse/NUTCH-279?page=all ] Stefan Neufeind updated NUTCH-279: -- Attachment: regex-normalize.patch 1) Incorporates jsessionid-normalization from NUTCH-255 2) Adds further normalizations 3) Adds a commandline-checker. Start with: bin/nutch org.apache.nutch.net.RegexUrlNormalizerChecker > Additions for regex-normalize > - > > Key: NUTCH-279 > URL: http://issues.apache.org/jira/browse/NUTCH-279 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Stefan Neufeind > Attachments: regex-normalize.patch > > Imho needed: > 1) Extend normalize-rules to commonly used session-id's etc. > 2) Ship a checker to check rules easily by hand -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-279) Additions for regex-normalize
Additions for regex-normalize - Key: NUTCH-279 URL: http://issues.apache.org/jira/browse/NUTCH-279 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Stefan Neufeind Imho needed: 1) Extend normalize-rules to commonly used session-id's etc. 2) Ship a checker to check rules easily by hand -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-278) Fetcher-status might need clarification: kbit/s instead of kb/s shown
Fetcher-status might need clarification: kbit/s instead of kb/s shown - Key: NUTCH-278 URL: http://issues.apache.org/jira/browse/NUTCH-278 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Stefan Neufeind Priority: Trivial In Fetcher.java, method reportStatus() there is + Math.round(float)bytes)*8)/1024)/elapsed)+" kb/s, "; Is that a bit misleading, since the user reading the status might guess it's "kilobytes" (kb) whereas "kbit/s" would be more clear in this case? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-277) Fetcher dies because of "max. redirects" (avoiding infinite loop)
[ http://issues.apache.org/jira/browse/NUTCH-277?page=comments#action_12412706 ] Stefan Neufeind commented on NUTCH-277: --- Problem was reproducable with the URL-set we had here. After moving from protocol-httpclient to protocol-http the problem is gone, crawling is fine. Could there be a problem in httpclient-interface, maybe with redirects? PS: Too bad we're missing https-support for now - but it works for the moment ... > Fetcher dies because of "max. redirects" (avoiding infinite loop) > - > > Key: NUTCH-277 > URL: http://issues.apache.org/jira/browse/NUTCH-277 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Environment: nightly-2006-05-20 > Reporter: Stefan Neufeind > Priority: Critical > > Error in the logs is: > 060521 213401 SEVERE Narrowly avoided an infinite loop in execute > org.apache.commons.httpclient.RedirectException: Maximum redirects (100) > exceeded > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:183) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324) > at > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:87) > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:97) > at > org.apache.nutch.protocol.http.api.RobotRulesParser.isAllowed(RobotRulesParser.java:394) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:173) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135) > This happens during normal crawling. Unfortunately I don't know how to > further track this down. But it's problematic, since it actually makes the > fetcher die. > Workaround (for the symptom) is in NUTCH-258 (avoid dying on SEVERE > logentry). That works for me, crawling works fine and it does not hang/crash. > However this is working around the problems not solving them - I know. But > it helps for the moment ... > Hope somebody can help - this loops quite important to track down to me. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12412705 ] Stefan Neufeind commented on NUTCH-258: --- Beware of simply silencing the error! It helped me at one place - but at another it really caused an infinite loop not to end. > Once Nutch logs a SEVERE log item, Nutch fails forevermore > -- > > Key: NUTCH-258 > URL: http://issues.apache.org/jira/browse/NUTCH-258 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Environment: All > Reporter: Scott Ganyo > Priority: Critical > Attachments: dumbfix.patch > > Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. > This is from the run() method in Fetcher.java: > public void run() { > synchronized (Fetcher.this) {activeThreads++;} // count threads > > try { > UTF8 key = new UTF8(); > CrawlDatum datum = new CrawlDatum(); > > while (true) { > if (LogFormatter.hasLoggedSevere()) // something bad happened > break;// exit > > Notice the last 2 lines. This will prevent Nutch from ever Fetching again > once this is hit as LogFormatter is storing this data as a static. > (Also note that "LogFormatter.hasLoggedSevere()" is also checked in > org.apache.nutch.net.URLFilterChecker and will disable this class as well.) > This must be fixed or Nutch cannot be run as any kind of long-running > service. Furthermore, I believe it is a poor decision to rely on a logging > event to determine the state of the application - this could have any number > of side-effects that would be extremely difficult to track down. (As it has > already for me.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
[ http://issues.apache.org/jira/browse/NUTCH-258?page=all ] Stefan Neufeind updated NUTCH-258: -- Attachment: dumbfix.patch I know this is a dumb fix :-) But it solves the problem for the moment ... > Once Nutch logs a SEVERE log item, Nutch fails forevermore > -- > > Key: NUTCH-258 > URL: http://issues.apache.org/jira/browse/NUTCH-258 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Environment: All > Reporter: Scott Ganyo > Priority: Critical > Attachments: dumbfix.patch > > Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. > This is from the run() method in Fetcher.java: > public void run() { > synchronized (Fetcher.this) {activeThreads++;} // count threads > > try { > UTF8 key = new UTF8(); > CrawlDatum datum = new CrawlDatum(); > > while (true) { > if (LogFormatter.hasLoggedSevere()) // something bad happened > break;// exit > > Notice the last 2 lines. This will prevent Nutch from ever Fetching again > once this is hit as LogFormatter is storing this data as a static. > (Also note that "LogFormatter.hasLoggedSevere()" is also checked in > org.apache.nutch.net.URLFilterChecker and will disable this class as well.) > This must be fixed or Nutch cannot be run as any kind of long-running > service. Furthermore, I believe it is a poor decision to rely on a logging > event to determine the state of the application - this could have any number > of side-effects that would be extremely difficult to track down. (As it has > already for me.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-277) Fetcher dies because of "max. redirects" (avoiding infinite loop)
[ http://issues.apache.org/jira/browse/NUTCH-277?page=all ] Stefan Neufeind updated NUTCH-277: -- Component: fetcher Version: 0.8-dev > Fetcher dies because of "max. redirects" (avoiding infinite loop) > - > > Key: NUTCH-277 > URL: http://issues.apache.org/jira/browse/NUTCH-277 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Environment: nightly-2006-05-20 > Reporter: Stefan Neufeind > Priority: Critical > > Error in the logs is: > 060521 213401 SEVERE Narrowly avoided an infinite loop in execute > org.apache.commons.httpclient.RedirectException: Maximum redirects (100) > exceeded > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:183) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324) > at > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:87) > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:97) > at > org.apache.nutch.protocol.http.api.RobotRulesParser.isAllowed(RobotRulesParser.java:394) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:173) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135) > This happens during normal crawling. Unfortunately I don't know how to > further track this down. But it's problematic, since it actually makes the > fetcher die. > Workaround (for the symptom) is in NUTCH-258 (avoid dying on SEVERE > logentry). That works for me, crawling works fine and it does not hang/crash. > However this is working around the problems not solving them - I know. But > it helps for the moment ... > Hope somebody can help - this loops quite important to track down to me. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-277) Fetcher dies because of "max. redirects" (avoiding infinite loop)
Fetcher dies because of "max. redirects" (avoiding infinite loop) - Key: NUTCH-277 URL: http://issues.apache.org/jira/browse/NUTCH-277 Project: Nutch Type: Bug Environment: nightly-2006-05-20 Reporter: Stefan Neufeind Priority: Critical Error in the logs is: 060521 213401 SEVERE Narrowly avoided an infinite loop in execute org.apache.commons.httpclient.RedirectException: Maximum redirects (100) exceeded at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:183) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324) at org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:87) at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:97) at org.apache.nutch.protocol.http.api.RobotRulesParser.isAllowed(RobotRulesParser.java:394) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:173) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135) This happens during normal crawling. Unfortunately I don't know how to further track this down. But it's problematic, since it actually makes the fetcher die. Workaround (for the symptom) is in NUTCH-258 (avoid dying on SEVERE logentry). That works for me, crawling works fine and it does not hang/crash. However this is working around the problems not solving them - I know. But it helps for the moment ... Hope somebody can help - this loops quite important to track down to me. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-254) Fetcher throws NullPointer if redirect URL is filtered
[ http://issues.apache.org/jira/browse/NUTCH-254?page=comments#action_12412684 ] Stefan Neufeind commented on NUTCH-254: --- looks fine and applies fine for me - could this be merged in the dev-trunk? > Fetcher throws NullPointer if redirect URL is filtered > -- > > Key: NUTCH-254 > URL: http://issues.apache.org/jira/browse/NUTCH-254 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Environment: Tested on Windows XP Media Center 2005, 2Gigs RAM, 3.0 Ghz > Pentium 4 Hyperthreaded. Should be on any platform. > Reporter: Dennis Kubes > Priority: Minor > Attachments: fetcher_filter_url_patch.txt > > Inside the Fetcher class if a redirect URL is filtered, for example jessionid > pages are filtered with the default URL filter, then a NullPointerException > is thrown when Fetcher trys to print out that the url was skipped for being > an identical url. It is not an identical URL but a filtered url. So what we > really need is two different checks. One for null url and one for identical > url. I have included a patch that handles this. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-48) "Did you mean" query enhancement/refignment feature request
[ http://issues.apache.org/jira/browse/NUTCH-48?page=all ] Stefan Neufeind updated NUTCH-48: - Attachment: did-you-mean-combined08.patch Here are both patches combined into one, built against 0.8-dev (namely: nightly-2006-05-20). - The necessary API-changes in 0.8-dev are incorporated in the patch. - Some smaller things also fixed, (e.g.: --- missing ../ in front of link to search.jsp --- missing at end of did-you-mean-part Small To-Do left: Maybe put text "Did you mean" into template to make it translatable to other languages. But I guess that can be done when finally merging this into the dev-tree. Patch tested and proved to work. > "Did you mean" query enhancement/refignment feature request > > > Key: NUTCH-48 > URL: http://issues.apache.org/jira/browse/NUTCH-48 > Project: Nutch > Type: New Feature > Components: web gui > Environment: All platforms > Reporter: byron miller > Assignee: Sami Siren > Priority: Minor > Attachments: did-you-mean-combined08.patch, rss-spell.patch, > spell-check.patch > > Looking to implement a "Did you mean" feature for query result pages that > return < = x amount of results to invoke a response that would recommend a > fixed/related or spell checked query to try. > Note from Doug to users list: > David Spencer has worked on this some. > http://www.searchmorph.com/weblog/index.php?id=23 > I think the code on his site might be more recent than what's committed > to the lucene/contrib directory. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-275) Fetcher not parsing XHTML-pages at all
[ http://issues.apache.org/jira/browse/NUTCH-275?page=all ] Stefan Neufeind updated NUTCH-275: -- Description: Server reports page as "text/html" - so I thought it would be processed as html. But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?). For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?). Links inside this document are NOT indexed at all - no digging this website actually stops here. Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows). 060521 025018 fetching http://www.secreturl.something/ 060521 025018 http.proxy.host = null 060521 025018 http.proxy.port = 8080 060521 025018 http.timeout = 1 060521 025018 http.content.limit = 65536 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) 060521 025018 fetcher.server.delay = 1000 060521 025018 http.max.delays = 1000 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060521 025019 map 0% reduce 0% 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, was: Server reports page as "text/html" - so I thought it would be processed as html. But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?). For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?). Links inside this document are NOT indexed at all - no digging this website actually stops here. Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows). 060521 025018 fetching http://www.speedpartner.de/ 060521 025018 http.proxy.host = null 060521 025018 http.proxy.port = 8080 060521 025018 http.timeout = 1 060521 025018 http.content.limit = 65536 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) 060521 025018 fetcher.server.delay = 1000 060521 025018 http.max.delays = 1000 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060521 025019 map 0% reduce 0% 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, > Fetcher not parsing XHTML-pages at all > -- > > Key: NUTCH-275 > URL: http://issues.apache.org/jira/browse/NUTCH-275 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Environment: problem with nightly-2006-05-20; worked fine with same website > on 0.7.2 > Reporter: Stefan Neufeind > > Server reports page as "text/html" - so I thought it would be processed as > html. > But something I guess evaluated the headers of the document and re-labeled it > as "text/xml" (why not text/xhtml?). > For some reason there is no plugin to be found for indexing text/xml (why > does TextParser not feel responsible?). > Links inside this document are NOT indexed at all - no digging this website > actually stops here. > Funny thing: For some magical reasons the dtd-files referenced in the header > seem to be valid links for the fetcher and as such are indexed in the next > round (if urlfilter allows). > 060521 025018 fetching http://www.secreturl.something/ > 060521 025018 http.proxy.host = null > 060521 025018 http.proxy.port = 8080 > 060521 025018 http.timeout = 1 > 060521 025018 http.content.limit = 65536 > 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; > http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) > 060521 025018 fetcher.server.delay = 1000 > 060521 025018 http.max.delays = 1000 > 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser > mapped to contentType text/xml via parse-plugins.xml, but > its plugin.xml file does not claim to support contentType: tex
[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at all
[ http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12412659 ] Stefan Neufeind commented on NUTCH-275: --- I've found out that the first line actually leads to the problems. Without it, the file is parsed as html. - But why can't XML be parsed at all (not even by TextParser)? - And afaik that header is valid as is - been told so - and validator from w3 does not complain as well. http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";> http://www.w3.org/1999/xhtml"; xml:lang="de" lang="de"> > Fetcher not parsing XHTML-pages at all > -- > > Key: NUTCH-275 > URL: http://issues.apache.org/jira/browse/NUTCH-275 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Environment: problem with nightly-2006-05-20; worked fine with same website > on 0.7.2 > Reporter: Stefan Neufeind > > Server reports page as "text/html" - so I thought it would be processed as > html. > But something I guess evaluated the headers of the document and re-labeled it > as "text/xml" (why not text/xhtml?). > For some reason there is no plugin to be found for indexing text/xml (why > does TextParser not feel responsible?). > Links inside this document are NOT indexed at all - no digging this website > actually stops here. > Funny thing: For some magical reasons the dtd-files referenced in the header > seem to be valid links for the fetcher and as such are indexed in the next > round (if urlfilter allows). > 060521 025018 fetching http://www.speedpartner.de/ > 060521 025018 http.proxy.host = null > 060521 025018 http.proxy.port = 8080 > 060521 025018 http.timeout = 1 > 060521 025018 http.content.limit = 65536 > 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; > http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) > 060521 025018 fetcher.server.delay = 1000 > 060521 025018 http.max.delays = 1000 > 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser > mapped to contentType text/xml via parse-plugins.xml, but > its plugin.xml file does not claim to support contentType: text/xml > 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser > mapped to contentType text/xml via parse-plugins.xml, but > not enabled via plugin.includes in nutch-default.xml > 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature > 060521 025019 map 0% reduce 0% > 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, > 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-275) Fetcher not parsing XHTML-pages at all
Fetcher not parsing XHTML-pages at all -- Key: NUTCH-275 URL: http://issues.apache.org/jira/browse/NUTCH-275 Project: Nutch Type: Bug Versions: 0.8-dev Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2 Reporter: Stefan Neufeind Server reports page as "text/html" - so I thought it would be processed as html. But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?). For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?). Links inside this document are NOT indexed at all - no digging this website actually stops here. Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows). 060521 025018 fetching http://www.speedpartner.de/ 060521 025018 http.proxy.host = null 060521 025018 http.proxy.port = 8080 060521 025018 http.timeout = 1 060521 025018 http.content.limit = 65536 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) 060521 025018 fetcher.server.delay = 1000 060521 025018 http.max.delays = 1000 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060521 025019 map 0% reduce 0% 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-274) Empty row in/at end of URL-list results in error
Empty row in/at end of URL-list results in error Key: NUTCH-274 URL: http://issues.apache.org/jira/browse/NUTCH-274 Project: Nutch Type: Bug Versions: 0.8-dev Environment: nightly-2006-05-20 Reporter: Stefan Neufeind Priority: Minor This is minor - but it's a little unclean :-) Reproduce: Have a URL-file with one URL followed by a newline, thus producing an empty line. Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is fine - but second is empty and therefor fails proper protocol-detection. 60521 022639 Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 060521 022639 Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 060521 022639 found resource parse-plugins.xml at file:/home/mm/nutch-nightly/conf/parse-plugins.xml 060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060521 022639 fetching http://www.bild.de/ 060521 022639 fetching 060521 022639 fetch of failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: no protocol: 060521 022639 http.proxy.host = null 060521 022639 http.proxy.port = 8080 060521 022639 http.timeout = 1 060521 022639 http.content.limit = 65536 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) 060521 022639 fetcher.server.delay = 1000 060521 022639 http.max.delays = 1000 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml 060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml 060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060521 022640 map 0% reduce 0% 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )
[ http://issues.apache.org/jira/browse/NUTCH-173?page=all ] Stefan Neufeind updated NUTCH-173: -- Attachment: patch08-new.patch Here is the 08-patch, corrected to work against nightly from 2006-05-20. Also fromHost is now only generated if really needed and nutch-default.xml is patched as well. By the way: Where should a property for "crawl" be located in the config-file? In the "fetcher"-section? In that case please somebody move it up/down or rename the property before including it in the dev-tree. But could somebody please review it quickly? I'm not sure it's 100% correct. Still investigating on my side ... > PerHost Crawling Policy ( crawl.ignore.external.links ) > --- > > Key: NUTCH-173 > URL: http://issues.apache.org/jira/browse/NUTCH-173 > Project: Nutch > Type: New Feature > Components: fetcher > Versions: 0.7.1, 0.7, 0.8-dev > Reporter: Philippe EUGENE > Priority: Minor > Attachments: patch.txt, patch08-new.patch, patch08.txt > > There is two major way of crawl in Nutch. > Intranet Crawl : forbidden all, allow somes few host > Whole-web crawl : allow all, forbidden few thinks > I propose a third type of crawl. > Directory Crawl : The purpose of this crawl is to manage few thousands of > host wihtout managing rules pattern in UrlFilterRegexp. > I made two patch for : 0.7, 0.7.1 and 0.8-dev > I propose a new boolean property in nutch-site.xml : > crawl.ignore.external.links, with false value at default. > By default this new feature don't modify the behavior of nutch crawler. > When you setup this property to true, the crawler don't fetch external links > of the host. > So the crawl is limited to the host that you inject at the beginning at the > crawl. > I know there is some proposal of new crawl policy using the CrawlDatum in > 0.8-dev branch. > This feature colud be a easiest way to add quickly new crawl feature to > nutch, waiting for a best way to improve crawl policy. > I post two patch. > Sorry for my very poor english > -- > Philippe -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-175) No input directories specified in: while crawing in nightly build from the 14.1.2006: sh ./nutch crawl urllist.txt -dir tmpdir
[ http://issues.apache.org/jira/browse/NUTCH-175?page=comments#action_12412644 ] Stefan Neufeind commented on NUTCH-175: --- My bad I didn't pay close attention when moving from 0.7 to 0.8. But I'd like to stress in this bug-entry that "urls" in the example-call to "nutch crawl" is no longer a file - but actually a directory containing files with urls in them. RTFM - and now it works :-) > No input directories specified in: while crawing in nightly build from the > 14.1.2006: sh ./nutch crawl urllist.txt -dir tmpdir > -- > > Key: NUTCH-175 > URL: http://issues.apache.org/jira/browse/NUTCH-175 > Project: Nutch > Type: Bug > Environment: SUSE Linux 9.3 > Reporter: Matthias Günter > Priority: Trivial > > [EMAIL PROTECTED]:~/workspace/lucene/nutch-nightly/bin> sh ./nutch crawl > urllist.txt -dir tmpdir > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml > 060114 205612 crawl started in: tmpdir > 060114 205612 rootUrlDir = urllist.txt > 060114 205612 threads = 10 > 060114 205612 depth = 5 > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml > 060114 205612 Injector: starting > 060114 205612 Injector: crawlDb: tmpdir/crawldb > 060114 205612 Injector: urlDir: urllist.txt > 060114 205612 Injector: Converting injected urls to crawl db entries. > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml > 060114 205612 Running job: job_n0o7ps > 060114 205612 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml > 060114 205613 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml > 060114 205613 parsing /tmp/nutch/mapred/local/localRunner/job_n0o7ps.xml > 060114 205613 parsing > file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml > java.io.IOException: No input directories specified in: NutchConf: > nutch-default.xml , mapred-default.xml , > /tmp/nutch/mapred/local/localRunner/job_n0o7ps.xml , nutch-site.xml > at > org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:85) > at > org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:95) > at > org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:63) > 060114 205613 map 0% > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) > at org.apache.nutch.crawl.Injector.inject(Injector.java:102) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) > urllist.txt contains > http://www.mentor.ch > PS: Is there a committer or developer (near Switzerland) who can support > (paid support) with a mixed index for intranet, some internet sites and > scanning of local drives (P:\ , S:\ etc) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412620 ] Stefan Neufeind commented on NUTCH-272: --- Oh, I just discovered this new parameter was added in 0.8-dev :-) But to my understanding of the description in nutch-default.xml this only applies to "per fetchlist". And that would mean "for one run", right? So in case I set this to 100 and fetch 10 rounds I'd have max. 1000 documents? But what if there is one document on the first level (theoretically) with 200 links in it? In this case I suspect that they are all written to the webdb as "to-do" in the first run, in the next the first 100 are fetched with rest skipped and upon another round the next 100 are fetched? Is that right? My idea was also to have this as a "per host" or "per site"-setting - or to be able to override the value for a certain host ... > Max. pages to crawl/fetch per site (emergency limit) > > > Key: NUTCH-272 > URL: http://issues.apache.org/jira/browse/NUTCH-272 > Project: Nutch > Type: Improvement > Reporter: Stefan Neufeind > > If I'm right, there is no way in place right now for setting an "emergency > limit" to fetch a certain max. number of pages per site. Is there an "easy" > way to implement such a limit, maybe as a plugin? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )
[ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12412530 ] Stefan Neufeind commented on NUTCH-173: --- Applies fine and works for me on 0.7.2. > PerHost Crawling Policy ( crawl.ignore.external.links ) > --- > > Key: NUTCH-173 > URL: http://issues.apache.org/jira/browse/NUTCH-173 > Project: Nutch > Type: New Feature > Components: fetcher > Versions: 0.7.1, 0.7, 0.8-dev > Reporter: Philippe EUGENE > Priority: Minor > Attachments: patch.txt, patch08.txt > > There is two major way of crawl in Nutch. > Intranet Crawl : forbidden all, allow somes few host > Whole-web crawl : allow all, forbidden few thinks > I propose a third type of crawl. > Directory Crawl : The purpose of this crawl is to manage few thousands of > host wihtout managing rules pattern in UrlFilterRegexp. > I made two patch for : 0.7, 0.7.1 and 0.8-dev > I propose a new boolean property in nutch-site.xml : > crawl.ignore.external.links, with false value at default. > By default this new feature don't modify the behavior of nutch crawler. > When you setup this property to true, the crawler don't fetch external links > of the host. > So the crawl is limited to the host that you inject at the beginning at the > crawl. > I know there is some proposal of new crawl policy using the CrawlDatum in > 0.8-dev branch. > This feature colud be a easiest way to add quickly new crawl feature to > nutch, waiting for a best way to improve crawl policy. > I post two patch. > Sorry for my very poor english > -- > Philippe -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)
Max. pages to crawl/fetch per site (emergency limit) Key: NUTCH-272 URL: http://issues.apache.org/jira/browse/NUTCH-272 Project: Nutch Type: Improvement Reporter: Stefan Neufeind If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-271) Meta-data per URL/site/section
Meta-data per URL/site/section -- Key: NUTCH-271 URL: http://issues.apache.org/jira/browse/NUTCH-271 Project: Nutch Type: New Feature Versions: 0.7.2 Reporter: Stefan Neufeind We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a "workaround" I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g. http://www.example1.com/something1/ -> meta-tag "companybranch1" http://www.example2.com/something2/ -> meta-tag "companybranch2" http://www.example3.com/something3/ -> meta-tag "companybranch1" http://www.example4.com/something4/ -> meta-tag "companybranch3" search for everything in companybranch1 or across 1 and 3 or similar -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira