Re: incremental crawling
Hi Doug 1. How to deal with "dead urls"? If I remove the url after nutch 1st crawling. Should nutch keeps the "dead urls" and never fetches them again? 2. should nutch export dedup as one extension point? In my project, we add information extraction layer to nutch, I think it is good idea export dedup as extension point since we can build our "duplicates rule" base on extracted data object, of course, the default is page url. Thought? /Jack On 12/2/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > It would be good to improve the support for incremental crawling added > to Nutch. Here are some ideas about how we might implement it. Andrzej > has posted in the past about this, so he probably has better ideas. > > Incremental crawling could proceed as follows: > > 1. Bootstrap with a batch crawl, using the 'crawl' command. Modify > CrawlDatum to store the MD5Hash of the content of fetched urls. > > 2. Reduce the fetch interval for high-scoring urls. If the default is > monthly, then the top-scoring 1% of urls might be set to daily, and the > top-scoring 10% of urls might be set to weekly. > > 3. Generate a fetch list & fetch it. When the url has been previously > fetched, and its content is unchanged, increase its fetch interval by an > amount, e.g., 50%. If the content is changed, decrease the fetch > interval. The percentage of increase and decrease might be influenced > by the url's score. > > 4. Update the crawl db & link db, index the new segment, dedup, etc. > When updating the crawl db, scores for existing urls should not change, > since the scoring method we're using (OPIC) assumes each page is fetched > only once. > > Steps 3 & 4 can be packaged as an 'update' command. Step 2 can be > included in the 'crawl' command, so that crawled indexes are always > ready for update. > > Comments? > > Doug > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: Urlfilter Patch
Totally agreed. Neither approach replaces the other. I just wanted to mention possibility so people don't over-focus on trying to build a hyper-optimized regex list. :) For the content provider, an HTTP HEAD request saves them bandwidth if we don't do a GET. That's some cost savings for them over doing a blind fetch (esp. if we discard it). I guess the question is, what's worse: - two server hits when we find content we want?, or - spending bandwidth on pages that the Nutch installation will ignore anyway? --matt On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote: Matt Kangas wrote: The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes... This could be a useful addition, but it could not replace url-based filters. A HEAD request must still be polite, so this could substantially slow fetching, as it would incur more delays. Also, for most dynamic pages, a HEAD is as expensive for the server as a GET, so this would cause more load on servers. Doug -- Matt Kangas / [EMAIL PROTECTED]
RE: Urlfilter Patch
Hi Jerome, > Yes, the fetcher can't rely on the document mime-type. > The only thing we can use for filtering is the document's URL. > So, another alternative, could be to exclude only files extensions that > are > registered in the mime-type repository > (some well known file extensions) but for which no parser is activated. > And > accepting all other ones. > So that the .foo files will be fetched... Yup, the key phrase is "well known". It would sort of be an optimization, or heuristic, to save some work on the regex... Cheers, Chris > > Jérôme
RE: Urlfilter Patch
Hi Doug, > > Chris Mattmann wrote: > > In principle, the mimeType system should give us some guidance on > > determining the appropriate mimeType for the content, regardless of > whether > > it ends in .foo, .bar or the like. > > Right, but the URL filters run long before we know the mime type, in > order to try to keep us from fetching lots of stuff we can't process. > The mime type is not known until we've fetched it. Duh, you're right. Sorry about that. Matt Kangas wrote: > The latter is not strictly true. Nutch could issue an HTTP HEAD > before the HTTP GET, and determine the mime-type before actually > grabbing the content. > > It's not how Nutch works now, but this might be more useful than a > super-detailed set of regexes... I liked Matt's idea of the HEAD request though. I wonder if some benchmarks on performance of this would be useful, because in some cases (such as focused crawling, or "non-whole-internet" crawling, such as intranet, etc.), it would seem that the performance penalty of performing the HEAD to get the content-type would be useful, and worth the cost... Cheers, Chris
Re: Urlfilter Patch
Agreed - looks like this list is too aggressive. A better one would be: -(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|png|pps|ppt|ps|psd|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xls|z|zip)\)?$ This removes xhtml, xml, php, jsp, py, pl, and cgi. We've seen php/jsp/py/pl/cgi in our error logs as un-parsable, but looks like most cases are when the server is miss-configured and winds up returning the source code, as opposed to the result of executing the code. -- Ken On Thu, 2005-12-01 at 18:53 +, Howie Wang wrote: .And .xhtml seem like they would be parsable by the default HTML parser. Ditto for .xml. It is a valid (though seldom used) xhtml extension. Howie >From: Doug Cutting <[EMAIL PROTECTED]> > >Ken Krugler wrote: >>For what it's worth, below is the filter list we're using for doing an >>html-centric crawl (no word docs, for example). Using the (?i) means we >>don't need to have upper & lower-case versions of the suffixes. >> > >>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$ > > >This looks like a more complete suffix list. > >Should we use this as the default? By default only html and text parsers >are enabled, so perhaps that's all we should accept. > > >Why do you exclude .php urls? These are simply dynamic pages, no? > >Similarly, .jsp and .py are frequently suffixes that return html. Are >there other suffixes we should remove from this list before we make it the >default exclusion list? > >Doug -- Rod Taylor <[EMAIL PROTECTED]> -- Ken Krugler Krugle, Inc. +1 530-470-9200
Re: Urlfilter Patch
Suggestion: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file extensions associated to each content-type, we can build a list of file extensions to include (other ones will be excluded) in the fetch process. I'd asked a Nutch consultant this exact same question a few months ago. It does seem odd that there's an implicit dependency between the file suffixes found in regex-urlfilter.txt and the enabled plug-ins found in nutch-default.xml and nutch-site.xml. What's the point of downloading a 100MB .bz2 file if there's nobody available to handle it? It's also odd that there's a nutch-site.xml, but no equivalent for regex-urlfilter.txt. There are the cases of some suffixes (like .php) that can return any kind of mime-type content, and other suffixes (like .xml) that can mean any number of things. So I think you'd still want regex-urlfilter.txt files (both a default and a site version) that provide explicit additions/deletions to the list generated from the installed and enabled parse-plugins. -- Ken -- Ken Krugler Krugle, Inc. +1 530-470-9200
Re: incremental crawling
3. Generate a fetch list & fetch it. When the url has been previously fetched, and its content is unchanged, increase its fetch interval by an amount, e.g., 50%. If the content is changed, decrease the fetch interval. The percentage of increase and decrease might be influenced by the url's score. Hi, if we would track in this way the amount of changes, we could also prefer pages in the ranking algorithm which change more often. Frequently changing pages might be more up-to-date and could have a higher value then pages never change. Also pages, which are unchanged for a long time, might run out of date and loose a litte bit in their general scoring. So, maybe the fetch interval value could be used as a multiplier for boosting pages in the final result set. Matthias
Re: Urlfilter Patch
Matt Kangas wrote: The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes... This could be a useful addition, but it could not replace url-based filters. A HEAD request must still be polite, so this could substantially slow fetching, as it would incur more delays. Also, for most dynamic pages, a HEAD is as expensive for the server as a GET, so this would cause more load on servers. Doug
Re: Urlfilter Patch
The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes... [EMAIL PROTECTED]:~$ telnet localhost 80 Trying 127.0.0.1... Connected to localhost.localdomain. Escape character is '^]'. HEAD / HTTP/1.0 HTTP/1.1 200 OK Date: Thu, 01 Dec 2005 21:25:38 GMT Server: Apache/2.0 Connection: close Content-Type: text/html; charset=UTF-8 Connection closed by foreign host On Dec 1, 2005, at 4:21 PM, Doug Cutting wrote: Chris Mattmann wrote: In principle, the mimeType system should give us some guidance on determining the appropriate mimeType for the content, regardless of whether it ends in .foo, .bar or the like. Right, but the URL filters run long before we know the mime type, in order to try to keep us from fetching lots of stuff we can't process. The mime type is not known until we've fetched it. Doug -- Matt Kangas / [EMAIL PROTECTED]
Re: Urlfilter Patch
> Right, but the URL filters run long before we know the mime type, in > order to try to keep us from fetching lots of stuff we can't process. > The mime type is not known until we've fetched it. Yes, the fetcher can't rely on the document mime-type. The only thing we can use for filtering is the document's URL. So, another alternative, could be to exclude only files extensions that are registered in the mime-type repository (some well known file extensions) but for which no parser is activated. And accepting all other ones. So that the .foo files will be fetched... Jérôme
Re: Urlfilter Patch
Chris Mattmann wrote: In principle, the mimeType system should give us some guidance on determining the appropriate mimeType for the content, regardless of whether it ends in .foo, .bar or the like. Right, but the URL filters run long before we know the mime type, in order to try to keep us from fetching lots of stuff we can't process. The mime type is not known until we've fetched it. Doug
Re: Urlfilter Patch
Hi Doug, On 12/1/05 1:11 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > Jérôme Charron wrote: [...] > > What about a site that develops a content system that has urls that end > in .foo, which we would exclude, even though they return html? > > Doug In principle, the mimeType system should give us some guidance on determining the appropriate mimeType for the content, regardless of whether it ends in .foo, .bar or the like. I'm not sure if the mime type registry is there yet, but I know that Jerome was working on a major update that would help in recognizing these types of situations. Of course, efficiency comes into play as well, in terms of now slowing down the fetch/parse, but it would be nice to have a general solution that made use of the information available in parse-plugins.xml to determine the appropriate set of allowed extensions in a URLFilter, if possible. It may be a pipe dream, but I'd say it's worth exploring... Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Urlfilter Patch
Jérôme Charron wrote: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file extensions associated to each content-type, we can build a list of file extensions to include (other ones will be excluded) in the fecth process. No? What about a site that develops a content system that has urls that end in .foo, which we would exclude, even though they return html? Doug
Re: [Nutch-dev] incremental crawling
Sounds really good (and it is requested by a lot of nutch users!). +1 Jérôme On 12/1/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Matt Kangas wrote: > > #2 should be a pluggable/hookable parameter. "high-scoring" sounds like > > a reasonable default basis for choosing recrawl intervals, but I'm sure > > that nearly everyone will think of a way to improve upon that for their > > particular system. > > > > e.g. "high-scoring" ain't gonna cut it for my needs. (0.5 wink ;) > > In NUTCH-61, Andrzej has a pluggable FetchSchedule. That looks like a > good idea. > > http://issues.apache.org/jira/browse/NUTCH-61 > > Doug > -- http://motrech.free.fr/ http://www.frutch.org/
Re: Urlfilter Patch
Jérôme Charron wrote: [...] build a list of file extensions to include (other ones will be excluded) in the fecth process. [...] I would not like to exclude all others - as for example many extensions are valid for html - especially dynamicly generated pages (jsp,asp,cgi just to name the easy ones and a lot of custom ones). But the idea of automatically allowing extensions for which plugins are enabled is good in my opinion. Anyway I will try to find my own list of forbidden extensions I prepared based on 80mln of urls - I just prepared the list of most common ones and went through it manually. I will try to find it over weekend so we can combine it with the list discussed in this thread. P.
Re: Urlfilter Patch
Jerome, I think that this is a great idea and ensures that there isn't replication of so-called "management information" across the system. It could be easily implemented as a utility method because we have utility java classes that represent the ParsePluginList, that you could get the mimeTypes from. Additionally, we could create a utility method that searches the extension point list for parsing plugins and returns a boolean true or false whether they are activated or not. Using this information, I believe that the url filtering would be a snap. +1 Cheers, Chris On 12/1/05 12:11 PM, "Jérôme Charron" <[EMAIL PROTECTED]> wrote: > Suggestion: > For consistency purpose, and easy of nutch management, why not filtering the > extensions based on the activated plugins? > By looking at the mime-types defined in the parse-plugins.xml file and the > activated plugins, we know which content-types will be parsed. > So, by getting the file extensions associated to each content-type, we can > build a list of file extensions to include (other ones will be excluded) in > the fecth process. > No? > > Jérôme > > -- > http://motrech.free.fr/ > http://www.frutch.org/ __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: [Nutch-dev] incremental crawling
Matt Kangas wrote: #2 should be a pluggable/hookable parameter. "high-scoring" sounds like a reasonable default basis for choosing recrawl intervals, but I'm sure that nearly everyone will think of a way to improve upon that for their particular system. e.g. "high-scoring" ain't gonna cut it for my needs. (0.5 wink ;) In NUTCH-61, Andrzej has a pluggable FetchSchedule. That looks like a good idea. http://issues.apache.org/jira/browse/NUTCH-61 Doug
Re: [Nutch-dev] incremental crawling
#2 should be a pluggable/hookable parameter. "high-scoring" sounds like a reasonable default basis for choosing recrawl intervals, but I'm sure that nearly everyone will think of a way to improve upon that for their particular system. e.g. "high-scoring" ain't gonna cut it for my needs. (0.5 wink ;) --matt On Dec 1, 2005, at 2:15 PM, Doug Cutting wrote: It would be good to improve the support for incremental crawling added to Nutch. Here are some ideas about how we might implement it. Andrzej has posted in the past about this, so he probably has better ideas. Incremental crawling could proceed as follows: 1. Bootstrap with a batch crawl, using the 'crawl' command. Modify CrawlDatum to store the MD5Hash of the content of fetched urls. 2. Reduce the fetch interval for high-scoring urls. If the default is monthly, then the top-scoring 1% of urls might be set to daily, and the top-scoring 10% of urls might be set to weekly. 3. Generate a fetch list & fetch it. When the url has been previously fetched, and its content is unchanged, increase its fetch interval by an amount, e.g., 50%. If the content is changed, decrease the fetch interval. The percentage of increase and decrease might be influenced by the url's score. 4. Update the crawl db & link db, index the new segment, dedup, etc. When updating the crawl db, scores for existing urls should not change, since the scoring method we're using (OPIC) assumes each page is fetched only once. Steps 3 & 4 can be packaged as an 'update' command. Step 2 can be included in the 'crawl' command, so that crawled indexes are always ready for update. Comments? Doug -- Matt Kangas / [EMAIL PROTECTED]
[jira] Resolved: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Doug Cutting resolved NUTCH-116: Fix Version: 0.8-dev Resolution: Fixed I just committed this. Thanks, Paul, this is great to have! > TestNDFS a JUnit test specifically for NDFS > --- > > Key: NUTCH-116 > URL: http://issues.apache.org/jira/browse/NUTCH-116 > Project: Nutch > Type: Test > Components: fetcher, indexer, searcher > Versions: 0.8-dev > Reporter: Paul Baclace > Fix For: 0.8-dev > Attachments: TestNDFS.java, TestNDFS.java, > comments_msgs_and_local_renames_during_TestNDFS.patch, > required_by_TestNDFS.patch, required_by_TestNDFS_v2.patch, > required_by_TestNDFS_v3.patch > > TestNDFS is a JUnit test for NDFS using "pseudo multiprocessing" (or more > strictly, pseudo distributed) meaning all daemons run in one process and > sockets are used to communicate between daemons. > The test permutes various block sizes, number of files, file sizes, and > number of datanodes. After creating 1 or more files and filling them with > random data, one datanode is shutdown, and then the files are verfified. > Next, all the random test files are deleted and we test for leakage > (non-deletion) by directly checking the real directories corresponding to the > datanodes still running. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: NDFS/MapReduce?
Check out the latest source from svn, use the branch called mapred. This url give you a kick start to install a map reduce system on several boxes: http://wiki.media-style.com/display/nutchDocu/setup+a+map+reduce+multi +box+system The 0.8 brunch works very well for me, but for sure there some bugs as in 0.7 but feel free to find and report them.;-) HTH Stefan Am 01.12.2005 um 21:20 schrieb Goldschmidt, Dave: Hello, I've been working with Nutch 0.7.1 for the last few months - very cool and impressive tool! I'm now on the verge of going to a distributed environment. Should I go to the latest nightly build that includes NDFS or stick with 0.7.1? What are the disadvantages to using 0.7.1 in a distributed manner? I'd like to go with the latest build, but where does the latest build stand - i.e. what doesn't work yet? ;-) Thanks all, DaveG
NDFS/MapReduce?
Hello, I've been working with Nutch 0.7.1 for the last few months - very cool and impressive tool! I'm now on the verge of going to a distributed environment. Should I go to the latest nightly build that includes NDFS or stick with 0.7.1? What are the disadvantages to using 0.7.1 in a distributed manner? I'd like to go with the latest build, but where does the latest build stand - i.e. what doesn't work yet? ;-) Thanks all, DaveG
Re: Urlfilter Patch
Suggestion: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file extensions associated to each content-type, we can build a list of file extensions to include (other ones will be excluded) in the fecth process. No? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
incremental crawling
It would be good to improve the support for incremental crawling added to Nutch. Here are some ideas about how we might implement it. Andrzej has posted in the past about this, so he probably has better ideas. Incremental crawling could proceed as follows: 1. Bootstrap with a batch crawl, using the 'crawl' command. Modify CrawlDatum to store the MD5Hash of the content of fetched urls. 2. Reduce the fetch interval for high-scoring urls. If the default is monthly, then the top-scoring 1% of urls might be set to daily, and the top-scoring 10% of urls might be set to weekly. 3. Generate a fetch list & fetch it. When the url has been previously fetched, and its content is unchanged, increase its fetch interval by an amount, e.g., 50%. If the content is changed, decrease the fetch interval. The percentage of increase and decrease might be influenced by the url's score. 4. Update the crawl db & link db, index the new segment, dedup, etc. When updating the crawl db, scores for existing urls should not change, since the scoring method we're using (OPIC) assumes each page is fetched only once. Steps 3 & 4 can be packaged as an 'update' command. Step 2 can be included in the 'crawl' command, so that crawled indexes are always ready for update. Comments? Doug
Re: Urlfilter Patch
On Thu, 2005-12-01 at 18:53 +, Howie Wang wrote: > .And .xhtml seem like they > would be parsable by the default HTML parser. Ditto for .xml. It is a valid (though seldom used) xhtml extension. > Howie > > >From: Doug Cutting <[EMAIL PROTECTED]> > > > >Ken Krugler wrote: > >>For what it's worth, below is the filter list we're using for doing an > >>html-centric crawl (no word docs, for example). Using the (?i) means we > >>don't need to have upper & lower-case versions of the suffixes. > >> > >>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$ > > > >This looks like a more complete suffix list. > > > >Should we use this as the default? By default only html and text parsers > >are enabled, so perhaps that's all we should accept. > > > >Why do you exclude .php urls? These are simply dynamic pages, no? > >Similarly, .jsp and .py are frequently suffixes that return html. Are > >there other suffixes we should remove from this list before we make it the > >default exclusion list? > > > >Doug > > > -- Rod Taylor <[EMAIL PROTECTED]>
Re: Urlfilter Patch
.pl files are often just perl CGI scripts. And .xhtml seem like they would be parsable by the default HTML parser. Howie From: Doug Cutting <[EMAIL PROTECTED]> Ken Krugler wrote: For what it's worth, below is the filter list we're using for doing an html-centric crawl (no word docs, for example). Using the (?i) means we don't need to have upper & lower-case versions of the suffixes. -(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$ This looks like a more complete suffix list. Should we use this as the default? By default only html and text parsers are enabled, so perhaps that's all we should accept. Why do you exclude .php urls? These are simply dynamic pages, no? Similarly, .jsp and .py are frequently suffixes that return html. Are there other suffixes we should remove from this list before we make it the default exclusion list? Doug
Re: Urlfilter Patch
Ken Krugler wrote: For what it's worth, below is the filter list we're using for doing an html-centric crawl (no word docs, for example). Using the (?i) means we don't need to have upper & lower-case versions of the suffixes. -(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$ This looks like a more complete suffix list. Should we use this as the default? By default only html and text parsers are enabled, so perhaps that's all we should accept. Why do you exclude .php urls? These are simply dynamic pages, no? Similarly, .jsp and .py are frequently suffixes that return html. Are there other suffixes we should remove from this list before we make it the default exclusion list? Doug
[jira] Resolved: (NUTCH-130) Be explicit about target JVM when building (1.4.x?)
[ http://issues.apache.org/jira/browse/NUTCH-130?page=all ] Doug Cutting resolved NUTCH-130: Fix Version: 0.8-dev Resolution: Fixed Assign To: Doug Cutting I just committed this. I moved the version to the default.properties file, and found a few other places where javac is called. > Be explicit about target JVM when building (1.4.x?) > --- > > Key: NUTCH-130 > URL: http://issues.apache.org/jira/browse/NUTCH-130 > Project: Nutch > Type: Improvement > Reporter: [EMAIL PROTECTED] > Assignee: Doug Cutting > Priority: Minor > Fix For: 0.8-dev > > Below is patch for nutch build.xml. It stipulates the target JVM is 1.4.x. > Without explicit target, a nutch built with 1.5.x java defaults to a 1.5.x > java target and won't run in a 1.4.x JVM. Can be annoying (From the ant > javac doc, regards the target attribute: "We highly recommend to always > specify this attribute."). > [debord 282] nutch > svn diff -u build.xml > Subcommand 'diff' doesn't accept option '-u [--show-updates]' > Type 'svn help diff' for usage. > [debord 283] nutch > svn diff build.xml > Index: build.xml > === > --- build.xml (revision 349779) > +++ build.xml (working copy) > @@ -72,6 +72,8 @@ > destdir="${build.classes}" > debug="${debug}" > optimize="${optimize}" > + target="1.4" > + source="1.4" > deprecation="${deprecation}"> > > -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira