Re: incremental crawling

2005-12-01 Thread Jack Tang
Hi Doug

1. How to deal with "dead urls"? If I remove the url after nutch 1st
crawling. Should nutch keeps the "dead urls" and never fetches them
again?
2. should nutch export dedup as one extension point? In my project, we
add information extraction layer to nutch, I think it is good idea
export dedup as extension point since we can build our "duplicates
rule" base on extracted data object, of course, the default is page
url.

Thought?
/Jack

On 12/2/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> It would be good to improve the support for incremental crawling added
> to Nutch.  Here are some ideas about how we might implement it.  Andrzej
> has posted in the past about this, so he probably has better ideas.
>
> Incremental crawling could proceed as follows:
>
> 1. Bootstrap with a batch crawl, using the 'crawl' command.  Modify
> CrawlDatum to store the MD5Hash of the content of fetched urls.
>
> 2. Reduce the fetch interval for high-scoring urls.  If the default is
> monthly, then the top-scoring 1% of urls might be set to daily, and the
> top-scoring 10% of urls might be set to weekly.
>
> 3. Generate a fetch list & fetch it.  When the url has been previously
> fetched, and its content is unchanged, increase its fetch interval by an
> amount, e.g., 50%.  If the content is changed, decrease the fetch
> interval.  The percentage of increase and decrease might be influenced
> by the url's score.
>
> 4. Update the crawl db & link db, index the new segment, dedup, etc.
> When updating the crawl db, scores for existing urls should not change,
> since the scoring method we're using (OPIC) assumes each page is fetched
> only once.
>
> Steps 3 & 4 can be packaged as an 'update' command.  Step 2 can be
> included in the 'crawl' command, so that crawled indexes are always
> ready for update.
>
> Comments?
>
> Doug
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: Urlfilter Patch

2005-12-01 Thread Matt Kangas
Totally agreed. Neither approach replaces the other. I just wanted to  
mention possibility so people don't over-focus on trying to build a  
hyper-optimized regex list. :)


For the content provider, an HTTP HEAD request saves them bandwidth  
if we don't do a GET. That's some cost savings for them over doing a  
blind fetch (esp. if we discard it).


I guess the question is, what's worse:
- two server hits when we find content we want?, or
- spending bandwidth on pages that the Nutch installation will ignore  
anyway?


--matt

On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:


Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEAD   
before the HTTP GET, and determine the mime-type before actually   
grabbing the content.
It's not how Nutch works now, but this might be more useful than  
a  super-detailed set of regexes...


This could be a useful addition, but it could not replace url-based  
filters.  A HEAD request must still be polite, so this could  
substantially slow fetching, as it would incur more delays.  Also,  
for most dynamic pages, a HEAD is as expensive for the server as a  
GET, so this would cause more load on servers.


Doug


--
Matt Kangas / [EMAIL PROTECTED]




RE: Urlfilter Patch

2005-12-01 Thread Chris Mattmann
Hi Jerome,

> Yes, the fetcher can't rely on the document mime-type.
> The only thing we can use for filtering is the document's URL.
> So, another alternative, could be to exclude only files extensions that
> are
> registered in the mime-type repository
> (some well known file extensions) but for which no parser is activated.
> And
> accepting all other ones.
> So that the .foo files will be fetched...

Yup, the key phrase is "well known". It would sort of be an optimization, or
heuristic, to save some work on the regex...

Cheers,
  Chris


> 
> Jérôme



RE: Urlfilter Patch

2005-12-01 Thread Chris Mattmann
Hi Doug,

> 
> Chris Mattmann wrote:
> >   In principle, the mimeType system should give us some guidance on
> > determining the appropriate mimeType for the content, regardless of
> whether
> > it ends in .foo, .bar or the like.
> 
> Right, but the URL filters run long before we know the mime type, in
> order to try to keep us from fetching lots of stuff we can't process.
> The mime type is not known until we've fetched it.

Duh, you're right. Sorry about that. 

Matt Kangas wrote:
> The latter is not strictly true. Nutch could issue an HTTP HEAD  
> before the HTTP GET, and determine the mime-type before actually  
> grabbing the content.
> 
> It's not how Nutch works now, but this might be more useful than a 
> super-detailed set of regexes...


I liked Matt's idea of the HEAD request though. I wonder if some benchmarks
on performance of this would be useful, because in some cases (such as
focused crawling, or "non-whole-internet" crawling, such as intranet, etc.),
it would seem that the performance penalty of performing the HEAD to get the
content-type would be useful, and worth the cost...

Cheers,
  Chris





Re: Urlfilter Patch

2005-12-01 Thread Ken Krugler

Agreed - looks like this list is too aggressive. A better one would be:

-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|png|pps|ppt|ps|psd|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xls|z|zip)\)?$

This removes xhtml, xml, php, jsp, py, pl, and cgi.

We've seen php/jsp/py/pl/cgi in our error logs as un-parsable, but 
looks like most cases are when the server is miss-configured and 
winds up returning the source code, as opposed to the result of 
executing the code.


-- Ken


On Thu, 2005-12-01 at 18:53 +, Howie Wang wrote:

 .And .xhtml seem like they
 would be parsable by the default HTML parser.


Ditto for .xml. It is a valid (though seldom used) xhtml extension.


 Howie

 >From: Doug Cutting <[EMAIL PROTECTED]>
 >
 >Ken Krugler wrote:
 >>For what it's worth, below is the filter list we're using for doing an
 >>html-centric crawl (no word docs, for example). Using the (?i) means we
 >>don't need to have upper & lower-case versions of the suffixes.
 >>

 > 
>>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
 > >

 >This looks like a more complete suffix list.
 >
 >Should we use this as the default?  By default only html and text parsers
 >are enabled, so perhaps that's all we should accept.
 >

 > >Why do you exclude .php urls?  These are simply dynamic pages, no?
 > >Similarly, .jsp and .py are frequently suffixes that return html.  Are

 >there other suffixes we should remove from this list before we make it the
 >default exclusion list?
 >
 >Doug




--
Rod Taylor <[EMAIL PROTECTED]>



--
Ken Krugler
Krugle, Inc.
+1 530-470-9200


Re: Urlfilter Patch

2005-12-01 Thread Ken Krugler

Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file extensions associated to each content-type, we can
build a list of file extensions to include (other ones will be excluded) in
the fetch process.


I'd asked a Nutch consultant this exact same question a few months ago.

It does seem odd that there's an implicit dependency between the file 
suffixes found in regex-urlfilter.txt and the enabled plug-ins found 
in nutch-default.xml and nutch-site.xml. What's the point of 
downloading a 100MB .bz2 file if there's nobody available to handle 
it?


It's also odd that there's a nutch-site.xml, but no equivalent for 
regex-urlfilter.txt.


There are the cases of some suffixes (like .php) that can return any 
kind of mime-type content, and other suffixes (like .xml) that can 
mean any number of things. So I think you'd still want 
regex-urlfilter.txt files (both a default and a site version) that 
provide explicit additions/deletions to the list generated from the 
installed and enabled parse-plugins.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200


Re: incremental crawling

2005-12-01 Thread Matthias Jaekle
3. Generate a fetch list & fetch it.  When the url has been previously 
fetched, and its content is unchanged, increase its fetch interval by an 
amount, e.g., 50%.  If the content is changed, decrease the fetch 
interval.  The percentage of increase and decrease might be influenced 
by the url's score.

Hi,

if we would track in this way the amount of changes, we could also 
prefer pages in the ranking algorithm which change more often.
Frequently changing pages might be more up-to-date and could have a 
higher value then pages never change.
Also pages, which are unchanged for a long time, might run out of date 
and loose a litte bit in their general scoring.
So, maybe the fetch interval value could be used as a multiplier for 
boosting pages in the final result set.


Matthias


Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting

Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEAD  before 
the HTTP GET, and determine the mime-type before actually  grabbing the 
content.


It's not how Nutch works now, but this might be more useful than a  
super-detailed set of regexes...


This could be a useful addition, but it could not replace url-based 
filters.  A HEAD request must still be polite, so this could 
substantially slow fetching, as it would incur more delays.  Also, for 
most dynamic pages, a HEAD is as expensive for the server as a GET, so 
this would cause more load on servers.


Doug


Re: Urlfilter Patch

2005-12-01 Thread Matt Kangas
The latter is not strictly true. Nutch could issue an HTTP HEAD  
before the HTTP GET, and determine the mime-type before actually  
grabbing the content.


It's not how Nutch works now, but this might be more useful than a  
super-detailed set of regexes...


[EMAIL PROTECTED]:~$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.localdomain.
Escape character is '^]'.
HEAD / HTTP/1.0

HTTP/1.1 200 OK
Date: Thu, 01 Dec 2005 21:25:38 GMT
Server: Apache/2.0
Connection: close
Content-Type: text/html; charset=UTF-8

Connection closed by foreign host



On Dec 1, 2005, at 4:21 PM, Doug Cutting wrote:


Chris Mattmann wrote:

  In principle, the mimeType system should give us some guidance on
determining the appropriate mimeType for the content, regardless  
of whether

it ends in .foo, .bar or the like.


Right, but the URL filters run long before we know the mime type,  
in order to try to keep us from fetching lots of stuff we can't  
process. The mime type is not known until we've fetched it.


Doug


--
Matt Kangas / [EMAIL PROTECTED]




Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
> Right, but the URL filters run long before we know the mime type, in
> order to try to keep us from fetching lots of stuff we can't process.
> The mime type is not known until we've fetched it.

Yes, the fetcher can't rely on the document mime-type.
The only thing we can use for filtering is the document's URL.
So, another alternative, could be to exclude only files extensions that are
registered in the mime-type repository
(some well known file extensions) but for which no parser is activated. And
accepting all other ones.
So that the .foo files will be fetched...

Jérôme


Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting

Chris Mattmann wrote:

  In principle, the mimeType system should give us some guidance on
determining the appropriate mimeType for the content, regardless of whether
it ends in .foo, .bar or the like.


Right, but the URL filters run long before we know the mime type, in 
order to try to keep us from fetching lots of stuff we can't process. 
The mime type is not known until we've fetched it.


Doug


Re: Urlfilter Patch

2005-12-01 Thread Chris Mattmann
Hi Doug,


On 12/1/05 1:11 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Jérôme Charron wrote:
[...]
> 
> What about a site that develops a content system that has urls that end
> in .foo, which we would exclude, even though they return html?
> 
> Doug

  In principle, the mimeType system should give us some guidance on
determining the appropriate mimeType for the content, regardless of whether
it ends in .foo, .bar or the like. I'm not sure if the mime type registry is
there yet, but I know that Jerome was working on a major update that would
help in recognizing these types of situations. Of course, efficiency comes
into play as well, in terms of now slowing down the fetch/parse, but it
would be nice to have a general solution that made use of the information
available in parse-plugins.xml to determine the appropriate set of allowed
extensions in a URLFilter, if possible. It may be a pipe dream, but I'd say
it's worth exploring...

Cheers,
  Chris



__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting

Jérôme Charron wrote:

For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file extensions associated to each content-type, we can
build a list of file extensions to include (other ones will be excluded) in
the fecth process.
No?


What about a site that develops a content system that has urls that end 
in .foo, which we would exclude, even though they return html?


Doug


Re: [Nutch-dev] incremental crawling

2005-12-01 Thread Jérôme Charron
Sounds really good (and it is requested by a lot of nutch users!).
+1

Jérôme

On 12/1/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> Matt Kangas wrote:
> > #2 should be a pluggable/hookable parameter. "high-scoring" sounds  like
> > a reasonable default basis for choosing recrawl intervals, but  I'm sure
> > that nearly everyone will think of a way to improve upon  that for their
> > particular system.
> >
> > e.g. "high-scoring" ain't gonna cut it for my needs. (0.5 wink ;)
>
> In NUTCH-61, Andrzej has a pluggable FetchSchedule.  That looks like a
> good idea.
>
> http://issues.apache.org/jira/browse/NUTCH-61
>
> Doug
>



--
http://motrech.free.fr/
http://www.frutch.org/


Re: Urlfilter Patch

2005-12-01 Thread Piotr Kosiorowski

Jérôme Charron wrote:
[...]

build a list of file extensions to include (other ones will be excluded) in
the fecth process.

[...]
I would not like to exclude all others - as for example many extensions 
are valid for html - especially dynamicly generated pages (jsp,asp,cgi 
just to name the easy ones and a lot of custom ones).  But the idea of 
automatically allowing extensions for which plugins are enabled is good 
in my opinion.
Anyway I will try to find my own list of forbidden extensions I prepared 
based on  80mln of urls - I just prepared the list of most common ones 
and went through it manually. I will try to find it over weekend so we 
can combine it with the list discussed in this thread.

P.




Re: Urlfilter Patch

2005-12-01 Thread Chris Mattmann
Jerome,

 I think that this is a great idea and ensures that there isn't replication
of so-called "management information" across the system. It could be easily
implemented as a utility method because we have utility java classes that
represent the ParsePluginList, that you could get the mimeTypes from.
Additionally, we could create a utility method that searches the extension
point list for parsing plugins and returns a boolean true or false whether
they are activated or not. Using this information, I believe that the url
filtering would be a snap.

 

+1

Cheers,
  Chris



On 12/1/05 12:11 PM, "Jérôme Charron" <[EMAIL PROTECTED]> wrote:

> Suggestion:
> For consistency purpose, and easy of nutch management, why not filtering the
> extensions based on the activated plugins?
> By looking at the mime-types defined in the parse-plugins.xml file and the
> activated plugins, we know which content-types will be parsed.
> So, by getting the file extensions associated to each content-type, we can
> build a list of file extensions to include (other ones will be excluded) in
> the fecth process.
> No?
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: [Nutch-dev] incremental crawling

2005-12-01 Thread Doug Cutting

Matt Kangas wrote:
#2 should be a pluggable/hookable parameter. "high-scoring" sounds  like 
a reasonable default basis for choosing recrawl intervals, but  I'm sure 
that nearly everyone will think of a way to improve upon  that for their 
particular system.


e.g. "high-scoring" ain't gonna cut it for my needs. (0.5 wink ;)


In NUTCH-61, Andrzej has a pluggable FetchSchedule.  That looks like a 
good idea.


http://issues.apache.org/jira/browse/NUTCH-61

Doug


Re: [Nutch-dev] incremental crawling

2005-12-01 Thread Matt Kangas
#2 should be a pluggable/hookable parameter. "high-scoring" sounds  
like a reasonable default basis for choosing recrawl intervals, but  
I'm sure that nearly everyone will think of a way to improve upon  
that for their particular system.


e.g. "high-scoring" ain't gonna cut it for my needs. (0.5 wink ;)

--matt

On Dec 1, 2005, at 2:15 PM, Doug Cutting wrote:

It would be good to improve the support for incremental crawling  
added to Nutch.  Here are some ideas about how we might implement  
it.  Andrzej has posted in the past about this, so he probably has  
better ideas.


Incremental crawling could proceed as follows:

1. Bootstrap with a batch crawl, using the 'crawl' command.  Modify  
CrawlDatum to store the MD5Hash of the content of fetched urls.


2. Reduce the fetch interval for high-scoring urls.  If the default  
is monthly, then the top-scoring 1% of urls might be set to daily,  
and the top-scoring 10% of urls might be set to weekly.


3. Generate a fetch list & fetch it.  When the url has been  
previously fetched, and its content is unchanged, increase its  
fetch interval by an amount, e.g., 50%.  If the content is changed,  
decrease the fetch interval.  The percentage of increase and  
decrease might be influenced by the url's score.


4. Update the crawl db & link db, index the new segment, dedup,  
etc. When updating the crawl db, scores for existing urls should  
not change, since the scoring method we're using (OPIC) assumes  
each page is fetched only once.


Steps 3 & 4 can be packaged as an 'update' command.  Step 2 can be  
included in the 'crawl' command, so that crawled indexes are always  
ready for update.


Comments?

Doug


--
Matt Kangas / [EMAIL PROTECTED]




[jira] Resolved: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-12-01 Thread Doug Cutting (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
 
Doug Cutting resolved NUTCH-116:


Fix Version: 0.8-dev
 Resolution: Fixed

I just committed this.  Thanks, Paul, this is great to have!

> TestNDFS a JUnit test specifically for NDFS
> ---
>
>  Key: NUTCH-116
>  URL: http://issues.apache.org/jira/browse/NUTCH-116
>  Project: Nutch
> Type: Test
>   Components: fetcher, indexer, searcher
> Versions: 0.8-dev
> Reporter: Paul Baclace
>  Fix For: 0.8-dev
>  Attachments: TestNDFS.java, TestNDFS.java, 
> comments_msgs_and_local_renames_during_TestNDFS.patch, 
> required_by_TestNDFS.patch, required_by_TestNDFS_v2.patch, 
> required_by_TestNDFS_v3.patch
>
> TestNDFS is a JUnit test for NDFS using "pseudo multiprocessing" (or more 
> strictly, pseudo distributed) meaning all daemons run in one process and 
> sockets are used to communicate between daemons.  
> The test permutes various block sizes, number of files, file sizes, and 
> number of datanodes.  After creating 1 or more files and filling them with 
> random data, one datanode is shutdown, and then the files are verfified. 
> Next, all the random test files are deleted and we test for leakage 
> (non-deletion) by directly checking the real directories corresponding to the 
> datanodes still running.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: NDFS/MapReduce?

2005-12-01 Thread Stefan Groschupf

Check out the latest source from svn, use the branch called mapred.
This url give you a kick start to install a map reduce system on  
several boxes:
http://wiki.media-style.com/display/nutchDocu/setup+a+map+reduce+multi 
+box+system
The 0.8 brunch works very well for me, but for sure there some bugs  
as in 0.7 but feel free to find and report them.;-)

HTH
Stefan

Am 01.12.2005 um 21:20 schrieb Goldschmidt, Dave:


Hello,



I've been working with Nutch 0.7.1 for the last few months - very cool
and impressive tool!  I'm now on the verge of going to a distributed
environment.  Should I go to the latest nightly build that includes  
NDFS

or stick with 0.7.1?  What are the disadvantages to using 0.7.1 in a
distributed manner?



I'd like to go with the latest build, but where does the latest build
stand - i.e. what doesn't work yet?  ;-)



Thanks all,

DaveG







NDFS/MapReduce?

2005-12-01 Thread Goldschmidt, Dave
Hello,

 

I've been working with Nutch 0.7.1 for the last few months - very cool
and impressive tool!  I'm now on the verge of going to a distributed
environment.  Should I go to the latest nightly build that includes NDFS
or stick with 0.7.1?  What are the disadvantages to using 0.7.1 in a
distributed manner?

 

I'd like to go with the latest build, but where does the latest build
stand - i.e. what doesn't work yet?  ;-)

 

Thanks all,

DaveG

 



Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file extensions associated to each content-type, we can
build a list of file extensions to include (other ones will be excluded) in
the fecth process.
No?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


incremental crawling

2005-12-01 Thread Doug Cutting
It would be good to improve the support for incremental crawling added 
to Nutch.  Here are some ideas about how we might implement it.  Andrzej 
has posted in the past about this, so he probably has better ideas.


Incremental crawling could proceed as follows:

1. Bootstrap with a batch crawl, using the 'crawl' command.  Modify 
CrawlDatum to store the MD5Hash of the content of fetched urls.


2. Reduce the fetch interval for high-scoring urls.  If the default is 
monthly, then the top-scoring 1% of urls might be set to daily, and the 
top-scoring 10% of urls might be set to weekly.


3. Generate a fetch list & fetch it.  When the url has been previously 
fetched, and its content is unchanged, increase its fetch interval by an 
amount, e.g., 50%.  If the content is changed, decrease the fetch 
interval.  The percentage of increase and decrease might be influenced 
by the url's score.


4. Update the crawl db & link db, index the new segment, dedup, etc. 
When updating the crawl db, scores for existing urls should not change, 
since the scoring method we're using (OPIC) assumes each page is fetched 
only once.


Steps 3 & 4 can be packaged as an 'update' command.  Step 2 can be 
included in the 'crawl' command, so that crawled indexes are always 
ready for update.


Comments?

Doug


Re: Urlfilter Patch

2005-12-01 Thread Rod Taylor
On Thu, 2005-12-01 at 18:53 +, Howie Wang wrote:
> .And .xhtml seem like they
> would be parsable by the default HTML parser.

Ditto for .xml. It is a valid (though seldom used) xhtml extension.

> Howie
> 
> >From: Doug Cutting <[EMAIL PROTECTED]>
> >
> >Ken Krugler wrote:
> >>For what it's worth, below is the filter list we're using for doing an 
> >>html-centric crawl (no word docs, for example). Using the (?i) means we 
> >>don't need to have upper & lower-case versions of the suffixes.
> >>
> >>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
> >
> >This looks like a more complete suffix list.
> >
> >Should we use this as the default?  By default only html and text parsers 
> >are enabled, so perhaps that's all we should accept.
> >
> >Why do you exclude .php urls?  These are simply dynamic pages, no? 
> >Similarly, .jsp and .py are frequently suffixes that return html.  Are 
> >there other suffixes we should remove from this list before we make it the 
> >default exclusion list?
> >
> >Doug
> 
> 
> 
-- 
Rod Taylor <[EMAIL PROTECTED]>



Re: Urlfilter Patch

2005-12-01 Thread Howie Wang

.pl  files are often just perl CGI scripts. And .xhtml seem like they
would be parsable by the default HTML parser.

Howie


From: Doug Cutting <[EMAIL PROTECTED]>

Ken Krugler wrote:
For what it's worth, below is the filter list we're using for doing an 
html-centric crawl (no word docs, for example). Using the (?i) means we 
don't need to have upper & lower-case versions of the suffixes.


-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$


This looks like a more complete suffix list.

Should we use this as the default?  By default only html and text parsers 
are enabled, so perhaps that's all we should accept.


Why do you exclude .php urls?  These are simply dynamic pages, no? 
Similarly, .jsp and .py are frequently suffixes that return html.  Are 
there other suffixes we should remove from this list before we make it the 
default exclusion list?


Doug





Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting

Ken Krugler wrote:
For what it's worth, below is the filter list we're using for doing an 
html-centric crawl (no word docs, for example). Using the (?i) means we 
don't need to have upper & lower-case versions of the suffixes.


-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$ 


This looks like a more complete suffix list.

Should we use this as the default?  By default only html and text 
parsers are enabled, so perhaps that's all we should accept.


Why do you exclude .php urls?  These are simply dynamic pages, no? 
Similarly, .jsp and .py are frequently suffixes that return html.  Are 
there other suffixes we should remove from this list before we make it 
the default exclusion list?


Doug


[jira] Resolved: (NUTCH-130) Be explicit about target JVM when building (1.4.x?)

2005-12-01 Thread Doug Cutting (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-130?page=all ]
 
Doug Cutting resolved NUTCH-130:


Fix Version: 0.8-dev
 Resolution: Fixed
  Assign To: Doug Cutting

I just committed this.  I moved the version to the default.properties file, and 
found a few other places where javac is called.

> Be explicit about target JVM when building (1.4.x?)
> ---
>
>  Key: NUTCH-130
>  URL: http://issues.apache.org/jira/browse/NUTCH-130
>  Project: Nutch
> Type: Improvement
> Reporter: [EMAIL PROTECTED]
> Assignee: Doug Cutting
> Priority: Minor
>  Fix For: 0.8-dev

>
> Below is patch for nutch build.xml.  It stipulates the target JVM is 1.4.x.  
> Without explicit target, a nutch built with 1.5.x java defaults to a 1.5.x 
> java target and won't run in a 1.4.x JVM.  Can be annoying (From the ant 
> javac doc, regards the target attribute: "We highly recommend to always 
> specify this attribute.").
> [debord 282] nutch > svn diff -u build.xml
> Subcommand 'diff' doesn't accept option '-u [--show-updates]'
> Type 'svn help diff' for usage.
> [debord 283] nutch > svn diff build.xml
> Index: build.xml
> ===
> --- build.xml   (revision 349779)
> +++ build.xml   (working copy)
> @@ -72,6 +72,8 @@
>   destdir="${build.classes}"
>   debug="${debug}"
>   optimize="${optimize}"
> + target="1.4"
> + source="1.4"
>   deprecation="${deprecation}">
>
>  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira