[jira] Commented: (NUTCH-796) Zero results problems difficult to troubleshoot due to lack of logging

2010-03-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847219#action_12847219
 ] 

Hudson commented on NUTCH-796:
--

Integrated in Nutch-trunk #1100 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1100/])
  Zero results problems difficult to troubleshoot due to lack of logging.


> Zero results problems difficult to troubleshoot due to lack of logging
> --
>
> Key: NUTCH-796
> URL: https://issues.apache.org/jira/browse/NUTCH-796
> Project: Nutch
>  Issue Type: Improvement
>  Components: searcher, web gui
>Affects Versions: 1.0.0, 1.1
> Environment: Linux, x86, nutch searcher and nutch webaps. v1.0, v1.1
>Reporter: Jesse Hires
>Assignee: Andrzej Bialecki 
> Fix For: 1.1
>
> Attachments: logging.patch
>
>
> There are a few places where search can fail in a distributed environment, 
> but when configuration is not quite right, there are no indications of errors 
> or logging.
> Increased logging of failures would help troubleshoot such problems, as well 
> as lower the "I get 0 results, why?" questions that come across the mailing 
> lists. 
> Areas where logging would be helpful:
> search app cannot locate search-servers.txt
> search app cannot find searcher node listed in search-server.txt
> search app cannot connect to port on searcher specified in search-server.txt
> searcher (bin/nutch server...) cannot find index
> searcher cannot find segments
> Access denied in any of the above scenarios.
> There are probably more that would be helpful, but I am not yet familiar to 
> know all the points of possible failure between the webpage and a search node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.

2010-03-18 Thread Andrew McCall (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847153#action_12847153
 ] 

Andrew McCall commented on NUTCH-693:
-

[http://en.wikipedia.org/wiki/Nofollow]

I don't think there is really any consensus on this standard to be honest. Most 
search engines don't index no-follow links per se, but they do follow them for 
crawling. Even Google, who first proposed the nofollow, sometimes actually do 
follow according to some tests linked in the wikipedia article. The results 
show that if the link is already in the index (eg has been followed elsewhere) 
then it does get followed and indexed. 

The nofollow is really just a keyword to point out that the link isn't being 
endorsed by the author - It's more a content guideline than a strict order for 
robots to obey. So I disagree that you're breaking standards or creating a 
robot that's not well behaved by ignoring it. 

I would have liked to have done a bit more with this so that I could have 
respected nofollows, but injected the URL as a brand new seed URL but other 
commitments took over and I never got around to it. Since the ideal nofollow 
behaviour is somewhere between ignoring them and not ignoring them I figured 
the option to ignore them was a good start and submitted the patch, but I'm not 
precious about it.

> Add configurable option for treating nofollow behaviour.
> 
>
> Key: NUTCH-693
> URL: https://issues.apache.org/jira/browse/NUTCH-693
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Andrew McCall
>Assignee: Otis Gospodnetic
>Priority: Minor
> Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow- 
> Ideally I'd like to follow them, but not pass the link juice between them. 
> I've attached a patch that adds a configuration element 
> parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
> nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-780) Nutch crawler did not read configuration files

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847094#action_12847094
 ] 

Andrzej Bialecki  commented on NUTCH-780:
-

Is the purpose of this issue to make Crawl.java usable via strongly-typedAPI 
instead of the generic main, e.g. something like this:

{code}
public class Crawl extends Configured {
 
 public int crawl(Path output, Path seedDir, int threads, int numCycles, int 
topN, ...) {
...
  }
}
{code}

> Nutch crawler did not read configuration files
> --
>
> Key: NUTCH-780
> URL: https://issues.apache.org/jira/browse/NUTCH-780
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.0
>Reporter: Vu Hoang
> Attachments: NUTCH-780.patch
>
>
> Nutch searcher can read properties at the constructor ...
> {code:java|title=NutchSearcher.java|borderStyle=solid}
> NutchBean bean = new NutchBean(getFilesystem().getConf(), fs);
> ... // put search engine code here
> {code}
> ... but Nutch crawler is not, it only reads data from arguments.
> {code:java|title=NutchCrawler.java|borderStyle=solid}
> StringBuilder builder = new StringBuilder();
> builder.append(domainlist + SPACE);
> builder.append(ARGUMENT_CRAWL_DIR);
> builder.append(domainlist + SUBFIX_CRAWLED + SPACE);
> builder.append(ARGUMENT_CRAWL_THREADS);
> builder.append(threads + SPACE);
> builder.append(ARGUMENT_CRAWL_DEPTH);
> builder.append(depth + SPACE);
> builder.append(ARGUMENT_CRAWL_TOPN);
> builder.append(topN + SPACE);
> Crawl.main(builder.toString().split(SPACE));
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-795) Add ability to maintain nofollow attribute in linkdb

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847075#action_12847075
 ] 

Andrzej Bialecki  commented on NUTCH-795:
-

Please see my comment to that issue. Or is there some other use case that you 
have in mind?

> Add ability to maintain nofollow attribute in linkdb
> 
>
> Key: NUTCH-795
> URL: https://issues.apache.org/jira/browse/NUTCH-795
> Project: Nutch
>  Issue Type: New Feature
>  Components: linkdb
>Affects Versions: 1.1
>Reporter: Sammy Yu
> Attachments: 0001-Updated-with-nofollow-support-for-Outlinks.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847074#action_12847074
 ] 

Andrzej Bialecki  commented on NUTCH-693:
-

This patch is controversial in the sense that a) Nutch strives to adhere to 
Internet standards and netiquette, which says that robots should obey nofollow, 
and b) most Nutch users want a well-behaved robot. You are free of course to 
modify the source as you did. Therefore I think that this functionality is not 
applicable to majority of Nutch users, and I vote -1 on including it in Nutch.

> Add configurable option for treating nofollow behaviour.
> 
>
> Key: NUTCH-693
> URL: https://issues.apache.org/jira/browse/NUTCH-693
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Andrew McCall
>Assignee: Otis Gospodnetic
>Priority: Minor
> Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow- 
> Ideally I'd like to follow them, but not pass the link juice between them. 
> I've attached a patch that adds a configuration element 
> parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
> nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-800) Generator builds a URL list that is not encoded

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847071#action_12847071
 ] 

Andrzej Bialecki  commented on NUTCH-800:
-

I'm puzzled by your problem description. Is Nutch affected by a potentially 
malicious URL data? URL form encoding is just a transport encoding, it doesn't 
make URL inherently safe (or unsafe).

> Generator builds a URL list that is not encoded
> ---
>
> Key: NUTCH-800
> URL: https://issues.apache.org/jira/browse/NUTCH-800
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.7.3, 0.9.0, 
> 1.0.0, 1.1
>Reporter: Jesse Campbell
>
> The URL string that is grabbed by the generator when creating the fetch list 
> does not get encoded, could potentially allow unsafe excecution, and breaks 
> reading improperly encoded URLs from the scraped pages.
> Since we a) cannot guarantee that any site we scrape is not malitious, and b) 
> likely do not have control over all content providers, we are currently 
> forced to use a regex normalizer to perform the same function as a built-in 
> java class (it would be unsafe to leave alone)
> A quick solution would be to update Generator.java to utilize the 
> java.net.URLEncoder static class:
> line 187: 
> old: String urlString = url.toString();
> new: String urlString = URLEncoder.encode(url.toString(),"UTF-8");
> line 192:
> old: u = new URL(url.toString());
> new: u = new URL(urlString);
> The use of URLEncoder.encode could also be at the updatedb stage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-796) Zero results problems difficult to troubleshoot due to lack of logging

2010-03-18 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-796.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

Patch applied in rev. 924945. Thanks for reporting it.

> Zero results problems difficult to troubleshoot due to lack of logging
> --
>
> Key: NUTCH-796
> URL: https://issues.apache.org/jira/browse/NUTCH-796
> Project: Nutch
>  Issue Type: Improvement
>  Components: searcher, web gui
>Affects Versions: 1.0.0, 1.1
> Environment: Linux, x86, nutch searcher and nutch webaps. v1.0, v1.1
>Reporter: Jesse Hires
>Assignee: Andrzej Bialecki 
> Fix For: 1.1
>
> Attachments: logging.patch
>
>
> There are a few places where search can fail in a distributed environment, 
> but when configuration is not quite right, there are no indications of errors 
> or logging.
> Increased logging of failures would help troubleshoot such problems, as well 
> as lower the "I get 0 results, why?" questions that come across the mailing 
> lists. 
> Areas where logging would be helpful:
> search app cannot locate search-servers.txt
> search app cannot find searcher node listed in search-server.txt
> search app cannot connect to port on searcher specified in search-server.txt
> searcher (bin/nutch server...) cannot find index
> searcher cannot find segments
> Access denied in any of the above scenarios.
> There are probably more that would be helpful, but I am not yet familiar to 
> know all the points of possible failure between the webpage and a search node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Crawling authenticated websites !

2010-03-18 Thread Susam Pal
On Thu, Mar 18, 2010 at 7:27 PM, Ranganath Cuddapah
 wrote:
> Hello,
> Is there a way to configure Nutch to crawl "forms authenticated" websites?
> What I mean is the kind of websites which look up a database for
> authentication/authorization and does not allow you to view secure pages
> unless authenticated. This need not be specifically on https, but on http
> too..!
> Any help is greatly appreciated.
> Thanks,
> Ranganath
> P.S : Not sure if this is the right email to ask the question. Apologies, in
> advance.

nutch-u...@lucene.apache.org is the right place ask this. I've
included it in CC.

This feature is not present in Nutch. We have recorded the summary of
some old discussions regarding this here:
http://wiki.apache.org/nutch/HttpPostAuthentication But this was never
implemented.

Regards,
Susam Pal


Crawling authenticated websites !

2010-03-18 Thread Ranganath Cuddapah
Hello,

Is there a way to configure Nutch to crawl "forms authenticated" websites?
What I mean is the kind of websites which look up a database for
authentication/authorization and does not allow you to view secure pages
unless authenticated. This need not be specifically on https, but on http
too..!

Any help is greatly appreciated.

Thanks,
Ranganath

P.S : Not sure if this is the right email to ask the question. Apologies, in
advance.


[jira] Commented: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846932#action_12846932
 ] 

Andrzej Bialecki  commented on NUTCH-802:
-

We already have a general way to control this and other aspects of URL-s as 
such, namely with URLFilters. I agree that this functionality could be useful, 
but in a form of a URLFilter (or adding this control to e.g. urlfilter-basic or 
urlfilter-validator).

> Problems managing outlinks with large url length
> 
>
> Key: NUTCH-802
> URL: https://issues.apache.org/jira/browse/NUTCH-802
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Reporter: Pablo Aragón
>Assignee: Andrzej Bialecki 
> Attachments: ParseOutputFormat.patch
>
>
> Nutch can get idle during the collection of outlinks if  the URL address of 
> the outlink is too large.
> The maximum sizes of an URL for the main web servers are:
> * Apache: 4,000 bytes
> * Microsoft Internet Information Server (IIS): 16, 384 bytes
> * Perl HTTP::Daemon: 8.000 bytes
> URL adress sizes bigger than 4000 bytes are problematic, so the limit should 
> be set in the nutch-default.xml configuration file.
> I attached a patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846930#action_12846930
 ] 

Julien Nioche commented on NUTCH-762:
-

Yes, I came across that situation too on a large crawl where a single machine 
was used to host a whole range of unrelated domain names (needless to say the 
host of the domains was not very pleased). We can now handle such cases that 
simply by partitioning by IP (and counting by domain).

I will have a look at reintroducing *generate.update.crawldb* tomorrow.



 

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-762-v2.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reopened NUTCH-802:
-

  Assignee: Andrzej Bialecki 

Submitting a patch is not "fixing", it's fixed when the patch is accepted and 
applied.

> Problems managing outlinks with large url length
> 
>
> Key: NUTCH-802
> URL: https://issues.apache.org/jira/browse/NUTCH-802
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Reporter: Pablo Aragón
>Assignee: Andrzej Bialecki 
> Attachments: ParseOutputFormat.patch
>
>
> Nutch can get idle during the collection of outlinks if  the URL address of 
> the outlink is too large.
> The maximum sizes of an URL for the main web servers are:
> * Apache: 4,000 bytes
> * Microsoft Internet Information Server (IIS): 16, 384 bytes
> * Perl HTTP::Daemon: 8.000 bytes
> URL adress sizes bigger than 4000 bytes are problematic, so the limit should 
> be set in the nutch-default.xml configuration file.
> I attached a patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846927#action_12846927
 ] 

Andrzej Bialecki  commented on NUTCH-762:
-

In my experience the IP-based fetching was only (rarely) needed when there was 
a large number of urls from virtual hosts hosted at the same ISP. In other 
words, not a common case - others may have different experience depending on 
their typical crawl targets... IMHO I think we don't have to reimplement this.

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-762-v2.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846923#action_12846923
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

That's one option, at least until the crawler-commons produces any artifacts 
... Eventually I think that this code and other related code (e.g. deciding 
which URL is canonical in presence of redirects, url normalization and 
filtering) should end up in the crawler-commons.

> parse-tika is not properly constructing URLs when the target begins with a "?"
> --
>
> Key: NUTCH-797
> URL: https://issues.apache.org/jira/browse/NUTCH-797
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: Win 7, Java(TM) SE Runtime Environment (build 
> 1.6.0_16-b01)
> Also repro's on RHEL and java 1.4.2
>Reporter: Robert Hohman
>Priority: Minor
> Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch
>
>
> This is my first bug and patch on nutch, so apologies if I have not provided 
> enough detail.
> In crawling the page at 
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are 
> links in the page that look like this:
> 2 href="?co=0&sk=0&p=3&pi=1">3
> in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
> getOutlinks looks for links, it comes across this link, and constucts a new 
> url with a base URL class built from 
> "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a 
> target of "?co=0&sk=0&p=2&pi=1"
> The URL class, per RFC 3986 at 
> http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
> how to merge these two, and per the RFC, the URL class merges these to: 
> http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1
> because the RFC explicitly states that the rightmost url segment (the 
> Search.aspx in this case) should be ripped off before combining.
> While this is compliant with the RFC, it means the URLs which are created for 
> the next round of fetching are incorrect.  Modern browsers seem to handle 
> this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
> exception or handling of what is a poorly formed url on accenture's part.
> I have fixed this by modifying DOMContentUtils to look for the case where a ? 
> begins the target, and then pulling the rightmost component out of the base 
> and inserting it into the target before the ?, so the target in this example 
> becomes:
> Search.aspx?co=0&sk=0&p=2&pi=1
> The URL class then properly constructs the new url as:
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1
> If it is agreed that this solution works, I believe the other html parsers in 
> nutch would need to be modified in a similar way.
> Can I get feedback on this proposed solution?  Specifically I'm worried about 
> unforeseen side effects.
> Much thanks
> Here is the patch info:
> Index: 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> ===
> --- 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(revision 916362)
> +++ 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(working copy)
> @@ -299,6 +299,50 @@
>  return false;
>}
>
> +  private URL fixURL(URL base, String target) throws MalformedURLException
> +  {
> +   // handle params that are embedded into the base url - move them to 
> target
> +   // so URL class constructs the new url class properly
> +   if  (base.toString().indexOf(';') > 0)  
> +  return fixEmbeddedParams(base, target);
> +   
> +   // handle the case that there is a target that is a pure query.
> +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
> how to assemble
> +   // URLs but I've seen this in numerous places, for example at
> +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0
> +   // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by 
> default
> +   // URL constructs the base+target combo as 
> +   // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, 
> incorrectly
> +   // dropping the Search.aspx target
> +   //
> +   // Browsers handle these just fine, they must have an exception 
> similar to this
> +   if (target.startsWith("?"))
> +   {
> +   return fixPureQueryTargets(base, target);
> +   }
> +   
> +   return new URL(base, target);
> +  }
> +  
> +  private URL fixPureQueryTargets(URL base, String target) throws 
> MalformedURLException
> +  {
> + if (!target.startsWith("?"))
> + 

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846910#action_12846910
 ] 

Julien Nioche commented on NUTCH-762:
-

OK, there was indeed an assumption that the generator would not need to be 
called again before an update.  Am happy to add back generate.update.crawldb. 

Note that this version of the Generator also differs from the original version 
in that 

{quote}
*IP resolution is done ONLY on the entries which have been selected for 
fetching (during the partitioning). Running the IP resolution on the whole 
crawlDb is too slow to be usable on a large scale
*can max the number of URLs per host or domain (but not by IP)
{quote}

We could allow more flexibility by counting per IP, again at the expense of 
performance. Not sure it is very useful in practice though. Since the way we 
count the URLs is now decoupled from the way we partition them, we can have an 
hybrid approach e.g. count per domain THEN partition by IP. 

Any thoughts on whether or not we should reintroduce the counting per IP?

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-762-v2.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"

2010-03-18 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846865#action_12846865
 ] 

Jukka Zitting commented on NUTCH-797:
-

I guess we need to apply the same logic also to other Tika parsers that may 
deal with relative URLs.

Since we in any case need this functionality in Tika, would it be useful for 
Nutch if it was made available as a public utility class or method in 
tika-core? It would be great if we could avoid duplicating the code in 
different projects.

> parse-tika is not properly constructing URLs when the target begins with a "?"
> --
>
> Key: NUTCH-797
> URL: https://issues.apache.org/jira/browse/NUTCH-797
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: Win 7, Java(TM) SE Runtime Environment (build 
> 1.6.0_16-b01)
> Also repro's on RHEL and java 1.4.2
>Reporter: Robert Hohman
>Priority: Minor
> Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch
>
>
> This is my first bug and patch on nutch, so apologies if I have not provided 
> enough detail.
> In crawling the page at 
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are 
> links in the page that look like this:
> 2 href="?co=0&sk=0&p=3&pi=1">3
> in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
> getOutlinks looks for links, it comes across this link, and constucts a new 
> url with a base URL class built from 
> "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a 
> target of "?co=0&sk=0&p=2&pi=1"
> The URL class, per RFC 3986 at 
> http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
> how to merge these two, and per the RFC, the URL class merges these to: 
> http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1
> because the RFC explicitly states that the rightmost url segment (the 
> Search.aspx in this case) should be ripped off before combining.
> While this is compliant with the RFC, it means the URLs which are created for 
> the next round of fetching are incorrect.  Modern browsers seem to handle 
> this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
> exception or handling of what is a poorly formed url on accenture's part.
> I have fixed this by modifying DOMContentUtils to look for the case where a ? 
> begins the target, and then pulling the rightmost component out of the base 
> and inserting it into the target before the ?, so the target in this example 
> becomes:
> Search.aspx?co=0&sk=0&p=2&pi=1
> The URL class then properly constructs the new url as:
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1
> If it is agreed that this solution works, I believe the other html parsers in 
> nutch would need to be modified in a similar way.
> Can I get feedback on this proposed solution?  Specifically I'm worried about 
> unforeseen side effects.
> Much thanks
> Here is the patch info:
> Index: 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> ===
> --- 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(revision 916362)
> +++ 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(working copy)
> @@ -299,6 +299,50 @@
>  return false;
>}
>
> +  private URL fixURL(URL base, String target) throws MalformedURLException
> +  {
> +   // handle params that are embedded into the base url - move them to 
> target
> +   // so URL class constructs the new url class properly
> +   if  (base.toString().indexOf(';') > 0)  
> +  return fixEmbeddedParams(base, target);
> +   
> +   // handle the case that there is a target that is a pure query.
> +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
> how to assemble
> +   // URLs but I've seen this in numerous places, for example at
> +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0
> +   // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by 
> default
> +   // URL constructs the base+target combo as 
> +   // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, 
> incorrectly
> +   // dropping the Search.aspx target
> +   //
> +   // Browsers handle these just fine, they must have an exception 
> similar to this
> +   if (target.startsWith("?"))
> +   {
> +   return fixPureQueryTargets(base, target);
> +   }
> +   
> +   return new URL(base, target);
> +  }
> +  
> +  private URL fixPureQueryTargets(URL base, String target) throws 
> MalformedURLException

[jira] Closed: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Aragón closed NUTCH-802.
--

Resolution: Fixed

> Problems managing outlinks with large url length
> 
>
> Key: NUTCH-802
> URL: https://issues.apache.org/jira/browse/NUTCH-802
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Reporter: Pablo Aragón
> Attachments: ParseOutputFormat.patch
>
>
> Nutch can get idle during the collection of outlinks if  the URL address of 
> the outlink is too large.
> The maximum sizes of an URL for the main web servers are:
> * Apache: 4,000 bytes
> * Microsoft Internet Information Server (IIS): 16, 384 bytes
> * Perl HTTP::Daemon: 8.000 bytes
> URL adress sizes bigger than 4000 bytes are problematic, so the limit should 
> be set in the nutch-default.xml configuration file.
> I attached a patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Aragón updated NUTCH-802:
---

Attachment: ParseOutputFormat.patch

> Problems managing outlinks with large url length
> 
>
> Key: NUTCH-802
> URL: https://issues.apache.org/jira/browse/NUTCH-802
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Reporter: Pablo Aragón
> Attachments: ParseOutputFormat.patch
>
>
> Nutch can get idle during the collection of outlinks if  the URL address of 
> the outlink is too large.
> The maximum sizes of an URL for the main web servers are:
> * Apache: 4,000 bytes
> * Microsoft Internet Information Server (IIS): 16, 384 bytes
> * Perl HTTP::Daemon: 8.000 bytes
> URL adress sizes bigger than 4000 bytes are problematic, so the limit should 
> be set in the nutch-default.xml configuration file.
> I attached a patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread JIRA
Problems managing outlinks with large url length


 Key: NUTCH-802
 URL: https://issues.apache.org/jira/browse/NUTCH-802
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Pablo Aragón


Nutch can get idle during the collection of outlinks if  the URL address of the 
outlink is too large.

The maximum sizes of an URL for the main web servers are:

* Apache: 4,000 bytes
* Microsoft Internet Information Server (IIS): 16, 384 bytes
* Perl HTTP::Daemon: 8.000 bytes

URL adress sizes bigger than 4000 bytes are problematic, so the limit should be 
set in the nutch-default.xml configuration file.

I attached a patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.