[jira] Created: (NUTCH-549) Bug

2007-09-06 Thread crossany (JIRA)
Bug
---

 Key: NUTCH-549
 URL: https://issues.apache.org/jira/browse/NUTCH-549
 Project: Nutch
  Issue Type: Bug
Reporter: crossany




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Limiting outlink tags.

2007-09-06 Thread Marcin Okraszewski
Hi,
I have noticed that Nutch considers img/@src as an outlink. I suppose in many 
cases people do not want to threat image as an outlink. At least I don't want. 
The same case is with script/@src. But, it seems there is no way to limit 
outlink tags. The DOMContentUtils.getOutlinks() takes links from all 
a,area,form,frame,iframe,script,link,img. Only "form" element can be turned off 
by "parser.html.form.use_action" parameter.

I would suggest to introduce a new configuration parameter which could be used 
to turn on or off certain elements. It could be simply done by single 
parameter, which would contain coma separated list of tags to be turned off.

What is your opinion? If you think it is a valid issue I can make a patch for 
this.

Regards,
Marcin



Labeling URLs a-la Google

2007-09-06 Thread Jeff Maki
Hello everybody,

I'm working on a project that is essentially a searchable database for
academic citations at the University of Pittsburgh. One of our
searching requirements was to be able to break the search results into
sections--in order to do this, I implemented something similar to
Google's "labels".

It's based heavily on the example plugin, and maybe not so pretty
code-wise, but it's a start.

Downloadable here:
http://upclose.lrdc.pitt.edu/people/maki_assets/nutch-regex-label.tar.gz

You configure it by adding something like the below to your nutch-site.xml file:


  extension.regexlabeler.labels
  
http://dev3\.informalscience\.org/research.*\.php.* = firsttag
secondtag thirdtag,
http://dev3\.informalscience\.org/project.*\.php.* = project,
http://www.?\.informalscience\.org.* = oldsite,
http://dev3\.informalscience\.org.* = devsite
  


Notes:
* Format of each line is =
* URLS must be unique.
* Multiple tags for the same pattern are delimited by a space.

Hope this saves somebody some time,

-Jeff

(BTW, Nutch as worked very well for us--excellent project!)


[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-09-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525475
 ] 

Andrzej Bialecki  commented on NUTCH-530:
-

I'm still against this patch, exactly because we are not sure how many times 
the ScoringFilters will be executed - it may be once, twice or N times. The 
current contract for ScoringFilters is that they are executed once.

CrawlDbReducer itself does not reduce all inlinked datums to a single 
CrawlDatum - it's up to the scoring filters to do whatever they want to do with 
all inlinks - although it's true that scoring-opic performs an operation 
equivalent to this, this may not always be the case.

Second, let's consider the following scenario (BTW, this is close to one of the 
ScoringFilters that I actually implemented, so it's not far fetched): let's say 
I implemented a ScoringFilter that checks for existence of a flag in CrawlDatum 
(presumably put there by Generator), and based on the value of this flag it 
counts the score from inlinks differently. Then it clears the flag to mark a 
successful update. If we ran updatedb that includes your patch, this operation 
would work ok in the first spill from the Combiner (although with vastly 
incomplete information), and then it would fail to do the right thing on 
subsequent runs through the Combiner or Reducer, because the flag would be 
already reset.

> Add a combiner to improve performance on updatedb
> -
>
> Key: NUTCH-530
> URL: https://issues.apache.org/jira/browse/NUTCH-530
> Project: Nutch
>  Issue Type: Improvement
> Environment: java 1.6
>Reporter: Emmanuel Joke
>Assignee: Emmanuel Joke
> Fix For: 1.0.0
>
> Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of 
> the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

2007-09-06 Thread Emmanuel Joke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525452
 ] 

Emmanuel Joke commented on NUTCH-548:
-

My mistake, you re right i was using the command crawl to make my test, and i 
didn't noticed that within the code it defined the urlfiter and urnormalizer to 
TRUE. 

Anyway, this current patch is still valid and useful.

Thanks again for those explanation.

> Move URLNormalizer from Outlink to ParseOutputFormat
> 
>
> Key: NUTCH-548
> URL: https://issues.apache.org/jira/browse/NUTCH-548
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Emmanuel Joke
>Assignee: Emmanuel Joke
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-548.patch
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink. 
> So I move this operation to the ParseOutputFormat object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-524) Generate Problem with Single Node

2007-09-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525419
 ] 

Doğacan Güney commented on NUTCH-524:
-

Hi Ian and Daniel,

Have you tried max.threads.per.host option? Or are you still working on this 
one?

> Generate Problem with Single Node
> -
>
> Key: NUTCH-524
> URL: https://issues.apache.org/jira/browse/NUTCH-524
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 0.9.0
> Environment: All
>Reporter: Daniel Clark
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: nutch-0.9_PartitionUrlByHost.patch
>
>
> Nutch with Hadoop has problems with a single node in URL list when there is a 
> cluster of two or more machines.  I will provide a fix for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-09-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525418
 ] 

Doğacan Güney commented on NUTCH-530:
-

Andrzej, what do you think about this one in light of Emmanuel's last comment? 
I am still uneasy about ScoringFilters running twice,  but I think Emmanuel is 
right that semantics don't change.

> Add a combiner to improve performance on updatedb
> -
>
> Key: NUTCH-530
> URL: https://issues.apache.org/jira/browse/NUTCH-530
> Project: Nutch
>  Issue Type: Improvement
> Environment: java 1.6
>Reporter: Emmanuel Joke
>Assignee: Emmanuel Joke
> Fix For: 1.0.0
>
> Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of 
> the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-546) file URL are filtered out by the crawler

2007-09-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-546:


Attachment: NUTCH-546-validator-plugin_v1.patch

Here is a patch that removes UrlValidator code from nutch code and adds it as a 
plugin (urlfilter-validator).

> file URL are filtered out by the crawler
> 
>
> Key: NUTCH-546
> URL: https://issues.apache.org/jira/browse/NUTCH-546
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.0
> Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>Reporter: Marc Brette
>Assignee: Doğacan Güney
> Attachments: NUTCH-546-validator-plugin_v1.patch, NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in 
> version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 
> 'authority', a combination of host and port. As it is null for file, the URL 
> is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe 
> other characters to be URL encoded) are also filtered out. It maybe be 
> because the file protocol plugin doesn't URL encode space characters and/or 
> UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it 
> works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.