[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511985
 ] 

Hudson commented on NUTCH-505:
--

Integrated in Nutch-Nightly #147 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/147/])

 Outlink urls should be validated
 

 Key: NUTCH-505
 URL: https://issues.apache.org/jira/browse/NUTCH-505
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, 
 NUTCH-505_draft_v2.patch


 See discussion here:
 http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
 Parse plugins may extract garbage urls from pages. We need a url validation 
 system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-12 Thread Espen Amble Kolstad (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512071
 ] 

Espen Amble Kolstad commented on NUTCH-505:
---

Automaton (http://www.brics.dk/automaton/), used in AutomatonURLFilter, is even 
faster if you preparse the regex'es
It doesn't support all regex, but most.

 Outlink urls should be validated
 

 Key: NUTCH-505
 URL: https://issues.apache.org/jira/browse/NUTCH-505
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-505-v2.patch, NUTCH-505.patch, NUTCH-505.patch, 
 NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch


 See discussion here:
 http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
 Parse plugins may extract garbage urls from pages. We need a url validation 
 system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512074
 ] 

Doğacan Güney commented on NUTCH-505:
-

Thanks for the suggestion. Automaton really looks good, but using automaton in 
UrlValidator will mean bringing automaton jar inside nutch core (it currently 
resides in plugin urlfilter-automaton's lib). I am not sure if that's OK with 
everyone.

 Outlink urls should be validated
 

 Key: NUTCH-505
 URL: https://issues.apache.org/jira/browse/NUTCH-505
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-505-v2.patch, NUTCH-505.patch, NUTCH-505.patch, 
 NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch


 See discussion here:
 http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
 Parse plugins may extract garbage urls from pages. We need a url validation 
 system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-12 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512139
 ] 

Andrzej Bialecki  commented on NUTCH-505:
-

Please test Java 1.5 and Java 1.6 - IIRC there are some differences in 
performance of java.util.regex between these two versions.

 Outlink urls should be validated
 

 Key: NUTCH-505
 URL: https://issues.apache.org/jira/browse/NUTCH-505
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: filtered.txt, NUTCH-505-v2.patch, NUTCH-505-v3.patch, 
 NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, 
 NUTCH-505_draft_v2.patch


 See discussion here:
 http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
 Parse plugins may extract garbage urls from pages. We need a url validation 
 system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512201
 ] 

Doğacan Güney commented on NUTCH-505:
-

Andrzej, on my tests, java.util.regex is faster on both Java 1.5 and Java 1.6.

And btw, I added ( and ) as valid path characters to the relevant regex pattern 
because nutch was able to fetch a url containing them.

 Outlink urls should be validated
 

 Key: NUTCH-505
 URL: https://issues.apache.org/jira/browse/NUTCH-505
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: filtered.txt, NUTCH-505-v2.patch, NUTCH-505-v3.patch, 
 NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, 
 NUTCH-505_draft_v2.patch


 See discussion here:
 http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
 Parse plugins may extract garbage urls from pages. We need a url validation 
 system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-10 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511447
 ] 

Andrzej Bialecki  commented on NUTCH-505:
-

* In ParseOutputFormat, the calculation of outlinksToStore should not make 
repeating calls to job.getInt() - the value of db.max.outlinksper.page should 
be retrieved once per invocation of getRecordWriter().

* you should increase the version number of ParseData, and add a code to read 
the current version of  ParseData. Otherwise the updated code won't be able to 
read older segments.

Other than that, the patch looks great, +1 for committing it after fixing these 
issues.

 Outlink urls should be validated
 

 Key: NUTCH-505
 URL: https://issues.apache.org/jira/browse/NUTCH-505
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Priority: Minor
 Attachments: NUTCH-505.patch, NUTCH-505_draft.patch, 
 NUTCH-505_draft_v2.patch


 See discussion here:
 http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
 Parse plugins may extract garbage urls from pages. We need a url validation 
 system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-06-26 Thread Kai_testing Middleton
I can confirm that with NUTCH-505_draft_v2.patch I no longer get outlink urls 
that contain html mark-up as I was getting before on www.variety.com.

--Kai Middleton

- Original Message 
From: Doğacan Güney (JIRA) [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Monday, June 25, 2007 1:09:26 AM
Subject: [jira] Commented: (NUTCH-505) Outlink urls should be validated


[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507803
 ] 

Doğacan Güney commented on NUTCH-505:
-

btw, for http://www.variety.com/, these are the 'urls' filtered:

http:/
http://www.variety.com//div
http://www.variety.com//div/a
mailto:[EMAIL PROTECTED]
http://ad.doubleclick.net/jump/variety.dart/;sz=993x47;ord=' + randomnumber + '?
http://ad.doubleclick.net/ad/variety.dart/;sz=993x47;ord=' + randomnumber + '?

Since we will not distribute score to these, this patch may also slightly 
improve scoring.


 Outlink urls should be validated
 

 Key: NUTCH-505
 URL: https://issues.apache.org/jira/browse/NUTCH-505
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Priority: Minor
 Attachments: NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch


 See discussion here:
 http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
 Parse plugins may extract garbage urls from pages. We need a url validation 
 system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.








   

Need a vacation? Get great deals
to amazing places on Yahoo! Travel.
http://travel.yahoo.com/

[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-06-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507803
 ] 

Doğacan Güney commented on NUTCH-505:
-

btw, for http://www.variety.com/, these are the 'urls' filtered:

http:/
http://www.variety.com//div
http://www.variety.com//div/a
mailto:[EMAIL PROTECTED]
http://ad.doubleclick.net/jump/variety.dart/;sz=993x47;ord=' + randomnumber + '?
http://ad.doubleclick.net/ad/variety.dart/;sz=993x47;ord=' + randomnumber + '?

Since we will not distribute score to these, this patch may also slightly 
improve scoring.


 Outlink urls should be validated
 

 Key: NUTCH-505
 URL: https://issues.apache.org/jira/browse/NUTCH-505
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Priority: Minor
 Attachments: NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch


 See discussion here:
 http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
 Parse plugins may extract garbage urls from pages. We need a url validation 
 system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.