[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-10-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584402#comment-15584402
 ] 

Lewis John McGibbney commented on NUTCH-1314:
-

Yes [~wastl-nagel] that was why it was still open. Do you want to port?

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
>Assignee: Lewis John McGibbney
> Fix For: 2.4
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-09-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15522329#comment-15522329
 ] 

Sebastian Nagel commented on NUTCH-1314:


Is there a reason why this issue is still open? To be ported to 1.x?

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
>Assignee: Lewis John McGibbney
> Fix For: 2.5
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137467#comment-15137467
 ] 

Hudson commented on NUTCH-1314:
---

SUCCESS: Integrated in Nutch-nutchgora #1549 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1549/])
NUTCH-1314 Impose a limit on the length of outlink target urls (lewismc: 
[http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1729220])
* 2.x/conf/nutch-default.xml
NUTCH-1314 Impose a limit on the length of outlink target urls (lewismc: 
[http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1729219])
* 2.x/src/test/org/apache/nutch/parse/TestParseUtil.java
NUTCH-1314 Impose a limit on the length of outlink target urls (lewismc: 
[http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1729218])
* 2.x/CHANGES.txt
* 2.x/conf/nutch-default.xml
* 2.x/src/java/org/apache/nutch/parse/ParseUtil.java


> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
>Assignee: Lewis John McGibbney
> Fix For: 2.4, 1.12
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137375#comment-15137375
 ] 

Lewis John McGibbney commented on NUTCH-1314:
-

Committed @ revisions 1729218 and 1729219 in 2.X

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
>Assignee: Lewis John McGibbney
> Fix For: 2.4, 1.12
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-08 Thread Canan Girgin (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136709#comment-15136709
 ] 

Canan Girgin commented on NUTCH-1314:
-

[~lewismc], Please could somebody help me commit  NUTCH-1314-v4.patch If there 
is no problem ?

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4, 1.12
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-05 Thread Canan Girgin (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134237#comment-15134237
 ] 

Canan Girgin commented on NUTCH-1314:
-

I tried to apply NUTCH-1314-v3.patch but I can't. I attached new patch file 
with tests (NUTCH-1314-v4.patch). 

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4, 1.12
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-02 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129575#comment-15129575
 ] 

Lewis John McGibbney commented on NUTCH-1314:
-

Yep, if someone can consolidate the patches above and generate a test we will 
get this committed. It is a nice improvement for sure.

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128368#comment-15128368
 ] 

Chris A. Mattmann commented on NUTCH-1314:
--

Otis, your patches are always welcome! :)

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-02 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128263#comment-15128263
 ] 

Otis Gospodnetic commented on NUTCH-1314:
-

We've run into this issue with Nutch 1.x and have modified the patch for Nutch 
1.x.  Will try adding to JIRA.  Would be nice if somebody could commit it.

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2013-12-20 Thread Nguyen Manh Tien (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854780#comment-13854780
 ] 

Nguyen Manh Tien commented on NUTCH-1314:
-

[~lewismc] We are using  NUTCH-1314-v3.patch

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.3
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2013-12-20 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854065#comment-13854065
 ] 

Lewis John McGibbney commented on NUTCH-1314:
-

Hi [~otis] which patch are you using... NUTCH-1314-v3.patch? Can you commit 
[~otis]?



> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.3
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2013-12-20 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854056#comment-13854056
 ] 

Otis Gospodnetic commented on NUTCH-1314:
-

BTW. we are using this now, too.  +1 for committing, [~ferdy.g]!

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.3
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2013-06-27 Thread Canan Girgin (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694784#comment-13694784
 ] 

Canan Girgin commented on NUTCH-1314:
-

I tried to test NUTCH-1314-v2.patch. But it removes links size<3000.In my 
opinion, "if (target.length() > maxTargetLength)" rows are not correct in patch 
file. It must be like "if (target.length() < maxTargetLength) ". 
NUTCH-1314-v2.patch file , there is a new parameter used 
("parser.html.outlinks.max_target_length"). I think it must be defined in 
nutch-default.xml file.

I attached a new patch file. In the ParseUtil class, target url length 
controlled before normalizer and filters. Is it correct?

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.3
>
> Attachments: NUTCH-1314.patch, NUTCH-1314-trunk.patch, 
> NUTCH-1314-v2.patch, NUTCH-1314-v3.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2013-04-28 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644269#comment-13644269
 ] 

Tejas Patil commented on NUTCH-1314:


Hi Lewis,
I tried to test both the patches. NUTCH-1314-trunk.patch gave compilation 
errors:
{noformat}[javac] 
/home/tejas/Desktop/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java:391:
 error: cannot find symbol
[javac] fixEmbeddedParams(base, target) :  new 
URL(base, target);
[javac] ^
[javac]   symbol:   method fixEmbeddedParams(URL,String)
[javac]   location: class DOMContentUtils
{noformat}

For NUTCH-1314-v2.patch:
I used [this|http://nutch.apache.org/about.html] url and ran the HtmlParser 
parser.

Before applying the patch:
{noformat}bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser 
about.html
title: About Apache Nutch
text: About Apache Nutch Apache > Nutch > Home   .
outlinks: [toUrl: file:skin/basic.css anchor: , toUrl: file:skin/screen.css 
anchor: , toUrl: file:skin/print.css anchor: , toUrl: file:skin/profile.css 
anchor: , toUrl: file:skin/getBlank.js anchor: , toUrl: file:skin/getMenu.js 
anchor: , toUrl: file:skin/fontsize.js anchor: , toUrl: file:images/favicon.ico 
anchor: , toUrl: http://www.apache.org/ anchor: Apache, toUrl: 
http://nutch.apache.org anchor: Nutch, toUrl: http://nutch.apache.org anchor: 
Home, toUrl: file:skin/breadcrumbs.js anchor: , toUrl: http://www.apache.org/ 
anchor: , toUrl: file:images/feather-small.gif anchor: , toUrl: 
http://nutch.apache.org/ anchor: , toUrl: file:images/nutch_logo_tm.gif anchor: 
, toUrl: file:index.html anchor: Main, toUrl: file:wiki.html anchor: Wiki, 
toUrl: http://issues.apache.org/jira/browse/NUTCH anchor: Jira, toUrl: 
file:index.html anchor: News, toUrl: file:credits.html anchor: Credits, toUrl: 
http://www.apache.org/foundation/thanks.html anchor: Thanks, toUrl: 
http://www.cafepress.com/nutch/ anchor: Buy Stuff, toUrl: 
http://www.apache.org/foundation/sponsorship.html anchor: Sponsorship, toUrl: 
http://www.apache.org/licenses/ anchor: License, toUrl: 
http://www.apache.org/security/ anchor: Security, toUrl: file:faq.html anchor: 
FAQ, toUrl: file:wiki.html anchor: Wiki, toUrl: file:tutorial.html anchor: 
Tutorial, toUrl: file:bot.html anchor: Robot, toUrl: 
file:apidocs-2.1/index.html anchor: API Docs (2.1), toUrl: 
file:apidocs-1.6/index.html anchor: API Docs (1.6), toUrl: 
https://builds.apache.org/job/Nutch-trunk/javadoc/ anchor: API Docs (trunk 
nightly), toUrl: https://builds.apache.org/job/Nutch-nutchgora/javadoc/ anchor: 
API Docs (2.x nightly), toUrl: file:downloads.html anchor: Download, toUrl: 
file:nightly.html anchor: Nightly builds, toUrl: file:sonar.html anchor: Sonar 
Analysis, toUrl: file:mailing_lists.html anchor: Mailing Lists, toUrl: 
file:issue_tracking.html anchor: Issue Tracking, toUrl: 
file:version_control.html anchor: Version Control, toUrl: 
file:old_downloads.html anchor: Older Downloads, toUrl: 
http://lucene.apache.org/java/ anchor: Lucene, toUrl: http://hadoop.apache.org/ 
anchor: Hadoop, toUrl: http://lucene.apache.org/solr/ anchor: Solr, toUrl: 
http://tika.apache.org/ anchor: Tika, toUrl: http://gora.apache.org anchor: 
Gora, toUrl: file:skin/images/rc-b-l-15-1body-2menu-3menu.png anchor: , toUrl: 
file:about.pdf anchor: PDF, toUrl: file:skin/images/pdfdoc.gif anchor: , toUrl: 
file:about.html#Overview anchor: Overview, toUrl: 
http://lucene.apache.org/java/ anchor: Apache Lucene, toUrl: 
http://lucene.apache.org/solr/ anchor: Apache Solr, toUrl: 
http://tika.apache.org/ anchor: Apache Tika, toUrl: http://hadoop.apache.org/ 
anchor: Hadoop cluster, toUrl: http://wiki.apache.org/nutch/ anchor: Nutch 
wiki., toUrl: http://www.apache.org/licenses/ anchor: The Apache Software 
Foundation. Apache Nutch, Nutch, Apache, the Apache feather logo, and the 
Apache Nutch project logo are trademarks of The Apache Software 
Foundation.]{noformat}

After applying the patch:
{noformat}bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser 
about.html
title: About Apache Nutch
text: About Apache Nutch Apache > Nutch > Home   .
outlinks: []{noformat}

Correct me if I am wrong: this patch would remove links of size > 3000. The 
outlinks are not super lengthy and that patch should not have removed those.

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1314.patch, NUTCH-1314-trunk.patch, 
> NUTCH-1314-v2.patch
>
>
> In the past we have encountered situations where crawl

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-04-18 Thread Ferdy Galema (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256438#comment-13256438
 ] 

Ferdy Galema commented on NUTCH-1314:
-

Exactly. Until that merge is properly implemented we can rely on this quickfix.

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-04-18 Thread Julien Nioche (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256437#comment-13256437
 ] 

Julien Nioche commented on NUTCH-1314:
--

This makes a good case for the merging of URL filters and normalizers (I think 
there is a JIRA on this) - we wouldn't need to worry about whether the the 
normalizer is called first etc... 

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-04-18 Thread Ferdy Galema (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256431#comment-13256431
 ] 

Ferdy Galema commented on NUTCH-1314:
-

I understand. I think the problem with implementing it with an urlfilter is 
that some parts of Nutch run the normalizers first. In the ParseUtil this is 
the case. Thus with malformed outlinks (of course this is where the majority of 
new urls are found) this will still be problematic. It makes sense to run 
normalizers first. Some urls still have a chance to be fixed (normalized) 
before they are filtered out.

Therefore the scope of this issue is to apply a very crude (but effective) 
filter before normalizing/filtering code is run.

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-04-18 Thread Julien Nioche (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256423#comment-13256423
 ] 

Julien Nioche commented on NUTCH-1314:
--

I was under the impression that the patch did not remove the URL but 
substituted it with a shorter version. If the idea is to remove the URL 
altogether (which makes perfect sense) then yes it should be a URLFilter 
instead 

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-04-18 Thread Ferdy Galema (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256398#comment-13256398
 ] 

Ferdy Galema commented on NUTCH-1314:
-

I assume you mean an URLFilter? Or do you want to correct the length by cutting 
off the excessive part? I think the urls should be rejected, because they 
probably were malformed anyway.

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-04-18 Thread Julien Nioche (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256393#comment-13256393
 ] 

Julien Nioche commented on NUTCH-1314:
--

What about doing this with a URLNormalizer (and make it the first to be 
called)? 

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-03-16 Thread Ferdy Galema (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231373#comment-13231373
 ] 

Ferdy Galema commented on NUTCH-1314:
-

Good one, I overlooked those but they should definitely be treated the same way.

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-03-16 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231368#comment-13231368
 ] 

Markus Jelsma commented on NUTCH-1314:
--

This should then also work for the Tika parser and the OutlinkExtractor i 
think. Parse-html is similar to parse-tika, it there are no outlinks obtain by 
getOutlinks in Domcontentutils then the outlink extractor is used.

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira