[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584402#comment-15584402 ] Lewis John McGibbney commented on NUTCH-1314: - Yes [~wastl-nagel] that was why it was still open. Do you want to port? > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema >Assignee: Lewis John McGibbney > Fix For: 2.4 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15522329#comment-15522329 ] Sebastian Nagel commented on NUTCH-1314: Is there a reason why this issue is still open? To be ported to 1.x? > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema >Assignee: Lewis John McGibbney > Fix For: 2.5 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137467#comment-15137467 ] Hudson commented on NUTCH-1314: --- SUCCESS: Integrated in Nutch-nutchgora #1549 (See [https://builds.apache.org/job/Nutch-nutchgora/1549/]) NUTCH-1314 Impose a limit on the length of outlink target urls (lewismc: [http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1729220]) * 2.x/conf/nutch-default.xml NUTCH-1314 Impose a limit on the length of outlink target urls (lewismc: [http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1729219]) * 2.x/src/test/org/apache/nutch/parse/TestParseUtil.java NUTCH-1314 Impose a limit on the length of outlink target urls (lewismc: [http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1729218]) * 2.x/CHANGES.txt * 2.x/conf/nutch-default.xml * 2.x/src/java/org/apache/nutch/parse/ParseUtil.java > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema >Assignee: Lewis John McGibbney > Fix For: 2.4, 1.12 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137375#comment-15137375 ] Lewis John McGibbney commented on NUTCH-1314: - Committed @ revisions 1729218 and 1729219 in 2.X > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema >Assignee: Lewis John McGibbney > Fix For: 2.4, 1.12 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136709#comment-15136709 ] Canan Girgin commented on NUTCH-1314: - [~lewismc], Please could somebody help me commit NUTCH-1314-v4.patch If there is no problem ? > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4, 1.12 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134237#comment-15134237 ] Canan Girgin commented on NUTCH-1314: - I tried to apply NUTCH-1314-v3.patch but I can't. I attached new patch file with tests (NUTCH-1314-v4.patch). > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4, 1.12 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129575#comment-15129575 ] Lewis John McGibbney commented on NUTCH-1314: - Yep, if someone can consolidate the patches above and generate a test we will get this committed. It is a nice improvement for sure. > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128368#comment-15128368 ] Chris A. Mattmann commented on NUTCH-1314: -- Otis, your patches are always welcome! :) > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128263#comment-15128263 ] Otis Gospodnetic commented on NUTCH-1314: - We've run into this issue with Nutch 1.x and have modified the patch for Nutch 1.x. Will try adding to JIRA. Would be nice if somebody could commit it. > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854780#comment-13854780 ] Nguyen Manh Tien commented on NUTCH-1314: - [~lewismc] We are using NUTCH-1314-v3.patch > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.3 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854065#comment-13854065 ] Lewis John McGibbney commented on NUTCH-1314: - Hi [~otis] which patch are you using... NUTCH-1314-v3.patch? Can you commit [~otis]? > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.3 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854056#comment-13854056 ] Otis Gospodnetic commented on NUTCH-1314: - BTW. we are using this now, too. +1 for committing, [~ferdy.g]! > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.3 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694784#comment-13694784 ] Canan Girgin commented on NUTCH-1314: - I tried to test NUTCH-1314-v2.patch. But it removes links size<3000.In my opinion, "if (target.length() > maxTargetLength)" rows are not correct in patch file. It must be like "if (target.length() < maxTargetLength) ". NUTCH-1314-v2.patch file , there is a new parameter used ("parser.html.outlinks.max_target_length"). I think it must be defined in nutch-default.xml file. I attached a new patch file. In the ParseUtil class, target url length controlled before normalizer and filters. Is it correct? > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.3 > > Attachments: NUTCH-1314.patch, NUTCH-1314-trunk.patch, > NUTCH-1314-v2.patch, NUTCH-1314-v3.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644269#comment-13644269 ] Tejas Patil commented on NUTCH-1314: Hi Lewis, I tried to test both the patches. NUTCH-1314-trunk.patch gave compilation errors: {noformat}[javac] /home/tejas/Desktop/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java:391: error: cannot find symbol [javac] fixEmbeddedParams(base, target) : new URL(base, target); [javac] ^ [javac] symbol: method fixEmbeddedParams(URL,String) [javac] location: class DOMContentUtils {noformat} For NUTCH-1314-v2.patch: I used [this|http://nutch.apache.org/about.html] url and ran the HtmlParser parser. Before applying the patch: {noformat}bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser about.html title: About Apache Nutch text: About Apache Nutch Apache > Nutch > Home . outlinks: [toUrl: file:skin/basic.css anchor: , toUrl: file:skin/screen.css anchor: , toUrl: file:skin/print.css anchor: , toUrl: file:skin/profile.css anchor: , toUrl: file:skin/getBlank.js anchor: , toUrl: file:skin/getMenu.js anchor: , toUrl: file:skin/fontsize.js anchor: , toUrl: file:images/favicon.ico anchor: , toUrl: http://www.apache.org/ anchor: Apache, toUrl: http://nutch.apache.org anchor: Nutch, toUrl: http://nutch.apache.org anchor: Home, toUrl: file:skin/breadcrumbs.js anchor: , toUrl: http://www.apache.org/ anchor: , toUrl: file:images/feather-small.gif anchor: , toUrl: http://nutch.apache.org/ anchor: , toUrl: file:images/nutch_logo_tm.gif anchor: , toUrl: file:index.html anchor: Main, toUrl: file:wiki.html anchor: Wiki, toUrl: http://issues.apache.org/jira/browse/NUTCH anchor: Jira, toUrl: file:index.html anchor: News, toUrl: file:credits.html anchor: Credits, toUrl: http://www.apache.org/foundation/thanks.html anchor: Thanks, toUrl: http://www.cafepress.com/nutch/ anchor: Buy Stuff, toUrl: http://www.apache.org/foundation/sponsorship.html anchor: Sponsorship, toUrl: http://www.apache.org/licenses/ anchor: License, toUrl: http://www.apache.org/security/ anchor: Security, toUrl: file:faq.html anchor: FAQ, toUrl: file:wiki.html anchor: Wiki, toUrl: file:tutorial.html anchor: Tutorial, toUrl: file:bot.html anchor: Robot, toUrl: file:apidocs-2.1/index.html anchor: API Docs (2.1), toUrl: file:apidocs-1.6/index.html anchor: API Docs (1.6), toUrl: https://builds.apache.org/job/Nutch-trunk/javadoc/ anchor: API Docs (trunk nightly), toUrl: https://builds.apache.org/job/Nutch-nutchgora/javadoc/ anchor: API Docs (2.x nightly), toUrl: file:downloads.html anchor: Download, toUrl: file:nightly.html anchor: Nightly builds, toUrl: file:sonar.html anchor: Sonar Analysis, toUrl: file:mailing_lists.html anchor: Mailing Lists, toUrl: file:issue_tracking.html anchor: Issue Tracking, toUrl: file:version_control.html anchor: Version Control, toUrl: file:old_downloads.html anchor: Older Downloads, toUrl: http://lucene.apache.org/java/ anchor: Lucene, toUrl: http://hadoop.apache.org/ anchor: Hadoop, toUrl: http://lucene.apache.org/solr/ anchor: Solr, toUrl: http://tika.apache.org/ anchor: Tika, toUrl: http://gora.apache.org anchor: Gora, toUrl: file:skin/images/rc-b-l-15-1body-2menu-3menu.png anchor: , toUrl: file:about.pdf anchor: PDF, toUrl: file:skin/images/pdfdoc.gif anchor: , toUrl: file:about.html#Overview anchor: Overview, toUrl: http://lucene.apache.org/java/ anchor: Apache Lucene, toUrl: http://lucene.apache.org/solr/ anchor: Apache Solr, toUrl: http://tika.apache.org/ anchor: Apache Tika, toUrl: http://hadoop.apache.org/ anchor: Hadoop cluster, toUrl: http://wiki.apache.org/nutch/ anchor: Nutch wiki., toUrl: http://www.apache.org/licenses/ anchor: The Apache Software Foundation. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation.]{noformat} After applying the patch: {noformat}bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser about.html title: About Apache Nutch text: About Apache Nutch Apache > Nutch > Home . outlinks: []{noformat} Correct me if I am wrong: this patch would remove links of size > 3000. The outlinks are not super lengthy and that patch should not have removed those. > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1314.patch, NUTCH-1314-trunk.patch, > NUTCH-1314-v2.patch > > > In the past we have encountered situations where crawl
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256438#comment-13256438 ] Ferdy Galema commented on NUTCH-1314: - Exactly. Until that merge is properly implemented we can rely on this quickfix. > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Attachments: NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256437#comment-13256437 ] Julien Nioche commented on NUTCH-1314: -- This makes a good case for the merging of URL filters and normalizers (I think there is a JIRA on this) - we wouldn't need to worry about whether the the normalizer is called first etc... > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Attachments: NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256431#comment-13256431 ] Ferdy Galema commented on NUTCH-1314: - I understand. I think the problem with implementing it with an urlfilter is that some parts of Nutch run the normalizers first. In the ParseUtil this is the case. Thus with malformed outlinks (of course this is where the majority of new urls are found) this will still be problematic. It makes sense to run normalizers first. Some urls still have a chance to be fixed (normalized) before they are filtered out. Therefore the scope of this issue is to apply a very crude (but effective) filter before normalizing/filtering code is run. > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Attachments: NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256423#comment-13256423 ] Julien Nioche commented on NUTCH-1314: -- I was under the impression that the patch did not remove the URL but substituted it with a shorter version. If the idea is to remove the URL altogether (which makes perfect sense) then yes it should be a URLFilter instead > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Attachments: NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256398#comment-13256398 ] Ferdy Galema commented on NUTCH-1314: - I assume you mean an URLFilter? Or do you want to correct the length by cutting off the excessive part? I think the urls should be rejected, because they probably were malformed anyway. > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Attachments: NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256393#comment-13256393 ] Julien Nioche commented on NUTCH-1314: -- What about doing this with a URLNormalizer (and make it the first to be called)? > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Attachments: NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231373#comment-13231373 ] Ferdy Galema commented on NUTCH-1314: - Good one, I overlooked those but they should definitely be treated the same way. > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Attachments: NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231368#comment-13231368 ] Markus Jelsma commented on NUTCH-1314: -- This should then also work for the Tika parser and the OutlinkExtractor i think. Parse-html is similar to parse-tika, it there are no outlinks obtain by getOutlinks in Domcontentutils then the outlink extractor is used. > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Attachments: NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira