[jira] [Commented] (NUTCH-1558) CharEncodingForConversion in ParseData's ParseMeta, not in ParseData's ContentMeta
[ https://issues.apache.org/jira/browse/NUTCH-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634183#comment-13634183 ] Ken Krugler commented on NUTCH-1558: I don't see the patch, but the HTML parsing support in Tika has similar support for dealing with ambiguous charset identification/detection - is there any way to leverage that? > CharEncodingForConversion in ParseData's ParseMeta, not in ParseData's > ContentMeta > -- > > Key: NUTCH-1558 > URL: https://issues.apache.org/jira/browse/NUTCH-1558 > Project: Nutch > Issue Type: Bug > Components: parser >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.7 > > > This patch from GitHub user ysc fixes two bugs related to character encoding: > * CharEncodingForConversion in ParseData's ParseMeta, not in ParseData's > ContentMeta > * if http response Header Content-Type return wrong coding,then get coding > from the original content of the page > Information about this pull request is here: http://s.apache.org/VOP -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Important : Bunch of Spam Created under Nutch Wiki!!
Hi Kiran, I was just chatting w/Steve Rowe, who handled this for the Solr project. He said: > It took less than a day, but I went on #asfinfra IRC channel and asked some > questions about the process, which may have gotten Gavin McDonald to move on > it sooner. Since we're still getting slammed with spam, it might be worthwhile to do the same. Thanks, -- Ken On Apr 1, 2013, at 12:30pm, kiran chitturi wrote: > I have posted the information on the JIRA issue page [0]. Let's hope the > issue will be taken care of soon. > > > [0] - https://issues.apache.org/jira/browse/INFRA-6081 > > > On Mon, Apr 1, 2013 at 3:27 PM, Lewis John Mcgibbney > wrote: > Hi Kiran, > > > On Mon, Apr 1, 2013 at 6:53 AM, wrote: > Re: Important : Bunch of Spam Created under Nutch Wiki!! > 22926 by: kiran chitturi > > > Hi guys, > > Do you know what is the destination for commit mails ? Can I give > 'dev@nutch.apache.org' ? > > No, we should put commit emails to the styatic archive here > http://mail-archives.apache.org/mod_mbox/nutch-commits/ > > > Thanks for sorting this out Kiran, we are truly getting hounded with spam > just now. > Best > Lewis > > > > -- > Kiran Chitturi > > > > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Re: Important : Bunch of Spam Created under Nutch Wiki!!
Hi Kiran, On Mar 28, 2013, at 2:03am, kiran chitturi wrote: > Thank you Ken for the information. I think the access is already restricted > to Contributors Only. Someone can please confirm, if it is not. It's not, as far as I know. I just created a fake account, logged in with it, and edited the front page. > If anyone needs to edit wiki, they would need to ask someone to get access to > wiki pages. > > Do you know if Solr still got hit by spam after locking down the wiki ? I think that change helped cut down most of the spam, but I don't monitor the Solr list that closely, sorry. -- Ken > On Thu, Mar 28, 2013 at 1:40 AM, Ken Krugler > wrote: > > On Mar 27, 2013, at 6:54pm, kiran chitturi wrote: > >> Thank you Binoy for reporting. >> >> We have been monitoring the pages and deleting them when we get time but >> there are more coming up. Today, I have seen a spam editing on the home page >> of Nutch wiki. It has inserted spam links under tutorials. >> >> We need to find a permanent solution to this. I wonder if any other >> list-servs are facing the same issue. > > Yes - Solr recently had to lock down editing on their wiki: > >> The wiki at http://wiki.apache.org/solr/ has come under attack by spammers >> more frequently of late, so the PMC has decided to lock it down in an >> attempt to reduce the work involved in tracking and removing spam. >> >> From now on, only people who appear on >> http://wiki.apache.org/solr/ContributorsGroup will be able to >> create/modify/delete wiki pages. >> >> Please request either on the solr-u...@lucene.apache.org or on >> d...@lucene.apache.org to have your wiki username added to the >> ContributorsGroup page - this is a one-time step. > > So I think you need to make a request to Infra to lock down the wiki, then > add people (generally in response to explicit requests) to the > ContributorsGroup page. > > -- Ken > > >> >> >> On Thu, Mar 28, 2013 at 12:49 AM, Binoy d wrote: >> I am quite suprised looking at the notification I am getting for new pages >> for Nutch Wiki >> Example : >> http://wiki.apache.org/nutch/KarlPuent >> >> I see at least 25-35 emails regarding such notification. >> >> All of the links I got are rooted under http://wiki.apache.org/nutch/ >> >> >> Is some one looking into this , If needed I can gladly forward emails to the >> person cleaning it up as I am not sure if every one has access to delete the >> pages. >> >> Regards, >> b >> >> -- Forwarded message -- >> From: Apache Wiki >> Date: Wed, Mar 27, 2013 at 9:32 PM >> Subject: [Nutch Wiki] Trivial Update of "EdwinaBro" by EdwinaBro >> To: Apache Wiki >> >> >> Dear Wiki user, >> >> You have subscribed to a wiki page or wiki category on "Nutch Wiki" for >> change notification. >> >> The "EdwinaBro" page has been changed by EdwinaBro: >> http://wiki.apache.org/nutch/EdwinaBro >> >> New page: >> I am 24 years old and my name is Edwina Brownlee. I life in Corjolens >> (Switzerland).<> >> <> >> <> >> Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue]] >> >> >> >> >> -- >> Kiran Chitturi >> >> >> >> > > -- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > > > > > -- > Kiran Chitturi > > > > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Re: Important : Bunch of Spam Created under Nutch Wiki!!
On Mar 27, 2013, at 6:54pm, kiran chitturi wrote: > Thank you Binoy for reporting. > > We have been monitoring the pages and deleting them when we get time but > there are more coming up. Today, I have seen a spam editing on the home page > of Nutch wiki. It has inserted spam links under tutorials. > > We need to find a permanent solution to this. I wonder if any other > list-servs are facing the same issue. Yes - Solr recently had to lock down editing on their wiki: > The wiki at http://wiki.apache.org/solr/ has come under attack by spammers > more frequently of late, so the PMC has decided to lock it down in an attempt > to reduce the work involved in tracking and removing spam. > > From now on, only people who appear on > http://wiki.apache.org/solr/ContributorsGroup will be able to > create/modify/delete wiki pages. > > Please request either on the solr-u...@lucene.apache.org or on > d...@lucene.apache.org to have your wiki username added to the > ContributorsGroup page - this is a one-time step. So I think you need to make a request to Infra to lock down the wiki, then add people (generally in response to explicit requests) to the ContributorsGroup page. -- Ken > > > On Thu, Mar 28, 2013 at 12:49 AM, Binoy d wrote: > I am quite suprised looking at the notification I am getting for new pages > for Nutch Wiki > Example : > http://wiki.apache.org/nutch/KarlPuent > > I see at least 25-35 emails regarding such notification. > > All of the links I got are rooted under http://wiki.apache.org/nutch/ > > > Is some one looking into this , If needed I can gladly forward emails to the > person cleaning it up as I am not sure if every one has access to delete the > pages. > > Regards, > b > > -- Forwarded message -- > From: Apache Wiki > Date: Wed, Mar 27, 2013 at 9:32 PM > Subject: [Nutch Wiki] Trivial Update of "EdwinaBro" by EdwinaBro > To: Apache Wiki > > > Dear Wiki user, > > You have subscribed to a wiki page or wiki category on "Nutch Wiki" for > change notification. > > The "EdwinaBro" page has been changed by EdwinaBro: > http://wiki.apache.org/nutch/EdwinaBro > > New page: > I am 24 years old and my name is Edwina Brownlee. I life in Corjolens > (Switzerland).<> > <> > <> > Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue]] > > > > > -- > Kiran Chitturi > > > > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564057#comment-13564057 ] Ken Krugler commented on NUTCH-1465: Hi Tejas - the original code didn't, but I checked and now remember that I added support for multiple sitemap URLs to BaseRobotRules in CC. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney > Fix For: 1.7 > > Attachments: NUTCH-1465-trunk.v1.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564019#comment-13564019 ] Ken Krugler commented on NUTCH-1465: Hi Tejas - I thought the current CC robots parsing code was already extracting the sitemap links. Or is the above comment ("modified the robots parsing code to extract the links to sitemap pages") a change to the current Nutch robots parsing code? I do remember thinking that the CC version would need to change to support multiple Sitemap links, even though it wasn't clear whether that was actually valid. -- Ken > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney > Fix For: 1.7 > > Attachments: NUTCH-1465-trunk.v1.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13560877#comment-13560877 ] Ken Krugler commented on NUTCH-1031: I've rolled this into trunk at crawler-commons. Next step is to roll a release. Not sure when I'll get to that, but on my list for this week. > Delegate parsing of robots.txt to crawler-commons > - > > Key: NUTCH-1031 > URL: https://issues.apache.org/jira/browse/NUTCH-1031 > Project: Nutch > Issue Type: Task >Reporter: Julien Nioche >Assignee: Tejas Patil >Priority: Minor > Labels: robots.txt > Fix For: 1.7 > > Attachments: CC.robots.multiple.agents.patch, > CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, > NUTCH-1031.v1.patch > > > We're about to release the first version of Crawler-Commons > [http://code.google.com/p/crawler-commons/] which contains a parser for > robots.txt files. This parser should also be better than the one we currently > have in Nutch. I will delegate this functionality to CC as soon as it is > available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13560420#comment-13560420 ] Ken Krugler commented on NUTCH-1031: Hi Tejas, I've been on the road, but I'll check out your patch when I return back to my office tomorrow. Thanks for updating it with a test case! -- Ken > Delegate parsing of robots.txt to crawler-commons > - > > Key: NUTCH-1031 > URL: https://issues.apache.org/jira/browse/NUTCH-1031 > Project: Nutch > Issue Type: Task >Reporter: Julien Nioche >Assignee: Tejas Patil >Priority: Minor > Labels: robots.txt > Fix For: 1.7 > > Attachments: CC.robots.multiple.agents.patch, > CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, > NUTCH-1031.v1.patch > > > We're about to release the first version of Crawler-Commons > [http://code.google.com/p/crawler-commons/] which contains a parser for > robots.txt files. This parser should also be better than the one we currently > have in Nutch. I will delegate this functionality to CC as soon as it is > available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558400#comment-13558400 ] Ken Krugler commented on NUTCH-1031: Regarding precedence - my guess is that it's not very important, as I haven't seen many (any?) robots.txt files where it would match the same robot, using related names, in rules blocks with different rules. This issue of precedence is specific to Nutch users, however (not part of the robots.txt RFC) so I'd suggest posting to the Nutch users list to see if anyone thinks it's important. As far as your review of the CC code, yes it's correct. There's one additional wrinkle in that the target user agent name is split on spaces, due to what appears to be an implicit expectation that you can use a user agent name with spaces (which based on the RFC isn't actually valid) and any piece of the name will match. > Delegate parsing of robots.txt to crawler-commons > - > > Key: NUTCH-1031 > URL: https://issues.apache.org/jira/browse/NUTCH-1031 > Project: Nutch > Issue Type: Task >Reporter: Julien Nioche >Assignee: Tejas Patil >Priority: Minor > Labels: robots.txt > Fix For: 1.7 > > Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch > > > We're about to release the first version of Crawler-Commons > [http://code.google.com/p/crawler-commons/] which contains a parser for > robots.txt files. This parser should also be better than the one we currently > have in Nutch. I will delegate this functionality to CC as soon as it is > available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558340#comment-13558340 ] Ken Krugler commented on NUTCH-1031: Hi Tejas - I've looked at your patch, and (assuming there's not a requirement to support precedence in the user agent name list) it seems like a valid change. Based on the RFC (http://www.robotstxt.org/norobots-rfc.txt) robot names shouldn't have commas, so splitting on that seems safe. Do you have a unit test to verify proper behavior? If so, I'd be happy to roll that into CC. -- Ken > Delegate parsing of robots.txt to crawler-commons > - > > Key: NUTCH-1031 > URL: https://issues.apache.org/jira/browse/NUTCH-1031 > Project: Nutch > Issue Type: Task >Reporter: Julien Nioche >Assignee: Tejas Patil >Priority: Minor > Labels: robots.txt > Fix For: 1.7 > > Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch > > > We're about to release the first version of Crawler-Commons > [http://code.google.com/p/crawler-commons/] which contains a parser for > robots.txt files. This parser should also be better than the one we currently > have in Nutch. I will delegate this functionality to CC as soon as it is > available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546089#comment-13546089 ] Ken Krugler commented on NUTCH-1031: Based on my reading of the robots.txt RFC ("The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring."), this seems like the User-Agent name (what's in the robots.txt file) is searched for a substring that matches the robot name token (what the caller is using). So that means in CC we'd either need to assume that a robot name _never_ contains a comma (and we split the caller-provided name) or we add a new API where you pass in a list of robot names. Thoughts? > Delegate parsing of robots.txt to crawler-commons > - > > Key: NUTCH-1031 > URL: https://issues.apache.org/jira/browse/NUTCH-1031 > Project: Nutch > Issue Type: Task >Reporter: Julien Nioche >Assignee: Julien Nioche >Priority: Minor > Labels: robots.txt > Fix For: 1.7 > > Attachments: NUTCH-1031.v1.patch > > > We're about to release the first version of Crawler-Commons > [http://code.google.com/p/crawler-commons/] which contains a parser for > robots.txt files. This parser should also be better than the one we currently > have in Nutch. I will delegate this functionality to CC as soon as it is > available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448774#comment-13448774 ] Ken Krugler commented on NUTCH-1465: Hi Lewis, Just to be clear, I think the dead horse is trying to get people interested in porting their code to crawler-commons, and then switching existing functionality to rely on cc. For anything new (like sitemap parsing) I think it's a no-brainer to use cc, unless the API is totally borked. E.g. if you didn't, then you wouldn't have picked up our BOM fix. -- Ken > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney > Fix For: 1.6, 2.1 > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447797#comment-13447797 ] Ken Krugler commented on NUTCH-1465: Hi Lewis - I could start a thread, but I also don't want to flog a dead horse :) I'm spending occasional small amounts of time trying to move code from Bixo over to CC, and the plan is for the 0.9 release of Bixo to switch over to using CC where possible. But the lack of excitement among Droids, Heretrix, Common Crawl, Nutch, etc. has made it pretty clear getting wide-spread adoption would be an uphill battle, one that I don't have the time currently to fight. -- Ken > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney > Fix For: 1.6, 2.1 > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447700#comment-13447700 ] Ken Krugler commented on NUTCH-1465: The sitemap parsing code referenced in the discussion you note has been placed in crawler-commons. We just finished using it during a crawl (fixed one bug, dealing with sitemaps that have a BOM) and it worked fine for the sites we were crawling. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney > Fix For: 1.6, 2.1 > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: bug in parse-tika or Tika RTFParser?
Hi Lewis, [Moving to the dev list] For many Tika parsers, the text you get back from the document starts with the title (if any), and then contains the body. So I'm wondering if what you're seeing in the test failure is that the parse.getText() result is actually "test rtf document\nThe quick brown fox…" -- Ken On Aug 15, 2012, at 12:49pm, Lewis John Mcgibbney wrote: > Hi, > > For some time (in 2.x) we have commented out this test as it was > waiting for TIKA-748 to be resolved... which now has been resolved > however I'm getting some confusing output when trying to resurrect the > test! > > So @line 105 we do > > String text = parse.getText(); > assertEquals("The quick brown fox jumps over the lazy dog", text.trim()); > > But I was wanting to implement the suggested test for title e.g. > > String title = parse.getTitle(); > String text = parse.getText(); > assertEquals("test rft document", title); > assertEquals("The quick brown fox jumps over the lazy dog", text.trim()); > > The test fails on the 2nd assertion which with the following > > Testcase: testIt took 5.668 sec > FAILED > null expected:<[The quick brown fox jumps over the lazy dog]> but > was:<[test rft document]> > junit.framework.ComparisonFailure: null expected:<[The quick brown fox > jumps over the lazy dog]> but was:<[test rft document]> > at org.apache.nutch.parse.tika.TestRTFParser.testIt(TestRTFParser.java:) > > So this looks like parse.getText() returns the same (in this instance) > as parse.getTitle()... which smells like rotting herring to me. > > Any immediate thoughts whether this is a known problem in the Tika RTF > parser, parse-tika's DomContentUtils class or somewhere in between? > > Thank you > > Lewis > > -- > Lewis -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
[jira] [Commented] (NUTCH-1455) RobotRulesParser to match multi-word user-agent names
[ https://issues.apache.org/jira/browse/NUTCH-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435123#comment-13435123 ] Ken Krugler commented on NUTCH-1455: I added a test to crawler-commons to confirm that its robots.txt parser handles this correctly :) > RobotRulesParser to match multi-word user-agent names > - > > Key: NUTCH-1455 > URL: https://issues.apache.org/jira/browse/NUTCH-1455 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.5.1 >Reporter: Sebastian Nagel > Fix For: 1.6 > > > If the user-agent name(s) configured in http.robots.agents contains spaces it > is not matched even if is exactly contained in the robots.txt > http.robots.agents = "Download Ninja,*" > If the robots.txt (http://en.wikipedia.org/robots.txt) contains > {code} > User-agent: Download Ninja > Disallow: / > {code} > all content should be forbidden. But it isn't: > {code} > % curl 'http://en.wikipedia.org/robots.txt' > robots.txt > % grep -A1 -i ninja robots.txt > User-agent: Download Ninja > Disallow: / > % cat test.urls > http://en.wikipedia.org/ > % bin/nutch plugin lib-http > org.apache.nutch.protocol.http.api.RobotRulesParser robots.txt test.urls > 'Download Ninja' > ... > allowed:http://en.wikipedia.org/ > {code} > The rfc (http://www.robotstxt.org/norobots-rfc.txt) states that > bq. The robot must obey the first record in /robots.txt that contains a > User-Agent line whose value contains the name token of the robot as a > substring. > Assumed that "Downlaod Ninja" is a substring of itself it should match and > http://en.wikipedia.org/ should be forbidden. > The point is that the agent name from the User-Agent line is split at spaces > while the names from the http.robots.agents property are not (they are only > split at ","). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433278#comment-13433278 ] Ken Krugler commented on NUTCH-1233: Hi Markus - two questions. First, is the current Tika (1.1) outlink extraction support sufficient? Second, do you think whitespace trimming should happen in Tika or externally? I'm not sure, as I guess there might be an issue where somebody wants the extract same anchor text as what was in the HTML, but seems odd. > Rely on Tika for outlink extraction > --- > > Key: NUTCH-1233 > URL: https://issues.apache.org/jira/browse/NUTCH-1233 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, > NUTCH-1233-1.6-2.patch > > > Tika provides outlink extraction features that are not used in Nutch. To be > able to use it in Nutch we need Tika to return the rel attr value of each > link, which it currently doesn't. There's a patch for Tika 1.1. If that patch > is included in Tika and we upgraded to that new version this issue can be > worked on. Here's preliminary code that does both Tika and current outlink > extraction. This also includes parts of the Boilerpipe code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1418) error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
[ https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405223#comment-13405223 ] Ken Krugler commented on NUTCH-1418: The path is invalid, so Nutch emitting a warning is fine. If Nutch subsequently bails on processing URLs for such a web site, then that would be a problem - but I don't think that's the case here, as it's being logged as a warning, not an error, and it obviously keeps processing the file (since you get three such warnings). Are you _sure_ that the reason Nutch isn't fetching is caused by this issue with robots.txt? I'm pretty sure many people use Nutch to crawl Wikipedia. > error parsing robots rules- can't decode path: > /wiki/Wikipedia%3Mediation_Committee/ > > > Key: NUTCH-1418 > URL: https://issues.apache.org/jira/browse/NUTCH-1418 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4 >Reporter: Arijit Mukherjee > > Since learning that nutch will be unable to crawl the javascript function > calls in href, I started looking for other alternatives. I decided to crawl > http://en.wikipedia.org/wiki/Districts_of_India. > I first tried injecting this URL and follow the step-by-step approach > till fetcher - when I realized, nutch did not fetch anything from this > website. I tried looking into logs/hadoop.log and found the following 3 lines > - which I believe could be saying that nutch is unable to parse the > robots.txt in the website and ttherefore, fetcher stopped? > > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots > rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots > rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/ > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots > rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/ > I tried checking the URL using parsechecker and no issues there! I think > it means that the robots.txt is malformed for this website, which is > preventing fetcher from fetching anything. Is there a way to get around this > problem, as parsechecker seems to go on its merry way parsing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties
[ https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295683#comment-13295683 ] Ken Krugler commented on NUTCH-1397: Should this issue be filed against Tika, versus Nutch? Or is this specific to language identification that's still part of Nutch? Sorry, but I haven't been keeping up with the state of migrating functionality to Tika. > language-identifier incorrectly handles double-barreled language properties > --- > > Key: NUTCH-1397 > URL: https://issues.apache.org/jira/browse/NUTCH-1397 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Priority: Minor > Fix For: 1.6, 2.1 > > > Currently when language-identifier is activated is parses and identifies > langauge-type=en, however does not identify en-GB or en-US. This issues > should correct that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Detecting Encoding with plugins
On Feb 14, 2012, at 2:34pm, Lewis John Mcgibbney wrote: > It's in HTMLParser#private static String sniffCharacterEncoding > > I'm still wondering where TikaParser gets the character encoding from though? FYI, the individual Tika parsers have their own detection logic. The HTML parser, for example, uses the response headers and metadata tags in addition to ICU's statistical method. That's something I'm still working on cleaning up, but haven't made much progress in the past few months. -- Ken > Additionally, this doesn't look like something we check for in our JUnit > classes? If we don't then I would like to write some tests to test for this. > > I am working on Any23 tests first, so this provides the justification behind > my question. > > Thanks > > Lewis > > On Tue, Feb 14, 2012 at 10:00 PM, Lewis John Mcgibbney > wrote: > Hi, > > I can't see anywhere within our parser plugins where we detect encoding of > documents. I've also begun looking through the o.a.n.p package but again I > can't see anything. > > Can anyone provide some detail on this please? > > Thank you > > Lewis > > > > -- > Lewis > > > > > -- > Lewis > -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: tika-core, tika-parser
On Feb 8, 2012, at 5:28am, Markus Jelsma wrote: > > > On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote: >> sorry don't understand what your issue is. We have a dependency on >> tika-parsers and the actual parser implementations (listed in tika parsers' >> POM) are pulled transitively just like any other dependency managed by Ivy. >> They end up being copied in runtime/local/plugins/parse-tika/ or put in >> the job in runtime/deploy/ > > My problem is that i am working on some code for Tika-parsers 1.1-SNAPSHOT > that i need to use in Nutch. However, when i build tika-parsers and put it in > Nutch' lib directory i still seem to be missing dependencies. Then trouble > begins: I don't know anything about how Nutch handles jars in its lib directory, but this sounds like you have a "raw" jar (tika-parsers) without its pom.xml. So then Ivy (or Maven) doesn't know about the transitive dependencies on other jars, which are needed to implement the actual parsing support. -- Ken > > Exception in thread "main" java.lang.NoClassDefFoundError: Could not > initialize class org.apache.tika.parser.dwg.DWGParser >at java.lang.Class.forName0(Native Method) >at java.lang.Class.forName(Class.java:247) >at sun.misc.Service$LazyIterator.next(Service.java:271) >at org.apache.nutch.parse.tika.TikaConfig.(TikaConfig.java:149) >at > org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211) >at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:254) >at > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162) >at > org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132) >at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71) >at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101) >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138) > > Nick told me to remove DWG from the org.apache.tika.parsers.Parsers config > file, which i did. But then other dependency issues come and go. The more > parsers i remove from the config file the better it goes, but then Tika won't > build anymore because of failing tests. > > I asked this on the Nutch list because i wasn't sure anymore how Nutch deals > with these its own deps, which you explained well. > > I'll give up for now :) > > > >> >> On 8 February 2012 13:03, Markus Jelsma wrote: >>> Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's >>> something else. >>> >>> dependencies, dependencies, dependencies :( >>> >>> On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote: >>>> The dependencies for the plugins are defined locally as shown in the >>>> URL below, where you can see the ref to tika-parsers for parse-tika. >>>> Is that more clear for you Markus? >>>> >>>> On 8 February 2012 12:58, Lewis John Mcgibbney >>> >>> wrote: >>>>> Hi Markus, >>>>> >>>>> For starters >>> >>> http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi >>> >>>>> ew=markup >>>>> >>>>> Can we pick our way through this? >>>>> >>>>> Thanks >>>>> >>>>> >>>>> On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma >>>>> >>>> >>>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> Can anyone shed light on this? We don't have any parsers in our libs >>> >>> dir >>> >>>>>> and >>>>>> we don't have tika-parsers jar, only the tika-core jar. Where are >>>>>> the parsers >>>>>> and how does this all work? >>>>>> >>>>>> I've posted a question (same subject) on the Tika list and Nick >>>>>> tells >>> >>> me >>> >>>>>> there >>>>>> must be parsers somewhere. Well, i have no idea how we do it in >>>>>> Nutch, do you? >>>>>> >>>>>> Thanks >>>>> >>>>> -- >>>>> *Lewis* >>> >>> -- >>> Markus Jelsma - CTO - Openindex > > -- > Markus Jelsma - CTO - Openindex -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: [DISCUSS] Issues with Fetcher
Hi Eddie, My own personal favorite area would be to integrate with crawler-commons. There's been some occasional work done to move things into this shared project - e.g. robots parser & a base HTTP fetcher from Bixo. I believe there's a Jira issue open to switch Nutch to using that robots.txt parser, which would be an improvement over what Nutch currently has. There are other pieces of Nutch that could/eventually should be moved there, e.g. URL normalization, but that doesn't directly benefit Nutch, just other Java-based crawlers. Or, if you have experience with JSPs/GUI work, then I think there's this big open issue around improving the Nutch GUI, which would likely provide the most benefit to the most users. I haven't been following the current status, but I know that there have been periodic discussions, and I think 101tec did some work on this a while back (for a client), but I don't know if that's been contributed (or could be, for that matter). -- Ken On Jan 21, 2012, at 8:17am, Edward Drapkin wrote: > On 1/21/2012 8:27 AM, Lewis John Mcgibbney wrote: >> >> Hi Julien, >> >> >> There are 8 issues in trunk about the fetcher - some of them unrelated to >> the Fetcher (NUTCH-827 / Nutch-1193) with most of the others being >> improvements (NUTCH-828 / NUTCH-1079) with possibly just a very few being >> real issues. >> >> This puts the whole discussion into much better context, thanks for pointing >> this out. Maybe I should have made it more clear, that I only filtered the >> fetcher issues on our Jira and I was simply modelling my discussion around >> that. You are completely correct though, it would be different if the >> fetcher was in a similar state to protocol-httpclient... which it is >> obviously not. >> >> I am also concerned about getting too radical changes to such a core part of >> the framework, especially when more pressing issues could be looked after >> instead. >> +1 >> >> Having said that if someone can come up with an interesting proposal for >> improving the Fetcher that would be very good, I would simply suggest that >> we then have a separate implementation for that. >> +1 >> >> >> >> Ok with this in mind then, is there some guidance we can communicate to >> Eddie? He has specifically mentioned that he shares similar opinions wrt the >> fetcher being a core part of Nutch, radical changes etc, and I also share >> this point of view. He has also added that he doesn't want to spend the time >> changing material which we may or may not merge with trunk, this also makes >> perfect sense. Additionally Ken's comments emphasise that this has been >> somewhat attempted in the past and that lessons have been learned and the >> implementation we have cuts the mustard as is. >> Maybe we could nudge Eddie in the right direction, which would benefit both >> himself and the project over the next while, I think this was the most >> important point I was trying to emphasise, however looking over my original >> comment this was maybe not how it was written. >> >> Thanks >> Lewis > > If there's more important and/or interesting things for me to work on, I'll > be glad to. I'm completely unfamiliar with the current state of the project > as a whole - and looking through JIRA is a bit daunting. The only reason I'm > attracted to working on the fetcher is I think it's a really interesting and > compelling problem to solve, and it's making it more flexible is something > that would directly benefit our use for it, so it will be easier to devote > time to it while I'm at the office. I do have a glut of free time at the > moment though, so I'm perfectly okay working on another area that's more > pressing - I just don't know what it is. I saw that protocol-httpclient > needs to be rewritten, is there someone working on that? > > I can work on more important and less controversial / radical things, but I > do think that having a more flexible, pluggable fetcher will be an enormous > improvement to Nutch and can greatly expand the potential uses for it as a > piece of software. There's a ton of cases where pluggable fetching could > have a huge improvement: local filesystem search, single-threaded / small > site indexing, email indexing (SMTP, POP, etc.), etc. I suggested an > extremely (perhaps too much so) abstract archtecture for fetching in ticket > #1201, and for the sake of brevity I won't repeat myself here, but I think > that would give Nutch a good base for flexible fetching, which I believe is a > huge improvement to the project. I'm obviously new to the development here > and I'm willing do whatever needs doing, I just believe the fetching is > something that needs doing. I just want to contribute! > > Thanks, > Eddie -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190050#comment-13190050 ] Ken Krugler commented on NUTCH-1201: My 2 cents, based on ancient history. We extended Nutch in several ways during my Krugle startup, and in general the experience wound up being pretty painful. Even with the help of Andrzej and Stefan Groschupf (two very knowledgeable Nutch developers), we wound up spinning our wheels. Part of the problem was the monolithic nature of Nutch, which made (makes?) it hard to extend in ways beyond plugin extension points that don't need to do much other than output different results for the same input data. My thought here is that I'd look at having a very high level extension point - e.g. "I've got a fetch list (generated by other Nutch code) in the segment, and now I need to process that list, with the end result being data in new sub-dirs in the segment". But keep the fetcher around as a re-usable component (see crawler-commons for one version from Bixo). Then if you want to do some crazy crawl-3-deep, you can craft your own solution (which might not even use map-reduce). -- Ken PS - my personal bias is to implement custom solutions using Cascading & reuseable Java classes, but I know that doesn't fit well with the more common user of Nutch, where "programming by XML" (configuration only) seems to be the sweet spot. > Allow for different FetcherThread impls > --- > > Key: NUTCH-1201 > URL: https://issues.apache.org/jira/browse/NUTCH-1201 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.5 > > > For certain cases we need to modify parts in FetcherThread and make it > pluggable. This introduces a new config directive fetcher.impl that takes a > FQCN and uses that setting Fetcher.fetch to load a class to use for > job.setMapRunnerClass(). This new class has to extend Fetcher and and inner > class FetcherThread. This allows for overriding methods in FetcherThread but > also methods in Fetcher itself if required. > A follow up on this issue would be to refactor parts of FetcherThread to make > it easier to override small sections instead of copying the entire method > body for a small change, which is now the case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [DISCUSS] Issues with Fetcher
Thanks for the poke - I'd started writing up a comment to that issue, but got sidetracked by the day job. -- Ken On Jan 20, 2012, at 9:16am, Lewis John Mcgibbney wrote: > Hi Everyone, > > Since Eddie decided to chap in on the dev lists/Jira we have not been able to > get back to him. I'm referring specifically to NUTCH-1201 and his comments > therewith. > > Doing a quick rekkie on the current fetcher issues I can see 32 issues with 7 > of them claiming to be patched up... this kinda indicates that although there > are underlying problems with the fetcher we are currently not getting the > time to address them. It also indicates that there is quite a bit of work to > be done with the fetcher... > > Has anyone had time to consider Eddie's comments or proposals for taking the > work forward. The last thing we would like to see is him allocating his time > elsewhere if we could have a real go at building a more appropriate fetcher > architecture (plugable, etc). > > I was thinking to myself all week that we would seriously be passing up an > opportunity if we didn't try to act on this one. > > Thanks guys. > > -- > Lewis > -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: ANT+MAVEN (was: Nutch Maven build)
t; > > being. > > > > > > > > > > > > Thanks > > > > > > > > > > > > Lewis > > > > > > > > > > > > [1] > > > > > > > > https://builds.apache.org/view/M-R/view/Nutch/job/nutch-trunk-maven/3/con > > > > so > > > > > > > > > > le > > > > > > > > > > -- > > > > > Markus Jelsma - CTO - Openindex > > > > > http://www.linkedin.com/in/markus17 > > > > > 050-8536620 / 06-50258350 > > > > > > > > > > > > > > > > > > > > -- > > > > > Lewis > > > > > > > > ++ > > > > Chris Mattmann, Ph.D. > > > > Senior Computer Scientist > > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > > > Office: 171-266B, Mailstop: 171-246 > > > > Email: chris.a.mattm...@nasa.gov > > > > WWW: http://sunset.usc.edu/~mattmann/ > > > > ++ > > > > Adjunct Assistant Professor, Computer Science Department > > > > University of Southern California, Los Angeles, CA 90089 USA > > > > ++ > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > > ++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++ > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com -- Ken Krugler http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088875#comment-13088875 ] Ken Krugler commented on NUTCH-1086: For what it's worth, there's a SimpleHttpFetcher in crawler-commons that uses HttpClient 4.1. > Rewrite protocol-httpclient > --- > > Key: NUTCH-1086 > URL: https://issues.apache.org/jira/browse/NUTCH-1086 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Markus Jelsma > > There are several issues about protocol-httpclient and several comments about > rewriting the plugin with the new http client libraries. There is, however, > not yet an issue for rewriting/reimplementing protocol-httpclient. > http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1046) Add tests for indexing to SOLR
[ https://issues.apache.org/jira/browse/NUTCH-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068721#comment-13068721 ] Ken Krugler commented on NUTCH-1046: Don't know if this is useful, but I've got some tests for indexing to embedded Solr as part of the cascading.solr scheme. See https://github.com/bixolabs/cascading.solr/ > Add tests for indexing to SOLR > -- > > Key: NUTCH-1046 > URL: https://issues.apache.org/jira/browse/NUTCH-1046 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.4, 2.0 >Reporter: Julien Nioche > Fix For: 1.4, 2.0 > > > We currently have no tests for checking that the indexing to SOLR works as > expected. Running an embedded SOLR Server within the tests would be good. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-657) Estonian N-gram profile has wrong name
[ https://issues.apache.org/jira/browse/NUTCH-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066438#comment-13066438 ] Ken Krugler commented on NUTCH-657: --- I'd thought that Nutch was now delegating language detection to Tika (which contains a port of what Nutch has). In any case, it's et.ngp over in Tika-land. > Estonian N-gram profile has wrong name > -- > > Key: NUTCH-657 > URL: https://issues.apache.org/jira/browse/NUTCH-657 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8.1, 0.9.0 >Reporter: Jonathan Young >Priority: Trivial > > The Nutch language identifier plugin contains an ngram profile, ee.ngp, in > src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang . "ee" > is the ISO-3166-1-alpha-2 code for Estonia (see > http://www.iso.org/iso/country_codes/iso_3166_code_lists/english_country_names_and_code_elements.htm), > but it is the ISO-639-2 code for Ewe (see > http://www.loc.gov/standards/iso639-2/php/English_list.php). "et" is the > ISO-639-2 code for Estonian, and the language profile in ee.ngp is clearly > Estonian. > Proposed solution: rename ee.ngp to et.ngp . -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset $charset
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054475#comment-13054475 ] Ken Krugler commented on NUTCH-1012: Tika has code to try to resolve charset names (and handle common error cases) in a graceful manner. Nutch might want to use this code, or we could add a general wrapper to crawler-commons. See CharsetUtils in Tika. > Cannot handle illegal charset $charset > -- > > Key: NUTCH-1012 > URL: https://issues.apache.org/jira/browse/NUTCH-1012 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.3 >Reporter: Markus Jelsma >Priority: Minor > Fix For: 1.4 > > > Pages returning: > {code} > Content-Type: text/html; charset=$charset > {code} > cause: > {code} > Error parsing: http://host/: failed(2,200): > java.nio.charset.IllegalCharsetNameException: $charset > Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: > Followed by 3999 > ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12 > {code} > Stack trace: > {code} > 2011-06-24 01:14:23,442 WARN parse.html - > java.nio.charset.IllegalCharsetNameException: $charset > 2011-06-24 01:14:23,442 WARN parse.html - at > java.nio.charset.Charset.checkName(Charset.java:284) > 2011-06-24 01:14:23,442 WARN parse.html - at > java.nio.charset.Charset.lookup2(Charset.java:458) > 2011-06-24 01:14:23,442 WARN parse.html - at > java.nio.charset.Charset.lookup(Charset.java:437) > 2011-06-24 01:14:23,442 WARN parse.html - at > java.nio.charset.Charset.isSupported(Charset.java:479) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) > 2011-06-24 01:14:23,443 WARN parse.html - at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) > 2011-06-24 01:14:23,443 WARN parse.html - at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > 2011-06-24 01:14:23,443 WARN parse.html - at > java.util.concurrent.FutureTask.run(FutureTask.java:138) > 2011-06-24 01:14:23,443 WARN parse.html - at > java.lang.Thread.run(Thread.java:662) > 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: > http://host/: failed(2,200): java.nio.charset.Ill > egalCharsetNameException: $charset > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex
[ https://issues.apache.org/jira/browse/NUTCH-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054471#comment-13054471 ] Ken Krugler commented on NUTCH-1013: No comment directly related to this patch, but URL normalization seems like a great component to move into crawler-commons, since all web crawlers need to do the same thing. > Migrate RegexURLNormalizer from Apache ORO to java.util.regex > - > > Key: NUTCH-1013 > URL: https://issues.apache.org/jira/browse/NUTCH-1013 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1013-1.4.patch > > > Apache ORO uses old Perl 5-style regular expressions. Features such as the > powerful lookbehind are not available. The project has become retired as > well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1008) Switch to crawler-commons version of robots.txt parsing code
Switch to crawler-commons version of robots.txt parsing code Key: NUTCH-1008 URL: https://issues.apache.org/jira/browse/NUTCH-1008 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Ken Krugler Priority: Minor The Bixo project has an improved version of Nutch's robots.txt parsing code. This was recently contributed to crawler-commons, in a format that should be independent of Bixo, Cascading, and even Hadoop. Nutch could switch to this, and benefit from more robust parsing, better compliance with ad hoc extensions to the robot exclusion protocol, and a wider community of users/developers for that code. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047490#comment-13047490 ] Ken Krugler commented on NUTCH-961: --- The way that Boilerpipe in Tika works is that it acts as a delegate, processing the SAX events generated by the default content handler that knows how to help clean up broken HTML. So it's incremental processing (you don't need to get the full page first). Separate note: Tika's Boilerpipe support now has an option to return HTML markup, so you could run it in this mode to get anchors/anchor text. > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma > Fix For: 1.4, 2.0 > > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961v2.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-944) Increase the number of elements to look for URLs and add the ability to specify multiple attributes by elements
[ https://issues.apache.org/jira/browse/NUTCH-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019565#comment-13019565 ] Ken Krugler commented on NUTCH-944: --- I'm curious how this relates to [TIKA-463]. Is it that this code extracts the URLs from the attributes that (as of TIKA-463) should be getting returned by Tika's HtmlParser? Also, TIKA-463 doesn't handle "video" - is that a legit XHTML 1.0 element? > Increase the number of elements to look for URLs and add the ability to > specify multiple attributes by elements > --- > > Key: NUTCH-944 > URL: https://issues.apache.org/jira/browse/NUTCH-944 > Project: Nutch > Issue Type: Improvement > Components: parser > Environment: GNU/Linux Fedora 12 >Reporter: Jean-Francois Gingras >Priority: Minor > Fix For: 2.0 > > Attachments: DOMContentUtils.java.path-1.0, > DOMContentUtils.java.path-1.3 > > > Here a patch for DOMContentUtils.java that increase the number of elements to > look for URLs. It also add the ability to specify multiple attributes by > elements, for example: > linkParams.put("frame", new LinkParams("frame", "longdesc,src", 0)); > linkParams.put("object", new LinkParams("object", > "classid,codebase,data,usemap", 0)); > linkParams.put("video", new LinkParams("video", "poster,src", 0)); // HTML 5 > I have a patch for release-1.0 and branch-1.3 > I would love to hear your comments about this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-960) Language ID - confidence factor
[ https://issues.apache.org/jira/browse/NUTCH-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986827#action_12986827 ] Ken Krugler commented on NUTCH-960: --- There are a number of Tika issues filed that relate to this. See TIKA-369, TIKA-496, TIKA-568. > Language ID - confidence factor > --- > > Key: NUTCH-960 > URL: https://issues.apache.org/jira/browse/NUTCH-960 > Project: Nutch > Issue Type: Wish >Affects Versions: 1.2 >Reporter: M Alexander > > Hi > In JAVA implementation, what is the best way to calculate the confidence of > the outcome of the language id for a given text? > For example: > n-gram matching / total n-gram * 100. > when a text is passed. The outcome would be "en" with 89% confidence. What is > the best way to implement this to the existig nutch language id code? > Thanks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Charset detection algorithm
Hi all, See https://issues.apache.org/jira/browse/TIKA-539 for a Tika issue I'm currently working on, which has to do with the charset detection algorithm. There's the HTML5 proposal, where the priority is - charset from Content-Type response header - charset from HTML element - charset detected from page contents Reinhard Schwab proposed a variation on the HTML5 approach, which makes sense to me; in my web crawling experience, too many servers lie to just blindly trust the response header contents. I've got a slight modification to Reinhard's approach, as describe in a comment on the above issue: https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=12928832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel #action_12928832 I'm interested in comments. Thanks! -- Ken ------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
More real-time crawling
Hi Xiao, FWIR there is adaptive refetch interval support in Nutch currently - or are you looking for something different? Regards, -- Ken On Oct 27, 2010, at 1:42am, xiao yang wrote: I want to modify the schedule of crawler to make it more real-time. Some web pages are frequently updated, while others seldom change. My idea is to classify URL into 2 categories which will affect the score of URL, so I want to add a field to store which category a URL belongs to. The idea is simple, but I found it's not so easy to implement in Nutch. Thanks! Xiao ------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Tika 0.8-SNAPSHOT and HTML torture testing
I just committed some changes to Tika that (in theory) should ensure all URLs get extracted from HTML documents. See https://issues.apache.org/jira/browse/TIKA-463 for details. It would be great if somebody active in Nutch could try this out with the current suite of Nutch tests for HTML processing. Thanks! -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Tika HTML parsing
Hi Andrzej, On Aug 15, 2010, at 12:04am, Andrzej Bialecki wrote: On 2010-08-15 06:54, Ken Krugler wrote: For what it's worth, I just committed some patches to Tika that should improve Tika's ability to extract HTML outlinks (in and elements, at least). Support for should be coming soon :) This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm tracking down, but I think Tika is getting closer to being usable by Nutch for typical web crawling. Thanks Ken for pushing forward this work! A few questions: * does this include image maps as well ()? I've got a patch for that (the same one that does iframes). Hopefully I'll commit that today. * how does the code treat invalid html with both body and frameset? TagSoup should clean up the invalid HTML. The issue you'd run into with is that TagSoup maps it to an empty , followed by I committed a patch that fixes this, at least for the examples that I tried (including the one that Julien reported). * what's the status of extracting the meta robots and link rel information? All elements are now emitted in the resulting element. And and elements should be passed through. It would be great to get input on just how "fixed" things are now, or maybe after the next patch gets committed. Thanks, -- Ken ---- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Tika HTML parsing
For what it's worth, I just committed some patches to Tika that should improve Tika's ability to extract HTML outlinks (in and elements, at least). Support for should be coming soon :) This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm tracking down, but I think Tika is getting closer to being usable by Nutch for typical web crawling. -- Ken -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
When a crawl goes bad...
Dear @80legs stop crushing metafilter.com from 2226 distinct IP addresses. Your bots are DDOSing the site with thousands of requests. Stop. <http://twitter.com/mathowie/status/20326707535> -- Ken -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Parse-tika ignores too much data...
On Jul 8, 2010, at 12:15am, Andrzej Bialecki wrote: On 2010-07-07 22:32, Ken Krugler wrote: Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way is handled, we also saw cases were it was twice in the output. Don't know about the case of it appearing twice. But for the above issue, I added a comment. The test HTML is badly broken, in that you can either have a OR a , but not both. The HTML was broken on purpose - one of the goals of the original test was to get as much content and links in presence of grave errors - as you know even major sites often produce a badly broken HTML, but the parser sanitize it and produce a valid DOM. In this case, it produced two nested elements, which is not valid. I'll need to check this out - the response from TagSoup was followed by the data, and finally a closing . So if Tika is generating two bodies, then that's a bug in Tika. Though technically, having the following the is also invalid. I'd suggest filing a Tika issue to do a better job of handling invalid framesets like this. Based on my experience, I don't think there would be an easy way to get this change into TagSoup. I should also mention that NekoHTML handled this test much better, by removing the and retaining only the . Yes, that's a well-known issue - certain docs are better handled by NekoHTML, while with others you get better results from TagSoup. Anecdotally I'd heard that NekoHTML was better at extracting links. Tika used to use NekoHTML, but switched to TagSoup last October. One reason was to avoid a troublesome dependency on Xerces. -- Ken ------------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Parse-tika ignores too much data...
Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way is handled, we also saw cases were it was twice in the output. Don't know about the case of it appearing twice. But for the above issue, I added a comment. The test HTML is badly broken, in that you can either have a OR a , but not both. -- Ken On 7 July 2010 17:41, Ken Krugler wrote: Hi Andrzej, I've got a old list of cases where Tika was not extracting links: - frame - iframe - img - map - object - link (only in section) I worked around this in my crawling code, by directly processing the DOM, but I should roll this into Tika. If you have a list of problems with test docs, file a TIKA issue and I'll try to fix things up quickly. Thanks, -- Ken On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote: Hi, I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. prepare the test DOM-s with Tika's HtmlParser. Results are not so good for some test cases... Even when using IdentityHtmlMapper Tika ignores some elements (such as frame/ frameset) and for some others (area) it drops the href. As a result, the number of valid outlinks collected with parse-tika is much smaller than with parse-html. I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and a partial fix was applied to Tika 0.8, but still this won't handle the problems I mentioned above. Can we come up with a plan to address this? I'd rather switch completely to Tika-s HTML parsing, but at the moment we would lose too much useful data... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Parse-tika ignores too much data...
Hi Andrzej, I've got a old list of cases where Tika was not extracting links: - frame - iframe - img - map - object - link (only in section) I worked around this in my crawling code, by directly processing the DOM, but I should roll this into Tika. If you have a list of problems with test docs, file a TIKA issue and I'll try to fix things up quickly. Thanks, -- Ken On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote: Hi, I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. prepare the test DOM-s with Tika's HtmlParser. Results are not so good for some test cases... Even when using IdentityHtmlMapper Tika ignores some elements (such as frame/ frameset) and for some others (area) it drops the href. As a result, the number of valid outlinks collected with parse-tika is much smaller than with parse-html. I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and a partial fix was applied to Tika 0.8, but still this won't handle the problems I mentioned above. Can we come up with a plan to address this? I'd rather switch completely to Tika-s HTML parsing, but at the moment we would lose too much useful data... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
[jira] Commented: (NUTCH-696) Timeout for Parser
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885307#action_12885307 ] Ken Krugler commented on NUTCH-696: --- Hey Chris - let me know if you want me to file a Tika issue and attach my current code. I never heard anything back re the general solution I'd proposed, but if you want to run with the ball that would be great. > Timeout for Parser > -- > > Key: NUTCH-696 > URL: https://issues.apache.org/jira/browse/NUTCH-696 > Project: Nutch > Issue Type: Wish > Components: fetcher >Reporter: Julien Nioche >Priority: Minor > Attachments: timeout.patch > > > I found that the parsing sometimes crashes due to a problem on a specific > document, which is a bit of a shame as this blocks the rest of the segment > and Hadoop ends up finding that the node does not respond. I was wondering > about whether it would make sense to have a timeout mechanism for the parsing > so that if a document is not parsed after a time t, it is simply treated as > an exception and we can get on with the rest of the process. > Does that make sense? Where do you think we should implement that, in > ParseUtil? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-696) Timeout for Parser
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885285#action_12885285 ] Ken Krugler commented on NUTCH-696: --- FWIW, so far I haven't run into issues with creating/releasing threads for each document being parsed, for a 20M page crawl. Or at least relative to the overhead of all that happens during parsing, it hasn't been noticeable. > Timeout for Parser > -- > > Key: NUTCH-696 > URL: https://issues.apache.org/jira/browse/NUTCH-696 > Project: Nutch > Issue Type: Wish > Components: fetcher >Reporter: Julien Nioche >Priority: Minor > Attachments: timeout.patch > > > I found that the parsing sometimes crashes due to a problem on a specific > document, which is a bit of a shame as this blocks the rest of the segment > and Hadoop ends up finding that the node does not respond. I was wondering > about whether it would make sense to have a timeout mechanism for the parsing > so that if a document is not parsed after a time t, it is simply treated as > an exception and we can get on with the rest of the process. > Does that make sense? Where do you think we should implement that, in > ParseUtil? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.