[jira] [Commented] (NUTCH-1558) CharEncodingForConversion in ParseData's ParseMeta, not in ParseData's ContentMeta

2013-04-17 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634183#comment-13634183
 ] 

Ken Krugler commented on NUTCH-1558:


I don't see the patch, but the HTML parsing support in Tika has similar support 
for dealing with ambiguous charset identification/detection - is there any way 
to leverage that?

> CharEncodingForConversion in ParseData's ParseMeta, not in ParseData's 
> ContentMeta
> --
>
> Key: NUTCH-1558
> URL: https://issues.apache.org/jira/browse/NUTCH-1558
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
>
> This patch from GitHub user ysc fixes two bugs related to character encoding:
> * CharEncodingForConversion in ParseData's ParseMeta, not in ParseData's 
> ContentMeta
> * if http response Header Content-Type return wrong coding,then get coding 
> from the original content of the page
> Information about this pull request is here: http://s.apache.org/VOP

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Important : Bunch of Spam Created under Nutch Wiki!!

2013-04-03 Thread Ken Krugler
Hi Kiran,

I was just chatting w/Steve Rowe, who handled this for the Solr project. He 
said:

> It took less than a day, but I went on #asfinfra IRC channel and asked some 
> questions about the process, which may have gotten Gavin McDonald to move on 
> it sooner.

Since we're still getting slammed with spam, it might be worthwhile to do the 
same.

Thanks,

-- Ken


On Apr 1, 2013, at 12:30pm, kiran chitturi wrote:

> I have posted the information on the JIRA issue page [0]. Let's hope the 
> issue will be taken care of soon.
> 
> 
> [0] - https://issues.apache.org/jira/browse/INFRA-6081
> 
> 
> On Mon, Apr 1, 2013 at 3:27 PM, Lewis John Mcgibbney 
>  wrote:
> Hi Kiran,
> 
> 
> On Mon, Apr 1, 2013 at 6:53 AM,  wrote:
> Re: Important : Bunch of Spam Created under Nutch Wiki!!
> 22926 by: kiran chitturi
> 
> 
> Hi guys,
> 
> Do you know what is the destination for commit mails ? Can I give 
> 'dev@nutch.apache.org' ?
> 
> No, we should put commit emails to the styatic archive here
> http://mail-archives.apache.org/mod_mbox/nutch-commits/
>  
> 
> Thanks for sorting this out Kiran, we are truly getting hounded with spam 
> just now.
> Best
> Lewis
> 
> 
> 
> -- 
> Kiran Chitturi
> 
> 
> 
> 

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: Important : Bunch of Spam Created under Nutch Wiki!!

2013-03-28 Thread Ken Krugler
Hi Kiran,

On Mar 28, 2013, at 2:03am, kiran chitturi wrote:

> Thank you Ken for the information. I think the access is already restricted 
> to Contributors Only. Someone can please confirm, if it is not. 

It's not, as far as I know. I just created a fake account, logged in with it, 
and edited the front page.

> If anyone needs to edit wiki, they would need to ask someone to get access to 
> wiki pages. 
> 
> Do you know if Solr still got hit by spam after locking down the wiki ?

I think that change helped cut down most of the spam, but I don't monitor the 
Solr list that closely, sorry.

-- Ken



> On Thu, Mar 28, 2013 at 1:40 AM, Ken Krugler  
> wrote:
> 
> On Mar 27, 2013, at 6:54pm, kiran chitturi wrote:
> 
>> Thank you Binoy for reporting.
>> 
>> We have been monitoring the pages and deleting them when we get time but 
>> there are more coming up. Today, I have seen a spam editing on the home page 
>> of Nutch wiki. It has inserted spam links under tutorials.
>> 
>> We need to find a permanent solution to this. I wonder if any other 
>> list-servs are facing the same issue.
> 
> Yes - Solr recently had to lock down editing on their wiki:
> 
>> The wiki at http://wiki.apache.org/solr/ has come under attack by spammers 
>> more frequently of late, so the PMC has decided to lock it down in an 
>> attempt to reduce the work involved in tracking and removing spam.
>> 
>> From now on, only people who appear on 
>> http://wiki.apache.org/solr/ContributorsGroup will be able to 
>> create/modify/delete wiki pages.
>> 
>> Please request either on the solr-u...@lucene.apache.org or on 
>> d...@lucene.apache.org to have your wiki username added to the 
>> ContributorsGroup page - this is a one-time step.
> 
> So I think you need to make a request to Infra to lock down the wiki, then 
> add people (generally in response to explicit requests) to the 
> ContributorsGroup page.
> 
> -- Ken
> 
> 
>> 
>> 
>> On Thu, Mar 28, 2013 at 12:49 AM, Binoy d  wrote:
>> I am quite suprised looking at the notification I am getting for new pages 
>> for Nutch Wiki
>> Example :
>> http://wiki.apache.org/nutch/KarlPuent
>> 
>> I see at least 25-35 emails regarding such notification.
>> 
>> All of the links I got are  rooted under http://wiki.apache.org/nutch/
>> 
>> 
>> Is some one looking into this , If needed I can gladly forward emails to the 
>> person cleaning it up as I am not sure if every one has access to delete the 
>> pages.
>> 
>> Regards,
>> b
>> 
>> -- Forwarded message --
>> From: Apache Wiki 
>> Date: Wed, Mar 27, 2013 at 9:32 PM
>> Subject: [Nutch Wiki] Trivial Update of "EdwinaBro" by EdwinaBro
>> To: Apache Wiki 
>> 
>> 
>> Dear Wiki user,
>> 
>> You have subscribed to a wiki page or wiki category on "Nutch Wiki" for 
>> change notification.
>> 
>> The "EdwinaBro" page has been changed by EdwinaBro:
>> http://wiki.apache.org/nutch/EdwinaBro
>> 
>> New page:
>> I am 24 years old and my name is Edwina Brownlee. I life in Corjolens 
>> (Switzerland).<>
>> <>
>> <>
>> Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue]]
>> 
>> 
>> 
>> 
>> -- 
>> Kiran Chitturi
>> 
>> 
>> 
>> 
> 
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Kiran Chitturi
> 
> 
> 
> 

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: Important : Bunch of Spam Created under Nutch Wiki!!

2013-03-27 Thread Ken Krugler

On Mar 27, 2013, at 6:54pm, kiran chitturi wrote:

> Thank you Binoy for reporting.
> 
> We have been monitoring the pages and deleting them when we get time but 
> there are more coming up. Today, I have seen a spam editing on the home page 
> of Nutch wiki. It has inserted spam links under tutorials.
> 
> We need to find a permanent solution to this. I wonder if any other 
> list-servs are facing the same issue.

Yes - Solr recently had to lock down editing on their wiki:

> The wiki at http://wiki.apache.org/solr/ has come under attack by spammers 
> more frequently of late, so the PMC has decided to lock it down in an attempt 
> to reduce the work involved in tracking and removing spam.
> 
> From now on, only people who appear on 
> http://wiki.apache.org/solr/ContributorsGroup will be able to 
> create/modify/delete wiki pages.
> 
> Please request either on the solr-u...@lucene.apache.org or on 
> d...@lucene.apache.org to have your wiki username added to the 
> ContributorsGroup page - this is a one-time step.

So I think you need to make a request to Infra to lock down the wiki, then add 
people (generally in response to explicit requests) to the ContributorsGroup 
page.

-- Ken


> 
> 
> On Thu, Mar 28, 2013 at 12:49 AM, Binoy d  wrote:
> I am quite suprised looking at the notification I am getting for new pages 
> for Nutch Wiki
> Example :
> http://wiki.apache.org/nutch/KarlPuent
> 
> I see at least 25-35 emails regarding such notification.
> 
> All of the links I got are  rooted under http://wiki.apache.org/nutch/
> 
> 
> Is some one looking into this , If needed I can gladly forward emails to the 
> person cleaning it up as I am not sure if every one has access to delete the 
> pages.
> 
> Regards,
> b
> 
> -- Forwarded message --
> From: Apache Wiki 
> Date: Wed, Mar 27, 2013 at 9:32 PM
> Subject: [Nutch Wiki] Trivial Update of "EdwinaBro" by EdwinaBro
> To: Apache Wiki 
> 
> 
> Dear Wiki user,
> 
> You have subscribed to a wiki page or wiki category on "Nutch Wiki" for 
> change notification.
> 
> The "EdwinaBro" page has been changed by EdwinaBro:
> http://wiki.apache.org/nutch/EdwinaBro
> 
> New page:
> I am 24 years old and my name is Edwina Brownlee. I life in Corjolens 
> (Switzerland).<>
> <>
> <>
> Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue]]
> 
> 
> 
> 
> -- 
> Kiran Chitturi
> 
> 
> 
> 

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-27 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564057#comment-13564057
 ] 

Ken Krugler commented on NUTCH-1465:


Hi Tejas - the original code didn't, but I checked and now remember that I 
added support for multiple sitemap URLs to BaseRobotRules in CC.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
> Fix For: 1.7
>
> Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-27 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564019#comment-13564019
 ] 

Ken Krugler commented on NUTCH-1465:


Hi Tejas - I thought the current CC robots parsing code was already extracting 
the sitemap links. Or is the above comment ("modified the robots parsing code 
to extract the links to sitemap pages") a change to the current Nutch robots 
parsing code?

I do remember thinking that the CC version would need to change to support 
multiple Sitemap links, even though it wasn't clear whether that was actually 
valid.

-- Ken

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
> Fix For: 1.7
>
> Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-23 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13560877#comment-13560877
 ] 

Ken Krugler commented on NUTCH-1031:


I've rolled this into trunk at crawler-commons. Next step is to roll a release. 
Not sure when I'll get to that, but on my list for this week.

> Delegate parsing of robots.txt to crawler-commons
> -
>
> Key: NUTCH-1031
> URL: https://issues.apache.org/jira/browse/NUTCH-1031
> Project: Nutch
>  Issue Type: Task
>Reporter: Julien Nioche
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: robots.txt
> Fix For: 1.7
>
> Attachments: CC.robots.multiple.agents.patch, 
> CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
> NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons 
> [http://code.google.com/p/crawler-commons/] which contains a parser for 
> robots.txt files. This parser should also be better than the one we currently 
> have in Nutch. I will delegate this functionality to CC as soon as it is 
> available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-22 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13560420#comment-13560420
 ] 

Ken Krugler commented on NUTCH-1031:


Hi Tejas,

I've been on the road, but I'll check out your patch when I return back to my 
office tomorrow. Thanks for updating it with a test case!

-- Ken

> Delegate parsing of robots.txt to crawler-commons
> -
>
> Key: NUTCH-1031
> URL: https://issues.apache.org/jira/browse/NUTCH-1031
> Project: Nutch
>  Issue Type: Task
>Reporter: Julien Nioche
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: robots.txt
> Fix For: 1.7
>
> Attachments: CC.robots.multiple.agents.patch, 
> CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
> NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons 
> [http://code.google.com/p/crawler-commons/] which contains a parser for 
> robots.txt files. This parser should also be better than the one we currently 
> have in Nutch. I will delegate this functionality to CC as soon as it is 
> available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558400#comment-13558400
 ] 

Ken Krugler commented on NUTCH-1031:


Regarding precedence - my guess is that it's not very important, as I haven't 
seen many (any?) robots.txt files where it would match the same robot, using 
related names, in rules blocks with different rules.

This issue of precedence is specific to Nutch users, however (not part of the 
robots.txt RFC) so I'd suggest posting to the Nutch users list to see if anyone 
thinks it's important.

As far as your review of the CC code, yes it's correct. There's one additional 
wrinkle in that the target user agent name is split on spaces, due to what 
appears to be an implicit expectation that you can use a user agent name with 
spaces (which based on the RFC isn't actually valid) and any piece of the name 
will match.

> Delegate parsing of robots.txt to crawler-commons
> -
>
> Key: NUTCH-1031
> URL: https://issues.apache.org/jira/browse/NUTCH-1031
> Project: Nutch
>  Issue Type: Task
>Reporter: Julien Nioche
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: robots.txt
> Fix For: 1.7
>
> Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons 
> [http://code.google.com/p/crawler-commons/] which contains a parser for 
> robots.txt files. This parser should also be better than the one we currently 
> have in Nutch. I will delegate this functionality to CC as soon as it is 
> available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558340#comment-13558340
 ] 

Ken Krugler commented on NUTCH-1031:


Hi Tejas - I've looked at your patch, and (assuming there's not a requirement 
to support precedence in the user agent name list) it seems like a valid 
change. Based on the RFC (http://www.robotstxt.org/norobots-rfc.txt) robot 
names shouldn't have commas, so splitting on that seems safe. Do you have a 
unit test to verify proper behavior? If so, I'd be happy to roll that into CC.

-- Ken

> Delegate parsing of robots.txt to crawler-commons
> -
>
> Key: NUTCH-1031
> URL: https://issues.apache.org/jira/browse/NUTCH-1031
> Project: Nutch
>  Issue Type: Task
>Reporter: Julien Nioche
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: robots.txt
> Fix For: 1.7
>
> Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons 
> [http://code.google.com/p/crawler-commons/] which contains a parser for 
> robots.txt files. This parser should also be better than the one we currently 
> have in Nutch. I will delegate this functionality to CC as soon as it is 
> available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546089#comment-13546089
 ] 

Ken Krugler commented on NUTCH-1031:


Based on my reading of the robots.txt RFC ("The robot must obey the first 
record in /robots.txt that contains a User-Agent line whose value contains the 
name token of the robot as a substring."), this seems like the User-Agent name 
(what's in the robots.txt file) is searched for a substring that matches the 
robot name token (what the caller is using).

So that means in CC we'd either need to assume that a robot name _never_ 
contains a comma (and we split the caller-provided name) or we add a new API 
where you pass in a list of robot names. Thoughts?

> Delegate parsing of robots.txt to crawler-commons
> -
>
> Key: NUTCH-1031
> URL: https://issues.apache.org/jira/browse/NUTCH-1031
> Project: Nutch
>  Issue Type: Task
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>Priority: Minor
>  Labels: robots.txt
> Fix For: 1.7
>
> Attachments: NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons 
> [http://code.google.com/p/crawler-commons/] which contains a parser for 
> robots.txt files. This parser should also be better than the one we currently 
> have in Nutch. I will delegate this functionality to CC as soon as it is 
> available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2012-09-05 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448774#comment-13448774
 ] 

Ken Krugler commented on NUTCH-1465:


Hi Lewis,

Just to be clear, I think the dead horse is trying to get people interested in 
porting their code to crawler-commons, and then switching existing 
functionality to rely on cc.

For anything new (like sitemap parsing) I think it's a no-brainer to use cc, 
unless the API is totally borked. E.g. if you didn't, then you wouldn't have 
picked up our BOM fix.

-- Ken

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
> Fix For: 1.6, 2.1
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2012-09-04 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447797#comment-13447797
 ] 

Ken Krugler commented on NUTCH-1465:


Hi Lewis - I could start a thread, but I also don't want to flog a dead horse :)

I'm spending occasional small amounts of time trying to move code from Bixo 
over to CC, and the plan is for the 0.9 release of Bixo to switch over to using 
CC where possible.

But the lack of excitement among Droids, Heretrix, Common Crawl, Nutch, etc. 
has made it pretty clear getting wide-spread adoption would be an uphill 
battle, one that I don't have the time currently to fight.

-- Ken

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
> Fix For: 1.6, 2.1
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2012-09-04 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447700#comment-13447700
 ] 

Ken Krugler commented on NUTCH-1465:


The sitemap parsing code referenced in the discussion you note has been placed 
in crawler-commons. We just finished using it during a crawl (fixed one bug, 
dealing with sitemaps that have a BOM) and it worked fine for the sites we were 
crawling.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
> Fix For: 1.6, 2.1
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: bug in parse-tika or Tika RTFParser?

2012-08-15 Thread Ken Krugler
Hi Lewis,

[Moving to the dev list]

For many Tika parsers, the text you get back from the document starts with the 
title (if any), and then contains the body.

So I'm wondering if what you're seeing in the test failure is that the 
parse.getText() result is actually "test rtf document\nThe quick brown fox…"

-- Ken

On Aug 15, 2012, at 12:49pm, Lewis John Mcgibbney wrote:

> Hi,
> 
> For some time (in 2.x) we have commented out this test as it was
> waiting for TIKA-748 to be resolved... which now has been resolved
> however I'm getting some confusing output when trying to resurrect the
> test!
> 
> So @line 105 we do
> 
> String text = parse.getText();
> assertEquals("The quick brown fox jumps over the lazy dog", text.trim());
> 
> But I was wanting to implement the suggested test for title e.g.
> 
> String title = parse.getTitle();
> String text = parse.getText();
> assertEquals("test rft document", title);
> assertEquals("The quick brown fox jumps over the lazy dog", text.trim());
> 
> The test fails on the 2nd assertion which with the following
> 
> Testcase: testIt took 5.668 sec
>   FAILED
> null expected:<[The quick brown fox jumps over the lazy dog]> but
> was:<[test rft document]>
> junit.framework.ComparisonFailure: null expected:<[The quick brown fox
> jumps over the lazy dog]> but was:<[test rft document]>
>   at org.apache.nutch.parse.tika.TestRTFParser.testIt(TestRTFParser.java:)
> 
> So this looks like parse.getText() returns the same (in this instance)
> as parse.getTitle()... which smells like rotting herring to me.
> 
> Any immediate thoughts whether this is a known problem in the Tika RTF
> parser, parse-tika's DomContentUtils class or somewhere in between?
> 
> Thank you
> 
> Lewis
> 
> -- 
> Lewis

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






[jira] [Commented] (NUTCH-1455) RobotRulesParser to match multi-word user-agent names

2012-08-15 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435123#comment-13435123
 ] 

Ken Krugler commented on NUTCH-1455:


I added a test to crawler-commons to confirm that its robots.txt parser handles 
this correctly :)

> RobotRulesParser to match multi-word user-agent names
> -
>
> Key: NUTCH-1455
> URL: https://issues.apache.org/jira/browse/NUTCH-1455
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.5.1
>Reporter: Sebastian Nagel
> Fix For: 1.6
>
>
> If the user-agent name(s) configured in http.robots.agents contains spaces it 
> is not matched even if is exactly contained in the robots.txt
> http.robots.agents = "Download Ninja,*"
> If the robots.txt (http://en.wikipedia.org/robots.txt) contains
> {code}
> User-agent: Download Ninja
> Disallow: /
> {code}
> all content should be forbidden. But it isn't:
> {code}
> % curl 'http://en.wikipedia.org/robots.txt' > robots.txt
> % grep -A1 -i ninja robots.txt 
> User-agent: Download Ninja
> Disallow: /
> % cat test.urls
> http://en.wikipedia.org/
> % bin/nutch plugin lib-http 
> org.apache.nutch.protocol.http.api.RobotRulesParser robots.txt test.urls 
> 'Download Ninja'
> ...
> allowed:http://en.wikipedia.org/
> {code}
> The rfc (http://www.robotstxt.org/norobots-rfc.txt) states that
> bq. The robot must obey the first record in /robots.txt that contains a 
> User-Agent line whose value contains the name token of the robot as a
> substring.
> Assumed that "Downlaod Ninja" is a substring of itself it should match and 
> http://en.wikipedia.org/ should be forbidden.
> The point is that the agent name from the User-Agent line is split at spaces 
> while the names from the http.robots.agents property are not (they are only 
> split at ",").

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2012-08-13 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433278#comment-13433278
 ] 

Ken Krugler commented on NUTCH-1233:


Hi Markus - two questions. First, is the current Tika (1.1) outlink extraction 
support sufficient? Second, do you think whitespace trimming should happen in 
Tika or externally? I'm not sure, as I guess there might be an issue where 
somebody wants the extract same anchor text as what was in the HTML, but seems 
odd.

> Rely on Tika for outlink extraction
> ---
>
> Key: NUTCH-1233
> URL: https://issues.apache.org/jira/browse/NUTCH-1233
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, 
> NUTCH-1233-1.6-2.patch
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be 
> able to use it in Nutch we need Tika to return the rel attr value of each 
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
> is included in Tika and we upgraded to that new version this issue can be 
> worked on. Here's preliminary code that does both Tika and current outlink 
> extraction. This also includes parts of the Boilerpipe code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1418) error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/

2012-07-02 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405223#comment-13405223
 ] 

Ken Krugler commented on NUTCH-1418:


The path is invalid, so Nutch emitting a warning is fine.

If Nutch subsequently bails on processing URLs for such a web site, then that 
would be a problem - but I don't think that's the case here, as it's being 
logged as a warning, not an error, and it obviously keeps processing the file 
(since you get three such warnings).

Are you _sure_ that the reason Nutch isn't fetching is caused by this issue 
with robots.txt? I'm pretty sure many people use Nutch to crawl Wikipedia.

> error parsing robots rules- can't decode path: 
> /wiki/Wikipedia%3Mediation_Committee/
> 
>
> Key: NUTCH-1418
> URL: https://issues.apache.org/jira/browse/NUTCH-1418
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Arijit Mukherjee
>
> Since learning that nutch will be unable to crawl the javascript function 
> calls in href, I started looking for other alternatives. I decided to crawl 
> http://en.wikipedia.org/wiki/Districts_of_India.
> I first tried injecting this URL and follow the step-by-step approach 
> till fetcher - when I realized, nutch did not fetch anything from this 
> website. I tried looking into logs/hadoop.log and found the following 3 lines 
> - which I believe could be saying that nutch is unable to parse the 
> robots.txt in the website and ttherefore, fetcher stopped?
>
> 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
> rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
> 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
> rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
> 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
> rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
> I tried checking the URL using parsechecker and no issues there! I think 
> it means that the robots.txt is malformed for this website, which is 
> preventing fetcher from fetching anything. Is there a way to get around this 
> problem, as parsechecker seems to go on its merry way parsing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties

2012-06-15 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295683#comment-13295683
 ] 

Ken Krugler commented on NUTCH-1397:


Should this issue be filed against Tika, versus Nutch? Or is this specific to 
language identification that's still part of Nutch? Sorry, but I haven't been 
keeping up with the state of migrating functionality to Tika.

> language-identifier incorrectly handles double-barreled language properties
> ---
>
> Key: NUTCH-1397
> URL: https://issues.apache.org/jira/browse/NUTCH-1397
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.1
>
>
> Currently when language-identifier is activated is parses and identifies 
> langauge-type=en, however does not identify en-GB or en-US. This issues 
> should correct that. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Detecting Encoding with plugins

2012-02-14 Thread Ken Krugler

On Feb 14, 2012, at 2:34pm, Lewis John Mcgibbney wrote:

> It's in HTMLParser#private static String sniffCharacterEncoding
> 
> I'm still wondering where TikaParser gets the character encoding from though?

FYI, the individual Tika parsers have their own detection logic.

The HTML parser, for example, uses the response headers and metadata tags in 
addition to ICU's statistical method.

That's something I'm still working on cleaning up, but haven't made much 
progress in the past few months.

-- Ken

> Additionally, this doesn't look like something we check for in our JUnit 
> classes? If we don't then I would like to write some tests to test for this.
> 
> I am working on Any23 tests first, so this provides the justification behind 
> my question.
> 
> Thanks
> 
> Lewis
> 
> On Tue, Feb 14, 2012 at 10:00 PM, Lewis John Mcgibbney 
>  wrote:
> Hi,
> 
> I can't see anywhere within our parser plugins where we detect encoding of 
> documents. I've also begun looking through the o.a.n.p package but again I 
> can't see anything.
> 
> Can anyone provide some detail on this please?
> 
> Thank you
> 
> Lewis 
> 
> 
> 
> -- 
> Lewis 
> 
> 
> 
> 
> -- 
> Lewis 
> 

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: tika-core, tika-parser

2012-02-08 Thread Ken Krugler

On Feb 8, 2012, at 5:28am, Markus Jelsma wrote:

> 
> 
> On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote:
>> sorry don't understand what your issue is. We have a dependency on
>> tika-parsers and the actual parser implementations (listed in tika parsers'
>> POM) are pulled transitively just like any other dependency managed by Ivy.
>> They end up being copied in  runtime/local/plugins/parse-tika/ or put in
>> the job in runtime/deploy/
> 
> My problem is that i am working on some code for Tika-parsers 1.1-SNAPSHOT 
> that i need to use in Nutch. However, when i build tika-parsers and put it in 
> Nutch' lib directory i still seem to be missing dependencies. Then trouble 
> begins:

I don't know anything about how Nutch handles jars in its lib directory, but 
this sounds like you have a "raw" jar (tika-parsers) without its pom.xml.

So then Ivy (or Maven) doesn't know about the transitive dependencies on other 
jars, which are needed to implement the actual parsing support.

-- Ken

> 
> Exception in thread "main" java.lang.NoClassDefFoundError: Could not 
> initialize class org.apache.tika.parser.dwg.DWGParser
>at java.lang.Class.forName0(Native Method)
>at java.lang.Class.forName(Class.java:247)
>at sun.misc.Service$LazyIterator.next(Service.java:271)
>at org.apache.nutch.parse.tika.TikaConfig.(TikaConfig.java:149)
>at 
> org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211)
>at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:254)
>at 
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
>at 
> org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
>at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
>at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
> 
> Nick told me to remove DWG from the org.apache.tika.parsers.Parsers config 
> file, which i did. But then other dependency issues come and go. The more 
> parsers i remove from the config file the better it goes, but then Tika won't 
> build anymore because of failing tests.
> 
> I asked this on the Nutch list because i wasn't sure anymore how Nutch deals 
> with these its own deps, which you explained well.
> 
> I'll give up for now :)
> 
> 
> 
>> 
>> On 8 February 2012 13:03, Markus Jelsma  wrote:
>>> Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's
>>> something else.
>>> 
>>> dependencies, dependencies, dependencies :(
>>> 
>>> On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
>>>> The dependencies for the plugins are defined locally as shown in the
>>>> URL below, where you can see the ref to tika-parsers for parse-tika.
>>>> Is that more clear for you Markus?
>>>> 
>>>> On 8 February 2012 12:58, Lewis John Mcgibbney
>>> 
>>> wrote:
>>>>> Hi Markus,
>>>>> 
>>>>> For starters
>>> 
>>> http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
>>> 
>>>>> ew=markup
>>>>> 
>>>>> Can we pick our way through this?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> 
>>>>> On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
>>>>> >>>> 
>>>>>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Can anyone shed light on this? We don't have any parsers in our libs
>>> 
>>> dir
>>> 
>>>>>> and
>>>>>> we don't have tika-parsers jar, only the tika-core jar. Where are
>>>>>> the parsers
>>>>>> and how does this all work?
>>>>>> 
>>>>>> I've posted a question (same subject) on the Tika list and Nick
>>>>>> tells
>>> 
>>> me
>>> 
>>>>>> there
>>>>>> must be parsers somewhere. Well, i have no idea how we do it in
>>>>>> Nutch, do you?
>>>>>> 
>>>>>> Thanks
>>>>> 
>>>>> --
>>>>> *Lewis*
>>> 
>>> --
>>> Markus Jelsma - CTO - Openindex
> 
> -- 
> Markus Jelsma - CTO - Openindex

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: [DISCUSS] Issues with Fetcher

2012-01-21 Thread Ken Krugler
Hi Eddie,

My own personal favorite area would be to integrate with crawler-commons.

There's been some occasional work done to move things into this shared project 
- e.g. robots parser & a base HTTP fetcher from Bixo.

I believe there's a Jira issue open to switch Nutch to using that robots.txt 
parser, which would be an improvement over what Nutch currently has.

There are other pieces of Nutch that could/eventually should be moved there, 
e.g. URL normalization, but that doesn't directly benefit Nutch, just other 
Java-based crawlers.

Or, if you have experience with JSPs/GUI work, then I think there's this big 
open issue around improving the Nutch GUI, which would likely provide the most 
benefit to the most users. I haven't been following the current status, but I 
know that there have been periodic discussions, and I think 101tec did some 
work on this a while back (for a client), but I don't know if that's been 
contributed (or could be, for that matter).

-- Ken

On Jan 21, 2012, at 8:17am, Edward Drapkin wrote:

> On 1/21/2012 8:27 AM, Lewis John Mcgibbney wrote:
>> 
>> Hi Julien,
>> 
>> 
>> There are 8 issues in trunk about the fetcher - some of them unrelated to 
>> the Fetcher (NUTCH-827 / Nutch-1193) with most of the others being 
>> improvements (NUTCH-828 / NUTCH-1079) with possibly just a very few being 
>> real issues.
>>  
>> This puts the whole discussion into much better context, thanks for pointing 
>> this out. Maybe I should have made it more clear, that I only filtered the 
>> fetcher issues on our Jira and I was simply modelling my discussion around 
>> that. You are completely correct though, it would be different if the 
>> fetcher was in a similar state to protocol-httpclient... which it is 
>> obviously not.
>>  
>> I am also concerned about getting too radical changes to such a core part of 
>> the framework, especially when more pressing issues could be looked after 
>> instead.
>> +1
>>  
>> Having said that if someone can come up with an interesting proposal for 
>> improving the Fetcher that would be very good, I would simply suggest that 
>> we then have a separate implementation for that.
>> +1
>>  
>> 
>> 
>> Ok with this in mind then, is there some guidance we can communicate to 
>> Eddie? He has specifically mentioned that he shares similar opinions wrt the 
>> fetcher being a core part of Nutch, radical changes etc, and I also share 
>> this point of view. He has also added that he doesn't want to spend the time 
>> changing material which we may or may not merge with trunk, this also makes 
>> perfect sense. Additionally Ken's comments emphasise that this has been 
>> somewhat attempted in the past and that lessons have been learned and the 
>> implementation we have cuts the mustard as is. 
>> Maybe we could nudge Eddie in the right direction, which would benefit both 
>> himself and the project over the next while, I think this was the most 
>> important point I was trying to emphasise, however looking over my original 
>> comment this was maybe not how it was written.
>> 
>> Thanks
>> Lewis
> 
> If there's more important and/or interesting things for me to work on, I'll 
> be glad to.  I'm completely unfamiliar with the current state of the project 
> as a whole - and looking through JIRA is a bit daunting.  The only reason I'm 
> attracted to working on the fetcher is I think it's a really interesting and 
> compelling problem to solve, and it's making it more flexible is something 
> that would directly benefit our use for it, so it will be easier to devote 
> time to it while I'm at the office.  I do have a glut of free time at the 
> moment though, so I'm perfectly okay working on another area that's more 
> pressing - I just don't know what it is.  I saw that protocol-httpclient 
> needs to be rewritten, is there someone working on that?
> 
> I can work on more important and less controversial / radical things, but I 
> do think that having a more flexible, pluggable fetcher will be an enormous 
> improvement to Nutch and can greatly expand the potential uses for it as a 
> piece of software.  There's a ton of cases where pluggable fetching could 
> have a huge improvement: local filesystem search, single-threaded / small 
> site indexing, email indexing (SMTP, POP, etc.), etc.  I suggested an 
> extremely (perhaps too much so) abstract archtecture for fetching in ticket 
> #1201, and for the sake of brevity I won't repeat myself here, but I think 
> that would give Nutch a good base for flexible fetching, which I believe is a 
> huge improvement to the project.  I'm obviously new to the development here 
> and I'm willing do whatever needs doing, I just believe the fetching is 
> something that needs doing.  I just want to contribute!
> 
> Thanks,
> Eddie

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

2012-01-20 Thread Ken Krugler (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190050#comment-13190050
 ] 

Ken Krugler commented on NUTCH-1201:


My 2 cents, based on ancient history.

We extended Nutch in several ways during my Krugle startup, and in general the 
experience wound up being pretty painful. Even with the help of Andrzej and 
Stefan Groschupf (two very knowledgeable Nutch developers), we wound up 
spinning our wheels.

Part of the problem was the monolithic nature of Nutch, which made (makes?) it 
hard to extend in ways beyond plugin extension points that don't need to do 
much other than output different results for the same input data.

My thought here is that I'd look at having a very high level extension point - 
e.g. "I've got a fetch list (generated by other Nutch code) in the segment, and 
now I need to process that list, with the end result being data in new sub-dirs 
in the segment". But keep the fetcher around as a re-usable component (see 
crawler-commons for one version from Bixo).

Then if you want to do some crazy crawl-3-deep, you can craft your own solution 
(which might not even use map-reduce).

-- Ken

PS - my personal bias is to implement custom solutions using Cascading & 
reuseable Java classes, but I know that doesn't fit well with the more common 
user of Nutch, where "programming by XML" (configuration only) seems to be the 
sweet spot.



> Allow for different FetcherThread impls
> ---
>
> Key: NUTCH-1201
> URL: https://issues.apache.org/jira/browse/NUTCH-1201
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.5
>
>
> For certain cases we need to modify parts in FetcherThread and make it 
> pluggable. This introduces a new config directive fetcher.impl that takes a 
> FQCN and uses that setting Fetcher.fetch to load a class to use for 
> job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
> class FetcherThread. This allows for overriding methods in FetcherThread but 
> also methods in Fetcher itself if required.
> A follow up on this issue would be to refactor parts of FetcherThread to make 
> it easier to override small sections instead of copying the entire method 
> body for a small change, which is now the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [DISCUSS] Issues with Fetcher

2012-01-20 Thread Ken Krugler
Thanks for the poke - I'd started writing up a comment to that issue, but got 
sidetracked by the day job.

-- Ken

On Jan 20, 2012, at 9:16am, Lewis John Mcgibbney wrote:

> Hi Everyone,
> 
> Since Eddie decided to chap in on the dev lists/Jira we have not been able to 
> get back to him. I'm referring specifically to NUTCH-1201 and his comments 
> therewith.
> 
> Doing a quick rekkie on the current fetcher issues I can see 32 issues with 7 
> of them claiming to be patched up... this kinda indicates that although there 
> are underlying problems with the fetcher we are currently not getting the 
> time to address them. It also indicates that there is quite a bit of work to 
> be done with the fetcher...
> 
> Has anyone had time to consider Eddie's comments or proposals for taking the 
> work forward. The last thing we would like to see is him allocating his time 
> elsewhere if we could have a real go at building a more appropriate fetcher 
> architecture (plugable, etc).
> 
> I was thinking to myself all week that we would seriously be passing up an 
> opportunity if we didn't try to act on this one.
> 
> Thanks guys. 
> 
> -- 
> Lewis 
> 

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: ANT+MAVEN (was: Nutch Maven build)

2011-10-31 Thread Ken Krugler
t; > > being.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Lewis
> > > > > >
> > > > > > [1]
> > > >
> > > > https://builds.apache.org/view/M-R/view/Nutch/job/nutch-trunk-maven/3/con
> > > > so
> > > >
> > > > > > le
> > > > >
> > > > > --
> > > > > Markus Jelsma - CTO - Openindex
> > > > > http://www.linkedin.com/in/markus17
> > > > > 050-8536620 / 06-50258350
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lewis
> > > >
> > > > ++
> > > > Chris Mattmann, Ph.D.
> > > > Senior Computer Scientist
> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > Office: 171-266B, Mailstop: 171-246
> > > > Email: chris.a.mattm...@nasa.gov
> > > > WWW:   http://sunset.usc.edu/~mattmann/
> > > > ++
> > > > Adjunct Assistant Professor, Computer Science Department
> > > > University of Southern California, Los Angeles, CA 90089 USA
> > > > ++
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> 
> 
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
> 
> 
> 
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com

--
Ken Krugler
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient

2011-08-22 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088875#comment-13088875
 ] 

Ken Krugler commented on NUTCH-1086:


For what it's worth, there's a SimpleHttpFetcher in crawler-commons that uses 
HttpClient 4.1.

> Rewrite protocol-httpclient
> ---
>
> Key: NUTCH-1086
> URL: https://issues.apache.org/jira/browse/NUTCH-1086
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Markus Jelsma
>
> There are several issues about protocol-httpclient and several comments about 
> rewriting the plugin with the new http client libraries. There is, however, 
> not yet an issue for rewriting/reimplementing protocol-httpclient.
> http://hc.apache.org/httpcomponents-client-ga/

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1046) Add tests for indexing to SOLR

2011-07-20 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068721#comment-13068721
 ] 

Ken Krugler commented on NUTCH-1046:


Don't know if this is useful, but I've got some tests for indexing to embedded 
Solr as part of the cascading.solr scheme. See 
https://github.com/bixolabs/cascading.solr/

> Add tests for indexing to SOLR
> --
>
> Key: NUTCH-1046
> URL: https://issues.apache.org/jira/browse/NUTCH-1046
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.4, 2.0
>Reporter: Julien Nioche
> Fix For: 1.4, 2.0
>
>
> We currently have no tests for checking that the indexing to SOLR works as 
> expected. Running an embedded SOLR Server within the tests would be good.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-657) Estonian N-gram profile has wrong name

2011-07-16 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066438#comment-13066438
 ] 

Ken Krugler commented on NUTCH-657:
---

I'd thought that Nutch was now delegating language detection to Tika (which 
contains a port of what Nutch has).

In any case, it's et.ngp over in Tika-land.

> Estonian N-gram profile has wrong name
> --
>
> Key: NUTCH-657
> URL: https://issues.apache.org/jira/browse/NUTCH-657
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8.1, 0.9.0
>Reporter: Jonathan Young
>Priority: Trivial
>
> The Nutch language identifier plugin contains an ngram profile, ee.ngp, in 
> src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang .  "ee" 
> is the ISO-3166-1-alpha-2 code for Estonia (see 
> http://www.iso.org/iso/country_codes/iso_3166_code_lists/english_country_names_and_code_elements.htm),
>  but it is the ISO-639-2 code for Ewe (see 
> http://www.loc.gov/standards/iso639-2/php/English_list.php).  "et" is the 
> ISO-639-2 code for Estonian, and the language profile in ee.ngp is clearly 
> Estonian.
> Proposed solution: rename ee.ngp to et.ngp .

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset $charset

2011-06-24 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054475#comment-13054475
 ] 

Ken Krugler commented on NUTCH-1012:


Tika has code to try to resolve charset names (and handle common error cases) 
in a graceful manner. Nutch might want to use this code, or we could add a 
general wrapper to crawler-commons. See CharsetUtils in Tika.

> Cannot handle illegal charset $charset
> --
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 1.4
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): 
> java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: 
> Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN  parse.html - 
> java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN  parse.ParseSegment - Error parsing: 
> http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex

2011-06-24 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054471#comment-13054471
 ] 

Ken Krugler commented on NUTCH-1013:


No comment directly related to this patch, but URL normalization seems like a 
great component to move into crawler-commons, since all web crawlers need to do 
the same thing.

> Migrate RegexURLNormalizer from Apache ORO to java.util.regex
> -
>
> Key: NUTCH-1013
> URL: https://issues.apache.org/jira/browse/NUTCH-1013
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1013-1.4.patch
>
>
> Apache ORO uses old Perl 5-style regular expressions. Features such as the 
> powerful lookbehind are not available. The project has become retired as 
> well. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1008) Switch to crawler-commons version of robots.txt parsing code

2011-06-17 Thread Ken Krugler (JIRA)
Switch to crawler-commons version of robots.txt parsing code


 Key: NUTCH-1008
 URL: https://issues.apache.org/jira/browse/NUTCH-1008
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Ken Krugler
Priority: Minor


The Bixo project has an improved version of Nutch's robots.txt parsing code.

This was recently contributed to crawler-commons, in a format that should be 
independent of Bixo, Cascading, and even Hadoop.

Nutch could switch to this, and benefit from more robust parsing, better 
compliance with ad hoc extensions to the robot exclusion protocol, and a wider 
community of users/developers for that code.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2011-06-10 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047490#comment-13047490
 ] 

Ken Krugler commented on NUTCH-961:
---

The way that Boilerpipe in Tika works is that it acts as a delegate, processing 
the SAX events generated by the default content handler that knows how to help 
clean up broken HTML.

So it's incremental processing (you don't need to get the full page first).

Separate note: Tika's Boilerpipe support now has an option to return HTML 
markup, so you could run it in this mode to get anchors/anchor text.


> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-944) Increase the number of elements to look for URLs and add the ability to specify multiple attributes by elements

2011-04-13 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019565#comment-13019565
 ] 

Ken Krugler commented on NUTCH-944:
---

I'm curious how this relates to [TIKA-463]. Is it that this code extracts the 
URLs from the attributes that (as of TIKA-463) should be getting returned by 
Tika's HtmlParser?

Also, TIKA-463 doesn't handle "video" - is that a legit XHTML 1.0 element?


> Increase the number of elements to look for URLs and add the ability to 
> specify multiple attributes by elements
> ---
>
> Key: NUTCH-944
> URL: https://issues.apache.org/jira/browse/NUTCH-944
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
> Environment: GNU/Linux Fedora 12
>Reporter: Jean-Francois Gingras
>Priority: Minor
> Fix For: 2.0
>
> Attachments: DOMContentUtils.java.path-1.0, 
> DOMContentUtils.java.path-1.3
>
>
> Here a patch for DOMContentUtils.java that increase the number of elements to 
> look for URLs. It also add the ability to specify multiple attributes by 
> elements, for example:
> linkParams.put("frame", new LinkParams("frame", "longdesc,src", 0));
> linkParams.put("object", new LinkParams("object", 
> "classid,codebase,data,usemap", 0));
> linkParams.put("video", new LinkParams("video", "poster,src", 0)); // HTML 5
> I have a patch for release-1.0 and branch-1.3
> I would love to hear your comments about this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-960) Language ID - confidence factor

2011-01-25 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986827#action_12986827
 ] 

Ken Krugler commented on NUTCH-960:
---

There are a number of Tika issues filed that relate to this. See TIKA-369, 
TIKA-496, TIKA-568.

> Language ID - confidence factor
> ---
>
> Key: NUTCH-960
> URL: https://issues.apache.org/jira/browse/NUTCH-960
> Project: Nutch
>  Issue Type: Wish
>Affects Versions: 1.2
>Reporter: M Alexander
>
> Hi
> In JAVA implementation, what is the best way to calculate the confidence of 
> the outcome of the language id for a given text?
> For example:
> n-gram matching / total n-gram * 100.
> when a text is passed. The outcome would be "en" with 89% confidence. What is 
> the best way to implement this to the existig nutch language id code?
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Charset detection algorithm

2010-11-06 Thread Ken Krugler

Hi all,

See https://issues.apache.org/jira/browse/TIKA-539 for a Tika issue  
I'm currently working on, which has to do with the charset detection  
algorithm.


There's the HTML5 proposal, where the priority is

- charset from Content-Type response header
- charset from HTML  element
- charset detected from page contents

Reinhard Schwab proposed a variation on the HTML5 approach, which  
makes sense to me; in my web crawling experience, too many servers lie  
to just blindly trust the response header contents.


I've got a slight modification to Reinhard's approach, as describe in  
a comment on the above issue:


https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=12928832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel 
#action_12928832


I'm interested in comments.

Thanks!

-- Ken

------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







More real-time crawling

2010-10-27 Thread Ken Krugler

Hi Xiao,

FWIR there is adaptive refetch interval support in Nutch currently -  
or are you looking for something different?


Regards,

-- Ken

On Oct 27, 2010, at 1:42am, xiao yang wrote:


I want to modify the schedule of crawler to make it more real-time.
Some web pages are frequently updated, while others seldom change. My
idea is to classify URL into 2 categories which will affect the score
of URL, so I want to add a field to store which category a URL belongs
to.
The idea is simple, but I found it's not so easy to implement in  
Nutch.


Thanks!
Xiao


------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Tika 0.8-SNAPSHOT and HTML torture testing

2010-08-17 Thread Ken Krugler
I just committed some changes to Tika that (in theory) should ensure  
all URLs get extracted from HTML documents.


See https://issues.apache.org/jira/browse/TIKA-463 for details.

It would be great if somebody active in Nutch could try this out with  
the current suite of Nutch tests for HTML processing.


Thanks!

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Tika HTML parsing

2010-08-15 Thread Ken Krugler

Hi Andrzej,

On Aug 15, 2010, at 12:04am, Andrzej Bialecki wrote:


On 2010-08-15 06:54, Ken Krugler wrote:
For what it's worth, I just committed some patches to Tika that  
should

improve Tika's ability to extract HTML outlinks (in  and 
elements, at least). Support for  should be coming soon :)

This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm
tracking down, but I think Tika is getting closer to being usable by
Nutch for typical web crawling.


Thanks Ken for pushing forward this work! A few questions:

* does this include image maps as well ()?


I've got a patch for that (the same one that does iframes). Hopefully  
I'll commit that today.



* how does the code treat invalid html with both body and frameset?


TagSoup should clean up the invalid HTML.

The issue you'd run into with  is that TagSoup maps it  
to an empty , followed by 


I committed a patch that fixes this, at least for the examples that I  
tried (including the one that Julien reported).


* what's the status of extracting the meta robots and link rel  
information?


All  elements are now emitted in the resulting  element.

And  and  elements should be passed through.

It would be great to get input on just how "fixed" things are now, or  
maybe after the next patch gets committed.


Thanks,

-- Ken

----
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Tika HTML parsing

2010-08-14 Thread Ken Krugler
For what it's worth, I just committed some patches to Tika that should  
improve Tika's ability to extract HTML outlinks (in  and   
elements, at least). Support for  should be coming soon :)


This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm  
tracking down, but I think Tika is getting closer to being usable by  
Nutch for typical web crawling.


-- Ken

--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






When a crawl goes bad...

2010-08-14 Thread Ken Krugler
Dear @80legs stop crushing metafilter.com from 2226 distinct IP  
addresses.

Your bots are DDOSing the site with thousands of requests. Stop.
<http://twitter.com/mathowie/status/20326707535>


-- Ken

--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Parse-tika ignores too much data...

2010-07-08 Thread Ken Krugler


On Jul 8, 2010, at 12:15am, Andrzej Bialecki wrote:


On 2010-07-07 22:32, Ken Krugler wrote:

Hi Julien,


See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be something  
very

wrong with the way  is handled, we also saw cases were it was
twice in the output.


Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly
broken, in that you can either have a  OR a , but  
not both.


The HTML was broken on purpose - one of the goals of the original  
test was to get as much content and links in presence of grave  
errors - as you know even major sites often produce a badly broken  
HTML, but the parser sanitize it and produce a valid DOM. In this  
case, it produced two nested  elements, which is not valid.


I'll need to check this out - the response from TagSoup was   
followed by the  data, and finally a closing .


So if Tika is generating two bodies, then that's a bug in Tika. Though  
technically, having the  following the  is also invalid.


I'd suggest filing a Tika issue to do a better job of handling invalid  
framesets like this. Based on my experience, I don't think there would  
be an easy way to get this change into TagSoup.


I should also mention that NekoHTML handled this test much better,  
by removing the  and retaining only the .


Yes, that's a well-known issue - certain docs are better handled by  
NekoHTML, while with others you get better results from TagSoup.


Anecdotally I'd heard that NekoHTML was better at extracting links.

Tika used to use NekoHTML, but switched to TagSoup last October. One  
reason was to avoid a troublesome dependency on Xerces.


-- Ken

------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Parse-tika ignores too much data...

2010-07-07 Thread Ken Krugler

Hi Julien,

See https://issues.apache.org/jira/browse/TIKA-457 for a description  
of one of the cases found by Andrzej. There seems to be something  
very wrong with the way  is handled, we also saw cases were it  
was twice in the output.


Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly  
broken, in that you can either have a  OR a , but not  
both.


-- Ken


On 7 July 2010 17:41, Ken Krugler  wrote:
Hi Andrzej,

I've got a old list of cases where Tika was not extracting links:

 - frame
 - iframe
 - img
 - map
 - object
 - link (only in  section)

I worked around this in my crawling code, by directly processing the  
DOM, but I should roll this into Tika.


If you have a list of problems with test docs, file a TIKA issue and  
I'll try to fix things up quickly.


Thanks,

-- Ken


On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:

Hi,

I'm going through NUTCH-840, and I tried to eat our own dog food,  
i.e. prepare the test DOM-s with Tika's HtmlParser.


Results are not so good for some test cases... Even when using  
IdentityHtmlMapper Tika ignores some elements (such as frame/ 
frameset) and for some others (area) it drops the href. As a result,  
the number of valid outlinks collected with parse-tika is much  
smaller than with parse-html.


I know this issue has been reported (TIKA-379, NUTCH-817,  
NUTCH-794), and a partial fix was applied to Tika 0.8, but still  
this won't handle the problems I mentioned above.


Can we come up with a plan to address this? I'd rather switch  
completely to Tika-s HTML parsing, but at the moment we would lose  
too much useful data...


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







--
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com


--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Parse-tika ignores too much data...

2010-07-07 Thread Ken Krugler

Hi Andrzej,

I've got a old list of cases where Tika was not extracting links:

 - frame
 - iframe
 - img
 - map
 - object
 - link (only in  section)

I worked around this in my crawling code, by directly processing the  
DOM, but I should roll this into Tika.


If you have a list of problems with test docs, file a TIKA issue and  
I'll try to fix things up quickly.


Thanks,

-- Ken

On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:


Hi,

I'm going through NUTCH-840, and I tried to eat our own dog food,  
i.e. prepare the test DOM-s with Tika's HtmlParser.


Results are not so good for some test cases... Even when using  
IdentityHtmlMapper Tika ignores some elements (such as frame/ 
frameset) and for some others (area) it drops the href. As a result,  
the number of valid outlinks collected with parse-tika is much  
smaller than with parse-html.


I know this issue has been reported (TIKA-379, NUTCH-817,  
NUTCH-794), and a partial fix was applied to Tika 0.8, but still  
this won't handle the problems I mentioned above.


Can we come up with a plan to address this? I'd rather switch  
completely to Tika-s HTML parsing, but at the moment we would lose  
too much useful data...


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






[jira] Commented: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885307#action_12885307
 ] 

Ken Krugler commented on NUTCH-696:
---

Hey Chris - let me know if you want me to file a Tika issue and attach my 
current code.

I never heard anything back re the general solution I'd proposed, but if you 
want to run with the ball that would be great.

> Timeout for Parser
> --
>
> Key: NUTCH-696
> URL: https://issues.apache.org/jira/browse/NUTCH-696
> Project: Nutch
>  Issue Type: Wish
>  Components: fetcher
>Reporter: Julien Nioche
>Priority: Minor
> Attachments: timeout.patch
>
>
> I found that the parsing sometimes crashes due to a problem on a specific 
> document, which is a bit of a shame as this blocks the rest of the segment 
> and Hadoop ends up finding that the node does not respond. I was wondering 
> about whether it would make sense to have a timeout mechanism for the parsing 
> so that if a document is not parsed after a time t, it is simply treated as 
> an exception and we can get on with the rest of the process.
> Does that make sense? Where do you think we should implement that, in 
> ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885285#action_12885285
 ] 

Ken Krugler commented on NUTCH-696:
---

FWIW, so far I haven't run into issues with creating/releasing threads for each 
document being parsed, for a 20M page crawl. Or at least relative to the 
overhead of all that happens during parsing, it hasn't been noticeable.


> Timeout for Parser
> --
>
> Key: NUTCH-696
> URL: https://issues.apache.org/jira/browse/NUTCH-696
> Project: Nutch
>  Issue Type: Wish
>  Components: fetcher
>Reporter: Julien Nioche
>Priority: Minor
> Attachments: timeout.patch
>
>
> I found that the parsing sometimes crashes due to a problem on a specific 
> document, which is a bit of a shame as this blocks the rest of the segment 
> and Hadoop ends up finding that the node does not respond. I was wondering 
> about whether it would make sense to have a timeout mechanism for the parsing 
> so that if a document is not parsed after a time t, it is simply treated as 
> an exception and we can get on with the rest of the process.
> Does that make sense? Where do you think we should implement that, in 
> ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.