[jira] Commented: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851923#action_12851923 ] Ken Krugler commented on NUTCH-706: --- Two comments about this: 1. From my experiences

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846424#action_12846424 ] Ken Krugler commented on NUTCH-797: --- I thought this same issue (relative URL with leading

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846459#action_12846459 ] Ken Krugler commented on NUTCH-797: --- Agreed re crawler-commons...feels like there's

[jira] Commented: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830109#action_12830109 ] Ken Krugler commented on NUTCH-786: --- Is this something that should also be applied

[jira] Commented: (NUTCH-751) Upgrade version of HttpClient

2010-01-11 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798890#action_12798890 ] Ken Krugler commented on NUTCH-751: --- i agree that this should be in crawler-commons. E.g

Re: Update on Integration with Tika

2009-11-16 Thread Ken Krugler
On Nov 16, 2009, at 12:00pm, Andrzej Bialecki wrote: Julien Nioche wrote: Hi, I came across the classloader issue that you mentioned but got everything to work OK by duplicating the class TikaConfiguration into the package used by my plugin. The lib tika-core goes into the main /lib dir

[jira] Commented: (NUTCH-751) Upgrade version of HttpClient

2009-09-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753069#action_12753069 ] Ken Krugler commented on NUTCH-751: --- I'm using HttpClient 4.0 in Bixo, and I agree

Re: Nutch Performance Improvements

2009-08-25 Thread Ken Krugler
not acceptable for general crawl, but would be nice to have such configuration option) Fuad Efendi == http://www.linkedin.com/in/liferay http://www.tokenizer.org http://www.casaGURU.com == -- Ken Krugler TransPac

MeetUp topic list posted

2009-08-03 Thread Ken Krugler
I added a page to the Nutch wiki at http://wiki.apache.org/nutch/ApacheConUs2009MeetUp , with some ideas for discussion topics. Please take a look and add/comment, thanks. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-210-6378

Re: MeetUp topic list posted

2009-08-03 Thread Ken Krugler
if you think I should move it one day earlier. Thanks, -- Ken On Aug 3, 2009, at 10:27am, Andrzej Bialecki wrote: Ken Krugler wrote: I added a page to the Nutch wiki at http://wiki.apache.org/nutch/ApacheConUs2009MeetUp , with some ideas for discussion topics. Please take a look and add

Re: MeetUp topic list posted

2009-08-03 Thread Ken Krugler
/webcrawlers.php http://www.manageability.org/blog/stuff/open-source-web-crawlers-java http://java-source.net/open-source/crawlers Any input on additional communities to invite? Thanks, -- Ken On Aug 3, 2009, at 10:27am, Andrzej Bialecki wrote: Ken Krugler wrote: I added a page to the Nutch

Web Crawler MeetUp info on wiki

2009-08-02 Thread Ken Krugler
, I'm assuming this is going to work. Would it be OK to use the Nutch wiki for a page on proposed discussion topics? If so, I'll add a placeholder and a link from the ApacheCon page. Thanks, -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530

Meetup at ApacheCon US 2009

2009-07-31 Thread Ken Krugler
others? Also, what evening works best for those who think they can make it? Andrzej is giving his Introduction to Nutch talk on Thursday afternoon (3pm Nov 5th), so that same evening is the leading candidate. Thanks, -- Ken -- Ken Krugler TransPac Software, Inc

Re: Nutch dev. plans

2009-07-20 Thread Ken Krugler
are copied to the local disks of the slaves for performance reasons. There has been work on making Katta work better for near-real time updating, versus the currently very batch-oriented approach. See the Katta list for more details. -- Ken -- Ken Krugler +1 530-210-6378

[jira] Commented: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-06-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722242#action_12722242 ] Ken Krugler commented on NUTCH-731: --- This is definitely an issue - I've been pinging

Re: IOException in dedup

2009-06-02 Thread Ken Krugler
is that you've run out of memory. Normally the hadoop.log file would have the OOM exception. If you're running from inside of Eclipse, see http://wiki.apache.org/nutch/RunNutchInEclipse0.9 for more details. -- Ken -- Ken Krugler +1 530-210-6378

Re: IOException in dedup

2009-06-02 Thread Ken Krugler
On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote: Hello, I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for Mac OS X. When I try to start crawling I get the following exception: Dedup: starting Dedup: adding indexes in: crawl/indexes Exception in thread main

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714277#action_12714277 ] Ken Krugler commented on NUTCH-739: --- There's another approach that works well here

Performance issues with queue-based fetching

2009-05-19 Thread Ken Krugler
? -- Ken -- Ken Krugler +1 530-210-6378

Re: Filtering URLs

2009-05-05 Thread Ken Krugler
create a new plugin that implements the URLFilter interface. -- Ken -- Ken Krugler +1 530-210-6378

Re: Nutch Topical / Focused Crawl

2009-04-02 Thread Ken Krugler
following methods for our scoring plugin: setConf() injectScore() initialScore(); generateSortValue(); passScoreBeforeParsing(); passScoreAfterParsing(); shouldHarvestOutlinks(); distributeScoreToOutlink(); updateDbScore(); indexerScore(); -- Ken -- Ken Krugler +1 530-210-6378

Re: Fetching inefficiency

2008-04-21 Thread Ken Krugler
have not tried overlapping fetching jobs yet, but I have a feeling that won't help a ton, plus it could lead to two fetchers fetching from the same server and being impolite - am I wrong? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch -- Ken Krugler Krugle

Re: Update to URL ordering from Generator.java

2007-10-24 Thread Ken Krugler
a maximum of one thread/domain, and that URLs are segmented such that each domain is handled inside of one task, such that the one thread/domain and pages/minute/domain restrictions can be enforced properly. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it

Re: Update to URL ordering from Generator.java

2007-10-24 Thread Ken Krugler
polite. -- Ken PS - Also note that the HTTP 1.1 response header can contain the server's max requests/connection value, so the site admin does have control over what they feel is a reasonable limit. On Oct 24, 2007, at 8:29 AM, Ken Krugler wrote: Ned Rockson wrote: So recently I switched

Re: Redirects and alias handling (LONG)

2007-08-15 Thread Ken Krugler
Ken Krugler wrote: common case. Thus it could be somewhat computationally expensive (e.g. a winnowing ala http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf). Interesting paper, thanks for the pointer - I always wondered what criteria to use to reduce the number of shingles

Re: Redirects and alias handling (LONG)

2007-08-14 Thread Ken Krugler
be OK, as that's the most common case. Thus it could be somewhat computationally expensive (e.g. a winnowing ala http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf). -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525 ] Ken Krugler commented on NUTCH-25: -- I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466260 ] Ken Krugler commented on NUTCH-353: --- Another small note about this (see NUTCH-411 for a related but different

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466261 ] Ken Krugler commented on NUTCH-353: --- Wait, looks like maybe change 490607 (fix for NUTCH-273) might fix the issue I

[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-23 Thread Ken Krugler (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12444162 ] Ken Krugler commented on NUTCH-385: --- There is a middle ground, though we don't know yet how important it is to address. When we crawl partner sites, we

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-02 Thread Ken Krugler (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ] Ken Krugler commented on NUTCH-353: --- +1 that the redirect target is not always the real URL that we want to keep. For example, http://www.ibm.com/developerworks

Re: OPICScoringFilter

2006-08-13 Thread Ken Krugler
virtual page that every page links to. c. One global score (cash). And then a number of changes to how page scores are calculated. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 Find Code, Find Answers

Re: result comparison tool?

2006-07-23 Thread Ken Krugler
with in the /nutch/quality directory, such as: http://www.krugle.com/files/cvs/cvs.sourceforge.net/nutch/playground/src/java/net/nutch/quality/MarkovRankSolver.java -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 Find Code, Find Answers

Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-24 Thread Ken Krugler
up some quick text for the Wiki re what a good user agent string should contain, and what should be on the web page that it refers to, since we also went through that same process not too long ago. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 Find Code, Find Answers

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Ken Krugler
by ANLTR, creates a URL parser/validator. It's almost too easy... :) Anyway, waiting to hear back from Ter. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 Find Code, Find Answers

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Ken Krugler (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370424 ] Ken Krugler commented on NUTCH-230: --- So Doug beat me to this comment :) I was going to describe the two cases we'd run into... 1. There's a great page, but most

[jira] Created: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-13 Thread Ken Krugler (JIRA)
Versions: 0.8-dev Reporter: Ken Krugler Priority: Minor In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks: score /= links.length; It then loops over the links, and any that pass the normalize/filter gauntlet get added

Re: scalability limits getDetails, mapFile Readers?

2006-03-01 Thread Ken Krugler
cached files would seem fairly straightforward. Though every time I think I understand Nutch I'm wrong - or the code changes :) -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 Find Code, Find Answers

Re: URL Partitioning (Lexical vs. IP Address)

2006-02-25 Thread Ken Krugler
207.142.131.245 207.142.131.236 207.142.131.248 207.142.131.203 207.142.131.213 207.142.131.206 207.142.131.214 207.142.131.210 207.142.131.246 207.142.131.204 207.142.131.235 207.142.131.247 207.142.131.205 207.142.131.202 Jeff -- Ken Krugler Krugle, Inc. +1 530-210-6378 Find Code, Find

ArrayIndexOutOfBoundsException during invert link phase

2006-02-04 Thread Ken Krugler
Hi all, Has anybody else seen the java.lang.ArrayIndexOutOfBoundsException error displayed in Diagnostic Text column of the jobdetail.jsp page when running 0.8? This occasionally seems to happen during the invert links phase. The stack crawl looks like:

Integrating Nutch w/Alexa

2006-02-01 Thread Ken Krugler
, or Java. But I couldn't find any other Java refs. So maybe that page is out of date, or a foreshadowing of things to come. -- Ken -- Ken Krugler Krugle, Inc. +1 530-470-9200

Re: Per-page crawling policy

2006-01-17 Thread Ken Krugler
the fetch support. We monitor the fetch threads, and when the ratio of active threads (fetching) to unactive (blocked) threads drops below a threshold we terminate the fetch. This then compensates for issues where a popular site is also a low-performing site. -- Ken -- Ken Krugler Krugle, Inc

Re: Per-page crawling policy

2006-01-16 Thread Ken Krugler
the WebDB to support a set of scores per page, while not hurting performance, seems tricky. -- Ken -- Ken Krugler Krugle, Inc. +1 530-470-9200

Normalizing URLs with anchors

2006-01-05 Thread Ken Krugler
://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex and http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindexhttp://www.dina.kvl.dk/~sestoft/gcsharp/index.html Is it safe to always strip # followed by (valid anchor characters) at the end of a URL? Thanks, -- Ken -- Ken Krugler Krugle

Re: Urlfilter Patch

2005-12-01 Thread Ken Krugler
generated from the installed and enabled parse-plugins. -- Ken -- Ken Krugler Krugle, Inc. +1 530-470-9200

Re: protocol-http versus protocol-httpclient

2005-11-09 Thread Ken Krugler
for any advice, -- Ken -- Ken Krugler Krugle, Inc. +1 530-470-9200

Long delay in httpclient

2005-10-26 Thread Ken Krugler
length of time (30 seconds, in my case) has gone by with only 0 byte results being returned by the read call. Does this make sense? Am I missing something else I should be trying? Thanks, -- Ken -- Ken Krugler Krugle, Inc. +1 530-470-9200

Re: [Nutch-dev] [Fwd: Fetch list priority]

2005-10-19 Thread Ken Krugler
of in-bound links. Note that our usage is also a bit non-standard in that we're doing a vertical crawl, and have a way of scoring page contents at crawl time. So we use this in combination with the OPIC score as the page score that we divide up among the outbound links. -- Ken -- Ken Krugler Krugle

Re: what contibute to fetch slowing down

2005-10-02 Thread Ken Krugler
/page Thanks, AJ -- Ken Krugler Krugle, Inc. +1 530-470-9200

Re: what contibute to fetch slowing down

2005-10-02 Thread Ken Krugler
, 559.6956 kb/s, 25572.332bytes/page Thanks, AJ -- Ken Krugler Krugle, Inc. +1 530-470-9200

Language detection

2005-08-12 Thread Ken Krugler
: Language Identification for Multilingual Documents John M. Prager -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200

Re: Ignore external links from crawled domains

2005-08-08 Thread Ken Krugler
have a link between pages from two of your target domains, this might cause problems, and (b) without mods to FetchListTool you still might wind up fetching a page with a score of 0. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200

Re: Detecting CJKV / Asian language pages

2005-08-02 Thread Ken Krugler
On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote: Yes - small chunks of untagged text are going to be a problem, no matter what you do. But if you're referring to query strings from an HTML page, the default is to use the encoding of the page (which from Nutch defaults to UTF-8). And you can