[
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851923#action_12851923
]
Ken Krugler commented on NUTCH-706:
---
Two comments about this:
1. From my experiences
[
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846424#action_12846424
]
Ken Krugler commented on NUTCH-797:
---
I thought this same issue (relative URL with leading
[
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846459#action_12846459
]
Ken Krugler commented on NUTCH-797:
---
Agreed re crawler-commons...feels like there's
[
https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830109#action_12830109
]
Ken Krugler commented on NUTCH-786:
---
Is this something that should also be applied
[
https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798890#action_12798890
]
Ken Krugler commented on NUTCH-751:
---
i agree that this should be in crawler-commons. E.g
On Nov 16, 2009, at 12:00pm, Andrzej Bialecki wrote:
Julien Nioche wrote:
Hi,
I came across the classloader issue that you mentioned but got
everything to work OK by duplicating the class TikaConfiguration
into the package used by my plugin. The lib tika-core goes into the
main /lib dir
[
https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753069#action_12753069
]
Ken Krugler commented on NUTCH-751:
---
I'm using HttpClient 4.0 in Bixo, and I agree
not acceptable for general crawl, but would be nice to
have such configuration option)
Fuad Efendi
==
http://www.linkedin.com/in/liferay
http://www.tokenizer.org
http://www.casaGURU.com
==
--
Ken Krugler
TransPac
I added a page to the Nutch wiki at http://wiki.apache.org/nutch/ApacheConUs2009MeetUp
, with some ideas for discussion topics.
Please take a look and add/comment, thanks.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-210-6378
if you think I should move it one day earlier.
Thanks,
-- Ken
On Aug 3, 2009, at 10:27am, Andrzej Bialecki wrote:
Ken Krugler wrote:
I added a page to the Nutch wiki at http://wiki.apache.org/nutch/ApacheConUs2009MeetUp
, with some ideas for discussion topics.
Please take a look and add
/webcrawlers.php
http://www.manageability.org/blog/stuff/open-source-web-crawlers-java
http://java-source.net/open-source/crawlers
Any input on additional communities to invite?
Thanks,
-- Ken
On Aug 3, 2009, at 10:27am, Andrzej Bialecki wrote:
Ken Krugler wrote:
I added a page to the Nutch
, I'm assuming this is going to work.
Would it be OK to use the Nutch wiki for a page on proposed discussion
topics? If so, I'll add a placeholder and a link from the ApacheCon
page.
Thanks,
-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530
others?
Also, what evening works best for those who think they can make it?
Andrzej is giving his Introduction to Nutch talk on Thursday
afternoon (3pm Nov 5th), so that same evening is the leading candidate.
Thanks,
-- Ken
--
Ken Krugler
TransPac Software, Inc
are
copied to the local disks of the slaves for performance reasons.
There has been work on making Katta work better for near-real time
updating, versus the currently very batch-oriented approach. See the
Katta list for more details.
-- Ken
--
Ken Krugler
+1 530-210-6378
[
https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722242#action_12722242
]
Ken Krugler commented on NUTCH-731:
---
This is definitely an issue - I've been pinging
is
that you've run out of memory. Normally the hadoop.log file would
have the OOM exception.
If you're running from inside of Eclipse, see
http://wiki.apache.org/nutch/RunNutchInEclipse0.9 for more details.
-- Ken
--
Ken Krugler
+1 530-210-6378
On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:
Hello,
I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse
for Mac OS X. When I try to start crawling I get the following
exception:
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread main
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714277#action_12714277
]
Ken Krugler commented on NUTCH-739:
---
There's another approach that works well here
?
-- Ken
--
Ken Krugler
+1 530-210-6378
create a new
plugin that implements the URLFilter interface.
-- Ken
--
Ken Krugler
+1 530-210-6378
following methods for our scoring plugin:
setConf()
injectScore()
initialScore();
generateSortValue();
passScoreBeforeParsing();
passScoreAfterParsing();
shouldHarvestOutlinks();
distributeScoreToOutlink();
updateDbScore();
indexerScore();
-- Ken
--
Ken Krugler
+1 530-210-6378
have not tried overlapping fetching jobs yet, but I
have a feeling that won't help a ton, plus it could lead to two
fetchers fetching from the same server and being impolite - am I
wrong? Thanks, Otis -- Sematext -- http://sematext.com/ --
Lucene - Solr - Nutch
--
Ken Krugler
Krugle
a maximum of one thread/domain, and that URLs are
segmented such that each domain is handled inside of one task, such
that the one thread/domain and pages/minute/domain restrictions can
be enforced properly.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it
polite.
-- Ken
PS - Also note that the HTTP 1.1 response header can contain the
server's max requests/connection value, so the site admin does have
control over what they feel is a reasonable limit.
On Oct 24, 2007, at 8:29 AM, Ken Krugler wrote:
Ned Rockson wrote:
So recently I switched
Ken Krugler wrote:
common case. Thus it could be somewhat computationally expensive
(e.g. a winnowing ala
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).
Interesting paper, thanks for the pointer - I always wondered what
criteria to use to reduce the number of shingles
be OK, as that's the most
common case. Thus it could be somewhat computationally expensive
(e.g. a winnowing ala
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525
]
Ken Krugler commented on NUTCH-25:
--
I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like
[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466260
]
Ken Krugler commented on NUTCH-353:
---
Another small note about this (see NUTCH-411 for a related but different
[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466261
]
Ken Krugler commented on NUTCH-353:
---
Wait, looks like maybe change 490607 (fix for NUTCH-273) might fix the issue I
[
http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12444162 ]
Ken Krugler commented on NUTCH-385:
---
There is a middle ground, though we don't know yet how important it is to
address.
When we crawl partner sites, we
[
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ]
Ken Krugler commented on NUTCH-353:
---
+1 that the redirect target is not always the real URL that we want to keep.
For example, http://www.ibm.com/developerworks
virtual page that every page links to.
c. One global score (cash).
And then a number of changes to how page scores are calculated.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
with in the /nutch/quality directory, such as:
http://www.krugle.com/files/cvs/cvs.sourceforge.net/nutch/playground/src/java/net/nutch/quality/MarkovRankSolver.java
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
up some quick text for the Wiki re what a good user
agent string should contain, and what should be on the web page that
it refers to, since we also went through that same process not too
long ago.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
by ANLTR, creates a URL parser/validator.
It's almost too easy... :)
Anyway, waiting to hear back from Ter.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
[
http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370424 ]
Ken Krugler commented on NUTCH-230:
---
So Doug beat me to this comment :)
I was going to describe the two cases we'd run into...
1. There's a great page, but most
Versions: 0.8-dev
Reporter: Ken Krugler
Priority: Minor
In ParseOutputFormat.java, the write() method currently divides the page score
by the # of outlinks:
score /= links.length;
It then loops over the links, and any that pass the normalize/filter gauntlet
get added
cached files
would seem fairly straightforward.
Though every time I think I understand Nutch I'm wrong - or the code changes :)
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
207.142.131.245 207.142.131.236 207.142.131.248 207.142.131.203
207.142.131.213 207.142.131.206 207.142.131.214 207.142.131.210
207.142.131.246 207.142.131.204 207.142.131.235 207.142.131.247
207.142.131.205 207.142.131.202
Jeff
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find
Hi all,
Has anybody else seen the java.lang.ArrayIndexOutOfBoundsException
error displayed in Diagnostic Text column of the jobdetail.jsp page
when running 0.8?
This occasionally seems to happen during the invert links phase. The
stack crawl looks like:
, or Java.
But I couldn't find any other Java refs. So maybe that page is out of
date, or a foreshadowing of things to come.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
the fetch support. We monitor the fetch
threads, and when the ratio of active threads (fetching) to unactive
(blocked) threads drops below a threshold we terminate the fetch.
This then compensates for issues where a popular site is also a
low-performing site.
-- Ken
--
Ken Krugler
Krugle, Inc
the WebDB to support a set of scores per page, while not hurting
performance, seems tricky.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex
and
http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindexhttp://www.dina.kvl.dk/~sestoft/gcsharp/index.html
Is it safe to always strip # followed by (valid anchor characters) at
the end of a URL?
Thanks,
-- Ken
--
Ken Krugler
Krugle
generated from the
installed and enabled parse-plugins.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
for any advice,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
length of
time (30 seconds, in my case) has gone by with only 0 byte results
being returned by the read call.
Does this make sense? Am I missing something else I should be trying?
Thanks,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
of in-bound links.
Note that our usage is also a bit non-standard in that we're doing a
vertical crawl, and have a way of scoring page contents at crawl
time. So we use this in combination with the OPIC score as the page
score that we divide up among the outbound links.
-- Ken
--
Ken Krugler
Krugle
/page
Thanks,
AJ
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
, 559.6956 kb/s, 25572.332bytes/page
Thanks,
AJ
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
: Language Identification for Multilingual Documents
John M. Prager
-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200
have a link between pages from
two of your target domains, this might cause problems, and (b)
without mods to FetchListTool you still might wind up fetching a page
with a score of 0.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200
On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote:
Yes - small chunks of untagged text are going to be a problem, no
matter what you do. But if you're referring to query strings from
an HTML page, the default is to use the encoding of the page (which
from Nutch defaults to UTF-8). And you can
53 matches
Mail list logo