in your segment (you can dump this with readseg command).
It should contain a plain text content of your file.
* use Luke (www.getopt.org/luke) to examine your Lucene index. You
should be able to retrieve terms coming from your Java documents - use
Rec
ire major refactoring)
> that could provide this functionality?
Use Nutch for crawling and indexing to Solr, and then use Solr directly
for searching.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| In
ch
> crawl" command, that means I will have to code my own .sh for crawling, one
> that uses the -noparsing option of the fetcher right ?
You can simply set the fetcher.parsing config option to false.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ ___
you can re-parse again after you fixed the config or the code...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
n the politeness crawl delay.
>
> 3. When it all goes down, is there a way to restart crawling from where the
> process stopped ?
Unfortunately, no. You should at least crawl without parsing, so tha
o the documentation. The problem should be reported to
the Hadoop project.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
, because it needs the Hadoop
>> infrastructure to run).
>
> I thought ant tar did this? That's what it sez on the release guide [1] and
> what I'm familiar with when I did the Nutch 0.9 release.
ant tar packs everything, i.e. both sou
e. We may have been too hasty with that,
though... What do others think?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
s_tlp
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
g backends - the one that is configured
by default uses plain Lucene, and it does not support faceting. The
other backend uses Solr, and then of course it supports faceting and all
other Solr features.
So in your case you need to switch to use Solr
On 2010-04-17 05:45, Phil Barnett wrote:
> On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote:
>
>> More details on this (your environment, OS, JDK version) and
>> logs/stacktraces would be highly appreciated! You mentioned that you
>> have some scripts - if yo
get more specific.
More details on this (your environment, OS, JDK version) and
logs/stacktraces would be highly appreciated! You mentioned that you
have some scripts - if you could extract relevant portions from them (or
copy the scripts) it would h
Release the packages as Apache Nutch 1.1.
>
> [ ] -1 Do not release the packages because...
>
+1 - tested both local and distributed workflows, all looks good.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || _
ormal steps to become a TLP.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
ep takes too much time, but still the number
of segments is well below a hundred, just don't merge them.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \|
ventDispatchThread.pumpEvents(Unknown Source)
> at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
> at java.awt.EventDispatchThread.run(Unknown Source)
>
> Any ideas why this happens and how
icial, but I'm not
familiar with maven, so I won't be able to make this change myself...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embed
gards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
On 2010-03-29 17:14, Pedro Bezunartea López wrote:
Thanks Andrzej, I was more curious than bothered by these easy to spot spam
messages. Can I help?
Thanks, not really - I sent an admin unsubscribe and it worked, we'll
see if the problem returns ...
--
Best regards,
Andrzej Bia
moderator adds them. It appears that this user slipped through
... I'll try to forcibly unsubscribe him. Sorry!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval
, otherwise
it's likely to happen again. Are you running this on a cluster? Check
the logs of the crashed tasks (in logs/userlogs/ on respective
tasktracker nodes).
--
Best regards,
Andrzej Bia
strongly recommend that you first fetch, and then run the parsing as a
separate step.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, Syst
aviour?
You can define these weights in the configuration, look for query boost
properties.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, S
fying the
code directly in ParseOutputFormat, it's complex and fragile.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
ripts generating the response ...
it was a total mess.
So, if you target 10 sites, you can make it work. If you target 10,000
sites all using slightly different methods, then forget it.
--
Best regards,
Andrzej Bia
really no content for the redirected
url?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Cont
still a few months away.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
ogle.com/p/boilerpipe/ .
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
arently that site no longer exists. Sorry :( However, you can
still check out that code from CVS repository at nutch.sf.net .
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
__
On 2010-02-21 12:36, reinhard schwab wrote:
Andrzej Bialecki schrieb:
On 2010-02-20 23:32, reinhard schwab wrote:
Andrzej Bialecki schrieb:
On 2010-02-20 22:45, reinhard schwab wrote:
the content of one page is stored even 7 times.
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
i
On 2010-02-20 23:32, reinhard schwab wrote:
Andrzej Bialecki schrieb:
On 2010-02-20 22:45, reinhard schwab wrote:
the content of one page is stored even 7 times.
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
i believe this comes from
Recno:: 383
URL::
http://www.cinema-paradiso.at
set of URL params,
such as sessionId, print=yes, etc) or completely unrelated (human
errors, peculiarities of the content management system, or mirrors). In
your case it seems that the same page is available under different
values of g2_highlightId.
--
Best regards,
Andrze
On 2010-02-09 03:08, Hua Su wrote:
Thanks. But heritrix is another project, right?
Please see this Git repository, it contains the latest work in progress
on Nutch+HBase:
git://github.com/dogacan/nutchbase.git
--
Best regards,
Andrzej Bialecki
WARN hdfs.DFSClient - DFS Read:
java.io.IOException: Could not obtain block: blk_-6931814167688802826_9735
file=/user/root/crawl/indexed-segments/20100117235244/part-0/_1lr.prx
This error is commonly caused by running out of disk space on a datanode.
--
Best regards,
Andrzej Bia
On 2010-01-15 20:09, MilleBii wrote:
Inject is meant to seed the database at the start.
But I would like to inject new urls on a production crawldb, I think it
works but I was wondering if somebody could confirm that.
Yes. New urls are merged with the old ones.
--
Best regards,
Andrzej
e urlfilter-automaton, which is slightly less expressive but much
much faster.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integrat
e in a separate plugin then it might. Another
reason is configurability - if you put this code in a separate plugin,
you can easily turn it on/off, but if it sits in HtmlParser this would
be more difficul
which does happen in development& test phases, less in
production though.
Right. Also, a common practice is to keep the raw data for a while just
to make sure that the parsing and indexing went smoothly (in case you
need to re-parse the raw content).
--
Best r
ks will incrementally merge the existing linkdb with new
links from a new segment.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System In
efae13d6cf878691
Umm .. if anything that comment suggests that properly handling diverse
PDFs is simply a hard thing to do, and PDFBox is not that much to blame.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
(2 documents), and if
the problem persist please report this in JIRA.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integra
On 2009-12-22 16:07, Claudio Martella wrote:
Andrzej Bialecki wrote:
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extra
ution that you are looking for is an IndexingFilter - this
receives a copy of the document with all fields collected just before
it's sent to the indexing backend - and you can freely modify the
content of NutchDocument, e.g. do additional analysis, add/remove/modify
fields, etc.
--
Best r
ed in that patch required too much maintenance. On the positive
side, it worked well with super-large keys and values (in the order of
gigabytes).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retr
regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
es) - maybe we should commit the change?
Thanks for reporting this - could you perhaps try to apply that patch
and see if it helps? I hesitated to commit it because it's really a
workaround and not a solution ... but if it works for you then it's
better than nothing.
--
Best r
On 2009-12-14 16:05, BrunoWL wrote:
Nobody?
Please, any answer would good.
Please check this issue:
https://issues.apache.org/jira/browse/NUTCH-479
That's the current status, i.e. this functionality is available only as
a patch.
--
Best regards,
Andrzej Bia
t contains part-N partial indexes).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact:
e to
regex-urlnormalizer that changes the matching urls to e.g. always lose
the 'www.' part.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__||
that page.
Very good explanation, that's exactly the reasons why Nutch never
discards such pages. If you really want to ignore certain pages, then
use URLFilters and/or ScoringFilters.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _
g the nutch*.job
to a separate Hadoop cluster? Could you please try it with a standalone
Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| In
, please creata a JIRA issue in Nutch, and attach
the patch.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integra
the priority of URL during
generation. See ScoringFilter.generatorSortValue(..), you can modify
this method in scoring-opic (or in your own scoring filter) to
prioritize certain urls over others.
--
Best regar
.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
reinhard schwab wrote:
this crawl date will be fetched and fetched again with 0 days retry
interval.
i will open an issue in jira and attach a patch.
Thanks for catching this bug - please do so.
--
Best regards,
Andrzej Bialecki
d. However, the deduplication process doesn't
accept partial indexes, so you need to specify each /part-NNNN dir as an
input to dedup.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval
merged index, and "indexes" for
partial indexes), otherwise they won't be found by the NutchBean (the
search component in Nutch). So e.g. your Lucene index in index1/ won't
be found.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
Paul Tomblin wrote:
On Sat, Nov 28, 2009 at 5:48 PM, Andrzej Bialecki wrote:
Paul Tomblin wrote:
-bash-3.2$ jstack -F 32507
Attaching to process ID 32507, please wait...
Hm, I can't see anything obviously wrong with that thread dump. What's the
CPU and swap usage, and load
Paul Tomblin wrote:
On Sat, Nov 28, 2009 at 4:45 PM, Andrzej Bialecki wrote:
Paul Tomblin wrote:
How can I tell what's going on and why it's stopped?
Try to generate a thread dump to see what code is being executed.
I didn't do any sort of distributed mode because I&
thread dump to see what code is being executed.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Next week I will be working on integrating the patches from Julien, and
if time permits I could perhaps start working on a speed monitoring to
lock out slow servers.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/
, slow map tasks tend to hang around, but still
some of them finish and make space for new tasks. As time goes on,
majority of your tasks becomes slow tasks, so the overall speed
continues to drop down.
--
Best regards,
Andrze
uses ICU4J CharsetDetector plus its own
heuristic (in util.EncodingDetector and in HtmlParser) that tries to
detect character encoding if it's missing or even if it's wrong - but
this is a tricky issue and sometimes results are unpredictable.
--
Best regards,
Andrze
track which
thread you replied to and your question is "hidden" in that thread and
gets less attention. It makes following discussions in the mailing
list archives particularly difficult."
--
Best regar
put, to see how many unique hosts are in the current
working set.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://
innocuous - it helps to debug at which points in the
code the Configuration instances are being created. And you wouldn't
have seen this if you didn't turn on the DEBUG logging. ;)
--
Best regards,
Andrze
the db in order to update the signatures.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact
d to use a more relaxed Signature implementation, e.g.
TextProfileSignature.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix,
ls in
your crawldb.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
and rebuild
it from all per-segment indexes plus that most recent one. And then
deduplicate. If this sounds wasteful, please keep in mind that when
Lucene merges indexes it needs to re-write the main index anyway, so in
terms of disk IO it should be nearly the same.
--
Best regards,
A
?
Hm, indeed this looks like a bug - we should instead do like this:
if (datum.getFetchInterval() > maxInterval) {
datum.setFetchInterval(maxInterval * 0.9);
}
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__
!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Dennis Kubes wrote:
I would like to get a couple things in this release as well. Let me
know if you want help with the upgrade.
You mean you want to do the Hadoop upgrade? I won't stand in your way :)
--
Best regards,
Andrzej Bia
e current code, but it's design is obscured by the
ScoringFilter api and the need to maintain its own extended DB-s.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|
week) - and I agree that we
should have a 1.1 release in the near future.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System In
(and webmasters who are their victims). The
source code is there, if you choose you can modify it to bypass these
restrictions, just be aware of the consequences (and don't use "Nutch"
as your user agent ;) ).
tform encoding - any characters
outside this encoding will be replaced by question marks.
If you want to get an exact copy of the raw binary content then please
use the SegmentReader API.
--
Best regar
s of course depends on the "last
modified" timestamp being present on the webpage that is being crawled,
which I believe is not mandatory. Still those who do set it would benefit.
This is already implemented - see the Signature / MD5Signature /
TextProfileSignature.
--
Best regards,
An
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
them to use different ports AND different local paths.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.s
efines the implementation of the "file://"
schema FileSystem. Now you probably forgot to put hadoop-default.xml on
your classpath. Go to Build Path and add this file to your classpath,
and all should be ok.
--
Best regar
the index.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
mirrors.
Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an
attractive platform to develop and experiment with such components.
-
Briefly ;) that's what comes to my mind when I think about the
adseg), and you can use its API to retrieve either all or
individual records from a segment (using URL as key).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| |
Andrzej Bialecki wrote:
doesn't work, as reported by me and others last week.
Thanks,
Did you get the message with the subject of "confirm unsubscribe from
nutch-user@lucene.apache.org" and did you respond to it from the same
email account that you were subscribed from?
..
ot; and did you respond to it from the same
email account that you were subscribed from?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix,
ntifier code in my plugin
code without actually using the language-identifier plugin?
You need to add the language-identifier plugin to the section
in your plugin.xml, like this:
--
d re-running the operation.
* minor issue - when specifying the path names of segments and crawldb,
do NOT append the trailing slash - it's not harmful in this particular
case, but you could have a nasty surprise when doing e.g. copy / mv
op
the longest is assigned a lot of URLs from a
single host.
A workaround for this is to limit the max number of URLs per host (in
nutch-site.xml) to a more reasonable number, e.g. 100 or 1000, whatever
works best for you.
--
Best regards,
Andrzej Bialecki
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
ment on or reject it by returning null.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Gora Mohanty wrote:
On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki wrote:
[...]
Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
They are kept in Nutch crawldb to prevent their re-discovery
(through stale links pointing to these URL-s from other pages).
If you really want
URLs directly from CrawlDb (using e.g.
CrawlDbReader API) and then uses SolrJ API to send the same delete
requests + commit.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Sem
at.MIN_VALUE) {
return;
}
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Java - you need to mount this location as
a local volume.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.s
. This problem is rare - I think I crawled cumulatively ~500mln
pages in various configs and it didn't occur to me personally. It
requires a few things to go wrong (see the issue comments).
--
Best regards,
Andrzej Bia
le.tar!myfile.txt) and add the original URL in the metadata, to keep
track of the parent URL. The rest should be handled automatically,
although there are some other complications that need to be handled as
well (e.g. don't recraw
ult is 100 - when crawling filesystems
each file in a directory is treated as an outlink, and this limit is
then applied.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retr
valid, and cannot be written to.
Are you sure you are running a single datanode process per machine?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || |
I agree with Dennis - use Nutch if you need to do a larger-scale
discovery such as when you crawl the web, but if you already know all
target pages in advance then Solr will be a much better (and much easier
to handle) platform.
--
Best regards,
Andrzej Bialecki
1 - 100 of 764 matches
Mail list logo