to be altered to achieve
this?
Just remove the following directories from each segment: crawl_parse,
parse_text, parse_data, and then run bin/nutch parse on these segments.
--
Best regards,
Andrzej Bialecki
- it's equivalent to IdentityReducer,
which is used implicitly by this job. This class is a leftover from the
time, when it contained also some filtering code.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
, and mergesegs to merge segments ;)
And a simple merge merges indexes of multiple segments, which is a
performance-related step in the regular Nutch work-cycle.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
Carl Cerecke wrote:
Carl Cerecke wrote:
Andrzej Bialecki wrote:
Carl Cerecke wrote:
I've given this a crack and it mostly seems to work, except I'm not
sure how to get the score back into the crawldb. After reading the
Javadoc, I figured that passScoreAfterParsing() was the method I need
?
It would be probably too slow, unless you made a copy of linkdb/crawldb
on the local FS-es of each node. But at this point the benefit of this
change would be doubtful, because of all the I/O you would need to do to
prepare each task's environment ...
--
Best regards,
Andrzej Bialecki
), it should be enough to put the
nutch*.job file in ${hadoop.dir}, and copy bin/nutch (possibly with some
minor modifications - my memory is a little vague on this ...).
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
get the index.ja.html page instead of the English page.
Please see org.apache.nutch.protocol.httpclient.Http.java:116 -
currently this is hardcoded, but it would be easy to turn it into a
configuration parameter.
--
Best regards,
Andrzej Bialecki
is the historical genesis of this issue (or is that
even relevant)?
Nutch webapp doesn't have anything to do with it. The limitations in the
query syntax have different roots (see above).
--
Best regards,
Andrzej Bialecki
text blocks by size
* drop a certain number (or percentage) of the smallest of the text blocks.
* put the blocks back in order, and extract only their text content.
This is the main body text.
--
Best regards,
Andrzej Bialecki
if the
dates (with this resolution) were stored in a single field. The other
method (combining) is already in use in Nutch, and implemented in
CommonGrams.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval
Doğacan Güney wrote:
On 6/23/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
On 6/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
These 'urls' most likely come from parse-js plugin. Can you
disable it
and see if they disappear? To extract
Doğacan Güney wrote:
On 6/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
These 'urls' most likely come from parse-js plugin. Can you disable it
and see if they disappear? To extract links from js code, parse-js
uses a heuristic that unfortunately also may extract
response times on most queries.
Are you running with a sorted index, and using non-zero
searcher.max.hits? If you use a well-defined PR-like scoring, then using
this feature could make wonders to the performance, and increase the max
number of docs per server.
--
Best regards,
Andrzej
queries:
http://www.nabble.com/Performance-optimization-for-Nutch-index---query-tf3276316.html#a9111523
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded
.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
this (for
performance reasons). Whenever the full text is needed, it's retrieved
from Nutch segment data. Please see the logic in o.a.n.s.FetchedSegment
for details - this process doesn't use Lucene at all, it simply
retrieves records from Hadoop MapFile using URL as document ID.
--
Best regards,
Andrzej
regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
- this should be parseData instead of parse.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info
want him not to see
MR certain sites in the results that have been crawled.
MR How can this be achieved?
Anyone solve this problem? I need this filter too. How to do it in the
best way in nutch 0.9?
Any thoughts?
http://issues.apache.org/jira/browse/NUTCH-477
--
Best regards,
Andrzej
it in a regex, or you can implement your own URLFilter plugin
that does exactly this.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http
Enzo Michelangeli wrote:
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Sunday, June 10, 2007 5:48 PM
Enzo Michelangeli wrote:
- Original Message - From: Berlin Brown
[EMAIL PROTECTED]
Sent: Sunday, June 10, 2007 11:24 AM
Yea, but how do crawl
and you have a very unpolite fetcher. Please don't run this to fetch
a site you don't control :)
.. because it destroys the built-in controls that Nutch uses to avoid
making multiple concurrent requests to the same site, or to make them
too quickly.
--
Best regards,
Andrzej Bialecki
, it handles cookies properly
without any additional configuration.
However, they are not stored anywhere, so they will be valid only for
the duration of a single fetch.
--
Best regards,
Andrzej Bialecki
the source
of DOMContentUtils to artificially limit the level of recursion in
getOutlinks to something like 200-300.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
Enzo Michelangeli wrote:
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Monday, June 04, 2007 2:05 PM
Er... I saw it mentioned at http://wiki.apache.org/nutch/FetchOptions
, so I thought it was for real...
Sorry, this page is wrong and should be corrected
Enzo Michelangeli wrote:
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Monday, June 04, 2007 1:31 AM
Enzo Michelangeli wrote:
In my case (with Nutch 0.8), it seems not: I set it to 500, and the
fetcher still saturates the 1.5 Mbit/s link... Is it supposed
property with such name ... Is this perhaps a part of your
local code base?
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
fast, although they differ
in accuracy vs. speed balance. Unfortunately the code is not public -
but the task is certainly doable, and doesn't require major changes.
--
Best regards,
Andrzej Bialecki
?
You can use the *Merger tools to re-write the data. E.g. CrawlDbMerger
for crawldb, giving just a single db as the input argument.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
/fetcher2_robots.patch
Good catch! The patch looks good, too - please go ahead. One question:
why did you remove the call to finishFetchItem() around line 505?
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval
to
increase the number of concurrent requests and the cache size. This was
on Linux, though - I have no idea how to do this on Windows.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
for this, if
I am mistaken, just give me a nudge and I will send an updated patch.
Indeed, you're right - I should've checked with the base version, not
just the patch.
--
Best regards,
Andrzej Bialecki
should be fine.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
which
config files are loaded in what order and from what locations.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http
, and queue info logging.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Doğacan Güney wrote:
On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
Hi everyone,
Has anyone tried Fetcher2 from latest trunk? On our tests, Fetcher2 is
always slower (by a large margin) that Fetcher.
For a segment with ~3 urls, we ran Fetcher with 150
junk and spam - unless you tightly
control the quality of URLs, using URLFilters, ScoringFilters and other
means.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
, the
request is terminated and Nutch is able to do the right thing.
The default protocol-http plugin does not use the apache commons httpclient
stuff, and works correctly.
Could you please create a JIRA issue, so that your analysis and the
possible fix is recorded? Thanks!
--
Best regards,
Andrzej
implement the former.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
hitting OS-wide limits of open file
handles. In another installation the OS-wide limits were ok, but the
limits on this particular account were insufficient.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
about such things should be fatored out and
encapsulated in a utility class.
This is more work than just adding a single line check, which may
suggest why it hasn't been done yet. Patches are welcome ;)
--
Best regards,
Andrzej Bialecki
into Nutch queries, and then
translated into Lucene queries, using this tool:
bin/nutch org.apache.nutch.searcher.Query
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
lowering it.
* Please try the following modification: somewhere around
LinkDb.java:283 add the following line:
job.setCombinerClass(LinkDb.class);
Recompile and re-run.
* Also, as others suggested, you may want to turn on compression.
--
Best regards,
Andrzej Bialecki
unfetched pages.
You can also modify the Generator to completely skip such flagged pages.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
send Nutch-related questions first to Nutch groups).
What is your operating system (uname -a) ?
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
wangxu wrote:
Linux wangxu.com 2.6.8-2-386 #1 Tue Aug 16 12:46:35 UTC 2005 i686 GNU/Linux
Andrzej Bialecki wrote:
wangxu wrote:
when I use nutch-nightly0.9 ,I got this:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
And I echo
thing to do then would be to rewrite absolute
outlinks contained in the content, from staging to www - but this can be
done in URLNormalizers.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
, it was completely rewritten - I don't
think there's any detailed documentation on this, though...
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
0.7.2 was released but failed to
locate any such discussion).
Please see above. The answer is yes. ;)
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix
Mathijs Homminga wrote:
Hi all,
Is there a way to split large segments into smaller pieces?
Mathijs
As the name suggests (not ;) ) use SegmentMerger with the -slice option.
--
Best regards,
Andrzej Bialecki
any answers.
What helps is when you create a bug issue in JIRA, describe the problem
and attach a patch that helped in your case.
Thank you for your co-operation. ;)
--
Best regards,
Andrzej Bialecki
public URLs, could you please send me your fetchlist ?
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact
provide a descriptive message instead of
throwing NPE. Care to provide a patch? ;)
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
in the Parse
MetaData.
The reason is simple - space. Storing additional data consumes space,
and if someone just occasionally needs this info from one or two pages
it's less costly to re-parse the page again.
--
Best regards,
Andrzej Bialecki
rubdabadub wrote:
On 3/2/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Dennis Kubes wrote:
Believe it or not I don't think that meta tags are currently stored.
I looked through the html parsing code and didn't see anywhere that it
could be storing it except in html filters. I see
by the symbolic name inside the SequenceFile.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram
the javadoc says, so that there's no
misunderstanding: if you use DFS and your fetch job is aborted, there is
no way in the world to recover the data - it's permanently lost. If you
run with a local FS, you can try this tool and hope for the best.
--
Best regards,
Andrzej Bialecki
.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Lucifersam wrote:
Andrzej Bialecki wrote:
Lucifersam wrote:
Finally - I seem to have a problem with identical pages with different
urls
- i.e.
http://website/
http://website/default.htm
I was under the impression that these would be removed by the dedup
process,
but this does
not support it, but it's easy to add.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram
partition ... I need to check where the
problem originates - however, this should not happen if you index more
documents than 2 * the number of reduce tasks.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval
and quickly,
rather they make a bunch of requests for resources tied to a single
page, then wait relatively long time, and then make another bunch of
requests ... So, the request pattern is still more fair than in the
case of a mad crawler.
--
Best regards,
Andrzej Bialecki
or more indexes under crawled/indexes is
invalid - nonexistent, incomplete or corrupt.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
) - and quite often all
requests from such sources get blocked at the firewall level -
sometimes, even whole IP classes get blocked.
So, t(h)read carefully ...
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
for automating
these types of job streams in python but that is not complete yet.
Andrzej, do you think this is something we should post to the wiki?
Sure, if it's ok for you to release it I'm sure many people would find
it useful.
--
Best regards,
Andrzej Bialecki
function which maps String to Integer,
but even in this case you would have a small probability that existing
URLs will be re-numbered. The space of int is too small to use random
hashing and hope there are no collisions.
--
Best regards,
Andrzej Bialecki
Nicolás Lichtmaier wrote:
I'd like to limit nutch to fetch, refetch and index just the injected
URLs. Will setting db.max.outlinks.per.page to 0 enable me to do that?
If not... how could achive what I'm looking to?
You need to run updatedb with -noAdditions switch.
--
Best regards,
Andrzej
,
redirecting, running javascripts, etc. In the end only perhaps 1 out of
50 sites was using a plain form authentication, and even that with
different field names on the form ... so I gave up.
--
Best regards,
Andrzej Bialecki
.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
of this information is already available on the Nutch Wiki. All I
can say is that there is certainly a limit to what you can do using the
local mode - if you need to handle large numbers of pages you will
need to migrate to the distributed setup.
--
Best regards,
Andrzej Bialecki
from
0.8 and later, and offers only limited scalability.
Still, this workaround should work ok ...
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix
.
There were also other intermittent problems with this library, so after
much deliberation we decided to leave the simpler plugin as the default ...
These issues may have been solved in a newer version of httpclient library.
--
Best regards,
Andrzej Bialecki
)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
.. and that's because urlDir: urls/url-fr.txt is not a directory, but
a file. You should give only the urls as the input directory - Nutch
will read all text files inside the directory.
--
Best regards,
Andrzej Bialecki
regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
of threads accessing a single host, and
delay between requests. Look for Exceeded http.max.delays errors in
your log.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
- most likely you have the default rule that
discards URLs with special characters.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
to physically remove
all blocks that are not accounted for in the current fsimage).
If it's any consolation - this problem is recognized, and people are
actively working on fixing it.
--
Best regards,
Andrzej Bialecki
!
This exception doesn't tell anything except that the job failed... You
need to increase the logging level to DEBUG - please check
log4j.properties .
My guess is that most likely one of these segments is unfetched or
corrupted.
--
Best regards,
Andrzej Bialecki
)
Nutch 0.8.1 doesn't work with any other version of Hadoop than the one
it's supplied with - i.e. version 0.4.0-patched.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
time I see that Hadoop detects non-obvious
errors in hardware or connectivity on a cluster - on one hand, it would
be nice if it were less susceptible to this kind of errors, on the other
hand - it makes for a good diagnostic tool ;)
--
Best regards,
Andrzej Bialecki
this on an NFS volume, using LocalFileSystem? You aren't
running out of disk space by any chance?
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
Brian Whitman wrote:
On Jan 15, 2007, at 1:36 PM, Andrzej Bialecki wrote:
Brian Whitman wrote:
(nutch-nightly, hadoop 0.9.1)
The file indicated (bad_files/data.-931801681) is a 255MB binary
file -- running strings on it shows a lot of URIs. There's also a
2MB .data.crc-931801681 file, all
indicate that mapred.speculative.execution is true in your
config - make sure it's explicitly set to false.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded
segment. Indexes
contain segment names and document id-s inside, so if you have
merged/sliced your segments you have to rebuild the index too.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic
paths for any arguments ... ;)
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
mln urls, if even that many. The main bottleneck were
the DB operations, which for any type of hardware would take even days
to complete.
These limitations have been largely removed in 0.8 and later, due to the
Hadoop framework.
--
Best regards,
Andrzej Bialecki
to include
information from linkdb when it generates new segments, whichever way is
more suitable to your requirements.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
is more than capable of doing this, all it takes is
one person familiar with the infrastructure the nightly build process,
and with a day or two to spare ...
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
knows anymore).
No, I meant the apache.org as a person (a committer), who is familiar
enough with both Nutch and the local infrastructure at apache.org so
that he could set it up.
--
Best regards,
Andrzej Bialecki
. Any ideas?
In such case you should always do a full thread dump of this JVM
process. Under Unix systems this is achieved by doing kill -SIGQUIT
pid, under Windows Ctrl-Break.
--
Best regards,
Andrzej Bialecki
/ are incompatible with 0.8.x,
and with earlier versions of trunk// - see the note 17. in CHANGES.txt.
You should also temporarily increase your logging level to DEBUG to see
if there are any problems reported at low level.
--
Best regards,
Andrzej Bialecki
and consuming 100% CPU.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
). This should be trivial to
implement as a scoring plugin.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com
)
^^
Please set mapred.speculative.execution to false, and repeat.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
, and Nutch config contains only overrides ... so
you need to put this explicitly into your hadoop-site.xml, like this:
property
namemapred.speculative.execution/name
valuefalse/value
/property
If this fixes your problem, I'll put this property in the public sources.
--
Best regards,
Andrzej
trouble, and you come up with some
patches that improve support for *BSD, I may be able to integrate them
back to Hadoop sources.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
again :)
Indeed, this is related to some changes of delete()'s behavior in HDFS -
it seems that previously it would just return false on non-existent
directories, now it throws an Exception.
I fixed this in trunk/ and branch-0.8.
--
Best regards,
Andrzej Bialecki
in logs.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
ava:74)
The issue is that this constructor, MapFile.Writer(Configuration,
FileSystem, String, Class, Class) is present only in Hadoop 0.9, but it
wasn't present before ...
--
Best regards,
Andrzej Bialecki
1 - 100 of 371 matches
Mail list logo