i use nutch-0.9, hadoop-0.12.2 and i use this command "bin/nutch crawl
urls -dir crawled -depth 3" have error :
- crawl started in: crawled
- rootUrlDir = input
- threads = 10
- depth = 3
- Injector: starting
- Injector: crawlDb: crawled/crawldb
- Injector: urlDir: input
- Injector: Converting in
Yes.
On 12/13/07 12:22 PM, "Eugeny N Dzhurinsky" <[EMAIL PROTECTED]> wrote:
> On Thu, Dec 13, 2007 at 11:31:49AM -0800, Ted Dunning wrote:
>> After indexing, indexes are moved to multiple query servers. ... (how nutch
>> works) With this architecture, you get good scaling in both queries per
Hi,
I guess that the problem is that I wrote my own LineReader. In this case, the
corresponding InputFormat has to specify that the input is not splitable by
overriding the isSplitable() method. I have got that fixed.
Thanks,
Rui
- Original Message
From: Owen O'Malley <[EMAIL PROTEC
On Dec 13, 2007, at 3:03 PM, Runping Qi wrote:
If your files have .gz as extension, they will split.
They will _not_ split.
If your files have .gz as extension, they will split.
Runping
> -Original Message-
> From: Rui Shi [mailto:[EMAIL PROTECTED]
> Sent: Thursday, December 13, 2007 2:53 PM
> To: hadoop-user@lucene.apache.org
> Subject: How to ask hadoop not to split the input
>
> Hi,
>
> My input is a bu
Hi,
My input is a bunch of gz files on local file system. I don't want hadoop to
split them for mappers. How should I specify that?
Thanks,
Rui
Be a better friend, newshound, and
know-it-all with Yah
On Thu, Dec 13, 2007 at 11:31:49AM -0800, Ted Dunning wrote:
> After indexing, indexes are moved to multiple query servers. The indexes on
> the local query servers are all on local disk.
>
> There are two dimensions to scaling search. The first dimension is query
> rate. To get that scaling, y
Owen O'Malley wrote:
On Dec 12, 2007, at 1:36 PM, Andrzej Bialecki wrote:
Ted Dunning wrote:
Hadoop *normally* uses the Sun JDK. Using gcj successfully would be
a bit
of a surprise.
GCJ 4.2 does NOT work. With minor tweaks it's possible to compile all
Hadoop classes, including contrib, b
After indexing, indexes are moved to multiple query servers. The indexes on
the local query servers are all on local disk.
There are two dimensions to scaling search. The first dimension is query
rate. To get that scaling, you simply replicate your basic search operator
and balance using a si
Ted Dunning wrote:
I don't think so (but I don't run nutch)
To actually run searches, the search engines copy the index to local
storage. Having them in HDFS is very nice, however, as a way to move them
to the right place.
Nutch can search in Lucene indexes on HDFS (see
org.apache.nutch.inde
On Thu, Dec 13, 2007 at 11:03:50AM -0800, Ted Dunning wrote:
>
> I don't think so (but I don't run nutch)
>
> To actually run searches, the search engines copy the index to local
> storage. Having them in HDFS is very nice, however, as a way to move them
> to the right place.
Even in case if th
I don't think so (but I don't run nutch)
To actually run searches, the search engines copy the index to local
storage. Having them in HDFS is very nice, however, as a way to move them
to the right place.
On 12/13/07 10:59 AM, "Eugeny N Dzhurinsky" <[EMAIL PROTECTED]> wrote:
> On Thu, Dec 13,
On Thu, Dec 13, 2007 at 11:36:31AM +0200, Enis Soztutar wrote:
> Hi,
>
> nutch indexes the documents in the org.apache.nutch.indexer.Indexer class.
> In the reduce phase, the documents are output wrapped in ObjectWritable.
> The OutputFormat opens a local indexwriter(FileSystem.startLocalOutput(
On Dec 12, 2007, at 1:36 PM, Andrzej Bialecki wrote:
Ted Dunning wrote:
Hadoop *normally* uses the Sun JDK. Using gcj successfully would
be a bit
of a surprise.
GCJ 4.2 does NOT work. With minor tweaks it's possible to compile
all Hadoop classes, including contrib, but it doesn't run pr
Hi,
nutch indexes the documents in the org.apache.nutch.indexer.Indexer
class. In the reduce phase, the documents are output wrapped in
ObjectWritable. The OutputFormat opens a local
indexwriter(FileSystem.startLocalOutput()), and adds all the documents
that are collected. Then puts the index
Hello!
We would like to use Hadoop to index a lot of documents, and we would like to
have this index in the Lucene and utilize Lucene's search engine power for
searching.
At this point I am confused a bit - when we will analyze documents in Map
part, we will end with
- document name/location
- li
No sign of 'upgrade still needs to be finalized' or something ...so I
assume removing the 'previous' dir is safe then?
On 12.12.2007, at 21:18, Konstantin Shvachko wrote:
2) Is there a way of finding out whether finalize still needs to
be run?
Yes, you can see it on the name-node web UI, a
17 matches
Mail list logo