Nutch crawl problem

2007-12-13 Thread jibjoice
i use nutch-0.9, hadoop-0.12.2 and i use this command "bin/nutch crawl urls -dir crawled -depth 3" have error : - crawl started in: crawled - rootUrlDir = input - threads = 10 - depth = 3 - Injector: starting - Injector: crawlDb: crawled/crawldb - Injector: urlDir: input - Injector: Converting in

Re: map/reduce and Lucene integration question

2007-12-13 Thread Ted Dunning
Yes. On 12/13/07 12:22 PM, "Eugeny N Dzhurinsky" <[EMAIL PROTECTED]> wrote: > On Thu, Dec 13, 2007 at 11:31:49AM -0800, Ted Dunning wrote: >> After indexing, indexes are moved to multiple query servers. ... (how nutch >> works) With this architecture, you get good scaling in both queries per

Re: How to ask hadoop not to split the input

2007-12-13 Thread Rui Shi
Hi, I guess that the problem is that I wrote my own LineReader. In this case, the corresponding InputFormat has to specify that the input is not splitable by overriding the isSplitable() method. I have got that fixed. Thanks, Rui - Original Message From: Owen O'Malley <[EMAIL PROTEC

Re: How to ask hadoop not to split the input

2007-12-13 Thread Owen O'Malley
On Dec 13, 2007, at 3:03 PM, Runping Qi wrote: If your files have .gz as extension, they will split. They will _not_ split.

RE: How to ask hadoop not to split the input

2007-12-13 Thread Runping Qi
If your files have .gz as extension, they will split. Runping > -Original Message- > From: Rui Shi [mailto:[EMAIL PROTECTED] > Sent: Thursday, December 13, 2007 2:53 PM > To: hadoop-user@lucene.apache.org > Subject: How to ask hadoop not to split the input > > Hi, > > My input is a bu

How to ask hadoop not to split the input

2007-12-13 Thread Rui Shi
Hi, My input is a bunch of gz files on local file system. I don't want hadoop to split them for mappers. How should I specify that? Thanks, Rui Be a better friend, newshound, and know-it-all with Yah

Re: map/reduce and Lucene integration question

2007-12-13 Thread Eugeny N Dzhurinsky
On Thu, Dec 13, 2007 at 11:31:49AM -0800, Ted Dunning wrote: > After indexing, indexes are moved to multiple query servers. The indexes on > the local query servers are all on local disk. > > There are two dimensions to scaling search. The first dimension is query > rate. To get that scaling, y

Re: Error Nutchwax Search

2007-12-13 Thread Andrzej Bialecki
Owen O'Malley wrote: On Dec 12, 2007, at 1:36 PM, Andrzej Bialecki wrote: Ted Dunning wrote: Hadoop *normally* uses the Sun JDK. Using gcj successfully would be a bit of a surprise. GCJ 4.2 does NOT work. With minor tweaks it's possible to compile all Hadoop classes, including contrib, b

Re: map/reduce and Lucene integration question

2007-12-13 Thread Ted Dunning
After indexing, indexes are moved to multiple query servers. The indexes on the local query servers are all on local disk. There are two dimensions to scaling search. The first dimension is query rate. To get that scaling, you simply replicate your basic search operator and balance using a si

Re: map/reduce and Lucene integration question

2007-12-13 Thread Andrzej Bialecki
Ted Dunning wrote: I don't think so (but I don't run nutch) To actually run searches, the search engines copy the index to local storage. Having them in HDFS is very nice, however, as a way to move them to the right place. Nutch can search in Lucene indexes on HDFS (see org.apache.nutch.inde

Re: map/reduce and Lucene integration question

2007-12-13 Thread Eugeny N Dzhurinsky
On Thu, Dec 13, 2007 at 11:03:50AM -0800, Ted Dunning wrote: > > I don't think so (but I don't run nutch) > > To actually run searches, the search engines copy the index to local > storage. Having them in HDFS is very nice, however, as a way to move them > to the right place. Even in case if th

Re: map/reduce and Lucene integration question

2007-12-13 Thread Ted Dunning
I don't think so (but I don't run nutch) To actually run searches, the search engines copy the index to local storage. Having them in HDFS is very nice, however, as a way to move them to the right place. On 12/13/07 10:59 AM, "Eugeny N Dzhurinsky" <[EMAIL PROTECTED]> wrote: > On Thu, Dec 13,

Re: map/reduce and Lucene integration question

2007-12-13 Thread Eugeny N Dzhurinsky
On Thu, Dec 13, 2007 at 11:36:31AM +0200, Enis Soztutar wrote: > Hi, > > nutch indexes the documents in the org.apache.nutch.indexer.Indexer class. > In the reduce phase, the documents are output wrapped in ObjectWritable. > The OutputFormat opens a local indexwriter(FileSystem.startLocalOutput(

Re: Error Nutchwax Search

2007-12-13 Thread Owen O'Malley
On Dec 12, 2007, at 1:36 PM, Andrzej Bialecki wrote: Ted Dunning wrote: Hadoop *normally* uses the Sun JDK. Using gcj successfully would be a bit of a surprise. GCJ 4.2 does NOT work. With minor tweaks it's possible to compile all Hadoop classes, including contrib, but it doesn't run pr

Re: map/reduce and Lucene integration question

2007-12-13 Thread Enis Soztutar
Hi, nutch indexes the documents in the org.apache.nutch.indexer.Indexer class. In the reduce phase, the documents are output wrapped in ObjectWritable. The OutputFormat opens a local indexwriter(FileSystem.startLocalOutput()), and adds all the documents that are collected. Then puts the index

map/reduce and Lucene integration question

2007-12-13 Thread Eugeny N Dzhurinsky
Hello! We would like to use Hadoop to index a lot of documents, and we would like to have this index in the Lucene and utilize Lucene's search engine power for searching. At this point I am confused a bit - when we will analyze documents in Map part, we will end with - document name/location - li

Re: finalize upgrade

2007-12-13 Thread Torsten Curdt
No sign of 'upgrade still needs to be finalized' or something ...so I assume removing the 'previous' dir is safe then? On 12.12.2007, at 21:18, Konstantin Shvachko wrote: 2) Is there a way of finding out whether finalize still needs to be run? Yes, you can see it on the name-node web UI, a