What is "segments.gen" and "segments_2" ?
The warning I am getting happens when I dedup two indexes.

I create index1 and index2 through generate/fetch/index/...etc
index1 is an index of 1/2 the segments. index2 is an index of the other 1/2

The warning is happening on both datanodes.

The command I am running is "bin/nutch dedup crawl/index1 crawl/index2"

If segments.gen and segments_2 are supposed to be directories, then why are
they created as files?

They are created as files from the start
"bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX
crawl/segments/YYY"

I don't see any errors or warnings about creating the index.

I'm using nutch 1.0, though it has been a bit since I've updated the sources
from the trunk.
I'm running one name node and two data nodes.

2009-11-30 18:25:23,497 WARN  mapred.FileInputFormat - Can't open index at
hdfs://nn1:9000/user/nutch/crawl/index2/segments_2:0+2147483647, skipping.
(hdfs://nn1:9000/user/nutch/crawl/index2/segments_2 not a directory)
2009-11-30 18:33:50,200 WARN  mapred.FileInputFormat - Can't open index at
hdfs://nn1:9000/user/nutch/crawl/index2/segments.gen:0+2147483647, skipping.
(hdfs://nn1:9000/user/nutch/crawl/index2/segments.gen not a directory)


Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Mon, Nov 30, 2009 at 9:30 AM, Jesse Hires <jhi...@gmail.com> wrote:

> actually searcher.dir is still the default "crawl". The warnings are
> showing up either while indexing segments or merging indexes. I need to
> spend some time figuring out just where it is happening at. I will look into
> it later tonight, work doesn't like my hobbies intruding. :)
>
> I may need some more info in "index" vs "indexes" later if you don't mind
> my asking some dumb questions about them, but thus far, things seem to be
> working in the manner I have it set up. With the exception of the warnings
> mentioned of course.
>
> The searching (or searchers) run out of a different directory and I run the
> indexes and segments for them locally on the individual nodes and I am
> getting search results back, which increase with every pass as expected.
>
>
>
> Jesse
>
> int GetRandomNumber()
> {
>    return 4; // Chosen by fair roll of dice
>                 // Guaranteed to be random
> } // xkcd.com
>
>
>
> On Mon, Nov 30, 2009 at 8:57 AM, Andrzej Bialecki <a...@getopt.org> wrote:
>
>> Jesse Hires wrote:
>>
>>> I am getting warnings in hadoop.log that segments.gen and segments_2 are
>>> not
>>> directories, and as you can see by the listing, they are in fact files
>>> not
>>> directories. I'm not sure what stage of the process this is happening in,
>>> as
>>> I just now stumbled on them, but it concerns me that it says it is
>>> skipping
>>> something. Any ideas before I start digging further?
>>>
>>>
>>>
>>>
>>> 2009-11-30 08:28:56,344 WARN  mapred.FileInputFormat - Can't open index
>>> at
>>> hdfs://nn1:9000/user/nutch/crawl/index1/segments.gen:0+2147483647,
>>> skipping.
>>>
>>
>> Most likely reason for this is that you defined your searcher.dir as
>> hdfs://nn1:9000/user/nutch/crawl/index1 - instead you should set it to
>> hdfs://nn1:9000/user/nutch/crawl . Please also note that names "index" and
>> "indexes" are magic - Lucene indexes must be located under one of these
>> names ("index" for a single merged index, and "indexes" for partial
>> indexes), otherwise they won't be found by the NutchBean (the search
>> component in Nutch). So e.g. your Lucene index in index1/ won't be found.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Reply via email to