What is "segments.gen" and "segments_2" ? The warning I am getting happens when I dedup two indexes.
I create index1 and index2 through generate/fetch/index/...etc index1 is an index of 1/2 the segments. index2 is an index of the other 1/2 The warning is happening on both datanodes. The command I am running is "bin/nutch dedup crawl/index1 crawl/index2" If segments.gen and segments_2 are supposed to be directories, then why are they created as files? They are created as files from the start "bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX crawl/segments/YYY" I don't see any errors or warnings about creating the index. I'm using nutch 1.0, though it has been a bit since I've updated the sources from the trunk. I'm running one name node and two data nodes. 2009-11-30 18:25:23,497 WARN mapred.FileInputFormat - Can't open index at hdfs://nn1:9000/user/nutch/crawl/index2/segments_2:0+2147483647, skipping. (hdfs://nn1:9000/user/nutch/crawl/index2/segments_2 not a directory) 2009-11-30 18:33:50,200 WARN mapred.FileInputFormat - Can't open index at hdfs://nn1:9000/user/nutch/crawl/index2/segments.gen:0+2147483647, skipping. (hdfs://nn1:9000/user/nutch/crawl/index2/segments.gen not a directory) Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Mon, Nov 30, 2009 at 9:30 AM, Jesse Hires <jhi...@gmail.com> wrote: > actually searcher.dir is still the default "crawl". The warnings are > showing up either while indexing segments or merging indexes. I need to > spend some time figuring out just where it is happening at. I will look into > it later tonight, work doesn't like my hobbies intruding. :) > > I may need some more info in "index" vs "indexes" later if you don't mind > my asking some dumb questions about them, but thus far, things seem to be > working in the manner I have it set up. With the exception of the warnings > mentioned of course. > > The searching (or searchers) run out of a different directory and I run the > indexes and segments for them locally on the individual nodes and I am > getting search results back, which increase with every pass as expected. > > > > Jesse > > int GetRandomNumber() > { > return 4; // Chosen by fair roll of dice > // Guaranteed to be random > } // xkcd.com > > > > On Mon, Nov 30, 2009 at 8:57 AM, Andrzej Bialecki <a...@getopt.org> wrote: > >> Jesse Hires wrote: >> >>> I am getting warnings in hadoop.log that segments.gen and segments_2 are >>> not >>> directories, and as you can see by the listing, they are in fact files >>> not >>> directories. I'm not sure what stage of the process this is happening in, >>> as >>> I just now stumbled on them, but it concerns me that it says it is >>> skipping >>> something. Any ideas before I start digging further? >>> >>> >>> >>> >>> 2009-11-30 08:28:56,344 WARN mapred.FileInputFormat - Can't open index >>> at >>> hdfs://nn1:9000/user/nutch/crawl/index1/segments.gen:0+2147483647, >>> skipping. >>> >> >> Most likely reason for this is that you defined your searcher.dir as >> hdfs://nn1:9000/user/nutch/crawl/index1 - instead you should set it to >> hdfs://nn1:9000/user/nutch/crawl . Please also note that names "index" and >> "indexes" are magic - Lucene indexes must be located under one of these >> names ("index" for a single merged index, and "indexes" for partial >> indexes), otherwise they won't be found by the NutchBean (the search >> component in Nutch). So e.g. your Lucene index in index1/ won't be found. >> >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> >