I don't know about the Nutch format -> Solr schema idea either. The NUTCH-442 system uses Solr for both indexing and searching, and uses Nutch for only crawling.
At my last job we had a custom scripting system that crawled the front page of over 5000 sites. Each site had a configured script. Yes, it was complex. We also had custom crawlers for Youtube & myspace and some other sites which gave APIs, but in general it was all hand-coded. I have used the rss format of the data input handler, and it works well but has problems with detecting errors etc. That is, it works well when it works but does not fail gracefully in a useful way. Lance 2009/1/9 Tony Wang <ivyt...@gmail.com> > Thanks Lance! I have no idea whether the Nuth-generated index could be > converted to Solr schema. I wonder what people are using this NUTCH-442 for > (http://issues.apache.org/jira/browse/NUTCH-442). > > So what crawler do you use to generate index for Solr? Thanks a lot!! > > On Fri, Jan 9, 2009 at 8:04 PM, Lance Norskog <goks...@gmail.com> wrote: > > > http://issues.apache.org/jira/browse/NUTCH-442 > > > > Haven't used Nutch. Can the Nutch-generated index be reverse-engineered > > into > > a Solr schema? In that case, you can just copy the Lucene index files > away > > from Nutch and run them under Solr. > > > > > > -- > Are you RCholic? www.RCholic.com > 温 良 恭 俭 让 仁 义 礼 智 信 >