I don't know about the Nutch format -> Solr schema idea either. The
NUTCH-442 system uses Solr for both indexing and searching, and uses Nutch
for only crawling.

At my last job we had a custom scripting system that crawled the front page
of over 5000 sites. Each site had a configured script. Yes, it was complex.
We also had custom crawlers for Youtube & myspace and some other sites which
gave APIs, but in general it was all hand-coded.

I have used the rss format of the data input handler, and it works well but
has problems with detecting errors etc. That is, it works well when it works
but does not fail gracefully in a useful way.

Lance

2009/1/9 Tony Wang <ivyt...@gmail.com>

> Thanks Lance! I have no idea whether the Nuth-generated index could be
> converted to Solr schema. I wonder what people are using this NUTCH-442 for
> (http://issues.apache.org/jira/browse/NUTCH-442).
>
> So what crawler do you use to generate index for Solr? Thanks a lot!!
>
> On Fri, Jan 9, 2009 at 8:04 PM, Lance Norskog <goks...@gmail.com> wrote:
>
> > http://issues.apache.org/jira/browse/NUTCH-442
> >
> > Haven't used Nutch. Can the Nutch-generated index be reverse-engineered
> > into
> > a Solr schema? In that case, you can just copy the Lucene index files
> away
> > from Nutch and run them under Solr.
> >
>
>
>
> --
> Are you RCholic? www.RCholic.com
> 温 良 恭 俭 让 仁 义 礼 智 信
>

Reply via email to