I don't know much about alternative pieces of software. I do know that making 
parse plugins in Nutch is quite easy and flexible with full access to the DOM. 

On Monday 12 September 2011 14:15:49 dpt9876 wrote:
> Ok nice. So its possible. Do you think this is a better method than
> scraping using an alternate? It seems to me it is in that it will work
> better with my end state, being Solr faceted search and I can remove
> layers of complexity. On Sep 12, 2011 8:03 PM, "Markus Jelsma-2 [via
> Lucene]" <
> 
> [email protected]> wrote:
> > Yes you can. As Ken replied in your Solr thread you must create custom
> 
> parse
> 
> > and indexing filters. The parse filter is needed to extract the
> 
> information
> 
> > and store it in the document and the index filter is used to pass that
> > new
> > 
> > information to the Solr index.
> > 
> > On Monday 12 September 2011 12:55:49 dpt9876 wrote:
> >> Hi, the friendly guys at the Solr user group pointed me here.
> >> 
> >> I am wondering if Nutch/Solr will do the following for a project I am
> >> working on.
> >> I want to create a search engine with facets for potentially hundreds of
> >> websites.
> >> Similar to say crawling amazon + buy.com + ebay and someone can search
> >> these 3 sites from my 1 website.
> >> (I realise there are better ways of doing the above example, its for
> >> illustrative purposes).
> >> Eventually I would build that search crawl to index say 200 or 1000
> >> merchants.
> >> Someone would come to my site and search for "digital camera".
> >> 
> >> They would get results from all 3 indexes and hopefully dynamic facets
> >> eg Price $100-200
> >> Price 200-300
> >> Resolution 1mp-2mp
> >> 
> >> etc etc
> >> 
> >> Can this be done on the fly?
> >> 
> >> I ask this because I am currently developing webscrapers to crawl these
> >> websites, dump that data into a db, then was thinking of tacking on a
> 
> solr
> 
> >> server to crawl my db.
> >> 
> >> Problem with that approach is that crawling the worlds ecommerce sites
> 
> will
> 
> >> take forever, when it seems solr might do that for me? (I have read
> >> about multiple indexes etc).
> >> 
> >> Many thanks
> >> 
> >> --
> 
> >> View this message in context:
> http://lucene.472066.n3.nabble.com/Will-Solr-Nutch-crawl-multi-websites-ak
> 
> >> a-a-mini-google-with-faceted-search-tp3329346p3329346.html Sent from the
> >> Nutch - User mailing list archive at Nabble.com.
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> > 
> > 
> > _______________________________________________
> > If you reply to this email, your message will be added to the discussion
> 
> below:
> 
> http://lucene.472066.n3.nabble.com/Will-Solr-Nutch-crawl-multi-websites-aka
> -a-mini-google-with-faceted-search-tp3329346p3329431.html
> 
> > To unsubscribe from Will Solr/Nutch crawl multi websites (aka a mini
> 
> google with faceted search)?, visit
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscri
> be_by_code&node=3329346&code=ZGFuaW50aGV0cm9waWNzQGdtYWlsLmNvbXwzMzI5MzQ2fC
> 04MDk0NTc1ODg=
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Will-Solr-Nutch-crawl-multi-websites-ak
> a-a-mini-google-with-faceted-search-tp3329346p3329454.html Sent from the
> Nutch - User mailing list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to