Hi Eric,

  I have also developed mini-applications replacing GSA for some of our
clients using Apache Nutch + Solr to crawl multi lingual sites and enable
multi-lingual search. Nutch+Solr is very stable and Nutch mailing list
provides a good support.

Reference link to start:
https://sites.google.com/site/profilerajanimaski/webcrawlers/apache-nutch

Thanks
Rajani




On Thu, Oct 31, 2013 at 12:27 AM, Palmer, Eric <epal...@richmond.edu> wrote:

> Markus and Jason
>
> thanks for the info.
>
> I will start to research Nutch.  Writing a crawler, agree it is a rabbit
> hole.
>
>
> --
> Eric Palmer
>
> Web Services
> U of Richmond
>
> To report technical issues, obtain technical support or make requests for
> enhancements please visit
> http://web.richmond.edu/contact/technical-support.html
>
>
>
>
>
> On 10/30/13 2:53 PM, "Jason Hellman" <jhell...@innoventsolutions.com>
> wrote:
>
> >Nutch is an excellent option.  It should feel very comfortable for people
> >migrating away from the Google appliances.
> >
> >Apache Droids is another possible way to approach, and I¹ve found people
> >using Heretrix or Manifold for various use cases (and usually in
> >combination with other use cases where the extra overhead was worth the
> >trouble).
> >
> >I think the simples approach will be NutchŠit¹s absolutely worth taking a
> >shot at it.
> >
> >DO NOT write a crawler!  That is a rabbit hole you do not want to peer
> >down into :)
> >
> >
> >
> >On Oct 30, 2013, at 10:54 AM, Markus Jelsma <markus.jel...@openindex.io>
> >wrote:
> >
> >> Hi Eric,
> >>
> >> We have also helped some government institution to replave their
> >>expensive GSA with open source software. In our case we use Apache Nutch
> >>1.7 to crawl the websites and index to Apache Solr. It is very
> >>effective, robust and scales easily with Hadoop if you have to. Nutch
> >>may not be the easiest tool for the job but is very stable, feature rich
> >>and has an active community here at Apache.
> >>
> >> Cheers,
> >>
> >> -----Original message-----
> >>> From:Palmer, Eric <epal...@richmond.edu>
> >>> Sent: Wednesday 30th October 2013 18:48
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Replacing Google Mini Search Appliance with Solr?
> >>>
> >>> Hello all,
> >>>
> >>> Been lurking on the list for awhile.
> >>>
> >>> We are at the end of life for replacing two google mini search
> >>>appliances used to index our public web sites. Google is no longer
> >>>selling the mini appliances and buying the big appliance is not cost
> >>>beneficial.
> >>>
> >>> http://search.richmond.edu/
> >>>
> >>> We would run a solr replacement in linux (cents, redhat, similar) with
> >>>open Java or Oracle Java.
> >>>
> >>> Background
> >>> ==========
> >>> ~130 sites
> >>> only ~12,000 pages (at a depth of 3)
> >>> probably ~40,000 pages if we go to a depth of 4
> >>>
> >>> We use key matches a lot. In solr terms these are elevated documents
> >>>(elevations)
> >>>
> >>> We would code a search query form in php and wrap it into our design
> >>>(http://www.richmond.edu)
> >>>
> >>> I have played with and love lucidworks and know that their $ solution
> >>>works for our use cases but the cost model is not attractive for such a
> >>>small collection.
> >>>
> >>> So with solr what are my open source options and what are people's
> >>>experiences crawling and indexing web sites with solr + crawler. I
> >>>understand there is not a crawler with solr so that would have to be
> >>>first up to get one working.
> >>>
> >>> We can code in Java, PHP, Python etc. if we have to, but we don't want
> >>>to write a crawler if we can avoid it.
> >>>
> >>> thanks in advance for and information.
> >>>
> >>> --
> >>> Eric Palmer
> >>> Web Services
> >>> U of Richmond
> >>>
> >>>
> >
>
>

Reply via email to