Hi Eric, I have also developed mini-applications replacing GSA for some of our clients using Apache Nutch + Solr to crawl multi lingual sites and enable multi-lingual search. Nutch+Solr is very stable and Nutch mailing list provides a good support.
Reference link to start: https://sites.google.com/site/profilerajanimaski/webcrawlers/apache-nutch Thanks Rajani On Thu, Oct 31, 2013 at 12:27 AM, Palmer, Eric <epal...@richmond.edu> wrote: > Markus and Jason > > thanks for the info. > > I will start to research Nutch. Writing a crawler, agree it is a rabbit > hole. > > > -- > Eric Palmer > > Web Services > U of Richmond > > To report technical issues, obtain technical support or make requests for > enhancements please visit > http://web.richmond.edu/contact/technical-support.html > > > > > > On 10/30/13 2:53 PM, "Jason Hellman" <jhell...@innoventsolutions.com> > wrote: > > >Nutch is an excellent option. It should feel very comfortable for people > >migrating away from the Google appliances. > > > >Apache Droids is another possible way to approach, and I¹ve found people > >using Heretrix or Manifold for various use cases (and usually in > >combination with other use cases where the extra overhead was worth the > >trouble). > > > >I think the simples approach will be NutchŠit¹s absolutely worth taking a > >shot at it. > > > >DO NOT write a crawler! That is a rabbit hole you do not want to peer > >down into :) > > > > > > > >On Oct 30, 2013, at 10:54 AM, Markus Jelsma <markus.jel...@openindex.io> > >wrote: > > > >> Hi Eric, > >> > >> We have also helped some government institution to replave their > >>expensive GSA with open source software. In our case we use Apache Nutch > >>1.7 to crawl the websites and index to Apache Solr. It is very > >>effective, robust and scales easily with Hadoop if you have to. Nutch > >>may not be the easiest tool for the job but is very stable, feature rich > >>and has an active community here at Apache. > >> > >> Cheers, > >> > >> -----Original message----- > >>> From:Palmer, Eric <epal...@richmond.edu> > >>> Sent: Wednesday 30th October 2013 18:48 > >>> To: solr-user@lucene.apache.org > >>> Subject: Replacing Google Mini Search Appliance with Solr? > >>> > >>> Hello all, > >>> > >>> Been lurking on the list for awhile. > >>> > >>> We are at the end of life for replacing two google mini search > >>>appliances used to index our public web sites. Google is no longer > >>>selling the mini appliances and buying the big appliance is not cost > >>>beneficial. > >>> > >>> http://search.richmond.edu/ > >>> > >>> We would run a solr replacement in linux (cents, redhat, similar) with > >>>open Java or Oracle Java. > >>> > >>> Background > >>> ========== > >>> ~130 sites > >>> only ~12,000 pages (at a depth of 3) > >>> probably ~40,000 pages if we go to a depth of 4 > >>> > >>> We use key matches a lot. In solr terms these are elevated documents > >>>(elevations) > >>> > >>> We would code a search query form in php and wrap it into our design > >>>(http://www.richmond.edu) > >>> > >>> I have played with and love lucidworks and know that their $ solution > >>>works for our use cases but the cost model is not attractive for such a > >>>small collection. > >>> > >>> So with solr what are my open source options and what are people's > >>>experiences crawling and indexing web sites with solr + crawler. I > >>>understand there is not a crawler with solr so that would have to be > >>>first up to get one working. > >>> > >>> We can code in Java, PHP, Python etc. if we have to, but we don't want > >>>to write a crawler if we can avoid it. > >>> > >>> thanks in advance for and information. > >>> > >>> -- > >>> Eric Palmer > >>> Web Services > >>> U of Richmond > >>> > >>> > > > >