Thanks AB. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Andrzej Białecki <a...@getopt.org> Reply-To: "user@nutch.apache.org" <user@nutch.apache.org> Date: Monday, October 6, 2014 at 3:47 PM To: "user@nutch.apache.org" <user@nutch.apache.org> Subject: Re: Nutch vs Lucidworks Fusion > >On 03 Oct 2014, at 12:44, Julien Nioche <lists.digitalpeb...@gmail.com> >wrote: > >> Attaching Andrzej to this thread. As most of you know Andrzej was the >>Nutch PMC chair prior to me and a huge contributor to Nutch over the >>years. He also works for Lucid. >> Andrzej : would you mind telling us a bit about LW's crawler and why >>you went for Aperture? Am I right in thinking that this has to do with >>the fact that you needed to be able to pilot the crawl via a REST-like >>service? >> > > >Hi Julien, and the Nutch community, > >It's been a while. :) > >First, let me clarify a few issues: > >* indeed I now work for Lucidworks and I'm involved in the design and >implementation of the connectors framework in the Lucidworks Fusion >product. > >* the connectors framework in Fusion allows us to integrate wildly >different third-party modules, e.g. we have connectors based on GCM, >Hadoop map-reduce, databases, local files, remote filesystems, >repositories, etc. In fact, it's relatively straightforward to integrate >Nutch with this framework, and we actually provide docs on how to do >this, so nothing stops you from using Nutch if it fits the bill. > >* this framework provides a uniform REST API to control the processing >pipeline for documents collected by connectors, and in most cases to >manage the crawlers configurations and processes. Only the first part is >in place for the integration with Nutch, i.e. configuration and jobs have >to be managed externally, and only the processing and content enrichment >is controlled by Lucidworks Fusion. If we get a business case that >requires a tighter integration I'm sure we will be happy to do it. > >* the previous generation of Lucidworks products (called "LucidWorks >Search", shortly LWS) used Aperture as a Web crawler. This was a legacy >integration and while it worked fine for what it was originally intended, >it definitely had some painful limitations, not to mention the fact that >the Aperture project is no longer active. > >* the current version of the product DOES NOT use Aperture for web >crawling. It uses a web- and file-crawler implementation created in-house >- it re-uses some code from crawler-commons, with some insignificant >modifications. > >* our content processing framework uses many Open Source tools (among >them Tika, OpenNLP, Drools, of course Solr, and many others), on top of >which we've built a powerful system for content enrichment, event >processing and data analytics. > >So, that's the facts. Now, let's move on to opinions ;) > >There are many different use cases for web/file crawling and many >different scalability and content processing requirements. So far the >target audience for Lucidworks Fusion required small- to medium-scale web >crawls, but with sophisticated content processing, extensive controls >over the crawling frontier (handling sessions for depth-first crawls, >cookies, form logins, etc) and easy management / control of the process >over REST / UI. In many cases also the effort to set up and operate a >Hadoop cluster was deemed too high or irrelevant to the core business. >And in reality, as you know, there are workload sizes for which Hadoop is >a total overkill and the roundtrip for processing is in the order of >several minutes instead of seconds. > >For these reasons we wanted to provide a web crawler that is >self-contained, lean, doesn't require Hadoop, is scalable well-enough >from small to mid-size workloads without Hadoop's overhead, and at the >same time to provide an easy way to integrate high-scale crawler like >Nutch for customers that need it - and for such customers we DO recommend >Nutch as the best high-scale crawler. :) > >So, in my opinion Lucidworks Fusion satisfies these goals, and provides a >reasonable tradeoff between ease of use, scalability, rich content >processing and ease of integration. Don't take my word for it - download >a copy and try it yourself! > >To Lewis: > >> Hopefull the above is my outtake on things. If LucidWorks have some >>magic >> sauce then great. Hopefully they consider bringing some of it back into >> Nutch rather than writing some Perl or Python scripts. I would never >>expect >> this to happen, however I am utterly depressed at how often I see this >> happening. > >Lucidworks is a Java/Clojure shop, the connectors framework and the web >crawler are written in Java - no Perl or Python in sight ;) Our magic >sauce is in enterprise integration and rich content processing pipelines, >not so much in base web crawling. > >So, that's my contribution to this discussion ... I hope this answered some >questions. Feel fee to ask if you need more information. > >-- >Best regards, >Andrzej Bialecki <a...@lucidworks.com> > >--=# http://www.lucidworks.com #=-- >