Re: Nutch vs Lucidworks Fusion

Mattmann, Chris A (3980) Tue, 07 Oct 2014 07:46:31 -0700

Thanks AB.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++







-----Original Message-----
From: Andrzej Białecki <a...@getopt.org>
Reply-To: "user@nutch.apache.org" <user@nutch.apache.org>
Date: Monday, October 6, 2014 at 3:47 PM
To: "user@nutch.apache.org" <user@nutch.apache.org>
Subject: Re: Nutch vs Lucidworks Fusion

>
>On 03 Oct 2014, at 12:44, Julien Nioche <lists.digitalpeb...@gmail.com>
>wrote:
>
>> Attaching Andrzej to this thread. As most of you know Andrzej was the
>>Nutch PMC chair prior to me and a huge contributor to Nutch over the
>>years. He also works for Lucid.
>> Andrzej : would you mind telling us a bit about LW's crawler and why
>>you went for Aperture? Am I right in thinking that this has to do with
>>the fact that you needed to be able to pilot the crawl via a REST-like
>>service?
>> 
>
>
>Hi Julien, and the Nutch community,
>
>It's been a while. :)
>
>First, let me clarify a few issues:
>
>* indeed I now work for Lucidworks and I'm involved in the design and
>implementation of the connectors framework in the Lucidworks Fusion
>product.
>
>* the connectors framework in Fusion allows us to integrate wildly
>different third-party modules, e.g. we have connectors based on GCM,
>Hadoop map-reduce, databases, local files, remote filesystems,
>repositories, etc. In fact, it's relatively straightforward to integrate
>Nutch with this framework, and we actually provide docs on how to do
>this, so nothing stops you from using Nutch if it fits the bill.
>
>* this framework provides a uniform REST API to control the processing
>pipeline for documents collected by connectors, and in most cases to
>manage the crawlers configurations and processes. Only the first part is
>in place for the integration with Nutch, i.e. configuration and jobs have
>to be managed externally, and only the processing and content enrichment
>is controlled by Lucidworks Fusion. If we get a business case that
>requires a tighter integration I'm sure we will be happy to do it.
>
>* the previous generation of Lucidworks products (called "LucidWorks
>Search", shortly LWS) used Aperture as a Web crawler. This was a legacy
>integration and while it worked fine for what it was originally intended,
>it definitely had some painful limitations, not to mention the fact that
>the Aperture project is no longer active.
>
>* the current version of the product DOES NOT use Aperture for web
>crawling. It uses a web- and file-crawler implementation created in-house
>- it re-uses some code from crawler-commons, with some insignificant
>modifications.
>
>* our content processing framework uses many Open Source tools (among
>them Tika, OpenNLP, Drools, of course Solr, and many others), on top of
>which we've built a powerful system for content enrichment, event
>processing and data analytics.
>
>So, that's the facts. Now, let's move on to opinions ;)
>
>There are many different use cases for web/file crawling and many
>different scalability and content processing requirements. So far the
>target audience for Lucidworks Fusion required small- to medium-scale web
>crawls, but with sophisticated content processing, extensive controls
>over the crawling frontier (handling sessions for depth-first crawls,
>cookies, form logins, etc) and easy management / control of the process
>over REST / UI. In many cases also the effort to set up and operate a
>Hadoop cluster was deemed too high or irrelevant to the core business.
>And in reality, as you know, there are workload sizes for which Hadoop is
>a total overkill and the roundtrip for processing is in the order of
>several minutes instead of seconds.
>
>For these reasons we wanted to provide a web crawler that is
>self-contained, lean, doesn't require Hadoop, is scalable well-enough
>from small to mid-size workloads without Hadoop's overhead, and at the
>same time to provide an easy way to integrate high-scale crawler like
>Nutch for customers that need it - and for such customers we DO recommend
>Nutch as the best high-scale crawler. :)
>
>So, in my opinion Lucidworks Fusion satisfies these goals, and provides a
>reasonable tradeoff between ease of use, scalability, rich content
>processing and ease of integration. Don't take my word for it - download
>a copy and try it yourself!
>
>To Lewis:
>
>> Hopefull the above is my outtake on things. If LucidWorks have some
>>magic
>> sauce then great. Hopefully they consider bringing some of it back into
>> Nutch rather than writing some Perl or Python scripts. I would never
>>expect
>> this to happen, however I am utterly depressed at how often I see this
>> happening.
>
>Lucidworks is a Java/Clojure shop, the connectors framework and the web
>crawler are written in Java - no Perl or Python in sight ;) Our magic
>sauce is in enterprise integration and rich content processing pipelines,
>not so much in base web crawling.
>
>So, that's my contribution to this discussion ... I hope this answered some
>questions. Feel fee to ask if you need more information.
>
>--
>Best regards,
>Andrzej Bialecki <a...@lucidworks.com>
>
>--=# http://www.lucidworks.com #=--
>

Re: Nutch vs Lucidworks Fusion

Reply via email to