Thanks Andrzej. We have been doing some awesome stuff with Tika
lately (OCR, GDAL and other things), and glad to hear you guys are
integrating with that. If there's any good stuff you guys have
(like NER, etc.) that would be appreciated to be pushed up, and
also to be collaborated on. We are funded on DARPA Memex and a number
of us are working on that project to expand, Nutch, Tika and Solr.

CC'ing dev lists for Nutch and Tika for awareness.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Andrzej Białecki <a...@getopt.org>
Reply-To: "user@nutch.apache.org" <user@nutch.apache.org>
Date: Tuesday, October 14, 2014 at 9:01 AM
To: "user@nutch.apache.org" <user@nutch.apache.org>
Subject: Re: Nutch vs Lucidworks Fusion

>
>On 13 Oct 2014, at 23:03, Markus Jelsma <markus.jel...@openindex.io>
>wrote:
>
>> Hi - anything on this? These are interesting topics so i am curious :)
>
>Hi,
>
>Sorry, I was away for a few days (visiting Athens, which is a lovely city
>at this time of the year... :) )
>
>We use Tika plus a few customised ContentHandlers and parsers to solve a
>few corner cases, and to extract text or xml + metadata recursively.
>
>Linked items are noted as such, but processed independently.
>
>We use a processing pipeline consisting of many stages, among others
>named-entity recognizers, regex extractors and transformers, Drools, etc.
>This pipeline is fully customizable and scriptable.
>
>We don't do anything specific yet to avoid spider traps, so yeah, it's up
>to the filters to handle them as best as possible...
>
>> 
>> Cheers,
>> Markus
>> 
>> 
>> 
>> -----Original message-----
>>> From:Markus Jelsma <markus.jel...@openindex.io>
>>> Sent: Thursday 9th October 2014 0:46
>>> To: user@nutch.apache.org; a...@getopt.org
>>> Subject: RE: Nutch vs Lucidworks Fusion
>>> 
>>> Hi Andrzej - how are you dealing with text extraction and other
>>>relevant items such as article date and accompanying images? And what
>>>about other metadata such as the author of the article or the rating
>>>some pasta recipe got? Also, must clients (or your consultants)
>>>implement site-specific URL filters to avoid those dreadful spider
>>>traps, or do you automatically resolve traps? If so, how?
>>> 
>>> Looking forward :)
>>> 
>>> Cheers,
>>> Markus
>>> 
>>> 
>>> -----Original message-----
>>>> From:Andrzej Białecki <a...@getopt.org>
>>>> Sent: Monday 6th October 2014 15:47
>>>> To: user@nutch.apache.org
>>>> Subject: Re: Nutch vs Lucidworks Fusion
>>>> 
>>>> On 03 Oct 2014, at 12:44, Julien Nioche
>>>><lists.digitalpeb...@gmail.com> wrote:
>>>> 
>>>>> Attaching Andrzej to this thread. As most of you know Andrzej was
>>>>>the Nutch PMC chair prior to me and a huge contributor to Nutch over
>>>>>the years. He also works for Lucid.
>>>>> Andrzej : would you mind telling us a bit about LW's crawler and why
>>>>>you went for Aperture? Am I right in thinking that this has to do
>>>>>with the fact that you needed to be able to pilot the crawl via a
>>>>>REST-like service?
>>>>> 
>>>> 
>>>> Hi Julien, and the Nutch community,
>>>> 
>>>> It's been a while. :)
>>>> 
>>>> First, let me clarify a few issues:
>>>> 
>>>> * indeed I now work for Lucidworks and I'm involved in the design and
>>>>implementation of the connectors framework in the Lucidworks Fusion
>>>>product.
>>>> 
>>>> * the connectors framework in Fusion allows us to integrate wildly
>>>>different third-party modules, e.g. we have connectors based on GCM,
>>>>Hadoop map-reduce, databases, local files, remote filesystems,
>>>>repositories, etc. In fact, it's relatively straightforward to
>>>>integrate Nutch with this framework, and we actually provide docs on
>>>>how to do this, so nothing stops you from using Nutch if it fits the
>>>>bill.
>>>> 
>>>> * this framework provides a uniform REST API to control the
>>>>processing pipeline for documents collected by connectors, and in most
>>>>cases to manage the crawlers configurations and processes. Only the
>>>>first part is in place for the integration with Nutch, i.e.
>>>>configuration and jobs have to be managed externally, and only the
>>>>processing and content enrichment is controlled by Lucidworks Fusion.
>>>>If we get a business case that requires a tighter integration I'm sure
>>>>we will be happy to do it.
>>>> 
>>>> * the previous generation of Lucidworks products (called "LucidWorks
>>>>Search", shortly LWS) used Aperture as a Web crawler. This was a
>>>>legacy integration and while it worked fine for what it was originally
>>>>intended, it definitely had some painful limitations, not to mention
>>>>the fact that the Aperture project is no longer active.
>>>> 
>>>> * the current version of the product DOES NOT use Aperture for web
>>>>crawling. It uses a web- and file-crawler implementation created
>>>>in-house - it re-uses some code from crawler-commons, with some
>>>>insignificant modifications.
>>>> 
>>>> * our content processing framework uses many Open Source tools (among
>>>>them Tika, OpenNLP, Drools, of course Solr, and many others), on top
>>>>of which we've built a powerful system for content enrichment, event
>>>>processing and data analytics.
>>>> 
>>>> So, that's the facts. Now, let's move on to opinions ;)
>>>> 
>>>> There are many different use cases for web/file crawling and many
>>>>different scalability and content processing requirements. So far the
>>>>target audience for Lucidworks Fusion required small- to medium-scale
>>>>web crawls, but with sophisticated content processing, extensive
>>>>controls over the crawling frontier (handling sessions for depth-first
>>>>crawls, cookies, form logins, etc) and easy management / control of
>>>>the process over REST / UI. In many cases also the effort to set up
>>>>and operate a Hadoop cluster was deemed too high or irrelevant to the
>>>>core business. And in reality, as you know, there are workload sizes
>>>>for which Hadoop is a total overkill and the roundtrip for processing
>>>>is in the order of several minutes instead of seconds.
>>>> 
>>>> For these reasons we wanted to provide a web crawler that is
>>>>self-contained, lean, doesn't require Hadoop, is scalable well-enough
>>>>from small to mid-size workloads without Hadoop's overhead, and at the
>>>>same time to provide an easy way to integrate high-scale crawler like
>>>>Nutch for customers that need it - and for such customers we DO
>>>>recommend Nutch as the best high-scale crawler. :)
>>>> 
>>>> So, in my opinion Lucidworks Fusion satisfies these goals, and
>>>>provides a reasonable tradeoff between ease of use, scalability, rich
>>>>content processing and ease of integration. Don't take my word for it
>>>>- download a copy and try it yourself!
>>>> 
>>>> To Lewis:
>>>> 
>>>>> Hopefull the above is my outtake on things. If LucidWorks have some
>>>>>magic
>>>>> sauce then great. Hopefully they consider bringing some of it back
>>>>>into
>>>>> Nutch rather than writing some Perl or Python scripts. I would never
>>>>>expect
>>>>> this to happen, however I am utterly depressed at how often I see
>>>>>this
>>>>> happening.
>>>> 
>>>> Lucidworks is a Java/Clojure shop, the connectors framework and the
>>>>web crawler are written in Java - no Perl or Python in sight ;) Our
>>>>magic sauce is in enterprise integration and rich content processing
>>>>pipelines, not so much in base web crawling.
>>>> 
>>>> So, that's my contribution to this discussion ... I hope this answered
>>>>some questions. Feel fee to ask if you need more information.
>>>> 
>>>> --
>>>> Best regards,
>>>> Andrzej Bialecki <a...@lucidworks.com>
>>>> 
>>>> --=# http://www.lucidworks.com #=--
>>>> 
>>>> 
>>> 
>
>---
>Best regards,
>
>Andrzej Bialecki
>

Reply via email to