Hi Mark,

I have some initial impressions; please read below.

On Mon, Jan 9, 2012 at 9:29 AM, Mark Bennett <mbenn...@ideaeng.com> wrote:
> We've been hoping to do some work this year to embed pipeline processing
> into MCF, such as UIMA or OpenPipeline or XPump.
>
> But reading through some recent posts there was a discussion about leaving
> this sort of thing to the Solr pipeline, and it suddenly dawned on me that
> maybe not everybody was on board with the idea of moving this into MCF.
>

Having a pipeline for content extraction is a pretty standard thing
for a search engine to have.  Having said that, I agree there are
times when this is not enough.

> So, before we spin our wheels, I wanted explain some reasons why this would
> be a GOOD thing to do, and get some reactions:
>
>
> 1: Not everybody is using Solr / or using exclusively Solr.
>
> Lucene and Solr a great of course, but open source isn't about walled
> gardens.  Most companies have multiple search engines.
>
> And, even if you just wanted to use Lucene (and not Solr), then the Solr
> pipeline is not very attractive.
>
> As an example, the Google appliance gets lots of press for Enterprise
> search.  And it's got enough traction that their format of connector is
> starting to be used by other companies.  BUT, at least in the past,
> Google's document processing wasn't very pipeline friendly.  They had calls
> you could make, but there were issues.
>
> Wouldn't it be cool if Manifold could be used to feed Google appliances?  I
> realize some open source folks might not care, but it would suddenly make
> MCF interesting to a lot more developers.
>
> Or look at FAST ESP (which was bought by Microsoft).  FAST ESP had a rich
> tradition of pipeline goodness, but once Microsoft acquired them, that
> pipeline technology is being re-cast in a very Microsoft centric stack.
> That's fine if you're a Microsoft shop, you might like it even better than
> before, but if you're company prefers Linux, you might be looking for
> something else.
>
>
> 2: Not every information application is about search
>
> Classically there's been a need to go from one database to another.  But in
> more recent times there's been a need to go from Databases into Content
> Management Systems, or from one CMS to another, or to convert one corpus of
> documents into another.
>
> Sure there was ETL technology (Extract, Transform, Load), but that tended
> to be around structured data.
>
> More generally there's the class of going between structured and
> unstructured data, and vice versa.  The latter, going from unstructured
> back to structured, is where Entity Extraction comes into play, and where I
> had thought MCF could really shine.
>
> There's a somewhat subtle point here as well.  There's the format of
> individual documents or files, such as HTML, PDF, RSS or MS Word, but also
> the type of repository it resides in (filesystem, database, CMS, web
> services, etc)  I was drawn to MCF for the connections, but a document
> pipeline would let it also work on the document formats as well.
>
>
> 3: Even spidering to feed a search engine can benefit from "early binding"
> and "extended state"
>
> A slight aside: generic web page spidering doesn't often need fancy
> processing.  What I'm about to talk about might at first seem like "edge
> cases".  BUT, almost by definition, many of us are not brought into a
> project unless it's well outside the mainstream use case.  So many
> programmers find themselves working almost fulltime on rather unusual
> projects.  Open source is quite attractive because it provides a wealth of
> tools to choose from.
>
> "Early Binding" for Spiders:
>
> Generally it's the need to deeply parse a document before instructing the
> spider what next action to take next.
>

We've looked at this as primarily a connector-specific activity.  For
example, you wouldn't want to do such a thing from within documents
fetched via JCIFs.  The main use case I can see is in extracting links
from web content.

> Let me give one simple example, but trust me there are many more!
>
> Suppose you have Web pages (or PDF files!) filled with part numbers.  And
> you have a REST API that, presented with a part number, will give more
> details.
>
> But you need to parse out the part numbers in order to create the URLs that
> you need to spider to fetch next.
>
> Many other applications of this involve helping the spider decide what type
> of document it has, or what quality of data it's getting.  You might decide
> to tell the spider to drill down deeper, or conversely, give up and work on
> higher value targets.
>

What you've described is a case by which ManifoldCF obtains content
references from one source, and indexes them from another.  Today, in
order to pull that kind of thing off with ManifoldCF, you need to
write a custom connector to do it.  That's not so unreasonable; it
involves a lot of domain-specific pieces - e.g. how to obtain the
PDFs, how to build the URLs, etc.  A similar situation existed for
crawling Wiki's; there was a custom API which basically worked with
HTTP requests.  A generic web crawler could have been used but because
of the very specific requirements for understanding the API it was
best modeled as a new connector.

I think the rough breakdown of what component of the ManifoldCF system
is responsible for what remains correct.  Making it easier to
construct a custom connector in this way, by using building blocks
that ManifoldCF would make available for all connectors, makes sense
to some degree.

> I could imagine a workaround where Manifold passes documents to Solr, and
> then Solr's pipeline later resubmits URLs back into MCF, but it's a lot
> more direct to just make these determinations more immediately.  In a few
> cases it WOULD be nice to have Solr's fullword index, so maybe in it'd be
> nice to have both options.  Commercial software companies would want to
> make the decision for you, they'd choose one way or the other, but this
> aint their garden.  ;-)
>
>
> "extended state" for Spiders:
>
> This is where you need the context of 2 or 3 pages back in your traversed
> path in order to make full use of the current page.
>

ManifoldCF framework uses the concept of "Carrydown" data for this
purpose.  This is covered in Chapters 6 and 7 of "ManifoldCF in
Action".

> Here's an example from a few years back:
>
> Steps:
> 1: Start with a list of concert venue web sites.
> 2: Foreach venue, lookup the upcoming events, including dates, bands and
> ticketing links.
> 3: Foreach band, go to this other site and lookup their albums.
> 4: Foreach album, lookup each song.
> 5: Foreach song, go to a third site to get the lyrics.
>

Yeah, and this is why we have a connector framework.  The actual
content here is an amalgam of many different pages.  Each individual
"document" you'd index into your search engine contains content that
comes from many sources, and it has to be the connector's
responsibility to pull all that together.

> Now users can search for songs including the text in the lyrics.
> When a match is found, also show them upcoming performances near them, and
> maybe even let them click to buy tickets.
>
> You can see that the unit of retrieval is particular songs, in steps 4 and
> 5.  But we want data that we parsed from several steps back.
>
> Even in the case of song lyrics, where they will have the band's name, they
> might not have the Album title.  (and a song could have been on several
> albums of course)  So even things you'd expect to be able to parse, you've
> often already had that info during a previous step.
>
> I realize MCF probably doesn't include this type of state trail now.  But I
> was thinking it'd at least be easier to build something on top of MCF than
> going way out to Solr and then back into Manifold.
>
> In the past I think folks would have used Perl or Python to handcraft these
> types of projects.  But that doesn't scale very well, and you still need
> persistence for long running jobs, AND it doesn't encourage code reuse.
>
>
> So, Manifold could really benefit from pipelines!
>
> I have a lot of technical thoughts about how this might be achieved, and a
> bunch related thoughts.  But if pipelines are really unwelcome, I don't
> want to force it.
>
>
> One final thought:
>
> The main search vendors seem to be abandoning high end, precision
> spidering.  There's a tendency now to see all the world as "Internet", and
> the data behind firewalls as just "a smaller Internet (intranet)"
>
> This is fine for 80-90% of common use cases.
>
> But that last 5-10% of atypical projects are HUGELY under-served at this
> time.  They often have expensive problems that simply won't go away.
>
> Sure, true open source folks may or may not care about "markets" or
> "expensive problems".
>
> BUT these are also INTERESTING problems!  If you're bored with "appliances"
> and the latest downloadable free search bar, then trust me, these edge
> cases will NOT bore you!!!
>
> And I've given up on the current crop of Tier 1 search vendors for solving
> anything "interesting".  They are so distracted with so many other
> things.... and selling a hot 80% solution to a hungry market is fine with
> them anyway.
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

I'd be interested in hearing your broad-brush idea of a proposal.  It
seems to me that there are several wholly-independent situations you
are describing, which don't offhand seem like they'd be served by a
common architectural component.  These are:

(1) Providing a content-extraction and modification pipeline, for
those output connectors that are targeting systems that cannot do
content extraction on their own.
(2) Providing framework-level services that allow "connectors" to be
readily constructed along a pipeline model.

Karl

Reply via email to