Revisiting: Should Manifold include Pipelines

Mark Bennett Mon, 09 Jan 2012 06:31:00 -0800

We've been hoping to do some work this year to embed pipeline processing
into MCF, such as UIMA or OpenPipeline or XPump.


But reading through some recent posts there was a discussion about leaving
this sort of thing to the Solr pipeline, and it suddenly dawned on me that
maybe not everybody was on board with the idea of moving this into MCF.

So, before we spin our wheels, I wanted explain some reasons why this would
be a GOOD thing to do, and get some reactions:


1: Not everybody is using Solr / or using exclusively Solr.

Lucene and Solr a great of course, but open source isn't about walled
gardens.  Most companies have multiple search engines.

And, even if you just wanted to use Lucene (and not Solr), then the Solr
pipeline is not very attractive.

As an example, the Google appliance gets lots of press for Enterprise
search.  And it's got enough traction that their format of connector is
starting to be used by other companies.  BUT, at least in the past,
Google's document processing wasn't very pipeline friendly.  They had calls
you could make, but there were issues.

Wouldn't it be cool if Manifold could be used to feed Google appliances?  I
realize some open source folks might not care, but it would suddenly make
MCF interesting to a lot more developers.

Or look at FAST ESP (which was bought by Microsoft).  FAST ESP had a rich
tradition of pipeline goodness, but once Microsoft acquired them, that
pipeline technology is being re-cast in a very Microsoft centric stack.
That's fine if you're a Microsoft shop, you might like it even better than
before, but if you're company prefers Linux, you might be looking for
something else.


2: Not every information application is about search

Classically there's been a need to go from one database to another.  But in
more recent times there's been a need to go from Databases into Content
Management Systems, or from one CMS to another, or to convert one corpus of
documents into another.

Sure there was ETL technology (Extract, Transform, Load), but that tended
to be around structured data.

More generally there's the class of going between structured and
unstructured data, and vice versa.  The latter, going from unstructured
back to structured, is where Entity Extraction comes into play, and where I
had thought MCF could really shine.

There's a somewhat subtle point here as well.  There's the format of
individual documents or files, such as HTML, PDF, RSS or MS Word, but also
the type of repository it resides in (filesystem, database, CMS, web
services, etc)  I was drawn to MCF for the connections, but a document
pipeline would let it also work on the document formats as well.


3: Even spidering to feed a search engine can benefit from "early binding"
and "extended state"

A slight aside: generic web page spidering doesn't often need fancy
processing.  What I'm about to talk about might at first seem like "edge
cases".  BUT, almost by definition, many of us are not brought into a
project unless it's well outside the mainstream use case.  So many
programmers find themselves working almost fulltime on rather unusual
projects.  Open source is quite attractive because it provides a wealth of
tools to choose from.

"Early Binding" for Spiders:

Generally it's the need to deeply parse a document before instructing the
spider what next action to take next.

Let me give one simple example, but trust me there are many more!

Suppose you have Web pages (or PDF files!) filled with part numbers.  And
you have a REST API that, presented with a part number, will give more
details.

But you need to parse out the part numbers in order to create the URLs that
you need to spider to fetch next.

Many other applications of this involve helping the spider decide what type
of document it has, or what quality of data it's getting.  You might decide
to tell the spider to drill down deeper, or conversely, give up and work on
higher value targets.

I could imagine a workaround where Manifold passes documents to Solr, and
then Solr's pipeline later resubmits URLs back into MCF, but it's a lot
more direct to just make these determinations more immediately.  In a few
cases it WOULD be nice to have Solr's fullword index, so maybe in it'd be
nice to have both options.  Commercial software companies would want to
make the decision for you, they'd choose one way or the other, but this
aint their garden.  ;-)


"extended state" for Spiders:

This is where you need the context of 2 or 3 pages back in your traversed
path in order to make full use of the current page.

Here's an example from a few years back:

Steps:
1: Start with a list of concert venue web sites.
2: Foreach venue, lookup the upcoming events, including dates, bands and
ticketing links.
3: Foreach band, go to this other site and lookup their albums.
4: Foreach album, lookup each song.
5: Foreach song, go to a third site to get the lyrics.

Now users can search for songs including the text in the lyrics.
When a match is found, also show them upcoming performances near them, and
maybe even let them click to buy tickets.

You can see that the unit of retrieval is particular songs, in steps 4 and
5.  But we want data that we parsed from several steps back.

Even in the case of song lyrics, where they will have the band's name, they
might not have the Album title.  (and a song could have been on several
albums of course)  So even things you'd expect to be able to parse, you've
often already had that info during a previous step.

I realize MCF probably doesn't include this type of state trail now.  But I
was thinking it'd at least be easier to build something on top of MCF than
going way out to Solr and then back into Manifold.

In the past I think folks would have used Perl or Python to handcraft these
types of projects.  But that doesn't scale very well, and you still need
persistence for long running jobs, AND it doesn't encourage code reuse.


So, Manifold could really benefit from pipelines!

I have a lot of technical thoughts about how this might be achieved, and a
bunch related thoughts.  But if pipelines are really unwelcome, I don't
want to force it.


One final thought:

The main search vendors seem to be abandoning high end, precision
spidering.  There's a tendency now to see all the world as "Internet", and
the data behind firewalls as just "a smaller Internet (intranet)"

This is fine for 80-90% of common use cases.

But that last 5-10% of atypical projects are HUGELY under-served at this
time.  They often have expensive problems that simply won't go away.

Sure, true open source folks may or may not care about "markets" or
"expensive problems".

BUT these are also INTERESTING problems!  If you're bored with "appliances"
and the latest downloadable free search bar, then trust me, these edge
cases will NOT bore you!!!

And I've given up on the current crop of Tier 1 search vendors for solving
anything "interesting".  They are so distracted with so many other
things.... and selling a hot 80% solution to a hungry market is fine with
them anyway.

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Revisiting: Should Manifold include Pipelines

Reply via email to