Re: Revisiting: Should Manifold include Pipelines

2012-01-12 Thread Karl Wright
Hi Mark,





 I'm not sure if this question is revisiting the motivation for preferring
 this in MCF, or a technical question about how to package metadata for
 different engines that might want it in a different format.


I'm looking not so much for justification, but for enough context as
to how to structure the code.  Based on what I've heard, it probably
makes the most sense to provide a service available for both
repository connectors and output connectors to use in massaging
content.  The configuration needed for the service would therefore be
managed by the repository connector or output connector which required
the pipeline's services.


 For the latter, how to pass metadata to engines, that's interesting.  One
 almost universal way is to add metadata tags the header portion of an HTML
 file.  There are some other microformats that some engines understand.
 Could we just assume, for now, that additional meta data will be jammed
 into the HTML header, perhaps with an x- for the name (a convention some
 folks like).


I would presume that a Java coder who writes the output connector that
knows how to connect to the given search engine would tackle this
problem in the appropriate way.  I don't think it's a pipeline
question.


 Including Tika would be useful for connectors that need to look at binary
 doc files to do their parsing.  Even if the pipeline then discards Tika's
 output when it's done, it's still a likely expense *if* it's meets the
 project objective.

 As an example, the current MCF system looks for links in HTML.  But
 hyperlinks can also appear in Word, Excel and PDF files.  Tika could, in
 theory, convert those docs so that they cal also be scanned for links, and
 then later discard that converted file.


Sure, that's why I'd make the pipeline be available to every
connector.  The Java code for the connector would be modified, if
appropriate, to use the pipeline if it was helpful for it.


 Given the dismal state of open tools, I'd be excited to just see 1:1
 pipeline functionality be made widely available.

 I'm regretting, to some extent, bringing in the more complex Pipeline logic
 as it may have partially derailed the conversation.  I'm one of the authors
 of the old XPump tool, which was able to do very fancy things, but suffered
 from other issues.

 But better to have something now then nothing.  And I'll ponder the more
 complex scenarios some more.


I'll talk about this more further down.



 So, my question to you is, what would the main use case(s) be for a
 pipeline in your view?


 I've given a couple examples above, of 1:1 transforms.  I *KNOW* this is of
 interest to some folks, but it sounds like I've failed to convince you.
 I'd ask you to take it on faith, but you don't know me very well, so that'd
 be asking a lot.


The goal of the question was to confirm that you thought the value of
having a pipeline was high enough, vs. building a Pipeline, as
we've defined it.  I wanted to be sure there was no communication
issue and that we understood one another before anybody went off and
started writing code.


 A final question for you Karl, since we've both invested some time in
 discussing something that would normally be very complex to others.  What
 open source tools would YOU suggest I look at, for a new home for uber
 pipeline processing?  I think you understand some of the logical
 functionality I want to model.

 Some other wish list items:
 * Leverage MCF connectors
 * A web UI framework for monitoring

 I'd say up front that I've considered Nutch, but I don't think it's a good
 fit for other reasons.

 I'm still looking around at UIMA.  I keep finding the justification for
 UIMA, how awesome it is, but less on the technical side.  I'm not sure it
 models a data flow design that well.

 The other area I looked at was some of the Eclipse process graph stuff,
 Business Process Management I think.


 There's a TON of open source projects.


I can't claim to speak for knowing all the open-source projects out
there.  But I'm unaware of one that really focuses on Pipeline
building from the perspective of crawling.

On the other hand, it seems pretty clear to me how one would go about
converting ManifoldCF to a Pipeline project.  What you'd get would
be a tool with UI components where you'd either glue the components
together with code, or use an amalgamation UI to generate the
necessary data flow.  There may already be tools in this space I don't
know of, but before you'd get to that point you'd want to have all the
technical underpinnings worked out.

The Pipeline services you'd want to provide would include functions
that each connector currently performs, but broken out as I'd
described in one of my earlier posts.  The document queue, which is
managed by the ManifoldCF framework right now, would need to be
redesigned since the entire notion of what a job is would require
redesign in a Pipeline world.

In order to develop such a thing, I'd be tempted to say fork 

Re: Revisiting: Should Manifold include Pipelines

2012-01-11 Thread Mark Bennett
Hi Karl,

Still pondering our last discussion.  Wondering if I got things off track.

As a start, what if I backtracked a bit, to this:

What's the easiest way to do this:
* A connector that tweaks metadata form a single source.
* Sits between any existing MCF datasource connector and the main MCF engine

Before:

CMS/DB - Existing MCF connector - MCF core - output

After:

CMS/DB - Existing MCF connector - Metadata tweaker - MCF core - output


Assume the matadata changes don't have any impact on security, or that no
security is being used (public data)


Re: Revisiting: Should Manifold include Pipelines

2012-01-11 Thread Karl Wright
Hi Mark,

I think I'd describe this simplified proposal as pipeline (vs.
Pipeline.  Your original description was the latter.)  This proposal
is simpler but does not have the ability to amalgamate content from
multiple connectors, correct?  As long as it is just modifying the
content and metadata (as described by RepositoryDocument), it's not
hard to develop a generic idea of a content processing pipeline, e.g.
Tika.

There's a question in my mind as to where it belongs.  If its purpose
is to make up for missing code in particular search engines, then I'd
argue it should be a service available to output connector coders, who
can then choose how much configurability makes sense from the point of
view of their target system.  For instance, since Tika is already part
of Solr, there would seem little benefit in adding a Tika pipeline
upstream of Solr as well, but maybe a Google Appliance connector would
want it and therefore expose it.  If the pipeline's purpose is to
include arbitrary business logic, on the other hand, then I think what
you'd really need is a Pipeline and not a pipeline, if you see what I
mean.

So, my question to you is, what would the main use case(s) be for a
pipeline in your view?

Karl

On Wed, Jan 11, 2012 at 6:31 AM, Mark Bennett mbenn...@ideaeng.com wrote:
 Hi Karl,

 Still pondering our last discussion.  Wondering if I got things off track.

 As a start, what if I backtracked a bit, to this:

 What's the easiest way to do this:
 * A connector that tweaks metadata form a single source.
 * Sits between any existing MCF datasource connector and the main MCF engine

 Before:

 CMS/DB - Existing MCF connector - MCF core - output

 After:

 CMS/DB - Existing MCF connector - Metadata tweaker - MCF core - output


 Assume the matadata changes don't have any impact on security, or that no
 security is being used (public data)


Re: Revisiting: Should Manifold include Pipelines

2012-01-10 Thread Mark Bennett
Hi Karl,

I wanted to acknowledge and thank you for your 2 emails.

I need to think a bit.  I *do* have answers to some of your concerns, and I
hopefully reasonable sounding ones at that.

Also, maybe I should take another look at Nutch - BUT Manifold's Web UI is
so much further along, and more inline with the type of admin view of
what's going on, that I had given up on Nutch for a bit.  I have some other
thoughts about Nutch but won't go into them here.

Also, to be clear, I in no way meant to even imply you had any other
motives for having materials in the book.  You've demonstrated, time and
again, that you sincerely want to share MCF, and info about it, with the
whole world!

Mark

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Tue, Jan 10, 2012 at 12:27 AM, Karl Wright daddy...@gmail.com wrote:

 you wanted a connection to be a pipeline component rather than what it
 is today.



Revisiting: Should Manifold include Pipelines

2012-01-09 Thread Mark Bennett
We've been hoping to do some work this year to embed pipeline processing
into MCF, such as UIMA or OpenPipeline or XPump.

But reading through some recent posts there was a discussion about leaving
this sort of thing to the Solr pipeline, and it suddenly dawned on me that
maybe not everybody was on board with the idea of moving this into MCF.

So, before we spin our wheels, I wanted explain some reasons why this would
be a GOOD thing to do, and get some reactions:


1: Not everybody is using Solr / or using exclusively Solr.

Lucene and Solr a great of course, but open source isn't about walled
gardens.  Most companies have multiple search engines.

And, even if you just wanted to use Lucene (and not Solr), then the Solr
pipeline is not very attractive.

As an example, the Google appliance gets lots of press for Enterprise
search.  And it's got enough traction that their format of connector is
starting to be used by other companies.  BUT, at least in the past,
Google's document processing wasn't very pipeline friendly.  They had calls
you could make, but there were issues.

Wouldn't it be cool if Manifold could be used to feed Google appliances?  I
realize some open source folks might not care, but it would suddenly make
MCF interesting to a lot more developers.

Or look at FAST ESP (which was bought by Microsoft).  FAST ESP had a rich
tradition of pipeline goodness, but once Microsoft acquired them, that
pipeline technology is being re-cast in a very Microsoft centric stack.
That's fine if you're a Microsoft shop, you might like it even better than
before, but if you're company prefers Linux, you might be looking for
something else.


2: Not every information application is about search

Classically there's been a need to go from one database to another.  But in
more recent times there's been a need to go from Databases into Content
Management Systems, or from one CMS to another, or to convert one corpus of
documents into another.

Sure there was ETL technology (Extract, Transform, Load), but that tended
to be around structured data.

More generally there's the class of going between structured and
unstructured data, and vice versa.  The latter, going from unstructured
back to structured, is where Entity Extraction comes into play, and where I
had thought MCF could really shine.

There's a somewhat subtle point here as well.  There's the format of
individual documents or files, such as HTML, PDF, RSS or MS Word, but also
the type of repository it resides in (filesystem, database, CMS, web
services, etc)  I was drawn to MCF for the connections, but a document
pipeline would let it also work on the document formats as well.


3: Even spidering to feed a search engine can benefit from early binding
and extended state

A slight aside: generic web page spidering doesn't often need fancy
processing.  What I'm about to talk about might at first seem like edge
cases.  BUT, almost by definition, many of us are not brought into a
project unless it's well outside the mainstream use case.  So many
programmers find themselves working almost fulltime on rather unusual
projects.  Open source is quite attractive because it provides a wealth of
tools to choose from.

Early Binding for Spiders:

Generally it's the need to deeply parse a document before instructing the
spider what next action to take next.

Let me give one simple example, but trust me there are many more!

Suppose you have Web pages (or PDF files!) filled with part numbers.  And
you have a REST API that, presented with a part number, will give more
details.

But you need to parse out the part numbers in order to create the URLs that
you need to spider to fetch next.

Many other applications of this involve helping the spider decide what type
of document it has, or what quality of data it's getting.  You might decide
to tell the spider to drill down deeper, or conversely, give up and work on
higher value targets.

I could imagine a workaround where Manifold passes documents to Solr, and
then Solr's pipeline later resubmits URLs back into MCF, but it's a lot
more direct to just make these determinations more immediately.  In a few
cases it WOULD be nice to have Solr's fullword index, so maybe in it'd be
nice to have both options.  Commercial software companies would want to
make the decision for you, they'd choose one way or the other, but this
aint their garden.  ;-)


extended state for Spiders:

This is where you need the context of 2 or 3 pages back in your traversed
path in order to make full use of the current page.

Here's an example from a few years back:

Steps:
1: Start with a list of concert venue web sites.
2: Foreach venue, lookup the upcoming events, including dates, bands and
ticketing links.
3: Foreach band, go to this other site and lookup their albums.
4: Foreach album, lookup each song.
5: Foreach song, go to a third site to get the lyrics.

Now users can search for songs including the text in the lyrics.
When a match is found, also show 

Re: Revisiting: Should Manifold include Pipelines

2012-01-09 Thread Mark Bennett
Hi Karl,

Thanks for the reply, most comments inline.

General comments:

I was wondering if you've used a custom pipeline like FAST ESP or
Ultraseek's old patches.py, and if there were any that you liked or
disliked?  In more recent times the OpenPipeline effort has been a bit
nascent, I think in part because it lacks some of connectors.  Coming from
my background I'm probably a bit biased to thinking of problems in terms of
a pipeline, and it's also a frequent discussion with some of our more
challenging clients.

Generally speaking we define the virtual document to be the basic unit of
retrieval, and it doesn't really matter whether it starts life as a Web
Page or PDF or Outlook node.  Most documents have a create / modified
date, some type of title, and a few other semi-common meta data fields.
They do vary by source, but there's mapping techniques.

Having more connector services, or even just more examples, is certainly a
step in the right direction.

But leaving it at writing custom monolithic connectors has a few
disadvantages:
- Not as modular, so discourages code reuse
- Maintains 100% coding, vs. some mix of configure vs. code
- Keeps the bar at rather advanced Java programmers, vs. opening up to
folks that feel more comfortable with scripting (of a sort, not
suggesting a full language)
- I think folks tend to share more when using configurable systems,
though I have no proof.  I might just be the larger number of people.
- Sort of the blank canvas syndrome as each person tries to grasp all the
nuances; granted one I'm suggesting merely presents a smaller blank canvas,
but maybe with crayons and connect the dots, vs. oil paints.

On to specific comments

On Mon, Jan 9, 2012 at 6:55 AM, Karl Wright daddy...@gmail.com wrote:

 Hi Mark,

 I have some initial impressions; please read below.

 On Mon, Jan 9, 2012 at 9:29 AM, Mark Bennett mbenn...@ideaeng.com wrote:
  We've been hoping to do some work this year to embed pipeline processing
  into MCF, such as UIMA or OpenPipeline or XPump.
 
  But reading through some recent posts there was a discussion about
 leaving
  this sort of thing to the Solr pipeline, and it suddenly dawned on me
 that
  maybe not everybody was on board with the idea of moving this into MCF.
 

 Having a pipeline for content extraction is a pretty standard thing
 for a search engine to have.  Having said that, I agree there are
 times when this is not enough.


But every engine has a different pipeline, and they're not always
comparable.

And virtually every large company has multiple search engines.  So
re-implementing business logic over and over is expensive and buggy.  And
there's also the question of basic connector and filters availability and
licensing.

And some vendors are fussy about their IP so code is rarely shared online.

And having a standard open source pipeline, that actually gets some use,
benefits from many more users.



  So, before we spin our wheels, I wanted explain some reasons why this
 would
  be a GOOD thing to do, and get some reactions:
 
 
  1: Not everybody is using Solr / or using exclusively Solr.
 
  Lucene and Solr a great of course, but open source isn't about walled
  gardens.  Most companies have multiple search engines.
 
  And, even if you just wanted to use Lucene (and not Solr), then the Solr
  pipeline is not very attractive.
 
  As an example, the Google appliance gets lots of press for Enterprise
  search.  And it's got enough traction that their format of connector is
  starting to be used by other companies.  BUT, at least in the past,
  Google's document processing wasn't very pipeline friendly.  They had
 calls
  you could make, but there were issues.
 
  Wouldn't it be cool if Manifold could be used to feed Google appliances?
  I
  realize some open source folks might not care, but it would suddenly make
  MCF interesting to a lot more developers.
 
  Or look at FAST ESP (which was bought by Microsoft).  FAST ESP had a rich
  tradition of pipeline goodness, but once Microsoft acquired them, that
  pipeline technology is being re-cast in a very Microsoft centric stack.
  That's fine if you're a Microsoft shop, you might like it even better
 than
  before, but if you're company prefers Linux, you might be looking for
  something else.
 
 
  2: Not every information application is about search
 
  Classically there's been a need to go from one database to another.  But
 in
  more recent times there's been a need to go from Databases into Content
  Management Systems, or from one CMS to another, or to convert one corpus
 of
  documents into another.
 
  Sure there was ETL technology (Extract, Transform, Load), but that tended
  to be around structured data.
 
  More generally there's the class of going between structured and
  unstructured data, and vice versa.  The latter, going from unstructured
  back to structured, is where Entity Extraction comes into play, and
 where I
  had thought MCF could really shine.
 
  There's