I'm relatively new to this field and I have a problem that seems to be
solvable in lots of different ways, and I'm looking for some recommendations
on how to approach a data refining pipeline.

I'm not sure where to look for this type of architecture description.  My
best
finds so far have been some of the talks on the lucidimagination.com site.

I have a project that I was tentatively planning to use nutch to crawl a
medium
size set of sites(20,000 to 50,000) extracting maybe 1million documents
in total.

My big problem is where should I do most of my document processing?  There
are
hooks in nutch, lucene and solr to parse or modify documents.  A large
number of
these documents will have semantically similar information but it's in 100s
if not 1000s of different formats.  I want to get as much of the data as I
can
into fields so I will be able to do faceted searches.  The product
descriptions
I'm parsing will mostly have 10-20 or so common pieces of data that I'd
like to gather.

I will have various processing goals:
  - language detection
  - parsing specific data fields that may be represented in many different
ways
  - product description parsing, which I can likely recognize by vocabulary
    with a naive bayes filter

Given the many different formats I expect that I'll take an iterative
approach
to the parsing problem.  So I'll likely crawl it once and successively
refine
my approach.

For the data fields I plan to run them through multiple parsers and score
them based on number of fields parsed and data consistency in the fields
parsed.

So where should I consider hanging my code in this document processing
chain?

thanks for any suggestions!

Reply via email to