I'm relatively new to this field and I have a problem that seems to be
solvable in lots of different ways, and I'm looking for some recommendations
on how to approach a data refining pipeline.
I'm not sure where to look for this type of architecture description. My
best
finds so far have been some of the talks on the lucidimagination.com site.
I have a project that I was tentatively planning to use nutch to crawl a
medium
size set of sites(20,000 to 50,000) extracting maybe 1million documents
in total.
My big problem is where should I do most of my document processing? There
are
hooks in nutch, lucene and solr to parse or modify documents. A large
number of
these documents will have semantically similar information but it's in 100s
if not 1000s of different formats. I want to get as much of the data as I
can
into fields so I will be able to do faceted searches. The product
descriptions
I'm parsing will mostly have 10-20 or so common pieces of data that I'd
like to gather.
I will have various processing goals:
- language detection
- parsing specific data fields that may be represented in many different
ways
- product description parsing, which I can likely recognize by vocabulary
with a naive bayes filter
Given the many different formats I expect that I'll take an iterative
approach
to the parsing problem. So I'll likely crawl it once and successively
refine
my approach.
For the data fields I plan to run them through multiple parsers and score
them based on number of fields parsed and data consistency in the fields
parsed.
So where should I consider hanging my code in this document processing
chain?
thanks for any suggestions!