I'm relatively new to this field and I have a problem that seems to be solvable in lots of different ways, and I'm looking for some recommendations on how to approach a data refining pipeline.
I'm not sure where to look for this type of architecture description. My best finds so far have been some of the talks on the lucidimagination.com site. I have a project that I was tentatively planning to use nutch to crawl a medium size set of sites(20,000 to 50,000) extracting maybe 1million documents in total. My big problem is where should I do most of my document processing? There are hooks in nutch, lucene and solr to parse or modify documents. A large number of these documents will have semantically similar information but it's in 100s if not 1000s of different formats. I want to get as much of the data as I can into fields so I will be able to do faceted searches. The product descriptions I'm parsing will mostly have 10-20 or so common pieces of data that I'd like to gather. I will have various processing goals: - language detection - parsing specific data fields that may be represented in many different ways - product description parsing, which I can likely recognize by vocabulary with a naive bayes filter Given the many different formats I expect that I'll take an iterative approach to the parsing problem. So I'll likely crawl it once and successively refine my approach. For the data fields I plan to run them through multiple parsers and score them based on number of fields parsed and data consistency in the fields parsed. So where should I consider hanging my code in this document processing chain? thanks for any suggestions!