This is a good question. I think we should carry this conversation forward on connectors-dev.
My initial thought on this issue is that the functionality really belongs in Tika. Tika is set up to extract and filter in exactly this way. The only reason you'd want to do it in MCF is if it would change the links you might extract (or, skip), and that seems to me less interesting. How do you feel about it? Karl On Thu, Mar 31, 2011 at 10:41 AM, Erlend Garåsen <e.f.gara...@usit.uio.no> wrote: > > All major commercial search engines are shipped with a web crawler which > allows one to filter out unwanted content, such as certain html blocks, > comments etc. Would it be advisable to add such a functionality to MCF? Or > will it be difficult to implement since the idea behind the > ExtractingRequestHandler is to send binary files to Solr? > > Say that you have an HTML document which includes the following comments: > <!-- stop indexing --> > <!-- start indexing --> > All content within these comments should then be skipped from the index. > > I managed to rewrite Apache Nutch in order to add this functionality for > some months ago. > > Erlend > > -- > Erlend Garåsen > Center for Information Technology Services > University of Oslo > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 >