On Tue, May 18, 2004 at 12:30:10PM -0700, Byron Miller wrote: > Is there a way that the fetcher could be extended not > necessarily as a plugin interface per say, but to read > an XML document that describes how to handle specific > file types? > > For example many of the pdf to html, word to html and > other applications already translate the content into > html source so the fetcher wouldn't need to be > extended to do this, however if there was an XML > document that describe the program that is called, the > variables provided/passed to an external program to > handle the translation of the doc into "html" then > that would be best. > > For instance if fetcher sees a pdf it would call the > command as definged in the xml file to handle the pdf > document and it would just know to parse the results > from the converted document - thus allowing your > cached copy to be an HTML document like google does.
I think it is possible and it'd better be handled by seperate tools. The crawler should be less concerned with content conversion/analysis. Even for outlink harvesting, a separate tool will be more manageable. I believe htdig has such things. > > This way extensions could be defined as any program > that has input/output (unix way) and not necessarily > an plugin that requires java knowledge or re-writes of > what is already done elsewhere into java. > > Heck, i wouldn't mind even having the ability to > define seperate indices for each data type and use the > distributed search to consolidate these so this way > you could have search ftp, pdf, html, word as seperate > entities fairly easily :) Yes, that is the way I do my fetch/search cycles: first round fetch text/html only, basically collect as many links as possbile second round, application/msword, third round, application/pdf, ... all can go in parallel, and provide better storage management, for pdf, doc are typically much larger than html and you do not want to mix them with html in the same segment. John ------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
