I'm interested in crawling multiple shared folders (among other things) on a corporate LAN. It is a LAN of MS clients with Active Directory managed accounts.
The users routinely access the files based on ntfs-level (and sharing?) permissions. Idealy, I'd like to set up a central server (probably linux, but any *n*x would do) where I'd mount all the shared folders. I'd then set up apache so that the files are accessible via http and, more importantly, webdav. I imagine apache could use mod_dav, mod_auth and possibly one or two other modules to regulate access priviledges - I could very well be completely wrong here. Finally, I'd like to set up nutch to crawl the shared documents through the web server, so that the stored links are valid in the whole LAN. Nutch would therefore require absolute access to all documents, but the documents would be served via a web server who checks user identities and access rights. Nutch users who've tackled the access rights problem themselves would save me a world of time, effort and trouble with a couple of pointers on how to go about the whole security issue. If the setup I described is the worst possible way to go about it, I'd appreciate a notice saying so and elaborating why. :) TIA, t.n.a.