Hi Mark, Yes, but I'm afraid we *can't* emulate the redirect behavior because that's an upstream connector choice. WGet can operate in a mode where it uses the pre-redirect URL, and resolves conflicts nonetheless. How does it do it?
Karl On Tue, Nov 19, 2013 at 5:33 PM, Mark Libucha <[email protected]> wrote: > wget -x uses the redirect url as the basis for the path it creates. > > So, if http://mysite/news returns a 302 redirecting to > http://mysite/news/index.html, wget saves as: > > mysite/news/index.html > > MCF, on the other hand, saves as: > > http/mysite/news > > Mark > > > On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <[email protected]> wrote: > >> Hi Mark, >> >> The filesystem connector is supposed to emulate WGET behavior. What does >> WGET do in this case? >> >> Karl >> >> >> >> On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <[email protected]> wrote: >> >>> Noticed this problem while crawling a web site and saving to the file >>> system with the FileSystem output connector. >>> >>> Let's say the website defines a URL like this: >>> >>> http://mysite/news >>> >>> That URI actually gets mapped to a file on the web server, say >>> http://mysite/news/index.html, but the http://mysite/news URI does >>> exist and gets sent as the documentURI to addOrReplaceDocument(). >>> >>> MCF's FileSystem connector gets the http://mysite/news URL and creates >>> a directory for saving that content that looks like this http/mysite/news, >>> where news is a file. >>> >>> But then if the site also defines a URL like this >>> http://mysite/news/local/today.html, MCF's FileSystem connector fails >>> trying to create the directory http/mysite/news/local because part of it, >>> http/mysite/news, already exists as a file. >>> >>> Of course, if the URIs are crawled in the reverse order, the file can't >>> be created because a directory already exists with that name. >>> >>> Make sense? >>> >>> The real killer is that when this happen it's fatal to the job. That is, >>> it doesn't just fail to get that one URL, the connector returns a fatal >>> error and the crawl is stopped. >>> >>> Mark >>> >>> >> >
