wget -x uses the redirect url as the basis for the path it creates. So, if http://mysite/news returns a 302 redirecting to http://mysite/news/index.html, wget saves as:
mysite/news/index.html MCF, on the other hand, saves as: http/mysite/news Mark On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <[email protected]> wrote: > Hi Mark, > > The filesystem connector is supposed to emulate WGET behavior. What does > WGET do in this case? > > Karl > > > > On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <[email protected]> wrote: > >> Noticed this problem while crawling a web site and saving to the file >> system with the FileSystem output connector. >> >> Let's say the website defines a URL like this: >> >> http://mysite/news >> >> That URI actually gets mapped to a file on the web server, say >> http://mysite/news/index.html, but the http://mysite/news URI does exist >> and gets sent as the documentURI to addOrReplaceDocument(). >> >> MCF's FileSystem connector gets the http://mysite/news URL and creates a >> directory for saving that content that looks like this http/mysite/news, >> where news is a file. >> >> But then if the site also defines a URL like this >> http://mysite/news/local/today.html, MCF's FileSystem connector fails >> trying to create the directory http/mysite/news/local because part of it, >> http/mysite/news, already exists as a file. >> >> Of course, if the URIs are crawled in the reverse order, the file can't >> be created because a directory already exists with that name. >> >> Make sense? >> >> The real killer is that when this happen it's fatal to the job. That is, >> it doesn't just fail to get that one URL, the connector returns a fatal >> error and the crawl is stopped. >> >> Mark >> >> >
