Couldn't find a site that would demonstrate the issue; if you know one, please inform.
Karl On Tue, Nov 19, 2013 at 5:57 PM, Karl Wright <[email protected]> wrote: > Hi Mark, > > Yes, at least the materials I see online say that this is the case. But I > don't know exactly how. > > For the purposes of the File System Output Connector, it doesn't matter, > since anyone can construct a site that does NOT redirect and still has the > URL layout as you originally described. So the problem has to be solved. > > I can experiment with WGET here, to check out what its behavior might be, > but not while I'm doing Windows stuff - so I thought you might be able to > do that. > > Thanks, > Karl > > > > On Tue, Nov 19, 2013 at 5:52 PM, Mark Libucha <[email protected]> wrote: > >> So you're saying wget can be run in a mode whereby it follows the >> redirect to fetch the content but uses the original, pre-redirect url to >> create the directory to store the content? >> >> >> On Tue, Nov 19, 2013 at 2:41 PM, Karl Wright <[email protected]> wrote: >> >>> Hi Mark, >>> >>> Yes, but I'm afraid we *can't* emulate the redirect behavior because >>> that's an upstream connector choice. WGet can operate in a mode where it >>> uses the pre-redirect URL, and resolves conflicts nonetheless. How does it >>> do it? >>> >>> Karl >>> >>> >>> >>> On Tue, Nov 19, 2013 at 5:33 PM, Mark Libucha <[email protected]>wrote: >>> >>>> wget -x uses the redirect url as the basis for the path it creates. >>>> >>>> So, if http://mysite/news returns a 302 redirecting to >>>> http://mysite/news/index.html, wget saves as: >>>> >>>> mysite/news/index.html >>>> >>>> MCF, on the other hand, saves as: >>>> >>>> http/mysite/news >>>> >>>> Mark >>>> >>>> >>>> On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <[email protected]>wrote: >>>> >>>>> Hi Mark, >>>>> >>>>> The filesystem connector is supposed to emulate WGET behavior. What >>>>> does WGET do in this case? >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <[email protected]>wrote: >>>>> >>>>>> Noticed this problem while crawling a web site and saving to the file >>>>>> system with the FileSystem output connector. >>>>>> >>>>>> Let's say the website defines a URL like this: >>>>>> >>>>>> http://mysite/news >>>>>> >>>>>> That URI actually gets mapped to a file on the web server, say >>>>>> http://mysite/news/index.html, but the http://mysite/news URI does >>>>>> exist and gets sent as the documentURI to addOrReplaceDocument(). >>>>>> >>>>>> MCF's FileSystem connector gets the http://mysite/news URL and >>>>>> creates a directory for saving that content that looks like this >>>>>> http/mysite/news, where news is a file. >>>>>> >>>>>> But then if the site also defines a URL like this >>>>>> http://mysite/news/local/today.html, MCF's FileSystem connector >>>>>> fails trying to create the directory http/mysite/news/local because part >>>>>> of >>>>>> it, http/mysite/news, already exists as a file. >>>>>> >>>>>> Of course, if the URIs are crawled in the reverse order, the file >>>>>> can't be created because a directory already exists with that name. >>>>>> >>>>>> Make sense? >>>>>> >>>>>> The real killer is that when this happen it's fatal to the job. That >>>>>> is, it doesn't just fail to get that one URL, the connector returns a >>>>>> fatal >>>>>> error and the crawl is stopped. >>>>>> >>>>>> Mark >>>>>> >>>>>> >>>>> >>>> >>> >> >
