Re: FileSystem connector path issue

Mark Libucha Tue, 19 Nov 2013 14:34:15 -0800

wget -x uses the redirect url as the basis for the path it creates.

So, if http://mysite/news returns a 302 redirecting to
http://mysite/news/index.html, wget saves as:


mysite/news/index.html

MCF, on the other hand, saves as:

http/mysite/news

Mark


On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <[email protected]> wrote:

> Hi Mark,
>
> The filesystem connector is supposed to emulate WGET behavior.  What does
> WGET do in this case?
>
> Karl
>
>
>
> On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <[email protected]> wrote:
>
>> Noticed this problem while crawling a web site and saving to the file
>> system with the FileSystem output connector.
>>
>> Let's say the website defines a URL like this:
>>
>> http://mysite/news
>>
>> That URI actually gets mapped to a file on the web server, say
>> http://mysite/news/index.html, but the http://mysite/news URI does exist
>> and gets sent as the documentURI to addOrReplaceDocument().
>>
>> MCF's FileSystem connector gets the http://mysite/news URL and creates a
>> directory for saving that content that looks like this http/mysite/news,
>> where news is a file.
>>
>> But then if the site also defines a URL like this
>> http://mysite/news/local/today.html, MCF's FileSystem connector fails
>> trying to create the directory http/mysite/news/local because part of it,
>> http/mysite/news, already exists as a file.
>>
>> Of course, if the URIs are crawled in the reverse order, the file can't
>> be created because a directory already exists with that name.
>>
>> Make sense?
>>
>> The real killer is that when this happen it's fatal to the job. That is,
>> it doesn't just fail to get that one URL, the connector returns a fatal
>> error and the crawl is stopped.
>>
>> Mark
>>
>>
>

Re: FileSystem connector path issue

Reply via email to