Re: FileSystem connector path issue

Karl Wright Tue, 19 Nov 2013 16:12:15 -0800

Couldn't find a site that would demonstrate the issue; if you know one,
please inform.


Karl



On Tue, Nov 19, 2013 at 5:57 PM, Karl Wright <[email protected]> wrote:

> Hi Mark,
>
> Yes, at least the materials I see online say that this is the case.  But I
> don't know exactly how.
>
> For the purposes of the File System Output Connector, it doesn't matter,
> since anyone can construct a site that does NOT redirect and still has the
> URL layout as you originally described.  So the problem has to be solved.
>
> I can experiment with WGET here, to check out what its behavior might be,
> but not while I'm doing Windows stuff - so I thought you might be able to
> do that.
>
> Thanks,
> Karl
>
>
>
> On Tue, Nov 19, 2013 at 5:52 PM, Mark Libucha <[email protected]> wrote:
>
>> So you're saying wget can be run in a mode whereby it follows the
>> redirect to fetch the content but uses the original, pre-redirect url to
>> create the directory to store the content?
>>
>>
>> On Tue, Nov 19, 2013 at 2:41 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Mark,
>>>
>>> Yes, but I'm afraid we *can't* emulate the redirect behavior because
>>> that's an upstream connector choice.  WGet can operate in a mode where it
>>> uses the pre-redirect URL, and resolves conflicts nonetheless.  How does it
>>> do it?
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Nov 19, 2013 at 5:33 PM, Mark Libucha <[email protected]>wrote:
>>>
>>>> wget -x uses the redirect url as the basis for the path it creates.
>>>>
>>>> So, if http://mysite/news returns a 302 redirecting to
>>>> http://mysite/news/index.html, wget saves as:
>>>>
>>>> mysite/news/index.html
>>>>
>>>> MCF, on the other hand, saves as:
>>>>
>>>> http/mysite/news
>>>>
>>>> Mark
>>>>
>>>>
>>>> On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <[email protected]>wrote:
>>>>
>>>>> Hi Mark,
>>>>>
>>>>> The filesystem connector is supposed to emulate WGET behavior.  What
>>>>> does WGET do in this case?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <[email protected]>wrote:
>>>>>
>>>>>> Noticed this problem while crawling a web site and saving to the file
>>>>>> system with the FileSystem output connector.
>>>>>>
>>>>>> Let's say the website defines a URL like this:
>>>>>>
>>>>>> http://mysite/news
>>>>>>
>>>>>> That URI actually gets mapped to a file on the web server, say
>>>>>> http://mysite/news/index.html, but the http://mysite/news URI does
>>>>>> exist and gets sent as the documentURI to addOrReplaceDocument().
>>>>>>
>>>>>> MCF's FileSystem connector gets the http://mysite/news URL and
>>>>>> creates a directory for saving that content that looks like this
>>>>>> http/mysite/news, where news is a file.
>>>>>>
>>>>>> But then if the site also defines a URL like this
>>>>>> http://mysite/news/local/today.html, MCF's FileSystem connector
>>>>>> fails trying to create the directory http/mysite/news/local because part 
>>>>>> of
>>>>>> it, http/mysite/news, already exists as a file.
>>>>>>
>>>>>> Of course, if the URIs are crawled in the reverse order, the file
>>>>>> can't be created because a directory already exists with that name.
>>>>>>
>>>>>> Make sense?
>>>>>>
>>>>>> The real killer is that when this happen it's fatal to the job. That
>>>>>> is, it doesn't just fail to get that one URL, the connector returns a 
>>>>>> fatal
>>>>>> error and the crawl is stopped.
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: FileSystem connector path issue

Reply via email to