Re: web connector : links extraction issues

2018-11-15 Thread Karl Wright
Hi Olivier, The HTML parser built into MCF is quite resilient against badly formed HTML, but there are limits. Characters like "<" and ">" are used to denote tags and thus they confuse the parser when they are present in unescaped form. It may be possible, with a fair bit of work, to handle

Re: web connector : links extraction issues

2018-11-15 Thread Olivier Tavard
Hi Karl, Thanks for your answer. Could you detail your answer please ? Just to better understand : you mean that there is no chance that special characters could be escaped in the MCF code in this case ie the website needs to escape itself the special characters otherwise the extraction will

Re: web connector : links extraction issues

2018-11-15 Thread Karl Wright
Hi Olivier, You can create a ticket but I don't have a good solution for you in any case. Karl On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard < olivier.tav...@francelabs.com> wrote: > Hi Karl, > > Do you think that I need to create a Jira issue relative to this bug ie > that the links

Re: web connector : links extraction issues

2018-10-30 Thread Olivier Tavard
Hi Karl, Thanks for your answer. I kept looking into this and I found what was the problem. The Javascript code into the tags contained the character '<'. If so the links extraction does not work with the web connector. To reproduce it, I created this page hosted in local Apache

Re: web connector : links extraction issues

2018-10-29 Thread Karl Wright
Hi Olivier, Javascript inclusion in the Web Connector is not evaluated. In fact, no Javascript is executed at all. Therefore it should not matter what is included via javascript. Thanks, Karl On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard < olivier.tav...@francelabs.com> wrote: > Hi, > >

web connector : links extraction issues

2018-10-29 Thread Olivier Tavard
Hi, Regarding the web connector, I noticed that for specific websites, some Javascript code can prevent the web connector to fetch correctly all the links present on the page. Specifically, for websites that contain a deprecated version of New relic web agent as