Re: web connector : links extraction issues

2018-11-15 Thread Karl Wright
Hi Olivier, The HTML parser built into MCF is quite resilient against badly formed HTML, but there are limits. Characters like "<" and ">" are used to denote tags and thus they confuse the parser when they are present in unescaped form. It may be possible, with a fair bit of work, to handle

Re: web connector : links extraction issues

2018-11-15 Thread Olivier Tavard
Hi Karl, Thanks for your answer. Could you detail your answer please ? Just to better understand : you mean that there is no chance that special characters could be escaped in the MCF code in this case ie the website needs to escape itself the special characters otherwise the extraction will

Re: web connector : links extraction issues

2018-11-15 Thread Karl Wright
Hi Olivier, You can create a ticket but I don't have a good solution for you in any case. Karl On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard < olivier.tav...@francelabs.com> wrote: > Hi Karl, > > Do you think that I need to create a Jira issue relative to this bug ie > that the links

Re: Error Job stop after repeatidly interruption

2018-11-15 Thread Karl Wright
(1) I increased the retries to go at least 10 minutes. (2) I handled the 503 response explicitly, with the same logic. See: https://issues.apache.org/jira/browse/CONNECTORS-1556 Karl On Thu, Nov 15, 2018 at 3:35 AM Bisonti Mario wrote: > Yes, Karl. > > > > Is it possible to apply the same

Re: Error Job stop after repeatidly interruption

2018-11-15 Thread Karl Wright
Hi Mario, Here's the code: >> try { //System.out.println("About to do a content PUT"); response = this.httpClient.execute(tikaHost, httpPut); //System.out.println("... content PUT succeeded"); } catch (IOException e) {