Re: Broken MFC Documentation Link

2020-07-02 Thread Karl Wright
The problem is due to the fact that the release documentation for the 2.16
documentation silently failed on a formatting issue.  I've fixed it in
trunk but the only way to have documentation online for 2.16 is to do a
point release of 2.16, e.g. 2.16.1.

Karl


On Thu, Jul 2, 2020 at 1:26 PM Cihad Guzel  wrote:

> Hi Karl,
>
> ManifoldCF latest version documentation link is broken:
> https://manifoldcf.apache.org/release/release-2.16/en_US/index.html
>
> Kind Regards,
> Cihad Güzel
>


[jira] [Resolved] (CONNECTORS-1646) Notification Connector JDK > 11

2020-06-26 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1646.
-
Fix Version/s: ManifoldCF 2.17
   Resolution: Fixed

r1879219


> Notification Connector JDK > 11
> ---
>
> Key: CONNECTORS-1646
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1646
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Email connector
>Affects Versions: ManifoldCF 2.16
>Reporter: Uwe Wolfinger
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.17
>
>
> When running MCF with a JDK >=11 there is a problem with the Email 
> notification connector. Whenever it tries to send a mail, following error is 
> tossed:
> FATAL 2020-06-25T14:58:43,097 (Job reset thread) - Error tossed: 
> javax/activation/DataSource
> java.lang.NoClassDefFoundError: javax/activation/DataSource
>  at 
> org.apache.manifoldcf.crawler.notifications.email.EmailSession.send(EmailSession.java:95)
>  ~[?:?]
>  at 
> org.apache.manifoldcf.crawler.notifications.email.EmailConnector$SendThread.run(EmailConnector.java:963)
>  ~[?:?]
> Caused by: java.lang.ClassNotFoundException: javax.activation.DataSource
>  at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:582)
>  ~[?:?]
>  at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
>  ~[?:?]
>  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
>  ... 2 more
>  
> this is because the activation jars are no longer part of the jdk:
> [https://docs.oracle.com/en/java/javase/11/migrate/index.html#JSMIG-GUID-561005C1-12BB-455C-AD41-00455CAD23A6]
> verified solution is to put
> javax.activation-1.2.0.jar
> and
> javax.activation-api-1.2.0.jar
> in the lib directory and add them to cp in jetty-options.env... and 
> options.env...
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1646) Notification Connector JDK > 11

2020-06-25 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144988#comment-17144988
 ] 

Karl Wright commented on CONNECTORS-1646:
-

Hi,

Please verify that if you put these jars into the connector-common-lib folder 
instead (and make no changes to the *.env files) that this still works for you.



> Notification Connector JDK > 11
> ---
>
> Key: CONNECTORS-1646
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1646
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Email connector
>Affects Versions: ManifoldCF 2.16
>Reporter: Uwe Wolfinger
>Assignee: Karl Wright
>Priority: Major
>
> When running MCF with a JDK >=11 there is a problem with the Email 
> notification connector. Whenever it tries to send a mail, following error is 
> tossed:
> FATAL 2020-06-25T14:58:43,097 (Job reset thread) - Error tossed: 
> javax/activation/DataSource
> java.lang.NoClassDefFoundError: javax/activation/DataSource
>  at 
> org.apache.manifoldcf.crawler.notifications.email.EmailSession.send(EmailSession.java:95)
>  ~[?:?]
>  at 
> org.apache.manifoldcf.crawler.notifications.email.EmailConnector$SendThread.run(EmailConnector.java:963)
>  ~[?:?]
> Caused by: java.lang.ClassNotFoundException: javax.activation.DataSource
>  at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:582)
>  ~[?:?]
>  at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
>  ~[?:?]
>  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
>  ... 2 more
>  
> this is because the activation jars are no longer part of the jdk:
> [https://docs.oracle.com/en/java/javase/11/migrate/index.html#JSMIG-GUID-561005C1-12BB-455C-AD41-00455CAD23A6]
> verified solution is to put
> javax.activation-1.2.0.jar
> and
> javax.activation-api-1.2.0.jar
> in the lib directory and add them to cp in jetty-options.env... and 
> options.env...
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1646) Notification Connector JDK > 11

2020-06-25 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1646:
---

Assignee: Karl Wright

> Notification Connector JDK > 11
> ---
>
> Key: CONNECTORS-1646
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1646
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Email connector
>Affects Versions: ManifoldCF 2.16
>Reporter: Uwe Wolfinger
>Assignee: Karl Wright
>Priority: Major
>
> When running MCF with a JDK >=11 there is a problem with the Email 
> notification connector. Whenever it tries to send a mail, following error is 
> tossed:
> FATAL 2020-06-25T14:58:43,097 (Job reset thread) - Error tossed: 
> javax/activation/DataSource
> java.lang.NoClassDefFoundError: javax/activation/DataSource
>  at 
> org.apache.manifoldcf.crawler.notifications.email.EmailSession.send(EmailSession.java:95)
>  ~[?:?]
>  at 
> org.apache.manifoldcf.crawler.notifications.email.EmailConnector$SendThread.run(EmailConnector.java:963)
>  ~[?:?]
> Caused by: java.lang.ClassNotFoundException: javax.activation.DataSource
>  at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:582)
>  ~[?:?]
>  at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
>  ~[?:?]
>  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
>  ... 2 more
>  
> this is because the activation jars are no longer part of the jdk:
> [https://docs.oracle.com/en/java/javase/11/migrate/index.html#JSMIG-GUID-561005C1-12BB-455C-AD41-00455CAD23A6]
> verified solution is to put
> javax.activation-1.2.0.jar
> and
> javax.activation-api-1.2.0.jar
> in the lib directory and add them to cp in jetty-options.env... and 
> options.env...
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: release-2.15 ant build is failing in my local machine

2020-06-23 Thread Karl Wright
https://manifoldcf.apache.org/en_US/download.html#Latest+2.x+release+%28Apache+ManifoldCF+2.16%2C+2020+May+06%29



On Tue, Jun 23, 2020 at 8:28 AM Kirankumar Mothukuri <
kirankumar.mothuk...@datafabricx.com> wrote:

> Thank you for the response karl,
>
> Could you please guide me where can I download lib distribution for
> released versions?
>
>
>
>
> On 2020/06/22 13:55:47, Karl Wright  wrote:
> > Hi Kirankumar,
> >
> > Please download the lib distribution for released versions.  The ant
> build
> > references URLs which seemingly change frequently so you can really only
> > reliably build using the provided libs.  Trunk, however, should be
> > buildable at all times, because we work to keep it up to date.
> >
> > Karl
> >
> >
> > On Mon, Jun 22, 2020 at 9:53 AM Kirankumar Mothukuri <
> > kirankumar.mothuk...@datafabricx.com> wrote:
> >
> > > Hello Team,
> > >
> > > I have cloned manifoldcf release-2.15 version code, I tried to build
> in my
> > > local machine and I am getting the below error, Can some one help me on
> > > this what is the problem here?
> > >
> > > BUILD FAILED
> > > D:\Datafabricx\Datafabricx workspace\manifoldcf\build.xml:1514:
> > > java.net.UnknownHostException: repo2.maven.org
> > >
> > >
> > >
> > >
> >
>


Re: release-2.15 ant build is failing in my local machine

2020-06-22 Thread Karl Wright
Hi Kirankumar,

Please download the lib distribution for released versions.  The ant build
references URLs which seemingly change frequently so you can really only
reliably build using the provided libs.  Trunk, however, should be
buildable at all times, because we work to keep it up to date.

Karl


On Mon, Jun 22, 2020 at 9:53 AM Kirankumar Mothukuri <
kirankumar.mothuk...@datafabricx.com> wrote:

> Hello Team,
>
> I have cloned manifoldcf release-2.15 version code, I tried to build in my
> local machine and I am getting the below error, Can some one help me on
> this what is the problem here?
>
> BUILD FAILED
> D:\Datafabricx\Datafabricx workspace\manifoldcf\build.xml:1514:
> java.net.UnknownHostException: repo2.maven.org
>
>
>
>


[jira] [Commented] (CONNECTORS-1645) Identical login regex rules bug

2020-06-10 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130948#comment-17130948
 ] 

Karl Wright commented on CONNECTORS-1645:
-

Sorry, fixed again.

r1878719


> Identical login regex rules bug
> ---
>
> Key: CONNECTORS-1645
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1645
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
>
> If a login sequence implies the same URL for different login types (ex: form 
> and redirect), you can't configure the same regex for each of them otherwise 
> they will override each other and only the last configured one will be 
> considered by the login sequence. 
> Currently the only workaround is to make a different regex for each login 
> type that matches the same URL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Sharepoint 2019

2020-06-10 Thread Karl Wright
The Sharepoint.dll would allow me to do the build, yes.  I'll email you
directly if you want to send it to me via google docs or some such.

Karl


On Wed, Jun 10, 2020 at 10:10 AM Shelly Singh 
wrote:

> Hi,
>
> Thanks for your response.
> It could be a tricky activity for me to build the plugin and change code
> if and when needed. I am only using the ManifoldCF as a blackbox and not
> familiar with code at all. Would really appreciate if you could add that
> support. I will also try finding someone who can do this, but chances of
> success by that route are bleak.
>
> If it would help, I could share the Sharepoint.dll from 2019 sharepoint.
>
> Thanks!
> Shelly
>
>
>
> On 2020/06/10 05:48:01, Karl Wright  wrote:
> > Hi,
> > One is not available yet.  In order to build one I need a copy of the
> > Sharepoint.dll from a Sharepoint 2019 instance and some time.
> >
> > Karl
> >
> >
> > On Wed, Jun 10, 2020 at 1:30 AM Shelly Singh 
> > wrote:
> >
> > > I am looking for Sharepoint 2019 plugin. Is one available?
> > >
> >
>


Re: Sharepoint 2019

2020-06-10 Thread Karl Wright
Forgot the svn path:

https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2019/trunk

Karl


On Wed, Jun 10, 2020 at 2:07 AM Karl Wright  wrote:

> I've set up an svn path for this plugin.  If you can "svn co" this path it
> should give you the plugin source (no different from 2016 I hope) plus all
> the necessary instructions for building.  So, theoretically, you could
> build it yourself, BUT:
>
> - you need to find the version number of the sharepoint DLL and put it
> into the appropriate file before you can do that, and
> - if there are any Sharepoint API method signature changes there may need
> to be code changes, and
> - access to SharePoint via web services is deprecated and I have no idea
> if 2019 even contains it.
>
> Microsoft wants people to use REST access now, and that requires a
> complete redevelopment of the connector, something that's too massive to
> undertake myself.  Co-development might be possible; I did begin work two
> years ago and would have to dust that off.  If you are able to build the
> plugin, on the other hand, chances are good it will just work.  If you
> would like me to build a distribution release, I will need the DLL in order
> to be able to do that.
>
> Karl
>
>
> On Wed, Jun 10, 2020 at 1:48 AM Karl Wright  wrote:
>
>> Hi,
>> One is not available yet.  In order to build one I need a copy of the
>> Sharepoint.dll from a Sharepoint 2019 instance and some time.
>>
>> Karl
>>
>>
>> On Wed, Jun 10, 2020 at 1:30 AM Shelly Singh 
>> wrote:
>>
>>> I am looking for Sharepoint 2019 plugin. Is one available?
>>>
>>


Re: Sharepoint 2019

2020-06-10 Thread Karl Wright
I've set up an svn path for this plugin.  If you can "svn co" this path it
should give you the plugin source (no different from 2016 I hope) plus all
the necessary instructions for building.  So, theoretically, you could
build it yourself, BUT:

- you need to find the version number of the sharepoint DLL and put it into
the appropriate file before you can do that, and
- if there are any Sharepoint API method signature changes there may need
to be code changes, and
- access to SharePoint via web services is deprecated and I have no idea if
2019 even contains it.

Microsoft wants people to use REST access now, and that requires a complete
redevelopment of the connector, something that's too massive to undertake
myself.  Co-development might be possible; I did begin work two years ago
and would have to dust that off.  If you are able to build the plugin, on
the other hand, chances are good it will just work.  If you would like me
to build a distribution release, I will need the DLL in order to be able to
do that.

Karl


On Wed, Jun 10, 2020 at 1:48 AM Karl Wright  wrote:

> Hi,
> One is not available yet.  In order to build one I need a copy of the
> Sharepoint.dll from a Sharepoint 2019 instance and some time.
>
> Karl
>
>
> On Wed, Jun 10, 2020 at 1:30 AM Shelly Singh 
> wrote:
>
>> I am looking for Sharepoint 2019 plugin. Is one available?
>>
>


Re: Sharepoint 2019

2020-06-09 Thread Karl Wright
Hi,
One is not available yet.  In order to build one I need a copy of the
Sharepoint.dll from a Sharepoint 2019 instance and some time.

Karl


On Wed, Jun 10, 2020 at 1:30 AM Shelly Singh 
wrote:

> I am looking for Sharepoint 2019 plugin. Is one available?
>


Re: Web connector cookie cache

2020-06-04 Thread Karl Wright
The cookies in the cookie cache expire after a fairly short amount of time,
an hour I believe.  So this would be fixed in any case, in an hour after
your first attempt.

The web UI connector contributions do not have facility for connector-added
buttons.  The cookie management table is created and managed by the web
connector.  I do not see an easy way around these restrictions.

Karl


On Thu, Jun 4, 2020 at 4:34 PM  wrote:

> Hi Karl,
>
>
>
> I noticed that the cookies used by the web connector are stored both into
> memory and in the cookiedata table of the manifold database. The cookiedata
> table still keeps cookies of a connector even if this one is removed from
> MCF admin UI. This can lead to problematic behaviors. Let me explain:
>
>
>
> I have configured a login sequence to cover the following use case:
>
>
>
> URL=test.com
>
> Step1 = form a (set cookie)
>
> Step2 = redirect b (set cookie)
>
> Step3 = redirect c (set cookie)
>
>
>
> The 3 different cookies are required to be able to crawl the wanted website
> but I did a mistake in the configuration and the login sequence was
> interrupted at step 2. So the connector retrieved 2 cookies then ended up
> in
> an infinite loop. I did a correction on the configuration but then, when I
> have restarted the job, it did not work. By checking the logs I noticed
> that
> the job was using the 2 retrieved cookies at Step1, and the problem was
> that
> with the cookies, the form have a different behavior and does not redirect
> to 'b' (Step2) but returns a 200 OK response which ends prematurely the
> login sequence. As a consequence, the third required cookie was never
> retrieved.
>
> The solution was simple, I needed to remove the cookies so that the job
> restarts with an empty cookie cache for the website. Indeed it worked, but
> to be able to do that I had to:
>
> 1.  remove the cookies from the cookiedata table
> 2.  reboot the mcf agent so that its in memory cache was emptied.
>
>
>
> Without those manips, the job was always using the cookies (even a job +
> connector delete then recreation did not work)
>
>
>
> Would it be possible to create a button in the connector's view to remove
> the cookies from the cookiedata table + the in memory cache in order to
> avoid such manips ?
>
>
>
> Julien
>
>
>
>


Re: Window shares job-Error ERROR: invalid byte sequence for encoding "UTF8": 0x00

2020-06-03 Thread Karl Wright
This is a Postgresql problem of some kind.  It could be the network
connection between your ManifoldCF process(es) and the Postgresql server.
If it's repeating I'd worry about it, otherwise it will recover.

Karl


On Wed, Jun 3, 2020 at 3:58 AM ritika jain  wrote:

> Hi All,
>
> I am using Window's shares connector and output connector as ES  and
> Postgres as database in Manifoldcf 2.14.
>
> Job is to crawl almost 20lakhs of records.
> When checked logs got this error:-
>
>
> * Worker thread aborting and restarting due to database connection reset:
> Database exception: SQLException doing query (22021): ERROR: invalid byte
> sequence for encoding "UTF8":
> 0x00org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
> exception: SQLException doing query (22021): ERROR: invalid byte sequence
> for encoding "UTF8": 0x00*
>
> I have not customized the code yet. Can anybody please me let know what is
> reason that is causing this error and how can we get rid of this.
>
> Thanks
> Ritika
>
>
>


[jira] [Commented] (CONNECTORS-1645) Identical login regex rules bug

2020-06-02 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124021#comment-17124021
 ] 

Karl Wright commented on CONNECTORS-1645:
-

r1878400


> Identical login regex rules bug
> ---
>
> Key: CONNECTORS-1645
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1645
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
>
> If a login sequence implies the same URL for different login types (ex: form 
> and redirect), you can't configure the same regex for each of them otherwise 
> they will override each other and only the last configured one will be 
> considered by the login sequence. 
> Currently the only workaround is to make a different regex for each login 
> type that matches the same URL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1645) Identical login regex rules bug

2020-06-02 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1645:
---

Assignee: Karl Wright

> Identical login regex rules bug
> ---
>
> Key: CONNECTORS-1645
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1645
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
>
> If a login sequence implies the same URL for different login types (ex: form 
> and redirect), you can't configure the same regex for each of them otherwise 
> they will override each other and only the last configured one will be 
> considered by the login sequence. 
> Currently the only workaround is to make a different regex for each login 
> type that matches the same URL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Web connector login sequence

2020-06-02 Thread Karl Wright
Thanks for the followup.

If you could create a ticket for this, and supply as much information as
possible, I'll try to look at it and understand the implications for the
session login model as it stands today.  I believe I presumed that the same
URL wouldn't be used for wildly different kinds of things; as you say,
turning this into a list rather than a map may make all the difference.

Karl


On Tue, Jun 2, 2020 at 7:34 AM  wrote:

> Hi Karl,
>
>
>
> Thanks for your answer.
>
>
>
> The login sequence I configured was the problem but not because some part
> were missing, the main problem was that I entered the same regular
> expression to address two different login types : a login page and a
> redirect page.
> I did not check the code, but it seems that the connector saves the login
> sequence into an HashMap with the login regex as key. So my redirect rule 
> “other-site\/cas\/login
> = redirect” was overridden by the form rule “other-site\/cas\/login =
> form”. This is why in the debug log, the other-site 302 response was not
> recognized by the login sequence.
>
>
>
> I have modified the two rules so that the regex are different and it works
> !
>
>
>
> I hope my use case will help other people if they encounter the same
> problem.
>
>
>
> Note that the solution I implemented sounds to me more like a workaround
> than a solution. Let me explain: I was able to differentiate the regex
> rules by removing a letter in one of them:
> “other-site\/cas\/logi = redirect” vs “other-site\/cas\/login = form”. But
> this does not feel like a “clean” solution
>
>
>
> Regards,
> Julien
>
>
>
>
>
>
>
> *De :* Karl Wright 
> *Envoyé :* vendredi 29 mai 2020 22:32
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Web connector login sequence
>
>
>
> Hi Julien,
>
> The login sequence must include all parts of the login sequence, from
> initiation (the first 302 that you get when you load /site) all the way
> through to the last action that sets the cookie.  After the login sequence
> is completed, the /site URL will be fetched again.  If you need more than
> one fetch to set more than one cookie, ALL the fetches must match your
> description of the login sequence or it will abort early.  If the cookie
> gets set on a final redirection, be sure to include that redirection too.
>
>
>
> Karl
>
>
>
>
>
> On Fri, May 29, 2020 at 12:01 PM  wrote:
>
> Hi MCF community,
>
>
>
> I need some help with the configuration of a login sequence with the Web
> connector. Here is the login sequence on a web browser :
>
>
>
> GET site/
>
> 302 -> site/login
>
> 302 -> other-site/cas/login
>
> 401 other-site/cas/login
>
> POST other-site/cas/login (set cookie)
>
> 302 -> site/login?param1=value (set cookie)
>
> 302 -> site/login?param1=value (set cookie)
>
> 302 -> site/
>
>
>
> I tested the following conf :
>
>
>
> Session: site
>
>   site\/login = redirect
>
> other-site\/cas\/login = redirect
>
>   other-site\/cas\/login = form
>
>   username=john
>
>password=***
>
>
>
> This configuration works till the form POST, after the form POST, the
> first cookie is correctly retrieved by the job but then it ends up in an
> infinite loop. Here are the debug logs:
>
>
>
> ….
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: For
> https://other-site/cas/login, setting virtual host to other-site
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Got an HttpClient object
> after 1 ms.
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Post method for
> '/cas/login'
>
> …..
>
> DEBUG 2020-05-29T15:07:18,442 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Retrieving cookies...
>
> DEBUG 2020-05-29T15:07:18,442 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB:   Cookie '[version:
> 0]xx
>
> INFO 2020-05-29T15:07:18,448 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: FETCH LOGIN|
> https://other-site/cas/login|1590764838416+31|302|0|
> <https://other-site/cas/login%7C1590764838416+31%7C302%7C0%7C>
>
> DEBUG 2020-05-29T15:07:18,448 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Document '
> https://other-site/cas/login' did not match expected form, link,
> redir

Re: Crawling / Indexation Query

2020-05-30 Thread Karl Wright
We can't.  You need to follow the instructions and send email to the
appropriate address, listed here:

http://manifoldcf.apache.org/en_US/mail.html

Karl


On Sat, May 30, 2020 at 4:40 PM Shashank Saurabh 
wrote:

> Please unsubscribe me from your mailing list.
>
> Thanks,
> Shashank
>
> On Thu, May 7, 2020 at 4:11 PM Karl Wright  wrote:
>
>> Hi,
>>
>> ManifoldCF is not a crawler, it's a synchronizer.  If robots says not to
>> crawl something, then it will not be indexed.  If robots is changed to
>> prohibit crawling of certain documents, then yes, those documents will be
>> removed from the index.
>>
>> But you can override the robots behavior in the document specification or
>> configuration, I believe.
>>
>> Karl
>>
>>
>> On Thu, May 7, 2020 at 6:27 AM ritika jain 
>> wrote:
>>
>>> Hi All,
>>>
>>> Can any body explain
>>> If a URL was indexed, and afterwards a noindex tag was added - will that
>>> URL then be deleted from the index when it is visited again by the crawler?
>>>
>>>
>>> Say a url was previously having indexation required meta tag and was
>>> present in Elastic index, but then afterwards
>>> 
>>> was added to page design afterwards.
>>>
>>> Should it be deleted from Index when the Manifoldcf job crawl that url
>>> again or the URL will still be present in the index.
>>>
>>> Thanks
>>>
>>>
>>>
>>


Re: Web connector login sequence

2020-05-29 Thread Karl Wright
Hi Julien,

The login sequence must include all parts of the login sequence, from
initiation (the first 302 that you get when you load /site) all the way
through to the last action that sets the cookie.  After the login sequence
is completed, the /site URL will be fetched again.  If you need more than
one fetch to set more than one cookie, ALL the fetches must match your
description of the login sequence or it will abort early.  If the cookie
gets set on a final redirection, be sure to include that redirection too.

Karl


On Fri, May 29, 2020 at 12:01 PM  wrote:

> Hi MCF community,
>
>
>
> I need some help with the configuration of a login sequence with the Web
> connector. Here is the login sequence on a web browser :
>
>
>
> GET site/
>
> 302 -> site/login
>
> 302 -> other-site/cas/login
>
> 401 other-site/cas/login
>
> POST other-site/cas/login (set cookie)
>
> 302 -> site/login?param1=value (set cookie)
>
> 302 -> site/login?param1=value (set cookie)
>
> 302 -> site/
>
>
>
> I tested the following conf :
>
>
>
> Session: site
>
>   site\/login = redirect
>
> other-site\/cas\/login = redirect
>
>   other-site\/cas\/login = form
>
>   username=john
>
>password=***
>
>
>
> This configuration works till the form POST, after the form POST, the
> first cookie is correctly retrieved by the job but then it ends up in an
> infinite loop. Here are the debug logs:
>
>
>
> ….
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: For
> https://other-site/cas/login, setting virtual host to other-site
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Got an HttpClient object
> after 1 ms.
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Post method for
> '/cas/login'
>
> …..
>
> DEBUG 2020-05-29T15:07:18,442 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Retrieving cookies...
>
> DEBUG 2020-05-29T15:07:18,442 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB:   Cookie '[version:
> 0]xx
>
> INFO 2020-05-29T15:07:18,448 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: FETCH LOGIN|
> https://other-site/cas/login|1590764838416+31|302|0|
> 
>
> DEBUG 2020-05-29T15:07:18,448 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Document '
> https://other-site/cas/login' did not match expected form, link,
> redirection, or content for sequence 'site'
>
> ….
>
>
>
> It seems that the redirection after the form POST is not considered by the
> job but I don’t know why. After that, there is an infinite loop where the
> cookie is passed on the GET “site/login” which redirects to
> “other-site/login”, but this time, when “other-site/login” get the cookie
> in the request, it does not send a 302 redirect response code but a 200 OK
>
>
>
> I don’t know why there is such behavior and I would be glad to have your
> advises !
>
>
>
> Thanks for your help
>
>
>
> Julien
>
>
>


[jira] [Resolved] (CONNECTORS-1644) LDAPAuthority.java - group search by dn encoding/escaping

2020-05-29 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1644.
-
Fix Version/s: ManifoldCF 2.17
   Resolution: Fixed

r 1878269


> LDAPAuthority.java - group search by dn encoding/escaping
> -
>
> Key: CONNECTORS-1644
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1644
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: LDAP authority
>Affects Versions: ManifoldCF 2.15
>Reporter: Uwe Wolfinger
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.17
>
>
> I just came across a problem with escaping, when searching groups by dn.
> A person has the following dn:
> cn=John\2C Doe,ou=Internal,ou=Users,ou=ORG,o=comp
> which results in:
> cn=John\5c2C Doe,ou=Internal,ou=Users,ou=ORG,o=comp
> after passing escapeLDAPSearchFilter.
> With a groupSearch Filter of "(&(objectClass=groupOfNames)(member=\{0}))" the 
> String that is sent to the LDAP Server is:
> (&(objectClass=groupOfNames)(member=cn=John5c2C 
> Doe,ou=Internal,ou=Users,ou=ORG,o=comp))
> -> this leads to an empty result set, as the \ disappeared.
> Changing 
> String searchFilter = groupSearch.replaceAll("\\\{0\\}", escapedDN);
> to
> String searchFilter = groupSearch.replace("\{0}", escapedDN);
> the following searchFilter is used, which is correct and leads to results:
> (&(objectClass=groupOfNames)(member=cn=John\5c2C 
> Doe,ou=Internal,ou=Users,ou=ORG,o=comp))
> So it seems that there is a problem with escaping/encoding when using the 
> regex based replaceAll method.
> Is there a reason to user replaceAll instead of replace at this position? 
> Would it be a problem, to use the simple string replace method?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1638) JCIFS connector optional hidden files

2020-05-29 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1638:

Fix Version/s: (was: ManifoldCF 2.16)
   ManifoldCF next

> JCIFS connector optional hidden files
> -
>
> Key: CONNECTORS-1638
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1638
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 2.15
>Reporter: Cihad Guzel
>Assignee: Cihad Guzel
>Priority: Major
> Fix For: ManifoldCF next
>
>
> The JCIFS connector should indexes hidden files optionally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1644) LDAPAuthority.java - group search by dn encoding/escaping

2020-05-29 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1644:
---

Assignee: Karl Wright

> LDAPAuthority.java - group search by dn encoding/escaping
> -
>
> Key: CONNECTORS-1644
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1644
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: LDAP authority
>Affects Versions: ManifoldCF 2.15
>Reporter: Uwe Wolfinger
>Assignee: Karl Wright
>Priority: Major
>
> I just came across a problem with escaping, when searching groups by dn.
> A person has the following dn:
> cn=John\2C Doe,ou=Internal,ou=Users,ou=ORG,o=comp
> which results in:
> cn=John\5c2C Doe,ou=Internal,ou=Users,ou=ORG,o=comp
> after passing escapeLDAPSearchFilter.
> With a groupSearch Filter of "(&(objectClass=groupOfNames)(member=\{0}))" the 
> String that is sent to the LDAP Server is:
> (&(objectClass=groupOfNames)(member=cn=John5c2C 
> Doe,ou=Internal,ou=Users,ou=ORG,o=comp))
> -> this leads to an empty result set, as the \ disappeared.
> Changing 
> String searchFilter = groupSearch.replaceAll("\\\{0\\}", escapedDN);
> to
> String searchFilter = groupSearch.replace("\{0}", escapedDN);
> the following searchFilter is used, which is correct and leads to results:
> (&(objectClass=groupOfNames)(member=cn=John\5c2C 
> Doe,ou=Internal,ou=Users,ou=ORG,o=comp))
> So it seems that there is a problem with escaping/encoding when using the 
> regex based replaceAll method.
> Is there a reason to user replaceAll instead of replace at this position? 
> Would it be a problem, to use the simple string replace method?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: URL Mapping

2020-05-28 Thread Karl Wright
That's a much better case for using the url mapper, yes.


On Thu, May 28, 2020 at 1:40 PM Michael Cizmar 
wrote:

> Right.  Another case that I'm exploring...crawling an internal site and
> wanting a load balanced url.  So you would crawl something like this:
>
> http://mystaging-server.myco.com/index.html
>
> and then want to change it to:
>
> https://www.myco.com/index.html
>
> Is that better for the url mapper?
>
>
>
> --
>
> Michael Cizmar
> Managing Director
>
> p: 312.585.6396
>
> d: 312.585.6286
> twitter: @michaelcizmar <http://twitter.com/michaelcizmar>
>
> http://www.mcplusa.com/
>
>
> The information contained in this communication is confidential, private,
> proprietary, or otherwise privileged and is intended only for the use of
> the addressee.  This e-mail is intended only for the person or entity to
> whom it is directed.  Unauthorized use, disclosure, distribution or copying
> is strictly prohibited and may be unlawful.  If you are not the intended
> recipient, please notify us immediately and permanently delete this e-mail
> and any attachments.
>
> --
> *From:* Karl Wright 
> *Sent:* Thursday, May 28, 2020 12:03 PM
> *To:* user@manifoldcf.apache.org 
> *Subject:* Re: URL Mapping
>
> Thanks!  It's far better to implement this than to try and hack it.  A
> general way of removing session information with regular expressions is
> probably not going to cut it either, so for now it's got to be in Java.
>
> Karl
>
>
> On Thu, May 28, 2020 at 12:47 PM Michael Cizmar <
> michael.ciz...@mcplusa.com> wrote:
>
> The "!ut" and then a bunch of session information is from Web Sphere
> Portal.  Some information about it here:
>
> https://books.google.com/books?id=bqAXnpmj5LwC=PA180=PA180=%22!ut%22+session+variables+websphere#v=onepage=%22!ut%22%20session%20variables%20websphere=false
>
> I'll look at making a change to the web crawler to suppor this like the BV
> and ASP.NET
>
> --
> *From:* Karl Wright 
> *Sent:* Thursday, May 28, 2020 11:41 AM
> *To:* user@manifoldcf.apache.org 
> *Subject:* Re: URL Mapping
>
> Hi,
>
> There are provisions in the URL canonicallization part of the world for
> removal of session information from the URL.  It only knows about some
> kinds of widely used sessions; java app server sessions, for example,
> Broadvision sessions, etc.  If you can convince me that your session
> information is (a) uniquely identifiable, and (b) commonly used, the proper
> approach is to incorporate session removal in this framework.  Please let
> me know.
>
> Karl
>
>
> On Thu, May 28, 2020 at 12:11 PM Michael Cizmar <
> michael.ciz...@mcplusa.com> wrote:
>
> I've got a really long url with a bunch of unnecessary session query
> string parameters.  I've been trying unsuccessfully to map it to the same
> url without the session.
>
> an example of the url below.  I thought I could do this:
>
> url map regular expression:
>
> (.*)\/!ut
>
> replacement configuration:
>
>
>
>
> So the go would be that the url be:
>
> http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/
>
> But the url gets rejected.
>
> Sample Crawl Url
>
>
> http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/!ut/p/a1/rZHLTsMwEEV_hS6yjDx5OWZpdRFImzYCAYk3lZM6D5TYSWoqPh8HFu2GQhHejEeae-aOLmIoQ0zyY1tz3SrJu7lneLfdBtTxI1iRhzsMFEfrpZ_6AFFoBnIzAN88Cj_pXxBDrJR60A3KeS2kvimV1KZaMKhJ886C8U1pIeSkOtNM3Pz5QewO3IJG9WIGDGW7RzkB7hZFIWxyyx3bL8LAJo6L7QoELitMPAH7r4WXLefmpvBkOoqfiTHth6vYTRxIAT1eufMy8D74Z2DqXg2Mf5Fz-zqOjJq05nzeNcr-FpchuVOyTGpjkOvGbmWlUHYmQtmZCGWfoqF_6omHq83G5gUBL-iOa0oXiw9FOxLu/dl5/d5/L0lJS2FZcHBpbW1LYVlwcGltbVlwcGchIS9vSHd3QUFBSXdpRUFJSkRBQ1VZaUVJVTVCZ09DbFFBQUlBQVNvU0FyUnFBQURBQWF0QXdMTzlRQUFFQUJ3WWVBR0tTQUFDa0k1Z21HU3dTaXJTQUFDZ0s5ZzBIUS80SmlHcGhxRWFoR29ScUVhbEdwaC9aNl9PTzVBMTRHMEs4Ukg2MEE2R0xDNFA0MDBHNy9hZ2VudCBjb250ZW50JTBwb3J0YWwlMHF1b3RlZW5yb2xsJTBkaWdzIC0gcXVvdGluZyAgZW5yb2xsbWVudCAoaW5kaXZpZHVhbCkvZjQ0YmEyOWUtODQwOC00YjFlLTg4MzktMTFlMjI4NDgxYTVhL2RpZ3MgLSBxdW90aW5nICBlbnJvbGxtZW50IChpbmRpdmlkdWFsKQ
>
>


Re: Error: Repeated service interruptions - failure processing document: Failed to acquire credits in time

2020-05-21 Thread Karl Wright
So the folder is accessible, but can you open the specific document
itself?  There may be an issue there unrelated to the folder.

If it does open OK, then I'm afraid you're beyond my knowledge of what the
problem might be.  The current JCIFS library comes from a Github project
and perhaps you can contact the maintainers to get them to interpret what
it means.  Sometimes just googling the precise error message (not
ManifoldCF's, but the underlying JCIFS error) can help clarify the issue.

Karl


On Thu, May 21, 2020 at 4:00 AM ritika jain 
wrote:

> Reply:-
> The smb exception means that it is coming from the JCIFS library, which is
> trying to find documents and their metadata from your windows shares, and
> is apparently not getting something it needs back promptly. Perhaps the
> user you are using to do the crawl has insufficient privileges? Also, the
> error you are seeing is a new one; I've never seen that before, so the
> connector hasn't either, and it basically doesn't know whether to skip the
> document or hard fail. But what I'd do is try to open the document yourself
> in Windows and find out whether it seems to work or not, for a start.
>
> Many Thanks for you reply,
> Surely will now follow mail chain only.
> I have checked the user privileges. User is having  all access rights.
> Also the manual access to folders is working fine and folder is accessible.
> Can it be possible in any case, the window shares connector faces some
> problem while connecting? (a network issue)
>
> Thanks
> Ritika
>
> On Tue, May 19, 2020 at 2:39 PM Karl Wright  wrote:
>
>> I commented in the ticket you created.
>> Thanks,
>> Karl
>>
>> On Tue, May 19, 2020 at 3:07 AM ritika jain 
>> wrote:
>>
>>> Hi All,
>>>
>>> I am configured Units job (Manifoldcf 2.14 and ES 7.6.2 and postgres
>>> 9.6.10) on server to access files from samba SMBv3 server and used
>>> jcifs-ng-2.1.2.jar to be loaded in lib of manifoldcf.
>>>
>>> After ingesting some records into the index , the got this error in logs
>>> :-
>>>  :-Unrecognized SmbException thrown getting document version for
>>> smb://store1.directory.intra/folders/UnitsTag1/Hydraulic Engineering/13 HYE
>>> Data/morelis/VSS/MatlabTools/data/s/srca.a
>>> jcifs.smb.SmbException: Failed to acquire credits in time.
>>>
>>> Can anybody please help me understand what can be the possible cause of
>>> this error. Can it be a network connection issue or something else.
>>>
>>> For info:- no authority connection/ Active Directory is being used till
>>> now. Also the Use SID for security (checkbox on manifoldcf UI):- is
>>> UNCHECKED.
>>>
>>> Any help will be appreciated greatly.
>>>
>>> Thanks
>>> RItika
>>>
>>>
>>>
>>>
>>>
>>>


Re: Error: Repeated service interruptions - failure processing document: Failed to acquire credits in time

2020-05-19 Thread Karl Wright
I commented in the ticket you created.
Thanks,
Karl

On Tue, May 19, 2020 at 3:07 AM ritika jain 
wrote:

> Hi All,
>
> I am configured Units job (Manifoldcf 2.14 and ES 7.6.2 and postgres
> 9.6.10) on server to access files from samba SMBv3 server and used
> jcifs-ng-2.1.2.jar to be loaded in lib of manifoldcf.
>
> After ingesting some records into the index , the got this error in logs :-
>  :-Unrecognized SmbException thrown getting document version for
> smb://store1.directory.intra/folders/UnitsTag1/Hydraulic Engineering/13 HYE
> Data/morelis/VSS/MatlabTools/data/s/srca.a
> jcifs.smb.SmbException: Failed to acquire credits in time.
>
> Can anybody please help me understand what can be the possible cause of
> this error. Can it be a network connection issue or something else.
>
> For info:- no authority connection/ Active Directory is being used till
> now. Also the Use SID for security (checkbox on manifoldcf UI):- is
> UNCHECKED.
>
> Any help will be appreciated greatly.
>
> Thanks
> RItika
>
>
>
>
>
>


[jira] [Commented] (CONNECTORS-1643) Error: Repeated service interruptions - failure processing document: Failed to acquire credits in time

2020-05-19 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110992#comment-17110992
 ] 

Karl Wright commented on CONNECTORS-1643:
-

The proper forum for this question is not to open a bug ticket, but to post is 
the user mailing list, as described here:

http://manifoldcf.apache.org/en_US/mail.html

The smb exception means that it is coming from the JCIFS library, which is 
trying to find documents and their metadata from your windows shares, and is 
apparently not getting something it needs back promptly.  Perhaps the user you 
are using to do the crawl has insufficient privileges?  Also, the error you are 
seeing is a new one; I've never seen that before, so the connector hasn't 
either, and it basically doesn't know whether to skip the document or hard 
fail.  But what I'd do is try to open the document yourself in Windows and find 
out whether it seems to work or not, for a start.


> Error: Repeated service interruptions - failure processing document: Failed 
> to acquire credits in time
> --
>
> Key: CONNECTORS-1643
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1643
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 2.14
>Reporter: Ritika Jain
>Priority: Major
> Fix For: ManifoldCF 2.14
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi All,
>  
> I am configured Units job (Manifoldcf 2.14 and ES 7.6.2 and postgres 9.6.10) 
> on server to access files from samba SMBv3 server and used jcifs-ng-2.1.2.jar 
> to be loaded in lib of manifoldcf.
>  
> After ingesting some records into the index , the got this error in logs :-
>  :-Unrecognized SmbException thrown getting document version for 
> smb://store1.directory.intra/folders/UnitsTag1/Hydraulic Engineering/13 HYE 
> Data/morelis/VSS/MatlabTools/data/s/srca.a
> jcifs.smb.SmbException: Failed to acquire credits in time.
>  
> Can anybody please help me understand what can be the possible cause of this 
> error. Can it be a network connection issue or something else.
>  
> For info:- no authority connection/ Active Directory is being used till now. 
> Also the Use SID for security (checkbox on manifoldcf UI):- is UNCHECKED.
>  
> Any help will be appreciated greatly.
>  
> Thanks
> RItika



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1643) Error: Repeated service interruptions - failure processing document: Failed to acquire credits in time

2020-05-19 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1643:
---

Assignee: Karl Wright

> Error: Repeated service interruptions - failure processing document: Failed 
> to acquire credits in time
> --
>
> Key: CONNECTORS-1643
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1643
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 2.14
>Reporter: Ritika Jain
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.14
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi All,
>  
> I am configured Units job (Manifoldcf 2.14 and ES 7.6.2 and postgres 9.6.10) 
> on server to access files from samba SMBv3 server and used jcifs-ng-2.1.2.jar 
> to be loaded in lib of manifoldcf.
>  
> After ingesting some records into the index , the got this error in logs :-
>  :-Unrecognized SmbException thrown getting document version for 
> smb://store1.directory.intra/folders/UnitsTag1/Hydraulic Engineering/13 HYE 
> Data/morelis/VSS/MatlabTools/data/s/srca.a
> jcifs.smb.SmbException: Failed to acquire credits in time.
>  
> Can anybody please help me understand what can be the possible cause of this 
> error. Can it be a network connection issue or something else.
>  
> For info:- no authority connection/ Active Directory is being used till now. 
> Also the Use SID for security (checkbox on manifoldcf UI):- is UNCHECKED.
>  
> Any help will be appreciated greatly.
>  
> Thanks
> RItika



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (CONNECTORS-1642) PostgreSQL Version >= 12.2 DB Initialization Problems

2020-05-12 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1642.
-
Fix Version/s: ManifoldCF 2.16
   Resolution: Fixed

r1877648.

Worked fine on Postgresql 9, so I think we should be good.  I would appreciate 
it if [~wolfingeru] would try this against Postgresql 12.2 however.


> PostgreSQL Version >= 12.2 DB Initialization Problems
> -
>
> Key: CONNECTORS-1642
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1642
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.15
>Reporter: Uwe Wolfinger
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: image-2020-05-12-14-34-45-754.png, 
> image-2020-05-12-14-35-41-885.png
>
>
> when trying to run the "./executecommand.sh 
> org.apache.manifoldcf.crawler.InitializeAndRegister" script, the following 
> erro shows up and the initialization process stops:
> {{ WARNING: Illegal reflective access by org.postgresql.jdbc.TimestampUtils 
> ([file:/home/suche/crawler/lib/postgresql-42.1.3.jar|file:///home/suche/crawler/lib/postgresql-42.1.3.jar])
>  to field java.util.TimeZone.defaultTimeZone}}
> {{ WARNING: Please consider reporting this to the maintainers of 
> org.postgresql.jdbc.TimestampUtils}}
> {{ WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations}}
> {{ WARNING: All illegal access operations will be denied in a future release}}
> {{ org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
> exception: SQLException doing query (42703): FEHLER: Spalte pg_attrdef.adsrc 
> existiert nicht}}
> {{ Position: 447}}
> {{ at 
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:715)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:741)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:803)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1457)}}
> {{ at 
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:204)}}
> {{ at 
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:837)}}
> {{ at 
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.getTableSchema(DBInterfacePostgreSQL.java:696)}}
> {{ at 
> org.apache.manifoldcf.core.database.BaseTable.getTableSchema(BaseTable.java:185)}}
> {{ at 
> org.apache.manifoldcf.agents.agentmanager.AgentManager.install(AgentManager.java:67)}}
> {{ at 
> org.apache.manifoldcf.agents.system.ManifoldCF.installTables(ManifoldCF.java:112)}}
>  
> the column "pg_attrdef.adsrc" no longer exists in PostgreSQL DB 12.2.
> [https://www.postgresql.org/docs/11/catalog-pg-attrdef.html]
> which means that it is impossible to initialize the core DB in a PostgreSQL  
> 12.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1642) PostgreSQL Version >= 12.2 DB Initialization Problems

2020-05-12 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105378#comment-17105378
 ] 

Karl Wright commented on CONNECTORS-1642:
-

[~michaelcizmar] Yes, a "latest versions" issue.
[~wolfingeru] Do your admins think this change will work back to PGSQL 9 or so? 
 I am not much worried about supporting versions older than that.



> PostgreSQL Version >= 12.2 DB Initialization Problems
> -
>
> Key: CONNECTORS-1642
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1642
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.15
>Reporter: Uwe Wolfinger
>Assignee: Karl Wright
>Priority: Major
>
> when trying to run the "./executecommand.sh 
> org.apache.manifoldcf.crawler.InitializeAndRegister" script, the following 
> erro shows up and the initialization process stops:
> {{ WARNING: Illegal reflective access by org.postgresql.jdbc.TimestampUtils 
> ([file:/home/suche/crawler/lib/postgresql-42.1.3.jar|file:///home/suche/crawler/lib/postgresql-42.1.3.jar])
>  to field java.util.TimeZone.defaultTimeZone}}
> {{ WARNING: Please consider reporting this to the maintainers of 
> org.postgresql.jdbc.TimestampUtils}}
> {{ WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations}}
> {{ WARNING: All illegal access operations will be denied in a future release}}
> {{ org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
> exception: SQLException doing query (42703): FEHLER: Spalte pg_attrdef.adsrc 
> existiert nicht}}
> {{ Position: 447}}
> {{ at 
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:715)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:741)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:803)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1457)}}
> {{ at 
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:204)}}
> {{ at 
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:837)}}
> {{ at 
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.getTableSchema(DBInterfacePostgreSQL.java:696)}}
> {{ at 
> org.apache.manifoldcf.core.database.BaseTable.getTableSchema(BaseTable.java:185)}}
> {{ at 
> org.apache.manifoldcf.agents.agentmanager.AgentManager.install(AgentManager.java:67)}}
> {{ at 
> org.apache.manifoldcf.agents.system.ManifoldCF.installTables(ManifoldCF.java:112)}}
>  
> the column "pg_attrdef.adsrc" no longer exists in PostgreSQL DB 12.2.
> [https://www.postgresql.org/docs/11/catalog-pg-attrdef.html]
> which means that it is impossible to initialize the core DB in a PostgreSQL  
> 12.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Need a new query to find a table's schema for Postgresql 12 +

2020-05-12 Thread Karl Wright
Hi all,

See https://issues.apache.org/jira/browse/CONNECTORS-1642 .  It appears
that Postgresql has moved away from supporting some of their internal
schema and have broken the query we have used for a decade to obtain the
schema for an existing table.  The code in the method in question (included
in the ticket) I grabbed from some place verbatim long ago; I didn't write
the query myself.  I am hoping that one of our intrepid developers can find
the Postgresql 12+ equivalent, preferably backwards compatible, that we can
replace it with?  Thanks in advance!

Karl


[jira] [Commented] (CONNECTORS-1642) PostgreSQL Version >= 12.2 DB Initialization Problems

2020-05-12 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105367#comment-17105367
 ] 

Karl Wright commented on CONNECTORS-1642:
-

The full code of the current method is the following:

{code}
  /** Get a table's schema.
  *@param tableName is the name of the table.
  *@param cacheKeys are the keys against which to cache the query, or null.
  *@param queryClass is the name of the query class, or null.
  *@return a map of column names and ColumnDescription objects, describing the 
schema, or null if the
  * table doesn't exist.
  */
  @Override
  public Map getTableSchema(String tableName, 
StringSet cacheKeys, String queryClass)
throws ManifoldCFException
  {
StringBuilder query = new StringBuilder();
List list = new ArrayList();
query.append("SELECT pg_attribute.attname AS \"Field\",");
query.append("CASE pg_type.typname WHEN 'int2' THEN 'smallint' WHEN 'int4' 
THEN 'int'");
query.append(" WHEN 'int8' THEN 'bigint' WHEN 'varchar' THEN 'varchar(' || 
pg_attribute.atttypmod-4 || ')'");
query.append(" WHEN 'text' THEN 'longtext'");
query.append(" WHEN 'bpchar' THEN 'char(' || pg_attribute.atttypmod-4 || 
')'");
query.append(" ELSE pg_type.typname END AS \"Type\",");
query.append("CASE WHEN pg_attribute.attnotnull THEN '' ELSE 'YES' END AS 
\"Null\",");
query.append("CASE pg_type.typname WHEN 'varchar' THEN 
substring(pg_attrdef.adsrc from '^(.*).*$') ELSE pg_attrdef.adsrc END AS 
Default ");
query.append("FROM pg_class INNER JOIN pg_attribute ON 
(pg_class.oid=pg_attribute.attrelid) INNER JOIN pg_type ON 
(pg_attribute.atttypid=pg_type.oid) ");
query.append("LEFT JOIN pg_attrdef ON (pg_class.oid=pg_attrdef.adrelid AND 
pg_attribute.attnum=pg_attrdef.adnum) ");
query.append("WHERE pg_class.relname=? AND pg_attribute.attnum>=1 AND NOT 
pg_attribute.attisdropped ");
query.append("ORDER BY pg_attribute.attnum");
list.add(tableName);

IResultSet set = performQuery(query.toString(),list,cacheKeys,queryClass);
if (set.getRowCount() == 0)
  return null;
// Digest the result
Map rval = new 
HashMap();
int i = 0;
while (i < set.getRowCount())
{
  IResultRow row = set.getRow(i++);
  String fieldName = row.getValue("Field").toString();
  String type = row.getValue("Type").toString();
  boolean isNull = row.getValue("Null").toString().equals("YES");
  boolean isPrimaryKey = false; // 
row.getValue("Key").toString().equals("PRI");
  rval.put(fieldName,new 
ColumnDescription(type,isPrimaryKey,isNull,null,null,false));
}

return rval;
  }
{code}

The query I got from StackOverflow or the Postgresql manual (can't remember 
which) a decade ago.  I think just replacing the query with a more modern 
version would work but I have no idea what the more modern version would be.  
Patches welcome.


> PostgreSQL Version >= 12.2 DB Initialization Problems
> -
>
> Key: CONNECTORS-1642
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1642
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.15
>Reporter: Uwe Wolfinger
>Assignee: Karl Wright
>Priority: Major
>
> when trying to run the "./executecommand.sh 
> org.apache.manifoldcf.crawler.InitializeAndRegister" script, the following 
> erro shows up and the initialization process stops:
> {{ WARNING: Illegal reflective access by org.postgresql.jdbc.TimestampUtils 
> ([file:/home/suche/crawler/lib/postgresql-42.1.3.jar|file:///home/suche/crawler/lib/postgresql-42.1.3.jar])
>  to field java.util.TimeZone.defaultTimeZone}}
> {{ WARNING: Please consider reporting this to the maintainers of 
> org.postgresql.jdbc.TimestampUtils}}
> {{ WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations}}
> {{ WARNING: All illegal access operations will be denied in a future release}}
> {{ org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
> exception: SQLException doing query (42703): FEHLER: Spalte pg_attrdef.adsrc 
> existiert nicht}}
> {{ Position: 447}}
> {{ at 
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:715)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:741)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeUncac

[jira] [Commented] (CONNECTORS-1642) PostgreSQL Version >= 12.2 DB Initialization Problems

2020-05-12 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105355#comment-17105355
 ] 

Karl Wright commented on CONNECTORS-1642:
-

Sounds like the method DBInterfacePostgreSQL.getTableSchema() needs to be 
updated to reflect the newest Postgresql version.  Does anyone want to propose 
a patch for this?


> PostgreSQL Version >= 12.2 DB Initialization Problems
> -
>
> Key: CONNECTORS-1642
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1642
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.15
>Reporter: Uwe Wolfinger
>Assignee: Karl Wright
>Priority: Major
>
> when trying to run the "./executecommand.sh 
> org.apache.manifoldcf.crawler.InitializeAndRegister" script, the following 
> erro shows up and the initialization process stops:
> {{ WARNING: Illegal reflective access by org.postgresql.jdbc.TimestampUtils 
> ([file:/home/suche/crawler/lib/postgresql-42.1.3.jar|file:///home/suche/crawler/lib/postgresql-42.1.3.jar])
>  to field java.util.TimeZone.defaultTimeZone}}
> {{ WARNING: Please consider reporting this to the maintainers of 
> org.postgresql.jdbc.TimestampUtils}}
> {{ WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations}}
> {{ WARNING: All illegal access operations will be denied in a future release}}
> {{ org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
> exception: SQLException doing query (42703): FEHLER: Spalte pg_attrdef.adsrc 
> existiert nicht}}
> {{ Position: 447}}
> {{ at 
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:715)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:741)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:803)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1457)}}
> {{ at 
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:204)}}
> {{ at 
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:837)}}
> {{ at 
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.getTableSchema(DBInterfacePostgreSQL.java:696)}}
> {{ at 
> org.apache.manifoldcf.core.database.BaseTable.getTableSchema(BaseTable.java:185)}}
> {{ at 
> org.apache.manifoldcf.agents.agentmanager.AgentManager.install(AgentManager.java:67)}}
> {{ at 
> org.apache.manifoldcf.agents.system.ManifoldCF.installTables(ManifoldCF.java:112)}}
>  
> the column "pg_attrdef.adsrc" no longer exists in PostgreSQL DB 12.2.
> [https://www.postgresql.org/docs/11/catalog-pg-attrdef.html]
> which means that it is impossible to initialize the core DB in a PostgreSQL  
> 12.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1642) PostgreSQL Version >= 12.2 DB Initialization Problems

2020-05-12 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1642:
---

Assignee: Karl Wright

> PostgreSQL Version >= 12.2 DB Initialization Problems
> -
>
> Key: CONNECTORS-1642
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1642
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.15
>Reporter: Uwe Wolfinger
>Assignee: Karl Wright
>Priority: Major
>
> when trying to run the "./executecommand.sh 
> org.apache.manifoldcf.crawler.InitializeAndRegister" script, the following 
> erro shows up and the initialization process stops:
> {{ WARNING: Illegal reflective access by org.postgresql.jdbc.TimestampUtils 
> ([file:/home/suche/crawler/lib/postgresql-42.1.3.jar|file:///home/suche/crawler/lib/postgresql-42.1.3.jar])
>  to field java.util.TimeZone.defaultTimeZone}}
> {{ WARNING: Please consider reporting this to the maintainers of 
> org.postgresql.jdbc.TimestampUtils}}
> {{ WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations}}
> {{ WARNING: All illegal access operations will be denied in a future release}}
> {{ org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
> exception: SQLException doing query (42703): FEHLER: Spalte pg_attrdef.adsrc 
> existiert nicht}}
> {{ Position: 447}}
> {{ at 
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:715)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:741)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:803)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1457)}}
> {{ at 
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)}}
> {{ at 
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:204)}}
> {{ at 
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:837)}}
> {{ at 
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.getTableSchema(DBInterfacePostgreSQL.java:696)}}
> {{ at 
> org.apache.manifoldcf.core.database.BaseTable.getTableSchema(BaseTable.java:185)}}
> {{ at 
> org.apache.manifoldcf.agents.agentmanager.AgentManager.install(AgentManager.java:67)}}
> {{ at 
> org.apache.manifoldcf.agents.system.ManifoldCF.installTables(ManifoldCF.java:112)}}
>  
> the column "pg_attrdef.adsrc" no longer exists in PostgreSQL DB 12.2.
> [https://www.postgresql.org/docs/11/catalog-pg-attrdef.html]
> which means that it is impossible to initialize the core DB in a PostgreSQL  
> 12.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Solr to become a top-level Apache project (TLP)

2020-05-12 Thread Karl Wright
+1 from me (binding)
Karl

On Tue, May 12, 2020 at 3:54 AM Atri Sharma  wrote:

> +1 (binding).
>
> Regards,
>
> Atri
>
> On Tue, 12 May 2020 at 13:07, Dawid Weiss  wrote:
>
>> Dear Lucene and Solr developers!
>>
>> According to an earlier [DISCUSS] thread on the dev list [2], I am
>> calling for a vote on the proposal to make Solr a top-level Apache
>> project (TLP) and separate Lucene and Solr development into two
>> independent entities.
>>
>> To quickly recap the reasons and consequences of such a move: it seems
>> like the reasons for the initial merge of Lucene and Solr, around 10
>> years ago, have been achieved. Both projects are in good shape and
>> exhibit signs of independence already (mailing lists, committers,
>> patch flow). There are many technical considerations that would make
>> development much easier if we move Solr out into its own TLP.
>>
>> We discussed this issue [2] and both PMC members and committers had a
>> chance to review all the pros and cons and express their views. The
>> discussion showed that there are clearly different opinions on the
>> matter - some people are in favor, some are neutral, others are
>> against or not seeing the point of additional labor. Realistically, I
>> don't think reaching 100% level consensus is going to be possible --
>> we are a diverse bunch with different opinions and personalities. I
>> firmly believe this is the right direction hence the decision to put
>> it under the voting process. Should something take a wrong turn in the
>> future (as some folks worry it may), all blame is on me.
>>
>> Therefore, the proposal is to separate Solr from under Lucene TLP, and
>> make it a TLP on its own. The initial structure of the new PMC,
>> committer base, git repositories and other managerial aspects can be
>> worked out during the process if the decision passes.
>>
>> Please indicate one of the following (see [1] for guidelines):
>>
>> [ ] +1 - yes, I vote for the proposal
>> [ ] -1 - no, I vote against the proposal
>>
>> Please note that anyone in the Lucene+Solr community is invited to
>> express their opinion, though only Lucene+Solr committers cast binding
>> votes (indicate non-binding votes in your reply, please).
>>
>> The vote will be active for a week to give everyone a chance to read
>> and cast a vote.
>>
>> Dawid
>>
>> [1] https://www.apache.org/foundation/voting.html
>> [2]
>> https://lists.apache.org/thread.html/rfae2440264f6f874e91545b2030c98e7b7e3854ddf090f7747d338df%40%3Cdev.lucene.apache.org%3E
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>> --
> Regards,
>
> Atri
> Apache Concerted
>


[jira] [Commented] (CONNECTORS-1641) Cannot find doap file: http://manifoldcf.apache.org/doap_ManifoldCF.rdf

2020-05-09 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103243#comment-17103243
 ] 

Karl Wright commented on CONNECTORS-1641:
-

It is present in SVN:

{code}
C:\wip\mcf-site\publish>dir
 Volume in drive C is Windows
 Volume Serial Number is F4D8-E4E0

 Directory of C:\wip\mcf-site\publish

05/08/2020  08:22 AM  .
05/08/2020  08:22 AM  ..
05/08/2020  08:22 AM 3,845 .htaccess
05/08/2020  08:22 AM33 broken-links.xml
05/08/2020  08:22 AM 3,013 doap_ManifoldCF.rdf
05/08/2020  08:22 AM  en_US
05/08/2020  08:22 AM  images
05/08/2020  08:22 AM 5,081 index.html
05/08/2020  08:22 AM  ja_JP
05/08/2020  08:22 AM16,599 linkmap.html
05/08/2020  08:20 AM  release
05/08/2020  08:38 AM  skin
05/08/2020  08:38 AM  zh_CN
   5 File(s) 28,571 bytes
   8 Dir(s)  123,887,923,200 bytes free

C:\wip\mcf-site\publish>
{code}

and here:

{code}
C:\wip\mcf-site\publish>svn list 
https://svn.apache.org/repos/asf/manifoldcf/site/publish
.htaccess
broken-links.xml
doap_ManifoldCF.rdf
en_US/
images/
index.html
ja_JP/
linkmap.html
release/
skin/
zh_CN/
{code}

I guess this needs to be escalated to INFRA.

> Cannot find doap file: http://manifoldcf.apache.org/doap_ManifoldCF.rdf
> ---
>
> Key: CONNECTORS-1641
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1641
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Sebb
>Priority: Major
>
> URL: http://manifoldcf.apache.org/doap_ManifoldCF.rdf
> HTTP Error 404: Not Found
> Source: 
> https://svn.apache.org/repos/asf/comdev/projects.apache.org/trunk/data/projects.xml



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (CONNECTORS-1640) http://manifoldcf.apache.org/ Website missing

2020-05-08 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1640.
-
Fix Version/s: ManifoldCF 2.16
   Resolution: Fixed

Looks like it finally went live.


> http://manifoldcf.apache.org/ Website missing
> -
>
> Key: CONNECTORS-1640
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1640
> Project: ManifoldCF
>  Issue Type: Bug
> Environment: http://manifoldcf.apache.org/
>Reporter: Sebb
>    Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.16
>
>
> As the subject says - there is currently no usable website at 
> http://manifoldcf.apache.org/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1640) http://manifoldcf.apache.org/ Website missing

2020-05-08 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102566#comment-17102566
 ] 

Karl Wright commented on CONNECTORS-1640:
-

I've re-committed to the svn instance that the site is mirrored from.  The 
workarea looks fine.  But the site still doesn't to appear to have changed, so 
I may need to involve Infra.



> http://manifoldcf.apache.org/ Website missing
> -
>
> Key: CONNECTORS-1640
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1640
> Project: ManifoldCF
>  Issue Type: Bug
> Environment: http://manifoldcf.apache.org/
>Reporter: Sebb
>    Assignee: Karl Wright
>Priority: Critical
>
> As the subject says - there is currently no usable website at 
> http://manifoldcf.apache.org/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1640) http://manifoldcf.apache.org/ Website missing

2020-05-08 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102528#comment-17102528
 ] 

Karl Wright commented on CONNECTORS-1640:
-

Looks like svn failed spectacularly for some reason here.  Retrying the publish 
now.


> http://manifoldcf.apache.org/ Website missing
> -
>
> Key: CONNECTORS-1640
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1640
> Project: ManifoldCF
>  Issue Type: Bug
> Environment: http://manifoldcf.apache.org/
>Reporter: Sebb
>Priority: Critical
>
> As the subject says - there is currently no usable website at 
> http://manifoldcf.apache.org/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1640) http://manifoldcf.apache.org/ Website missing

2020-05-08 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1640:
---

Assignee: Karl Wright

> http://manifoldcf.apache.org/ Website missing
> -
>
> Key: CONNECTORS-1640
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1640
> Project: ManifoldCF
>  Issue Type: Bug
> Environment: http://manifoldcf.apache.org/
>Reporter: Sebb
>    Assignee: Karl Wright
>Priority: Critical
>
> As the subject says - there is currently no usable website at 
> http://manifoldcf.apache.org/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Crawling / Indexation Query

2020-05-07 Thread Karl Wright
Hi,

ManifoldCF is not a crawler, it's a synchronizer.  If robots says not to
crawl something, then it will not be indexed.  If robots is changed to
prohibit crawling of certain documents, then yes, those documents will be
removed from the index.

But you can override the robots behavior in the document specification or
configuration, I believe.

Karl


On Thu, May 7, 2020 at 6:27 AM ritika jain  wrote:

> Hi All,
>
> Can any body explain
> If a URL was indexed, and afterwards a noindex tag was added - will that
> URL then be deleted from the index when it is visited again by the crawler?
>
>
> Say a url was previously having indexation required meta tag and was
> present in Elastic index, but then afterwards
> 
> was added to page design afterwards.
>
> Should it be deleted from Index when the Manifoldcf job crawl that url
> again or the URL will still be present in the index.
>
> Thanks
>
>
>


Re: ES 7.6.2

2020-05-07 Thread Karl Wright
Hi Ritika,

ManifoldCF's ElasticSearch connector does not include any code that
requires Java 11, so you are all set.

Because JDK 11 removes many packages, however, you should expect to run
ManifoldCF 2.14 with Java 8.  ManifoldCF 2.16, just released, supports Java
11.

Karl


On Thu, May 7, 2020 at 5:14 AM ritika jain  wrote:

> Hi,
>
> Can any body tell me please whether Manifoldcf 2.14 version is compatible
> with Elastic Search Version 7.6.2 as it requires Java 11.
>
> Thanks
> Ritika
>


[RESULT] [VOTE] Release Apache ManifoldCF 2.16, RC0

2020-05-06 Thread Karl Wright
Three +1's, >72 hrs.  Vote passes!

Karl

On Tue, May 5, 2020 at 4:33 PM Karl Wright  wrote:

> +1 from me, although I've now had to patch trunk to get the mongodb test
> materials to download as of today.
>
> Karl
>
>
> On Sun, May 3, 2020 at 6:48 PM Cihad Guzel  wrote:
>
>> Hi,
>>
>> +1 from me.
>>
>> I have built successfully with JDK 11 via maven. All tests passed
>> successfully.
>>
>> Regards,
>> Cihad Güzel
>>
>> Furkan KAMACI , 3 May 2020 Paz, 22:27 tarihinde
>> şunu yazdı:
>>
>> > Hi,
>> >
>> > +1 from me.
>> >
>> > I checked:
>> >
>> > - LICENSE is fine
>> > - NOTICE should be updated (minor issue, I can fix it)
>> > - No unexpected binary files
>> > - Checked PGP signatures
>> > - Checked Checksums
>> > - Apache rat checks are OK
>> >
>> >
>> > PS: I got an error while compiling and testing with Java 11 but seems to
>> > related with my environment:
>> >
>> > [ERROR] Failed to execute goal on project mcf-cmis-connector: Could not
>> > resolve dependencies for project
>> > org.apache.manifoldcf:mcf-cmis-connector:jar:2.16: Failed to collect
>> > dependencies at
>> > org.apache.chemistry.opencmis:chemistry-opencmis-client-impl:jar:1.1.0
>> ->
>> >
>> org.apache.chemistry.opencmis:chemistry-opencmis-client-bindings:jar:1.1.0
>> > -> org.apache.cxf:cxf-rt-frontend-jaxws:jar:3.0.12 ->
>> > org.apache.cxf:cxf-rt-bindings-soap:jar:3.0.12 ->
>> > org.apache.cxf:cxf-rt-databinding-jaxb:jar:3.0.12 ->
>> > com.sun.xml.bind:jaxb-impl:jar:2.1.14: Failed to read artifact
>> descriptor
>> > for com.sun.xml.bind:jaxb-impl:jar:2.1.14: Could not transfer artifact
>> > com.sun.xml.bind:jaxb-impl:pom:2.1.14 from/to central (
>> > https://repo.maven.apache.org/maven2): Transfer failed for
>> >
>> >
>> https://repo.maven.apache.org/maven2/com/sun/xml/bind/jaxb-impl/2.1.14/jaxb-impl-2.1.14.pom
>> > :
>> > Operation timed out (Read failed)
>> >
>> > Kind Regards,
>> > Furkan KAMACI
>> >
>> > On Sun, May 3, 2020 at 9:59 PM Michael Cizmar <
>> mich...@michaelcizmar.com>
>> > wrote:
>> >
>> > > Great work Karl!  I'm looking forward to trying this out.
>> > >
>> > > On Sun, May 3, 2020 at 1:31 PM Karl Wright 
>> wrote:
>> > >
>> > > > Please vote on whether to release Apache ManifoldCF 2.16, RC0.  This
>> > > > release has a new confluence connector as well as preliminary
>> support
>> > for
>> > > > Java 11.  The release artifact can be found at:
>> > > >
>> > > >
>> >
>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.16
>> > > >
>> > > > There is a release tag at:
>> > > >
>> > > > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.16-RC0
>> > > >
>> > > > Thanks in advance!
>> > > > Karl
>> > > >
>> > >
>> >
>>
>


Re: [VOTE] Release Apache ManifoldCF 2.16, RC0

2020-05-05 Thread Karl Wright
+1 from me, although I've now had to patch trunk to get the mongodb test
materials to download as of today.

Karl


On Sun, May 3, 2020 at 6:48 PM Cihad Guzel  wrote:

> Hi,
>
> +1 from me.
>
> I have built successfully with JDK 11 via maven. All tests passed
> successfully.
>
> Regards,
> Cihad Güzel
>
> Furkan KAMACI , 3 May 2020 Paz, 22:27 tarihinde
> şunu yazdı:
>
> > Hi,
> >
> > +1 from me.
> >
> > I checked:
> >
> > - LICENSE is fine
> > - NOTICE should be updated (minor issue, I can fix it)
> > - No unexpected binary files
> > - Checked PGP signatures
> > - Checked Checksums
> > - Apache rat checks are OK
> >
> >
> > PS: I got an error while compiling and testing with Java 11 but seems to
> > related with my environment:
> >
> > [ERROR] Failed to execute goal on project mcf-cmis-connector: Could not
> > resolve dependencies for project
> > org.apache.manifoldcf:mcf-cmis-connector:jar:2.16: Failed to collect
> > dependencies at
> > org.apache.chemistry.opencmis:chemistry-opencmis-client-impl:jar:1.1.0 ->
> >
> org.apache.chemistry.opencmis:chemistry-opencmis-client-bindings:jar:1.1.0
> > -> org.apache.cxf:cxf-rt-frontend-jaxws:jar:3.0.12 ->
> > org.apache.cxf:cxf-rt-bindings-soap:jar:3.0.12 ->
> > org.apache.cxf:cxf-rt-databinding-jaxb:jar:3.0.12 ->
> > com.sun.xml.bind:jaxb-impl:jar:2.1.14: Failed to read artifact descriptor
> > for com.sun.xml.bind:jaxb-impl:jar:2.1.14: Could not transfer artifact
> > com.sun.xml.bind:jaxb-impl:pom:2.1.14 from/to central (
> > https://repo.maven.apache.org/maven2): Transfer failed for
> >
> >
> https://repo.maven.apache.org/maven2/com/sun/xml/bind/jaxb-impl/2.1.14/jaxb-impl-2.1.14.pom
> > :
> > Operation timed out (Read failed)
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Sun, May 3, 2020 at 9:59 PM Michael Cizmar  >
> > wrote:
> >
> > > Great work Karl!  I'm looking forward to trying this out.
> > >
> > > On Sun, May 3, 2020 at 1:31 PM Karl Wright  wrote:
> > >
> > > > Please vote on whether to release Apache ManifoldCF 2.16, RC0.  This
> > > > release has a new confluence connector as well as preliminary support
> > for
> > > > Java 11.  The release artifact can be found at:
> > > >
> > > >
> > https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.16
> > > >
> > > > There is a release tag at:
> > > >
> > > > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.16-RC0
> > > >
> > > > Thanks in advance!
> > > > Karl
> > > >
> > >
> >
>


Re: MongoDB download now no longer works

2020-05-05 Thread Karl Wright
Thanks, yes, 2.2.0 works fine.

Karl


On Tue, May 5, 2020 at 10:32 AM Michael Cizmar 
wrote:

> Looks like that is using version 2.1.2-SNAPSHOT.  Key part I think is
> "Snapshot".  I did a search and found this one (and there is a 2.1.2
> version):
>
>
> https://search.maven.org/artifact/de.flapdoodle.embed/de.flapdoodle.embed.mongo/2.2.0/jar
>
> and it appears to be available for download still:
>
>
> https://repo1.maven.org/maven2/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.2.0/
>
> --
> Michael Cizmar
>
>
> On 5/5/20, 8:55 AM, "Karl Wright"  wrote:
>
> A unit test for the MongoDB connector needs to download a version of
> MongoDB to do this testing.  Unfortunately it looks like MongoDB has
> removed their old, free versions from the download repository.  We get
> this
> now:
>
> C:\wip\mcf\trunk\connectors\mongodb>ant download-dependencies
> Buildfile: C:\wip\mcf\trunk\connectors\mongodb\build.xml
>
> download-dependencies:
>   [get] Getting:
>
> https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
>   [get] To:
>
> C:\wip\mcf\trunk\connectors\mongodb\test-materials\de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
>   [get] Error opening connection java.io.FileNotFoundException:
>
> https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
>   [get] Error opening connection java.io.FileNotFoundException:
>
> https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
>   [get] Error opening connection java.io.FileNotFoundException:
>
> https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
>   [get] Can't get
>
> https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
> to
>
> C:\wip\mcf\trunk\connectors\mongodb\test-materials\de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
>
> BUILD FAILED
> C:\wip\mcf\trunk\connectors\mongodb\build.xml:92: Can't get
>
> https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
> to
>
> C:\wip\mcf\trunk\connectors\mongodb\test-materials\de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
>
> Total time: 4 seconds
>
> C:\wip\mcf\trunk\connectors\mongodb>
>
> Any ideas?  Or do we just need to disable this test too?
>
> Karl
>
>


MongoDB download now no longer works

2020-05-05 Thread Karl Wright
A unit test for the MongoDB connector needs to download a version of
MongoDB to do this testing.  Unfortunately it looks like MongoDB has
removed their old, free versions from the download repository.  We get this
now:

C:\wip\mcf\trunk\connectors\mongodb>ant download-dependencies
Buildfile: C:\wip\mcf\trunk\connectors\mongodb\build.xml

download-dependencies:
  [get] Getting:
https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
  [get] To:
C:\wip\mcf\trunk\connectors\mongodb\test-materials\de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
  [get] Error opening connection java.io.FileNotFoundException:
https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
  [get] Error opening connection java.io.FileNotFoundException:
https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
  [get] Error opening connection java.io.FileNotFoundException:
https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
  [get] Can't get
https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
to
C:\wip\mcf\trunk\connectors\mongodb\test-materials\de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar

BUILD FAILED
C:\wip\mcf\trunk\connectors\mongodb\build.xml:92: Can't get
https://oss.sonatype.org/content/repositories/snapshots/de/flapdoodle/embed/de.flapdoodle.embed.mongo/2.1.2-SNAPSHOT/de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar
to
C:\wip\mcf\trunk\connectors\mongodb\test-materials\de.flapdoodle.embed.mongo-2.1.2-20180621.063700-1.jar

Total time: 4 seconds

C:\wip\mcf\trunk\connectors\mongodb>

Any ideas?  Or do we just need to disable this test too?

Karl


[VOTE] Release Apache ManifoldCF 2.16, RC0

2020-05-03 Thread Karl Wright
Please vote on whether to release Apache ManifoldCF 2.16, RC0.  This
release has a new confluence connector as well as preliminary support for
Java 11.  The release artifact can be found at:

https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.16

There is a release tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.16-RC0

Thanks in advance!
Karl


[jira] [Resolved] (CONNECTORS-1624) Get ManifoldCF to run under Java 11 or higher

2020-05-03 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1624.
-
Resolution: Fixed

Commits were made to address Java 11 changes, including the ES testing version.


> Get ManifoldCF to run under Java 11 or higher
> -
>
> Key: CONNECTORS-1624
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1624
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Framework core
>    Reporter: Karl Wright
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
>
> Java 11 doesn't include a number of classes that Java 8 does.  We need to 
> explicitly include jars that provide these classes or ManifoldCF will not 
> function under higher Java revs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-05-03 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1639.
-
Resolution: Fixed

Merged branch into trunk.  r1877321


> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log, es_start.sh, es_stop.sh
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-05-01 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097744#comment-17097744
 ] 

Karl Wright commented on CONNECTORS-1639:
-

Thanks, these are helpful.  I will be able to integrate them into the java code 
for starting and stopping the ES instance.

There's only one thing I still need: a compatible ES version and mapper 
attachment.


> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Should we continue to hold the 4/30 release until the ES testing is ironed out?

2020-05-01 Thread Karl Wright
Thanks!!

Will look later tonight.
Karl


On Fri, May 1, 2020 at 4:20 PM Michael Cizmar 
wrote:

> Karl,
>
> We created the script for you today and added it to:
> https://issues.apache.org/jira/browse/CONNECTORS-1639
>
> I had them make one to start it up and one to shut it down.  Please let me
> know if this works for you.
>
> --
> Michael Cizmar
>
>
> On 5/1/20, 6:02 AM, "Karl Wright"  wrote:
>
> I've got time this weekend to make code changes, but I don't actually
> know
> how to proceed, so I think we're stuck.  What I need to have is
> instructions on how to set up a modern ES release with the mapper
> attachment or equivalent.  I currently download the latest ES release
> but
> find that the mapper attachment available from the Maven repo is
> incompatible with this version, and the new ES-supported mapper is not
> available until ES 8.0.  Help?!?
>
> If nobody knows how to resolve this right now, I would still release,
> and
> delay the work for updating ES to next release.  Thoughts?
>
> Karl
>
>


Should we continue to hold the 4/30 release until the ES testing is ironed out?

2020-05-01 Thread Karl Wright
I've got time this weekend to make code changes, but I don't actually know
how to proceed, so I think we're stuck.  What I need to have is
instructions on how to set up a modern ES release with the mapper
attachment or equivalent.  I currently download the latest ES release but
find that the mapper attachment available from the Maven repo is
incompatible with this version, and the new ES-supported mapper is not
available until ES 8.0.  Help?!?

If nobody knows how to resolve this right now, I would still release, and
delay the work for updating ES to next release.  Thoughts?

Karl


Re: Status of Elastic Search integration tests

2020-04-29 Thread Karl Wright
Can you suggest what components I need to download and configure for the
test, then?  Step-by-step directions would be very helpful, e.g. "download
es version x, and this component from this maven repo URL, and then unpack
here, and create this config file..."

The mapper attachment I download from the maven repo seems to not work with
the ES version I am downloading, which is why I'm stuck.  The new plugin
you pointed me at says:

Installationedit
<https://github.com/elastic/elasticsearch/edit/master/docs/plugins/install_remove.asciidoc>

Version 8.0.0 of the Elastic Stack has not yet been released.

On Wed, Apr 29, 2020 at 7:49 AM Michael Cizmar 
wrote:

> Right.  On my radar is refactoring of this to use the Elastic Java SDK.  If
> we use that then, in my view, the encoding of the document would be the
> responsibility of the SDK and one less thing to test.  (The Java SDK is
> somewhat complicated as well because they tend to rewrite underlying
> transmission pieces)
>
> For testing purposes currently we can install the mapper attachment and
> create an ingestion workflow to hand that.
>
> On Wed, Apr 29, 2020 at 6:44 AM Karl Wright  wrote:
>
> > The connector itself encodes binary documents and sends them to ES,
> > purportedly for the mapper attachment to process and convert to text.
> The
> > test used to exercise that.
> >
> > Perhaps it's worth reviewing the connector code itself to see what is
> > outdated/legacy, and only test the parts that are not outdated?
> >
> > Specifically, my concern is that we need to support binary document
> > transmission to ES, and ES obviously needs to handle those for the
> > integration to work properly.
> >
> > Karl
> >
> >
> > On Wed, Apr 29, 2020 at 7:27 AM Michael Cizmar <
> mich...@michaelcizmar.com>
> > wrote:
> >
> > > There's been some changes to Elasticsearch like reducing the document
> > types
> > > and ingestion/mapper.  The mapper attachment I believe has been
> > > deprecated in favor of:
> > >
> > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html
> > >
> > > This should be incorporated into a pipeline.  Do we need something like
> > > this in our integration test?  I don't think it's the responsibility of
> > the
> > > output connector to handle this.
> > >
> > > On Wed, Apr 29, 2020 at 5:25 AM Karl Wright 
> wrote:
> > >
> > > > Hello all,
> > > >
> > > > I set up a branch (branches/CONNECTORS-1639) to work on the
> > elasticsearch
> > > > test problem for JDK 11.  The branch downloads an ES and a mapper
> > > > attachment but it turns out that the mapper attachment is apparently
> > > > incompatible with the current (7.x) version of ES.  Does anyone know
> > > > whether the mapper attachment is still supported?  If so, where can I
> > > find
> > > > it in the Maven repo?
> > > >
> > > > Karl
> > > >
> > >
> >
>


Re: Status of Elastic Search integration tests

2020-04-29 Thread Karl Wright
The connector itself encodes binary documents and sends them to ES,
purportedly for the mapper attachment to process and convert to text.  The
test used to exercise that.

Perhaps it's worth reviewing the connector code itself to see what is
outdated/legacy, and only test the parts that are not outdated?

Specifically, my concern is that we need to support binary document
transmission to ES, and ES obviously needs to handle those for the
integration to work properly.

Karl


On Wed, Apr 29, 2020 at 7:27 AM Michael Cizmar 
wrote:

> There's been some changes to Elasticsearch like reducing the document types
> and ingestion/mapper.  The mapper attachment I believe has been
> deprecated in favor of:
>
>
> https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html
>
> This should be incorporated into a pipeline.  Do we need something like
> this in our integration test?  I don't think it's the responsibility of the
> output connector to handle this.
>
> On Wed, Apr 29, 2020 at 5:25 AM Karl Wright  wrote:
>
> > Hello all,
> >
> > I set up a branch (branches/CONNECTORS-1639) to work on the elasticsearch
> > test problem for JDK 11.  The branch downloads an ES and a mapper
> > attachment but it turns out that the mapper attachment is apparently
> > incompatible with the current (7.x) version of ES.  Does anyone know
> > whether the mapper attachment is still supported?  If so, where can I
> find
> > it in the Maven repo?
> >
> > Karl
> >
>


Status of Elastic Search integration tests

2020-04-29 Thread Karl Wright
Hello all,

I set up a branch (branches/CONNECTORS-1639) to work on the elasticsearch
test problem for JDK 11.  The branch downloads an ES and a mapper
attachment but it turns out that the mapper attachment is apparently
incompatible with the current (7.x) version of ES.  Does anyone know
whether the mapper attachment is still supported?  If so, where can I find
it in the Maven repo?

Karl


Re: Release schedule

2020-04-23 Thread Karl Wright
The problem with running anything under Ant is that it's not set up for
this kind of flow:

- start service
- run tests
- stop service

Ant is about building and is not a sequential language, so starting this
under Ant is the wrong idea.

Instead, we can invoke scripts at will from within the java test class
itself.  But we need both Linux and Windows scripts for that, and we
download only one ES instance, and therefore we get only one kind of script
to go with it.

I supposed I can platform-conditionalize the download itself so we get
different scripts for different platforms.  Since ES doesn't include all
script variants in every download we're kind of stuck with this it seems.

The other issue we have to address is waiting for ES to actually fully
start.  I believe the code we had did this via a specific HTTP Get
request fired at the instance, so maybe we can reuse that code.  There may
also be a way to shut ES down via a similar HTTP Get mechanism.

Can you verify that waiting for ES to come up and shutting down ES can be
done with the same mechanism as is currently in the test code?

Karl


On Thu, Apr 23, 2020 at 7:44 AM Michael Cizmar 
wrote:

> Karl,
>
> I found this:
>
> https://gquintana.github.io/2016/11/30/Testing-a-Java-and-Elasticsearch-50-application.html
>
> Which should solve the issue of running elastic in the background via Ant.
>
> I can also provide a simple setup script (sh) if that helps as well.
>
> Michael
>
> On Wed, Apr 22, 2020 at 12:35 PM Karl Wright  wrote:
>
> > I looked into trying to get things working under Ant and created a branch
> > CONNECTORS-1639 containing some changes relating to download of
> > elasticsearch artifacts.  I did a little exploration as to whether we
> could
> > use the Elasticsearch Runner package to start a cluster, but that is
> really
> > painful because it has a ton of dependencies, so I think I'll just try
> > calling the main class that the ES startup script uses and see how we do
> > that way.
> >
> > But I'm snowed under with work related tasks again so it will have to
> wait.
> >
> > Karl
> >
> >
> > On Sat, Apr 18, 2020 at 5:52 PM Michael Cizmar <
> mich...@michaelcizmar.com>
> > wrote:
> >
> > > I've updated the ticket with the changes:
> > >
> > >
> >
> https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1639
> > >
> > > On Sat, Apr 18, 2020 at 3:59 PM Cihad Guzel  wrote:
> > >
> > > > Thanks folks for your information.
> > > >
> > > > I reviewed the pom.xml of the  ES connector. It uses an old elastic
> > > search
> > > > version as dependency. It misled me. On the other hand, you are
> right.
> > I
> > > > agree with your thoughts on this matter. It is best if we can
> rearrange
> > > > them.
> > > >
> > > > Kind regards,
> > > > Cihad Guzel
> > > >
> > > >
> > > > Karl Wright , 18 Nis 2020 Cmt, 17:49 tarihinde
> > şunu
> > > > yazdı:
> > > >
> > > > > I can help with Ant tasks but I need information as to how you're
> > > > supposed
> > > > > to start the ES instance.  An ant task snippet would be sufficient
> I
> > > > think.
> > > > >
> > > > >
> > > > > On Sat, Apr 18, 2020 at 10:33 AM Michael Cizmar <
> > > > > michael.ciz...@mcplusa.com>
> > > > > wrote:
> > > > >
> > > > > > I believe so.  I only modified the Pom in the es connector
> project
> > > and
> > > > > > removed the Node references.  I know there is a way to do this in
> > ant
> > > > as
> > > > > > well.  I will look it up but may need some guidance on Ant.
> > > > > >
> > > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > > 
> > > > > > From: Karl Wright 
> > > > > > Sent: Saturday, April 18, 2020 9:15:15 AM
> > > > > > To: dev 
> > > > > > Subject: Re: Release schedule
> > > > > >
> > > > > > Hi Michael,
> > > > > > This has to run under Ant as well.  Any way to make that happen?
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > > On Sat, Apr 18, 2020 at 9:49 AM Michael Cizmar <
> > > > > michael.ciz...@mcplusa.com
> > > > > > >
> > > >

Re: Release schedule

2020-04-22 Thread Karl Wright
I looked into trying to get things working under Ant and created a branch
CONNECTORS-1639 containing some changes relating to download of
elasticsearch artifacts.  I did a little exploration as to whether we could
use the Elasticsearch Runner package to start a cluster, but that is really
painful because it has a ton of dependencies, so I think I'll just try
calling the main class that the ES startup script uses and see how we do
that way.

But I'm snowed under with work related tasks again so it will have to wait.

Karl


On Sat, Apr 18, 2020 at 5:52 PM Michael Cizmar 
wrote:

> I've updated the ticket with the changes:
>
> https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1639
>
> On Sat, Apr 18, 2020 at 3:59 PM Cihad Guzel  wrote:
>
> > Thanks folks for your information.
> >
> > I reviewed the pom.xml of the  ES connector. It uses an old elastic
> search
> > version as dependency. It misled me. On the other hand, you are right. I
> > agree with your thoughts on this matter. It is best if we can rearrange
> > them.
> >
> > Kind regards,
> > Cihad Guzel
> >
> >
> > Karl Wright , 18 Nis 2020 Cmt, 17:49 tarihinde şunu
> > yazdı:
> >
> > > I can help with Ant tasks but I need information as to how you're
> > supposed
> > > to start the ES instance.  An ant task snippet would be sufficient I
> > think.
> > >
> > >
> > > On Sat, Apr 18, 2020 at 10:33 AM Michael Cizmar <
> > > michael.ciz...@mcplusa.com>
> > > wrote:
> > >
> > > > I believe so.  I only modified the Pom in the es connector project
> and
> > > > removed the Node references.  I know there is a way to do this in ant
> > as
> > > > well.  I will look it up but may need some guidance on Ant.
> > > >
> > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > 
> > > > From: Karl Wright 
> > > > Sent: Saturday, April 18, 2020 9:15:15 AM
> > > > To: dev 
> > > > Subject: Re: Release schedule
> > > >
> > > > Hi Michael,
> > > > This has to run under Ant as well.  Any way to make that happen?
> > > >
> > > > Karl
> > > >
> > > >
> > > > On Sat, Apr 18, 2020 at 9:49 AM Michael Cizmar <
> > > michael.ciz...@mcplusa.com
> > > > >
> > > > wrote:
> > > >
> > > > > I've got a fix for this.  I switched to using a Maven plugin that
> > spins
> > > > up
> > > > > an Elasticsearch instance.  With this, you need only to remove the
> > Node
> > > > > code in the integration tests.  Tested with 6.x client and 7.x
> > > > > elasticsearch.
> > > > >
> > > > > There are more things we can do with this output plugin in the
> future
> > > > like
> > > > > moving to the SDK.
> > > > >
> > > > > M
> > > > >
> > > > > On 4/18/20, 8:32 AM, "Karl Wright"  wrote:
> > > > >
> > > > > Thanks for the quick reply.
> > > > > I agree we don't want to turn off the ES connector itself, but
> > yes
> > > we
> > > > > will
> > > > > need to shut down the tests.  Cihad, would you like to propose
> a
> > > > > strategy
> > > > > for that?  I think for now just marking them with @Ignore
> should
> > be
> > > > OK,
> > > > > since the tests don't have compile time dependencies on missing
> > > > > classes.
> > > > > What do you think?
> > > > >
> > > > > Upgrading to ES 6.x is obviously the right thing to do but who
> > here
> > > > > has the
> > > > > knowledge to do a good job with this?  I am certain there are a
> > > > number
> > > > > of
> > > > > ES users lurking on this list.  Please volunteer if so.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Sat, Apr 18, 2020 at 9:15 AM Furkan KAMACI <
> > > > furkankam...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > There is a compatibility matrix for Elasticsearch. We need to
> > > > > support at
> > > > > > least

[jira] [Comment Edited] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-04-19 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087245#comment-17087245
 ] 

Karl Wright edited comment on CONNECTORS-1639 at 4/19/20, 11:09 PM:


Since this is firing off a java task, we could use the Ant Java task, described 
here:

https://ant.apache.org/manual/Tasks/java.html

However, this will not work for maven so I'd steer away from that.

We can do something similar to this instead, which would invoke the same Java 
again, although we'd want ES's classpath obviously:

{code}
public final class JavaProcess {

private JavaProcess() {}

public static int exec(Class klass, List args) throws IOException,
   InterruptedException {
String javaHome = System.getProperty("java.home");
String javaBin = javaHome +
File.separator + "bin" +
File.separator + "java";
String classpath = System.getProperty("java.class.path");
String className = klass.getName();

List command = new LinkedList();
command.add(javaBin);
command.add("-cp");
command.add(classpath);
command.add(className);
if (args != null) {
command.addAll(args);
}

ProcessBuilder builder = new ProcessBuilder(command);

Process process = builder.inheritIO().start();
process.waitFor();
return process.exitValue();
}

}
{code}

There are, however, the following problems to be addressed: (1) waiting for the 
instance to start, and (2) stopping the instance.


was (Author: kwri...@metacarta.com):
Since this is firing off a java task, we could use the Ant Java task, described 
here:

https://ant.apache.org/manual/Tasks/java.html

However, this will not work for maven so I'd steer away from that.

We can do something similar to this instead, which would invoke the same Java 
again, although we'd want ES's classpath obviously:

{code}


> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-04-19 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087245#comment-17087245
 ] 

Karl Wright commented on CONNECTORS-1639:
-

Since this is firing off a java task, we could use the Ant Java task, described 
here:

https://ant.apache.org/manual/Tasks/java.html

However, this will not work for maven so I'd steer away from that.

We can do something similar to this instead, which would invoke the same Java 
again, although we'd want ES's classpath obviously:

{code}


> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-04-19 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086994#comment-17086994
 ] 

Karl Wright commented on CONNECTORS-1639:
-

It looks like we can reverse engineer the startup script here:

{code}
test-materials/elasticsearch-7.6.2/bin/elasticsearch"
{code}

The basic daemon runner is:

{code}
  exec \
"$JAVA" \
$ES_JAVA_OPTS \
-Des.path.home="$ES_HOME" \
-Des.path.conf="$ES_PATH_CONF" \
-Des.distribution.flavor="$ES_DISTRIBUTION_FLAVOR" \
-Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
-Des.bundled_jdk="$ES_BUNDLED_JDK" \
-cp "$ES_CLASSPATH" \
org.elasticsearch.bootstrap.Elasticsearch \
"$@"
{code}

This can be shelled out easily enough in Java, BUT we'd also need to make sure 
it comes up before starting the test.


> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>      Components: Elastic Search connector
>Reporter: Cihad Guzel
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-04-19 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086926#comment-17086926
 ] 

Karl Wright edited comment on CONNECTORS-1639 at 4/19/20, 11:31 AM:


Unfortunately the dependencies for ElasticSearch Cluster Runner are huge.  See:

https://mvnrepository.com/artifact/org.codelibs/elasticsearch-cluster-runner/7.6.2.0

This is a problem because we'd need to download all these dependencies and 
their dependencies.  But it may be that the cluster runner does not actually 
use all these.

It may not even be the right thing to use.  It doesn't seem like it would be 
hard to write something that starts up a cluster based on a specific image.  
[~michaelcizmar], if you were starting a cluster from a downloaded, unpacked 
image, what steps would you take?

FWIW, if you want to try this, svn checkout 
https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1639, and then 
do: the following:

{code}
ant make-core-deps
ant build
cd connectors/elasticsearch
ant download-dependencies
{code}

The unpacked ES download, and the mapper attachments plugin, should live in 
test-materials at that point.



was (Author: kwri...@metacarta.com):
Unfortunately the dependencies for ElasticSearch Cluster Runner are huge.  See:

https://mvnrepository.com/artifact/org.codelibs/elasticsearch-cluster-runner/7.6.2.0

This is a problem because we'd need to download all these dependencies and 
their dependencies.  But it may be that the cluster runner does not actually 
use all these.

It may not even be the right thing to use.  It doesn't seem like it would be 
hard to write something that starts up a cluster based on a specific image.  
[~michaelcizmar], if you were starting a cluster from a downloaded, unpacked 
image, what steps would you take?

(FWIW, if you want to try this, svn checkout 
https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1639, and then 
do: the following:

{code}
ant make-core-deps
ant build
cd connectors/elasticsearch
ant download-dependencies
{code}

The unpacked ES download, and the mapper attachments plugin, should live in 
test-materials at that point.


> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-04-19 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086926#comment-17086926
 ] 

Karl Wright edited comment on CONNECTORS-1639 at 4/19/20, 11:31 AM:


Unfortunately the dependencies for ElasticSearch Cluster Runner are huge.  See:

https://mvnrepository.com/artifact/org.codelibs/elasticsearch-cluster-runner/7.6.2.0

This is a problem because we'd need to download all these dependencies and 
their dependencies.  But it may be that the cluster runner does not actually 
use all these.

It may not even be the right thing to use.  It doesn't seem like it would be 
hard to write something that starts up a cluster based on a specific image.  
[~michaelcizmar], if you were starting a cluster from a downloaded, unpacked 
image, what steps would you take?

(FWIW, if you want to try this, svn checkout 
https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1639, and then 
do: the following:

{code}
ant make-core-deps
ant build
cd connectors/elasticsearch
ant download-dependencies
{code}

The unpacked ES download, and the mapper attachments plugin, should live in 
test-materials at that point.



was (Author: kwri...@metacarta.com):
Unfortunately the dependencies for ElasticSearch Cluster Runner are huge.  See:

https://mvnrepository.com/artifact/org.codelibs/elasticsearch-cluster-runner/7.6.2.0



> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-04-19 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086926#comment-17086926
 ] 

Karl Wright commented on CONNECTORS-1639:
-

Unfortunately the dependencies for ElasticSearch Cluster Runner are huge.  See:

https://mvnrepository.com/artifact/org.codelibs/elasticsearch-cluster-runner/7.6.2.0



> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-04-19 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086902#comment-17086902
 ] 

Karl Wright commented on CONNECTORS-1639:
-

Here are some examples of how to invoke these from code:

https://www.programcreek.com/java-api-examples/?api=org.codelibs.elasticsearch.runner.ElasticsearchClusterRunner

> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-04-19 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086901#comment-17086901
 ] 

Karl Wright commented on CONNECTORS-1639:
-

I created a branch 
(https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1639) that 
updates the ant build to download ES 7.6.2 and mapper attachments version 
3.1.2, and unpacks them.  I hope these are compatible?  But in any case, all 
that should be needed now is incorporating Elasticsearch Runner to start the 
instance programmatically, if that is how it works.  Looking into that now.


> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-04-18 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086693#comment-17086693
 ] 

Karl Wright commented on CONNECTORS-1639:
-

The pom diff you included looks fine.  Once you've got a diff for the test 
itself I can make the rest of the ant changes.


> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-04-18 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086674#comment-17086674
 ] 

Karl Wright commented on CONNECTORS-1639:
-

[~michaelcizmar] Cihad Guzel has made some commits to get MCF JDK 11 compatible.

The elasticsearch maven plugin I can't find in the maven repo, but looking for 
it I found this:

https://mvnrepository.com/artifact/org.codelibs/elasticsearch-cluster-runner

It seems like this could in theory be used by our standard integration tests to 
start a cluster and tear it down, no?


> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: CONNECTORS-1639.diff, 
> elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Release schedule

2020-04-18 Thread Karl Wright
I can help with Ant tasks but I need information as to how you're supposed
to start the ES instance.  An ant task snippet would be sufficient I think.


On Sat, Apr 18, 2020 at 10:33 AM Michael Cizmar 
wrote:

> I believe so.  I only modified the Pom in the es connector project and
> removed the Node references.  I know there is a way to do this in ant as
> well.  I will look it up but may need some guidance on Ant.
>
> Get Outlook for iOS<https://aka.ms/o0ukef>
> ____
> From: Karl Wright 
> Sent: Saturday, April 18, 2020 9:15:15 AM
> To: dev 
> Subject: Re: Release schedule
>
> Hi Michael,
> This has to run under Ant as well.  Any way to make that happen?
>
> Karl
>
>
> On Sat, Apr 18, 2020 at 9:49 AM Michael Cizmar  >
> wrote:
>
> > I've got a fix for this.  I switched to using a Maven plugin that spins
> up
> > an Elasticsearch instance.  With this, you need only to remove the Node
> > code in the integration tests.  Tested with 6.x client and 7.x
> > elasticsearch.
> >
> > There are more things we can do with this output plugin in the future
> like
> > moving to the SDK.
> >
> > M
> >
> > On 4/18/20, 8:32 AM, "Karl Wright"  wrote:
> >
> > Thanks for the quick reply.
> > I agree we don't want to turn off the ES connector itself, but yes we
> > will
> > need to shut down the tests.  Cihad, would you like to propose a
> > strategy
> > for that?  I think for now just marking them with @Ignore should be
> OK,
> > since the tests don't have compile time dependencies on missing
> > classes.
> > What do you think?
> >
> > Upgrading to ES 6.x is obviously the right thing to do but who here
> > has the
> > knowledge to do a good job with this?  I am certain there are a
> number
> > of
> > ES users lurking on this list.  Please volunteer if so.
> >
> > Karl
> >
> >
> > On Sat, Apr 18, 2020 at 9:15 AM Furkan KAMACI <
> furkankam...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > There is a compatibility matrix for Elasticsearch. We need to
> > support at
> > > least Elasticsearch 6.5.x for Java 11 support. You can check it
> from
> > here:
> > > https://www.elastic.co/de/support/matrix#matrix_jvm
> > >
> > > @Cihad
> > >
> > > As far as I know, current support is not 2.0.0. It is 5.5.2:
> > > https://github.com/apache/manifoldcf-integration-elasticsearch-5.5
> > >
> > > @Karl Wright 
> > >
> > > So, such an upgrade from 5.5.2 to 6.5.x may not be so painful.
> > Committers
> > > who use ES can comment on this.
> > >
> > > My comments:
> > >
> > > +1 to temporarily turning those tests off
> > > -1 to temporarily turning the connector off
> >     >
> > > Kind Regards,
> > > Furkan KAMACI
> > >
> > > On Sat, Apr 18, 2020 at 3:27 PM Cihad Guzel 
> > wrote:
> > >
> > >> Hi Karl,
> > >>
> > >> MFC ES Connector uses the Elastic Search 2.0.0 . It's an ancient
> > version.
> > >> The latest version is 7.6.2 . So, I agree with you and I think we
> > can
> > >> temporarily turn the connector off.
> > >>
> > >> +1
> > >>
> > >> Kind Regards,
> > >> Cihad Güzel
> > >>
> > >>
> > >> Karl Wright , 18 Nis 2020 Cmt, 11:41
> tarihinde
> > şunu
> > >> yazdı:
> > >>
> > >> > Hi all,
> > >> >
> > >> > We're due to release ManifoldCF 2.16 by April 30th.  The major
> > work for
> > >> > this release was adoption of Java 11, and that work is
> incomplete
> > >> because
> > >> > of ElasticSearch incompatibilities.  I'm therefore tempted to
> > hold the
> > >> > release until we at least have a plan for dealing with ES going
> > forward.
> > >> >
> > >> > It's not clear that our ES connector support is affected, but
> > certainly
> > >> our
> > >> > integration tests are, because Java 11 isn't supported in any of
> > the ES
> > >> > versions we run for those tests.  So at the least we need to
> > decide to
> > >> turn
> > >> > those off.  And indeed, we really need to have someone with ES
> > >> experience
> > >> > map a strategy for getting our ES support back into compliance
> > with
> > >> what's
> > >> > out in the world at large now.  Cihad Guzel did much work on
> Java
> > 11 but
> > >> > stumbled over the Elastic Search problem.  Any of our committers
> > who
> > >> know
> > >> > ES and are stuck inside at the moment, please speak up.
> > >> >
> > >> > Thanks in advance,
> > >> > Karl
> > >> >
> > >>
> > >
> >
> >
>


Re: Release schedule

2020-04-18 Thread Karl Wright
Hi Michael,
This has to run under Ant as well.  Any way to make that happen?

Karl


On Sat, Apr 18, 2020 at 9:49 AM Michael Cizmar 
wrote:

> I've got a fix for this.  I switched to using a Maven plugin that spins up
> an Elasticsearch instance.  With this, you need only to remove the Node
> code in the integration tests.  Tested with 6.x client and 7.x
> elasticsearch.
>
> There are more things we can do with this output plugin in the future like
> moving to the SDK.
>
> M
>
> On 4/18/20, 8:32 AM, "Karl Wright"  wrote:
>
> Thanks for the quick reply.
> I agree we don't want to turn off the ES connector itself, but yes we
> will
> need to shut down the tests.  Cihad, would you like to propose a
> strategy
> for that?  I think for now just marking them with @Ignore should be OK,
> since the tests don't have compile time dependencies on missing
> classes.
> What do you think?
>
> Upgrading to ES 6.x is obviously the right thing to do but who here
> has the
> knowledge to do a good job with this?  I am certain there are a number
> of
> ES users lurking on this list.  Please volunteer if so.
>
> Karl
>
>
> On Sat, Apr 18, 2020 at 9:15 AM Furkan KAMACI 
> wrote:
>
> > Hi,
> >
> > There is a compatibility matrix for Elasticsearch. We need to
> support at
> > least Elasticsearch 6.5.x for Java 11 support. You can check it from
> here:
> > https://www.elastic.co/de/support/matrix#matrix_jvm
> >
> > @Cihad
> >
>     > As far as I know, current support is not 2.0.0. It is 5.5.2:
> > https://github.com/apache/manifoldcf-integration-elasticsearch-5.5
> >
> > @Karl Wright 
> >
> > So, such an upgrade from 5.5.2 to 6.5.x may not be so painful.
> Committers
> > who use ES can comment on this.
> >
> > My comments:
> >
> > +1 to temporarily turning those tests off
> > -1 to temporarily turning the connector off
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Sat, Apr 18, 2020 at 3:27 PM Cihad Guzel 
> wrote:
> >
> >> Hi Karl,
> >>
> >> MFC ES Connector uses the Elastic Search 2.0.0 . It's an ancient
> version.
> >> The latest version is 7.6.2 . So, I agree with you and I think we
> can
> >> temporarily turn the connector off.
> >>
> >> +1
> >>
> >> Kind Regards,
> >> Cihad Güzel
> >>
> >>
> >> Karl Wright , 18 Nis 2020 Cmt, 11:41 tarihinde
> şunu
> >> yazdı:
> >>
> >> > Hi all,
> >> >
> >> > We're due to release ManifoldCF 2.16 by April 30th.  The major
> work for
> >> > this release was adoption of Java 11, and that work is incomplete
> >> because
> >> > of ElasticSearch incompatibilities.  I'm therefore tempted to
> hold the
> >> > release until we at least have a plan for dealing with ES going
> forward.
> >> >
> >> > It's not clear that our ES connector support is affected, but
> certainly
> >> our
> >> > integration tests are, because Java 11 isn't supported in any of
> the ES
> >> > versions we run for those tests.  So at the least we need to
> decide to
> >> turn
> >> > those off.  And indeed, we really need to have someone with ES
> >> experience
> >> > map a strategy for getting our ES support back into compliance
> with
> >> what's
> >> > out in the world at large now.  Cihad Guzel did much work on Java
> 11 but
> >> > stumbled over the Elastic Search problem.  Any of our committers
> who
> >> know
> >> > ES and are stuck inside at the moment, please speak up.
> >> >
> >> > Thanks in advance,
> >> > Karl
> >> >
> >>
> >
>
>


Re: Release schedule

2020-04-18 Thread Karl Wright
Thanks for the quick reply.
I agree we don't want to turn off the ES connector itself, but yes we will
need to shut down the tests.  Cihad, would you like to propose a strategy
for that?  I think for now just marking them with @Ignore should be OK,
since the tests don't have compile time dependencies on missing classes.
What do you think?

Upgrading to ES 6.x is obviously the right thing to do but who here has the
knowledge to do a good job with this?  I am certain there are a number of
ES users lurking on this list.  Please volunteer if so.

Karl


On Sat, Apr 18, 2020 at 9:15 AM Furkan KAMACI 
wrote:

> Hi,
>
> There is a compatibility matrix for Elasticsearch. We need to support at
> least Elasticsearch 6.5.x for Java 11 support. You can check it from here:
> https://www.elastic.co/de/support/matrix#matrix_jvm
>
> @Cihad
>
> As far as I know, current support is not 2.0.0. It is 5.5.2:
> https://github.com/apache/manifoldcf-integration-elasticsearch-5.5
>
> @Karl Wright 
>
> So, such an upgrade from 5.5.2 to 6.5.x may not be so painful. Committers
> who use ES can comment on this.
>
> My comments:
>
> +1 to temporarily turning those tests off
> -1 to temporarily turning the connector off
>
> Kind Regards,
> Furkan KAMACI
>
> On Sat, Apr 18, 2020 at 3:27 PM Cihad Guzel  wrote:
>
>> Hi Karl,
>>
>> MFC ES Connector uses the Elastic Search 2.0.0 . It's an ancient version.
>> The latest version is 7.6.2 . So, I agree with you and I think we can
>> temporarily turn the connector off.
>>
>> +1
>>
>> Kind Regards,
>> Cihad Güzel
>>
>>
>> Karl Wright , 18 Nis 2020 Cmt, 11:41 tarihinde şunu
>> yazdı:
>>
>> > Hi all,
>> >
>> > We're due to release ManifoldCF 2.16 by April 30th.  The major work for
>> > this release was adoption of Java 11, and that work is incomplete
>> because
>> > of ElasticSearch incompatibilities.  I'm therefore tempted to hold the
>> > release until we at least have a plan for dealing with ES going forward.
>> >
>> > It's not clear that our ES connector support is affected, but certainly
>> our
>> > integration tests are, because Java 11 isn't supported in any of the ES
>> > versions we run for those tests.  So at the least we need to decide to
>> turn
>> > those off.  And indeed, we really need to have someone with ES
>> experience
>> > map a strategy for getting our ES support back into compliance with
>> what's
>> > out in the world at large now.  Cihad Guzel did much work on Java 11 but
>> > stumbled over the Elastic Search problem.  Any of our committers who
>> know
>> > ES and are stuck inside at the moment, please speak up.
>> >
>> > Thanks in advance,
>> > Karl
>> >
>>
>


Release schedule

2020-04-18 Thread Karl Wright
Hi all,

We're due to release ManifoldCF 2.16 by April 30th.  The major work for
this release was adoption of Java 11, and that work is incomplete because
of ElasticSearch incompatibilities.  I'm therefore tempted to hold the
release until we at least have a plan for dealing with ES going forward.

It's not clear that our ES connector support is affected, but certainly our
integration tests are, because Java 11 isn't supported in any of the ES
versions we run for those tests.  So at the least we need to decide to turn
those off.  And indeed, we really need to have someone with ES experience
map a strategy for getting our ES support back into compliance with what's
out in the world at large now.  Cihad Guzel did much work on Java 11 but
stumbled over the Elastic Search problem.  Any of our committers who know
ES and are stuck inside at the moment, please speak up.

Thanks in advance,
Karl


Re: Welcome Eric Pugh as a Lucene/Solr committer

2020-04-06 Thread Karl Wright
Welcome, Eric!


On Mon, Apr 6, 2020 at 9:52 AM Steve Rowe  wrote:

> Congrats and welcome Eric!
>
> --
> Steve
>
> > On Apr 6, 2020, at 8:21 AM, Jan Høydahl  wrote:
> >
> > Hi all,
> >
> > Please join me in welcoming Eric Pugh as the latest Lucene/Solr
> committer!
> >
> > Eric has been part of the Solr community for over a decade, as a code
> contributor, book author, company founder, blogger and mailing list
> contributor! We look forward to his future contributions!
> >
> > Congratulations and welcome! It is a tradition to introduce yourself
> with a brief bio, Eric.
> >
> > Jan Høydahl
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Delta deletion

2020-03-31 Thread Karl Wright
The framework uses the model to figure out what the seeds mean, and really
nothing else.  So making it dynamic would be pretty much useless.

Karl

On Tue, Mar 31, 2020 at 10:21 AM  wrote:

> Thanks Karl !
>
> The key thing of the solution I needed was : "Carrydown changes force the
> children that rely on them to be processed again"
>
> I have one last question : Is it possible to "dynamically" change the
> return value of the getConnectorModel method ? For example, is it possible
> to rely it on a job specification parameter ? Or is it too late because the
> framework has already considered the value at this step ?
>
> Julien
>
> -Message d'origine-
> De : Karl Wright 
> Envoyé : vendredi 27 mars 2020 04:51
> À : dev 
> Objet : Re: Delta deletion
>
> Just to be clear, this is what I think would work:
>
> (1) Your addSeedDocuments() method adds your seeds.
> (2) Each seed document, when processed, either decides it's dead, or it
> calls IProcessActivity.addDocumentReference()
> to populate children AND pass on an "Parent is alive" value.
> (3) The seed document, inside processDocuments(), when it decides it is
> dead, does nothing other than call IProcessActivity.removeDocument() on
> itself.  If it's still alive it checks to be sure that it needs to index
> using the standard IProcessActivity method for that, and if so, calls the
> indexing method.
> (4) processDocuments, when called for a child document, looks for the
> "Parent is alive" value.  If it does not find it, it should call
> IProcessActivity.removeDocument() on itself.  If it does find it, it
> should check to be sure it needs indexing etc etc.  Note that if the parent
> deletes itself, the carrydown data from the parent will be removed -- it
> will not be changed, just gone.
>
> Noted that the child document may be processed more than once in any given
> job run depending on the order things happen.  Carrydown changes force the
> children that rely on them to be processed again; that's how ManifoldCF
> keeps it all straight and consistent.
>
> I haven't thought through whether calling IProcessActivity.noDocument() is
> better than IProcessActivity.removeDocument().  I suspect it is because it
> will leave a versioned gravemarker around while
> IProcessActivity.removeDocument() gets rid of all traces of the document.
> You would know best which makes the most sense in your context.
>
> Karl
>
>
> On Thu, Mar 26, 2020 at 11:11 PM Karl Wright  wrote:
>
> > There is no such restriction.  The only requirement is that every
> > document deletes itself and cannot delete any other documents.
> >
> > Perhaps you can share some code snippets?
> >
> >
> > On Thu, Mar 26, 2020, 11:52 AM  wrote:
> >
> >> Hi Karl,
> >>
> >> I tried to use the carrydown mechanism to perform the delete of
> >> children documents but I am facing a problem:
> >>
> >> During the first crawl, the connector registers children documents of
> >> a document as carrydown data in the processDocuments method through
> >> the activities.addDocumentReference method, and all is working well.
> >> During a delta crawl, the addSeedDocuments method declares deleted
> >> parent documents, but in the processDocuments, although I am able to
> >> retrieve the child document ids of the parent document thanks to the
> >> carrydown data, I am unable to delete them. My guess is that the ids
> >> I want to delete have not been declared in the addSeedDocuments
> >> method. If this is correct, is there any way to change this behavior ?
> >> Otherwise, is there another way to do things ? As I cannot retrieve
> >> carrydown data in the addSeedDocuments I seem to be in a dead-end.
> >>
> >> Julien
> >>
> >>
> >> -Message d'origine-
> >> De : Julien Massiera  Envoyé : lundi
> >> 9 mars 2020 14:24 À : dev@manifoldcf.apache.org Objet : Re: Delta
> >> deletion
> >>
> >> Yes I consider the confluence connector complete.
> >>
> >> As you suggest, I will try to use the "carrydown" mechanism to do
> >> what I want.
> >>
> >> Thanks,
> >>
> >> Julien
> >>
> >> On 09/03/2020 13:59, Karl Wright wrote:
> >> > Do you consider the confluence connector in the branch complete?
> >> > If so I'll look at it as time permits later today.
> >> >
> >> > As far as your proposal is concerned, maintaining lists of
> >> > dependencies for all documents is quite expensive.  We do this for
> >

Re: Delta deletion

2020-03-26 Thread Karl Wright
Just to be clear, this is what I think would work:

(1) Your addSeedDocuments() method adds your seeds.
(2) Each seed document, when processed, either decides it's dead, or it
calls IProcessActivity.addDocumentReference()
to populate children AND pass on an "Parent is alive" value.
(3) The seed document, inside processDocuments(), when it decides it is
dead, does nothing other than call IProcessActivity.removeDocument() on
itself.  If it's still alive it checks to be sure that it needs to index
using the standard IProcessActivity method for that, and if so, calls the
indexing method.
(4) processDocuments, when called for a child document, looks for the
"Parent is alive" value.  If it does not find it, it should call
IProcessActivity.removeDocument() on itself.  If it does find it, it should
check to be sure it needs indexing etc etc.  Note that if the parent
deletes itself, the carrydown data from the parent will be removed -- it
will not be changed, just gone.

Noted that the child document may be processed more than once in any given
job run depending on the order things happen.  Carrydown changes force the
children that rely on them to be processed again; that's how ManifoldCF
keeps it all straight and consistent.

I haven't thought through whether calling IProcessActivity.noDocument() is
better than IProcessActivity.removeDocument().  I suspect it is because it
will leave a versioned gravemarker around while
IProcessActivity.removeDocument() gets rid of all traces of the document.
You would know best which makes the most sense in your context.

Karl


On Thu, Mar 26, 2020 at 11:11 PM Karl Wright  wrote:

> There is no such restriction.  The only requirement is that every document
> deletes itself and cannot delete any other documents.
>
> Perhaps you can share some code snippets?
>
>
> On Thu, Mar 26, 2020, 11:52 AM  wrote:
>
>> Hi Karl,
>>
>> I tried to use the carrydown mechanism to perform the delete of children
>> documents but I am facing a problem:
>>
>> During the first crawl, the connector registers children documents of a
>> document as carrydown data in the processDocuments method through the
>> activities.addDocumentReference method, and all is working well.
>> During a delta crawl, the addSeedDocuments method declares deleted parent
>> documents, but in the processDocuments, although I am able to retrieve the
>> child document ids of the parent document thanks to the carrydown data, I
>> am unable to delete them. My guess is that the ids I want to delete have
>> not been declared in the addSeedDocuments method. If this is correct, is
>> there any way to change this behavior ?
>> Otherwise, is there another way to do things ? As I cannot retrieve
>> carrydown data in the addSeedDocuments I seem to be in a dead-end.
>>
>> Julien
>>
>>
>> -Message d'origine-
>> De : Julien Massiera 
>> Envoyé : lundi 9 mars 2020 14:24
>> À : dev@manifoldcf.apache.org
>> Objet : Re: Delta deletion
>>
>> Yes I consider the confluence connector complete.
>>
>> As you suggest, I will try to use the "carrydown" mechanism to do what I
>> want.
>>
>> Thanks,
>>
>> Julien
>>
>> On 09/03/2020 13:59, Karl Wright wrote:
>> > Do you consider the confluence connector in the branch complete?
>> > If so I'll look at it as time permits later today.
>> >
>> > As far as your proposal is concerned, maintaining lists of
>> > dependencies for all documents is quite expensive.  We do this for hop
>> > counting and we basically tell people to only use it if they must,
>> > because of the huge amount of database overhead involved.  We also
>> > maintain "carrydown" data which is accessible during document
>> > processing.  It is typically used for ingestion, but maybe you could
>> > use that for a signal that child documents should delete themselves or
>> something.
>> >
>> > Major crawling model changes are a gigantic effort; there are always
>> > many things to consider and many problems encountered that need to be
>> > worked around.  If you are concerned simply with the load on your API
>> > to handle deletions, I'd suggest using one of the existing mechanisms
>> > for reducing that.  But I can see no straightforward way to
>> > incrementally add dependency deletion to the current framework.
>> >
>> > Karl
>> >
>> >
>> > On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera <
>> > julien.massi...@francelabs.com> wrote:
>> >
>> >> Hi Karl,
>> >>
>> >> Now that I finished the confl

Re: Delta deletion

2020-03-26 Thread Karl Wright
There is no such restriction.  The only requirement is that every document
deletes itself and cannot delete any other documents.

Perhaps you can share some code snippets?


On Thu, Mar 26, 2020, 11:52 AM  wrote:

> Hi Karl,
>
> I tried to use the carrydown mechanism to perform the delete of children
> documents but I am facing a problem:
>
> During the first crawl, the connector registers children documents of a
> document as carrydown data in the processDocuments method through the
> activities.addDocumentReference method, and all is working well.
> During a delta crawl, the addSeedDocuments method declares deleted parent
> documents, but in the processDocuments, although I am able to retrieve the
> child document ids of the parent document thanks to the carrydown data, I
> am unable to delete them. My guess is that the ids I want to delete have
> not been declared in the addSeedDocuments method. If this is correct, is
> there any way to change this behavior ?
> Otherwise, is there another way to do things ? As I cannot retrieve
> carrydown data in the addSeedDocuments I seem to be in a dead-end.
>
> Julien
>
>
> -Message d'origine-
> De : Julien Massiera 
> Envoyé : lundi 9 mars 2020 14:24
> À : dev@manifoldcf.apache.org
> Objet : Re: Delta deletion
>
> Yes I consider the confluence connector complete.
>
> As you suggest, I will try to use the "carrydown" mechanism to do what I
> want.
>
> Thanks,
>
> Julien
>
> On 09/03/2020 13:59, Karl Wright wrote:
> > Do you consider the confluence connector in the branch complete?
> > If so I'll look at it as time permits later today.
> >
> > As far as your proposal is concerned, maintaining lists of
> > dependencies for all documents is quite expensive.  We do this for hop
> > counting and we basically tell people to only use it if they must,
> > because of the huge amount of database overhead involved.  We also
> > maintain "carrydown" data which is accessible during document
> > processing.  It is typically used for ingestion, but maybe you could
> > use that for a signal that child documents should delete themselves or
> something.
> >
> > Major crawling model changes are a gigantic effort; there are always
> > many things to consider and many problems encountered that need to be
> > worked around.  If you are concerned simply with the load on your API
> > to handle deletions, I'd suggest using one of the existing mechanisms
> > for reducing that.  But I can see no straightforward way to
> > incrementally add dependency deletion to the current framework.
> >
> > Karl
> >
> >
> > On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera <
> > julien.massi...@francelabs.com> wrote:
> >
> >> Hi Karl,
> >>
> >> Now that I finished the confluence connector, I am getting back to
> >> the other one I was working on, and it would greatly help me to have
> >> your thoughts on my proposal below.
> >>
> >> Thanks,
> >> Julien
> >>
> >> On 02/03/2020 16:40, julien.massi...@francelabs.com wrote:
> >>> Hi Karl,
> >>>
> >>> Thanks for your answer.
> >>>
> >>> Your explanations validate what I was anticipating on the way MCF is
> >> currently implementing its model. As you stated, this does mean that
> >> in order to use the _DELETE model properly, the seeding process has
> >> to provide the complete list of deleted documents.
> >>> Yet wouldn't it be a useful improvement to update the
> >> activities.deleteDocument method (or create an additional delete
> >> method) so that it automatically – and optionnaly - removes the
> >> referenced documents of a document Id ?
> >>> For instance, since the activities.addDocumentReference method
> >>> already
> >> asks the document identifier of the "parent" document, couldn’t we
> >> maintain in postgres a list of "child ids" and use it during the
> >> delete process to delete them ?
> >>> This is very useful in the use case I already described but I am
> >>> sure it
> >> would be useful for other type of connectors and/or future
> >> connectors. The benefits of such modification increase with the number
> of crawled documents.
> >>> Here is an illustration of the benefits of this MCF modification:
> >>>
> >>> With my current connector, if my first crawl ingests 1M documents
> >>> and on
> >> the delta crawl only 1 document that has 2 children is deleted, it
> >> must r

[jira] [Assigned] (CONNECTORS-1639) Upgrade Elastic Search Version

2020-03-22 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1639:
---

Assignee: Karl Wright

> Upgrade Elastic Search Version
> --
>
> Key: CONNECTORS-1639
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1639
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Cihad Guzel
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: elastic-search-1.0.1-java11-build-error.log
>
>
> Current Elastic Search version is 1.0.1 . According to [this 
> matrix|https://www.elastic.co/support/matrix#matrix_jvm], Java 11 is not 
> supported by any ES version below 6.5.
> Besides, ES 1.x is no longer supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


ElasticSearch-literate volunteers needed for JDK 11 port

2020-03-22 Thread Karl Wright
Hi All,

The version of ElasticSearch we support apparently is incompatible with JDK
11.  We therefore will need to update that connector and the associated
tests as well.  See:

https://issues.apache.org/jira/browse/CONNECTORS-1624?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17059835


I would ask anyone with any recent experience with ElasticSearch to please
step forward and lend a hand with this exercise.  Any volunteers?

Thanks in advance,
Karl


[jira] [Commented] (CONNECTORS-1624) Get ManifoldCF to run under Java 11 or higher

2020-03-22 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064298#comment-17064298
 ] 

Karl Wright commented on CONNECTORS-1624:
-

I will request volunteers in dev group for the ElasticSearch task.


> Get ManifoldCF to run under Java 11 or higher
> -
>
> Key: CONNECTORS-1624
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1624
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Framework core
>    Reporter: Karl Wright
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
>
> Java 11 doesn't include a number of classes that Java 8 does.  We need to 
> explicitly include jars that provide these classes or ManifoldCF will not 
> function under higher Java revs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Welcome Alessandro Benedetti as a Lucene/Solr committer

2020-03-18 Thread Karl Wright
Welcome, Alessandro!
Karl

On Wed, Mar 18, 2020 at 9:01 AM David Smiley 
wrote:

> Hi all,
>
> Please join me in welcoming Alessandro Benedetti as the latest Lucene/Solr
> committer!
>
> Alessandro has been contributing to Lucene and Solr in areas such as More
> Like This, Synonym boosting, and Suggesters, and other areas for years.
> Furthermore he's been a help to many users on the solr-user mailing list
> and has helped others through his blog posts and presentations about
> search.  We look forward to his future contributions.
>
> Congratulations and welcome!  It is a tradition to introduce yourself with
> a brief bio, Alessandro.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>


Re: Delta deletion

2020-03-09 Thread Karl Wright
Do you consider the confluence connector in the branch complete?
If so I'll look at it as time permits later today.

As far as your proposal is concerned, maintaining lists of dependencies for
all documents is quite expensive.  We do this for hop counting and we
basically tell people to only use it if they must, because of the huge
amount of database overhead involved.  We also maintain "carrydown" data
which is accessible during document processing.  It is typically used for
ingestion, but maybe you could use that for a signal that child documents
should delete themselves or something.

Major crawling model changes are a gigantic effort; there are always many
things to consider and many problems encountered that need to be worked
around.  If you are concerned simply with the load on your API to handle
deletions, I'd suggest using one of the existing mechanisms for reducing
that.  But I can see no straightforward way to incrementally add dependency
deletion to the current framework.

Karl


On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:

> Hi Karl,
>
> Now that I finished the confluence connector, I am getting back to the
> other one I was working on, and it would greatly help me to have your
> thoughts on my proposal below.
>
> Thanks,
> Julien
>
> On 02/03/2020 16:40, julien.massi...@francelabs.com wrote:
> > Hi Karl,
> >
> > Thanks for your answer.
> >
> > Your explanations validate what I was anticipating on the way MCF is
> currently implementing its model. As you stated, this does mean that in
> order to use the _DELETE model properly, the seeding process has to provide
> the complete list of deleted documents.
> >
> > Yet wouldn't it be a useful improvement to update the
> activities.deleteDocument method (or create an additional delete method) so
> that it automatically – and optionnaly - removes the referenced documents
> of a document Id ?
> >
> > For instance, since the activities.addDocumentReference method already
> asks the document identifier of the "parent" document, couldn’t we maintain
> in postgres a list of "child ids" and use it during the delete process to
> delete them ?
> >
> > This is very useful in the use case I already described but I am sure it
> would be useful for other type of connectors and/or future connectors. The
> benefits of such modification increase with the number of crawled documents.
> >
> > Here is an illustration of the benefits of this MCF modification:
> >
> > With my current connector, if my first crawl ingests 1M documents and on
> the delta crawl only 1 document that has 2 children is deleted, it must
> rely on the processDocument method to check the version of each of the 1M
> documents to figure out and delete the 3 concerned ones (so at least 1M
> calls to the API of the targeted repository). With the suggested optional
> modification, the seeding process would use the delta API of the targeted
> repository and declare the parent document (only one API call), then the
> processDocuments method would be triggered only one time to check the
> version of the document (another one API call), figure out that it does not
> exists anymore and delete it with its 2 children. Its 2 API calls vs 1M...
> even if on framework side we have one more request to perform to postgres,
> I think it worth the processing time.
> >
> > What do you think ?
> >
> > Julien
> >
> > -Message d'origine-
> > De : Karl Wright 
> > Envoyé : samedi 29 février 2020 15:51
> > À : dev 
> > Objet : Re: Delta deletion
> >
> > Hi Julien,
> >
> > First, ALL models rely on individual existence checks for documents.
> That is, when your connector fetches a deleted document, the framework has
> to be told that the document is gone, or it will not be removed.  There is
> no "discovery" process for deleted documents other than seeding (and only
> when the model includes _DELETE).
> >
> > The upshot of this is that IF your seeding method does not return all
> documents that have been removed THEN it cannot be a _DELETE model.
> >
> > I hope this helps.
> >
> > Karl
> >
> >
> > On Sat, Feb 29, 2020 at 8:10 AM  wrote:
> >
> >> Hi dev community,
> >>
> >>
> >>
> >> I am trying to develop a connector for an API that exposes a
> >> hierarchical arborescence of documents: each document can have children
> documents.
> >>
> >> During the init crawl, the child documents are referenced in the MCF
> >> connector through the method
> >> activities.addDocumentRefenrece(c

[jira] [Commented] (CONNECTORS-1637) New Confluence connector

2020-03-06 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053563#comment-17053563
 ] 

Karl Wright commented on CONNECTORS-1637:
-

At the root level:

{code}
ant make-core-deps
ant build
{code}

This will download the dependencies, then build the framework and build all the 
connectors.  After that you can debug your build locally for your connector by:

{code}
cd connectors/
ant build
{code}


> New Confluence connector
> 
>
> Key: CONNECTORS-1637
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1637
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Confluence connector
>Reporter: Julien Massiera
>Assignee: Julien Massiera
>Priority: Major
>
> We need to address 3 main issues of the current Confluence connector :
> - it does not correctly implements the security 
> - it has performance problems when handling a huge dataset 
> - it generates a version string for documents that is not sufficient to 
> detect all changes
> To resolve some of these issues, the connector has to use the new confluence 
> API which is available from the v6. For that reason we need to release a new 
> connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1637) New Confluence connector

2020-03-06 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053389#comment-17053389
 ] 

Karl Wright commented on CONNECTORS-1637:
-

I saw you committed code in a branch (CONNECTORS-1637) for this.  Wonderful!
Can you let me know when you think the branch is done?
Key things: getting the ant build to work, and making sure ant rat-sources is 
happy with your Apache headers.



> New Confluence connector
> 
>
> Key: CONNECTORS-1637
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1637
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Confluence connector
>Reporter: Julien Massiera
>Assignee: Julien Massiera
>Priority: Major
>
> We need to address 3 main issues of the current Confluence connector :
> - it does not correctly implements the security 
> - it has performance problems when handling a huge dataset 
> - it generates a version string for documents that is not sufficient to 
> detect all changes
> To resolve some of these issues, the connector has to use the new confluence 
> API which is available from the v6. For that reason we need to release a new 
> connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: New confluence connector

2020-03-05 Thread Karl Wright
Hi Julien,

I agree if there are significant version restrictions it should be released
as a different connector than the one that is already there.

Can you create an svn branch and start setting this up in it?  There should
also be a ticket created that you can use to name your branch.  It's
important to make sure there are no name collisions with the existing
connector, of course.

Thanks!
Karl


On Thu, Mar 5, 2020 at 12:59 PM  wrote:

> Hi Karl,
>
>
>
> I have developed a new Confluence connector based on the existing one in
> order to address 3 main issues :
>
> - the current connector does not correctly implements the security
>
> - the current connector has performance problems when handling a huge
> dataset
>
> - the current connector generates a version string for documents that is
> not
> sufficient to detect all changes
>
>
>
> I would like to contribute it to the MCF community. But I am not sure what
> is the best way to do it, since the core code is very different from the
> original one. Furthermore it is only compatible from confluence v6 and
> above. So I was thinking that we may release it as a different connector,
> but I'd prefer to have your opinion on that.
>
>
>
> What do you think?
>
>
>
> Best regards,
>
> Julien
>
>


Re: Delta deletion

2020-02-29 Thread Karl Wright
Hi Julien,

First, ALL models rely on individual existence checks for documents.  That
is, when your connector fetches a deleted document, the framework has to be
told that the document is gone, or it will not be removed.  There is no
"discovery" process for deleted documents other than seeding (and only when
the model includes _DELETE).

The upshot of this is that IF your seeding method does not return all
documents that have been removed THEN it cannot be a _DELETE model.

I hope this helps.

Karl


On Sat, Feb 29, 2020 at 8:10 AM  wrote:

> Hi dev community,
>
>
>
> I am trying to develop a connector for an API that exposes a hierarchical
> arborescence of documents: each document can have children documents.
>
> During the init crawl, the child documents are referenced in the MCF
> connector through the method
> activities.addDocumentRefenrece(childDocumentIdentifier,
> parentDocumentIdentifier, parentDataNames, parentDataValues)
>
> The API is able to provide delta modifications/deletions from a provided
> date but, when a document that has children is deleted, the API only
> returns
> the id of the document, not its children. On the MCF connector side, I
> thought that, as I have referenced the children, by deleting the parent
> document all its children would be deleted with it, but it appears that it
> is not the case.
>
> So my question is : did I miss something ? Is there another way to perform
> delta deletions ? Unfortunately if I don't find a way to solve this issue,
> I
> will not be able to take advantage of the delta feature and thus I will
> have
> to use the "add_modify" connector type and test every id on a delta crawl
> to
> figure out which ids are missing. This would be a huge loss of
> performances.
>
>
>
> Regards,
>
> Julien Massiera
>
>


[jira] [Commented] (CONNECTORS-1629) Support Solr Kerberos Authentication

2020-02-16 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038017#comment-17038017
 ] 

Karl Wright commented on CONNECTORS-1629:
-

Great news!
I'll close the ticket.


> Support Solr Kerberos Authentication
> 
>
> Key: CONNECTORS-1629
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1629
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Affects Versions: ManifoldCF 2.14
>Reporter: Jörn Franke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
>
> Several enterprise deployments of Solr are leveraging SolrCloud Kerberos 
> authentication.
> The integration seems to be rather simple and the goal of this Jira is to 
> evaluate the possential needed step to eventually contribute the Kerberos 
> integration to the ManifoldCF project.
> The following steps would be needed:
>  * One can pass the JVM parameter java.security.auth.login.config to the 
> ManifoldCF JVM using -Djava.security.auth.login.config=/path/to/jaas.confg in 
> which Kerberos authentication details, such as keytab and principal that has 
> the right access to Solr is configured
>  * A small adaption to the SolrCloudClient that is used within Manifold needs 
> to be done to enable Kerberos authentication: 
> HttpClientUtil.setConfigurer(new Krb5HttpClientConfigurer());
> Should this be integrated in Manifold, one may want to consider one input 
> field in the configuration in the UI where one can select / flow which user 
> defined in the Jaas conf (you can define multiple one) should be chosen. By 
> default one may simply select "client" or "SolrJClient" if Jaas.conf is 
> present in the System properties. This does not mean the user needs to be 
> named like this, but the configuration entry referencing any user should be 
> named like this.
> Having a confiugration allows to have a different users per flow. This might 
> also be needed in case you have multiple Solr clusters. 
> Related discussion 
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201912.mbox/browser]
> SolrJ Kerberos integration: 
> [https://lucene.apache.org/solr/guide/8_3/kerberos-authentication-plugin.html#using-solrj-with-a-kerberized-solr]
> Jaas conf documentation: 
> [https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/LoginConfigFile.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1629) Support Solr Kerberos Authentication

2020-02-16 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037904#comment-17037904
 ] 

Karl Wright commented on CONNECTORS-1629:
-

Hi,

Your symptom sounds like stuck locks, which can happen if you're using 
file-based sync and a multiprocess model and you kill processes with kill -9 
rather than shutting them down gracefully.

> Support Solr Kerberos Authentication
> 
>
> Key: CONNECTORS-1629
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1629
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Affects Versions: ManifoldCF 2.14
>Reporter: Jörn Franke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
>
> Several enterprise deployments of Solr are leveraging SolrCloud Kerberos 
> authentication.
> The integration seems to be rather simple and the goal of this Jira is to 
> evaluate the possential needed step to eventually contribute the Kerberos 
> integration to the ManifoldCF project.
> The following steps would be needed:
>  * One can pass the JVM parameter java.security.auth.login.config to the 
> ManifoldCF JVM using -Djava.security.auth.login.config=/path/to/jaas.confg in 
> which Kerberos authentication details, such as keytab and principal that has 
> the right access to Solr is configured
>  * A small adaption to the SolrCloudClient that is used within Manifold needs 
> to be done to enable Kerberos authentication: 
> HttpClientUtil.setConfigurer(new Krb5HttpClientConfigurer());
> Should this be integrated in Manifold, one may want to consider one input 
> field in the configuration in the UI where one can select / flow which user 
> defined in the Jaas conf (you can define multiple one) should be chosen. By 
> default one may simply select "client" or "SolrJClient" if Jaas.conf is 
> present in the System properties. This does not mean the user needs to be 
> named like this, but the configuration entry referencing any user should be 
> named like this.
> Having a confiugration allows to have a different users per flow. This might 
> also be needed in case you have multiple Solr clusters. 
> Related discussion 
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201912.mbox/browser]
> SolrJ Kerberos integration: 
> [https://lucene.apache.org/solr/guide/8_3/kerberos-authentication-plugin.html#using-solrj-with-a-kerberized-solr]
> Jaas conf documentation: 
> [https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/LoginConfigFile.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1636) ElasticSearch Connector not working with ingest pipeline processor attachment

2020-02-14 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1636:
---

Assignee: Karl Wright

> ElasticSearch Connector not working with ingest pipeline processor attachment
> -
>
> Key: CONNECTORS-1636
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1636
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.15
>Reporter: Rohit Batta
>Assignee: Karl Wright
>Priority: Major
>  Labels: ManifoldCF, connector, elasticsearch, manifoldcf
> Fix For: ManifoldCF next
>
>
> While using Apache manifoldcf elasticsearch connector for elasticsearch 
> version 6.6.x, I found that connector is not working as expected for pipeline 
> processor "attachment". 
> The processor requires Base64 String to process input stream to content. 
> It is working for "mapper-attachment" plugin but that plugin is deprecated in 
> newer versions of elasticsearch.
> In case elasticsearch pipeline is used and mapper-attachment is set to false. 
> then the content is processed as byte Array to index document, which is not 
> correct type for indexing to elasticsearch.
>  
> {code:java}
> if (!useMapperAttachments && inputStream != null) {
> if (contentAttributeName != null) {
> Reader r = new InputStreamReader(inputStream, Consts.UTF_8);
> if (needComma) {
> pw.print(",");
> }
> pw.append(jsonStringEscape(contentAttributeName)).append(" : \"");
> char[] buffer = new char[65536];
> while (true) {
> int amt = r.read(buffer, 0, buffer.length);
> if (amt == -1)
> break;
> for (int j = 0; j < amt; j++) {
> final char x = buffer[j];
> if (x == '\n')
> pw.append('\\').append('n');
> else if (x == '\r')
> pw.append('\\').append('r');
> else if (x == '\t')
> pw.append('\\').append('t');
> else if (x == '\b')
> pw.append('\\').append('b');
> else if (x == '\f')
> pw.append('\\').append('f');
> else if (x < 32) {
> pw.append("\\u").append(String.format(Locale.ROOT, 
> "%04x", (int) x));
> } else {
> if (x == '\"' || x == '\\' || x == '/')
> pw.append('\\');
> pw.append(x);
> }
> }
> }
> pw.append("\"");
> needComma = true;
> }
> }
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1636) ElasticSearch Connector not working with ingest pipeline processor attachment

2020-02-14 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037006#comment-17037006
 ] 

Karl Wright commented on CONNECTORS-1636:
-

Patches welcome.
ElasticSearch changes so quickly it's very hard to keep up.


> ElasticSearch Connector not working with ingest pipeline processor attachment
> -
>
> Key: CONNECTORS-1636
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1636
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.15
>Reporter: Rohit Batta
>Assignee: Karl Wright
>Priority: Major
>  Labels: ManifoldCF, connector, elasticsearch, manifoldcf
> Fix For: ManifoldCF next
>
>
> While using Apache manifoldcf elasticsearch connector for elasticsearch 
> version 6.6.x, I found that connector is not working as expected for pipeline 
> processor "attachment". 
> The processor requires Base64 String to process input stream to content. 
> It is working for "mapper-attachment" plugin but that plugin is deprecated in 
> newer versions of elasticsearch.
> In case elasticsearch pipeline is used and mapper-attachment is set to false. 
> then the content is processed as byte Array to index document, which is not 
> correct type for indexing to elasticsearch.
>  
> {code:java}
> if (!useMapperAttachments && inputStream != null) {
> if (contentAttributeName != null) {
> Reader r = new InputStreamReader(inputStream, Consts.UTF_8);
> if (needComma) {
> pw.print(",");
> }
> pw.append(jsonStringEscape(contentAttributeName)).append(" : \"");
> char[] buffer = new char[65536];
> while (true) {
> int amt = r.read(buffer, 0, buffer.length);
> if (amt == -1)
> break;
> for (int j = 0; j < amt; j++) {
> final char x = buffer[j];
> if (x == '\n')
> pw.append('\\').append('n');
> else if (x == '\r')
> pw.append('\\').append('r');
> else if (x == '\t')
> pw.append('\\').append('t');
> else if (x == '\b')
> pw.append('\\').append('b');
> else if (x == '\f')
> pw.append('\\').append('f');
> else if (x < 32) {
> pw.append("\\u").append(String.format(Locale.ROOT, 
> "%04x", (int) x));
> } else {
> if (x == '\"' || x == '\\' || x == '/')
> pw.append('\\');
> pw.append(x);
> }
> }
> }
> pw.append("\"");
> needComma = true;
> }
> }
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1617) Date format extraction problem in XLS/XLSX

2020-02-13 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036165#comment-17036165
 ] 

Karl Wright commented on CONNECTORS-1617:
-

Jira -> Create -> pull down "TIKA" in the "Project" pulldown.


> Date format extraction problem in XLS/XLSX
> --
>
> Key: CONNECTORS-1617
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1617
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Tika extractor, Tika service connector
>Affects Versions: ManifoldCF 2.10
>    Reporter: Zoltan Farago
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: exceldatum.xlsx
>
>
> Currently TIKA/ManifoldCF 2.10 extracts dates from the attached file tis way:
> 2018.05.10 -> 10/05/18
> 2002.02.02 -> 2/2/2
> We need this format:
> 2018.05.10 -> 2018-05-10
> 2002.02.02 -> 2002-02-02
> This occurs only when the field type is date. When the field type is text 
> then the output is fine.
>  
> Please help us with a recommendation with any settings in the pipeline (Tika 
> configs, excel setting, OS local settings, etc.), or provide a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Extraction of related links

2020-02-12 Thread Karl Wright
This is not functionality that ManifoldCF supports out of the box.  The
extracted links are used for crawling, not as metadata.

I don't see a general use-case for this either, so I think you're on your
own modifying the web connector code to do what you want.  The
RepositoryDocument structure has arbitrary multi-valued fields; just put
what you want into one such field and you should see it in Elastic Search.

Karl


On Thu, Feb 13, 2020 at 1:57 AM ritika jain 
wrote:

> Hi All,
>
> I am using Manifoldcf 2.12, Repository as Web connector and Output as ES.
> As per requirement now, I want to save all related sub-links of a
> particular document Identifier(at a time). For example :-DocumentId::-
> www.xyz.com, so I would like to extract all related sublinks say:-
> www.xyz.com/abc, www.xyz.com/pqr etc.and save it in variable and then
> pass it to Elastic search
>
> I had gone the Web Repo code and thought of the function extractLinks
> ( protected boolean extractLinks(String documentIdentifier,
> IProcessActivity activities, DocumentURLFilter filter)) can do so.
> Is the existing functionality of MF is able for this extraction or we have
> to customize it? Any help would be appreciated.
>
>
> Thanks
> Ritika
>


[jira] [Resolved] (CONNECTORS-1617) Date format extraction problem in XLS/XLSX

2020-02-06 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1617.
-
Fix Version/s: ManifoldCF 2.16
   Resolution: Won't Fix

I'm marking this as "won't fix" although it should really be "can't fix".  If a 
Tika ticket gets created to address date format configurability then please 
include it here; if there's already some configurability present we can work 
with that.  Thanks!

> Date format extraction problem in XLS/XLSX
> --
>
> Key: CONNECTORS-1617
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1617
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Tika extractor, Tika service connector
>Affects Versions: ManifoldCF 2.10
>    Reporter: Zoltan Farago
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: exceldatum.xlsx
>
>
> Currently TIKA/ManifoldCF 2.10 extracts dates from the attached file tis way:
> 2018.05.10 -> 10/05/18
> 2002.02.02 -> 2/2/2
> We need this format:
> 2018.05.10 -> 2018-05-10
> 2002.02.02 -> 2002-02-02
> This occurs only when the field type is date. When the field type is text 
> then the output is fine.
>  
> Please help us with a recommendation with any settings in the pipeline (Tika 
> configs, excel setting, OS local settings, etc.), or provide a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1617) Date format extraction problem in XLS/XLSX

2020-02-06 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031505#comment-17031505
 ] 

Karl Wright commented on CONNECTORS-1617:
-

The internal Tika extractor treats all metadata as strings, using the Tika 
library.  I don't think the date format is configurable.  Indeed, there's a 
blog post on this:

https://grokbase.com/t/tika/user/10982he7yd/how-can-i-configure-tika-to-extract-dates-in-single-format

Note that Tika tries to maintain the date format present in the original 
spreadsheet!!

The solution proposed when you want a specific date format is this:

{quote}
* Write your own excel parser for Tika, which ignores the date formatting
set for cells, and always uses iso8601
{quote}

That's not going to cut it here because we don't have any information that 
would allow us to autodetect the incoming format properly.  It's basically just 
a text file and there are no hints, especially for dates like "01-01-2010".  
Which comes first, the day or the month?

The external Tika extractor has even less configurability because you cannot 
run custom code there.

Now, suppose all you want to do is post-process just *dates* to change the 
separator character.  Well, we do not know whether the field being returned 
from Tika is a date even.  If we replaced all /'s with -'s in it then we'd 
corrupt other kinds of fields.

My conclusion: there's nothing we can do in ManifoldCF to fix this problem.  A 
solution might be found in Tika itself, but only if somebody tickets it.  Tika 
would need to go through the column definitions and understand which columns 
were dates and act accordingly.  Feel free to open a Tika ticket accordingly.



> Date format extraction problem in XLS/XLSX
> --
>
> Key: CONNECTORS-1617
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1617
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Tika extractor, Tika service connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Zoltan Farago
>Assignee: Karl Wright
>Priority: Major
> Attachments: exceldatum.xlsx
>
>
> Currently TIKA/ManifoldCF 2.10 extracts dates from the attached file tis way:
> 2018.05.10 -> 10/05/18
> 2002.02.02 -> 2/2/2
> We need this format:
> 2018.05.10 -> 2018-05-10
> 2002.02.02 -> 2002-02-02
> This occurs only when the field type is date. When the field type is text 
> then the output is fine.
>  
> Please help us with a recommendation with any settings in the pipeline (Tika 
> configs, excel setting, OS local settings, etc.), or provide a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1617) Date format extraction problem in XLS/XLSX

2020-02-06 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1617:
---

Assignee: Karl Wright

> Date format extraction problem in XLS/XLSX
> --
>
> Key: CONNECTORS-1617
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1617
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Tika extractor, Tika service connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Zoltan Farago
>Assignee: Karl Wright
>Priority: Major
> Attachments: exceldatum.xlsx
>
>
> Currently TIKA/ManifoldCF 2.10 extracts dates from the attached file tis way:
> 2018.05.10 -> 10/05/18
> 2002.02.02 -> 2/2/2
> We need this format:
> 2018.05.10 -> 2018-05-10
> 2002.02.02 -> 2002-02-02
> This occurs only when the field type is date. When the field type is text 
> then the output is fine.
>  
> Please help us with a recommendation with any settings in the pipeline (Tika 
> configs, excel setting, OS local settings, etc.), or provide a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1617) Date format extraction problem in XLS/XLSX

2020-02-06 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031498#comment-17031498
 ] 

Karl Wright commented on CONNECTORS-1617:
-

Are you using the external Tika extractor, or the embedded one?


> Date format extraction problem in XLS/XLSX
> --
>
> Key: CONNECTORS-1617
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1617
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Tika extractor, Tika service connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Zoltan Farago
>Assignee: Karl Wright
>Priority: Major
> Attachments: exceldatum.xlsx
>
>
> Currently TIKA/ManifoldCF 2.10 extracts dates from the attached file tis way:
> 2018.05.10 -> 10/05/18
> 2002.02.02 -> 2/2/2
> We need this format:
> 2018.05.10 -> 2018-05-10
> 2002.02.02 -> 2002-02-02
> This occurs only when the field type is date. When the field type is text 
> then the output is fine.
>  
> Please help us with a recommendation with any settings in the pipeline (Tika 
> configs, excel setting, OS local settings, etc.), or provide a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (CONNECTORS-1597) reflected cross-site scripting vulnerability

2020-02-03 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1597.
-
Fix Version/s: ManifoldCF 2.13
   Resolution: Fixed

> reflected cross-site scripting vulnerability
> 
>
> Key: CONNECTORS-1597
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1597
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Assignee: Kishore Kumar
>Priority: Minor
> Fix For: ManifoldCF 2.13
>
>
> This is the full report of a penetration test, performed at a client where we 
> deployed a system which uses manifold:
> *Summary*
> A reflected cross-site scripting vulnerability was discovered in the 
> application.
> Reflected cross-site scripting occurs when a web application displays data 
> submitted by the user that
> contains HTML markup and scripting code without properly escaping it. An 
> attacker will create a link to the
> vulnerable page that will display JavaScript code crated by the attacker. The 
> attacker will then trick an
> authenticated application user into clicking or following this crated link. 
> When the user's browser parses the
> generated page, it will execute the code crafted by the attacker. If the user 
> was logged in to the application
> when he followed the link, the attacker's code could perform any action in 
> the application that the user can
> perform.
> *Impact*
> Reflected cross-site scripting can be used by attackers to compromise the 
> session of an authenticated user.
> By persuading the victim to click on a specially crafted link, the attacker 
> can execute his own JavaScript
> payload in the browser context of the victim. In this specific case, an 
> attacker could hijack its victim's session
> given that the session token is not flagged as HttpOnly as demonstrated in 
> [G190204T1F4][MANIFOLD]
> Insecure Cookie Configuration.
> Additional attacks exist where an attacker can deceive end users of the 
> application by redirecting them to
> replica sites or trick them into downloading trojans or other malware. The 
> attacker can also use a so called
> browser exploitation framework. In this scenario the attacker injects 
> JavaScript code that communicates to
> the attack framework running on the attacker's computer. When the victim user 
> executes the JavaScript code
> the attacker can control the victim's browser. Publicly available frameworks 
> exist (BeEF -
> [http://www.bindshell.net/tools/beef], Backframe 
> -[http://www.gnucitizen.org/projects/backframe/], XSS Proxy -
> [http://xss-proxy.sourceforge.net/]).
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] [name of an arbitrarily 
> supplied URL parameter]
> *Description*
> A case where the application includes user input into the generated HTML 
> pages without properly escaping
> the user supplied data was discovered in the application. The HTTP requests 
> and responses shown below
> demonstrate the problem.
> {code:java}
> GET /mcf-crawler-ui/?smafi">alert(1)non7x=1 HTTP/1.1
> Host: els-manifold-uat.bc:8475
> Accept-Encoding: gzip, deflate
> Accept: */*
> Accept-Language: en
> User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; 
> Trident/5.0)
> Connection: close
> Cookie: JSESSIONID=ov3qae9biucxdat0xiin5s18
> {code}
> {code:java}
> HTTP/1.1 200 OK
> Server: nginx/1.12.2
> Date: Mon, 18 Feb 2019 13:07:02 GMT
> Content-Type: text/html;charset=utf-8
> Content-Length: 2576
> Connection: close
> Pragma: No-cache
> Expires: Thu, 01 Jan 1970 00:00:00 GMT
> Cache-Control: no-cache
> max-age: Thu, 01 Jan 1970 00:00:00 GMT
> 
> 
> 
> http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
>  type="text/css"/>
> 
> Apache ManifoldCF™ Login
> 
> <!--
> function login()
> {
> document.loginform.submit();
> }
> document.onkeypress = loginKeyPress;
> function loginKeyPress(e)
> {
> e = e || window.event;
> if (e.keyCode == 13)
> {
> document.getElementById('buttonLogin').click();
> return false;
> }
> return true;
> }
> //-->
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Sign in to start your session
>  method="POST">
> alert(1)non7x=1">
> 
> --snip--
> {code}
> *Recommendations*
> We recommend that the application enforces proper validation on user input. 
> In most situations where u

[jira] [Commented] (CONNECTORS-1597) reflected cross-site scripting vulnerability

2020-02-03 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028833#comment-17028833
 ] 

Karl Wright commented on CONNECTORS-1597:
-

Final analysis of this ticket is that, since ManifoldCF's UI does not have 
different classes of users, and since we've already dealt with any problems 
with the login page, escalation of privileges is not a valid attack vector 
against the ManifoldCF UI.


> reflected cross-site scripting vulnerability
> 
>
> Key: CONNECTORS-1597
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1597
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Assignee: Kishore Kumar
>Priority: Minor
>
> This is the full report of a penetration test, performed at a client where we 
> deployed a system which uses manifold:
> *Summary*
> A reflected cross-site scripting vulnerability was discovered in the 
> application.
> Reflected cross-site scripting occurs when a web application displays data 
> submitted by the user that
> contains HTML markup and scripting code without properly escaping it. An 
> attacker will create a link to the
> vulnerable page that will display JavaScript code crated by the attacker. The 
> attacker will then trick an
> authenticated application user into clicking or following this crated link. 
> When the user's browser parses the
> generated page, it will execute the code crafted by the attacker. If the user 
> was logged in to the application
> when he followed the link, the attacker's code could perform any action in 
> the application that the user can
> perform.
> *Impact*
> Reflected cross-site scripting can be used by attackers to compromise the 
> session of an authenticated user.
> By persuading the victim to click on a specially crafted link, the attacker 
> can execute his own JavaScript
> payload in the browser context of the victim. In this specific case, an 
> attacker could hijack its victim's session
> given that the session token is not flagged as HttpOnly as demonstrated in 
> [G190204T1F4][MANIFOLD]
> Insecure Cookie Configuration.
> Additional attacks exist where an attacker can deceive end users of the 
> application by redirecting them to
> replica sites or trick them into downloading trojans or other malware. The 
> attacker can also use a so called
> browser exploitation framework. In this scenario the attacker injects 
> JavaScript code that communicates to
> the attack framework running on the attacker's computer. When the victim user 
> executes the JavaScript code
> the attacker can control the victim's browser. Publicly available frameworks 
> exist (BeEF -
> [http://www.bindshell.net/tools/beef], Backframe 
> -[http://www.gnucitizen.org/projects/backframe/], XSS Proxy -
> [http://xss-proxy.sourceforge.net/]).
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] [name of an arbitrarily 
> supplied URL parameter]
> *Description*
> A case where the application includes user input into the generated HTML 
> pages without properly escaping
> the user supplied data was discovered in the application. The HTTP requests 
> and responses shown below
> demonstrate the problem.
> {code:java}
> GET /mcf-crawler-ui/?smafi">alert(1)non7x=1 HTTP/1.1
> Host: els-manifold-uat.bc:8475
> Accept-Encoding: gzip, deflate
> Accept: */*
> Accept-Language: en
> User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; 
> Trident/5.0)
> Connection: close
> Cookie: JSESSIONID=ov3qae9biucxdat0xiin5s18
> {code}
> {code:java}
> HTTP/1.1 200 OK
> Server: nginx/1.12.2
> Date: Mon, 18 Feb 2019 13:07:02 GMT
> Content-Type: text/html;charset=utf-8
> Content-Length: 2576
> Connection: close
> Pragma: No-cache
> Expires: Thu, 01 Jan 1970 00:00:00 GMT
> Cache-Control: no-cache
> max-age: Thu, 01 Jan 1970 00:00:00 GMT
> 
> 
> 
> http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
>  type="text/css"/>
> 
> Apache ManifoldCF™ Login
> 
> <!--
> function login()
> {
> document.loginform.submit();
> }
> document.onkeypress = loginKeyPress;
> function loginKeyPress(e)
> {
> e = e || window.event;
> if (e.keyCode == 13)
> {
> document.getElementById('buttonLogin').click();
> return false;
> }
> return true;
> }
> //-->
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Sign in to start your session
>  method="POST">
> alert(1

[jira] [Resolved] (CONNECTORS-1595) cross-site request forgery vulnerability

2020-02-03 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1595.
-
Resolution: Not A Problem

> cross-site request forgery vulnerability
> 
>
> Key: CONNECTORS-1595
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1595
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Assignee: Kishore Kumar
>Priority: Minor
>
> Below is the full analysis and description as a result from the penetration 
> test.
> *Summary*
> The application is vulnerable to Cross-Site Request Forgery (CSRF).
> A cross-site request forgery attack uses the following scenario:
> 1. An attacker creates a web page that includes an image or a form pointing 
> to the attacked application.
> The image source would actually be a URL with parameters pointing to the 
> application page that
> performs some action. In case of a form, the form action would point to the 
> action page in the target
> application, and the form is submitted automatically by JavaScript when the 
> page is viewed.
> 2. The attacker tricks the victim user to browse to this page. The attacker 
> may get the victim to click a
> link, or embed the attacking HTML code into some page the victim views, for 
> example in a bulletin
> board or chat.
> 3. When the victim views the attacker's page, his browser sends a request 
> prepared by the attacker to
> the attacked application. If the victim is logged in to the target 
> application, his browser will possess
> all necessary session tokens, so the request will appear as authorized to the 
> application and
> succeed.
> A cross-site request forgery attack uses the fact that the victim's browser 
> possesses the necessary
> authentication tokens to perform some actions in the target application.
> *Impact*
> A remote, unauthenticated attacker that can trick an authenticated user into 
> clicking a link crafted by the
> attacker or open a malicious web page, can force the victim to unknowingly 
> perform various actions within
> the application.
> Given that the whole application is not protected against CSRF, any action 
> that an administrator can take on
> Apache Manifold could be unknowingly performed if they fall for a CSRF attack.
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/]
> *Description*
> It appears that the application does not implement any CSRF protection. 
> Consider the following example. An
> attacker tricks a logged in application user to visit a page containing the 
> following code:
> {code:java}
> 
> 
> 
> history.pushState('', '', '/')
> https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp;
> method="POST" enctype="multipart/form-data">
> 
> 
> 
> 
> 
> 
>  value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr
> awlerConnector" />
> 
> 
> 
>  value="ferdiklompcraftworkznl" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  value="httpsintrauatwebbc" />
>  />
> 
>  value="validation" />
>  value=""
> />
>  value="Continue" />
>  value="username" />
>  value="id996812" />
>  value="" />
>  value="Continue" />
>  value="password" />
>  value="Th1sIs4cl1X" />
>  value="" />
>  value="Continue" />
>  value="loginformtype" />
>  value="pwd" />
>  value="" />
>  value="3" />
> 
> 
>  value="httpsintrauatwebbc" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> When the victim's browser parses the page and tries to load images, it will 
> cause them to execute any action
> of the attacker's choosing on Manifold.
> *Recommendations*
> The usual approach to preventing CSRF attacks is to add a new parameter with 
> an unpredictable value to
> each form or link that performs some action in the application, commonly 
> referred to as a CSRF-Token. The
> parameter value should have enough entropy so that it cannot be predicted by 
> an attacker and should be
> unique to the current user session. When the user submits the form or clicks 
> the link, the server side code
> checks the parameter value. If it is valid, the request is accepted, 
> otherwise it is denied. The attacker has no
> way of knowing the value of the unpredictable parameter, so he cannot 
> construct a form or link that will
> submit a valid request.
> *References*
>  * OWASP - Cross-Site Request Forgery - 
> [https://www.owasp.org/index.php/Cross-]
> Site_Request_Forgery



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1595) cross-site request forgery vulnerability

2020-02-03 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028832#comment-17028832
 ] 

Karl Wright commented on CONNECTORS-1595:
-

Note: the team here did analyze this potential attack vector, noting the 
following:

(1) The UI does not have different capabilities for individual users.  The only 
security is whether you are "in" or not.
(2) The attack described in this ticket does not get an attacker "in", it just 
gives them capabilities once they do that that they might not have.   But since 
all users are created equally in the MCF UI, this is not problematic.


> cross-site request forgery vulnerability
> 
>
> Key: CONNECTORS-1595
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1595
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Assignee: Kishore Kumar
>Priority: Minor
>
> Below is the full analysis and description as a result from the penetration 
> test.
> *Summary*
> The application is vulnerable to Cross-Site Request Forgery (CSRF).
> A cross-site request forgery attack uses the following scenario:
> 1. An attacker creates a web page that includes an image or a form pointing 
> to the attacked application.
> The image source would actually be a URL with parameters pointing to the 
> application page that
> performs some action. In case of a form, the form action would point to the 
> action page in the target
> application, and the form is submitted automatically by JavaScript when the 
> page is viewed.
> 2. The attacker tricks the victim user to browse to this page. The attacker 
> may get the victim to click a
> link, or embed the attacking HTML code into some page the victim views, for 
> example in a bulletin
> board or chat.
> 3. When the victim views the attacker's page, his browser sends a request 
> prepared by the attacker to
> the attacked application. If the victim is logged in to the target 
> application, his browser will possess
> all necessary session tokens, so the request will appear as authorized to the 
> application and
> succeed.
> A cross-site request forgery attack uses the fact that the victim's browser 
> possesses the necessary
> authentication tokens to perform some actions in the target application.
> *Impact*
> A remote, unauthenticated attacker that can trick an authenticated user into 
> clicking a link crafted by the
> attacker or open a malicious web page, can force the victim to unknowingly 
> perform various actions within
> the application.
> Given that the whole application is not protected against CSRF, any action 
> that an administrator can take on
> Apache Manifold could be unknowingly performed if they fall for a CSRF attack.
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/]
> *Description*
> It appears that the application does not implement any CSRF protection. 
> Consider the following example. An
> attacker tricks a logged in application user to visit a page containing the 
> following code:
> {code:java}
> 
> 
> 
> history.pushState('', '', '/')
> https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp;
> method="POST" enctype="multipart/form-data">
> 
> 
> 
> 
> 
> 
>  value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr
> awlerConnector" />
> 
> 
> 
>  value="ferdiklompcraftworkznl" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  value="httpsintrauatwebbc" />
>  />
> 
>  value="validation" />
>  value=""
> />
>  value="Continue" />
>  value="username" />
>  value="id996812" />
>  value="" />
>  value="Continue" />
>  value="password" />
>  value="Th1sIs4cl1X" />
>  value="" />
>  value="Continue" />
>  value="loginformtype" />
>  value="pwd" />
>  value="" />
>  value="3" />
> 
> 
>  value="httpsintrauatwebbc" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> When the victim's browser parses the page and tries to load images, it will 
> cause them to execute any action
> of the attacker's choosing on Manifold.
> *Recommendations*
> The usual approach to preventing CSRF attacks is to add a new parameter with 
> an u

[jira] [Commented] (CONNECTORS-1635) CSWS Connector: Issues with connecting to OpenText system

2020-01-30 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027255#comment-17027255
 ] 

Karl Wright commented on CONNECTORS-1635:
-

You're creating a brand-new ResourceResolver() in this fix.  Very probably you 
need to extend the one that is being used rather than creating something brand 
new, because I bet other methods in the resolver are going to help it resolve 
links in the wsdl.

In other words, you fixed one problem but the way you fixed it broke the 
ability to fetch the other referenced wsdl components.


> CSWS Connector: Issues with connecting to OpenText system
> -
>
> Key: CONNECTORS-1635
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1635
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.15
>Reporter: Jörn Franke
>    Assignee: Karl Wright
>Priority: Major
>
> This is about the CSWS connector. It has the following issues:
>  * Cannot fetch WSDL from https URL, because CA are ignored. The reason is 
> that the underlying CXF framework uses for fetching the WSDL the library 
> WSDL4JAVA, which is a completely different mechanism compared to doing a web 
> service call within the CXF (the latter is correctly addressed by the 
> connector). See below on how to fix this.
>  * After fixing fetching WSD from a https URL, another issue occurs. It can 
> fetch correctly the WSDL, but included references not. The thing is that in 
> the error message a URL of the included reference is mentioned and this URL 
> is reachable and also the same server as the WSDL. So I have the theory that 
> something blocks the CXF request to fetch included files from a https URL.
>  
> Error trace for the second point:
> Caused by: org.apache.cxf.service.factory.ServiceConstructionException: 
> Failed to create service.
>      at 
> org.apache.cxf.wsdl11.WSDLServiceFactory.create(WSDLServiceFactory.java:169)
>      at 
> org.apache.cxf.wsdl.service.factory.ReflectionServiceFactoryBean.buildServiceFromWSDL(ReflectionServiceFactoryBean.java:408)
>      at 
> org.apache.cxf.wsdl.service.factory.ReflectionServiceFactoryBean.initializeServiceModel(ReflectionServiceFactoryBean.java:528)
>      at 
> org.apache.cxf.wsdl.service.factory.ReflectionServiceFactoryBean.create(ReflectionServiceFactoryBean.java:263)
>      at 
> org.apache.cxf.jaxws.support.JaxWsServiceFactoryBean.create(JaxWsServiceFactoryBean.java:199)
>      at 
> org.apache.cxf.frontend.AbstractWSDLBasedEndpointFactory.createEndpoint(AbstractWSDLBasedEndpointFactory.java:103)
>      at 
> org.apache.cxf.frontend.ClientFactoryBean.create(ClientFactoryBean.java:91)
>      at 
> org.apache.cxf.frontend.ClientProxyFactoryBean.create(ClientProxyFactoryBean.java:159)
>      at 
> org.apache.cxf.jaxws.JaxWsProxyFactoryBean.create(JaxWsProxyFactoryBean.java:142)
>      at org.apache.cxf.jaxws.ServiceImpl.createPort(ServiceImpl.java:492)
>      at org.apache.cxf.jaxws.ServiceImpl.getPort(ServiceImpl.java:358)
>      ... 51 more
>  Caused by: org.apache.ws.commons.schema.XmlSchemaException: Unable to locate 
> imported document at 'https://server:443/cws/services/Authentication?xsd=2', 
> relative to 
> '[https://server:443/cws/services/Authentication?wsdl#types1'|https://deref-web-02.de/mail/client/S_ilqmoKMFI/dereferrer/?redirectUrl=https%3A%2F%2Fd-darwin-dev5.escb.eu%3A443%2Fcws%2Fservices%2FAuthentication%3Fwsdl%23types1%27].
>      at 
> org.apache.cxf.catalog.CatalogXmlSchemaURIResolver.resolveEntity(CatalogXmlSchemaURIResolver.java:76)
>      at 
> org.apache.ws.commons.schema.SchemaBuilder.resolveXmlSchema(SchemaBuilder.java:684)
>      at 
> org.apache.ws.commons.schema.SchemaBuilder.handleImport(SchemaBuilder.java:538)
>      at 
> org.apache.ws.commons.schema.SchemaBuilder.handleSchemaElementChild(SchemaBuilder.java:1515)
>      at 
> org.apache.ws.commons.schema.SchemaBuilder.handleXmlSchemaElement(SchemaBuilder.java:658)
>      at 
> org.apache.ws.commons.schema.XmlSchemaCollection.read(XmlSchemaCollection.java:550)
>      at 
> org.apache.cxf.common.xmlschema.SchemaCollection.read(SchemaCollection.java:129)
>      at org.apache.cxf.wsdl11.SchemaUtil.extractSchema(SchemaUtil.java:141)
>      at org.apache.cxf.wsdl11.SchemaUtil.getSchemas(SchemaUtil.java:74)
>      at org.apache.cxf.wsdl11.SchemaUtil.getSchemas(SchemaUtil.java:66)
>      at org.apache.cxf.wsdl11.SchemaUtil.getSchemas(SchemaUtil.java:61)
>      at 
> org.apache.cxf.wsdl11.WSDLServiceBuilder.getSchemas(WSDLServiceBuilder.java:378)
>      at 
> org.apache.cxf.wsdl11.WSDLServiceBuilder.bu

[jira] [Assigned] (CONNECTORS-1635) CSWS Connector: Issues with connecting to OpenText system

2020-01-30 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1635:
---

Assignee: Karl Wright

> CSWS Connector: Issues with connecting to OpenText system
> -
>
> Key: CONNECTORS-1635
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1635
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.15
>Reporter: Jörn Franke
>    Assignee: Karl Wright
>Priority: Major
>
> This is about the CSWS connector. It has the following issues:
>  * Cannot fetch WSDL from https URL, because CA are ignored. The reason is 
> that the underlying CXF framework uses for fetching the WSDL the library 
> WSDL4JAVA, which is a completely different mechanism compared to doing a web 
> service call within the CXF (the latter is correctly addressed by the 
> connector). See below on how to fix this.
>  * After fixing fetching WSD from a https URL, another issue occurs. It can 
> fetch correctly the WSDL, but included references not. The thing is that in 
> the error message a URL of the included reference is mentioned and this URL 
> is reachable and also the same server as the WSDL. So I have the theory that 
> something blocks the CXF request to fetch included files from a https URL.
>  
> Error trace for the second point:
> Caused by: org.apache.cxf.service.factory.ServiceConstructionException: 
> Failed to create service.
>      at 
> org.apache.cxf.wsdl11.WSDLServiceFactory.create(WSDLServiceFactory.java:169)
>      at 
> org.apache.cxf.wsdl.service.factory.ReflectionServiceFactoryBean.buildServiceFromWSDL(ReflectionServiceFactoryBean.java:408)
>      at 
> org.apache.cxf.wsdl.service.factory.ReflectionServiceFactoryBean.initializeServiceModel(ReflectionServiceFactoryBean.java:528)
>      at 
> org.apache.cxf.wsdl.service.factory.ReflectionServiceFactoryBean.create(ReflectionServiceFactoryBean.java:263)
>      at 
> org.apache.cxf.jaxws.support.JaxWsServiceFactoryBean.create(JaxWsServiceFactoryBean.java:199)
>      at 
> org.apache.cxf.frontend.AbstractWSDLBasedEndpointFactory.createEndpoint(AbstractWSDLBasedEndpointFactory.java:103)
>      at 
> org.apache.cxf.frontend.ClientFactoryBean.create(ClientFactoryBean.java:91)
>      at 
> org.apache.cxf.frontend.ClientProxyFactoryBean.create(ClientProxyFactoryBean.java:159)
>      at 
> org.apache.cxf.jaxws.JaxWsProxyFactoryBean.create(JaxWsProxyFactoryBean.java:142)
>      at org.apache.cxf.jaxws.ServiceImpl.createPort(ServiceImpl.java:492)
>      at org.apache.cxf.jaxws.ServiceImpl.getPort(ServiceImpl.java:358)
>      ... 51 more
>  Caused by: org.apache.ws.commons.schema.XmlSchemaException: Unable to locate 
> imported document at 'https://server:443/cws/services/Authentication?xsd=2', 
> relative to 
> '[https://server:443/cws/services/Authentication?wsdl#types1'|https://deref-web-02.de/mail/client/S_ilqmoKMFI/dereferrer/?redirectUrl=https%3A%2F%2Fd-darwin-dev5.escb.eu%3A443%2Fcws%2Fservices%2FAuthentication%3Fwsdl%23types1%27].
>      at 
> org.apache.cxf.catalog.CatalogXmlSchemaURIResolver.resolveEntity(CatalogXmlSchemaURIResolver.java:76)
>      at 
> org.apache.ws.commons.schema.SchemaBuilder.resolveXmlSchema(SchemaBuilder.java:684)
>      at 
> org.apache.ws.commons.schema.SchemaBuilder.handleImport(SchemaBuilder.java:538)
>      at 
> org.apache.ws.commons.schema.SchemaBuilder.handleSchemaElementChild(SchemaBuilder.java:1515)
>      at 
> org.apache.ws.commons.schema.SchemaBuilder.handleXmlSchemaElement(SchemaBuilder.java:658)
>      at 
> org.apache.ws.commons.schema.XmlSchemaCollection.read(XmlSchemaCollection.java:550)
>      at 
> org.apache.cxf.common.xmlschema.SchemaCollection.read(SchemaCollection.java:129)
>      at org.apache.cxf.wsdl11.SchemaUtil.extractSchema(SchemaUtil.java:141)
>      at org.apache.cxf.wsdl11.SchemaUtil.getSchemas(SchemaUtil.java:74)
>      at org.apache.cxf.wsdl11.SchemaUtil.getSchemas(SchemaUtil.java:66)
>      at org.apache.cxf.wsdl11.SchemaUtil.getSchemas(SchemaUtil.java:61)
>      at 
> org.apache.cxf.wsdl11.WSDLServiceBuilder.getSchemas(WSDLServiceBuilder.java:378)
>      at 
> org.apache.cxf.wsdl11.WSDLServiceBuilder.buildServices(WSDLServiceBuilder.java:345)
>      at 
> org.apache.cxf.wsdl11.WSDLServiceBuilder.buildServices(WSDLServiceBuilder.java:209)
>      at 
> org.apache.cxf.wsdl11.WSDLServiceFactory.create(WSDLServiceFactory.java:161)
>      ... 61 more
>  
>  
>  
> ---
> fixing https CAs for fetching WSDLs:
> // enable https for wsdl requests (this goes via 

Re: CSWS Connector : ServiceConstructionException: Failed to create service

2020-01-30 Thread Karl Wright
Hi,

I've been waiting for a ticket to appear that summarizes what's been
happening for this issue but I haven't seen one.  Can you bring us up to
date?  Thanks in advance,
Karl

On Wed, Jan 22, 2020 at 12:06 PM Jörn Franke  wrote:

> i try to help. I will create a Jira if you do not mind where I also
> explain how I make the WSDL thing working for https, which could not have
> worked before.
> Reason is that for fetching the WSDL it uses a completely different
> approach (WSDL4Java), where in the connector code no truststore was defined
>  (only for doing actually SOAP requests). This was fixed and it can create
> the service. But then after service creation in the following lines of
> codes it fetches the xsd which does not work (it says it cannot find them),
> which is strange as the Url is correct and reachable. Hence, I suspect it
> does not understand it is a http URL but instead it tries to open it on the
> file system.
> I am pretty busy at the moment, but I try to support and give feedback as
> soon as possible.
>
> I don’t know what is different to your 2 installations - maybe they are
> http or partly http or there is some patch to the code that did not make it
> into the git.
> I can tell that I can test with Content Server 10 as well as 16.
> SOAP UI has no problem and in the end it does exactly the same (starting
> from the https WSDL etc.)
>
> Am 22.01.2020 um 13:17 schrieb Karl Wright :
>
> 
> The whole web services java + cxf architecture is pretty mysterious.  The
> only way I've made progress is by finding code snippets in stackoverflow;
> the documentation is not adequate.  BUT there are configuration files that
> determine how the WSDL parser resolves references.  I don't know how we
> would force that configuration to be in effect but something like that
> would need to be done.  I'm just surprised that you're having this problem
> when two other installations didn't.  There must be a difference somewhere.
>
> Karl
>
>
> On Wed, Jan 22, 2020 at 5:11 AM Jörn Franke  wrote:
>
>> Sorry I did not have much time, my next action plan is to try to modify
>> the catalogue xml to fetch it directly from the https. For some reasons it
>> can fetch the WSDL (after my fix), but not the included xsds despite that
>> in the error message it has the correct url of them.
>> Are you aware of any configuration that tries to force file based access
>> of those? In the Code i did not find anything suspicious.
>>
>> Am 22.01.2020 um 10:28 schrieb Karl Wright :
>>
>> 
>> Has there been any news?
>> I'd love to get this tied up so that you're able to proceed.
>> Karl
>>
>> On Thu, Jan 16, 2020 at 12:08 PM Jörn Franke 
>> wrote:
>>
>>> Ok I understand. I will try and let you know. Thanks again very much for
>>> your fast and detailed answer. Really appreciated. I hope I can give back
>>> with the solution to fetch WSDLs from https and maybe a solution to this
>>> problem (maybe other will face this as well).
>>>
>>> About the connector: the WSDL is successfully fetched via https (not
>>> file - no clue why) - after the modification I made. The only problem I see
>>> now is that the xsd to which the WSDL is referring are not fetched. The
>>> bizarre thing is that the https url that it mention for the xsd is
>>> absolutely correct. So I assume it does not understand an http url, maybe
>>> that is related to configuration.
>>>
>>> Am 16.01.2020 um 14:53 schrieb Karl Wright :
>>>
>>> 
>>> The WSDLS are bundled with the jar.  We intended this to be the ONLY way
>>> the wsdls were accessed, and made lots of changes to the wsdls accordingly,
>>> so that they referenced other wsdls via the "file system".  The wsdls are
>>> the fixed up ones that are used to build the java stubs locally, plus a
>>> config file that's supposed to tell CXF how to resolve referenced wsdls.
>>> That config file may or may not be correct, because we never were able to
>>> get CXF to use the local resource wsdls during actual connection.
>>>
>>> Except now they seem to be both fetched via https AND locally sourced.
>>> I have no idea how that can be.  I had assumed it was done one way or the
>>> other but not both.
>>>
>>> Perhaps the problem is that the configuration file is being read but the
>>> resource wsdls are not being found?  Removing the meta-inf from the jar
>>> would then force everything to go through https.  Ideally I'd love it if
>>> that wasn't needed and we could get the resource fetch working everywhere.
>>>
>

Re: sharepoint crawler documents limit

2020-01-27 Thread Karl Wright
I'm glad you got by this.  Thanks for letting us know what the issue was.
Karl

On Mon, Jan 27, 2020 at 4:05 AM Jorge Alonso Garcia 
wrote:

> Hi,
> We had change timeout on sharepoint IIS and now the process is able to
> crall all documents.
> Thanks for your help
>
>
>
> El lun., 30 dic. 2019 a las 12:18, Gaurav G ()
> escribió:
>
>> We had faced a similar issue, wherein our repo had 100,000 documents but
>> our crawler stopped after 5 documents. The issue turned out to be that
>> the Sharepoint query that was fired by the Sharepoint web service gets
>> progressively slower and eventually the connection starts timing out before
>> the next 1 records get returned. We increased a timeout parameter on
>> Sharepoint to 10 minutes and then after that we were able to crawl all
>> documents successfully.  I believe we had increased the parameter indicated
>> in the link below
>>
>>
>> https://weblogs.asp.net/jeffwids/how-to-increase-the-timeout-for-a-sharepoint-2010-website
>>
>>
>>
>> On Fri, Dec 20, 2019 at 6:27 PM Karl Wright  wrote:
>>
>>> Hi Priya,
>>>
>>> This has nothing to do with anything in ManifoldCF.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Dec 20, 2019 at 7:56 AM Priya Arora  wrote:
>>>
>>>> Hi All,
>>>>
>>>> Is this issue something to have with below value/parameters set in
>>>> properties.xml.
>>>> [image: image.png]
>>>>
>>>>
>>>> On Fri, Dec 20, 2019 at 5:21 PM Jorge Alonso Garcia 
>>>> wrote:
>>>>
>>>>> And what other sharepoint parameter I could check?
>>>>>
>>>>> Jorge Alonso Garcia
>>>>>
>>>>>
>>>>>
>>>>> El vie., 20 dic. 2019 a las 12:47, Karl Wright ()
>>>>> escribió:
>>>>>
>>>>>> The code seems correct and many people are using it without
>>>>>> encountering this problem.  There may be another SharePoint configuration
>>>>>> parameter you also need to look at somewhere.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia <
>>>>>> jalon...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Karl,
>>>>>>> On sharepoint the list view threshold is 150,000 but we only receipt
>>>>>>> 20,000 from mcf
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>>
>>>>>>> Jorge Alonso Garcia
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> El jue., 19 dic. 2019 a las 19:19, Karl Wright ()
>>>>>>> escribió:
>>>>>>>
>>>>>>>> If the job finished without error it implies that the number of
>>>>>>>> documents returned from this one library was 1 when the service is
>>>>>>>> called the first time (starting at doc 0), 1 when it's called the
>>>>>>>> second time (starting at doc 1), and zero when it is called the 
>>>>>>>> third
>>>>>>>> time (starting at doc 2).
>>>>>>>>
>>>>>>>> The plugin code is unremarkable and actually gets results in chunks
>>>>>>>> of 1000 under the covers:
>>>>>>>>
>>>>>>>> >>>>>>
>>>>>>>> SPQuery listQuery = new SPQuery();
>>>>>>>> listQuery.Query = ">>>>>>> Override=\"TRUE\">";
>>>>>>>> listQuery.QueryThrottleMode =
>>>>>>>> SPQueryThrottleOption.Override;
>>>>>>>> listQuery.ViewAttributes =
>>>>>>>> "Scope=\"Recursive\"";
>>>>>>>> listQuery.ViewFields = ">>>>>>> Name='FileRef' />";
>>>>>>>> listQuery.RowLimit = 1000;
>>>>>>>>
>>>>>>>> XmlDocument doc = new XmlDocument();
>>>>>>>> 

<    1   2   3   4   5   6   7   8   9   10   >