Re: Manifold Job process isssue

2021-11-09 Thread Karl Wright
If your docker image's clock is out of sync badly with the real world, then
System.currentTimeMillis() may give bogus values, and ManifoldCF uses that
to manage throttling etc.  I don't know if that is the correct explanation
but it's the only thing I can think of.

Karl


On Tue, Nov 9, 2021 at 4:56 AM ritika jain  wrote:

>
> Hi All,
>
> I am using window shares connector , manifoldcf 2.14 and ES as output. I
> have configured a job to process 60k of documents, Also these documents are
> new and do not have corresponding values in DB and ES index.
>
> So ideally it should process/Index the documents as soon as the job starts.
> But Manifoldcf does not process anything for many hours of job start up.I
> have tried restarting the docker container as well. But it didn't help
> much. Also logs only correspond to Long running queries.
>
> Why does the manifold behave like that?
>
> Thanks
> Ritika
>


Re: Duplicate key error

2021-10-27 Thread Karl Wright
We see errors like this only because MCF is a highly multithreaded
application, and two threads sometimes are able to collide in what they are
doing even though they are transactionally separated.  That is because of
bugs in the database software.  So if you restart the job it should not
encounter the same problem.

If the problem IS repeatable, we will of course look deeper into what is
going on.

Karl


On Wed, Oct 27, 2021 at 9:52 AM Karl Wright  wrote:

> Is it repeatable?  My guess is it is not repeatable.
> Karl
>
> On Wed, Oct 27, 2021 at 4:43 AM ritika jain 
> wrote:
>
>> So , it can be left as it is.. ? because it is preventing job to complete
>> and its stopping.
>>
>> On Tue, Oct 26, 2021 at 8:40 PM Karl Wright  wrote:
>>
>>> That's a database bug.  All of our underlying databases have some bugs
>>> of this kind.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Oct 26, 2021 at 9:17 AM ritika jain 
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> While using Manifoldcf 2.14 with Web connector and ES connector. After
>>>> a certain time of continuing the job (jobs ingest some documents in lakhs),
>>>> we got this error on PROD.
>>>>
>>>> Can anybody suggest what could be the problem?
>>>>
>>>> PRODUCTION MANIFOLD ERROR:
>>>>
>>>> Error: ERROR: duplicate key value violates unique constraint
>>>> "ingeststatus_pkey" Detail: Key (id)=(1624***7) already exists.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>


Re: Duplicate key error

2021-10-27 Thread Karl Wright
Is it repeatable?  My guess is it is not repeatable.
Karl

On Wed, Oct 27, 2021 at 4:43 AM ritika jain 
wrote:

> So , it can be left as it is.. ? because it is preventing job to complete
> and its stopping.
>
> On Tue, Oct 26, 2021 at 8:40 PM Karl Wright  wrote:
>
>> That's a database bug.  All of our underlying databases have some bugs of
>> this kind.
>>
>> Karl
>>
>>
>> On Tue, Oct 26, 2021 at 9:17 AM ritika jain 
>> wrote:
>>
>>> Hi All,
>>>
>>> While using Manifoldcf 2.14 with Web connector and ES connector. After a
>>> certain time of continuing the job (jobs ingest some documents in lakhs),
>>> we got this error on PROD.
>>>
>>> Can anybody suggest what could be the problem?
>>>
>>> PRODUCTION MANIFOLD ERROR:
>>>
>>> Error: ERROR: duplicate key value violates unique constraint
>>> "ingeststatus_pkey" Detail: Key (id)=(1624***7) already exists.
>>>
>>>
>>> Thanks
>>>
>>>
>>>


Re:

2021-10-26 Thread Karl Wright
That's a database bug.  All of our underlying databases have some bugs of
this kind.

Karl


On Tue, Oct 26, 2021 at 9:17 AM ritika jain 
wrote:

> Hi All,
>
> While using Manifoldcf 2.14 with Web connector and ES connector. After a
> certain time of continuing the job (jobs ingest some documents in lakhs),
> we got this error on PROD.
>
> Can anybody suggest what could be the problem?
>
> PRODUCTION MANIFOLD ERROR:
>
> Error: ERROR: duplicate key value violates unique constraint
> "ingeststatus_pkey" Detail: Key (id)=(1624***7) already exists.
>
>
> Thanks
>
>
>


Re: Windows Shares job-Limit on defining no of paths

2021-10-25 Thread Karl Wright
The only limit is that the more you add, the slower it gets.

Karl


On Mon, Oct 25, 2021 at 6:06 AM ritika jain 
wrote:

> Hi ,
> Is there any limit on the number of paths we can define in job using
> Repository as Window Shares and ES as Output
>
> Thanks
>


Re: Null Pointer Exception

2021-10-25 Thread Karl Wright
The API should really catch this situation.  Basically, you are calling a
function that requires an input but you are not providing one.  In that
case the API sets the input to "null", and the detailed operation is
called.  The detailed operation is not expecting a null input.

This is API piece that is not flagging the error properly:

// Parse the input
Configuration input;

if (protocol.equals("json"))
{
  if (argument.length() != 0)
  {
input = new Configuration();
input.fromJSON(argument);
  }
  else
input = null;
}
else
{
  response.sendError(response.SC_BAD_REQUEST,"Unknown API protocol:
"+protocol);
  return;
}

Since this is POST, it should assume that the input cannot be null, and if
it is, it's a bad request.

Karl


On Mon, Oct 25, 2021 at 2:44 AM ritika jain 
wrote:

> Hi,
>
> I am getting Null pointer exceptions while creating a job programmatic
> approach via PHP.
> Can anybody suggest the reason for this?.
>
>Error 500 Server Error 
> HTTP ERROR 500 Problem accessing
> /mcf-api-service/json/jobs. Reason:  Server ErrorCaused
> by:java.lang.NullPointerException at
> org.apache.manifoldcf.agents.system.ManifoldCF.findConfigurationNode(ManifoldCF.java:208)
> at
> org.apache.manifoldcf.crawler.system.ManifoldCF.apiPostJob(ManifoldCF.java:3539)
> at
> org.apache.manifoldcf.crawler.system.ManifoldCF.executePostCommand(ManifoldCF.java:3585)
> at
> org.apache.manifoldcf.apiservlet.APIServlet.executePost(APIServlet.java:576)
> at org.apache.manifoldcf.apiservlet.APIServlet.doPost(APIServlet.java:175)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769) at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.eclipse.jetty.server.Server.handle(Server.java:497) at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311) at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
> at
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
> at java.lang.Thread.run(Thread.java:748)  Powered by
> Jetty://  
>
>


[jira] [Resolved] (CONNECTORS-1675) Unable to delete Mapping Connections via JSON API

2021-10-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1675.
-
Fix Version/s: ManifoldCF 2.21
   Resolution: Fixed

r1894400


> Unable to delete Mapping Connections via JSON API
> -
>
> Key: CONNECTORS-1675
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1675
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API
>Affects Versions: ManifoldCF 2.20
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.21
>
>
> The DELETE action via the JSON API 
> mappingconnections/__ does not seem to work. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1675) Unable to delete Mapping Connections via JSON API

2021-10-19 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1675:
---

Assignee: Kishore Kumar

> Unable to delete Mapping Connections via JSON API
> -
>
> Key: CONNECTORS-1675
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1675
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API
>Affects Versions: ManifoldCF 2.20
>Reporter: Julien Massiera
>Assignee: Kishore Kumar
>Priority: Major
>
> The DELETE action via the JSON API 
> mappingconnections/__ does not seem to work. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1675) Unable to delete Mapping Connections via JSON API

2021-10-19 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1675:
---

Assignee: Karl Wright  (was: Kishore Kumar)

> Unable to delete Mapping Connections via JSON API
> -
>
> Key: CONNECTORS-1675
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1675
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API
>Affects Versions: ManifoldCF 2.20
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Major
>
> The DELETE action via the JSON API 
> mappingconnections/__ does not seem to work. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1674) KEYS file must be called KEYS

2021-10-05 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1674:

Fix Version/s: ManifoldCF next

> KEYS file must be called KEYS
> -
>
> Key: CONNECTORS-1674
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1674
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Sebb
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF next
>
>
> https://infra.apache.org/release-download-pages.html#download-page
> The KEYS file must be called KEYS, and should be at the root of the download 
> tree.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1674) KEYS file must be called KEYS

2021-10-05 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1674:
---

Assignee: Karl Wright

> KEYS file must be called KEYS
> -
>
> Key: CONNECTORS-1674
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1674
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Sebb
>    Assignee: Karl Wright
>Priority: Major
>
> https://infra.apache.org/release-download-pages.html#download-page
> The KEYS file must be called KEYS, and should be at the root of the download 
> tree.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1660) Patch for MCF HTML extractor connector

2021-10-04 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1660:

Fix Version/s: (was: ManifoldCF 2.18)
   ManifoldCF next

> Patch for MCF HTML extractor connector
> --
>
> Key: CONNECTORS-1660
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1660
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: HTML extractor
>Reporter: Olivier Tavard
>    Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF next
>
> Attachments: patch_html_extractor_connector_02_12_2020.txt, 
> patch_html_extractor_connector_11_12_2020.txt
>
>
> Hello,
> Here is a patch for the HTML extractor connector regarding the text 
> extraction with or without HTML stripping : 
> [^patch_html_extractor_connector_02_12_2020.txt]
>  * Extraction of HTML code : I added a whitelist through the Jsoup cleaner to 
> define what HTML elements are allowed to inforce the security. In the code I 
> set to “relaxed”:
> This whitelist allows a full range of text and structural body HTML: a, b, 
> blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, 
> h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, 
> sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul
> (more details here : 
> [https://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html#relaxed()])
> A future improvement of the code would be to add a new parameter on the 
> interface to choose what whitelist to choose.
>  
>  * Extraction of text with stripping HTML activated : we keep only text nodes 
> : all HTML will be stripped (same thing as before). The change is the Jsoup 
> pretty print option is now set to false to keep line breaks.
>  
> Best regards



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (CONNECTORS-1673) Download page must use https for sigs and hashes

2021-10-04 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1673.
-
Fix Version/s: ManifoldCF 2.20
   Resolution: Fixed

2.20 release update should fix this.


> Download page must use https for sigs and hashes
> 
>
> Key: CONNECTORS-1673
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1673
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Sebb
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.20
>
>
> The download page
> https://manifoldcf.apache.org/en_US/download.html
> currently uses http for links to sigs and hashes.
> This is not secure; please use HTTPS
> https://infra.apache.org/release-download-pages.html#links



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1673) Download page must use https for sigs and hashes

2021-10-04 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1673:
---

Assignee: Karl Wright

> Download page must use https for sigs and hashes
> 
>
> Key: CONNECTORS-1673
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1673
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Sebb
>    Assignee: Karl Wright
>Priority: Major
>
> The download page
> https://manifoldcf.apache.org/en_US/download.html
> currently uses http for links to sigs and hashes.
> This is not secure; please use HTTPS
> https://infra.apache.org/release-download-pages.html#links



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[RESULT] [VOTE] Release Apache ManifoldCF 2.20, RC0

2021-10-01 Thread Karl Wright
Three +1's, >72 hours.  Vote passes!

Karl

On Fri, Oct 1, 2021 at 10:00 AM Karl Wright  wrote:

> Ran tests, verified things still work.  +1 from me.
>
> Karl
>
>
> On Mon, Sep 27, 2021 at 12:46 PM Cihad Guzel  wrote:
>
>> +1
>>
>> Cihad Güzel
>>
>>
>>  adresine sahip kullanıcı 26 Eyl 2021
>> Paz,
>> 15:06 tarihinde şunu yazdı:
>>
>> > +1
>> >
>> > Julien
>> >
>> > -Message d'origine-
>> > De : Karl Wright 
>> > Envoyé : samedi 25 septembre 2021 15:24
>> > À : dev 
>> > Objet : [VOTE] Release Apache ManifoldCF 2.20, RC0
>> >
>> > Please vote on whether to release Apache ManifoldCF 2.20, RC0.
>> >
>> > This release has a new connector in it, and a few bug fixes, but is
>> > otherwise pretty light.  Nevertheless, it's a month behind schedule so
>> I'm
>> > calling a vote for release, by the end of the month.
>> >
>> > The release artifact can be found at:
>> >
>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.20
>> >
>> > There is also a release tag here:
>> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.20-RC0
>> >
>> > Karl
>> >
>> >
>>
>


Re: [VOTE] Release Apache ManifoldCF 2.20, RC0

2021-10-01 Thread Karl Wright
Ran tests, verified things still work.  +1 from me.

Karl


On Mon, Sep 27, 2021 at 12:46 PM Cihad Guzel  wrote:

> +1
>
> Cihad Güzel
>
>
>  adresine sahip kullanıcı 26 Eyl 2021 Paz,
> 15:06 tarihinde şunu yazdı:
>
> > +1
> >
> > Julien
> >
> > -Message d'origine-
> > De : Karl Wright 
> > Envoyé : samedi 25 septembre 2021 15:24
> > À : dev 
> > Objet : [VOTE] Release Apache ManifoldCF 2.20, RC0
> >
> > Please vote on whether to release Apache ManifoldCF 2.20, RC0.
> >
> > This release has a new connector in it, and a few bug fixes, but is
> > otherwise pretty light.  Nevertheless, it's a month behind schedule so
> I'm
> > calling a vote for release, by the end of the month.
> >
> > The release artifact can be found at:
> > https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.20
> >
> > There is also a release tag here:
> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.20-RC0
> >
> > Karl
> >
> >
>


Re: Error: Repeated service interruptions - failure processing document: Read timed out

2021-09-30 Thread Karl Wright
Hi,

You say this is a "Tika error".  Is this Tika as a stand-alone service?  I
do not recognize any ManifoldCF classes whatsoever in this thread dump.

If this is Tika, I suggest contacting the Tika team.

Karl


On Thu, Sep 30, 2021 at 3:02 AM Bisonti Mario 
wrote:

> Additional info.
>
>
>
> I am using 2.17-dev version
>
>
>
>
>
>
>
> *Da:* Bisonti Mario
> *Inviato:* martedì 28 settembre 2021 17:01
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Error: Repeated service interruptions - failure processing
> document: Read timed out
>
>
>
> Hello
>
>
>
> I have error on a Job that parses a network folder.
>
>
>
> This is the tika error:
> 2021-09-28 16:14:50 INFO  Server:415 - Started @1367ms
>
> 2021-09-28 16:14:50 WARN  ContextHandler:1671 - Empty contextPath
>
> 2021-09-28 16:14:50 INFO  ContextHandler:916 - Started
> o.e.j.s.h.ContextHandler@3dd69f5a{/,null,AVAILABLE}
>
> 2021-09-28 16:14:50 INFO  TikaServerCli:413 - Started Apache Tika server
> at http://sengvivv02.vimar.net:9998/
>
> 2021-09-28 16:15:04 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:26:46 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:26:46 INFO  TikaResource:484 - tika (application/pdf)
>
> 2021-09-28 16:27:23 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:27:24 INFO  TikaResource:484 - tika (application/pdf)
>
> 2021-09-28 16:27:26 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:27:26 INFO  TikaResource:484 - tika (application/pdf)
>
> 2021-09-28 16:30:28 WARN  PhaseInterceptorChain:468 - Interceptor for {
> http://resource.server.tika.apache.org/}MetadataResource has thrown
> exception, unwinding now
>
> org.apache.cxf.interceptor.Fault: Could not send Message.
>
> at
> org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:67)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>
> at org.eclipse.jetty.server.Server.handle(Server.java:516)
>
> at
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
>
> at
> org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
>
> at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
>
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
>
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>
> at
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
>
> at
> org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
>
> at java.base/java.lang.Thread.run(Thread.java:834)
>
> Caused by: org.eclipse.jetty.io.EofException
>
> at
> org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279)
>
> at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422)
>
> at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:277)
>
> at
> org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
>
> at
> org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:826)
>
> at
> 

[VOTE] Release Apache ManifoldCF 2.20, RC0

2021-09-25 Thread Karl Wright
Please vote on whether to release Apache ManifoldCF 2.20, RC0.

This release has a new connector in it, and a few bug fixes, but is
otherwise pretty light.  Nevertheless, it's a month behind schedule so I'm
calling a vote for release, by the end of the month.

The release artifact can be found at:
https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.20

There is also a release tag here:
https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.20-RC0

Karl


[jira] [Updated] (CONNECTORS-1671) Solr output connector behavior on some exceptions

2021-09-08 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1671:

Fix Version/s: (was: ManifoldCF 2.19)
   ManifoldCF 2.20

> Solr output connector behavior on some exceptions
> -
>
> Key: CONNECTORS-1671
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1671
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Affects Versions: ManifoldCF 2.19
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.20
>
> Attachments: patch-CONNECTORS-1671.txt
>
>
> In the « handleIOException » method of the « HttpPoster » class, the unknown 
> case triggers a job failure despite the exception can only concern the 
> document/action itself and not a problem with a potential "Solr down" issue 
> (all "Solr down" issues are handled upstream)
> Same thing in the « handleSolrServerException » method



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1671) Solr output connector behavior on some exceptions

2021-09-08 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1671:
---

Fix Version/s: (was: ManifoldCF next)
   ManifoldCF 2.19
 Assignee: Karl Wright
   Resolution: Fixed

r1893132


> Solr output connector behavior on some exceptions
> -
>
> Key: CONNECTORS-1671
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1671
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Affects Versions: ManifoldCF 2.19
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.19
>
> Attachments: patch-CONNECTORS-1671.txt
>
>
> In the « handleIOException » method of the « HttpPoster » class, the unknown 
> case triggers a job failure despite the exception can only concern the 
> document/action itself and not a problem with a potential "Solr down" issue 
> (all "Solr down" issues are handled upstream)
> Same thing in the « handleSolrServerException » method



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Tika Parser Issue

2021-09-07 Thread Karl Wright
This is something you should contact the Tika project about.
Karl


On Tue, Sep 7, 2021 at 8:46 AM ritika jain  wrote:

> Hi All,
>
> I am using tika-core 1.21 and tika-parsers 1.21 jar files as tika
> dependencies in Manifoldcf 2.14 version.
> Getting some issues while parsing *PDF *files. Some strange characters
> appeared, tried changing Tika jar files version also 1.24 and 1.27 (it
> didn't even extract files correctly).
>
> [image: 365.jfif]
> Also checked with the document content, it seems to be fine.
> Can anybody help me on this.
>
> Thanks
> Ritika
>


Re: Query:JCIFS connector

2021-08-23 Thread Karl Wright
I have a work day today, with limited time.
The UI is what it is; it does not have capabilities beyond what is stated
in the UI and in the manual.  It's meant to allow construction of paths
piece by piece, not by full subdirectory at a time.

You can obviously use the API if you want to construct path specifications
some other way.  It sounds like you are doing things programmatically
anyway so I would definitely look into that.

Karl


On Mon, Aug 23, 2021 at 3:52 AM ritika jain 
wrote:

> Can anybody have a clue on this ?
>
> On Fri, Aug 20, 2021 at 12:33 PM ritika jain 
> wrote:
>
>> Hi All,
>>
>> I am having a query , is there any way using which we can mention
>> subdirectories' path in the file spec of Window shares connector.
>>
>> Like my requirement is to mention Most top hierarchical folder on top as
>> mentioned in Screenshot.
>> And in file spec requirement is  to mention file name followed by
>> subdirectories.
>> *Say for example there is a file *
>> E:\sharing\demo\Index.pdf
>>
>> Requirement is to mention sharing at top and rest path at file spec.
>>
>> [image: image.png]
>>
>> Is there any way to do it? Any help would be appreciated
>>
>> Thanks
>> Ritika
>>
>>


Re: Job Deletion query

2021-08-12 Thread Karl Wright
Yes, when you delete a job, the indexed documents associated with that job
are removed from the index.

ManifoldCF is a synchronizer, not a crawler, so when you remove the
synchronization job then if it didn't delete the indexed documents they
would be left dangling.

Karl


On Thu, Aug 12, 2021 at 3:46 AM ritika jain 
wrote:

> Hi All,
>
> When we delete a job in Manifoldcf .. Does it also delete the indexed
> documents via that job from Elastic index as well ?
>
> I understand that when a job is deleted from Manifoldcf interface it will
> delete all the referenced documents via that job from postgres. But why is
> it deleted from the ES index?
>
> Thanks
> Ritika
>


Re: Window shares dynamic Job issue

2021-08-11 Thread Karl Wright
,"_value_":"1599130705168"},{"_type_":"description","_value_":"Demo_job"},{"_type_":"repository_connection","_value_":"mas_Repo"},{"_type_":"document_specification","_children_":[{"_type_":"startpoint","include":[{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pdf","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.doc","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docm","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dot","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dotx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wpd
>  
> ","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pptx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.ppt","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp4","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp5","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp6","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp7","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsm
>  
> ","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xls","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xls","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.png","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpeg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.bmp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.gif","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpeg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpg","_value_":"","_attribute_type":"file"},{"_attribute_filespec":"*","_value_":"","_attribute_type":"directory"}],"_attribute_path":"*windows\/Job\/Demo
>  School 
> Network\/Information\*","_value_":""},{"_type_":"maxlength","_value_":"","_attribute_value":"500"},{"_type_":"security","_value_":"","_attribute_value":"on"},{"_type_":"sharesecurity","_value_":"","_attribute_value":"on"},{"_type_":"parentfoldersecurity","_value_":"","_attribute_value":"off"}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"0"},{"_type_":"stage_isoutput","_value_":"false"},{"_type_":"stage_connectionname","_value_":"Tika"},{"_type_":"stage_specification","_children_":[{"_type_":"keepAllMetadata","_value_":"","_attribute_value":"true"},{"_type_":"lowerNames","_value_":"","_attribute_value":"false"},{"_type_":"writeLimit","_value_":"","_attribute_value":""},{"_type_":"ignoreException","_value_":"","_attribute_value":"true"},{"_type_":"boilerplateprocessor","_value_":"","_attribute_value":"de.l3s.boilerpipe.extractors.KeepEverythingExtractor"}]}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"1"},{"_type_":"stage_prerequisite","_value_":"0"},{"_type_":"stage_isoutput","_value_":"false"},{"_type_":"stage_connectionname","_value_":"Metadata
>  
> Adjuster"},{"_type_":"stage_specification","_children_":[{"_type_":"expression","_attribute_parameter":"d_connector_type","_value_":"","_attribute_value":"FileShare"},{"_type_":"expression","_attribute_parameter":"d_description","_value_":"","_attribute_value":"\"${dc:description}\"
>  
> "},{"_type_":"keepAllMetadata","_value_":"","_attribute_value":"true"},{"_type_":"filterEmpty","_value_":"","_attribute_value":"true"}]}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"2"},{"_type_":"stage_prerequisite","_value_":"1"},{"_type_":"stage_isoutput","_value_":"true"},{"_type_":"stage_connectionname","_value_":"Deltares_Output"},{"_type_":"stage_specification"}]},{"_type_":"start_mode","_value_":"manual"},{"_type_":"run_mode","_value_":"scan
>  
> once"},{"_type_":"hopcount_mode","_value_":"accurate"},{"_type_":"priority","_value_":"5"},{"_type_":"recrawl_interval","_value_":"8640"},{"_type_":"max_recrawl_interval","_value_":"infinite"},{"_type_":"expiration_interval","_value_":"infinite"},{"_type_":"reseed_interval","_value_":"360"}]}}
>
> Basically these two job structures are fully same.Except Path:- is
> mentioned as 1) Complete path till File location 2) only path till folders.
>
> In the first  case the ingestion file has a slash at the end and In second 
> case we don't.
>
>
> Thanks'
>
> Ritika
>
>
> On Tue, Aug 10, 2021 at 6:52 PM Karl Wright  wrote:
>
>> I am sorry, but I'm having trouble understanding how exactly you are
>> configuring the JCIFS connector in these two cases.Can you view the job
>> in each case and provide cut-and-paste of the view?
>>
>> Karl
>>
>>
>> On Tue, Aug 10, 2021 at 9:09 AM ritika jain 
>> wrote:
>>
>>> Hi All,
>>>
>>> I am using Window shares connector in 2.14 manifoldcf version and
>>> Elastic as output.
>>> I have created a dynamic manifoldcf job API via which a job will be
>>> created in manifoldcf with inclusions list and path, only particular file
>>> path is to be mentioned . Example file path:- C:/Users/Dell/Desktop/abc.txt.
>>>
>>> A job will be created to crawl only this single file .
>>> *Issue is :-*
>>> When this job ingest document in Elastic search  there is slash, that is
>>> getting appended in the end
>>>
>>> *Ingested file is* :- C:/Users/Dell/Desktop/abc.txt/
>>>
>>> But when same file is crawled via Manifoldcf job settings by mentioning
>>> path till folder structure (as manual job creation does not allow file path
>>> till particular file it allows till folders only).
>>> It does not append /
>>>
>>> *Ingested file in this case:-*
>>> C:/Users/Dell/Desktop/abc.txt
>>> as expected original file.
>>>
>>> *Query*
>>> Why is this the case as it makes searching in ES ambiguous.
>>>
>>> Thanks
>>> Ritika
>>>
>>>
>>>


Re: Window shares dynamic Job issue

2021-08-10 Thread Karl Wright
I am sorry, but I'm having trouble understanding how exactly you are
configuring the JCIFS connector in these two cases.Can you view the job
in each case and provide cut-and-paste of the view?

Karl


On Tue, Aug 10, 2021 at 9:09 AM ritika jain 
wrote:

> Hi All,
>
> I am using Window shares connector in 2.14 manifoldcf version and Elastic
> as output.
> I have created a dynamic manifoldcf job API via which a job will be
> created in manifoldcf with inclusions list and path, only particular file
> path is to be mentioned . Example file path:- C:/Users/Dell/Desktop/abc.txt.
>
> A job will be created to crawl only this single file .
> *Issue is :-*
> When this job ingest document in Elastic search  there is slash, that is
> getting appended in the end
>
> *Ingested file is* :- C:/Users/Dell/Desktop/abc.txt/
>
> But when same file is crawled via Manifoldcf job settings by mentioning
> path till folder structure (as manual job creation does not allow file path
> till particular file it allows till folders only).
> It does not append /
>
> *Ingested file in this case:-*
> C:/Users/Dell/Desktop/abc.txt
> as expected original file.
>
> *Query*
> Why is this the case as it makes searching in ES ambiguous.
>
> Thanks
> Ritika
>
>
>


[jira] [Commented] (CONNECTORS-1671) Solr output connector behavior on some exceptions

2021-07-31 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390936#comment-17390936
 ] 

Karl Wright commented on CONNECTORS-1671:
-

[~julienFL], you are catching RuntimeException.  That is the broad class of 
exceptions that includes everything that you really don't want to ignore, e.g. 
OutOfMemoryException.  Can you be more specific about what you want to catch 
here?  I am not comfortable with this.


> Solr output connector behavior on some exceptions
> -
>
> Key: CONNECTORS-1671
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1671
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Affects Versions: ManifoldCF 2.19
>Reporter: Julien Massiera
>Priority: Major
> Fix For: ManifoldCF next
>
> Attachments: patch-CONNECTORS-1671.txt
>
>
> In the « handleIOException » method of the « HttpPoster » class, the unknown 
> case triggers a job failure despite the exception can only concern the 
> document/action itself and not a problem with a potential "Solr down" issue 
> (all "Solr down" issues are handled upstream)
> Same thing in the « handleSolrServerException » method



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: JCIFS Connector File Size Attribute

2021-07-26 Thread Karl Wright
The parameter in the Solr connection UI is:

"Original size field name:"

Karl

On Mon, Jul 26, 2021 at 12:24 PM Wolfinger Uwe 
wrote:

> Do i need any extra configuration in the solr connection? When i look at
> the query string,that sends the request to solr, there is no field
> "originalSize".
>
> King Regards,
> Uwe
>
>
>
>
> -Ursprüngliche Nachricht-
> Von: Karl Wright 
> Gesendet: Freitag, 23. Juli 2021 20:34
> An: dev 
> Betreff: Re: JCIFS Connector File Size Attribute
>
> Hi,
> The original size field is provided by the Repository Connector, and
> passed to the output connector.
>
> In this case, the code that sets the field is here:
>
> kawright@1USDKAWRIGHT:/mnt/c/wip/mcf/trunk$ grep -R
> "rd.setOriginalSize(originalLength);" . --include "*.java"
>
> ./connectors/jcifs/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/sharedrive/SharedDriveConnector.jav
> :  rd.setOriginalSize(originalLength);
>
> The code that uses this field and pushes it into Solr is configured in the
> Solr connection.  That is probably why you are overlooking it.
>
> Thanks,
>
> Karl
>
>
> On Fri, Jul 23, 2021 at 10:13 AM Wolfinger Uwe 
> wrote:
>
> > Hi,
> >
> > we are using the JCIFs shared drive connector to crawl windows shares.
> > What we would like to have is, that the file size can be displayed in
> > the search results, i.e. that an appropriate attribute is sent to solr.
> >
> > According to this issue:
> > https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1
> > 204
> > this should alfready work.
> >
> > Unfortunately i am not able configure the corresponding job to send
> > such an attribute. A look at
> >
> >
> > https://github.com/apache/manifoldcf/blob/trunk/connectors/jcifs/conne
> > ctor/src/main/java/org/apache/manifoldcf/crawler/connectors/sharedrive
> > /SharedDriveConnector.java
> >
> > shows, that only the following attributes are added as fields:
> > rd.addField("lastModified", lastModifiedDate.toString());
> >
> > rd.addField("fileLastModified",DateParser.formatISO8601Date(lastModifi
> > edDate)); rd.addField("createdOn", creationDate.toString());
> > rd.addField("fileCreatedOn",DateParser.formatISO8601Date(creationDate)
> > ); rd.addField("attributes", Integer.toString(attributes));
> > rd.addField("shareName", shareName);
> >
> > am missing something? Or ist the fileSize attribute missing when
> > populating the crawling result.
> >
> > kind regards,
> > Uwe
> >
> >
> >
>


Re: JCIFS Connector File Size Attribute

2021-07-23 Thread Karl Wright
Hi,
The original size field is provided by the Repository Connector, and passed
to the output connector.

In this case, the code that sets the field is here:

kawright@1USDKAWRIGHT:/mnt/c/wip/mcf/trunk$ grep -R
"rd.setOriginalSize(originalLength);" . --include "*.java"
./connectors/jcifs/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/sharedrive/SharedDriveConnector.jav
:  rd.setOriginalSize(originalLength);

The code that uses this field and pushes it into Solr is configured in the
Solr connection.  That is probably why you are overlooking it.

Thanks,

Karl


On Fri, Jul 23, 2021 at 10:13 AM Wolfinger Uwe 
wrote:

> Hi,
>
> we are using the JCIFs shared drive connector to crawl windows shares.
> What we would like to have is, that the file size can be displayed in the
> search results, i.e. that an appropriate attribute is sent to solr.
>
> According to this issue:
> https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1204
> this should alfready work.
>
> Unfortunately i am not able configure the corresponding job to send such
> an attribute. A look at
>
>
> https://github.com/apache/manifoldcf/blob/trunk/connectors/jcifs/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/sharedrive/SharedDriveConnector.java
>
> shows, that only the following attributes are added as fields:
> rd.addField("lastModified", lastModifiedDate.toString());
>
> rd.addField("fileLastModified",DateParser.formatISO8601Date(lastModifiedDate));
> rd.addField("createdOn", creationDate.toString());
> rd.addField("fileCreatedOn",DateParser.formatISO8601Date(creationDate));
> rd.addField("attributes", Integer.toString(attributes));
> rd.addField("shareName", shareName);
>
> am missing something? Or ist the fileSize attribute missing when
> populating the crawling result.
>
> kind regards,
> Uwe
>
>
>


Re: Solr output connector - behavior on some exceptions

2021-07-13 Thread Karl Wright
It is called the "Lucene/Solr Connector" component.
Karl


On Tue, Jul 13, 2021 at 10:11 AM  wrote:

> Ok, there is no "Solr connector" component in JIRA, can you add it please
> ?
>
> -----Message d'origine-
> De : Karl Wright 
> Envoyé : mardi 13 juillet 2021 16:04
> À : dev 
> Objet : Re: Solr output connector - behavior on some exceptions
>
> Null values causing exceptions in the output connector should be addressed
> independently in the output connector.  But basically as long as that is
> done I am fine with your proposal.
>
> Karl
>
>
> On Tue, Jul 13, 2021 at 9:59 AM  wrote:
>
> > I ended up in that part of the code while debugging after we had a
> > crawling job stopped because of an exception concerning a document
> > having a null value for a specific metadata and another one with a
> > value that triggered a request parsing issue on Solr side.
> >
> > Julien
> >
> > -Message d'origine-
> > De : Karl Wright 
> > Envoyé : mardi 13 juillet 2021 15:48
> > À : dev 
> > Objet : Re: Solr output connector - behavior on some exceptions
> >
> > If the "solr is down" exceptions are indeed caught upstream, I'm
> > tentatively in agreement that this fallback logic can be changed.  But
> > I would like to understand what specifically you are seeing this happen
> for.
> > What cases are you hoping to improve?
> >
> > Karl
> >
> >
> > On Tue, Jul 13, 2021 at 9:39 AM  wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > I would like to change the behavior of the Solr output connector
> > > concerning two exception handling cases :
> > >
> > >
> > >
> > >1. In the current « handleIOException » method of the HttpPoster
> > >class, the « unknown » case looks like this :
> > >
> > >
> > >
> > >As the comment says, we don’t know the type of IOException, so it is
> > >not necessary to make the ServiceInterruption fail after a period,
> > >especially since all « Solr down » exceptions have been handled
> > > upstream
> > >
> > >2. The current « handleSolrServerException » method of the HttPoster
> > >class. Same as above, this method is called for an unknown
> > > exception
> > that
> > >cannot be related to a « Solr down » issue; it can only be
> > > related to
> > some
> > >missconfiguration or document specific issue. It is therefore not
> > necessary
> > >to throw a ManifoldCFException that will stop the job with a
> > > failure state
> > >
> > >
> > >
> > >
> > >
> > > What do you think ? If you agree with me, I can create a ticket for
> > > that and submit a patch. This would allow to graciously keep the job
> > > running while properly skipping identified exceptions.
> > >
> > >
> > >
> > >
> > >
> > > Regards,
> > > Julien
> > >
> > >
> > >
> > >
> > > <https://www.avast.com/sig-email?utm_medium=email_source=link
> > > m_ campaign=sig-email_content=emailclient> Garanti sans virus.
> > > www.avast.com
> > > <https://www.avast.com/sig-email?utm_medium=email_source=link
> > > m_ campaign=sig-email_content=emailclient>
> > > <#m_-5206088803545595557_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> > >
> >
> >
>
>


Re: Solr output connector - behavior on some exceptions

2021-07-13 Thread Karl Wright
Null values causing exceptions in the output connector should be addressed
independently in the output connector.  But basically as long as that is
done I am fine with your proposal.

Karl


On Tue, Jul 13, 2021 at 9:59 AM  wrote:

> I ended up in that part of the code while debugging after we had a
> crawling job stopped because of an exception concerning a document having a
> null value for a specific metadata and another one with a value that
> triggered a request parsing issue on Solr side.
>
> Julien
>
> -Message d'origine-
> De : Karl Wright 
> Envoyé : mardi 13 juillet 2021 15:48
> À : dev 
> Objet : Re: Solr output connector - behavior on some exceptions
>
> If the "solr is down" exceptions are indeed caught upstream, I'm
> tentatively in agreement that this fallback logic can be changed.  But I
> would like to understand what specifically you are seeing this happen for.
> What cases are you hoping to improve?
>
> Karl
>
>
> On Tue, Jul 13, 2021 at 9:39 AM  wrote:
>
> > Hi,
> >
> >
> >
> > I would like to change the behavior of the Solr output connector
> > concerning two exception handling cases :
> >
> >
> >
> >1. In the current « handleIOException » method of the HttpPoster
> >class, the « unknown » case looks like this :
> >
> >
> >
> >As the comment says, we don’t know the type of IOException, so it is
> >not necessary to make the ServiceInterruption fail after a period,
> >especially since all « Solr down » exceptions have been handled
> > upstream
> >
> >2. The current « handleSolrServerException » method of the HttPoster
> >class. Same as above, this method is called for an unknown exception
> that
> >cannot be related to a « Solr down » issue; it can only be related to
> some
> >missconfiguration or document specific issue. It is therefore not
> necessary
> >to throw a ManifoldCFException that will stop the job with a
> > failure state
> >
> >
> >
> >
> >
> > What do you think ? If you agree with me, I can create a ticket for
> > that and submit a patch. This would allow to graciously keep the job
> > running while properly skipping identified exceptions.
> >
> >
> >
> >
> >
> > Regards,
> > Julien
> >
> >
> >
> >
> > <https://www.avast.com/sig-email?utm_medium=email_source=link_
> > campaign=sig-email_content=emailclient> Garanti sans virus.
> > www.avast.com
> > <https://www.avast.com/sig-email?utm_medium=email_source=link_
> > campaign=sig-email_content=emailclient>
> > <#m_-5206088803545595557_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> >
>
>


Re: Solr output connector - behavior on some exceptions

2021-07-13 Thread Karl Wright
If the "solr is down" exceptions are indeed caught upstream, I'm
tentatively in agreement that this fallback logic can be changed.  But I
would like to understand what specifically you are seeing this happen for.
What cases are you hoping to improve?

Karl


On Tue, Jul 13, 2021 at 9:39 AM  wrote:

> Hi,
>
>
>
> I would like to change the behavior of the Solr output connector
> concerning two exception handling cases :
>
>
>
>1. In the current « handleIOException » method of the HttpPoster
>class, the « unknown » case looks like this :
>
>
>
>As the comment says, we don’t know the type of IOException, so it is
>not necessary to make the ServiceInterruption fail after a period,
>especially since all « Solr down » exceptions have been handled upstream
>
>2. The current « handleSolrServerException » method of the HttPoster
>class. Same as above, this method is called for an unknown exception that
>cannot be related to a « Solr down » issue; it can only be related to some
>missconfiguration or document specific issue. It is therefore not necessary
>to throw a ManifoldCFException that will stop the job with a failure state
>
>
>
>
>
> What do you think ? If you agree with me, I can create a ticket for that
> and submit a patch. This would allow to graciously keep the job running
> while properly skipping identified exceptions.
>
>
>
>
>
> Regards,
> Julien
>
>
>
>
> 
>  Garanti
> sans virus. www.avast.com
> 
> <#m_-5206088803545595557_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>


Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright
If you wish to add a feature request, please create a CONNECTORS ticket
that describes the functionality you think the connector should have.

Karl


On Wed, Jul 7, 2021 at 9:29 AM h0444xk8  wrote:

> Hi,
>
> yes, that seems to be the reason. In:
>
>
> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java
>
> there is the following code sequence:
>
> else if (lowercaseLine.startsWith("sitemap:"))
>{
>  // We don't complain about this, but right now we don't
> listen to it either.
>}
>
> But if I have a look at:
>
>
> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
>
> a sitemap containing an urlset seems to be handled
>
> else if (localName.equals("urlset") || localName.equals("sitemapindex"))
>{
>  // Sitemap detected
>  outerTagCount++;
>  return new
>
> UrlsetContextClass(theStream,namespace,localName,qName,atts,documentURI,handler);
>}
>
> So, my question is: is there another way to handle sitemaps inside the
> Web Crawler?
>
> Cheers Sebastian
>
>
>
>
>
> Am 07.07.2021 12:23 schrieb Karl Wright:
>
> > The robots parsing does not recognize the "sitemaps" line, which was
> > likely not in the spec for robots when this connector was written.
> >
> > Karl
> >
> > On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:
> >
> >> Hi,
> >>
> >> I have a general question. Is the Web connector supporting sitemap
> >> files
> >> referenced by the robots.txt? In my use case the robots.txt is stored
> >> in
> >> the root of the website and is referencing two compressed sitemaps.
> >>
> >> Example of robots.txt
> >> 
> >> User-Agent: *
> >> Disallow:
> >> Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz [1]
> >> Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz [2]
> >>
> >> When start crawling in „Simple History" there is an error log entry as
> >> follows:
> >>
> >> Unknown robots.txt line: 'Sitemap:
> >> https://www.example.de/sitemap/en-sitemap.xml.gz [2]'
> >>
> >> Is there a general problem with sitemaps at all or with sitemaps
> >> referenced in robots.txt or with compressed sitemaps?
> >>
> >> Best regards
> >>
> >> Sebastian
>
>
> Links:
> --
> [1] https://www.example.de/sitemap/de-sitemap.xml.gz
> [2] https://www.example.de/sitemap/en-sitemap.xml.gz
>


Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright
The robots parsing does not recognize the "sitemaps" line, which was likely
not in the spec for robots when this connector was written.

Karl


On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:

> Hi,
>
> I have a general question. Is the Web connector supporting sitemap files
> referenced by the robots.txt? In my use case the robots.txt is stored in
> the root of the website and is referencing two compressed sitemaps.
>
> Example of robots.txt
> 
> User-Agent: *
> Disallow:
> Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz
> Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz
>
> When start crawling in „Simple History" there is an error log entry as
> follows:
>
> Unknown robots.txt line: 'Sitemap:
> https://www.example.de/sitemap/en-sitemap.xml.gz'
>
> Is there a general problem with sitemaps at all or with sitemaps
> referenced in robots.txt or with compressed sitemaps?
>
> Best regards
>
> Sebastian
>


[jira] [Commented] (CONNECTORS-1670) PostgreSQL: transaction in progress

2021-07-06 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375979#comment-17375979
 ] 

Karl Wright commented on CONNECTORS-1670:
-

It's not clear that this is anything other than a warning.  There may be a way 
to turn it off, if so.


> PostgreSQL: transaction in progress
> ---
>
> Key: CONNECTORS-1670
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1670
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.19
>Reporter: Uwe Wolfinger
>Priority: Major
>
> We recently upgraded Manifold version to 2.19 and PostgreSQL version to 12.5. 
> So far everything worked fine, except some warnings within the database (row 
> 13):
> 1: 2021-07-06 08:51:46.366 CEST-1013441-172.16.172.133(45412)-xxx: LOG: 
> Execute : START TRANSACTION 
>  2: 2021-07-06 08:51:46.366 CEST-1013441-172.16.172.133(45412)-xxx: LOG: 
> Execute S_38: SELECT id,status,connectionname FROM jobs WHERE 
> assessmentstate=$1 FOR UPDATE 
>  3: 2021-07-06 08:51:46.366 CEST-1013441-172.16.172.133(45412)-xxx: 
> DETAIL: Parameter: $1 = 'N' 
>  4: 2021-07-06 08:51:46.367 CEST-1013441-172.16.172.133(45412)-xxx: LOG: 
> Execute S_39: SELECT transformationname FROM jobpipelines WHERE ownerid=$1 
> AND transformationname IS NOT NULL 
>  5:2021-07-06 08:51:46.367 CEST-1013441-172.16.172.133(45412)-xxx: 
> DETAIL: Parameter: $1 = '1615294637875' 
>  6:2021-07-06 08:51:46.368 CEST-1013441-172.16.172.133(45412)-xxx: LOG: 
> Execute S_40: SELECT outputname FROM jobpipelines WHERE ownerid=$1 AND 
> outputname IS NOT NULL 
>  7:2021-07-06 08:51:46.368 CEST-1013441-172.16.172.133(45412)-xxx: 
> DETAIL: Parameter: $1 = '1615294637875' 
>  8:2021-07-06 08:51:46.369 CEST-10134
>  9:2021-07-06 08:55:02.148 CEST-1012399-172.16.172.133(45330)-xxx: LOG: 
> Execute : START TRANSACTION 
>  10:2021-07-06 08:55:02.149 CEST-1012399-172.16.172.133(45330)-xxx: LOG: 
> Execute S_4: SELECT id,lasttime,status,startmethod,connectionname FROM jobs 
> WHERE (status=$1 OR status=$2 OR status=$3 OR status=$4 OR status=$5) AND 
> startmethod!=$6 FOR UPDATE 
>  11:2021-07-06 08:55:02.149 CEST-1012399-172.16.172.133(45330)-xxx: 
> DETAIL: Parameter: $1 = 'N', $2 = 'W', $3 = 'w', $4 = 'Z', $5 = 'z', $6 = 'D' 
>  12:2021-07-06 08:55:02.151 CEST-1012399-172.16.172.133(45330)-xxx: LOG: 
> Execute : START TRANSACTION 
>  13:2021-07-06 08:55:02.151 CEST-1012399-172.16.172.133(45330)-xxx: 
> WARNING: there is already a transaction in progress 
>  14:2021-07-06 08:55:02.153 CEST-1012399-172.16.172.133(45330)-xxx: LOG: 
> execute S_14: SELECT * FROM schedules WHERE ownerid=$1 ORDER BY ordinal ASC 
>  15:2021-07-06 08:55:02.153 CEST-1012399-172.16.172.133(45330)-xxx: 
> DETAIL: Parameter: $1 =
> as we can see a new transaction ist startet in row 9 and then another one in 
> row 12, which causes the warning in row 13. The reason for this seems, that 
> the first transaction is not terminated when the second one starts. I tried 
> to look up the code in the source, but i couldnt find the right location.
> Has anyone else encountered the behaviour or know, where to lookup the 
> location in the source code?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Commit on CONNECTORS-1667 branch

2021-06-25 Thread Karl Wright
Thanks!

No, I haven't had time to integrate this, but if the branch is ready, I'd
be happy to pull it in now.  Please let me know.

Karl


On Fri, Jun 25, 2021 at 9:52 AM  wrote:

> Hi Karl,
>
>
>
> I needed to patch my contribution for a new Tika connector from 3 months
> ago: I made the assumption that you did not find time yet to integrate it
> to
> the trunk, so I decided to commit it on the branch CONNECTORS-1667. As a
> reminder, here is the corresponding JIRA ticket:
> https://issues.apache.org/jira/browse/CONNECTORS-1667
>
>
>
> Regards,
> Julien
>
>
>
> --
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> https://www.avast.com/antivirus
>


[jira] [Resolved] (LUCENE-10012) Cache concurrency for GeoStandardPath is poorly designed

2021-06-22 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved LUCENE-10012.
--
Fix Version/s: main (9.0)
   Resolution: Fixed

> Cache concurrency for GeoStandardPath is poorly designed
> 
>
> Key: LUCENE-10012
> URL: https://issues.apache.org/jira/browse/LUCENE-10012
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Affects Versions: 8.9
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Minor
> Fix For: main (9.0)
>
>
> The synchronization strategy used for caching of distance segments for the 
> GeoStandardPath shape is poorly designed and blocks way more than it needs to.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10012) Cache concurrency for GeoStandardPath is poorly designed

2021-06-22 Thread Karl Wright (Jira)
Karl Wright created LUCENE-10012:


 Summary: Cache concurrency for GeoStandardPath is poorly designed
 Key: LUCENE-10012
 URL: https://issues.apache.org/jira/browse/LUCENE-10012
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/spatial3d
Affects Versions: 8.9
Reporter: Karl Wright
Assignee: Karl Wright


The synchronization strategy used for caching of distance segments for the 
GeoStandardPath shape is poorly designed and blocks way more than it needs to.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: Manifoldcf Redirection process

2021-05-28 Thread Karl Wright
302 does get recognized as a redirection, yes


On Fri, May 28, 2021 at 5:07 AM ritika jain 
wrote:

> Is the process the same when fetch/process status code returned is 302  ?
  When running a job with web crawler and ES output connector

>>>
> can anybody have a clue about  this
>


[jira] [Commented] (CONNECTORS-1668) Use of Wild Characters in SharePoint Connector.

2021-05-23 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350025#comment-17350025
 ] 

Karl Wright commented on CONNECTORS-1668:
-

If you think you have a web service call that will locate the list of virtual 
sites given a root site, I'd create a method in SPSProxyHelper that implements 
that.  If you can show it works, the next thing to do is:

(1) Come up with a document identifier format that represents the root site.
(2) Change the processDocuments() method of the connector to recognize that 
document identifier format and call your new method.  The results should be 
added to the queue using "processActivities.addDocumentReference()".
(3) Decide how the document specification for this connector would need to be 
extended to support virtual site discovery - this is actually the tricky part, 
because you will need to modify the HTML editor for document specification 
editing to include this.

I can help you with (3) but first you need to prove you can do (1).
 

> Use of Wild Characters in SharePoint Connector.
> ---
>
> Key: CONNECTORS-1668
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1668
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.16
>Reporter: Shashank Dwivedi
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: image-2021-05-23-00-36-45-378.png
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, 
> My SharePoint site is of the following *Format* :
> -*Projects(root)*
>    -*Project 1*
>         -Project Library
>         -Folder 1
>         -Folder 2 ... Folder N
>    -*Project 2 ... Project N*
>         -Project Library
>         -Folder 1 .. Folder N
> We have the *Projects(root site)* in this fashion from Project 1 to *Project 
> N(2)*, where N is a *large number.* I wish to process all files present 
> inside the *Project Library folder* of all the projects.
> So, as a Path rule I am currently supplying "*Projects/**/*Project Library/* 
> *". There is no space between / and * in the last.
> However, this is *not working out*. It is also pulling documents inside 
> *Folder 1, Folder2,..Folder N.* I want it to Process files only inside 
> Project Library.
> Please suggest me the right way to accomplish this Task.
> I could not identify any suggestion regarding the same in the End user 
> Documentation.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1668) Use of Wild Characters in SharePoint Connector.

2021-05-22 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349817#comment-17349817
 ] 

Karl Wright commented on CONNECTORS-1668:
-

About whether we can implement "site discovery" in this connector: the problem 
is that Microsoft has deprecated the entire API we use, and the connector must 
be 100% redeveloped.  The old API did not have any ability to do site 
discovery.  Not sure what the new API has, but nobody on the MCF team has the 
six free weeks of coding time and access to a MS Sharepoint instance to build 
what is needed.  Volunteers welcome.


> Use of Wild Characters in SharePoint Connector.
> ---
>
> Key: CONNECTORS-1668
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1668
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.16
>Reporter: Shashank Dwivedi
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
> Attachments: image-2021-05-23-00-36-45-378.png
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, 
> My SharePoint site is of the following *Format* :
> -*Projects(root)*
>    -*Project 1*
>         -Project Library
>         -Folder 1
>         -Folder 2 ... Folder N
>    -*Project 2 ... Project N*
>         -Project Library
>         -Folder 1 .. Folder N
> We have the *Projects(root site)* in this fashion from Project 1 to *Project 
> N(2)*, where N is a *large number.* I wish to process all files present 
> inside the *Project Library folder* of all the projects.
> So, as a Path rule I am currently supplying "*Projects/**/*Project Library/* 
> *". There is no space between / and * in the last.
> However, this is *not working out*. It is also pulling documents inside 
> *Folder 1, Folder2,..Folder N.* I want it to Process files only inside 
> Project Library.
> Please suggest me the right way to accomplish this Task.
> I could not identify any suggestion regarding the same in the End user 
> Documentation.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1668) Use of Wild Characters in SharePoint Connector.

2021-05-22 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349815#comment-17349815
 ] 

Karl Wright commented on CONNECTORS-1668:
-

The logic for path rules is as follows:

{code}
  if (sn.getType().equals("pathrule"))
  {
// New-style rule.
// Here's the trick: We do what the first matching rule tells us to do.
String pathMatch = sn.getAttributeValue("match");
String action = sn.getAttributeValue("action");
String ruleType = sn.getAttributeValue("type");

// First, find out if we match EXACTLY.
if (checkMatch(libraryPath,0,pathMatch))
{
  // If this is true, the type also has to match if the rule is to 
apply.
  if (ruleType.equals("library"))
  {
if (Logging.connectors.isDebugEnabled())
  Logging.connectors.debug("SharePoint: Library '"+libraryPath+"' 
exactly matched rule path '"+pathMatch+"'");
if (action.equals("include"))
{
  // For include rules, partial match is good enough to proceed.
  if (Logging.connectors.isDebugEnabled())
Logging.connectors.debug("SharePoint: Including library 
'"+libraryPath+"'");
  return true;
}
if (Logging.connectors.isDebugEnabled())
  Logging.connectors.debug("SharePoint: Excluding library 
'"+libraryPath+"'");
return false;
  }
}
else if (ruleType.equals("file") && 
checkPartialPathMatch(libraryPath,0,pathMatch,1) && action.equals("include"))
{
  if (Logging.connectors.isDebugEnabled())
Logging.connectors.debug("SharePoint: Library '"+libraryPath+"' 
partially matched file rule path '"+pathMatch+"' - including");
  return true;
}
else if (ruleType.equals("folder") && 
checkPartialPathMatch(libraryPath,0,pathMatch,1) && action.equals("include"))
{
  if (Logging.connectors.isDebugEnabled())
Logging.connectors.debug("SharePoint: Library '"+libraryPath+"' 
partially matched folder rule path '"+pathMatch+"' - including");
  return true;
}
  }
}
{code}

I need to see the rule type; as you can see, to include a library, you need a 
library rule, and to include a site, you need a site rule.

The checkMatch() method does this:

{code}
  /** Recursive worker method for checkMatch.  Returns 'true' if there is a 
path that consumes both
  * strings in their entirety in a matched way.
  *@param caseSensitive is true if file names are case sensitive.
  *@param sourceMatch is the source string (w/o wildcards)
  *@param match is the match string (w/wildcards)
  *@return true if there is a match.
  */
  protected static boolean checkMatch(boolean caseSensitive, String 
sourceMatch, String match)
{code}

The partial path match method looks like this:

{code}
  protected static boolean checkPartialPathMatch( String sourceMatch, int 
sourceIndex, String match, int requiredExtraPathSections )
  {
// The partial match must be of a complete path, with at least a specified 
number of trailing path components possible in what remains.
// Path components can include everything but the "/" character itself.
//
// The match string is the one containing the wildcards.  Both the "*" 
wildcard and the "?" wildcard will match a "/", which is intended but is why 
this
// matcher is a little tricky to write.
//
// Note also that it is OK to return "true" more than strictly necessary, 
but it is never OK to return "false" incorrectly.

// This is a partial path match.  That means that we don't have to 
completely use up the match string, but what's left on the match string after 
the source
// string is used up MUST either be capable of being null, or be capable of 
starting with a "/"integral path sections, and MUST include at least n of these 
sections.
//
{code}


If you look at the code, you will note there's quite a bit of debug logging 
around path matching.  The basic point though is that the entire match string 
must be consumed for the full match, meaning that anything that is not a 
wildcard MUST match, and for a partial match there must be at least N sections 
left over after the match is entirely consumed.

To summarize:

(1) You need a Site rule to include a site.
(2) You need a Library rule to include a library.







> Use of Wild Characters in SharePoint Connector.
> -------
>
>   

[jira] [Assigned] (CONNECTORS-1668) Use of Wild Characters in SharePoint Connector.

2021-05-22 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1668:
---

Assignee: Karl Wright

> Use of Wild Characters in SharePoint Connector.
> ---
>
> Key: CONNECTORS-1668
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1668
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.16
>Reporter: Shashank Dwivedi
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.16
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, 
> My SharePoint site is of the following *Format* :
> -*Projects(root)*
>    -*Project 1*
>         -Project Library
>         -Folder 1
>         -Folder 2 ... Folder N
>    -*Project 2 ... Project N*
>         -Project Library
>         -Folder 1 .. Folder N
> We have the *Projects(root site)* in this fashion from Project 1 to *Project 
> N(2)*, where N is a *large number.* I wish to process all files present 
> inside the *Project Library folder* of all the projects.
> So, as a Path rule I am currently supplying "*Projects/**/*Project Library/* 
> *". There is no space between / and * in the last.
> However, this is *not working out*. It is also pulling documents inside 
> *Folder 1, Folder2,..Folder N.* I want it to Process files only inside 
> Project Library.
> Please suggest me the right way to accomplish this Task.
> I could not identify any suggestion regarding the same in the End user 
> Documentation.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1668) Use of Wild Characters in SharePoint Connector.

2021-05-22 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349807#comment-17349807
 ] 

Karl Wright commented on CONNECTORS-1668:
-

Could you view your job, and include a screen shot of the inclusion rules you 
have?  There are several different kinds of rules, I have to see what you're 
actually trying.



> Use of Wild Characters in SharePoint Connector.
> ---
>
> Key: CONNECTORS-1668
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1668
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.16
>Reporter: Shashank Dwivedi
>Priority: Major
> Fix For: ManifoldCF 2.16
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, 
> My SharePoint site is of the following *Format* :
> -*Projects(root)*
>    -*Project 1*
>         -Project Library
>         -Folder 1
>         -Folder 2 ... Folder N
>    -*Project 2 ... Project N*
>         -Project Library
>         -Folder 1 .. Folder N
> We have the *Projects(root site)* in this fashion from Project 1 to *Project 
> N(2)*, where N is a *large number.* I wish to process all files present 
> inside the *Project Library folder* of all the projects.
> So, as a Path rule I am currently supplying "*Projects/**/*Project Library/* 
> *". There is no space between / and * in the last.
> However, this is *not working out*. It is also pulling documents inside 
> *Folder 1, Folder2,..Folder N.* I want it to Process files only inside 
> Project Library.
> Please suggest me the right way to accomplish this Task.
> I could not identify any suggestion regarding the same in the End user 
> Documentation.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Manifoldcf Redirection process

2021-05-19 Thread Karl Wright
ManifoldCF reads all the URLs on its queue.
If it's a 301, it detects this and pushes the new URL onto the document
queue.
When it gets to the new URL, it processes it like any other.

Karl


On Wed, May 19, 2021 at 8:32 AM ritika jain 
wrote:

> Hi
>
> I want to understand the process of "How does manifold.cf handles
> redirection of URL." in case of web crawler connector
>
> If there is a page redirect (through a 301) to another URL, then the next
> crawl will detect the redirect and index the new (final) URL and display it
> in the search results. (instead of the old URL that redirects). Just as is
> also done by search engines like Google / Bing.
>
> Is that true, , what manifold is capable of avoiding the URL that is 301
> and pick the URL to which it is redirected? and ingest that URL .
>
> If not , what process Manifoldcf follows to inges redirection of URL's
>
> Thanks
> Ritika
>
>
>


Re: Interrupted while acquiring credits

2021-05-14 Thread Karl Wright
"crashing the manifold" is probably running out of memory, and it is
probably due to having too many worker threads and insufficient memory, not
the error you found.

If that error caused a problem, it would simply abort the job, not "crash"
Manifold.

Karl


On Fri, May 14, 2021 at 4:10 AM ritika jain 
wrote:

> It retries for 3 times and it usually crashes the manifoldcf.
>
> Similar ticket i observed
> https://issues.apache.org/jira/browse/CONNECTORS-1633, does manifoldcf
> itself capable of skipping the file that cause issue instead of aborting
> the job  or crashing manifold
>
> On Fri, May 14, 2021 at 1:34 PM Karl Wright  wrote:
>
>> '
>>
>> *JCIFS: Possibly transient exception detected on attempt 1 while getting
>> share security'Yes, it is going to retry.*
>>
>> *Karl*
>>
>> On Fri, May 14, 2021 at 1:45 AM ritika jain 
>> wrote:
>>
>>> Hi,
>>> I am using Windows shares connector in manifoldcf 2.14 and ElasticSearch
>>> connector as Output connector and Tika and Metadata adjuster as
>>> Transformation connector
>>>
>>> Trying to crawl the files from SMB server with 64 GB of server and Start
>>> option file of manifoldcf is being given 32GB of memory
>>>  But many times got different errors while processing documents:-
>>> *1) Access is denied*
>>> *2) ... 23 more*
>>>
>>>
>>> * WARN 2021-05-13T12:33:16,318 (Worker thread '6') - Service
>>> interruption reported for job 1599130705168 connection 'Themas_Repo':
>>> Timeout or other service interruption: Interrupted while acquiring
>>> credits WARN 2021-05-13T12:33:17,315 (Worker thread '6') - JCIFS: Possibly
>>> transient exception detected on attempt 1 while getting share security:
>>> Interrupted while acquiring creditsjcifs.smb.SmbException: Interrupted
>>> while acquiring credits*
>>> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1530)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbSessionImpl.sessionSetupSMB2(SmbSessionImpl.java:549)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbSessionImpl.sessionSetup(SmbSessionImpl.java:483)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:369)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:347)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbTreeImpl.treeConnect(SmbTreeImpl.java:607)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectTree(SmbTreeConnection.java:609)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:563)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:484)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:460)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:421)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbFile.ensureTreeConnected(SmbFile.java:551)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbPipeHandleImpl.ensureTreeConnected(SmbPipeHandleImpl.java:111)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbPipeHandleImpl.ensureOpen(SmbPipeHandleImpl.java:166)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbPipeHandleImpl.sendrecv(SmbPipeHandleImpl.java:250)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.dcerpc.DcerpcPipeHandle.doSendReceiveFragment(DcerpcPipeHandle.java:113)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:243)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:216)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:234)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2337)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2500)
>>> [mcf-jcifs-connector.jar:2.14]
>>> at
>>> org.apache.

Re: Interrupted while acquiring credits

2021-05-14 Thread Karl Wright
'

*JCIFS: Possibly transient exception detected on attempt 1 while getting
share security'Yes, it is going to retry.*

*Karl*

On Fri, May 14, 2021 at 1:45 AM ritika jain 
wrote:

> Hi,
> I am using Windows shares connector in manifoldcf 2.14 and ElasticSearch
> connector as Output connector and Tika and Metadata adjuster as
> Transformation connector
>
> Trying to crawl the files from SMB server with 64 GB of server and Start
> option file of manifoldcf is being given 32GB of memory
>  But many times got different errors while processing documents:-
> *1) Access is denied*
> *2) ... 23 more*
>
>
> * WARN 2021-05-13T12:33:16,318 (Worker thread '6') - Service interruption
> reported for job 1599130705168 connection 'Themas_Repo': Timeout or other
> service interruption: Interrupted while acquiring credits WARN
> 2021-05-13T12:33:17,315 (Worker thread '6') - JCIFS: Possibly transient
> exception detected on attempt 1 while getting share security: Interrupted
> while acquiring creditsjcifs.smb.SmbException: Interrupted while acquiring
> credits*
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1530)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbSessionImpl.sessionSetupSMB2(SmbSessionImpl.java:549)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbSessionImpl.sessionSetup(SmbSessionImpl.java:483)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:369)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:347)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbTreeImpl.treeConnect(SmbTreeImpl.java:607)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectTree(SmbTreeConnection.java:609)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:563)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:484)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:460)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:421)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbFile.ensureTreeConnected(SmbFile.java:551)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbPipeHandleImpl.ensureTreeConnected(SmbPipeHandleImpl.java:111)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbPipeHandleImpl.ensureOpen(SmbPipeHandleImpl.java:166)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbPipeHandleImpl.sendrecv(SmbPipeHandleImpl.java:250)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendReceiveFragment(DcerpcPipeHandle.java:113)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:243)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:216)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:234)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2337)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2500)
> [mcf-jcifs-connector.jar:2.14]
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1261)
> [mcf-jcifs-connector.jar:2.14]
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:647)
> [mcf-jcifs-connector.jar:2.14]
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
> Caused by: java.io.InterruptedIOException: Interrupted while acquiring
> credits
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:976)
> ~[?:?]
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[?:?]
> ... 23 more
> Caused by: java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> ~[?:1.8.0_292]
> at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:582)
> ~[?:1.8.0_292]
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:959)
> ~[?:?]
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[?:?]
> ... 23 more
>  WARN 2021-05-13T12:33:17,314 (Worker thread '2') - JCIFS: Possibly
> transient exception detected on attempt 2 while getting share security:
> Interrupted while acquiring credits
> jcifs.smb.SmbException: Interrupted while acquiring credits
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.
>
> Do  we have such functionality that in case of any error occurs like this.
> That it should skip the particular record and then continue to process
> further instead of 

Re: CONNECTORS-1667 integration to trunk ?

2021-05-12 Thread Karl Wright
Hi Julien,

I was occupied with several work-related escalations and trying to get 2.19
out the door.  I will have time this weekend to review the new connector
but for right now can you hold off?  Thanks!
Karl


On Wed, May 12, 2021 at 10:56 AM  wrote:

> Hi Karl,
>
>
>
> Is that ticket OK for you ?
> https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1667
>
>
>
> I did not have any news
>
>
>
> Regards,
>
> Julien
>
>
>
>


Re: Notification connector error

2021-05-11 Thread Karl Wright
This used to work fine, but I suspect that when SSH was declared unsafe, it
was disabled, and now only TLS will work.

Karl


On Tue, May 11, 2021 at 12:13 PM  wrote:

> Hello,
>
>
>
> I am trying to use an email notification connector but without success.
> When the connector tries to send an email I keep having the following error:
>
>
>
> Email: Error sending email: Could not convert socket to TLS
>
> javax.mail.MessagingException: Could not convert socket to TLS
>
> at
> com.sun.mail.smtp.SMTPTransport.startTLS(SMTPTransport.java:1918)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> com.sun.mail.smtp.SMTPTransport.protocolConnect(SMTPTransport.java:652)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Service.connect(Service.java:317)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Service.connect(Service.java:176)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Service.connect(Service.java:125)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Transport.send0(Transport.java:194)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Transport.send(Transport.java:124)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> org.apache.manifoldcf.crawler.notifications.email.EmailSession.send(EmailSession.java:112)
> ~[?:?]
>
> at
> org.apache.manifoldcf.crawler.notifications.email.EmailConnector$SendThread.run(EmailConnector.java:963)
> ~[?:?]
>
> Caused by: javax.net.ssl.SSLHandshakeException: No appropriate protocol
> (protocol is disabled or cipher suites are inappropriate)
>
> at
> sun.security.ssl.HandshakeContext.(HandshakeContext.java:170) ~[?:?]
>
> at
> sun.security.ssl.ClientHandshakeContext.(ClientHandshakeContext.java:98)
> ~[?:?]
>
> at
> sun.security.ssl.TransportContext.kickstart(TransportContext.java:221)
> ~[?:?]
>
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:433) ~[?:?]
>
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:411) ~[?:?]
>
> at
> com.sun.mail.util.SocketFetcher.configureSSLSocket(SocketFetcher.java:548)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> com.sun.mail.util.SocketFetcher.startTLS(SocketFetcher.java:485)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> com.sun.mail.smtp.SMTPTransport.startTLS(SMTPTransport.java:1913)
> ~[mail-1.4.5.jar:1.4.5]
>
> ... 8 more
>
>
>
>
>
> The connector is configured with a gmail SMTP, using the configuration
> recommended by the documentation:
>
>
>
> Hostname: smtp.gmail.com
>
> Port: 587
>
>
>
> Configuration properties:
>
> mail.smtp.ssl.trust : smtp.gmail.com
>
> mail.smtp.starttls.enable : true
>
> mail.smtp.auth : true
>
>
>
>
>
> The username and password I use are correct and I also tried with the
> office365 SMTP and I get the same error.
>
>
>
> I am using openjdk version "11.0.11" 2021-04-20. Do you have any idea
> about my issue ?
>
>
>
> Julien
>
>
>


[RESULT] [VOTE] Release Apache ManifoldCF 2.19, RC1

2021-05-10 Thread Karl Wright
Three +1s.  > 72 hours.  Vote passes!

(Note we still need somebody to look at the maven test failures for
alfresco-webscript.)

Karl


On Fri, May 7, 2021 at 7:35 AM  wrote:

> Ant build and test are OK, we also tested to crawl with the filer and the
> web connector our testing corpus and all is OK.
>
> But the maven tests fail on the alfresco webscript tests and it is
> impossible to skip them...
>
> Still, it is a +1
>
> Julien
>
> -----Message d'origine-
> De : Karl Wright 
> Envoyé : jeudi 6 mai 2021 12:37
> À : dev 
> Objet : Re: [VOTE] Release Apache ManifoldCF 2.19, RC1
>
> Ran tests, including Alfresco webscript one.  +1 from me.
> Karl
>
>
> On Thu, May 6, 2021 at 5:28 AM Karl Wright  wrote:
>
> > Please vote to release Apache ManifoldCF 2.19, RC1.  The release
> > artifact can be found at:
> > https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.19
> .
> > There is also a release tag at
> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.19-RC1 .
> >
> > This release has a significant update to the Elastic Search connector,
> > to both bring it up to date, and to work around the problem with
> > limited ID length that ES has.
> >
> > This RC also addresses the jetty-introduced test problem present in RC0.
> >
> > Karl
> >
>
>


Re: [VOTE] Release Apache ManifoldCF 2.19, RC1

2021-05-06 Thread Karl Wright
Ran tests, including Alfresco webscript one.  +1 from me.
Karl


On Thu, May 6, 2021 at 5:28 AM Karl Wright  wrote:

> Please vote to release Apache ManifoldCF 2.19, RC1.  The release artifact
> can be found at:
> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.19 .
> There is also a release tag at
> https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.19-RC1 .
>
> This release has a significant update to the Elastic Search connector, to
> both bring it up to date, and to work around the problem with limited ID
> length that ES has.
>
> This RC also addresses the jetty-introduced test problem present in RC0.
>
> Karl
>


[VOTE] Release Apache ManifoldCF 2.19, RC1

2021-05-06 Thread Karl Wright
Please vote to release Apache ManifoldCF 2.19, RC1.  The release artifact
can be found at:
https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.19 .
There is also a release tag at
https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.19-RC1 .

This release has a significant update to the Elastic Search connector, to
both bring it up to date, and to work around the problem with limited ID
length that ES has.

This RC also addresses the jetty-introduced test problem present in RC0.

Karl


Re: [VOTE] Release Apache ManifoldCF 2.19, RC0

2021-05-05 Thread Karl Wright
I have a patch for this and will spin a new RC.

Karl


On Wed, May 5, 2021 at 1:43 PM Karl Wright  wrote:

> After 1 1/2 hours spent downloading, I see the issue:
>
> >>>>>>
> [junit] java.lang.IllegalStateException: No config path set
> [junit] at
> org.eclipse.jetty.security.PropertyUserStore.loadUsers(PropertyUserStore.java:237)
> [junit] at
> org.eclipse.jetty.security.PropertyUserStore.doStart(PropertyUserStore.java:308)
> [junit] at
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
> [junit] at
> org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:169)
> [junit] at
> org.eclipse.jetty.util.component.ContainerLifeCycle.addBean(ContainerLifeCycle.java:384)
> [junit] at
> org.eclipse.jetty.util.component.ContainerLifeCycle.addBean(ContainerLifeCycle.java:313)
> [junit] at
> org.eclipse.jetty.util.component.ContainerLifeCycle.updateBean(ContainerLifeCycle.java:864)
> [junit] at
> org.eclipse.jetty.security.HashLoginService.setUserStore(HashLoginService.java:126)
> <<<<<<
>
> This is likely due to a recent Jetty upgrade - I think it was last release
> cycle.  It sounds like we need to provide a configuration path now for
> Jetty to come up.
>
> Karl
>
>
> On Wed, May 5, 2021 at 9:01 AM Karl Wright  wrote:
>
>> I'm having severe bandwidth issues with my internet connection today.  I
>> will have to wait to try this completely until that clears up.
>> Karl
>>
>>
>> On Wed, May 5, 2021 at 8:25 AM Karl Wright  wrote:
>>
>>> I just run "ant test".  The alfresco test doesn't run, though, unless
>>> you run "ant make-deps" first.  That downloads the test artifacts.
>>>
>>> I'm just running through the alfresco test here and will post when done.
>>>
>>> Karl
>>>
>>>
>>> On Wed, May 5, 2021 at 7:20 AM Piergiorgio Lucidi <
>>> piergior...@apache.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have issues executing integration tests.
>>>> It seems that there is a problem with the Alfresco WebScript connector
>>>> that
>>>> is not running the Alfresco instance before starting the test.
>>>> I have the same error on Ant and in Maven.
>>>> Which commands are you executing for testing everything?
>>>> Maybe I'm missing something.
>>>> Cheers,
>>>> PJ
>>>>
>>>> Il giorno gio 29 apr 2021 alle ore 01:21 Karl Wright <
>>>> daddy...@gmail.com>
>>>> ha scritto:
>>>>
>>>> > Please vote to release Apache ManifoldCF 2.19, RC0.  The release
>>>> artifact
>>>> > can be found at:
>>>> >
>>>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.19
>>>> .
>>>> > There is also a release tag at
>>>> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.19-RC0 .
>>>> >
>>>> > This release has a significant update to the Elastic Search
>>>> connector, to
>>>> > both bring it up to date, and to work around the problem with limited
>>>> ID
>>>> > length that ES has.
>>>> >
>>>> > Karl
>>>> >
>>>>
>>>>
>>>> --
>>>> Piergiorgio
>>>>
>>>


Re: [VOTE] Release Apache ManifoldCF 2.19, RC0

2021-05-05 Thread Karl Wright
After 1 1/2 hours spent downloading, I see the issue:

>>>>>>
[junit] java.lang.IllegalStateException: No config path set
[junit] at
org.eclipse.jetty.security.PropertyUserStore.loadUsers(PropertyUserStore.java:237)
[junit] at
org.eclipse.jetty.security.PropertyUserStore.doStart(PropertyUserStore.java:308)
[junit] at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
[junit] at
org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:169)
[junit] at
org.eclipse.jetty.util.component.ContainerLifeCycle.addBean(ContainerLifeCycle.java:384)
[junit] at
org.eclipse.jetty.util.component.ContainerLifeCycle.addBean(ContainerLifeCycle.java:313)
[junit] at
org.eclipse.jetty.util.component.ContainerLifeCycle.updateBean(ContainerLifeCycle.java:864)
[junit] at
org.eclipse.jetty.security.HashLoginService.setUserStore(HashLoginService.java:126)
<<<<<<

This is likely due to a recent Jetty upgrade - I think it was last release
cycle.  It sounds like we need to provide a configuration path now for
Jetty to come up.

Karl


On Wed, May 5, 2021 at 9:01 AM Karl Wright  wrote:

> I'm having severe bandwidth issues with my internet connection today.  I
> will have to wait to try this completely until that clears up.
> Karl
>
>
> On Wed, May 5, 2021 at 8:25 AM Karl Wright  wrote:
>
>> I just run "ant test".  The alfresco test doesn't run, though, unless you
>> run "ant make-deps" first.  That downloads the test artifacts.
>>
>> I'm just running through the alfresco test here and will post when done.
>>
>> Karl
>>
>>
>> On Wed, May 5, 2021 at 7:20 AM Piergiorgio Lucidi 
>> wrote:
>>
>>> Hi,
>>>
>>> I have issues executing integration tests.
>>> It seems that there is a problem with the Alfresco WebScript connector
>>> that
>>> is not running the Alfresco instance before starting the test.
>>> I have the same error on Ant and in Maven.
>>> Which commands are you executing for testing everything?
>>> Maybe I'm missing something.
>>> Cheers,
>>> PJ
>>>
>>> Il giorno gio 29 apr 2021 alle ore 01:21 Karl Wright >> >
>>> ha scritto:
>>>
>>> > Please vote to release Apache ManifoldCF 2.19, RC0.  The release
>>> artifact
>>> > can be found at:
>>> >
>>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.19
>>> .
>>> > There is also a release tag at
>>> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.19-RC0 .
>>> >
>>> > This release has a significant update to the Elastic Search connector,
>>> to
>>> > both bring it up to date, and to work around the problem with limited
>>> ID
>>> > length that ES has.
>>> >
>>> > Karl
>>> >
>>>
>>>
>>> --
>>> Piergiorgio
>>>
>>


Re: [VOTE] Release Apache ManifoldCF 2.19, RC0

2021-05-05 Thread Karl Wright
I'm having severe bandwidth issues with my internet connection today.  I
will have to wait to try this completely until that clears up.
Karl


On Wed, May 5, 2021 at 8:25 AM Karl Wright  wrote:

> I just run "ant test".  The alfresco test doesn't run, though, unless you
> run "ant make-deps" first.  That downloads the test artifacts.
>
> I'm just running through the alfresco test here and will post when done.
>
> Karl
>
>
> On Wed, May 5, 2021 at 7:20 AM Piergiorgio Lucidi 
> wrote:
>
>> Hi,
>>
>> I have issues executing integration tests.
>> It seems that there is a problem with the Alfresco WebScript connector
>> that
>> is not running the Alfresco instance before starting the test.
>> I have the same error on Ant and in Maven.
>> Which commands are you executing for testing everything?
>> Maybe I'm missing something.
>> Cheers,
>> PJ
>>
>> Il giorno gio 29 apr 2021 alle ore 01:21 Karl Wright 
>> ha scritto:
>>
>> > Please vote to release Apache ManifoldCF 2.19, RC0.  The release
>> artifact
>> > can be found at:
>> >
>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.19
>> .
>> > There is also a release tag at
>> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.19-RC0 .
>> >
>> > This release has a significant update to the Elastic Search connector,
>> to
>> > both bring it up to date, and to work around the problem with limited ID
>> > length that ES has.
>> >
>> > Karl
>> >
>>
>>
>> --
>> Piergiorgio
>>
>


Re: [VOTE] Release Apache ManifoldCF 2.19, RC0

2021-05-05 Thread Karl Wright
I just run "ant test".  The alfresco test doesn't run, though, unless you
run "ant make-deps" first.  That downloads the test artifacts.

I'm just running through the alfresco test here and will post when done.

Karl


On Wed, May 5, 2021 at 7:20 AM Piergiorgio Lucidi 
wrote:

> Hi,
>
> I have issues executing integration tests.
> It seems that there is a problem with the Alfresco WebScript connector that
> is not running the Alfresco instance before starting the test.
> I have the same error on Ant and in Maven.
> Which commands are you executing for testing everything?
> Maybe I'm missing something.
> Cheers,
> PJ
>
> Il giorno gio 29 apr 2021 alle ore 01:21 Karl Wright 
> ha scritto:
>
> > Please vote to release Apache ManifoldCF 2.19, RC0.  The release artifact
> > can be found at:
> > https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.19
> .
> > There is also a release tag at
> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.19-RC0 .
> >
> > This release has a significant update to the Elastic Search connector, to
> > both bring it up to date, and to work around the problem with limited ID
> > length that ES has.
> >
> > Karl
> >
>
>
> --
> Piergiorgio
>


Re: [VOTE] Release Apache ManifoldCF 2.19, RC0

2021-05-03 Thread Karl Wright
Reminder: Voting underway.  Please evaluate and vote.

Ran tests.  +1 from me.

Karl


On Wed, Apr 28, 2021 at 7:21 PM Karl Wright  wrote:

> Please vote to release Apache ManifoldCF 2.19, RC0.  The release artifact
> can be found at:
> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.19
> .  There is also a release tag at
> https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.19-RC0 .
>
> This release has a significant update to the Elastic Search connector, to
> both bring it up to date, and to work around the problem with limited ID
> length that ES has.
>
> Karl
>
>


[VOTE] Release Apache ManifoldCF 2.19, RC0

2021-04-28 Thread Karl Wright
Please vote to release Apache ManifoldCF 2.19, RC0.  The release artifact
can be found at:
https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.19 .
There is also a release tag at
https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.19-RC0 .

This release has a significant update to the Elastic Search connector, to
both bring it up to date, and to work around the problem with limited ID
length that ES has.

Karl


I've created the 2.19 release branch, and will spin an RC0 this evening, with luck

2021-04-28 Thread Karl Wright
Voting will commence when the candidate is ready.  Please be ready to
evaluate it.

Thanks,
Karl


It's release time again

2021-04-12 Thread Karl Wright
I'd like to build RC0 by the end of the week, so if there are any pressing
issues I don't know about, this would be the time to address them.

The one thing I know that is still outstanding is the elastic search
connector patch that addresses a changed field name.  I have been quite
busy and have to dig that up again, but that is all I know of.

Karl


Re: General questions

2021-04-12 Thread Karl Wright
Hi,

There was a book written but never published on ManifoldCF and how to write
connectors.  It's meant to be extended in that way.  The PDFs for the book
are available for free online, and they are linked through the manifoldcf
web site.

Karl


On Mon, Apr 12, 2021 at 8:49 AM koch  wrote:

> Hi everyone,
>
> I would like to know, what is planned for manifoldCF in the future?
> How much activity is in the project, or is there already an 'end of
> live' in sight?
>
> Is it compatible with Java11 or higher.
>
> Has someone tried to used it in an OSGI container like karaf?
>
> How can i expand manifold. If i would like to write my own repository or
> output connectors,
> do i have to plug them in at build time or is it possible to add
> connectors at runtime?
>
> Any help would be much appriciated.
>
> Kind regards,
> Matthias
>
>
>


Re: Manifoldcf Deletion Process

2021-03-30 Thread Karl Wright
Hi Ritika,

There is no deletion process.  Deletion takes place when a job is run in a
mode where deletion is possible (there are some where it is not).  The way
it takes place depends on the kind of repository connector (what model it
declares itself to use).

For the most common kinds of connectors, the job sequence involves scanning
all documents described by the job.  If the document is gone, it is deleted
right away.  If the document just wasn't accessed on the crawl, then and at
the end, those no-longer-referenced documents are removed.

Karl


On Tue, Mar 30, 2021 at 9:03 AM ritika jain 
wrote:

> Hi All,
>
> I want to understand the process of Manifoldcf Deletion . i.e in which all
> cases Deletion process (When checked in Simple History) executes.
> One case as per my knowledge , is the one whenever Seed URL of a
> particular job is changed.
> What all are the cases when Deletion process runs.
>
> My requirement to research whether Manifold is capable of handling  the
> scenario, say when a URL is existing and ingested in Elastic Index (say:-
> www.abc.com),
>
> Next time when job is run ,say the URL www.abc.com does not exist anymore
> and resulted in 404, Is Manifoldcf is capable of handling(by default) this
> 404 URL and deleting the URL from Database and from ElasticSearch Index(in
> which it was ingested already)..
>
> Any help will be thankful.
> Thanks
> Ritika
>


Re: How to override carry down data

2021-03-21 Thread Karl Wright
It gets called during JobManager.finishDocuments(), here:

  @Override
  public DocumentDescription[] finishDocuments(Long jobID, String[]
legalLinkTypes, String[] parentIdentifierHashes, int hopcountMethod)
throws ManifoldCFException
...
  // A certain set of carrydown records are going to be deleted by
the ensuing restoreRecords command.  Calculate that set of records!
  rval =
calculateAffectedRestoreCarrydownChildren(jobID,parentIdentifierHashes);
  carryDown.restoreRecords(jobID,parentIdentifierHashes);
  database.performCommit();
...

Is your connector calling the IProcessActivity methods meant to signal that
document processing has finished?  If not, that is the problem!

Karl



On Sun, Mar 21, 2021 at 9:14 PM Karl Wright  wrote:

> Ah, so it appears that the way this works is subtle and clever.
>
> Values are added or updated in one phase of activity.  At this time the
> records are flagged with either "new" or "existing".  At a later time,
> values still in the "base state" are removed, and the "new" and "existing"
> states are mapped back to the base state.
>
> This is the Carrydown class method that supposedly does the deletion and
> rejiggering of the states:
>
>   /** Return all records belonging to the specified parent documents to
> the base state,
>   * and delete the old (eliminated) child records.
>   */
>   public void restoreRecords(Long jobID, String[] parentDocumentIDHashes)
> throws ManifoldCFException
>
> ... and it appears that it does the right thing:
>
> // Delete
> StringBuilder sb = new StringBuilder("WHERE ");
> ArrayList newList = new ArrayList();
>
> sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
>   new UnitaryClause(jobIDField,jobID),
>   new MultiClause(parentIDHashField,list)})).append(" AND ");
>
> sb.append(newField).append("=?");
> newList.add(statusToString(ISNEW_BASE));
> performDelete(sb.toString(),newList,null);
>
> // Restore new values
> sb = new StringBuilder("WHERE ");
> newList.clear();
>
> sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
>   new UnitaryClause(jobIDField,jobID),
>   new MultiClause(parentIDHashField,list)})).append(" AND ");
>
> sb.append(newField).append(" IN (?,?)");
> newList.add(statusToString(ISNEW_EXISTING));
> newList.add(statusToString(ISNEW_NEW));
>
> HashMap map = new HashMap();
> map.put(newField,statusToString(ISNEW_BASE));
> map.put(processIDField,null);
> performUpdate(map,sb.toString(),newList,null);
>
> noteModifications(0,list.size(),0);
>
> So the question becomes: does it get called appropriately?
>
> Karl
>
>
>
> On Sun, Mar 21, 2021 at 8:45 PM Karl Wright  wrote:
>
>> I've tried to refresh my memory by looking at the carrydown code, which
>> is quite old at this point.  But one thing is very clear: that code never
>> removes carrydown data values unless the child or parent document goes
>> away, and wasn't intended to.
>>
>> It's not at all trivial to do but the code here could be modified to set
>> the carrydown values to exactly what is specified in the reference for the
>> given parent.  However, I worry that changing this behavior will break
>> something.  Carrydown has a built-in assumption that if the reference is
>> added multiple times with different data during a crawl, eventually the
>> data will stabilize and no more downstream processing will be necessary.
>> Carrydown changes that are incautious will result in jobs that never
>> complete.
>>
>> I think it is worth looking at changing the behavior such that no
>> accumulation of values takes place, though.  It's not an easy change I
>> fear.  I'll look into how to make it happen.
>>
>> Karl
>>
>>
>>
>> On Sun, Mar 21, 2021 at 1:18 PM  wrote:
>>
>>>  First crawl
>>> -
>>>
>>> In the processDocument method the following code is triggered on the
>>> parentIdendifier:
>>>
>>> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
>>> new String[] { "content" }, new String[][] { { "someContent" } });
>>>
>>> Then the childIdentifier is processed and the following code is
>>> triggered in the processDocument method:
>>>
>>> final String[] contentArray =
>>> activities.retrieveParentData(childIdentifier, "content");
>>>
>>&

Re: How to override carry down data

2021-03-21 Thread Karl Wright
Ah, so it appears that the way this works is subtle and clever.

Values are added or updated in one phase of activity.  At this time the
records are flagged with either "new" or "existing".  At a later time,
values still in the "base state" are removed, and the "new" and "existing"
states are mapped back to the base state.

This is the Carrydown class method that supposedly does the deletion and
rejiggering of the states:

  /** Return all records belonging to the specified parent documents to the
base state,
  * and delete the old (eliminated) child records.
  */
  public void restoreRecords(Long jobID, String[] parentDocumentIDHashes)
throws ManifoldCFException

... and it appears that it does the right thing:

// Delete
StringBuilder sb = new StringBuilder("WHERE ");
ArrayList newList = new ArrayList();

sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
  new UnitaryClause(jobIDField,jobID),
  new MultiClause(parentIDHashField,list)})).append(" AND ");

sb.append(newField).append("=?");
newList.add(statusToString(ISNEW_BASE));
performDelete(sb.toString(),newList,null);

// Restore new values
sb = new StringBuilder("WHERE ");
newList.clear();

sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
  new UnitaryClause(jobIDField,jobID),
  new MultiClause(parentIDHashField,list)})).append(" AND ");

sb.append(newField).append(" IN (?,?)");
newList.add(statusToString(ISNEW_EXISTING));
newList.add(statusToString(ISNEW_NEW));

HashMap map = new HashMap();
map.put(newField,statusToString(ISNEW_BASE));
map.put(processIDField,null);
performUpdate(map,sb.toString(),newList,null);

noteModifications(0,list.size(),0);

So the question becomes: does it get called appropriately?

Karl



On Sun, Mar 21, 2021 at 8:45 PM Karl Wright  wrote:

> I've tried to refresh my memory by looking at the carrydown code, which is
> quite old at this point.  But one thing is very clear: that code never
> removes carrydown data values unless the child or parent document goes
> away, and wasn't intended to.
>
> It's not at all trivial to do but the code here could be modified to set
> the carrydown values to exactly what is specified in the reference for the
> given parent.  However, I worry that changing this behavior will break
> something.  Carrydown has a built-in assumption that if the reference is
> added multiple times with different data during a crawl, eventually the
> data will stabilize and no more downstream processing will be necessary.
> Carrydown changes that are incautious will result in jobs that never
> complete.
>
> I think it is worth looking at changing the behavior such that no
> accumulation of values takes place, though.  It's not an easy change I
> fear.  I'll look into how to make it happen.
>
> Karl
>
>
>
> On Sun, Mar 21, 2021 at 1:18 PM  wrote:
>
>>  First crawl
>> -
>>
>> In the processDocument method the following code is triggered on the
>> parentIdendifier:
>>
>> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
>> new String[] { "content" }, new String[][] { { "someContent" } });
>>
>> Then the childIdentifier is processed and the following code is triggered
>> in the processDocument method:
>>
>> final String[] contentArray =
>> activities.retrieveParentData(childIdentifier, "content");
>>
>> At this point, the childIdentifier correctly retrieve a contentArray
>> containing 1 value which is "someContent"
>>
>>  Second crawl
>> -
>>
>> In the processDocument method the following code is triggered on the
>> parentIdendifier:
>>
>> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
>> new String[] { "content" }, new String[][] { { "newContent" } });
>>
>> Then the childIdentifier is processed and the following code is triggered
>> in the processDocument method:
>>
>> final String[] contentArray =
>> activities.retrieveParentData(childIdentifier, "content");
>>
>> At this point, the childIdentifier retrieves a contentArray containing 2
>> values, the old one "someContent", and the new one "newContent"
>>
>> I can guarantee that the parentIdentifier between the two crawls is the
>> same and that on the second crawl, only the "newContent" is added, I
>> debugged the code to confirm everything

Re: How to override carry down data

2021-03-21 Thread Karl Wright
I've tried to refresh my memory by looking at the carrydown code, which is
quite old at this point.  But one thing is very clear: that code never
removes carrydown data values unless the child or parent document goes
away, and wasn't intended to.

It's not at all trivial to do but the code here could be modified to set
the carrydown values to exactly what is specified in the reference for the
given parent.  However, I worry that changing this behavior will break
something.  Carrydown has a built-in assumption that if the reference is
added multiple times with different data during a crawl, eventually the
data will stabilize and no more downstream processing will be necessary.
Carrydown changes that are incautious will result in jobs that never
complete.

I think it is worth looking at changing the behavior such that no
accumulation of values takes place, though.  It's not an easy change I
fear.  I'll look into how to make it happen.

Karl



On Sun, Mar 21, 2021 at 1:18 PM  wrote:

>  First crawl
> -
>
> In the processDocument method the following code is triggered on the
> parentIdendifier:
>
> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
> new String[] { "content" }, new String[][] { { "someContent" } });
>
> Then the childIdentifier is processed and the following code is triggered
> in the processDocument method:
>
> final String[] contentArray =
> activities.retrieveParentData(childIdentifier, "content");
>
> At this point, the childIdentifier correctly retrieve a contentArray
> containing 1 value which is "someContent"
>
>  Second crawl
> -
>
> In the processDocument method the following code is triggered on the
> parentIdendifier:
>
> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
> new String[] { "content" }, new String[][] { { "newContent" } });
>
> Then the childIdentifier is processed and the following code is triggered
> in the processDocument method:
>
> final String[] contentArray =
> activities.retrieveParentData(childIdentifier, "content");
>
> At this point, the childIdentifier retrieves a contentArray containing 2
> values, the old one "someContent", and the new one "newContent"
>
> I can guarantee that the parentIdentifier between the two crawls is the
> same and that on the second crawl, only the "newContent" is added, I
> debugged the code to confirm everything.
>
>
>
> Julien
>
>
> -Message d'origine-
> De : Karl Wright 
> Envoyé : dimanche 21 mars 2021 16:05
> À : dev 
> Objet : Re: How to override carry down data
>
> Can you give me a code example?
> The carry-down information is set by the parent, as you say.  The specific
> information is keyed to the parent so when the child is added to the queue,
> all old carrydown information from the same parent is deleted at that time,
> and until that happens the carrydown information is preserved for every
> child.  As you say, it can be augmented by other parents that refer to the
> same child, but it is never *replaced* by carrydown info from a different
> parent, just augmented.
>
> If it didn't work this way, MCF would have horrendous order dependencies
> in what documents got processed first.  As it is, when the carrydown
> information changes because another parent is discovered, the children are
> queued for processing to achieve stable results.
>
> Karl
>
>
> On Sun, Mar 21, 2021 at 10:45 AM  wrote:
>
> > Hi Karl,
> >
> >
> >
> > I am using carry-down data in a repository connector but I have
> > figured out that I am unable to update/override a value that already
> have been set.
> > Indeed, despite I am using the same key and the same parent
> > identifier, the values are stacked. So, when I retrieve carry-down
> > data through the key I get more and more values in the array instead of
> only one that is updated.
> > It seems I misunderstood the documentation, I was believing that the
> > carry-down data values are stacked only if there are several parent
> > identifiers for the same key.
> > What can I do to maintain only one carry-down data value for a given
> > key and a given parent identifier ?
> >
> >
> >
> > Regards,
> >
> > Julien
> >
> >
> >
> >
>
>


Re: How to override carry down data

2021-03-21 Thread Karl Wright
Can you give me a code example?
The carry-down information is set by the parent, as you say.  The specific
information is keyed to the parent so when the child is added to the queue,
all old carrydown information from the same parent is deleted at that time,
and until that happens the carrydown information is preserved for every
child.  As you say, it can be augmented by other parents that refer to the
same child, but it is never *replaced* by carrydown info from a different
parent, just augmented.

If it didn't work this way, MCF would have horrendous order dependencies in
what documents got processed first.  As it is, when the carrydown
information changes because another parent is discovered, the children are
queued for processing to achieve stable results.

Karl


On Sun, Mar 21, 2021 at 10:45 AM  wrote:

> Hi Karl,
>
>
>
> I am using carry-down data in a repository connector but I have figured out
> that I am unable to update/override a value that already have been set.
> Indeed, despite I am using the same key and the same parent identifier, the
> values are stacked. So, when I retrieve carry-down data through the key I
> get more and more values in the array instead of only one that is updated.
> It seems I misunderstood the documentation, I was believing that the
> carry-down data values are stacked only if there are several parent
> identifiers for the same key.
> What can I do to maintain only one carry-down data value for a given key
> and
> a given parent identifier ?
>
>
>
> Regards,
>
> Julien
>
>
>
>


Re: Another Elasticsearch patch to allow the long URI

2021-03-20 Thread Karl Wright
I have now updated (I think) everything that this patch actually has, save
for one deprecated field substitution (the "types" field is now the "doc_"
field).  I would like to know more about this.  Does the "types" field no
longer work?  Should we send both, in order to be sure that the connector
works with most versions of ElasticSearch?  Please help clarify so that I
can finish this off.

The changes are committed to trunk; I would be very appreciative if  Shirai
Takashi/ 白井隆 reviewed them there.Thanks!
Karl


On Sat, Mar 20, 2021 at 4:32 AM Karl Wright  wrote:

> Hi,
>
> Please see https://issues.apache.org/jira/browse/CONNECTORS-1666 .
>
> I did not commit the patches as given because I felt that the fix was a
> relatively narrow one and it could be implemented with no user
> involvement.  Adding control for the user was therefore beyond the scope of
> the repair.
>
> There are more changes in these patches than just the ID length issue.  I
> am working to add this functionality as well but without anything I would
> consider to be unneeded.
> Karl
>
>
> On Fri, Mar 19, 2021 at 3:48 AM Karl Wright  wrote:
>
>> Thanks for the information.  I'll see what I can do.
>> Karl
>>
>>
>> On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 <
>> shi...@nintendo.co.jp> wrote:
>>
>>> Hi, Karl.
>>>
>>> Karl Wright wrote:
>>> >Hi - I'm still waiting for this patch to be attached to a ticket.  That
>>> is
>>> >the only way I believe we're allowed to accept it legally.
>>>
>>> Do you ask me to send the patch to the JIRA ticket?
>>> I can't access the JIRA because of our firewall.
>>> Sorry.
>>> What can I do without JIRA?
>>>
>>> 
>>> Nintendo, Co., Ltd.
>>> Product Technology Dept.
>>> Takashi SHIRAI
>>> PHONE: +81-75-662-9600
>>> mailto:shi...@nintendo.co.jp
>>>
>>


[jira] [Commented] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305410#comment-17305410
 ] 

Karl Wright commented on CONNECTORS-1666:
-

r1887848 updates Japanese translations included with this ticket.


> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch, 
> apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch, 
> apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch, 
> apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.
> (Patches submitted on behalf of Shirai Takashi)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305408#comment-17305408
 ] 

Karl Wright commented on CONNECTORS-1666:
-

r1887847 adds support for ingestattachment and for document URI attribute.


> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch, 
> apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch, 
> apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch, 
> apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.
> (Patches submitted on behalf of Shirai Takashi)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Another Elasticsearch patch to allow the long URI

2021-03-20 Thread Karl Wright
Hi,

Please see https://issues.apache.org/jira/browse/CONNECTORS-1666 .

I did not commit the patches as given because I felt that the fix was a
relatively narrow one and it could be implemented with no user
involvement.  Adding control for the user was therefore beyond the scope of
the repair.

There are more changes in these patches than just the ID length issue.  I
am working to add this functionality as well but without anything I would
consider to be unneeded.
Karl


On Fri, Mar 19, 2021 at 3:48 AM Karl Wright  wrote:

> Thanks for the information.  I'll see what I can do.
> Karl
>
>
> On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 
> wrote:
>
>> Hi, Karl.
>>
>> Karl Wright wrote:
>> >Hi - I'm still waiting for this patch to be attached to a ticket.  That
>> is
>> >the only way I believe we're allowed to accept it legally.
>>
>> Do you ask me to send the patch to the JIRA ticket?
>> I can't access the JIRA because of our firewall.
>> Sorry.
>> What can I do without JIRA?
>>
>> 
>> Nintendo, Co., Ltd.
>> Product Technology Dept.
>> Takashi SHIRAI
>> PHONE: +81-75-662-9600
>> mailto:shi...@nintendo.co.jp
>>
>


[jira] [Commented] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305365#comment-17305365
 ] 

Karl Wright commented on CONNECTORS-1666:
-

r1887840 submits my take on what is necessary to correct the actual problem.  
Users do not need to know or care much about the format of the ID ES is using.  
Nor is it necessary to use hashing beyond what MCF itself provides and uses 
internally for its database keys.

Note that the patches provided also include some updates to the ES API.  These 
should be discussed and perhaps implemented as well.


> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch, 
> apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch, 
> apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch, 
> apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.
> (Patches submitted on behalf of Shirai Takashi)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1666:

Description: 
The size of the ElasticSearch ID field is severely limited.  We therefore need 
to use a strategy to hash the ID when it gets too long so that ES doesn't fail 
on such documents.

(Patches submitted on behalf of Shirai Takashi)


  was:
The size of the ElasticSearch ID field is severely limited.  We therefore need 
to use a strategy to hash the ID when it gets too long so that ES doesn't fail 
on such documents.



> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch, 
> apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch, 
> apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch, 
> apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.
> (Patches submitted on behalf of Shirai Takashi)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1666:

Attachment: apache-manifoldcf-elastic-id-2.patch

> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch, 
> apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch, 
> apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch, 
> apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1666:

Attachment: apache-manifoldcf-elastic-id.patch

> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch, 
> apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch, 
> apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1666:

Attachment: apache-manifoldcf-2.18-elastic-id.patch

> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch, 
> apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch, 
> apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1666:

Attachment: (was: apache-manifoldcf-2.18-elastic-id.patch)

> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch, 
> apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch, 
> apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1666:

Attachment: apache-manifoldcf-2.18-elastic-id.patch

> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch, 
> apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305344#comment-17305344
 ] 

Karl Wright commented on CONNECTORS-1666:
-

{quote}
Hi, there.

I've found another trouble in Elasticsearch connector.
Elasticsearch output connector use the URI string as ID.
Elasticsearch allows the length of ID no more than 512 bytes.
If the URL length is too long, it causes HTTP 400 error.

I prepare two solutions with this attached patch.
The one is URI decoding.
If the URI includes multibyte characters,
the ID is URL encoded duplicately.
Ex) U+3000 -> %E3%80%80 -> %25E3%2580%2580
This enlarges the ID length unnecessarily.
Then I add the option to decode URI as the ID before encoding.

But the length may still longer than 512 bytes.
The other solution is hashing.
The new added options are the following.
Raw) uses the URI string as is.
Hash) hashes (SHA1) the URI string always.
Hash if long) hashes the URI only if its length exceeds 512 bytes.
The last one is prepared for the compatibility.

Both of solutions cause a new problem.
If the URI is decoded or hashed,
the original URI cannot be keeped in each document.
Then I add the new fields.
URI field name) keeps the original URI string as is.
Decoded URI field name) keeps the decoded URI string.
The default settings provides these fields as empty.


I sended the patch for Ingest-Attachment the other day.
Then this mail attaches the two patches.
apache-manifoldcf-2.18-elastic-id.patch.gz:
 The patch for 2.18 including the patch of the other day.
apache-manifoldcf-elastic-id.patch.gz:
 The patch for the source patched the other day.

By the way, I tryed to describe the above to some documents.
But no suitable document is found in the ManifoldCF package.
The Elasticsearch document may be wrote for the ancient spacifications.
Where can I describe this new specifications?
{quote}


> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1666:

Attachment: apache-manifoldcf-elastic-id.patch.gz

> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1666:

Attachment: apache-manifoldcf-2.18-elastic-id.patch.gz

> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-2.18-elastic-id.patch.gz, 
> apache-manifoldcf-elastic-id-2.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1666:

Attachment: apache-manifoldcf-elastic-id-2.patch.gz

> ElasticSearch connector cannot use full URLs for IDs
> 
>
> Key: CONNECTORS-1666
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.17
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Attachments: apache-manifoldcf-elastic-id-2.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore 
> need to use a strategy to hash the ID when it gets too long so that ES 
> doesn't fail on such documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

2021-03-20 Thread Karl Wright (Jira)
Karl Wright created CONNECTORS-1666:
---

 Summary: ElasticSearch connector cannot use full URLs for IDs
 Key: CONNECTORS-1666
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
 Project: ManifoldCF
  Issue Type: Bug
  Components: Elastic Search connector
Affects Versions: ManifoldCF 2.17
Reporter: Karl Wright
Assignee: Karl Wright


The size of the ElasticSearch ID field is severely limited.  We therefore need 
to use a strategy to hash the ID when it gets too long so that ES doesn't fail 
on such documents.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Another Elasticsearch patch to allow the long URI

2021-03-19 Thread Karl Wright
Thanks for the information.  I'll see what I can do.
Karl


On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, Karl.
>
> Karl Wright wrote:
> >Hi - I'm still waiting for this patch to be attached to a ticket.  That is
> >the only way I believe we're allowed to accept it legally.
>
> Do you ask me to send the patch to the JIRA ticket?
> I can't access the JIRA because of our firewall.
> Sorry.
> What can I do without JIRA?
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp
>


Re: Another Elasticsearch patch to allow the long URI

2021-03-18 Thread Karl Wright
Hi - I'm still waiting for this patch to be attached to a ticket.  That is
the only way I believe we're allowed to accept it legally.

Karl


On Thu, Mar 4, 2021 at 7:16 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, Karl.
>
> Karl Wrightさんは書きました:
> >I agree it is unlikely that the JDK will lose support for SHA-1 because it
> >is used commonly, as is MD5.  So please feel free to use it.
>
> I know.
> I think that SHA-1 is better on the whole.
> I don't care that apache-manifoldcf-elastic-id-2.patch.gz is discarded.
>
> SHA-256 is surely safer from the risk of collision.
> But the risk with SHA-1 can be ignored unless intension.
> It should be considered only when ManifoldCF is used for the worldwide
> data.
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp
>


Re: Inactive MCF agent

2021-03-16 Thread Karl Wright
If anything running in the agents process runs out of memory, it's fatal
and corrupting and MUST shut the process down.  So if your connector throws
an OOM, the agents process will log something to the console and exit.  It
should exit with a very specific exit code.  So all you have to do to make
MCF agents process shut itself down is NOT catch any OutOfMemory error
exceptions, or if you do, rethrow them.

Karl


On Tue, Mar 16, 2021 at 6:06 AM  wrote:

> Hi Karl,
>
> Took me some time to reproduce, but I was able to dump the process after
> it happened again, and it appears that an OOM is the cause of the problem.
> After investigation, it seems that this OOM was triggered by a
> transformation connector I had developed. I increased the JVM heap size a
> little and the problem never happened again. For info, I had limited the
> number of connections of that connector to only 1, to be sure this was not
> a potential cause of the issue.
> My question is : To make sure that the agent process crashes instead of
> staying up in a similar case (OOM in my scenario), is there something that
> can be done at the connector level or at a more global level in MCF ?
>
> Regards,
> Julien
>
>
> -Message d'origine-
> De : Karl Wright 
> Envoyé : mardi 2 mars 2021 19:17
> À : dev 
> Objet : Re: Inactive MCF agent
>
> The MCF Agents process shouldn't get hung up under normal operation.  If
> it encounters a problem that may call its continued activity into question,
> it shuts itself down.
>
> There are two situations where the process could theoretically hang.
>
> The first is when you are using file-based synch, and you forcibly kill
> another ManifoldCF process so that it doesn't clean up locks after itself.
> But if you are using Zookeeper, it should not ever fail to clean up after
> a process is killed.
>
> The second situation is when certain database conditions arise, and MCF
> decides it needs to reset all its worker threads.  When it does this, it
> blocks all worker threads from proceeding until it reaches a point where
> they are all quiescent, and then it resets all of them at the same time.
> When it is waiting for all threads to shut down in this way, if that never
> completely happens, MCF will be paused forever.
>
> What I'd like to do in that case is get a thread dump of the agents
> process.  That will tell us what the problem is.
>
> Karl
>
>
> On Tue, Mar 2, 2021 at 12:53 PM  wrote:
>
> > Hi Karl,
> >
> > I recently faced a weird case where a job in a "running" state was not
> > doing anything for several hours. The MCF agent process was up but
> > neither the Simple History nor the logs showed any activity. Since we
> > could not wait more than 12 hours, we decided to restart the agent,
> > and the job "went back on rails" and continued its work normally.
> > In order to avoid as much as possible the need for such a manual
> > intervention, I would have two questions:
> > - Is there a way to "test" the agent process ? Like a "process ping"
> > which can detect if the process is doing or ready to do something ?
> > And if not, is there a way to implement such thing easily ? The idea
> > being to make the detection and restart automatically rather than
> > manually.
> > - Knowing that we have activated the debug log level, would you have
> > recommendation on what to look at to find a potential cause of such an
> > issue ?
> >
> > Regards,
> > Julien Massiera
> >
> >
>
>


Re: Add activity records to the web connector

2021-03-09 Thread Karl Wright
Yes, please go ahead
Karl


On Tue, Mar 9, 2021 at 11:06 AM  wrote:

> Hi Karl,
>
>
>
> I would like to add more activity records in the web connector to keep
> track
> in the simple history of filtered URLs that would match exclude filters.
> May
> I create a ticket for this and propose a patch ?
>
>
>
> Regards,
>
> Julien Massiera
>
>
>
>


Re: Another Elasticsearch patch to allow the long URI

2021-03-04 Thread Karl Wright
I agree it is unlikely that the JDK will lose support for SHA-1 because it
is used commonly, as is MD5.  So please feel free to use it.

Karl


On Wed, Mar 3, 2021 at 7:54 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, Horn.
>
> Jörn Franke wrote:
> >Makes sense
>
> I don't think that it's easy.
>
>
> >>> Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when
> it will be removed from JDK.
>
> I also know SHA-1 is dangerous.
> Someone can generate the string which is hashed into the same SHA-1 to
> pretend another one.
> Then SHA-1 should not be used with certifications.
> The future JDK may stop using SHA-1 with certifications.
> But JDK will never stop supporting SHA-1 algorism.
>
> If SHA-1 is removed from JDK,
> ManifoldCF can not be built for reasons of another using of SHA-1.
> Some connectors already use SHA-1 as the ID value,
> then the previous saved records will be inaccessible.
> I can use SHA-256 with Elasticsearch connector.
> How should the other SHA-1 be managed?
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp
>


Re: Inactive MCF agent

2021-03-02 Thread Karl Wright
The MCF Agents process shouldn't get hung up under normal operation.  If it
encounters a problem that may call its continued activity into question, it
shuts itself down.

There are two situations where the process could theoretically hang.

The first is when you are using file-based synch, and you forcibly kill
another ManifoldCF process so that it doesn't clean up locks after itself.
But if you are using Zookeeper, it should not ever fail to clean up after a
process is killed.

The second situation is when certain database conditions arise, and MCF
decides it needs to reset all its worker threads.  When it does this, it
blocks all worker threads from proceeding until it reaches a point where
they are all quiescent, and then it resets all of them at the same time.
When it is waiting for all threads to shut down in this way, if that never
completely happens, MCF will be paused forever.

What I'd like to do in that case is get a thread dump of the agents
process.  That will tell us what the problem is.

Karl


On Tue, Mar 2, 2021 at 12:53 PM  wrote:

> Hi Karl,
>
> I recently faced a weird case where a job in a "running" state was not
> doing
> anything for several hours. The MCF agent process was up but neither the
> Simple History nor the logs showed any activity. Since we could not wait
> more than 12 hours, we decided to restart the agent, and the job "went back
> on rails" and continued its work normally.
> In order to avoid as much as possible the need for such a manual
> intervention, I would have two questions:
> - Is there a way to "test" the agent process ? Like a "process ping" which
> can detect if the process is doing or ready to do something ? And if not,
> is
> there a way to implement such thing easily ? The idea being to make the
> detection and restart automatically rather than manually.
> - Knowing that we have activated the debug log level, would you have
> recommendation on what to look at to find a potential cause of such an
> issue
> ?
>
> Regards,
> Julien Massiera
>
>


Re: Another Elasticsearch patch to allow the long URI

2021-03-02 Thread Karl Wright
Hi - this is very helpful.  I would like you to officially create a ticket
in Jira: https://issues.apache.org/jira , project "CONNECTORS", and attach
these patches.  Backwards compatibility means that we very likely have to
use the hash approach, and not use the decoding approach.

Thanks,
Karl


On Mon, Mar 1, 2021 at 10:10 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, there.
>
> I've found another trouble in Elasticsearch connector.
> Elasticsearch output connector use the URI string as ID.
> Elasticsearch allows the length of ID no more than 512 bytes.
> If the URL length is too long, it causes HTTP 400 error.
>
> I prepare two solutions with this attached patch.
> The one is URI decoding.
> If the URI includes multibyte characters,
> the ID is URL encoded duplicately.
> Ex) U+3000 -> %E3%80%80 -> %25E3%2580%2580
> This enlarges the ID length unnecessarily.
> Then I add the option to decode URI as the ID before encoding.
>
> But the length may still longer than 512 bytes.
> The other solution is hashing.
> The new added options are the following.
> Raw) uses the URI string as is.
> Hash) hashes (SHA1) the URI string always.
> Hash if long) hashes the URI only if its length exceeds 512 bytes.
> The last one is prepared for the compatibility.
>
> Both of solutions cause a new problem.
> If the URI is decoded or hashed,
> the original URI cannot be keeped in each document.
> Then I add the new fields.
> URI field name) keeps the original URI string as is.
> Decoded URI field name) keeps the decoded URI string.
> The default settings provides these fields as empty.
>
>
> I sended the patch for Ingest-Attachment the other day.
> Then this mail attaches the two patches.
> apache-manifoldcf-2.18-elastic-id.patch.gz:
>  The patch for 2.18 including the patch of the other day.
> apache-manifoldcf-elastic-id.patch.gz:
>  The patch for the source patched the other day.
>
> By the way, I tryed to describe the above to some documents.
> But no suitable document is found in the ManifoldCF package.
> The Elasticsearch document may be wrote for the ancient spacifications.
> Where can I describe this new specifications?
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp


Re: Congratulations to the new Lucene PMC Chair, Michael Sokolov!

2021-02-20 Thread Karl Wright
Congratulations!


On Sat, Feb 20, 2021 at 4:17 PM Namgyu Kim  wrote:

> Congratulations, Mike! :D
>
> On Thu, Feb 18, 2021 at 6:32 AM Anshum Gupta 
> wrote:
>
>> Every year, the Lucene PMC rotates the Lucene PMC chair and Apache Vice
>> President position.
>>
>> This year we nominated and elected Michael Sokolov as the Chair, a
>> decision that the board approved in its February 2021 meeting.
>>
>> Congratulations, Mike!
>>
>> --
>> Anshum Gupta
>>
>


Re: Congratulations to the new Apache Solr PMC Chair, Jan Høydahl!

2021-02-20 Thread Karl Wright
Congratulations!
Karl

On Sat, Feb 20, 2021 at 6:28 AM Uwe Schindler  wrote:

> Congrats Jan!
>
>
>
> Uwe
>
>
>
> -
>
> Uwe Schindler
>
> Achterdiek 19, D-28357 Bremen
>
> https://www.thetaphi.de
>
> eMail: u...@thetaphi.de
>
>
>
> *From:* Anshum Gupta 
> *Sent:* Thursday, February 18, 2021 7:55 PM
> *To:* Lucene Dev ; solr-user@lucene.apache.org
> *Subject:* Congratulations to the new Apache Solr PMC Chair, Jan Høydahl!
>
>
>
> Hi everyone,
>
>
>
> I’d like to inform everyone that the newly formed Apache Solr PMC
> nominated and elected Jan Høydahl for the position of the Solr PMC Chair
> and Vice President. This decision was approved by the board in its February
> 2021 meeting.
>
>
>
> Congratulations Jan!
>
>
>
> --
>
> Anshum Gupta
>


Re: Congratulations to the new Apache Solr PMC Chair, Jan Høydahl!

2021-02-20 Thread Karl Wright
Congratulations!
Karl

On Sat, Feb 20, 2021 at 6:28 AM Uwe Schindler  wrote:

> Congrats Jan!
>
>
>
> Uwe
>
>
>
> -
>
> Uwe Schindler
>
> Achterdiek 19, D-28357 Bremen
>
> https://www.thetaphi.de
>
> eMail: u...@thetaphi.de
>
>
>
> *From:* Anshum Gupta 
> *Sent:* Thursday, February 18, 2021 7:55 PM
> *To:* Lucene Dev ; solr-u...@lucene.apache.org
> *Subject:* Congratulations to the new Apache Solr PMC Chair, Jan Høydahl!
>
>
>
> Hi everyone,
>
>
>
> I’d like to inform everyone that the newly formed Apache Solr PMC
> nominated and elected Jan Høydahl for the position of the Solr PMC Chair
> and Vice President. This decision was approved by the board in its February
> 2021 meeting.
>
>
>
> Congratulations Jan!
>
>
>
> --
>
> Anshum Gupta
>


Re: Multiprocess file installation of manifold

2021-02-17 Thread Karl Wright
File synchronization is still supported but is deprecated.  We recommend
zookeeper synchronization unless you have a very good reason not to.

Karl


On Wed, Feb 17, 2021 at 12:26 PM Ananth Peddinti  wrote:

> Hello Team ,
>
>
> I would like to know if someone has already done multi-process model
>   installation of manifold on linux machine .I would like to know the
> process in detail.We are running into issues with the quick start model.
>
>
>
> Regards
>
> Ananth
> --
> 
> -SECURITY/CONFIDENTIALITY WARNING-
>
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of
> the communication is strictly prohibited. If you received the communication
> in error, please notify the sender immediately by replying to this message
> and deleting the message and any accompanying files from your system. If,
> due to the security risks, you do not wish to receive further
> communications via e-mail, please reply to this message and inform the
> sender that you do not wish to receive further e-mail from the sender.
> (LCP301)
> 
>


Re: Job Content Length issue

2021-02-17 Thread Karl Wright
The internal Tika is not memory bounded; some transformations stream, but
others put everything into memory.

You can try using the external tika, with a tika instance you run
separately, and that would likely help.  But you may need to give it lots
of memory too.

Karl


On Wed, Feb 17, 2021 at 3:50 AM ritika jain 
wrote:

> Hi Karl,
>
> I am using Elastic search as an output connector and yes using an internal
> Tika extracter, not using solr output connection.
>
> Also Elastic search server is on hosted on different server with huge
> memory allocation.
>
> On Tue, Feb 16, 2021 at 7:29 PM Karl Wright  wrote:
>
>> Hi, do you mean content limiter length of 100?
>>
>> I assume you are using the internal Tika transformer?  Are you combining
>> this with a Solr output connection that is not using the extract handler?
>>
>> By "manifold crashes" I assume you actually mean it runs out of memory.
>> The "long running query" concern is a red herring because that does not
>> cause a crash under any circumstances.
>>
>> This is quite likely if I described your setup above, because if you do
>> not use the Solr extract handler, the entire content of every document must
>> be loaded into memory.  That is why we require you to fill in a Solr field
>> on those kind of output connections that limits the number of bytes.
>>
>> Karl
>>
>>
>> On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
>> wrote:
>>
>>>
>>>
>>> Hi users
>>>
>>>
>>> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
>>> server which is having some millions billions of records to process and
>>> crawl.
>>>
>>> Total system memory is 64Gb out of which start options file of manifold
>>> is defined as 32GB.
>>>
>>> We have some larger files to crawl around 30 MB of file or more that
>>> than .
>>>
>>> When mentioned size in the content limiter tab is 10 that is 1 MB
>>> job works fine but when changed to 1000 that is 10 MB .. manifold
>>> crashes with some logs with long running query .
>>>
>>> How we can achieve or optimise job specifications to process large
>>> documents also.
>>>
>>> Do I need to increase or decrease the number of connections or number of
>>> worker thread count or something.
>>>
>>> Can anybody help me on this to crawl larger files too at least till 10 MB
>>>
>>> Thanks
>>>
>>> Ritika
>>>
>>


Re: Job Content Length issue

2021-02-16 Thread Karl Wright
Hi, do you mean content limiter length of 100?

I assume you are using the internal Tika transformer?  Are you combining
this with a Solr output connection that is not using the extract handler?

By "manifold crashes" I assume you actually mean it runs out of memory.
The "long running query" concern is a red herring because that does not
cause a crash under any circumstances.

This is quite likely if I described your setup above, because if you do not
use the Solr extract handler, the entire content of every document must be
loaded into memory.  That is why we require you to fill in a Solr field on
those kind of output connections that limits the number of bytes.

Karl


On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
wrote:

>
>
> Hi users
>
>
> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
> server which is having some millions billions of records to process and
> crawl.
>
> Total system memory is 64Gb out of which start options file of manifold is
> defined as 32GB.
>
> We have some larger files to crawl around 30 MB of file or more that than .
>
> When mentioned size in the content limiter tab is 10 that is 1 MB job
> works fine but when changed to 1000 that is 10 MB .. manifold crashes
> with some logs with long running query .
>
> How we can achieve or optimise job specifications to process large
> documents also.
>
> Do I need to increase or decrease the number of connections or number of
> worker thread count or something.
>
> Can anybody help me on this to crawl larger files too at least till 10 MB
>
> Thanks
>
> Ritika
>


Re: content length tab

2021-02-15 Thread Karl Wright
This parameter is in bytes.

Karl


On Mon, Feb 15, 2021 at 9:03 AM ritika jain 
wrote:

> Hi Users,
>
> Can anybody tell me if this can be filled as bytes or kilobytes here.
>
> The "Content Length tab looks like this:
>
>
> [image: Windows Share Job, Content Length tab]
>
> Values are to be filled as 100 , will this be 100 bytes or 100 kilobytes
> or in MB.
>
> Thanks
> Ritika
>


[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML

2021-02-12 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283929#comment-17283929
 ] 

Karl Wright commented on CONNECTORS-1656:
-

The patch is fine.  I was not notified it was attached, for some reason.


> HTML extractor produces invalid XML
> ---
>
> Key: CONNECTORS-1656
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1656
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: HTML extractor
>Affects Versions: ManifoldCF 2.17
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF next
>
> Attachments: patch-CONNECTORS-1656
>
>
> The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' 
> option is disabled) but invalid XML (some tags like img do not have closing 
> tag), and in some cases it is problematic. For example, when Tika is used 
> behind, it processes the document as an XML document and most of the time a 
> parse exception is raised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: GSOC - Mavenisation of MCF ?

2021-02-10 Thread Karl Wright
Hi Furkan,

There is already a mongoDB connector:

12/12/2020  03:51 AM  mongodb



On Wed, Feb 10, 2021 at 4:10 AM Furkan KAMACI 
wrote:

> Hi,
>
> It could be nice to have a connector for a GSoC task i.e. a MongoDB
> repository connector.
>
> Kind Regards,
> Furkan KAMACI
>
> On Tue, Feb 9, 2021 at 10:48 PM Cedric Ulmer 
> wrote:
>
> > Hi Karl and Piergiorgio,
> >
> > Ok got it, was a bad idea!
> > Wrt connectors, not sure either, doing a good connector (incl.
> > documentation) is rather time consuming, and 2 months flies very fast.
> The
> > best a student could do would probably be to look at a source API and
> > devise how to best use it to create a proper connector.
> >
> > Regards,
> >
> > Cedric
> > CEO
> > France Labs – Makers of Datafari Enteprise Search
> > Vainqueur du trophée du Jury à IMAgineDAY 2021
> >
> >
> > -Message d'origine-
> > De : Piergiorgio Lucidi 
> > Envoyé : lundi 8 février 2021 13:56
> > À : dev 
> > Objet : Re: GSOC - Mavenisation of MCF ?
> >
> > I agree with Karl,
> > for GSoC purposes it's better to propose something related to an
> > independent and well defined connector.
> > PJ
> >
> > Il giorno lun 8 feb 2021 alle ore 12:57 Karl Wright 
> > ha
> > scritto:
> >
> > > There are already poms throughout.  However, release distribution
> > > structure cannot be built with Maven at this time because the
> > > directory structure of the final release artifact is complex, and
> > > because we want individual external connector developers to be able to
> > "add into" this.
> > >
> > > In other words, it is far too big a project for a GSOC exercise.
> > >
> > > Karl
> > >
> > >
> > > On Mon, Feb 8, 2021 at 6:49 AM Cedric Ulmer
> > > 
> > > wrote:
> > >
> > > > Hi Karl,
> > > >
> > > > With regards to the Google Summer of Code, we have a suggestion:
> > > > would it be a good idea to propose as a subject the mavenisation of
> > > > MCF ? If yes,
> > > is
> > > > it you who needs to apply since we as France Labs are not
> > > > responsible for the Apache MCF project ? We can off course act as
> > mentors of the student.
> > > > Also, if Manifoldians have other ideas for the GSOC, I think it's
> > > > time to talk about it because the deadline for organizations is Feb.
> > 19th.
> > > >
> > > > Cedric
> > > > CEO
> > > > France Labs - Makers of Datafari Enteprise Search<
> > > > https://www.datafari.com/en> Winners of the trophy of the Jury at
> > > > IMAgineDAY< https://www.ima-dt.org/ima/event/detail.html/idConf/938>
> > > > 2021
> > > >
> > > >
> > > >
> > >
> >
> >
> > --
> > Piergiorgio
> >
>


Re: GSOC - Mavenisation of MCF ?

2021-02-08 Thread Karl Wright
There are already poms throughout.  However, release distribution structure
cannot be built with Maven at this time because the directory structure of
the final release artifact is complex, and because we want individual
external connector developers to be able to "add into" this.

In other words, it is far too big a project for a GSOC exercise.

Karl


On Mon, Feb 8, 2021 at 6:49 AM Cedric Ulmer 
wrote:

> Hi Karl,
>
> With regards to the Google Summer of Code, we have a suggestion: would it
> be a good idea to propose as a subject the mavenisation of MCF ? If yes, is
> it you who needs to apply since we as France Labs are not responsible for
> the Apache MCF project ? We can off course act as mentors of the student.
> Also, if Manifoldians have other ideas for the GSOC, I think it's time to
> talk about it because the deadline for organizations is Feb. 19th.
>
> Cedric
> CEO
> France Labs - Makers of Datafari Enteprise Search<
> https://www.datafari.com/en>
> Winners of the trophy of the Jury at IMAgineDAY<
> https://www.ima-dt.org/ima/event/detail.html/idConf/938> 2021
>
>
>


Re: JIRA Authority connector - Remove potential domain in username

2021-01-29 Thread Karl Wright
What you need is a user mapping.
See:

http://manifoldcf.apache.org/release/release-2.18/en_US/end-user-documentation.html#mappers



On Fri, Jan 29, 2021 at 7:16 AM  wrote:

> Hi,
>
>
>
> In my use cases, as I often combine several authorities with the Active
> Directory authority, I request the MCF authorities servlet with a username
> formatted like 'username@domain'. The problem with the JIRA that I crawl,
> is
> that users do not have the @domain part so the authority does not find the
> user.
>
> So I wonder if I we can add an option in the JIRA authority configuration
> in
> order to "clean" the username input parameter and remove the @domain ?
>
>
>
> Regards,
> Julien
>
>


[jira] [Assigned] (CONNECTORS-1662) JIRA connector - NullPointerException after getCharSet method

2021-01-29 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1662:
---

Assignee: Karl Wright

> JIRA connector - NullPointerException after getCharSet method
> -
>
> Key: CONNECTORS-1662
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1662
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JIRA connector
>Affects Versions: ManifoldCF 2.17
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Major
> Attachments: patch-CONNECTORS-1662
>
>
> Sometimes the following exception is triggered on some documents during crawl:
> {code:java}
> Error tossed: charsetjava.lang.NullPointerException: charset    at 
> java.io.InputStreamReader.(InputStreamReader.java:115) ~[?:?]    at 
> org.apache.manifoldcf.crawler.connectors.jira.JiraSession.convertToString(JiraSession.java:183)
>  ~[?:?]    at 
> org.apache.manifoldcf.crawler.connectors.jira.JiraSession.getRest(JiraSession.java:237)
>  ~[?:?]    at 
> org.apache.manifoldcf.crawler.connectors.jira.JiraSession.getIssue(JiraSession.java:317)
>  ~[?:?]    at 
> org.apache.manifoldcf.crawler.connectors.jira.JiraRepositoryConnector$GetIssueThread.run(JiraRepositoryConnector.java:1409)
>  ~[?:?]
> {code}
> After investigations it appears that the getCharSet method of the JiraSession 
> class may return null charset when it is null (no check) or a 
> UnsupportedCharsetException happens 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   10   >