Re: (CONNECTORS-1740) Solr 9 output connector

2023-06-01 Thread Karl Wright
Okay, it's as I suspected, the Zookeeper update didn't change any
functionality but just broke stuff.

The first thing I'd do is alert the Solr team to the problem.  They should
for now roll back their dependency so that an earlier Zookeeper is used.
The next step would be to work with the Zookeeper team to use ManifoldCF
unit tests to allow them to fix the problem, as you say. Rather than
assuming this is the same problem we see in previous Zookeeper tickets (it
probably is but we can't be sure of that), I'd create a new one describing
very carefully how to reproduce this using a ManifoldCF branch checkout.
Be prepared to interact with the Zookeeper team at some length about the
problem and how to reproduce it.

My sense is that Zookeeper's original authors are long gone and you may not
get very far here.  And I have very limited time availability these days.
If you are blocked in this in some way let me know and I will do my best to
jump in and unblock you.

I'd also fix the Solr 9 branch (after you make a copy of it for the
Zookeeper folks) so that a working version of Zookeeper is downloaded and
we can then merge that branch.  Please let me know when that is done and
I'll integrate that work.

Thanks,
Karl


On Thu, Jun 1, 2023 at 5:56 AM Guylaine BASSETTE <
guylaine.basse...@francelabs.com> wrote:

> Hi Karl,
>
> Following up on your discussion with Julien. I did some further testings
> and I’m commenting here because I cannot comment in the existing ticket
> (
> https://issues.apache.org/jira/browse/CONNECTORS-1740?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17643980#comment-17643980
> <
> https://issues.apache.org/jira/browse/CONNECTORS-1740?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17643980#comment-17643980>)
>
> . We tested the Solr 9 output connector using the ZK library in its
> 3.5.6 version, targeting a Solr 9.2, and it worked so for now this
> output connector can be considered as valid.
>
> Still, in the long run, I think this ZK bug will become an issue for
> MCF. Since thanks to your testings, the problem can be reproduced,
> wouldn’t it be worth for you to comment on their ZK issue, letting them
> know that the issue is still here with ZK 3.5.7, that is does not only
> happen in docker mode, and it can be reproduced every time using MCF
> testing framework ?
>
> --
>
> Best Regards,
> Guylaine
>
> France Labs – Your knowledge, now
> Datafari Enterprise Search – Découvrez la version 5 / Discover our version
> 5
> www.datafari.com 
>
>


[RESULT][VOTE] Release ManifoldCF 2.25, RC0

2023-06-01 Thread Karl Wright
Three binding +1's, >72 hrs.  Vote passes!
Karl


On Thu, Jun 1, 2023 at 6:27 PM Karl Wright  wrote:

> +1 from me as well.
>
>
> On Thu, Jun 1, 2023 at 12:07 PM Karl Wright  wrote:
>
>> Hi -
>> This is a vote thread on a specific release artifact.  CONNECTORS-1746 is
>> indeed included in this release.
>>
>> Incorporating a JSON-based generic connector hasn't happened yet because
>> the contribution needed to be complete, and a release was requested before
>> that happened.
>>
>> Karl
>>
>>
>>
>>
>> On Thu, Jun 1, 2023 at 7:37 AM Guylaine BASSETTE <
>> guylaine.basse...@francelabs.com> wrote:
>>
>>> Hi all,
>>>
>>> Do you think it would make sense to include as well the following
>>> modifications into 2.25 ? They don’t require lots of modifications, but
>>> they would benefit everyone:
>>>
>>>   * the fix on CSV connector I proposed in mail "Control over number of
>>> processed documents per thread" on 2023/05/22
>>>   * (CONNECTORS-1740) Solr 9 output connector (mail today)
>>>
>>> As for my 2 other suggestions, I leave it up to you to decide.
>>>
>>>   * Json based generic authority connector (mail on 2023/05/23)
>>>   * Reading a document in Transfo Connector: Utility Classes (mail today)
>>>
>>> As a side note, did you also envision to include the optimisation on
>>> postgresql usage as proposed by Mingchun Zhao ?
>>> https://issues.apache.org/jira/browse/CONNECTORS-1746
>>>
>>>
>>> BTW, many thanks Mingchun for your 2 proposals on postgre!
>>>
>>>
>>> Bien cordialement,
>>> Guylaine
>>>
>>> France Labs – Your knowledge, now
>>> Datafari Enterprise Search – Découvrez la version 5 / Discover our
>>> version 5
>>> www.datafari.com 
>>>
>>>
>>> Le 30/05/2023 à 11:13, Mingchun Zhao a écrit :
>>> > +1 (non-binding)
>>> >
>>> > The following tests passed.
>>> > - Unit tests
>>> > - Integration tests with PostgreSQL
>>> > - Load tests with PostgreSQL
>>> > - New feature: the ability to disable hopcount tracking entirely, for
>>> > better performance of the web connector
>>> >
>>> > Regards,
>>> > Mingchun
>>> >
>>> > 2023年5月30日(火) 6:08 Karl Wright:
>>> >> Please vote on whether to release ManifoldCF 2.25, RC0.
>>> >>
>>> >> This release contains one new feature: the ability to disable hopcount
>>> >> tracking entirely, for better performance of the web connector.  The
>>> >> attempt to update the Solr connector to release 9.x of Solr did NOT
>>> make it
>>> >> in because that version of SolrJ depends on a broken version of
>>> zookeeper,
>>> >> our thread coordination library.
>>> >>
>>> >> A release artifact can be found here:
>>> >>
>>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.25
>>> >>
>>> >> A release tag can also be found at
>>> >> https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.25-RC0  .
>>> >>
>>> >> Karl
>>
>>


Re: [VOTE] Release ManifoldCF 2.25, RC0

2023-06-01 Thread Karl Wright
+1 from me as well.


On Thu, Jun 1, 2023 at 12:07 PM Karl Wright  wrote:

> Hi -
> This is a vote thread on a specific release artifact.  CONNECTORS-1746 is
> indeed included in this release.
>
> Incorporating a JSON-based generic connector hasn't happened yet because
> the contribution needed to be complete, and a release was requested before
> that happened.
>
> Karl
>
>
>
>
> On Thu, Jun 1, 2023 at 7:37 AM Guylaine BASSETTE <
> guylaine.basse...@francelabs.com> wrote:
>
>> Hi all,
>>
>> Do you think it would make sense to include as well the following
>> modifications into 2.25 ? They don’t require lots of modifications, but
>> they would benefit everyone:
>>
>>   * the fix on CSV connector I proposed in mail "Control over number of
>> processed documents per thread" on 2023/05/22
>>   * (CONNECTORS-1740) Solr 9 output connector (mail today)
>>
>> As for my 2 other suggestions, I leave it up to you to decide.
>>
>>   * Json based generic authority connector (mail on 2023/05/23)
>>   * Reading a document in Transfo Connector: Utility Classes (mail today)
>>
>> As a side note, did you also envision to include the optimisation on
>> postgresql usage as proposed by Mingchun Zhao ?
>> https://issues.apache.org/jira/browse/CONNECTORS-1746
>>
>>
>> BTW, many thanks Mingchun for your 2 proposals on postgre!
>>
>>
>> Bien cordialement,
>> Guylaine
>>
>> France Labs – Your knowledge, now
>> Datafari Enterprise Search – Découvrez la version 5 / Discover our
>> version 5
>> www.datafari.com 
>>
>>
>> Le 30/05/2023 à 11:13, Mingchun Zhao a écrit :
>> > +1 (non-binding)
>> >
>> > The following tests passed.
>> > - Unit tests
>> > - Integration tests with PostgreSQL
>> > - Load tests with PostgreSQL
>> > - New feature: the ability to disable hopcount tracking entirely, for
>> > better performance of the web connector
>> >
>> > Regards,
>> > Mingchun
>> >
>> > 2023年5月30日(火) 6:08 Karl Wright:
>> >> Please vote on whether to release ManifoldCF 2.25, RC0.
>> >>
>> >> This release contains one new feature: the ability to disable hopcount
>> >> tracking entirely, for better performance of the web connector.  The
>> >> attempt to update the Solr connector to release 9.x of Solr did NOT
>> make it
>> >> in because that version of SolrJ depends on a broken version of
>> zookeeper,
>> >> our thread coordination library.
>> >>
>> >> A release artifact can be found here:
>> >>
>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.25
>> >>
>> >> A release tag can also be found at
>> >> https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.25-RC0  .
>> >>
>> >> Karl
>
>


Re: [VOTE] Release ManifoldCF 2.25, RC0

2023-06-01 Thread Karl Wright
Hi -
This is a vote thread on a specific release artifact.  CONNECTORS-1746 is
indeed included in this release.

Incorporating a JSON-based generic connector hasn't happened yet because
the contribution needed to be complete, and a release was requested before
that happened.

Karl




On Thu, Jun 1, 2023 at 7:37 AM Guylaine BASSETTE <
guylaine.basse...@francelabs.com> wrote:

> Hi all,
>
> Do you think it would make sense to include as well the following
> modifications into 2.25 ? They don’t require lots of modifications, but
> they would benefit everyone:
>
>   * the fix on CSV connector I proposed in mail "Control over number of
> processed documents per thread" on 2023/05/22
>   * (CONNECTORS-1740) Solr 9 output connector (mail today)
>
> As for my 2 other suggestions, I leave it up to you to decide.
>
>   * Json based generic authority connector (mail on 2023/05/23)
>   * Reading a document in Transfo Connector: Utility Classes (mail today)
>
> As a side note, did you also envision to include the optimisation on
> postgresql usage as proposed by Mingchun Zhao ?
> https://issues.apache.org/jira/browse/CONNECTORS-1746
>
>
> BTW, many thanks Mingchun for your 2 proposals on postgre!
>
>
> Bien cordialement,
> Guylaine
>
> France Labs – Your knowledge, now
> Datafari Enterprise Search – Découvrez la version 5 / Discover our version
> 5
> www.datafari.com 
>
>
> Le 30/05/2023 à 11:13, Mingchun Zhao a écrit :
> > +1 (non-binding)
> >
> > The following tests passed.
> > - Unit tests
> > - Integration tests with PostgreSQL
> > - Load tests with PostgreSQL
> > - New feature: the ability to disable hopcount tracking entirely, for
> > better performance of the web connector
> >
> > Regards,
> > Mingchun
> >
> > 2023年5月30日(火) 6:08 Karl Wright:
> >> Please vote on whether to release ManifoldCF 2.25, RC0.
> >>
> >> This release contains one new feature: the ability to disable hopcount
> >> tracking entirely, for better performance of the web connector.  The
> >> attempt to update the Solr connector to release 9.x of Solr did NOT
> make it
> >> in because that version of SolrJ depends on a broken version of
> zookeeper,
> >> our thread coordination library.
> >>
> >> A release artifact can be found here:
> >>
> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.25
> >>
> >> A release tag can also be found at
> >> https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.25-RC0  .
> >>
> >> Karl


[jira] [Resolved] (CONNECTORS-1746) Adding conditions to execute PostgreSQL's ANALYZE command to avoid crawling become extremely slow.

2023-06-01 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1746.
-
Fix Version/s: ManifoldCF 2.25
   Resolution: Fixed

> Adding conditions to execute PostgreSQL's ANALYZE command to avoid crawling 
> become extremely slow.
> --
>
> Key: CONNECTORS-1746
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1746
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
> Environment: Using ManifoldCF 2.24 with PostgreSQL 12.14 as the 
> database. 
>Reporter: Mingchun Zhao
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.25
>
> Attachments: DBInterfacePostgreSQL.java.patch
>
>
> Sometimes, the crawling does not process any documents for a while and there 
> is nothing logged about long-running queries. The performance can be restored 
> by firing the 'ANALYZE' command manually. It seems that a bad query plan 
> caused this performance problem.
> Therefore, in addition to the current configuration parameter 
> 'org.apache.manifoldcf.db.postgres.analyze.', it is considered 
> necessary to execute the 'ANALYZE' even in the following situations.
> 1. When the number of records in the table exceeds the number required for 
> creating a execution plan after the job starts.
> 2. When the crawling performance slows down. For example, if the processing 
> rate of documents drops below a specified threshold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release ManifoldCF 2.25, RC0

2023-06-01 Thread Guylaine BASSETTE

Hi all,

Do you think it would make sense to include as well the following 
modifications into 2.25 ? They don’t require lots of modifications, but 
they would benefit everyone:


 * the fix on CSV connector I proposed in mail "Control over number of
   processed documents per thread" on 2023/05/22
 * (CONNECTORS-1740) Solr 9 output connector (mail today)

As for my 2 other suggestions, I leave it up to you to decide.

 * Json based generic authority connector (mail on 2023/05/23)
 * Reading a document in Transfo Connector: Utility Classes (mail today)

As a side note, did you also envision to include the optimisation on 
postgresql usage as proposed by Mingchun Zhao ? 
https://issues.apache.org/jira/browse/CONNECTORS-1746



BTW, many thanks Mingchun for your 2 proposals on postgre!


Bien cordialement,
Guylaine

France Labs – Your knowledge, now
Datafari Enterprise Search – Découvrez la version 5 / Discover our version 5
www.datafari.com 


Le 30/05/2023 à 11:13, Mingchun Zhao a écrit :

+1 (non-binding)

The following tests passed.
- Unit tests
- Integration tests with PostgreSQL
- Load tests with PostgreSQL
- New feature: the ability to disable hopcount tracking entirely, for
better performance of the web connector

Regards,
Mingchun

2023年5月30日(火) 6:08 Karl Wright:

Please vote on whether to release ManifoldCF 2.25, RC0.

This release contains one new feature: the ability to disable hopcount
tracking entirely, for better performance of the web connector.  The
attempt to update the Solr connector to release 9.x of Solr did NOT make it
in because that version of SolrJ depends on a broken version of zookeeper,
our thread coordination library.

A release artifact can be found here:
https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.25

A release tag can also be found at
https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.25-RC0  .

Karl

(CONNECTORS-1740) Solr 9 output connector

2023-06-01 Thread Guylaine BASSETTE

Hi Karl,

Following up on your discussion with Julien. I did some further testings 
and I’m commenting here because I cannot comment in the existing ticket 
(https://issues.apache.org/jira/browse/CONNECTORS-1740?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17643980#comment-17643980 
) 
. We tested the Solr 9 output connector using the ZK library in its 
3.5.6 version, targeting a Solr 9.2, and it worked so for now this 
output connector can be considered as valid.


Still, in the long run, I think this ZK bug will become an issue for 
MCF. Since thanks to your testings, the problem can be reproduced, 
wouldn’t it be worth for you to comment on their ZK issue, letting them 
know that the issue is still here with ZK 3.5.7, that is does not only 
happen in docker mode, and it can be reproduced every time using MCF 
testing framework ?


--

Best Regards,
Guylaine

France Labs – Your knowledge, now
Datafari Enterprise Search – Découvrez la version 5 / Discover our version 5
www.datafari.com 



Reading a document in Transfo Connector: Utility Classes

2023-06-01 Thread Guylaine BASSETTE

Hello,

I would like to contribute with some Utility Classes whose purpose is 
described below.


When you need to browse a document in a Transformation Connector, you 
have to store its stream, because after reading, it can't be read again 
at the Output Connector for Solr indexing.


I have created utility classes to store content of a document for 
browsing in a Transformation Connector, because each connector has 
currently its own way of doing it.
A build method automatically chooses the most suitable way to store data 
read thanks to the document size passed to the method: memory storage or 
temporary file storage. The max size for memory storage is a constant 
fixed to 65536 Bytes.


Here's an example:

DestinationStorage will be in memory or a temporary file ( 
File./createTempFile/(prefix, "tmp") ).


It is useful for many Transformation Connectors and is already in use in 
ours, where it's doing well.



--

Best Regards,
Guylaine

France Labs – Your knowledge, now
Datafari Enterprise Search – Découvrez la version 5 / Discover our version 5
www.datafari.com