Unless I have a massive misunderstanding somewhere...
*Steph van Schalkwyk* Principal, Remcam Search Engines +1.314.452. <+1+314+452+2896>2896 st...@remcam.net http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk <https://mail.google.com/mail/u/0/#> <http://linkedin.com/in/vanschalkwyk> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <st...@remcam.net> wrote: > Hi Karl > I'm addressing it in the ES Output Connector. > Not touching the framework :) > S > > > > *Steph van Schalkwyk* > Principal, Remcam Search Engines > +1.314.452. <+1+314+452+2896>2896 st...@remcam.net http://remcam.net > <http://www.remcam.net/> Skype: svanschalkwyk > <https://mail.google.com/mail/u/0/#> > <http://linkedin.com/in/vanschalkwyk> > > On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddy...@gmail.com> wrote: > >> Let's make sure we're talking about the same thing. >> >> Here is the output connector method that receives the ID (as the >> documentURI parameter): >> >> public int addOrReplaceDocumentWithException(String documentURI, >> VersionContext pipelineDescription, RepositoryDocument document, String >> authorityNameString, IOutputAddActivity activities) >> throws ManifoldCFException, ServiceInterruption, IOException; >> >> ManifoldCF doesn't say anywhere that this ID is case insensitive. If you >> make it case insensitive in an output connector, this will potentially >> break a lot of things, for example incremental indexing (which organizes >> the last indexed version by document ID). >> >> I therefore highly recommend that any "sloppyness" in this parameter be >> addressed in the Repository Connector that constructs it. If the connector >> is crawling a repository that believes that URLs are case insensitive then >> it should map these IDs to lower case. If not, then it shouldn't. >> >> Karl >> >> >> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <st...@remcam.net> >> wrote: >> >>> Hi Karl. >>> The issue is that the ES Output Connector uses the uri to create the >>> _id. When used with IIS which allows case variation in the URI, it creates >>> multiple documents. Clients on Windows IIS are rarely cognizant of that >>> issue as IIS is so lax in policing that OTB. >>> Currently, every case variation in URI results in a new doc in the >>> index. This is only in the ES output connector. >>> I can add an optional checkbox to do determien that particular action if >>> that would help? >>> Regards, >>> Steph >>> >>> >>> >>> >>> >>> *Steph van Schalkwyk* >>> Principal, Remcam Search Engines >>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>> <https://mail.google.com/mail/u/0/#> >>> <http://linkedin.com/in/vanschalkwyk> >>> >>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddy...@gmail.com> wrote: >>> >>>> THanks for the update. >>>> Lower-casing the ID would be fine except there are some connectors that >>>> care about case. The web connector is one such because it's up to the web >>>> service to decide if case matters, so the web connector does not view urls >>>> with case differences as being the same. Other connectors also will likely >>>> care as well. So I don't think lower-casing the document id is a smart >>>> thing to do. >>>> >>>> You could add this bit of configuration to the web connector, if that's >>>> what you are using, or to whatever other connector constructs the ID. >>>> >>>> Karl >>>> >>>> >>>> >>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <st...@remcam.net> >>>> wrote: >>>> >>>>> Thanks Karl. >>>>> >>>>> I'll look into that. >>>>> >>>>> Another note: >>>>> Regarding the ES connector - I have made two additions to it and >>>>> should probably diff them for inclusion after approval: >>>>> 1. lowercased _id (the doc URI). >>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources, >>>>> particularly IIS...) >>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x does >>>>> not allow accedd to _id in the schema anymore, so no copy_field etc. from >>>>> _id). Hence "url". >>>>> >>>>> Regards, >>>>> Steph >>>>> >>>>> >>>>> >>>>> >>>>> *Steph van Schalkwyk* >>>>> Principal, Remcam Search Engines >>>>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>> <https://mail.google.com/mail/u/0/#> >>>>> <http://linkedin.com/in/vanschalkwyk> >>>>> >>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddy...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and we may >>>>>> need to upgrade it. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <st...@remcam.net> >>>>>> wrote: >>>>>> >>>>>>> Olivier >>>>>>> By all means. >>>>>>> The only issue I have seen (totally unrelated) is with Jetty, which >>>>>>> has to be restarted about once a week. Still trying to find the issue. >>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with Postgres10 >>>>>>> may be a bit slower. I have no empiric evidence at the moment as I'm >>>>>>> still >>>>>>> delivering the project to UAT. Will keep you posted. >>>>>>> Regards, >>>>>>> Steph >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Steph van Schalkwyk* >>>>>>> Principal, Remcam Search Engines >>>>>>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>> >>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard < >>>>>>> olivier.tav...@francelabs.com> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry for >>>>>>>> the late answer). I will test it soon. >>>>>>>> >>>>>>>> Best regards, >>>>>>>> >>>>>>>> >>>>>>>> Olivier TAVARD >>>>>>>> >>>>>>>> >>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <st...@remcam.net> a >>>>>>>> écrit : >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> These are the rpm installs: >>>>>>>> - file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7. >>>>>>>> x86_64.rpm >>>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>> - file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG. >>>>>>>> rhel7.x86_64.rpm >>>>>>>> - file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7. >>>>>>>> x86_64.rpm >>>>>>>> - file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7. >>>>>>>> x86_64.rpm >>>>>>>> >>>>>>>> postgresql_version: 10 >>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data >>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin >>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data >>>>>>>> postgresql_daemon: postgresql-10.service >>>>>>>> postgresql_packages: >>>>>>>> - postgresql10-libs >>>>>>>> - postgresql10 >>>>>>>> - postgresql10-server >>>>>>>> - postgresql10-contrib >>>>>>>> # - postgresql10-devel >>>>>>>> >>>>>>>> postgresql_hba_entries: >>>>>>>> - { type: local, database: all, user: postgres, auth_method: peer } >>>>>>>> - { type: local, database: all, user: all, auth_method: peer } >>>>>>>> - { type: host, database: all, user: all, address: '127.0.0.1/32', >>>>>>>> auth_method: md5 } >>>>>>>> - { type: host, database: all, user: all, address: '::1/128', >>>>>>>> auth_method: md5 } >>>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0', >>>>>>>> auth_method: md5 } >>>>>>>> - { type: host, database: all, user: all, address: '::0/0', >>>>>>>> auth_method: md5 } >>>>>>>> >>>>>>>> postgresql_global_config_options: >>>>>>>> - option: unix_socket_directories >>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",") }}' >>>>>>>> >>>>>>>> - option: standard_conforming_strings >>>>>>>> value: 'on' >>>>>>>> >>>>>>>> - option: shared_buffers >>>>>>>> value: '1024MB' >>>>>>>> >>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB >>>>>>>> # checkpoint_segments=300 >>>>>>>> - option: max_wal_size >>>>>>>> value: '14400MB' >>>>>>>> >>>>>>>> - option: min_wal_size >>>>>>>> value: '80MB' >>>>>>>> >>>>>>>> - option: maintenance_work_mem >>>>>>>> value: '2MB' >>>>>>>> >>>>>>>> - option: listen_addresses >>>>>>>> value: '*' >>>>>>>> >>>>>>>> - option: max_connections >>>>>>>> value: '400' >>>>>>>> >>>>>>>> - option: checkpoint_timeout >>>>>>>> value: '900' >>>>>>>> >>>>>>>> - option: datestyle >>>>>>>> value: "iso, mdy" >>>>>>>> >>>>>>>> - option: autovacuum >>>>>>>> value: 'off' >>>>>>>> >>>>>>>> # vacuum all databases every night (full vacuum on Sunday night, >>>>>>>> lazy vacuum every night) >>>>>>>> - name: add postgresql cron lazy vacuum >>>>>>>> cron: >>>>>>>> name: lazy_vacuum >>>>>>>> hour: 8 >>>>>>>> minute: 0 >>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'" >>>>>>>> - name: add postgresql cron full vacuum >>>>>>>> cron: >>>>>>>> name: full_vacuum >>>>>>>> weekday: 0 >>>>>>>> hour: 10 >>>>>>>> minute: 0 >>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze --quiet'" >>>>>>>> # re-index all databases once a week >>>>>>>> - name: add postgresql cron reindex >>>>>>>> cron: >>>>>>>> name: reindex >>>>>>>> weekday: 0 >>>>>>>> hour: 12 >>>>>>>> minute: 0 >>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from >>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U >>>>>>>> postgres >>>>>>>> {} -c \"reindex database {};\"' " >>>>>>>> >>>>>>>> >>>>>>>> This is how I run 2.10. >>>>>>>> Been running fine for some weeks without user intervention. >>>>>>>> @Karl: Any comments please? >>>>>>>> Steph >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>> >