Let's make sure we're talking about the same thing. Here is the output connector method that receives the ID (as the documentURI parameter):
public int addOrReplaceDocumentWithException(String documentURI, VersionContext pipelineDescription, RepositoryDocument document, String authorityNameString, IOutputAddActivity activities) throws ManifoldCFException, ServiceInterruption, IOException; ManifoldCF doesn't say anywhere that this ID is case insensitive. If you make it case insensitive in an output connector, this will potentially break a lot of things, for example incremental indexing (which organizes the last indexed version by document ID). I therefore highly recommend that any "sloppyness" in this parameter be addressed in the Repository Connector that constructs it. If the connector is crawling a repository that believes that URLs are case insensitive then it should map these IDs to lower case. If not, then it shouldn't. Karl On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <st...@remcam.net> wrote: > Hi Karl. > The issue is that the ES Output Connector uses the uri to create the _id. > When used with IIS which allows case variation in the URI, it creates > multiple documents. Clients on Windows IIS are rarely cognizant of that > issue as IIS is so lax in policing that OTB. > Currently, every case variation in URI results in a new doc in the index. > This is only in the ES output connector. > I can add an optional checkbox to do determien that particular action if > that would help? > Regards, > Steph > > > > > > *Steph van Schalkwyk* > Principal, Remcam Search Engines > +1.314.452. <+1+314+452+2896>2896 st...@remcam.net http://remcam.net > <http://www.remcam.net/> Skype: svanschalkwyk > <https://mail.google.com/mail/u/0/#> > <http://linkedin.com/in/vanschalkwyk> > > On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddy...@gmail.com> wrote: > >> THanks for the update. >> Lower-casing the ID would be fine except there are some connectors that >> care about case. The web connector is one such because it's up to the web >> service to decide if case matters, so the web connector does not view urls >> with case differences as being the same. Other connectors also will likely >> care as well. So I don't think lower-casing the document id is a smart >> thing to do. >> >> You could add this bit of configuration to the web connector, if that's >> what you are using, or to whatever other connector constructs the ID. >> >> Karl >> >> >> >> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <st...@remcam.net> >> wrote: >> >>> Thanks Karl. >>> >>> I'll look into that. >>> >>> Another note: >>> Regarding the ES connector - I have made two additions to it and should >>> probably diff them for inclusion after approval: >>> 1. lowercased _id (the doc URI). >>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources, >>> particularly IIS...) >>> 3. Added a "url" metadata field to the ES connector (as ES 6.x does not >>> allow accedd to _id in the schema anymore, so no copy_field etc. from _id). >>> Hence "url". >>> >>> Regards, >>> Steph >>> >>> >>> >>> >>> *Steph van Schalkwyk* >>> Principal, Remcam Search Engines >>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>> <https://mail.google.com/mail/u/0/#> >>> <http://linkedin.com/in/vanschalkwyk> >>> >>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddy...@gmail.com> wrote: >>> >>>> Hi Steph, I suspect that Jetty is leaking some resource, and we may >>>> need to upgrade it. >>>> >>>> Karl >>>> >>>> >>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <st...@remcam.net> >>>> wrote: >>>> >>>>> Olivier >>>>> By all means. >>>>> The only issue I have seen (totally unrelated) is with Jetty, which >>>>> has to be restarted about once a week. Still trying to find the issue. >>>>> I may be overly sensitive, but I suspect MCF 2.10 with Postgres10 may >>>>> be a bit slower. I have no empiric evidence at the moment as I'm still >>>>> delivering the project to UAT. Will keep you posted. >>>>> Regards, >>>>> Steph >>>>> >>>>> >>>>> >>>>> *Steph van Schalkwyk* >>>>> Principal, Remcam Search Engines >>>>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>> <https://mail.google.com/mail/u/0/#> >>>>> <http://linkedin.com/in/vanschalkwyk> >>>>> >>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard < >>>>> olivier.tav...@francelabs.com> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry for the >>>>>> late answer). I will test it soon. >>>>>> >>>>>> Best regards, >>>>>> >>>>>> >>>>>> Olivier TAVARD >>>>>> >>>>>> >>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <st...@remcam.net> a >>>>>> écrit : >>>>>> >>>>>> >>>>>> >>>>>> These are the rpm installs: >>>>>> - >>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm >>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm >>>>>> - >>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm >>>>>> - >>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm >>>>>> - >>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm >>>>>> >>>>>> postgresql_version: 10 >>>>>> postgresql_data_dir: /var/lib/pgsql/10/data >>>>>> postgresql_bin_path: /usr/pgsql-10/bin >>>>>> postgresql_config_path: /var/lib/pgsql/10/data >>>>>> postgresql_daemon: postgresql-10.service >>>>>> postgresql_packages: >>>>>> - postgresql10-libs >>>>>> - postgresql10 >>>>>> - postgresql10-server >>>>>> - postgresql10-contrib >>>>>> # - postgresql10-devel >>>>>> >>>>>> postgresql_hba_entries: >>>>>> - { type: local, database: all, user: postgres, auth_method: peer } >>>>>> - { type: local, database: all, user: all, auth_method: peer } >>>>>> - { type: host, database: all, user: all, address: '127.0.0.1/32', >>>>>> auth_method: md5 } >>>>>> - { type: host, database: all, user: all, address: '::1/128', >>>>>> auth_method: md5 } >>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0', >>>>>> auth_method: md5 } >>>>>> - { type: host, database: all, user: all, address: '::0/0', >>>>>> auth_method: md5 } >>>>>> >>>>>> postgresql_global_config_options: >>>>>> - option: unix_socket_directories >>>>>> value: '{{ postgresql_unix_socket_directories | join(",") }}' >>>>>> >>>>>> - option: standard_conforming_strings >>>>>> value: 'on' >>>>>> >>>>>> - option: shared_buffers >>>>>> value: '1024MB' >>>>>> >>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB >>>>>> # checkpoint_segments=300 >>>>>> - option: max_wal_size >>>>>> value: '14400MB' >>>>>> >>>>>> - option: min_wal_size >>>>>> value: '80MB' >>>>>> >>>>>> - option: maintenance_work_mem >>>>>> value: '2MB' >>>>>> >>>>>> - option: listen_addresses >>>>>> value: '*' >>>>>> >>>>>> - option: max_connections >>>>>> value: '400' >>>>>> >>>>>> - option: checkpoint_timeout >>>>>> value: '900' >>>>>> >>>>>> - option: datestyle >>>>>> value: "iso, mdy" >>>>>> >>>>>> - option: autovacuum >>>>>> value: 'off' >>>>>> >>>>>> # vacuum all databases every night (full vacuum on Sunday night, lazy >>>>>> vacuum every night) >>>>>> - name: add postgresql cron lazy vacuum >>>>>> cron: >>>>>> name: lazy_vacuum >>>>>> hour: 8 >>>>>> minute: 0 >>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'" >>>>>> - name: add postgresql cron full vacuum >>>>>> cron: >>>>>> name: full_vacuum >>>>>> weekday: 0 >>>>>> hour: 10 >>>>>> minute: 0 >>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze --quiet'" >>>>>> # re-index all databases once a week >>>>>> - name: add postgresql cron reindex >>>>>> cron: >>>>>> name: reindex >>>>>> weekday: 0 >>>>>> hour: 12 >>>>>> minute: 0 >>>>>> job: "su - postgres -c 'psql -t -c \"select datname from pg_database >>>>>> order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U postgres {} -c >>>>>> \"reindex database {};\"' " >>>>>> >>>>>> >>>>>> This is how I run 2.10. >>>>>> Been running fine for some weeks without user intervention. >>>>>> @Karl: Any comments please? >>>>>> Steph >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>> >