Hi Karl I'm addressing it in the ES Output Connector. Not touching the framework :) S
*Steph van Schalkwyk* Principal, Remcam Search Engines +1.314.452. <+1+314+452+2896>2896 [email protected] http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk <https://mail.google.com/mail/u/0/#> <http://linkedin.com/in/vanschalkwyk> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <[email protected]> wrote: > Let's make sure we're talking about the same thing. > > Here is the output connector method that receives the ID (as the > documentURI parameter): > > public int addOrReplaceDocumentWithException(String documentURI, > VersionContext pipelineDescription, RepositoryDocument document, String > authorityNameString, IOutputAddActivity activities) > throws ManifoldCFException, ServiceInterruption, IOException; > > ManifoldCF doesn't say anywhere that this ID is case insensitive. If you > make it case insensitive in an output connector, this will potentially > break a lot of things, for example incremental indexing (which organizes > the last indexed version by document ID). > > I therefore highly recommend that any "sloppyness" in this parameter be > addressed in the Repository Connector that constructs it. If the connector > is crawling a repository that believes that URLs are case insensitive then > it should map these IDs to lower case. If not, then it shouldn't. > > Karl > > > On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <[email protected]> > wrote: > >> Hi Karl. >> The issue is that the ES Output Connector uses the uri to create the _id. >> When used with IIS which allows case variation in the URI, it creates >> multiple documents. Clients on Windows IIS are rarely cognizant of that >> issue as IIS is so lax in policing that OTB. >> Currently, every case variation in URI results in a new doc in the index. >> This is only in the ES output connector. >> I can add an optional checkbox to do determien that particular action if >> that would help? >> Regards, >> Steph >> >> >> >> >> >> *Steph van Schalkwyk* >> Principal, Remcam Search Engines >> +1.314.452. <+1+314+452+2896>2896 [email protected] http://remcam.net >> <http://www.remcam.net/> Skype: svanschalkwyk >> <https://mail.google.com/mail/u/0/#> >> <http://linkedin.com/in/vanschalkwyk> >> >> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <[email protected]> wrote: >> >>> THanks for the update. >>> Lower-casing the ID would be fine except there are some connectors that >>> care about case. The web connector is one such because it's up to the web >>> service to decide if case matters, so the web connector does not view urls >>> with case differences as being the same. Other connectors also will likely >>> care as well. So I don't think lower-casing the document id is a smart >>> thing to do. >>> >>> You could add this bit of configuration to the web connector, if that's >>> what you are using, or to whatever other connector constructs the ID. >>> >>> Karl >>> >>> >>> >>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <[email protected]> >>> wrote: >>> >>>> Thanks Karl. >>>> >>>> I'll look into that. >>>> >>>> Another note: >>>> Regarding the ES connector - I have made two additions to it and should >>>> probably diff them for inclusion after approval: >>>> 1. lowercased _id (the doc URI). >>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources, >>>> particularly IIS...) >>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x does not >>>> allow accedd to _id in the schema anymore, so no copy_field etc. from _id). >>>> Hence "url". >>>> >>>> Regards, >>>> Steph >>>> >>>> >>>> >>>> >>>> *Steph van Schalkwyk* >>>> Principal, Remcam Search Engines >>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>> <https://mail.google.com/mail/u/0/#> >>>> <http://linkedin.com/in/vanschalkwyk> >>>> >>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> Hi Steph, I suspect that Jetty is leaking some resource, and we may >>>>> need to upgrade it. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <[email protected]> >>>>> wrote: >>>>> >>>>>> Olivier >>>>>> By all means. >>>>>> The only issue I have seen (totally unrelated) is with Jetty, which >>>>>> has to be restarted about once a week. Still trying to find the issue. >>>>>> I may be overly sensitive, but I suspect MCF 2.10 with Postgres10 may >>>>>> be a bit slower. I have no empiric evidence at the moment as I'm still >>>>>> delivering the project to UAT. Will keep you posted. >>>>>> Regards, >>>>>> Steph >>>>>> >>>>>> >>>>>> >>>>>> *Steph van Schalkwyk* >>>>>> Principal, Remcam Search Engines >>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>> <https://mail.google.com/mail/u/0/#> >>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>> >>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry for >>>>>>> the late answer). I will test it soon. >>>>>>> >>>>>>> Best regards, >>>>>>> >>>>>>> >>>>>>> Olivier TAVARD >>>>>>> >>>>>>> >>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <[email protected]> a >>>>>>> écrit : >>>>>>> >>>>>>> >>>>>>> >>>>>>> These are the rpm installs: >>>>>>> - file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG. >>>>>>> rhel7.x86_64.rpm >>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>> - file:///tmp/postgres10/postgresql10-contrib-10.4- >>>>>>> 1PGDG.rhel7.x86_64.rpm >>>>>>> - file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG. >>>>>>> rhel7.x86_64.rpm >>>>>>> - file:///tmp/postgres10/postgresql10-server-10.4- >>>>>>> 1PGDG.rhel7.x86_64.rpm >>>>>>> >>>>>>> postgresql_version: 10 >>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data >>>>>>> postgresql_bin_path: /usr/pgsql-10/bin >>>>>>> postgresql_config_path: /var/lib/pgsql/10/data >>>>>>> postgresql_daemon: postgresql-10.service >>>>>>> postgresql_packages: >>>>>>> - postgresql10-libs >>>>>>> - postgresql10 >>>>>>> - postgresql10-server >>>>>>> - postgresql10-contrib >>>>>>> # - postgresql10-devel >>>>>>> >>>>>>> postgresql_hba_entries: >>>>>>> - { type: local, database: all, user: postgres, auth_method: peer } >>>>>>> - { type: local, database: all, user: all, auth_method: peer } >>>>>>> - { type: host, database: all, user: all, address: '127.0.0.1/32', >>>>>>> auth_method: md5 } >>>>>>> - { type: host, database: all, user: all, address: '::1/128', >>>>>>> auth_method: md5 } >>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0', >>>>>>> auth_method: md5 } >>>>>>> - { type: host, database: all, user: all, address: '::0/0', >>>>>>> auth_method: md5 } >>>>>>> >>>>>>> postgresql_global_config_options: >>>>>>> - option: unix_socket_directories >>>>>>> value: '{{ postgresql_unix_socket_directories | join(",") }}' >>>>>>> >>>>>>> - option: standard_conforming_strings >>>>>>> value: 'on' >>>>>>> >>>>>>> - option: shared_buffers >>>>>>> value: '1024MB' >>>>>>> >>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB >>>>>>> # checkpoint_segments=300 >>>>>>> - option: max_wal_size >>>>>>> value: '14400MB' >>>>>>> >>>>>>> - option: min_wal_size >>>>>>> value: '80MB' >>>>>>> >>>>>>> - option: maintenance_work_mem >>>>>>> value: '2MB' >>>>>>> >>>>>>> - option: listen_addresses >>>>>>> value: '*' >>>>>>> >>>>>>> - option: max_connections >>>>>>> value: '400' >>>>>>> >>>>>>> - option: checkpoint_timeout >>>>>>> value: '900' >>>>>>> >>>>>>> - option: datestyle >>>>>>> value: "iso, mdy" >>>>>>> >>>>>>> - option: autovacuum >>>>>>> value: 'off' >>>>>>> >>>>>>> # vacuum all databases every night (full vacuum on Sunday night, >>>>>>> lazy vacuum every night) >>>>>>> - name: add postgresql cron lazy vacuum >>>>>>> cron: >>>>>>> name: lazy_vacuum >>>>>>> hour: 8 >>>>>>> minute: 0 >>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'" >>>>>>> - name: add postgresql cron full vacuum >>>>>>> cron: >>>>>>> name: full_vacuum >>>>>>> weekday: 0 >>>>>>> hour: 10 >>>>>>> minute: 0 >>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze --quiet'" >>>>>>> # re-index all databases once a week >>>>>>> - name: add postgresql cron reindex >>>>>>> cron: >>>>>>> name: reindex >>>>>>> weekday: 0 >>>>>>> hour: 12 >>>>>>> minute: 0 >>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from >>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U >>>>>>> postgres >>>>>>> {} -c \"reindex database {};\"' " >>>>>>> >>>>>>> >>>>>>> This is how I run 2.10. >>>>>>> Been running fine for some weeks without user intervention. >>>>>>> @Karl: Any comments please? >>>>>>> Steph >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>
