I coded up the web connector feature I think we need. See CONNECTORS-1528; I've attached a patch. Please apply and test it out to see if it solves the case problem for your IIS site.
For the "//" issue, can you be more specific about the mapping you need to do? Karl On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <daddy...@gmail.com> wrote: > Hi Steph, > > Right, you wouldn't want to touch the framework. > > The effect of lower-casing the documentURI parameter in the > addOrReplaceDocumentWithException method in an output connector would be to > map multiple, independently-fetched, documents that differ only by the case > of the URL together into one document in the index. The ManifoldCF > assumption is that a document with a certain URI can be tracked in the > index using exactly that URI. Mapping the URI to lower case would break > that assumption so the framework would make the wrong decision in many > cases. > > If you are picking up documents using the web connector, therefore, and > you are getting duplicate documents because the document URLs are sloppy, > it is therefore essential that INSTEAD of mapping the document URI to lower > case in the output connector, you map to lower case in the repository > connector. Otherwise the framework will not work right. > > There is a tab in the web connector that allows you to configure URL > normalization, called "Canonicalization". This would be a very appropriate > place to add URL mapping to lower case. It should be as simple as adding > one more checkbox column in the table, and modifying the method that does > the URL processing to include lower-casing. > > Karl > > > > On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <st...@remcam.net> > wrote: > >> Unless I have a massive misunderstanding somewhere... >> >> >> >> >> *Steph van Schalkwyk* >> Principal, Remcam Search Engines >> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net http://remcam.net >> <http://www.remcam.net/> Skype: svanschalkwyk >> <https://mail.google.com/mail/u/0/#> >> <http://linkedin.com/in/vanschalkwyk> >> >> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <st...@remcam.net> >> wrote: >> >>> Hi Karl >>> I'm addressing it in the ES Output Connector. >>> Not touching the framework :) >>> S >>> >>> >>> >>> *Steph van Schalkwyk* >>> Principal, Remcam Search Engines >>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>> <https://mail.google.com/mail/u/0/#> >>> <http://linkedin.com/in/vanschalkwyk> >>> >>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddy...@gmail.com> wrote: >>> >>>> Let's make sure we're talking about the same thing. >>>> >>>> Here is the output connector method that receives the ID (as the >>>> documentURI parameter): >>>> >>>> public int addOrReplaceDocumentWithException(String documentURI, >>>> VersionContext pipelineDescription, RepositoryDocument document, String >>>> authorityNameString, IOutputAddActivity activities) >>>> throws ManifoldCFException, ServiceInterruption, IOException; >>>> >>>> ManifoldCF doesn't say anywhere that this ID is case insensitive. If >>>> you make it case insensitive in an output connector, this will potentially >>>> break a lot of things, for example incremental indexing (which organizes >>>> the last indexed version by document ID). >>>> >>>> I therefore highly recommend that any "sloppyness" in this parameter be >>>> addressed in the Repository Connector that constructs it. If the connector >>>> is crawling a repository that believes that URLs are case insensitive then >>>> it should map these IDs to lower case. If not, then it shouldn't. >>>> >>>> Karl >>>> >>>> >>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <st...@remcam.net> >>>> wrote: >>>> >>>>> Hi Karl. >>>>> The issue is that the ES Output Connector uses the uri to create the >>>>> _id. When used with IIS which allows case variation in the URI, it creates >>>>> multiple documents. Clients on Windows IIS are rarely cognizant of that >>>>> issue as IIS is so lax in policing that OTB. >>>>> Currently, every case variation in URI results in a new doc in the >>>>> index. This is only in the ES output connector. >>>>> I can add an optional checkbox to do determien that particular action >>>>> if that would help? >>>>> Regards, >>>>> Steph >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *Steph van Schalkwyk* >>>>> Principal, Remcam Search Engines >>>>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>> <https://mail.google.com/mail/u/0/#> >>>>> <http://linkedin.com/in/vanschalkwyk> >>>>> >>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddy...@gmail.com> >>>>> wrote: >>>>> >>>>>> THanks for the update. >>>>>> Lower-casing the ID would be fine except there are some connectors >>>>>> that care about case. The web connector is one such because it's up to >>>>>> the >>>>>> web service to decide if case matters, so the web connector does not view >>>>>> urls with case differences as being the same. Other connectors also will >>>>>> likely care as well. So I don't think lower-casing the document id is a >>>>>> smart thing to do. >>>>>> >>>>>> You could add this bit of configuration to the web connector, if >>>>>> that's what you are using, or to whatever other connector constructs the >>>>>> ID. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <st...@remcam.net> >>>>>> wrote: >>>>>> >>>>>>> Thanks Karl. >>>>>>> >>>>>>> I'll look into that. >>>>>>> >>>>>>> Another note: >>>>>>> Regarding the ES connector - I have made two additions to it and >>>>>>> should probably diff them for inclusion after approval: >>>>>>> 1. lowercased _id (the doc URI). >>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources, >>>>>>> particularly IIS...) >>>>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x does >>>>>>> not allow accedd to _id in the schema anymore, so no copy_field etc. >>>>>>> from >>>>>>> _id). Hence "url". >>>>>>> >>>>>>> Regards, >>>>>>> Steph >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Steph van Schalkwyk* >>>>>>> Principal, Remcam Search Engines >>>>>>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>> >>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddy...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and we may >>>>>>>> need to upgrade it. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk < >>>>>>>> st...@remcam.net> wrote: >>>>>>>> >>>>>>>>> Olivier >>>>>>>>> By all means. >>>>>>>>> The only issue I have seen (totally unrelated) is with Jetty, >>>>>>>>> which has to be restarted about once a week. Still trying to find the >>>>>>>>> issue. >>>>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with Postgres10 >>>>>>>>> may be a bit slower. I have no empiric evidence at the moment as I'm >>>>>>>>> still >>>>>>>>> delivering the project to UAT. Will keep you posted. >>>>>>>>> Regards, >>>>>>>>> Steph >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *Steph van Schalkwyk* >>>>>>>>> Principal, Remcam Search Engines >>>>>>>>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>> >>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard < >>>>>>>>> olivier.tav...@francelabs.com> wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry for >>>>>>>>>> the late answer). I will test it soon. >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Olivier TAVARD >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <st...@remcam.net> >>>>>>>>>> a écrit : >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> These are the rpm installs: >>>>>>>>>> - >>>>>>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>> - >>>>>>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>> - >>>>>>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>> - >>>>>>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>> >>>>>>>>>> postgresql_version: 10 >>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data >>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin >>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data >>>>>>>>>> postgresql_daemon: postgresql-10.service >>>>>>>>>> postgresql_packages: >>>>>>>>>> - postgresql10-libs >>>>>>>>>> - postgresql10 >>>>>>>>>> - postgresql10-server >>>>>>>>>> - postgresql10-contrib >>>>>>>>>> # - postgresql10-devel >>>>>>>>>> >>>>>>>>>> postgresql_hba_entries: >>>>>>>>>> - { type: local, database: all, user: postgres, auth_method: peer >>>>>>>>>> } >>>>>>>>>> - { type: local, database: all, user: all, auth_method: peer } >>>>>>>>>> - { type: host, database: all, user: all, address: '127.0.0.1/32', >>>>>>>>>> auth_method: md5 } >>>>>>>>>> - { type: host, database: all, user: all, address: '::1/128', >>>>>>>>>> auth_method: md5 } >>>>>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0', >>>>>>>>>> auth_method: md5 } >>>>>>>>>> - { type: host, database: all, user: all, address: '::0/0', >>>>>>>>>> auth_method: md5 } >>>>>>>>>> >>>>>>>>>> postgresql_global_config_options: >>>>>>>>>> - option: unix_socket_directories >>>>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",") }}' >>>>>>>>>> >>>>>>>>>> - option: standard_conforming_strings >>>>>>>>>> value: 'on' >>>>>>>>>> >>>>>>>>>> - option: shared_buffers >>>>>>>>>> value: '1024MB' >>>>>>>>>> >>>>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB >>>>>>>>>> # checkpoint_segments=300 >>>>>>>>>> - option: max_wal_size >>>>>>>>>> value: '14400MB' >>>>>>>>>> >>>>>>>>>> - option: min_wal_size >>>>>>>>>> value: '80MB' >>>>>>>>>> >>>>>>>>>> - option: maintenance_work_mem >>>>>>>>>> value: '2MB' >>>>>>>>>> >>>>>>>>>> - option: listen_addresses >>>>>>>>>> value: '*' >>>>>>>>>> >>>>>>>>>> - option: max_connections >>>>>>>>>> value: '400' >>>>>>>>>> >>>>>>>>>> - option: checkpoint_timeout >>>>>>>>>> value: '900' >>>>>>>>>> >>>>>>>>>> - option: datestyle >>>>>>>>>> value: "iso, mdy" >>>>>>>>>> >>>>>>>>>> - option: autovacuum >>>>>>>>>> value: 'off' >>>>>>>>>> >>>>>>>>>> # vacuum all databases every night (full vacuum on Sunday night, >>>>>>>>>> lazy vacuum every night) >>>>>>>>>> - name: add postgresql cron lazy vacuum >>>>>>>>>> cron: >>>>>>>>>> name: lazy_vacuum >>>>>>>>>> hour: 8 >>>>>>>>>> minute: 0 >>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'" >>>>>>>>>> - name: add postgresql cron full vacuum >>>>>>>>>> cron: >>>>>>>>>> name: full_vacuum >>>>>>>>>> weekday: 0 >>>>>>>>>> hour: 10 >>>>>>>>>> minute: 0 >>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze --quiet'" >>>>>>>>>> # re-index all databases once a week >>>>>>>>>> - name: add postgresql cron reindex >>>>>>>>>> cron: >>>>>>>>>> name: reindex >>>>>>>>>> weekday: 0 >>>>>>>>>> hour: 12 >>>>>>>>>> minute: 0 >>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from >>>>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U >>>>>>>>>> postgres >>>>>>>>>> {} -c \"reindex database {};\"' " >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> This is how I run 2.10. >>>>>>>>>> Been running fine for some weeks without user intervention. >>>>>>>>>> @Karl: Any comments please? >>>>>>>>>> Steph >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >>