yes On Wed, Sep 5, 2018 at 12:10 PM Steph van Schalkwyk <st...@remcam.net> wrote:
> Thank you. So I'll stop for now? > Steph > > > > > *Steph van Schalkwyk* > Principal, Remcam Search Engines > +1.314.452. <+1+314+452+2896>2896 st...@remcam.net http://remcam.net > <http://www.remcam.net/> Skype: svanschalkwyk > <https://mail.google.com/mail/u/0/#> > <http://linkedin.com/in/vanschalkwyk> > > On Wed, Sep 5, 2018 at 11:05 AM, Karl Wright <daddy...@gmail.com> wrote: > >> I'm already working on the Web Connector. The UI has problems that >> predate this change and I've alerted Kishore about them -- he'll look into >> them later today. >> >> Karl >> >> >> On Wed, Sep 5, 2018 at 11:55 AM Steph van Schalkwyk <st...@remcam.net> >> wrote: >> >>> Thank you Karl. >>> You are of course correct in that the incremental crawl is now broken in >>> that it does a full crawl every time. >>> I'll jump on the Web Connector and add that functionality. >>> Thanks for this excellent application and all the help over the years. >>> Steph >>> >>> >>> >>> >>> *Steph van Schalkwyk* >>> Principal, Remcam Search Engines >>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>> <https://mail.google.com/mail/u/0/#> >>> <http://linkedin.com/in/vanschalkwyk> >>> >>> On Wed, Sep 5, 2018 at 6:33 AM, Karl Wright <daddy...@gmail.com> wrote: >>> >>>> The patch I uploaded doesn't work because the entire tab is broken; >>>> looks like the UI refactoring broke it and it was never reported. Fixing >>>> now. >>>> Karl >>>> >>>> >>>> On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <daddy...@gmail.com> wrote: >>>> >>>>> I coded up the web connector feature I think we need. See >>>>> CONNECTORS-1528; I've attached a patch. Please apply and test it out to >>>>> see if it solves the case problem for your IIS site. >>>>> >>>>> For the "//" issue, can you be more specific about the mapping you >>>>> need to do? >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <daddy...@gmail.com> wrote: >>>>> >>>>>> Hi Steph, >>>>>> >>>>>> Right, you wouldn't want to touch the framework. >>>>>> >>>>>> The effect of lower-casing the documentURI parameter in the >>>>>> addOrReplaceDocumentWithException method in an output connector would be >>>>>> to >>>>>> map multiple, independently-fetched, documents that differ only by the >>>>>> case >>>>>> of the URL together into one document in the index. The ManifoldCF >>>>>> assumption is that a document with a certain URI can be tracked in the >>>>>> index using exactly that URI. Mapping the URI to lower case would break >>>>>> that assumption so the framework would make the wrong decision in many >>>>>> cases. >>>>>> >>>>>> If you are picking up documents using the web connector, therefore, >>>>>> and you are getting duplicate documents because the document URLs are >>>>>> sloppy, it is therefore essential that INSTEAD of mapping the document >>>>>> URI >>>>>> to lower case in the output connector, you map to lower case in the >>>>>> repository connector. Otherwise the framework will not work right. >>>>>> >>>>>> There is a tab in the web connector that allows you to configure URL >>>>>> normalization, called "Canonicalization". This would be a very >>>>>> appropriate >>>>>> place to add URL mapping to lower case. It should be as simple as adding >>>>>> one more checkbox column in the table, and modifying the method that does >>>>>> the URL processing to include lower-casing. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <st...@remcam.net> >>>>>> wrote: >>>>>> >>>>>>> Unless I have a massive misunderstanding somewhere... >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Steph van Schalkwyk* >>>>>>> Principal, Remcam Search Engines >>>>>>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>> >>>>>>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk < >>>>>>> st...@remcam.net> wrote: >>>>>>> >>>>>>>> Hi Karl >>>>>>>> I'm addressing it in the ES Output Connector. >>>>>>>> Not touching the framework :) >>>>>>>> S >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Steph van Schalkwyk* >>>>>>>> Principal, Remcam Search Engines >>>>>>>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>> >>>>>>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddy...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Let's make sure we're talking about the same thing. >>>>>>>>> >>>>>>>>> Here is the output connector method that receives the ID (as the >>>>>>>>> documentURI parameter): >>>>>>>>> >>>>>>>>> public int addOrReplaceDocumentWithException(String documentURI, >>>>>>>>> VersionContext pipelineDescription, RepositoryDocument document, >>>>>>>>> String >>>>>>>>> authorityNameString, IOutputAddActivity activities) >>>>>>>>> throws ManifoldCFException, ServiceInterruption, IOException; >>>>>>>>> >>>>>>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive. >>>>>>>>> If you make it case insensitive in an output connector, this will >>>>>>>>> potentially break a lot of things, for example incremental indexing >>>>>>>>> (which >>>>>>>>> organizes the last indexed version by document ID). >>>>>>>>> >>>>>>>>> I therefore highly recommend that any "sloppyness" in this >>>>>>>>> parameter be addressed in the Repository Connector that constructs >>>>>>>>> it. If >>>>>>>>> the connector is crawling a repository that believes that URLs are >>>>>>>>> case >>>>>>>>> insensitive then it should map these IDs to lower case. If not, then >>>>>>>>> it >>>>>>>>> shouldn't. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk < >>>>>>>>> st...@remcam.net> wrote: >>>>>>>>> >>>>>>>>>> Hi Karl. >>>>>>>>>> The issue is that the ES Output Connector uses the uri to create >>>>>>>>>> the _id. When used with IIS which allows case variation in the URI, >>>>>>>>>> it >>>>>>>>>> creates multiple documents. Clients on Windows IIS are rarely >>>>>>>>>> cognizant of >>>>>>>>>> that issue as IIS is so lax in policing that OTB. >>>>>>>>>> Currently, every case variation in URI results in a new doc in >>>>>>>>>> the index. This is only in the ES output connector. >>>>>>>>>> I can add an optional checkbox to do determien that particular >>>>>>>>>> action if that would help? >>>>>>>>>> Regards, >>>>>>>>>> Steph >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Steph van Schalkwyk* >>>>>>>>>> Principal, Remcam Search Engines >>>>>>>>>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>>> >>>>>>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddy...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> THanks for the update. >>>>>>>>>>> Lower-casing the ID would be fine except there are some >>>>>>>>>>> connectors that care about case. The web connector is one such >>>>>>>>>>> because >>>>>>>>>>> it's up to the web service to decide if case matters, so the web >>>>>>>>>>> connector >>>>>>>>>>> does not view urls with case differences as being the same. Other >>>>>>>>>>> connectors also will likely care as well. So I don't think >>>>>>>>>>> lower-casing the >>>>>>>>>>> document id is a smart thing to do. >>>>>>>>>>> >>>>>>>>>>> You could add this bit of configuration to the web connector, if >>>>>>>>>>> that's what you are using, or to whatever other connector >>>>>>>>>>> constructs the ID. >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk < >>>>>>>>>>> st...@remcam.net> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks Karl. >>>>>>>>>>>> >>>>>>>>>>>> I'll look into that. >>>>>>>>>>>> >>>>>>>>>>>> Another note: >>>>>>>>>>>> Regarding the ES connector - I have made two additions to it >>>>>>>>>>>> and should probably diff them for inclusion after approval: >>>>>>>>>>>> 1. lowercased _id (the doc URI). >>>>>>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy >>>>>>>>>>>> sources, particularly IIS...) >>>>>>>>>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x >>>>>>>>>>>> does not allow accedd to _id in the schema anymore, so no >>>>>>>>>>>> copy_field etc. >>>>>>>>>>>> from _id). Hence "url". >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Steph >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Steph van Schalkwyk* >>>>>>>>>>>> Principal, Remcam Search Engines >>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>>>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright < >>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and >>>>>>>>>>>>> we may need to upgrade it. >>>>>>>>>>>>> >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk < >>>>>>>>>>>>> st...@remcam.net> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Olivier >>>>>>>>>>>>>> By all means. >>>>>>>>>>>>>> The only issue I have seen (totally unrelated) is with Jetty, >>>>>>>>>>>>>> which has to be restarted about once a week. Still trying to >>>>>>>>>>>>>> find the issue. >>>>>>>>>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with >>>>>>>>>>>>>> Postgres10 may be a bit slower. I have no empiric evidence at >>>>>>>>>>>>>> the moment as >>>>>>>>>>>>>> I'm still delivering the project to UAT. Will keep you posted. >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Steph >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Steph van Schalkwyk* >>>>>>>>>>>>>> Principal, Remcam Search Engines >>>>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896 st...@remcam.net >>>>>>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: >>>>>>>>>>>>>> svanschalkwyk <https://mail.google.com/mail/u/0/#> >>>>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard < >>>>>>>>>>>>>> olivier.tav...@francelabs.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration >>>>>>>>>>>>>>> (sorry for the late answer). I will test it soon. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Olivier TAVARD >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk < >>>>>>>>>>>>>>> st...@remcam.net> a écrit : >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> These are the rpm installs: >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> postgresql_version: 10 >>>>>>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data >>>>>>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin >>>>>>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data >>>>>>>>>>>>>>> postgresql_daemon: postgresql-10.service >>>>>>>>>>>>>>> postgresql_packages: >>>>>>>>>>>>>>> - postgresql10-libs >>>>>>>>>>>>>>> - postgresql10 >>>>>>>>>>>>>>> - postgresql10-server >>>>>>>>>>>>>>> - postgresql10-contrib >>>>>>>>>>>>>>> # - postgresql10-devel >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> postgresql_hba_entries: >>>>>>>>>>>>>>> - { type: local, database: all, user: postgres, auth_method: >>>>>>>>>>>>>>> peer } >>>>>>>>>>>>>>> - { type: local, database: all, user: all, auth_method: peer >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> - { type: host, database: all, user: all, address: ' >>>>>>>>>>>>>>> 127.0.0.1/32', auth_method: md5 } >>>>>>>>>>>>>>> - { type: host, database: all, user: all, address: '::1/128', >>>>>>>>>>>>>>> auth_method: md5 } >>>>>>>>>>>>>>> - { type: host, database: all, user: all, address: ' >>>>>>>>>>>>>>> 0.0.0.0/0', auth_method: md5 } >>>>>>>>>>>>>>> - { type: host, database: all, user: all, address: '::0/0', >>>>>>>>>>>>>>> auth_method: md5 } >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> postgresql_global_config_options: >>>>>>>>>>>>>>> - option: unix_socket_directories >>>>>>>>>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",") >>>>>>>>>>>>>>> }}' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - option: standard_conforming_strings >>>>>>>>>>>>>>> value: 'on' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - option: shared_buffers >>>>>>>>>>>>>>> value: '1024MB' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB >>>>>>>>>>>>>>> # checkpoint_segments=300 >>>>>>>>>>>>>>> - option: max_wal_size >>>>>>>>>>>>>>> value: '14400MB' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - option: min_wal_size >>>>>>>>>>>>>>> value: '80MB' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - option: maintenance_work_mem >>>>>>>>>>>>>>> value: '2MB' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - option: listen_addresses >>>>>>>>>>>>>>> value: '*' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - option: max_connections >>>>>>>>>>>>>>> value: '400' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - option: checkpoint_timeout >>>>>>>>>>>>>>> value: '900' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - option: datestyle >>>>>>>>>>>>>>> value: "iso, mdy" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - option: autovacuum >>>>>>>>>>>>>>> value: 'off' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> # vacuum all databases every night (full vacuum on Sunday >>>>>>>>>>>>>>> night, lazy vacuum every night) >>>>>>>>>>>>>>> - name: add postgresql cron lazy vacuum >>>>>>>>>>>>>>> cron: >>>>>>>>>>>>>>> name: lazy_vacuum >>>>>>>>>>>>>>> hour: 8 >>>>>>>>>>>>>>> minute: 0 >>>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'" >>>>>>>>>>>>>>> - name: add postgresql cron full vacuum >>>>>>>>>>>>>>> cron: >>>>>>>>>>>>>>> name: full_vacuum >>>>>>>>>>>>>>> weekday: 0 >>>>>>>>>>>>>>> hour: 10 >>>>>>>>>>>>>>> minute: 0 >>>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze >>>>>>>>>>>>>>> --quiet'" >>>>>>>>>>>>>>> # re-index all databases once a week >>>>>>>>>>>>>>> - name: add postgresql cron reindex >>>>>>>>>>>>>>> cron: >>>>>>>>>>>>>>> name: reindex >>>>>>>>>>>>>>> weekday: 0 >>>>>>>>>>>>>>> hour: 12 >>>>>>>>>>>>>>> minute: 0 >>>>>>>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from >>>>>>>>>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql >>>>>>>>>>>>>>> -U postgres >>>>>>>>>>>>>>> {} -c \"reindex database {};\"' " >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is how I run 2.10. >>>>>>>>>>>>>>> Been running fine for some weeks without user intervention. >>>>>>>>>>>>>>> @Karl: Any comments please? >>>>>>>>>>>>>>> Steph >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>> >