yes

On Wed, Sep 5, 2018 at 12:10 PM Steph van Schalkwyk <st...@remcam.net>
wrote:

> Thank you. So I'll stop for now?
> Steph
>
>
>
>
> *Steph van Schalkwyk*
> Principal, Remcam Search Engines
> +1.314.452. <+1+314+452+2896>2896    st...@remcam.net   http://remcam.net
> <http://www.remcam.net/> Skype: svanschalkwyk
> <https://mail.google.com/mail/u/0/#>
> <http://linkedin.com/in/vanschalkwyk>
>
> On Wed, Sep 5, 2018 at 11:05 AM, Karl Wright <daddy...@gmail.com> wrote:
>
>> I'm already working on the Web Connector.  The UI has problems that
>> predate this change and I've alerted Kishore about them -- he'll look into
>> them later today.
>>
>> Karl
>>
>>
>> On Wed, Sep 5, 2018 at 11:55 AM Steph van Schalkwyk <st...@remcam.net>
>> wrote:
>>
>>> Thank you Karl.
>>> You are of course correct in that the incremental crawl is now broken in
>>> that it does a full crawl every time.
>>> I'll jump on the Web Connector and add that functionality.
>>> Thanks for this excellent application and all the help over the years.
>>> Steph
>>>
>>>
>>>
>>>
>>> *Steph van Schalkwyk*
>>> Principal, Remcam Search Engines
>>> +1.314.452. <+1+314+452+2896>2896    st...@remcam.net
>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>> <https://mail.google.com/mail/u/0/#>
>>> <http://linkedin.com/in/vanschalkwyk>
>>>
>>> On Wed, Sep 5, 2018 at 6:33 AM, Karl Wright <daddy...@gmail.com> wrote:
>>>
>>>> The patch I uploaded doesn't work because the entire tab is broken;
>>>> looks like the UI refactoring broke it and it was never reported.  Fixing
>>>> now.
>>>> Karl
>>>>
>>>>
>>>> On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <daddy...@gmail.com> wrote:
>>>>
>>>>> I coded up the web connector feature I think we need.  See
>>>>> CONNECTORS-1528; I've attached a patch.  Please apply and test it out to
>>>>> see if it solves the case problem for your IIS site.
>>>>>
>>>>> For the "//" issue, can you be more specific about the mapping you
>>>>> need to do?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <daddy...@gmail.com> wrote:
>>>>>
>>>>>> Hi Steph,
>>>>>>
>>>>>> Right, you wouldn't want to touch the framework.
>>>>>>
>>>>>> The effect of lower-casing the documentURI parameter in the
>>>>>> addOrReplaceDocumentWithException method in an output connector would be 
>>>>>> to
>>>>>> map multiple, independently-fetched, documents that differ only by the 
>>>>>> case
>>>>>> of the URL together into one document in the index.  The ManifoldCF
>>>>>> assumption is that a document with a certain URI can be tracked in the
>>>>>> index using exactly that URI.  Mapping the URI to lower case would break
>>>>>> that assumption so the framework would make the wrong decision in many
>>>>>> cases.
>>>>>>
>>>>>> If you are picking up documents using the web connector, therefore,
>>>>>> and you are getting duplicate documents because the document URLs are
>>>>>> sloppy, it is therefore essential that INSTEAD of mapping the document 
>>>>>> URI
>>>>>> to lower case in the output connector, you map to lower case in the
>>>>>> repository connector.  Otherwise the framework will not work right.
>>>>>>
>>>>>> There is a tab in the web connector that allows you to configure URL
>>>>>> normalization, called "Canonicalization".  This would be a very 
>>>>>> appropriate
>>>>>> place to add URL mapping to lower case.  It should be as simple as adding
>>>>>> one more checkbox column in the table, and modifying the method that does
>>>>>> the URL processing to include lower-casing.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <st...@remcam.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Unless I have a massive misunderstanding somewhere...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Steph van Schalkwyk*
>>>>>>> Principal, Remcam Search Engines
>>>>>>> +1.314.452. <+1+314+452+2896>2896    st...@remcam.net
>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>
>>>>>>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <
>>>>>>> st...@remcam.net> wrote:
>>>>>>>
>>>>>>>> Hi Karl
>>>>>>>> I'm addressing it in the ES Output Connector.
>>>>>>>> Not touching the framework :)
>>>>>>>> S
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Steph van Schalkwyk*
>>>>>>>> Principal, Remcam Search Engines
>>>>>>>> +1.314.452. <+1+314+452+2896>2896    st...@remcam.net
>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>
>>>>>>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddy...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Let's make sure we're talking about the same thing.
>>>>>>>>>
>>>>>>>>> Here is the output connector method that receives the ID (as the
>>>>>>>>> documentURI parameter):
>>>>>>>>>
>>>>>>>>>   public int addOrReplaceDocumentWithException(String documentURI,
>>>>>>>>> VersionContext pipelineDescription, RepositoryDocument document, 
>>>>>>>>> String
>>>>>>>>> authorityNameString, IOutputAddActivity activities)
>>>>>>>>>     throws ManifoldCFException, ServiceInterruption, IOException;
>>>>>>>>>
>>>>>>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive.
>>>>>>>>> If you make it case insensitive in an output connector, this will
>>>>>>>>> potentially break a lot of things, for example incremental indexing 
>>>>>>>>> (which
>>>>>>>>> organizes the last indexed version by document ID).
>>>>>>>>>
>>>>>>>>> I therefore highly recommend that any "sloppyness" in this
>>>>>>>>> parameter be addressed in the Repository Connector that constructs 
>>>>>>>>> it.  If
>>>>>>>>> the connector is crawling a repository that believes that URLs are 
>>>>>>>>> case
>>>>>>>>> insensitive then it should map these IDs to lower case.  If not, then 
>>>>>>>>> it
>>>>>>>>> shouldn't.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <
>>>>>>>>> st...@remcam.net> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Karl.
>>>>>>>>>> The issue is that the ES Output Connector uses the uri to create
>>>>>>>>>> the _id. When used with IIS which allows case variation in the URI, 
>>>>>>>>>> it
>>>>>>>>>> creates multiple documents. Clients on Windows IIS are rarely 
>>>>>>>>>> cognizant of
>>>>>>>>>> that issue as IIS is so lax in policing that OTB.
>>>>>>>>>> Currently, every case variation in URI results in a new doc in
>>>>>>>>>> the index. This is only in the ES output connector.
>>>>>>>>>> I can add an optional checkbox to do determien that particular
>>>>>>>>>> action if that would help?
>>>>>>>>>> Regards,
>>>>>>>>>> Steph
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    st...@remcam.net
>>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddy...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> THanks for the update.
>>>>>>>>>>> Lower-casing the ID would be fine except there are some
>>>>>>>>>>> connectors that care about case.  The web connector is one such 
>>>>>>>>>>> because
>>>>>>>>>>> it's up to the web service to decide if case matters, so the web 
>>>>>>>>>>> connector
>>>>>>>>>>> does not view urls with case differences as being the same.  Other
>>>>>>>>>>> connectors also will likely care as well. So I don't think 
>>>>>>>>>>> lower-casing the
>>>>>>>>>>> document id is a smart thing to do.
>>>>>>>>>>>
>>>>>>>>>>> You could add this bit of configuration to the web connector, if
>>>>>>>>>>> that's what you are using, or to whatever other connector 
>>>>>>>>>>> constructs the ID.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <
>>>>>>>>>>> st...@remcam.net> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Karl.
>>>>>>>>>>>>
>>>>>>>>>>>> I'll look into that.
>>>>>>>>>>>>
>>>>>>>>>>>> Another note:
>>>>>>>>>>>> Regarding the ES connector - I have made two additions to it
>>>>>>>>>>>> and should probably diff them for inclusion after approval:
>>>>>>>>>>>> 1. lowercased _id (the doc URI).
>>>>>>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy
>>>>>>>>>>>> sources, particularly IIS...)
>>>>>>>>>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x
>>>>>>>>>>>> does not allow accedd to _id in the schema anymore, so no 
>>>>>>>>>>>> copy_field etc.
>>>>>>>>>>>> from _id). Hence "url".
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Steph
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    st...@remcam.net
>>>>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <
>>>>>>>>>>>> daddy...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and
>>>>>>>>>>>>> we may need to upgrade it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <
>>>>>>>>>>>>> st...@remcam.net> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Olivier
>>>>>>>>>>>>>> By all means.
>>>>>>>>>>>>>> The only issue I have seen (totally unrelated) is with Jetty,
>>>>>>>>>>>>>> which has to be restarted about once a week. Still trying to 
>>>>>>>>>>>>>> find the issue.
>>>>>>>>>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with
>>>>>>>>>>>>>> Postgres10 may be a bit slower. I have no empiric evidence at 
>>>>>>>>>>>>>> the moment as
>>>>>>>>>>>>>> I'm still delivering the project to UAT. Will keep you posted.
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Steph
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    st...@remcam.net
>>>>>>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype:
>>>>>>>>>>>>>> svanschalkwyk <https://mail.google.com/mail/u/0/#>
>>>>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard <
>>>>>>>>>>>>>> olivier.tav...@francelabs.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration
>>>>>>>>>>>>>>> (sorry for the late answer). I will test it soon.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Olivier TAVARD
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <
>>>>>>>>>>>>>>> st...@remcam.net> a écrit :
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These are the rpm installs:
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> postgresql_version: 10
>>>>>>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>>>>>>>>>>> postgresql_daemon: postgresql-10.service
>>>>>>>>>>>>>>> postgresql_packages:
>>>>>>>>>>>>>>> - postgresql10-libs
>>>>>>>>>>>>>>> - postgresql10
>>>>>>>>>>>>>>> - postgresql10-server
>>>>>>>>>>>>>>> - postgresql10-contrib
>>>>>>>>>>>>>>> # - postgresql10-devel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> postgresql_hba_entries:
>>>>>>>>>>>>>>> - { type: local, database: all, user: postgres, auth_method:
>>>>>>>>>>>>>>> peer }
>>>>>>>>>>>>>>> - { type: local, database: all, user: all, auth_method: peer
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> - { type: host, database: all, user: all, address: '
>>>>>>>>>>>>>>> 127.0.0.1/32', auth_method: md5 }
>>>>>>>>>>>>>>> - { type: host, database: all, user: all, address: '::1/128',
>>>>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>>>>> - { type: host, database: all, user: all, address: '
>>>>>>>>>>>>>>> 0.0.0.0/0', auth_method: md5 }
>>>>>>>>>>>>>>> - { type: host, database: all, user: all, address: '::0/0',
>>>>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> postgresql_global_config_options:
>>>>>>>>>>>>>>> - option: unix_socket_directories
>>>>>>>>>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",")
>>>>>>>>>>>>>>> }}'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: standard_conforming_strings
>>>>>>>>>>>>>>> value: 'on'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: shared_buffers
>>>>>>>>>>>>>>> value: '1024MB'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB
>>>>>>>>>>>>>>> # checkpoint_segments=300
>>>>>>>>>>>>>>> - option: max_wal_size
>>>>>>>>>>>>>>> value: '14400MB'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: min_wal_size
>>>>>>>>>>>>>>> value: '80MB'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: maintenance_work_mem
>>>>>>>>>>>>>>> value: '2MB'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: listen_addresses
>>>>>>>>>>>>>>> value: '*'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: max_connections
>>>>>>>>>>>>>>> value: '400'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: checkpoint_timeout
>>>>>>>>>>>>>>> value: '900'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: datestyle
>>>>>>>>>>>>>>> value: "iso, mdy"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: autovacuum
>>>>>>>>>>>>>>> value: 'off'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # vacuum all databases every night (full vacuum on Sunday
>>>>>>>>>>>>>>> night, lazy vacuum every night)
>>>>>>>>>>>>>>> - name: add postgresql cron lazy vacuum
>>>>>>>>>>>>>>> cron:
>>>>>>>>>>>>>>> name: lazy_vacuum
>>>>>>>>>>>>>>> hour: 8
>>>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'"
>>>>>>>>>>>>>>> - name: add postgresql cron full vacuum
>>>>>>>>>>>>>>> cron:
>>>>>>>>>>>>>>> name: full_vacuum
>>>>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>>>>> hour: 10
>>>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze
>>>>>>>>>>>>>>> --quiet'"
>>>>>>>>>>>>>>> # re-index all databases once a week
>>>>>>>>>>>>>>> - name: add postgresql cron reindex
>>>>>>>>>>>>>>> cron:
>>>>>>>>>>>>>>> name: reindex
>>>>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>>>>> hour: 12
>>>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from
>>>>>>>>>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql 
>>>>>>>>>>>>>>> -U postgres
>>>>>>>>>>>>>>> {} -c \"reindex database {};\"' "
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is how I run 2.10.
>>>>>>>>>>>>>>> Been running fine for some weeks without user intervention.
>>>>>>>>>>>>>>> @Karl: Any comments please?
>>>>>>>>>>>>>>> Steph
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>
>

Reply via email to