[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710279#comment-16710279
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi [~SteenTi], you will still not get unreachable documents deleted if you run 
your job using the "minimal" cycle.  Please be sure you are using the "full" 
cycle.

If you need cycles that are very very short, you will need to make a tradeoff 
between getting new content in and removing old content.  Typically we 
recommend that you schedule your job to use "minimal" crawls most of the time, 
but use "full" runs periodically to clean out unreachable documents.

If you believe you are running "full" crawls and there is still not any 
cleanup, I can assure you that the Web Connector has automated tests that 
verify it does work properly to clean up unreachable documents.  So there would 
be two possibilities: (1) this is specific to changes in seeds, or (2) the 
Elastic Search Connector is transmitting deletes that are failing silently for 
some reason.  In order to figure out which it is please run a cycle manually, 
and look at the Simple History report to see if deletions are logged.


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709986#comment-16709986
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

The documentation states:
{code:java}
A typical non-continuous run of a job has the following stages of execution:

Adding the job's new, changed, or deleted starting points to the queue 
("seeding")
Fetching documents, discovering new documents, and detecting deletions
Removing no-longer-included documents from the queue

Jobs can also be run "continuously", which means that the job never completes, 
unless it is aborted. A continuous run has different stages of execution:

Adding the job's new, changed, or deleted starting points to the queue 
("seeding")
Fetching documents, discovering new documents, and detecting deletions, while 
reseeding periodically

Note that continuous jobs cannot remove no-longer-included documents from the 
queue. They can only remove documents that have been deleted from the 
repository.{code}
Both should detect deletions but only non-continuous should delete the 
unreachable documents.
so knowing this i changed the job to a non-continuous job that starts every 5 
min for testing.
Even when the job is non-continuous it doesn't delete the unreachable documents
It keeps all documents indexed in elastic

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: ManifoldCF and PostgradeSQL Version Compatability

2018-12-05 Thread Karl Wright
We know that the 9.x series works with ManifoldCF fine.
If you go with the 10.x or 11.x series, you will need to get a different
postgresql driver but I doubt there is anything that would make this not
work.  But you would need to verify that.

Karl


On Wed, Dec 5, 2018 at 6:46 AM Singh,Jasvinder 
wrote:

> Karl - My Question was regarding which version of postgresql is compatible
> with ManifoldCF as per postgresql site they have versioning till 11.1
>
> https://www.postgresql.org/support/versioning/
> PostgreSQL 11.1, 10.6, 9.6.11, 9.5.15, 9.4.20, and 9.3.25 Released!
>
> So my question is from the site which version of postgresql to download so
> that it can be compatible with  Postgresql driver 42.1.3 as you mentioned
> below
>
>
>
>
> -Original Message-
> From: Karl Wright [mailto:daddy...@gmail.com]
> Sent: Wednesday, December 05, 2018 4:43 PM
> To: dev
> Subject: Re: ManifoldCF and PostgradeSQL Version Compatability
>
> Hi Jasvinder,
>
> The Postgresql driver we currently download for testing and when customers
> build ManifoldCF themselves is:
>
> 
>
>
> Karl
>
>
> On Wed, Dec 5, 2018 at 4:43 AM jasvinder.si...@gartner.com <
> jasvinder.si...@gartner.com> wrote:
>
> > As per Documentation in it states that
> >
> > The PostgreSQL JDBC driver included with ManifoldCF is known to work with
> > version 9.1, so that version is the currently recommended one.
> >
> > But as per
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.postgresql.org_download_=DwIBaQ=qRq7a-87GiVVW7v8KD1gdQ=O4FdpznhI-aW2GjsYnqls4K1A4w_taTiyoHoGzRLt4o=JlT2BPjDwyBACj4tLKM-exSW76gvASv-70C5cpO5ZwU=QLQ-FPqx0eIjPyFS_qj9fGv6aCkXLExOIiRF0M8VZ90=
> site version 9.1 seems
> > not downloadable
> > as its falls under unsupported version - Can you suggest what version you
> > recommend
> > today - i have to use PostgreSQL for Simplified multi-process model using
> > ZooKeeper-based synchronization.
> >
> > Also please confirm I have to update configurations in
> > properties-global.xml file and not properties.xml
> > file in D:\manifoldcftunk\dist\multiprocess-zk-example-proprietary
> > As per example in applciation for Single Process it uses properties.xml
> > to override values
> > but no where in documentation i see where to update the values for
> > Simplified multi-process model using ZooKeeper-based synchronization.
> > Looking at  properties-global.xml  file it seems it has below config so I
> > am assuming it needs to be udpated
> >
> >  > value="org.apache.manifoldcf.core.database.DBInterfaceHSQLDB"/>
> >
> > Thanks in advance for help
> >
>
> 
>
> CEB is now Gartner. Learn more.<
> https://www.gartner.com/technology/about/ceb-acquisition.jsp>
> CEB India Private Ltd. Registration No: U741040HR2004PTC035324. Registered
> office: 6th Floor, Tower-B, Building No.10, DLF Cyber City, Gurgaon 122002,
> Haryana India If you are not the intended recipient or have received this
> message in error, please notify the sender and permanently delete this
> message and any attachments.
>


RE: ManifoldCF and PostgradeSQL Version Compatability

2018-12-05 Thread Singh,Jasvinder
Karl - My Question was regarding which version of postgresql is compatible with 
ManifoldCF as per postgresql site they have versioning till 11.1

https://www.postgresql.org/support/versioning/
PostgreSQL 11.1, 10.6, 9.6.11, 9.5.15, 9.4.20, and 9.3.25 Released!

So my question is from the site which version of postgresql to download so that 
it can be compatible with  Postgresql driver 42.1.3 as you mentioned below




-Original Message-
From: Karl Wright [mailto:daddy...@gmail.com]
Sent: Wednesday, December 05, 2018 4:43 PM
To: dev
Subject: Re: ManifoldCF and PostgradeSQL Version Compatability

Hi Jasvinder,

The Postgresql driver we currently download for testing and when customers
build ManifoldCF themselves is:




Karl


On Wed, Dec 5, 2018 at 4:43 AM jasvinder.si...@gartner.com <
jasvinder.si...@gartner.com> wrote:

> As per Documentation in it states that
>
> The PostgreSQL JDBC driver included with ManifoldCF is known to work with
> version 9.1, so that version is the currently recommended one.
>
> But as per 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.postgresql.org_download_=DwIBaQ=qRq7a-87GiVVW7v8KD1gdQ=O4FdpznhI-aW2GjsYnqls4K1A4w_taTiyoHoGzRLt4o=JlT2BPjDwyBACj4tLKM-exSW76gvASv-70C5cpO5ZwU=QLQ-FPqx0eIjPyFS_qj9fGv6aCkXLExOIiRF0M8VZ90=
>  site version 9.1 seems
> not downloadable
> as its falls under unsupported version - Can you suggest what version you
> recommend
> today - i have to use PostgreSQL for Simplified multi-process model using
> ZooKeeper-based synchronization.
>
> Also please confirm I have to update configurations in
> properties-global.xml file and not properties.xml
> file in D:\manifoldcftunk\dist\multiprocess-zk-example-proprietary
> As per example in applciation for Single Process it uses properties.xml
> to override values
> but no where in documentation i see where to update the values for
> Simplified multi-process model using ZooKeeper-based synchronization.
> Looking at  properties-global.xml  file it seems it has below config so I
> am assuming it needs to be udpated
>
>  value="org.apache.manifoldcf.core.database.DBInterfaceHSQLDB"/>
>
> Thanks in advance for help
>



CEB is now Gartner. Learn 
more.
CEB India Private Ltd. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower-B, Building No.10, DLF Cyber City, Gurgaon 122002, 
Haryana India If you are not the intended recipient or have received this 
message in error, please notify the sender and permanently delete this message 
and any attachments.


Re: Deployment model choice

2018-12-05 Thread Karl Wright
Hi Jasvinder,

The multiprocess model(s) use an application server (tomcat or jetty) to
deliver the ManifoldCF UI services, and a separate family (one or more)
agents processes that do the actual crawling.  The two models are distinct
because one uses files for process synchronization (deprecated), and the
other uses Zookeeper.

Choice (4) is there for people who want to develop their own custom
deployment strategy.  This is not recommended for people not familiar at
all with ManifoldCF.

Karl


On Wed, Dec 5, 2018 at 4:49 AM jasvinder.si...@gartner.com <
jasvinder.si...@gartner.com> wrote:

> Below are the Deployment models listed in Documentation
>
> 1)Quick-start single process model
> 2)Single-process deployable war
> 3)Simplified multi-process model
>a)Simplified multi-process model using file-based synchronization
>b)Simplified multi-process model using ZooKeeper-based synchronization
>
> 4)Command-driven multi-process model
>
> Can you please elaborate the use case of 3 for both a & b points and 4
> model  listed above
>
> From the documentation, I could not make out in which situation to use 3
> and 4 models
> and If 3 (b) is suggested as recommended one in most places - what is
> exact puropse of model
> 4
>
>
>
>


Re: ManifoldCF and PostgradeSQL Version Compatability

2018-12-05 Thread Karl Wright
Hi Jasvinder,

The Postgresql driver we currently download for testing and when customers
build ManifoldCF themselves is:




Karl


On Wed, Dec 5, 2018 at 4:43 AM jasvinder.si...@gartner.com <
jasvinder.si...@gartner.com> wrote:

> As per Documentation in it states that
>
> The PostgreSQL JDBC driver included with ManifoldCF is known to work with
> version 9.1, so that version is the currently recommended one.
>
> But as per https://www.postgresql.org/download/ site version 9.1 seems
> not downloadable
> as its falls under unsupported version - Can you suggest what version you
> recommend
> today - i have to use PostgreSQL for Simplified multi-process model using
> ZooKeeper-based synchronization.
>
> Also please confirm I have to update configurations in
> properties-global.xml file and not properties.xml
> file in D:\manifoldcftunk\dist\multiprocess-zk-example-proprietary
> As per example in applciation for Single Process it uses properties.xml
> to override values
> but no where in documentation i see where to update the values for
> Simplified multi-process model using ZooKeeper-based synchronization.
> Looking at  properties-global.xml  file it seems it has below config so I
> am assuming it needs to be udpated
>
>  value="org.apache.manifoldcf.core.database.DBInterfaceHSQLDB"/>
>
> Thanks in advance for help
>


Re: WebSite Crawling with Customheader Info

2018-12-05 Thread Karl Wright
Hi Jasvinder,

That sounds like a customization you would have to make.  The Web Connector
is designed for generic web crawling, not as the basis for a custom
connector.  Indeed, I would strongly suggest that you not try to use the
web connector to retrieve your repository content but rather develop a
connector of your own meant to work with the system you are trying to get
data out of.

Thanks,
Karl


On Wed, Dec 5, 2018 at 4:54 AM jasvinder.si...@gartner.com <
jasvinder.si...@gartner.com> wrote:

> Can you please suggest is there a way to pass custom header info with Seed
> URL
> so that the target application can determine this request is coming for
> crawling
> i.e if my site is xxx.yy.com so when the request hits from ManifoldCF for
> crawling
> can i pass some header which I can parse in my xxx.yy.com  site to
> determine
> the request is for crawling - and I can customize my code for some purposes
> like - Bypass Authentication - The reason I am asking is for some reason
> I could not map my login sequence defined in ManifoldCF - so looking for
> alternatives
> (Since it's my Internet Site - Its ok for me to ByPass Authentication
> based on some
> custom header)
>
> Thanks In Advance for help
>


WebSite Crawling with Customheader Info

2018-12-05 Thread jasvinder . singh
Can you please suggest is there a way to pass custom header info with Seed URL 
so that the target application can determine this request is coming for 
crawling 
i.e if my site is xxx.yy.com so when the request hits from ManifoldCF for 
crawling
can i pass some header which I can parse in my xxx.yy.com  site to determine
the request is for crawling - and I can customize my code for some purposes
like - Bypass Authentication - The reason I am asking is for some reason
I could not map my login sequence defined in ManifoldCF - so looking for 
alternatives
(Since it's my Internet Site - Its ok for me to ByPass Authentication based on 
some
custom header)

Thanks In Advance for help


Deployment model choice

2018-12-05 Thread jasvinder . singh
Below are the Deployment models listed in Documentation

1)Quick-start single process model
2)Single-process deployable war
3)Simplified multi-process model
   a)Simplified multi-process model using file-based synchronization
   b)Simplified multi-process model using ZooKeeper-based synchronization

4)Command-driven multi-process model

Can you please elaborate the use case of 3 for both a & b points and 4 model  
listed above

>From the documentation, I could not make out in which situation to use 3 and 4 
>models 
and If 3 (b) is suggested as recommended one in most places - what is exact 
puropse of model
4 





ManifoldCF and PostgradeSQL Version Compatability

2018-12-05 Thread jasvinder . singh
As per Documentation in it states that

The PostgreSQL JDBC driver included with ManifoldCF is known to work with 
version 9.1, so that version is the currently recommended one. 

But as per https://www.postgresql.org/download/ site version 9.1 seems not 
downloadable 
as its falls under unsupported version - Can you suggest what version you 
recommend 
today - i have to use PostgreSQL for Simplified multi-process model using 
ZooKeeper-based synchronization.

Also please confirm I have to update configurations in properties-global.xml 
file and not properties.xml
file in D:\manifoldcftunk\dist\multiprocess-zk-example-proprietary 
As per example in applciation for Single Process it uses properties.xml  to 
override values 
but no where in documentation i see where to update the values for Simplified 
multi-process model using ZooKeeper-based synchronization. Looking at  
properties-global.xml  file it seems it has below config so I am assuming it 
needs to be udpated 



Thanks in advance for help


[jira] [Resolved] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1562.
-
Resolution: Not A Problem

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709779#comment-16709779
 ] 

Karl Wright commented on CONNECTORS-1562:
-

"Dynamic rescan" is the same thing as "continuous crawling".  You don't want 
that if you want document deletions to be detected on a schedule.  In fact, 
jobs never end in this mode; they run indefinitely.  There's a whole book 
chapter on this and the user guide also mentions this:

http://manifoldcf.apache.org/release/release-2.11/en_US/end-user-documentation.html#jobs


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709712#comment-16709712
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

The crawling is schedualted as dynamically rescan of the documents


!Screenshot from 2018-12-05 09-01-46.png!

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Tim Steenbeke (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Attachment: Screenshot from 2018-12-05 09-01-46.png

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)