[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711866#comment-16711866
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi [~SteenTi], the only thing I have not been able to verify is whether the ES 
connector is working properly or not.  What I'd like you to do is set up your 
sample job in such a way so that it is small enough to crawl in a small amount 
of time -- and use the Null output connector rather than the ES one.  Please 
then make sure you know how to execute the web crawl jobs and make sure you see 
the same things I saw above.  Once you get to that point, we can verify whether 
or not ES is doing the right thing.

Thanks again.

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711871#comment-16711871
 ] 

Karl Wright commented on CONNECTORS-1562:
-

[~DonaldVdD], please see above.

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711862#comment-16711862
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Next, I modified the job as follows:

- Added the "http://manifoldcf.apache.org; url to the seeds again
- Went to the "Schedule" tab
- Created a schedule record that had the 48-minute value and no other minute 
value, and clicked the "Add" button for schedule records
- Clicked on the "Connection" tab and selected "Start when schedule window 
starts" option
- Clicked "save"
- Went to the Job Status page and refreshed until 1:48 PM
- Saw that the job started at 1: 48 PM

I conclude that the scheduler works properly too.


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711846#comment-16711846
 ] 

Karl Wright commented on CONNECTORS-1562:
-

I just did a test run as follows:

(1) Created a web repository connection (using all defaults except the required 
email address)
(2) Created a null output connection (again, all defaults)
(3) Created a job that used these two connections, using maximum link count of 
2 and no maximum redirection count, plus seed of "http://manifoldcf.apache.org;
(4) Ran the job manually to completion
(5) Immediately got a simple history report for the web connection:

{code}
Start Time  ActivityIdentifier  Result Code Bytes   Time
Result Description
12/6/18 1:33:10 PM  output notification (Null)  OK  0   
1   
12/6/18 1:33:00 PM  job end 1544121003866(test)
0   1   
12/6/18 1:32:54 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/mail.html
OK  11212   1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:54 PM  process http://manifoldcf.apache.org/en_US/mail.html
OK  11212   26  
12/6/18 1:32:53 PM  fetch   http://manifoldcf.apache.org/en_US/mail.html
200 11212   365 
12/6/18 1:32:49 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/who.html
OK  96341   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:49 PM  process http://manifoldcf.apache.org/en_US/who.html
OK  963417  
12/6/18 1:32:48 PM  fetch   http://manifoldcf.apache.org/en_US/who.html
200 9634339 
12/6/18 1:32:44 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/release-documentation.html
OK  93491   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:44 PM  process 
http://manifoldcf.apache.org/en_US/release-documentation.html
OK  934910  
12/6/18 1:32:43 PM  fetch   
http://manifoldcf.apache.org/en_US/release-documentation.html
200 9349338 
12/6/18 1:32:39 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/security.html
OK  13725   1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:39 PM  process http://manifoldcf.apache.org/en_US/security.html
OK  13725   15  
12/6/18 1:32:38 PM  fetch   http://manifoldcf.apache.org/en_US/security.html
200 13725   417 
12/6/18 1:32:34 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/books-and-presentations.html
OK  11419   1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:34 PM  process 
http://manifoldcf.apache.org/en_US/books-and-presentations.html
OK  11419   14  
12/6/18 1:32:33 PM  fetch   
http://manifoldcf.apache.org/en_US/books-and-presentations.html
200 11419   371 
12/6/18 1:32:31 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/download.html
OK  144128  1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:31 PM  process http://manifoldcf.apache.org/en_US/download.html
OK  144128  8   
12/6/18 1:32:28 PM  fetch   http://manifoldcf.apache.org/en_US/download.html
200 144128  2443
{code}

Next:

(1) I modified the job to remove the one seed I had, and saved it
(2) Ran the job again
(3) Immediately retrieved a Simple History report:

{code}
12/6/18 1:35:20 PM  output notification (Null)  OK  0   
1   
12/6/18 1:35:10 PM  job end 1544121003866(test)
0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/release-documentation.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/skin/profile.css
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/download.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/en_US/developer-resources.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/en_US/who.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/developer-resources.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/index.html
OK  0   1   
12/6/18 1:35:00 PM   

[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Donald Van den Driessche (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711461#comment-16711461
 ] 

Donald Van den Driessche commented on CONNECTORS-1562:
--

Hi Karl

I'm a colleague of Tim, we work together on this porject.
Thank you for your time and we are interested in what the result brings.

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711409#comment-16711409
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi Tim,
All the functionality you say doesn't work is exercised by integration tests.  
I will happily do a walkthrough today at some point to confirm this.  It is an 
extremely busy day for me, however, so please be patient.

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Deployment model choice - Tomcat Version

2018-12-06 Thread Karl Wright
There is no particular version requirements for tomcat.  Please let us know
of any problems you run onto.

Karl

On Thu, Dec 6, 2018, 6:01 AM Singh,Jasvinder  Karl - Thanks for reply . Have a question for Tomcat - is there any
> version restriction - I don't see anything mentioned in Documentation for
> this  -
> Latest version is Tomcat 9
>
>
> -Original Message-
> From: Karl Wright [mailto:daddy...@gmail.com]
> Sent: Wednesday, December 05, 2018 4:47 PM
> To: dev
> Subject: Re: Deployment model choice
>
> Hi Jasvinder,
>
> The multiprocess model(s) use an application server (tomcat or jetty) to
> deliver the ManifoldCF UI services, and a separate family (one or more)
> agents processes that do the actual crawling.  The two models are distinct
> because one uses files for process synchronization (deprecated), and the
> other uses Zookeeper.
>
> Choice (4) is there for people who want to develop their own custom
> deployment strategy.  This is not recommended for people not familiar at
> all with ManifoldCF.
>
> Karl
>
>
> On Wed, Dec 5, 2018 at 4:49 AM jasvinder.si...@gartner.com <
> jasvinder.si...@gartner.com> wrote:
>
> > Below are the Deployment models listed in Documentation
> >
> > 1)Quick-start single process model
> > 2)Single-process deployable war
> > 3)Simplified multi-process model
> >a)Simplified multi-process model using file-based synchronization
> >b)Simplified multi-process model using ZooKeeper-based synchronization
> >
> > 4)Command-driven multi-process model
> >
> > Can you please elaborate the use case of 3 for both a & b points and 4
> > model  listed above
> >
> > From the documentation, I could not make out in which situation to use 3
> > and 4 models
> > and If 3 (b) is suggested as recommended one in most places - what is
> > exact puropse of model
> > 4
> >
> >
> >
> >
>
> 
>
> CEB is now Gartner. Learn more.<
> https://www.gartner.com/technology/about/ceb-acquisition.jsp>
> CEB India Private Ltd. Registration No: U741040HR2004PTC035324. Registered
> office: 6th Floor, Tower-B, Building No.10, DLF Cyber City, Gurgaon 122002,
> Haryana India If you are not the intended recipient or have received this
> message in error, please notify the sender and permanently delete this
> message and any attachments.
>


RE: Deployment model choice - Tomcat Version

2018-12-06 Thread Singh,Jasvinder
Karl - Thanks for reply . Have a question for Tomcat - is there any version 
restriction - I don't see anything mentioned in Documentation for this  -
Latest version is Tomcat 9


-Original Message-
From: Karl Wright [mailto:daddy...@gmail.com]
Sent: Wednesday, December 05, 2018 4:47 PM
To: dev
Subject: Re: Deployment model choice

Hi Jasvinder,

The multiprocess model(s) use an application server (tomcat or jetty) to
deliver the ManifoldCF UI services, and a separate family (one or more)
agents processes that do the actual crawling.  The two models are distinct
because one uses files for process synchronization (deprecated), and the
other uses Zookeeper.

Choice (4) is there for people who want to develop their own custom
deployment strategy.  This is not recommended for people not familiar at
all with ManifoldCF.

Karl


On Wed, Dec 5, 2018 at 4:49 AM jasvinder.si...@gartner.com <
jasvinder.si...@gartner.com> wrote:

> Below are the Deployment models listed in Documentation
>
> 1)Quick-start single process model
> 2)Single-process deployable war
> 3)Simplified multi-process model
>a)Simplified multi-process model using file-based synchronization
>b)Simplified multi-process model using ZooKeeper-based synchronization
>
> 4)Command-driven multi-process model
>
> Can you please elaborate the use case of 3 for both a & b points and 4
> model  listed above
>
> From the documentation, I could not make out in which situation to use 3
> and 4 models
> and If 3 (b) is suggested as recommended one in most places - what is
> exact puropse of model
> 4
>
>
>
>



CEB is now Gartner. Learn 
more.
CEB India Private Ltd. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower-B, Building No.10, DLF Cyber City, Gurgaon 122002, 
Haryana India If you are not the intended recipient or have received this 
message in error, please notify the sender and permanently delete this message 
and any attachments.