Jenkins build is back to normal : ManifoldCF-ant #727

2020-03-09 Thread Apache Jenkins Server
See 




Build failed in Jenkins: ManifoldCF-mvn #748

2020-03-09 Thread Apache Jenkins Server
See 


Changes:

[kwright] Merge CONNECTORS-1637 branch to trunk.  Thanks to Julien Massiera for 
the contribution!


--
[...truncated 3.23 MB...]
Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Building index for all the packages and classes...
Generating 

Generating 

Generating 

Building index for all classes...
Generating 

Generating 

Generating 

Generating 

Generating 

5 errors
39 warnings
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] ManifoldCF 2.16-SNAPSHOT ... SUCCESS [  3.611 s]
[INFO] ManifoldCF - Framework . SUCCESS [  0.053 s]
[INFO] ManifoldCF - Framework - Less Compiler . SUCCESS [  2.819 s]
[INFO] ManifoldCF - Framework - Core .. SUCCESS [ 11.419 s]
[INFO] ManifoldCF - Connector-Common .. SUCCESS [  7.293 s]
[INFO] ManifoldCF - Framework - UI Core ... SUCCESS [  3.686 s]
[INFO] ManifoldCF - Framework - Agents  SUCCESS [  7.633 s]
[INFO] ManifoldCF - Framework - Pull Agent  SUCCESS [ 15.180 s]
[INFO] ManifoldCF - Framework - Authority Servlet . SUCCESS [  2.658 s]
[INFO] ManifoldCF - Framework - API Servlet ... SUCCESS [  2.715 s]
[INFO] ManifoldCF - Framework - Authority Service . SUCCESS [  4.650 s]

[jira] [Created] (CONNECTORS-1638) JCIFS connector optional hidden files

2020-03-09 Thread Cihad Guzel (Jira)
Cihad Guzel created CONNECTORS-1638:
---

 Summary: JCIFS connector optional hidden files
 Key: CONNECTORS-1638
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1638
 Project: ManifoldCF
  Issue Type: Improvement
  Components: JCIFS connector
Affects Versions: ManifoldCF 2.15
Reporter: Cihad Guzel
Assignee: Cihad Guzel
 Fix For: ManifoldCF 2.16


The JCIFS connector should indexes hidden files optionally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


MongoDB Connector

2020-03-09 Thread Furkan KAMACI
Hi All,

Is there anybody who works on developing a MongoDB connector to fetch data
(not an output connector)?

Kind Regards,
Furkan KAMACI


Re: Delta deletion

2020-03-09 Thread Julien Massiera

Yes I consider the confluence connector complete.

As you suggest, I will try to use the "carrydown" mechanism to do what I 
want.


Thanks,

Julien

On 09/03/2020 13:59, Karl Wright wrote:

Do you consider the confluence connector in the branch complete?
If so I'll look at it as time permits later today.

As far as your proposal is concerned, maintaining lists of dependencies for
all documents is quite expensive.  We do this for hop counting and we
basically tell people to only use it if they must, because of the huge
amount of database overhead involved.  We also maintain "carrydown" data
which is accessible during document processing.  It is typically used for
ingestion, but maybe you could use that for a signal that child documents
should delete themselves or something.

Major crawling model changes are a gigantic effort; there are always many
things to consider and many problems encountered that need to be worked
around.  If you are concerned simply with the load on your API to handle
deletions, I'd suggest using one of the existing mechanisms for reducing
that.  But I can see no straightforward way to incrementally add dependency
deletion to the current framework.

Karl


On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:


Hi Karl,

Now that I finished the confluence connector, I am getting back to the
other one I was working on, and it would greatly help me to have your
thoughts on my proposal below.

Thanks,
Julien

On 02/03/2020 16:40, julien.massi...@francelabs.com wrote:

Hi Karl,

Thanks for your answer.

Your explanations validate what I was anticipating on the way MCF is

currently implementing its model. As you stated, this does mean that in
order to use the _DELETE model properly, the seeding process has to provide
the complete list of deleted documents.

Yet wouldn't it be a useful improvement to update the

activities.deleteDocument method (or create an additional delete method) so
that it automatically – and optionnaly - removes the referenced documents
of a document Id ?

For instance, since the activities.addDocumentReference method already

asks the document identifier of the "parent" document, couldn’t we maintain
in postgres a list of "child ids" and use it during the delete process to
delete them ?

This is very useful in the use case I already described but I am sure it

would be useful for other type of connectors and/or future connectors. The
benefits of such modification increase with the number of crawled documents.

Here is an illustration of the benefits of this MCF modification:

With my current connector, if my first crawl ingests 1M documents and on

the delta crawl only 1 document that has 2 children is deleted, it must
rely on the processDocument method to check the version of each of the 1M
documents to figure out and delete the 3 concerned ones (so at least 1M
calls to the API of the targeted repository). With the suggested optional
modification, the seeding process would use the delta API of the targeted
repository and declare the parent document (only one API call), then the
processDocuments method would be triggered only one time to check the
version of the document (another one API call), figure out that it does not
exists anymore and delete it with its 2 children. Its 2 API calls vs 1M...
even if on framework side we have one more request to perform to postgres,
I think it worth the processing time.

What do you think ?

Julien

-Message d'origine-
De : Karl Wright 
Envoyé : samedi 29 février 2020 15:51
À : dev 
Objet : Re: Delta deletion

Hi Julien,

First, ALL models rely on individual existence checks for documents.

That is, when your connector fetches a deleted document, the framework has
to be told that the document is gone, or it will not be removed.  There is
no "discovery" process for deleted documents other than seeding (and only
when the model includes _DELETE).

The upshot of this is that IF your seeding method does not return all

documents that have been removed THEN it cannot be a _DELETE model.

I hope this helps.

Karl


On Sat, Feb 29, 2020 at 8:10 AM  wrote:


Hi dev community,



I am trying to develop a connector for an API that exposes a
hierarchical arborescence of documents: each document can have children

documents.

During the init crawl, the child documents are referenced in the MCF
connector through the method
activities.addDocumentRefenrece(childDocumentIdentifier,
parentDocumentIdentifier, parentDataNames, parentDataValues)

The API is able to provide delta modifications/deletions from a
provided date but, when a document that has children is deleted, the
API only returns the id of the document, not its children. On the MCF
connector side, I thought that, as I have referenced the children, by
deleting the parent document all its children would be deleted with
it, but it appears that it is not the case.

So my question is : did I miss something ? Is there another way to
perform delta 

Re: Delta deletion

2020-03-09 Thread Karl Wright
Do you consider the confluence connector in the branch complete?
If so I'll look at it as time permits later today.

As far as your proposal is concerned, maintaining lists of dependencies for
all documents is quite expensive.  We do this for hop counting and we
basically tell people to only use it if they must, because of the huge
amount of database overhead involved.  We also maintain "carrydown" data
which is accessible during document processing.  It is typically used for
ingestion, but maybe you could use that for a signal that child documents
should delete themselves or something.

Major crawling model changes are a gigantic effort; there are always many
things to consider and many problems encountered that need to be worked
around.  If you are concerned simply with the load on your API to handle
deletions, I'd suggest using one of the existing mechanisms for reducing
that.  But I can see no straightforward way to incrementally add dependency
deletion to the current framework.

Karl


On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:

> Hi Karl,
>
> Now that I finished the confluence connector, I am getting back to the
> other one I was working on, and it would greatly help me to have your
> thoughts on my proposal below.
>
> Thanks,
> Julien
>
> On 02/03/2020 16:40, julien.massi...@francelabs.com wrote:
> > Hi Karl,
> >
> > Thanks for your answer.
> >
> > Your explanations validate what I was anticipating on the way MCF is
> currently implementing its model. As you stated, this does mean that in
> order to use the _DELETE model properly, the seeding process has to provide
> the complete list of deleted documents.
> >
> > Yet wouldn't it be a useful improvement to update the
> activities.deleteDocument method (or create an additional delete method) so
> that it automatically – and optionnaly - removes the referenced documents
> of a document Id ?
> >
> > For instance, since the activities.addDocumentReference method already
> asks the document identifier of the "parent" document, couldn’t we maintain
> in postgres a list of "child ids" and use it during the delete process to
> delete them ?
> >
> > This is very useful in the use case I already described but I am sure it
> would be useful for other type of connectors and/or future connectors. The
> benefits of such modification increase with the number of crawled documents.
> >
> > Here is an illustration of the benefits of this MCF modification:
> >
> > With my current connector, if my first crawl ingests 1M documents and on
> the delta crawl only 1 document that has 2 children is deleted, it must
> rely on the processDocument method to check the version of each of the 1M
> documents to figure out and delete the 3 concerned ones (so at least 1M
> calls to the API of the targeted repository). With the suggested optional
> modification, the seeding process would use the delta API of the targeted
> repository and declare the parent document (only one API call), then the
> processDocuments method would be triggered only one time to check the
> version of the document (another one API call), figure out that it does not
> exists anymore and delete it with its 2 children. Its 2 API calls vs 1M...
> even if on framework side we have one more request to perform to postgres,
> I think it worth the processing time.
> >
> > What do you think ?
> >
> > Julien
> >
> > -Message d'origine-
> > De : Karl Wright 
> > Envoyé : samedi 29 février 2020 15:51
> > À : dev 
> > Objet : Re: Delta deletion
> >
> > Hi Julien,
> >
> > First, ALL models rely on individual existence checks for documents.
> That is, when your connector fetches a deleted document, the framework has
> to be told that the document is gone, or it will not be removed.  There is
> no "discovery" process for deleted documents other than seeding (and only
> when the model includes _DELETE).
> >
> > The upshot of this is that IF your seeding method does not return all
> documents that have been removed THEN it cannot be a _DELETE model.
> >
> > I hope this helps.
> >
> > Karl
> >
> >
> > On Sat, Feb 29, 2020 at 8:10 AM  wrote:
> >
> >> Hi dev community,
> >>
> >>
> >>
> >> I am trying to develop a connector for an API that exposes a
> >> hierarchical arborescence of documents: each document can have children
> documents.
> >>
> >> During the init crawl, the child documents are referenced in the MCF
> >> connector through the method
> >> activities.addDocumentRefenrece(childDocumentIdentifier,
> >> parentDocumentIdentifier, parentDataNames, parentDataValues)
> >>
> >> The API is able to provide delta modifications/deletions from a
> >> provided date but, when a document that has children is deleted, the
> >> API only returns the id of the document, not its children. On the MCF
> >> connector side, I thought that, as I have referenced the children, by
> >> deleting the parent document all its children would be deleted with
> >> it, but it appears that it is not 

Re: Delta deletion

2020-03-09 Thread Julien Massiera

Hi Karl,

Now that I finished the confluence connector, I am getting back to the 
other one I was working on, and it would greatly help me to have your 
thoughts on my proposal below.


Thanks,
Julien

On 02/03/2020 16:40, julien.massi...@francelabs.com wrote:

Hi Karl,

Thanks for your answer.

Your explanations validate what I was anticipating on the way MCF is currently 
implementing its model. As you stated, this does mean that in order to use the 
_DELETE model properly, the seeding process has to provide the complete list of 
deleted documents.

Yet wouldn't it be a useful improvement to update the activities.deleteDocument 
method (or create an additional delete method) so that it automatically – and 
optionnaly - removes the referenced documents of a document Id ?

For instance, since the activities.addDocumentReference method already asks the document identifier 
of the "parent" document, couldn’t we maintain in postgres a list of "child 
ids" and use it during the delete process to delete them ?

This is very useful in the use case I already described but I am sure it would 
be useful for other type of connectors and/or future connectors. The benefits 
of such modification increase with the number of crawled documents.

Here is an illustration of the benefits of this MCF modification:

With my current connector, if my first crawl ingests 1M documents and on the 
delta crawl only 1 document that has 2 children is deleted, it must rely on the 
processDocument method to check the version of each of the 1M documents to 
figure out and delete the 3 concerned ones (so at least 1M calls to the API of 
the targeted repository). With the suggested optional modification, the seeding 
process would use the delta API of the targeted repository and declare the 
parent document (only one API call), then the processDocuments method would be 
triggered only one time to check the version of the document (another one API 
call), figure out that it does not exists anymore and delete it with its 2 
children. Its 2 API calls vs 1M... even if on framework side we have one more 
request to perform to postgres, I think it worth the processing time.

What do you think ?

Julien

-Message d'origine-
De : Karl Wright 
Envoyé : samedi 29 février 2020 15:51
À : dev 
Objet : Re: Delta deletion

Hi Julien,

First, ALL models rely on individual existence checks for documents.  That is, when your 
connector fetches a deleted document, the framework has to be told that the document is 
gone, or it will not be removed.  There is no "discovery" process for deleted 
documents other than seeding (and only when the model includes _DELETE).

The upshot of this is that IF your seeding method does not return all documents 
that have been removed THEN it cannot be a _DELETE model.

I hope this helps.

Karl


On Sat, Feb 29, 2020 at 8:10 AM  wrote:


Hi dev community,



I am trying to develop a connector for an API that exposes a
hierarchical arborescence of documents: each document can have children 
documents.

During the init crawl, the child documents are referenced in the MCF
connector through the method
activities.addDocumentRefenrece(childDocumentIdentifier,
parentDocumentIdentifier, parentDataNames, parentDataValues)

The API is able to provide delta modifications/deletions from a
provided date but, when a document that has children is deleted, the
API only returns the id of the document, not its children. On the MCF
connector side, I thought that, as I have referenced the children, by
deleting the parent document all its children would be deleted with
it, but it appears that it is not the case.

So my question is : did I miss something ? Is there another way to
perform delta deletions ? Unfortunately if I don't find a way to solve
this issue, I will not be able to take advantage of the delta feature
and thus I will have to use the "add_modify" connector type and test
every id on a delta crawl to figure out which ids are missing. This
would be a huge loss of performances.



Regards,

Julien Massiera



--
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers 
Summit
www.francelabs.com