Re: ManifoldCF database model

2018-10-16 Thread Karl Wright
Hi, you can look at ManifoldCF In Action.  There's a link to it on the
manifoldcf page.

However, you should be aware that we consider it a severe bug if ManifoldCF
doesn't clean up after itself.  The only time that is not expected is when
people write buggy connectors or mess with database tables themselves.  I
would urge you to examine the Simple History report and try to come up with
a reproducible test case rather than trying to reverse engineer MCF.
Should you go directly to the database, we will be unable to give you any
support.

Thanks,
Karl


On Tue, Oct 16, 2018 at 11:51 AM Gustavo Beneitez <
gustavo.benei...@gmail.com> wrote:

> Hi all,
>
> how do you do? I was wandering if there is any technical document about
> what is the meaning of each table in database, the relationship between
> documents, repositories, jobs and any other output connector (some kind of
> a database model).
>
> We are facing some "garbage issues", jobs are created, duplicated, related
> to transformations, linked to outputs (Elastic Search), played and finally
> deleted, but in the end documents that should be also deleted against the
> output connector,  sometimes they still are there, don't know if they are
> visible because they point to an existing job, an unexpected job end or any
> other failure.
>
> We need to understand the database model in order to check when documents
> stored in Elastic can be safely removed since they no longer are referred
> by any process. A process that should be executed periodically every week,
> for example.
>
> Thanks in advance!
>


Re: Create documents from transformation connector

2018-10-16 Thread Karl Wright
Hi Julien,

That is one thing you cannot do with the MCF pipeline. All documents must
originate in a RepositoryConnector.  The repository connector can create
multiple subdocuments itself, if need be, but the rest of the pipeline does
not allow further splitting.

One way around this: If the second document is intended for a second
output, you can write a transformer that just converts the original
document into the "new" one, and then create your pipeline so that your
transformer is in the path to the second output but not the first.

Your description of the problem argues, however, for adding archive
disassembly to the file system connector, frankly.

Karl


On Tue, Oct 16, 2018 at 12:09 PM Julien 
wrote:

> Hi Karl,
>
> I was wondering if there is a simple way to generate multiple documents
> from a transformation connector.
>
> My use case is the following :
> I have some files that are archives files and I would like to create a
> transformation connector that will be able to extract the files within the
> archives and create new MCF document for each extracted one. So they will
> be processed by the next connectors of my job pipeline.
>
> What would be the best approach in your opinion ?
>
> Regards,
> Julien
>
>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> https://www.avast.com/antivirus
>


Create documents from transformation connector

2018-10-16 Thread Julien
Hi Karl,

I was wondering if there is a simple way to generate multiple documents from a 
transformation connector.

My use case is the following :
I have some files that are archives files and I would like to create a 
transformation connector that will be able to extract the files within the 
archives and create new MCF document for each extracted one. So they will be 
processed by the next connectors of my job pipeline.

What would be the best approach in your opinion ?

Regards,
Julien



---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus


ManifoldCF database model

2018-10-16 Thread Gustavo Beneitez
Hi all,

how do you do? I was wandering if there is any technical document about
what is the meaning of each table in database, the relationship between
documents, repositories, jobs and any other output connector (some kind of
a database model).

We are facing some "garbage issues", jobs are created, duplicated, related
to transformations, linked to outputs (Elastic Search), played and finally
deleted, but in the end documents that should be also deleted against the
output connector,  sometimes they still are there, don't know if they are
visible because they point to an existing job, an unexpected job end or any
other failure.

We need to understand the database model in order to check when documents
stored in Elastic can be safely removed since they no longer are referred
by any process. A process that should be executed periodically every week,
for example.

Thanks in advance!


[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-10-16 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651950#comment-16651950
 ] 

Karl Wright commented on CONNECTORS-1546:
-

I agree with your decision.


> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-10-16 Thread Steph van Schalkwyk (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651942#comment-16651942
 ] 

Steph van Schalkwyk commented on CONNECTORS-1546:
-

Hans is correct. I would remove it. It can mess up merging later if not used 
correctly. It may also take a long time to complete. 

I'm going to upload a patch or two soon and will remove it if you concur.

BTW, from the ES 6.4 doc:

"Force merge should only be called against *read-only indices*. Running force 
merge against a read-write index can cause very large segments to be produced 
(>5Gb per segment), and the merge policy +*will never consider it for merging 
again until it mostly consists of deleted docs*+. This can cause very large 
segments to remain in the shards."

But I agree. It isn't up to MCF to decide what to do as it does impact 
ingesting.

Hans may want to try this before ingesting:
PUT /_cluster/settings{"transient" : {"indices.store.throttle.type" : "none" 
}}
and after ingesting:
PUT /_cluster/settings{"transient" : {"indices.store.throttle.type" : "merge" 
}}
 

 

> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-10-16 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651761#comment-16651761
 ] 

Karl Wright commented on CONNECTORS-1546:
-

Hi [~st...@remcam.net], can you comment on this?

> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-10-16 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1546:
---

Assignee: Steph van Schalkwyk

> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-10-16 Thread Hans Van Goethem (JIRA)
Hans Van Goethem created CONNECTORS-1546:


 Summary: Optimize Elasticsearch performance by removing 
'forcemerge'
 Key: CONNECTORS-1546
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Elastic Search connector
Reporter: Hans Van Goethem


After crawling with ManifoldCF, forcemerge is applied to optimize the 
Elasticsearch index. This optimization makes the Elastic faster for 
read-operations but not for write-opeartions. On the contrary, performance on 
the write operations becomes worse after every forcemerge. 

Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
recurrent crawling to Elasticsearch?

If somene needs this forcemerge, it can be applied mannually against 
Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)