Re: Solr Repository Connector

2019-08-05 Thread Olivier Tavard
Hello,

We are currently working on this kind of repository connector for a customer. 
We plan to give the code to the MCF project if the customer lets us do it 
legally. We will know it at the end of the month or at the beginning of next 
month.

In order to have this working, all the fields of the target Solr need to be 
stored, this condition is mandatory. You can give a look to the Solr entity 
processor of Data Import Handler component : 
https://lucene.apache.org/solr/guide/8_0/uploading-structured-data-store-data-with-the-data-import-handler.html#entity-processors
 
.
 We were inspired by that for the development of the connector.

Best regards,

Olivier



> Le 5 août 2019 à 16:38, Furkan KAMACI  a écrit :
> 
> Hi Dileepa,
> 
> Writing a custom repository connector can let you achieve your goal. Read and 
> directly write to an output connector. 
> 
> You should check your requirements i.e. which data sources you will connect. 
> MCF may rid of huge integration pains compared to many other ETL tools in 
> your case.
> 
> On the other hand, if you wanna achieve a federated search, you could search 
> across distributed indexes. Otherwise, it is a heteregous sourced indexing 
> architecture. You can federate your search query into Solr without ingesting 
> it to any other place. By the way, MCF will let you make document level 
> security, you should handle it manually in such a case.
> 
> Kind Regards,
> Furkan KAMACI
> 
> 5 Ağu 2019 Pzt, saat 17:11 tarihinde Dileepa Jayakody 
> mailto:dileepajayak...@gmail.com>> şunu yazdı:
> Hi Karl and all,
> 
> In my use-case, one of the data-sources is an already populated Solr index 
> which is an e-commerce web-site data index (customers, products & services). 
> Apart from the Solr Index, I need to ingest several other heterogeneous 
> data-sources such as PostgresSQL databases, CRM data etc into the federated 
> search index (the output index will either be a Solr, Elastic-search. We 
> haven't yet finalized on the output index, but I know that both of these are 
> supported in MCF as output connectors.).   
> 
> @Karl based on your comments, I would appreciate your opinion on below 
> ingestion flow. 
> Solr repository/data-source > Solr schema transformations > 
> Solr/Elastic-search search-index
> 
> For such a scenario, do you think MCF is not the ideal option as the 
> ETL/ingestion tool?  Should I go for a lower-level ETL tool such as Apache 
> Nifi ?  
> Or will writing a MCF Solr repository connector be useful to achieve this? 
> WDYT?
> 
> Thanks a lot.
> Regards,
> Dileepa
> 
> 
> 
> On Mon, Aug 5, 2019 at 3:40 PM Karl Wright  > wrote:
> If you are trying to extract data from a Solr index, I know of no way to do 
> that.
> Karl
> 
> 
> On Mon, Aug 5, 2019 at 9:08 AM Dileepa Jayakody  > wrote:
> Hi All,
> 
> Thanks for your replies.
> I'm looking for a repository connector. I've used the Solr output connector 
> before. But now what I need is to connect to a solr index as a repository and 
> retrieve the documents from there. So I need a Solr repository connector.
> 
> @Karl
> I will look at the Solr connector, but this is an output connect, isn't it? 
> Can use this as a repository connector to retrieve docs?
> 
> Thanks,
> Dileepa 
> 
> On Mon, Aug 5, 2019 at 12:45 PM Cihad Guzel  > wrote:
> Hi Dileepa,
> 
> You can check all MFC Connectors list from 
> https://manifoldcf.apache.org/release/release-2.13/en_US/included-connectors.html
>  
> 
> 
> MFC have a Solr Output Connector. It is not a repository connector. if you 
> want to use as repository connector, you should write a new repository 
> connector.
> 
> Regards,
> Cihad Guzel
> 
> 
> Dileepa Jayakody  >, 5 Ağu 2019 Pzt, 13:18 tarihinde şunu 
> yazdı:
> Hi All, 
> 
> I'm working on a project which needs to implement a federated search solution 
> with heterogeneous data repositories. One repository is a Solr index. I would 
> like to use ManifoldCF as the data ingestion engine in this project as I have 
> worked with MCF before.  
> 
> Does ManifoldCF has a Solr repository connector which I can use here? Or will 
> I need to implement a new repository connector for Solr?
> Any guidance here is much appreciated. 
> 
> Thanks,
> Dileepa



Re: Error: Unexpected jobqueue status - record id X, expecting active status, saw 4 (MySQL compatible Database)

2019-05-21 Thread Olivier Tavard
Hi Markus,

We have the same error (with postgresql database). Did the error occur again 
since your last mail ?
Did you change something on your MCF configuration to fix this ?

Thanks,
Best regards,

Olivier 


> Le 13 févr. 2019 à 13:58, Markus Schuch  a 
> écrit :
> 
> Hi Karl,
>  
> we set the diagnostigs logger to level debug.
>  
> I will get back when the error occurs again.
>  
> Cheers,
> Markus
> 
> 
> Pflichtangaben anzeigen 
> 
> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie hier: 
> http://www.deutschebahn.com/de/konzern/datenschutz 
> 
> Le 12 févr. 2019 à 17:41, Karl Wright  a écrit :
> 
> Hi Marcus,
> 
> There's a properties.xml debugging logger you can enable that will keep track 
> of what's happening with transactions, so that when an error of this kinds is 
> reported, information about why the situation is unexpected is dumped to the 
> log.  The logger is called "diagnostics" e.g. 
> "org.apache.manifoldcf.diagnostics".
> 
> Karl
> 
> 
> On Tue, Feb 12, 2019 at 10:53 AM Markus Schuch 
> mailto:markus.sch...@deutschebahn.com>> 
> wrote:
> Hi,
> 
> we are seeing "Error: Unexpected jobqueue status - record id 1484612513829, 
> expecting active status, saw 4" from time to time.
> I didn’t report it, because we were running on an old MCF version, and there 
> were some bugreports relating to this error that are resolved in newer 
> versions.
> 
> Now we have upgraded to latest and greatest and the error still occurred. So 
> want to start to track this down.
> 
> Out setup is:
> 
> ManifoldCF 2.12, running in a Docker Container based on Redhat Linux, OpenJDK 
> 8
> AWS RDS Database (Aurora MySQL -> 5.6 compatible)
> Single Process Setup
> 
> We run several Jobs (run once mode) over night with schedule windows and max 
> runtime settings. Some of the time windows are overlapping.
> At normal days some Jobs finish during their time window, some go to WAITING 
> and will go on in the next night.
> 
> The error mostly hits at a Sharepoint Repository Job.
> We have the impression, that the error is somehow related to situations, when 
> service interruptions (e.g. connection issues) occur in other jobs.
> 
>  
> 
> Maybe all this is related to fundamental problems with the database or the 
> JDBC driver
> 
> (Karl expressed his concerns about that in the maybe-related 
> https://issues.apache.org/jira/browse/CONNECTORS-1581 
>  ticket)
> 
> 
> On the other hand, there were similar bugreports in the past concerning 
> unexpected jobqueue status and there were resolved.
> 
> Mayby there is a chance this can also be analyzed and fixed, but I do not 
> really know where to start.
> 
> 
> Has anybody ideas how we can track this down / debug this to get to the 
> bottom of the problem?
> What could be the first step in the analysis?
> 
> Many thanks in advance
> Markus
> 
> May be related bugreports:
> 
>  
> 
> [0] https://issues.apache.org/jira/browse/CONNECTORS-1395 
> 
> [1] https://issues.apache.org/jira/browse/CONNECTORS-1180 
> 
> [2] https://issues.apache.org/jira/browse/CONNECTORS-590 
> 
> [3] https://issues.apache.org/jira/browse/CONNECTORS-246 
> 
>  
> 
> --
> 
> Markus Schuch
> 
> Web Business (T.IPB 26)
> 
>  
> 
> DB Systel GmbH
> 
> Jürgen-Ponto-Platz 1, 60329 Frankfurt a. Main
> 
> 
> 
> Pflichtangaben anzeigen 
> 
> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie hier: 
> http://www.deutschebahn.com/de/konzern/datenschutz 
> 



Re: web connector : links extraction issues

2018-11-15 Thread Olivier Tavard
Hi Karl,

Thanks for your answer. 
Could you detail your answer please ? Just to better understand : you mean that 
there is no chance that special characters could be escaped in the MCF code in 
this case ie the website needs to escape itself the special characters 
otherwise the extraction will not work in MCF, am I right ?

Best regards,

Olivier



> Le 15 nov. 2018 à 12:57, Karl Wright  a écrit :
> 
> Hi Olivier,
> 
> You can create a ticket but I don't have a good solution for you in any case.
> 
> Karl
> 
> 
>> On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard 
>>  wrote:
>> Hi Karl,
>> 
>> Do you think that I need to create a Jira issue relative to this bug ie that 
>> the links extraction does not work if inside Javascript tags some code 
>> contain special characters like '>', '< '?
>> 
>> Thanks,
>> Best regards,
>> 
>> Olivier
>> 
>> 
>> 
>>> Le 30 oct. 2018 à 12:05, Olivier Tavard  a 
>>> écrit :
>>> 
>>> Hi Karl,
>>> 
>>> Thanks for your answer.
>>> I kept looking into this and I found what was the problem. The Javascript 
>>> code into the tags </scripts>  contained the character '<'. If so 
>>> the links extraction does not work with the web connector.
>>> 
>>> To reproduce it, I created this page hosted in local Apache then I indexed 
>>> it with MCF 2.11 out of the box.
>>> 
>>> in the first example the page was :
>>> <!DOCTYPE html>
>>> 
>>> <head>
>>> <title>test</title>
>>> <meta charset="utf-8" />
>>> <script type="text/javascript">
>>> 
>>> 
>>> 
>>> 
>>> https://manifoldcf.apache.org/en_US/index.html";>manifoldcf
>>> 
>>> 
>>> The links extraction was correct, in the debug log :
>>> DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an 
>>> HttpClient object
>>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For 
>>> http://localhost:/testjs/test.html, setting virtual host to localhost
>>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an HttpClient 
>>> object after 1 ms.
>>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for 
>>> '/testjs/test.html'
>>>  INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH 
>>> URL|http://localhost:/testjs/test.html|1540896372585+75|200|223|
>>> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 
>>> 'http://localhost:/testjs/test.html' is text, with encoding 'UTF-8'; 
>>> link extraction starting
>>> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document 
>>> 'http://localhost:/testjs/test.html', found link to 
>>> 'https://manifoldcf.apache.org/en_US/index.html'
>>> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content 
>>> exclusion rule supplied... returning
>>> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to ingest 
>>> 'http://localhost:/testjs/test.html'
>>> —
>>> In the second example, the code was pretty quite the same except that I 
>>> included the character '<' in the content of the script tags :
>>> 
>>> 
>>> 
>>> test
>>> 
>>> a<b
>>> 
>>> 
>>> 
>>> 
>>> https://manifoldcf.apache.org/en_US/index.html";>manifoldcf
>>> 
>>> 
>>> 
>>> The links extraction was not successful, the debug log indicates :
>>> DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an 
>>> HttpClient object
>>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For 
>>> http://localhost:/testjs/test.html, setting virtual host to localhost
>>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an HttpClient 
>>> object after 1 ms.
>>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for 
>>> '/testjs/test.html'
>>>  INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH 
>>> URL|http://localhost:/testjs/test.html|1540896493475+76|200|226|
>>> DEBUG 2018-10-30T11:48:13,552 (Worker thre

Re: web connector : links extraction issues

2018-11-15 Thread Olivier Tavard
Hi Karl,

Do you think that I need to create a Jira issue relative to this bug ie that 
the links extraction does not work if inside Javascript tags some code contain 
special characters like '>', '< '?

Thanks,
Best regards,

Olivier



> Le 30 oct. 2018 à 12:05, Olivier Tavard  a 
> écrit :
> 
> Hi Karl,
> 
> Thanks for your answer.
> I kept looking into this and I found what was the problem. The Javascript 
> code into the tags </scripts>  contained the character '<'. If so the 
> links extraction does not work with the web connector.
> 
> To reproduce it, I created this page hosted in local Apache then I indexed it 
> with MCF 2.11 out of the box.
> 
> in the first example the page was :
> <!DOCTYPE html>
> 
> <head>
> <title>test</title>
> <meta charset="utf-8" />
> <script type="text/javascript">
> 
> 
> 
> 
> https://manifoldcf.apache.org/en_US/index.html 
> <https://manifoldcf.apache.org/en_US/index.html>">manifoldcf
> 
> 
> The links extraction was correct, in the debug log :
> DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an 
> HttpClient object
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For 
> http://localhost:/testjs/test.html 
> <http://localhost:/testjs/test.html>, setting virtual host to localhost
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an HttpClient 
> object after 1 ms.
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for 
> '/testjs/test.html'
>  INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH 
> URL|http://localhost:/testjs/test.html|1540896372585+75|200|223| 
> <http://localhost:/testjs/test.html|1540896372585+75|200|223|>
> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 
> 'http://localhost:/testjs/test.html' is text, with encoding 'UTF-8 
> <http://localhost:/testjs/test.html'%20is%20text,%20with%20encoding%20'UTF-8>';
>  link extraction starting
> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document 
> 'http://localhost:/testjs/test.html', found link to 
> 'https://manifoldcf.apache.org/en_US/index.html 
> <http://localhost:/testjs/test.html',%20found%20link%20to%20'https://manifoldcf.apache.org/en_US/index.html>'
> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content 
> exclusion rule supplied... returning
> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to ingest 
> 'http://localhost:/testjs/test.html 
> <http://localhost:/testjs/test.html>'
> —
> In the second example, the code was pretty quite the same except that I 
> included the character '<' in the content of the script tags :
> 
> 
> 
> test
> 
> a<b
> 
> 
> 
> 
> https://manifoldcf.apache.org/en_US/index.html 
> <https://manifoldcf.apache.org/en_US/index.html>">manifoldcf
> 
> 
> 
> The links extraction was not successful, the debug log indicates :
> DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an 
> HttpClient object
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For 
> http://localhost:/testjs/test.html 
> <http://localhost:/testjs/test.html>, setting virtual host to localhost
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an HttpClient 
> object after 1 ms.
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for 
> '/testjs/test.html'
>  INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH 
> URL|http://localhost:/testjs/test.html|1540896493475+76|200|226| 
> <http://localhost:/testjs/test.html|1540896493475+76|200|226|>
> DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 
> 'http://localhost:/testjs/test.html' is text, with encoding 'UTF-8 
> <http://localhost:/testjs/test.html'%20is%20text,%20with%20encoding%20'UTF-8>';
>  link extraction starting
> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content 
> exclusion rule supplied... returning
> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to ingest 
> 'http://localhost:/testjs/test.html 
> <http://localhost:/testjs/test.html>'
> —
> So special characters like the less than sign should be escaped in the code 

Re: web connector : links extraction issues

2018-10-30 Thread Olivier Tavard
Hi Karl,

Thanks for your answer.
I kept looking into this and I found what was the problem. The Javascript code 
into the tags </scripts>  contained the character '<'. If so the links 
extraction does not work with the web connector.

To reproduce it, I created this page hosted in local Apache then I indexed it 
with MCF 2.11 out of the box.

in the first example the page was :
<!DOCTYPE html>

<head>
<title>test</title>
<meta charset="utf-8" />
<script type="text/javascript">




https://manifoldcf.apache.org/en_US/index.html";>manifoldcf


The links extraction was correct, in the debug log :
DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an 
HttpClient object
DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For 
http://localhost:/testjs/test.html, setting virtual host to localhost
DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an HttpClient 
object after 1 ms.
DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for 
'/testjs/test.html'
 INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH 
URL|http://localhost:/testjs/test.html|1540896372585+75|200|223|
DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 
'http://localhost:/testjs/test.html' is text, with encoding 'UTF-8'; link 
extraction starting
DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document 
'http://localhost:/testjs/test.html', found link to 
'https://manifoldcf.apache.org/en_US/index.html'
DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content exclusion 
rule supplied... returning
DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to ingest 
'http://localhost:/testjs/test.html'
—
In the second example, the code was pretty quite the same except that I 
included the character '<' in the content of the script tags :



test

a<b




https://manifoldcf.apache.org/en_US/index.html";>manifoldcf



The links extraction was not successful, the debug log indicates :
DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an 
HttpClient object
DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For 
http://localhost:/testjs/test.html, setting virtual host to localhost
DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an HttpClient 
object after 1 ms.
DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for 
'/testjs/test.html'
 INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH 
URL|http://localhost:/testjs/test.html|1540896493475+76|200|226|
DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 
'http://localhost:/testjs/test.html' is text, with encoding 'UTF-8'; link 
extraction starting
DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content exclusion 
rule supplied... returning
DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to ingest 
'http://localhost:/testjs/test.html'
—
So special characters like the less than sign should be escaped in the code of 
the web connector to preserve the links extraction.

Thanks,
Best regards,


Olivier 

> Le 29 oct. 2018 à 19:39, Karl Wright  a écrit :
> 
> Hi Olivier,
> 
> Javascript inclusion in the Web Connector is not evaluated.  In fact, no 
> Javascript is executed at all.  Therefore it should not matter what is 
> included via javascript.
> 
> Thanks,
> Karl
> 
> 
> On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard  <mailto:olivier.tav...@francelabs.com>> wrote:
> Hi,
> 
> Regarding the web connector, I noticed that for specific websites, some 
> Javascript code can prevent the web connector to fetch correctly all the 
> links present on the page. Specifically, for websites that contain a 
> deprecated version of New relic web agent as 
> js-agent.newrelic.com/nr-1071.min.js 
> <http://js-agent.newrelic.com/nr-1071.min.js>.
> After downloading the page locally and removing the reference to the new 
> relic agent browser, the links were correctly fetched in the page by the web 
> connector. So it seems that the Javascript injection here caused by the new 
> relic agent was the cause of the links not fetched in the page.
> This case is rare and concerns only old versions of New Relic agent. But in a 
> more generic way, would it be possible to block the javascript injection at 
> the connector level during the indexation ?
>  
> Thanks,
> Best regards,
> Olivier 
> 
> 



web connector : links extraction issues

2018-10-29 Thread Olivier Tavard
Hi,

Regarding the web connector, I noticed that for specific websites, some 
Javascript code can prevent the web connector to fetch correctly all the links 
present on the page. Specifically, for websites that contain a deprecated 
version of New relic web agent as js-agent.newrelic.com/nr-1071.min.js 
.
After downloading the page locally and removing the reference to the new relic 
agent browser, the links were correctly fetched in the page by the web 
connector. So it seems that the Javascript injection here caused by the new 
relic agent was the cause of the links not fetched in the page.
This case is rare and concerns only old versions of New Relic agent. But in a 
more generic way, would it be possible to block the javascript injection at the 
connector level during the indexation ?
 
Thanks,
Best regards,
Olivier 




Re: Logging and Document filter transformation connector

2018-10-17 Thread Olivier Tavard
Hi Karl,

I  opened a ticket on JIRA, it will be simpler to discuss on it : 
https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1547

Thanks,

Olivier 


> Le 11 oct. 2018 à 19:25, Karl Wright  a écrit :
> 
> The fact that the history is different for the two suggests that the 
> mechanism is different.  You can turn on connector logging and that should 
> help figure out why the png is being rejected.  Once we know that it should 
> be possible to consider improvements to the history.
> 
> Karl
> 
> On Thu, Oct 11, 2018, 10:41 AM Olivier Tavard  <mailto:olivier.tav...@francelabs.com>> wrote:
> Hello Karl,
> 
> OK thanks for the detailed explanation.
> So I understand that we cannot add a distinct result code if the repository 
> connector has no knowledge of the pipeline.
> My problem is that sometimes we do not have any activity status about an 
> excluded file.
> 
> To be more precise, I created a job that only keeps doc and docx extensions 
> (web repository connector and document filter transformation connector). If 
> you look at the screenshot, you will see that the html and the png files are 
> excluded by the repository connector as expected but only the html file has a 
> specific activity log entry with a explicit result code (EXCLUDEURL) :
> 
> The png file has only a "fech activity" and has a 200 result code. I had to 
> activate the debug mode to find a log line about the exclusion of the png 
> file :
> "Removing url 'https://www.datafari.com/assets/img/img_feature_phone_list.png 
> <https://www.datafari.com/assets/img/img_feature_phone_list.png>' because it 
> had the wrong content type ('image/png')"
> The code related to this is located l. 902 in the WebcrawlerConnector and it 
> contains only :
> activityResultCode = null; 
> 
> At the other hand for the html file, the section is l. 1366 and it has 
> explicit code to handle that :
> 
> errorCode = activities.EXCLUDED_URL;
> errorDesc = "Rejected due to URL ('"+documentIdentifier+"')";
> activities.noDocument(documentIdentifier,versionString);
> 
> I do not understand why for the html file the log activity is present with a 
> specific result code and not for the png file for example. Would it be 
> possible to have the same log entry for all the files  ?
> 
> Thanks,
> Best regards,
> 
> Olivier 
> 
>> Le 11 oct. 2018 à 16:00, Karl Wright > <mailto:daddy...@gmail.com>> a écrit :
>> 
>> Hi Olivier,
>> 
>> The Repository connector has no knowledge of what the pipeline looks like.  
>> It simply asks the framework whether the mime type, length, etc. is 
>> acceptable to the downstream pipeline.  It's the connector's responsibility 
>> to note the reason for the rejection in the simple history, but it does not 
>> have any knowledge whatsoever of which connector rejected the document, and 
>> therefore cannot say which transformer or output rejected the document.
>> 
>> Transformation and output connectors which respond to checks for document 
>> mime type or length checks likewise do not have any knowledge of the 
>> upstream connector that is doing the checking.
>> 
>> Karl
>> 
>> 
>> 
>> On Thu, Oct 11, 2018 at 9:31 AM Olivier Tavard 
>> mailto:olivier.tav...@francelabs.com>> wrote:
>> Hello,
>> 
>> I have a question regarding the Document filter transformation connector and 
>> the log about it.
>> I would like to have a look of all the documents excluded by the rules 
>> configured in the Document filter transformation connector by looking at the 
>> Simple history or by the MCF log but it is not easy so far.
>> 
>> Let’s say that I want to crawl a website and I want to index html pages 
>> only. So I configure a web repository connector with a Document filter 
>> transformation connector and I create the rule with only one allowed mime 
>> type content and one file extension. So far so good, the job works well but 
>> if I want to visualize on the MCF log or by the simple history all the files 
>> that were excluded by the transformation connector it is quickly complicated 
>> : I have to search manually all the files that were fetched but not 
>> processed by Tika transformation connector or ingested by the output 
>> connector.
>> 
>> Of my understanding of the code, the document filter transformation 
>> connector can communicate directly with the repo transformation connector to 
>> indicate the rules of exclusion of the documents and so the document that 
>> need to be excluded 

Logging and Document filter transformation connector

2018-10-11 Thread Olivier Tavard
Hello,

I have a question regarding the Document filter transformation connector and 
the log about it.
I would like to have a look of all the documents excluded by the rules 
configured in the Document filter transformation connector by looking at the 
Simple history or by the MCF log but it is not easy so far.

Let’s say that I want to crawl a website and I want to index html pages only. 
So I configure a web repository connector with a Document filter transformation 
connector and I create the rule with only one allowed mime type content and one 
file extension. So far so good, the job works well but if I want to visualize 
on the MCF log or by the simple history all the files that were excluded by the 
transformation connector it is quickly complicated : I have to search manually 
all the files that were fetched but not processed by Tika transformation 
connector or ingested by the output connector.

Of my understanding of the code, the document filter transformation connector 
can communicate directly with the repo transformation connector to indicate the 
rules of exclusion of the documents and so the document that need to be 
excluded are not processed in the Document filter transformation connector but 
directly excluded by the web repo connector.
So in the simple history, I can see that a document that will be excluded is in 
"activity fetch" and that’s it, there is no additional information about it.
Could it be possible to add a log entry with an explicit result code as 
excluded by "document filter connector" or something like when the document is 
excluded by the repository connector?
 
Thank you,
Best regards,
Olivier 



Re: Debug logging properties location

2018-10-11 Thread Olivier Tavard
Hi Karl,

OK thanks for the answer. So it is its normal location, I just wanted to be 
sure.
In my opinion, a suggestion of improvement would be to complete the section 
about the properties in the how to build and deploy page  to add an additional 
column on the tab to distinguish if  the property should be located in the 
global or local properties file.
Thanks,

Olivier 


> Le 11 oct. 2018 à 15:01, Karl Wright  a écrit :
> 
> Hi Olivier, it sounds like you are using Zookeeper.  Certain properties are 
> global and are imported into Zookeeper.  Other properties are local and found 
> in each local properties.xml file.  The debug properties for logging is, I 
> believe, global.
> 
> Karl
> 
> 
> On Thu, Oct 11, 2018 at 8:39 AM Olivier Tavard  <mailto:olivier.tav...@francelabs.com>> wrote:
> Hello,
> 
> I have a question regarding the debug logging properties and their location 
> in the multi process model.
> If I put the properties in the properties.xml file (as 
> org.apache.manifoldcf.connectors for example), it seems that the properties 
> are not taken into account. In the other hand, if I put them in the 
> global-properties.xml file it is OK.
> Is it the normal behaviour ? I thought that global-properties file was only 
> used for shared configuration. For example the property 
> org.apache.manifoldcf.logconfigfile is located in properties.xml and not in 
> global-properties.xml.
> 
> Thanks,
> Best regards,
> 
> Olivier
> 
> 



Debug logging properties location

2018-10-11 Thread Olivier Tavard
Hello,

I have a question regarding the debug logging properties and their location in 
the multi process model.
If I put the properties in the properties.xml file (as 
org.apache.manifoldcf.connectors for example), it seems that the properties are 
not taken into account. In the other hand, if I put them in the 
global-properties.xml file it is OK.
Is it the normal behaviour ? I thought that global-properties file was only 
used for shared configuration. For example the property 
org.apache.manifoldcf.logconfigfile is located in properties.xml and not in 
global-properties.xml.

Thanks,
Best regards,

Olivier




Re: PostgreSQL version to support MCF v2.10

2018-09-04 Thread Olivier Tavard
Hello,

Thanks a lot for sharing your PostgreSQL configuration (sorry for the late 
answer). I will test it soon.

Best regards,


Olivier TAVARD


> Le 23 août 2018 à 19:20, Steph van Schalkwyk  a écrit :
> 
> 
> 
> These are the rpm installs:
> - file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm
> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm
> - 
> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm
> - 
> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm
> - 
> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm
> 
>   postgresql_version: 10
>   postgresql_data_dir: /var/lib/pgsql/10/data
>   postgresql_bin_path: /usr/pgsql-10/bin
>   postgresql_config_path: /var/lib/pgsql/10/data
>   postgresql_daemon: postgresql-10.service
>   postgresql_packages:
> - postgresql10-libs
> - postgresql10
> - postgresql10-server
> - postgresql10-contrib
> #- postgresql10-devel
> 
>   postgresql_hba_entries:
> - { type: local, database: all, user: postgres, auth_method: peer }
> - { type: local, database: all, user: all, auth_method: peer }
> - { type: host, database: all, user: all, address: '127.0.0.1/32 
> <http://127.0.0.1/32>', auth_method: md5 }
> - { type: host, database: all, user: all, address: '::1/128', 
> auth_method: md5 } 
> - { type: host, database: all, user: all, address: '0.0.0.0/0 
> <http://0.0.0.0/0>', auth_method: md5 }
> - { type: host, database: all, user: all, address: '::0/0', 
> auth_method: md5 }
> 
>   postgresql_global_config_options:
> - option: unix_socket_directories
>   value: '{{ postgresql_unix_socket_directories | join(",") }}'
> 
> - option: standard_conforming_strings
>   value: 'on'
> 
> - option: shared_buffers
>   value: '1024MB'
> 
> # max_wal_size = (3 * checkpoint_segments) * 16MB
> # checkpoint_segments=300
> - option: max_wal_size
>   value: '14400MB'
> 
> - option: min_wal_size
>   value: '80MB'
> 
> - option: maintenance_work_mem
>   value: '2MB'
> 
> - option: listen_addresses
>   value: '*'
> 
> - option: max_connections
>   value: '400'
> 
> - option: checkpoint_timeout
>   value: '900'
> 
> - option: datestyle
>   value: "iso, mdy"
> 
> - option: autovacuum
>   value: 'off'
> 
> # vacuum all databases every night (full vacuum on Sunday night, lazy 
> vacuum every night)
> - name: add postgresql cron lazy vacuum
>   cron:
> name: lazy_vacuum
> hour: 8
> minute: 0
> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'"
> - name: add postgresql cron full vacuum
>   cron:
> name: full_vacuum
> weekday: 0
> hour: 10
> minute: 0
> job: "su - postgres -c 'vacuumdb --all --full --analyze --quiet'"
> # re-index all databases once a week
> - name: add postgresql cron reindex
>   cron:
> name: reindex
> weekday: 0
> hour: 12
> minute: 0
> job: "su - postgres -c 'psql -t -c \"select datname from pg_database 
> order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U postgres {} -c \"reindex 
> database {};\"' "
> 
> 
> This is how I run 2.10.
> Been running fine for some weeks without user intervention.
> @Karl: Any comments please?
> Steph
> 
> 



Re: PostgreSQL version to support MCF v2.10

2018-08-23 Thread Olivier Tavard
Hi,

I am also interested by the migration to PostgreSQL 10.
@Steph it would be nice if you could precise the changes you did in the 
configuration file when you wrote :
"I'm using 10.4 with no issues. 

One or two of the recommended settings for MCF have changed between 9.6 and 10. 

Simple to resolve though.
"

Thanks !
Best regards,

Olivier


> Le 6 août 2018 à 15:52, Karl Wright  a écrit :
> 
> It is what is expected with multiple threads active at the same time.
> Karl
> 
> 
> On Mon, Aug 6, 2018 at 7:26 AM Standen Guy  > wrote:
> Hi Karl,
> 
> I haven’t experienced any job aborts, so all seems OK in that respect.
> 
> Is there anything I can do to reduce these errors in the first place, or it 
> is just to be expected with the nature of the multiple worker threads and the 
> query types issued by ManifoldCF?
> 
> Best Regards,
> 
>  
> 
> Guy
> 
>  
> 
> From: Karl Wright [mailto:daddy...@gmail.com ] 
> Sent: 06 August 2018 12:16
> To: user@manifoldcf.apache.org 
> Subject: Re: PostgreSQL version to support MCF v2.10
> 
>  
> 
> These are exactly the same kind of issue as the first "error" reported.  They 
> will be retried.  If they did not get retried, they would abort the job 
> immediately.
> 
>  
> 
> Karl
> 
>  
> 
>  
> 
> On Mon, Aug 6, 2018 at 6:57 AM Standen Guy  > wrote:
> 
> Hi Karl,
> 
>Thanks for the prompt response regarding the first  error 
> example.   Do you have a view as to second error  i.e.
> 
> “2018-08-03 15:52:42.855 BST [5272] ERROR:  could not serialize access due to 
> concurrent update
> 
> 2018-08-03 15:52:42.855 BST [5272] STATEMENT:  SELECT id,status,checktime 
> FROM jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
> 
> 2018-08-03 15:52:42.855 BST [7424] ERROR:  could not serialize access due to 
> concurrent update
> 
> 2018-08-03 15:52:42.855 BST [7424] STATEMENT:  SELECT id,status,checktime 
> FROM jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
> 
> 2018-08-03 15:52:42.855 BST [5716] ERROR:  could not serialize access due to 
> concurrent update
> 
> “
> 
>  
> 
> These errors don’t suggest a retry may sort them out  - is this an issue?
> 
>  
> 
> Many Thanks,
> 
>  
> 
> Guy
> 
>  
> 
> From: Karl Wright [mailto:daddy...@gmail.com ] 
> Sent: 06 August 2018 10:52
> To: user@manifoldcf.apache.org 
> Subject: Re: PostgreSQL version to support MCF v2.10
> 
>  
> 
> Ah, the following errors:
> 
> >>
> 
> 2018-08-03 15:52:25.218 BST [4140] ERROR:  could not serialize access due to 
> read/write dependencies among transactions
> 
> 2018-08-03 15:52:25.218 BST [4140] DETAIL:  Reason code: Canceled on 
> identification as a pivot, during conflict in checking.
> 
> 2018-08-03 15:52:25.218 BST [4140] HINT:  The transaction might succeed if 
> retried.
> 
> << 
> 
>  
> 
> ... occur because of concurrent transactions.  The transaction is indeed 
> retried when this occurs, so unless your job aborts, you are fine.
> 
>  
> 
> Karl
> 
>  
> 
>  
> 
> On Mon, Aug 6, 2018 at 5:49 AM Karl Wright  > wrote:
> 
> What errors are these?  Please include them and I can let you know.
> 
>  
> 
> Karl
> 
>  
> 
>  
> 
> On Mon, Aug 6, 2018 at 4:50 AM Standen Guy  > wrote:
> 
> Thank you Karl and Steph,
> 
>  
> 
> Steph, yes I don’t seem to have any issues with running the MCF jobs, but am 
> concerned about the PostgreSQL errors. Do you ( or anyone else)  have a view 
> on the errors I have seen in the PostgreSQL logs  - is this something you 
> have seen with 10.4  and if so was it corrected by changing some settings? 
> 
>  
> 
> Best Regards
> 
>  
> 
> Guy
> 
>  
> 
> From: Steph van Schalkwyk [mailto:st...@remcam.net ] 
> Sent: 03 August 2018 23:21
> To: user@manifoldcf.apache.org 
> Subject: Re: PostgreSQL version to support MCF v2.10
> 
>  
> 
> I'm using 10.4 with no issues. 
> 
> One or two of the recommended settings for MCF have changed between 9.6 and 
> 10. 
> 
> Simple to resolve though.
> 
> Steph
> 
>  
> 
> 
> 
>  
> 
> On Fri, Aug 3, 2018 at 1:29 PM, Karl Wright  > wrote:
> 
> Hi Guy,
> 
>  
> 
> I use Postgresql 9.6 myself and have found no issues with it.  I don't know 
> about v 10 however.
> 
>  
> 
> Karl
> 
>  
> 
>  
> 
> On Fri, Aug 3, 2018 at 11:32 AM Standen Guy  > wrote:
> 
> Hi Karl/All,
> 
>I am upgrading from MCF v2.6  supported by PostgreSQL v 9.3.16 
>   to  MCF v2.10.  I wonder if there is any official advice as to which 
> version of PostgreSQL  will support  MCF v2.10? The  MCF v2.10 build and 
> deployment instructions still suggest that PostgreSQL 9.3 is the latest 
> tested version of PostgreSQL.  Given that PostgreSQL 9.3.x  is go

Sharepoint 2013 indexation time performance

2018-06-13 Thread Olivier Tavard
Hi,

I have a question regarding the performance of the Sharepoint repository 
connector.
Recently we did some tests using MCF 2.8.1 to crawl some documents on a 
Sharepoint 2013 server. There were few documents : only 700 all located on the 
same documents list.
For full indexation the indexation speed was about 60 docs/min (with security 
activated : SP native authority connector).
For incremental indexation with one file modified and one file added, the 
indexation speed was 80 docs/min.

I know that of course, the performance depends of many factors but if someone 
already indexed large Sharepoint servers and could post the performances 
obtained, it would be be great to compare.

Tests done on two VMs on a ESXI server CPU : Xeon D-1520  RAM : 64GB
- VM for Sharepoint 2013 :
all Sharepoint services installed on the same VM 
4 VCPU, 16 GB RAM

- VM for MCF 
4 VCPU, 24 GB RAM
MCF 2.8.1, multiprocess with Zookeeper

Thanks,

Best regards,

Olivier TAVARD



Re: ZK based synchronization questions

2017-11-02 Thread Olivier Tavard
Hi Karl,

OK it makes senses, thanks for the explanation !

Best regards,

Olivier TAVARD


> Le 31 oct. 2017 à 17:53, Karl Wright  a écrit :
> 
> Hi Olivier,
> 
> Zookeeper connections are pooled, so they are pulled out of the pool at will 
> by ManifoldCF and returned when the lock etc is done.  This means you really 
> should need only a total number of outstanding Zookeeper handles that is on 
> the same order as the number of operating threads in your ManifoldCF cluster. 
>  We do have some users, though, who have hundreds of worker threads in the 
> mistaken assumption that more threads makes the system faster, and when 
> people do stuff like that, we often get tickets because the number of 
> zookeeper handles runs out.  That is why there is such a large number.
> 
> Thanks,
> Karl
> 
> 
> On Tue, Oct 31, 2017 at 12:23 PM, Olivier Tavard 
> mailto:olivier.tav...@francelabs.com>> wrote:
> Hi all,
> Just to clarify my concern on ZK:
> To my knowledge, best practices concerning ZK connections are to not go 
> beyond 60. Is there any rationale for setting it at 1000 for MCF ? 
> Could this can have side effects on our ZK cluster shared by MCF and 
> SolrCloud ?
> 
> Thanks,
> 
> Olivier
> 
> 
>> Le 23 oct. 2017 à 17:19, Olivier Tavard > <mailto:olivier.tav...@francelabs.com>> a écrit :
>> 
>> Hello,
>> 
>> We configured MCF to use ZK sync instead of file sync. We noticed a huge 
>> improvement regarding the stability of the MCF jobs in every case especially 
>> for large data to index (15M of files using the Windows Share repository 
>> connector). Before that, we had some errors when the job was running 
>> randomly. With that change, we did not notice any error on the job so far.
>> However, after testing that configuration on several servers, we had errors 
>> reported and I would like to know what you suggest for that.
>> We installed MCF on servers that already have Solr 6.6.X on them. I saw on 
>> other threads on the mailing list that it was OK to use existing ZK 
>> installation rather than using a new ZK instance dedicated to MCF so we use 
>> the same ZK for both Solr and MCF.
>> After starting MCF and Solr, we noticed some errors on the MCF log for few 
>> servers : Session 0x0 for server localhost/127.0.0.1:2181 
>> <http://127.0.0.1:2181/>, unexpected error, closing socket connection and 
>> attempting reconnect
>> java.io.IOException: Connection reset by peer
>> Then after checking the ZK log we saw this message : "WARN 2017-10-23 
>> 08:53:35,431 (NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181 
>> <http://0.0.0.0/0.0.0.0:2181>) - 
>> ZooKeeper|ZooKeeper|zookeeper.server.NIOServerCnxnFactory|Too many 
>> connections from /127.0.0.1 <http://127.0.0.1/> - max is 60"
>> Therefore we changed the parameter maxClientCnxns from 60 in our ZK 
>> configuration to 1000 as in the MCF zookeeper.cfg default file and there is 
>> no problem anymore.
>> I  would just like to know why this parameter needs to be so high for MCF 
>> needs and if other people share their ZK cluster with both Solr and MCF too 
>> without any problem. And last question I had : MCF uses Zookeeper 3.4.8 
>> while Solr 6.6+ uses ZK 3.4.10.  As written above our ZK cluster version is 
>> 3.4.10 and we use MCF on it, is it ok to do that or would it be best to use 
>> a ZK installation with version 3.4.8 only for MCF ?  So far we did not see 
>> any problems using it.
>> 
>> Thanks,
>> Best regards, 
>> 
>> Olivier TAVARD
> 
> 



Re: ZK based synchronization questions

2017-10-31 Thread Olivier Tavard
Hi all,
Just to clarify my concern on ZK:
To my knowledge, best practices concerning ZK connections are to not go beyond 
60. Is there any rationale for setting it at 1000 for MCF ? 
Could this can have side effects on our ZK cluster shared by MCF and SolrCloud ?

Thanks,

Olivier


> Le 23 oct. 2017 à 17:19, Olivier Tavard  a 
> écrit :
> 
> Hello,
> 
> We configured MCF to use ZK sync instead of file sync. We noticed a huge 
> improvement regarding the stability of the MCF jobs in every case especially 
> for large data to index (15M of files using the Windows Share repository 
> connector). Before that, we had some errors when the job was running 
> randomly. With that change, we did not notice any error on the job so far.
> However, after testing that configuration on several servers, we had errors 
> reported and I would like to know what you suggest for that.
> We installed MCF on servers that already have Solr 6.6.X on them. I saw on 
> other threads on the mailing list that it was OK to use existing ZK 
> installation rather than using a new ZK instance dedicated to MCF so we use 
> the same ZK for both Solr and MCF.
> After starting MCF and Solr, we noticed some errors on the MCF log for few 
> servers : Session 0x0 for server localhost/127.0.0.1:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Connection reset by peer
> Then after checking the ZK log we saw this message : "WARN 2017-10-23 
> 08:53:35,431 (NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181) - 
> ZooKeeper|ZooKeeper|zookeeper.server.NIOServerCnxnFactory|Too many 
> connections from /127.0.0.1 - max is 60"
> Therefore we changed the parameter maxClientCnxns from 60 in our ZK 
> configuration to 1000 as in the MCF zookeeper.cfg default file and there is 
> no problem anymore.
> I  would just like to know why this parameter needs to be so high for MCF 
> needs and if other people share their ZK cluster with both Solr and MCF too 
> without any problem. And last question I had : MCF uses Zookeeper 3.4.8 while 
> Solr 6.6+ uses ZK 3.4.10.  As written above our ZK cluster version is 3.4.10 
> and we use MCF on it, is it ok to do that or would it be best to use a ZK 
> installation with version 3.4.8 only for MCF ?  So far we did not see any 
> problems using it.
> 
> Thanks,
> Best regards, 
> 
> Olivier TAVARD



ZK based synchronization questions

2017-10-23 Thread Olivier Tavard
Hello,

We configured MCF to use ZK sync instead of file sync. We noticed a huge 
improvement regarding the stability of the MCF jobs in every case especially 
for large data to index (15M of files using the Windows Share repository 
connector). Before that, we had some errors when the job was running randomly. 
With that change, we did not notice any error on the job so far.
However, after testing that configuration on several servers, we had errors 
reported and I would like to know what you suggest for that.
We installed MCF on servers that already have Solr 6.6.X on them. I saw on 
other threads on the mailing list that it was OK to use existing ZK 
installation rather than using a new ZK instance dedicated to MCF so we use the 
same ZK for both Solr and MCF.
After starting MCF and Solr, we noticed some errors on the MCF log for few 
servers : Session 0x0 for server localhost/127.0.0.1:2181, unexpected error, 
closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
Then after checking the ZK log we saw this message : "WARN 2017-10-23 
08:53:35,431 (NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181) - 
ZooKeeper|ZooKeeper|zookeeper.server.NIOServerCnxnFactory|Too many connections 
from /127.0.0.1 - max is 60"
Therefore we changed the parameter maxClientCnxns from 60 in our ZK 
configuration to 1000 as in the MCF zookeeper.cfg default file and there is no 
problem anymore.
I  would just like to know why this parameter needs to be so high for MCF needs 
and if other people share their ZK cluster with both Solr and MCF too without 
any problem. And last question I had : MCF uses Zookeeper 3.4.8 while Solr 6.6+ 
uses ZK 3.4.10.  As written above our ZK cluster version is 3.4.10 and we use 
MCF on it, is it ok to do that or would it be best to use a ZK installation 
with version 3.4.8 only for MCF ?  So far we did not see any problems using it.

Thanks,
Best regards, 

Olivier TAVARD


Re: Job error during WindowsShare repository connector indexation

2017-10-11 Thread Olivier Tavard
Hi,

Thanks for your answers.
OK I will definitively use Zookeeper rather file-based synchronization and let 
you know.

For information, the syncharea folder during our crawl was not accessed by any 
other process. The server is dedicated to MCF. The OS is Debian 8 and the files 
are on standard Linux filesystem (ext3). We did not increase the max open files 
in this server (only on the Solr servers), it is a good thing to investigate, 
thanks.
Regardless of the change for ZK, is it possible to change this behavior in MCF 
by automatically stopping the job for example when this exception occurs ?

Thanks,

Olivier TAVARD


> Le 11 oct. 2017 à 14:15, Karl Wright  a écrit :
> 
> In this case it's the *directory* that it doesn't find, so it can't create 
> the file.  If the syncharea is in an NFS-mounted filesystem, then you can get 
> problems of this kind, which is why we strongly advise using Zookeeper 
> instead of playing those kinds of games.
> 
> Karl
> 
> 
> On Wed, Oct 11, 2017 at 7:20 AM, Luis Cabaceira  <mailto:cabace...@gmail.com>> wrote:
> I've seen similar errors (that actually seam like the file is not there or 
> has been deleted, while in fact it exists) due to the reasons i've wrote 
> before.
> 
> On 11 October 2017 at 15:12, Karl Wright  <mailto:daddy...@gmail.com>> wrote:
> This error:
> 
> >>>>>>
> WARN 2017-10-09 08:23:56,284 (Idle cleanup thread) - 
> MCF|MCF-agent|apache.manifoldcf.lock|Attempt to set file lock 
> 'mcf/mcf_home/./syncharea/551/442/lock-_POOLTARGET__REPOSITORYCONNECTORPOOL_SmbFileShare.lock'
>  failed: No such file or directory
> java.io.IOException: No such file or directory
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1012)
> at 
> org.apache.manifoldcf.core.lockmanager.FileLockObject.grabFileLock(FileLockObject.java:223)
> at 
> org.apache.manifoldcf.core.lockmanager.FileLockObject.obtainGlobalWriteLockNoWait(FileLockObject.java:78)
> at 
> org.apache.manifoldcf.core.lockmanager.LockObject.obtainGlobalWriteLock(LockObject.java:121)
> at 
> org.apache.manifoldcf.core.lockmanager.LockObject.enterWriteLock(LockObject.java:74)
> at 
> org.apache.manifoldcf.core.lockmanager.LockGate.enterWriteLock(LockGate.java:177)
> at 
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.enterWrite(BaseLockManager.java:1120)
> at 
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.enterWriteLock(BaseLockManager.java:757)
> at 
> org.apache.manifoldcf.core.lockmanager.LockManager.enterWriteLock(LockManager.java:302)
> at 
> org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:585)
> at 
> org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)
> at 
> org.apache.manifoldcf.crawler.repositoryconnectorpool.RepositoryConnectorPool.pollAllConnectors(RepositoryConnectorPool.java:124)
> at 
> org.apache.manifoldcf.crawlerui.IdleCleanupThread.run(IdleCleanupThread.java:69)
> And the error was repeated indefinitely in the log.
> <<<<<<
> 
> is due to somebody erasing the file-based syncharea while ManifoldCF 
> processes were active.  We strongly suggest using Zookeeper rather than 
> file-based synch, in any case.
> 
> Thanks,
> 
> Karl
> 
> 
> On Wed, Oct 11, 2017 at 6:05 AM, Luis Cabaceira  <mailto:cabace...@gmail.com>> wrote:
> From the look of it, this can be coming from a limitation on the number file 
> handles. You process can be creating too many file handles and not closing 
> those in time, eventually preventing further file operations. 
> 
> I suggest you check this, in Linux run : cat /proc/sys/fs/file-max
> 
> 
> To see the hard and soft values : 
> 
> # ulimit -Hn
> # ulimit -Sn
> 
> P.S. - Change into the user that is running Manifold first
> 
> 
> On 11 October 2017 at 13:54, Olivier Tavard  <mailto:olivier.tav...@francelabs.com>> wrote:
> Hi,
> 
> Thanks for your answer.
> Yes I could reach the samba server from the MCF server. Indeed, the first 
> hours after the MCF job was launched, thousands of documents were correctly 
> accessed and processed by MCF. The mentioned errors appeared only after few 
> hours. Before that, the indexation was done correctly.
> 
> Best regards,
> Olivier TAVARD
> 
> 
>> Le 11 oct. 2017 à 11:21, Cihad Guzel > <mailto:cguz...@gmail.com>> a écrit :
>> 
>> Hi Olivier,
>> 
>> Did you try to connect to samba server with any samba client app? Check 
>> Iptables on your server. Can you stop iptables on ubuntu server? Maybe, you 
>> can configure iptables.
>> 
>>

Re: Job error during WindowsShare repository connector indexation

2017-10-11 Thread Olivier Tavard
Hi,

Thanks for your answer.
Yes I could reach the samba server from the MCF server. Indeed, the first hours 
after the MCF job was launched, thousands of documents were correctly accessed 
and processed by MCF. The mentioned errors appeared only after few hours. 
Before that, the indexation was done correctly.

Best regards,
Olivier TAVARD


> Le 11 oct. 2017 à 11:21, Cihad Guzel  a écrit :
> 
> Hi Olivier,
> 
> Did you try to connect to samba server with any samba client app? Check 
> Iptables on your server. Can you stop iptables on ubuntu server? Maybe, you 
> can configure iptables.
> 
> Regards,
> Cihad Guzel
> 
> 
> 2017-10-11 12:02 GMT+03:00 Olivier Tavard  <mailto:olivier.tav...@francelabs.com>>:
> Hi,
> 
> I had this error during crawling a Samba hosted on Ubuntu Server :
> ERROR 2017-10-05 00:00:14,109 (Idle cleanup thread) - 
> MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Exception tossed: Service 
> '_ANON_0' of type '_REPOSITORYCONNECTORPOOL_SmbFileShare' is not active
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Service '_ANON_0' 
> of type '_REPOSITORYCONNECTORPOOL_SmbFileShare' is not active
> at 
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.updateServiceData(BaseLockManager.java:273)
> at 
> org.apache.manifoldcf.core.lockmanager.LockManager.updateServiceData(LockManager.java:108)
> at 
> org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:654)
> at 
> org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)
> at 
> org.apache.manifoldcf.crawler.repositoryconnectorpool.RepositoryConnectorPool.pollAllConnectors(RepositoryConnectorPool.java:124)
> at 
> org.apache.manifoldcf.crawler.system.IdleCleanupThread.run(IdleCleanupThread.java:68)
> 
> I used MCF 2.8.1 on Debian 8 with Postgresql 9.5.3, Windows Share repository 
> connector. The job was configured to process about 2 millions of files  (600 
> GB). 
> For text extraction I used a Tika server (on the same server as MCF) and add 
> the Tika external content extractor transformation connector into the job 
> configuration.
> The error was present 9 hours after the job was launched. The status job 
> still indicated that the job was running but there was only 1 document in the 
> active column and the error above was repeated in the MCF log.
> 
> Then I tried to launch the clean-lock.sh script and I obtained this error :
> WARN 2017-10-09 08:23:56,284 (Idle cleanup thread) - 
> MCF|MCF-agent|apache.manifoldcf.lock|Attempt to set file lock 
> 'mcf/mcf_home/./syncharea/551/442/lock-_POOLTARGET__REPOSITORYCONNECTORPOOL_SmbFileShare.lock'
>  failed: No such file or directory
> java.io.IOException: No such file or directory
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1012)
> at 
> org.apache.manifoldcf.core.lockmanager.FileLockObject.grabFileLock(FileLockObject.java:223)
> at 
> org.apache.manifoldcf.core.lockmanager.FileLockObject.obtainGlobalWriteLockNoWait(FileLockObject.java:78)
> at 
> org.apache.manifoldcf.core.lockmanager.LockObject.obtainGlobalWriteLock(LockObject.java:121)
> at 
> org.apache.manifoldcf.core.lockmanager.LockObject.enterWriteLock(LockObject.java:74)
> at 
> org.apache.manifoldcf.core.lockmanager.LockGate.enterWriteLock(LockGate.java:177)
> at 
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.enterWrite(BaseLockManager.java:1120)
> at 
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.enterWriteLock(BaseLockManager.java:757)
> at 
> org.apache.manifoldcf.core.lockmanager.LockManager.enterWriteLock(LockManager.java:302)
> at 
> org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:585)
> at 
> org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)
> at 
> org.apache.manifoldcf.crawler.repositoryconnectorpool.RepositoryConnectorPool.pollAllConnectors(RepositoryConnectorPool.java:124)
> at 
> org.apache.manifoldcf.crawlerui.IdleCleanupThread.run(IdleCleanupThread.java:69)
> And the error was repeated indefinitely in the log.
> 
> Did it mean that there was a problem with the syncharea folder at some point ?
> 
> Thanks,
> Best regards,
> 
> Olivier TAVARD
> 
> 
> 
> -- 
> Cihad Güzel



Job error during WindowsShare repository connector indexation

2017-10-11 Thread Olivier Tavard
Hi,

I had this error during crawling a Samba hosted on Ubuntu Server :
ERROR 2017-10-05 00:00:14,109 (Idle cleanup thread) - 
MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Exception tossed: Service 
'_ANON_0' of type '_REPOSITORYCONNECTORPOOL_SmbFileShare' is not active
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Service '_ANON_0' of 
type '_REPOSITORYCONNECTORPOOL_SmbFileShare' is not active
at 
org.apache.manifoldcf.core.lockmanager.BaseLockManager.updateServiceData(BaseLockManager.java:273)
at 
org.apache.manifoldcf.core.lockmanager.LockManager.updateServiceData(LockManager.java:108)
at 
org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:654)
at 
org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)
at 
org.apache.manifoldcf.crawler.repositoryconnectorpool.RepositoryConnectorPool.pollAllConnectors(RepositoryConnectorPool.java:124)
at 
org.apache.manifoldcf.crawler.system.IdleCleanupThread.run(IdleCleanupThread.java:68)

I used MCF 2.8.1 on Debian 8 with Postgresql 9.5.3, Windows Share repository 
connector. The job was configured to process about 2 millions of files  (600 
GB). 
For text extraction I used a Tika server (on the same server as MCF) and add 
the Tika external content extractor transformation connector into the job 
configuration.
The error was present 9 hours after the job was launched. The status job still 
indicated that the job was running but there was only 1 document in the active 
column and the error above was repeated in the MCF log.

Then I tried to launch the clean-lock.sh script and I obtained this error :
WARN 2017-10-09 08:23:56,284 (Idle cleanup thread) - 
MCF|MCF-agent|apache.manifoldcf.lock|Attempt to set file lock 
'mcf/mcf_home/./syncharea/551/442/lock-_POOLTARGET__REPOSITORYCONNECTORPOOL_SmbFileShare.lock'
 failed: No such file or directory
java.io.IOException: No such file or directory
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:1012)
at 
org.apache.manifoldcf.core.lockmanager.FileLockObject.grabFileLock(FileLockObject.java:223)
at 
org.apache.manifoldcf.core.lockmanager.FileLockObject.obtainGlobalWriteLockNoWait(FileLockObject.java:78)
at 
org.apache.manifoldcf.core.lockmanager.LockObject.obtainGlobalWriteLock(LockObject.java:121)
at 
org.apache.manifoldcf.core.lockmanager.LockObject.enterWriteLock(LockObject.java:74)
at 
org.apache.manifoldcf.core.lockmanager.LockGate.enterWriteLock(LockGate.java:177)
at 
org.apache.manifoldcf.core.lockmanager.BaseLockManager.enterWrite(BaseLockManager.java:1120)
at 
org.apache.manifoldcf.core.lockmanager.BaseLockManager.enterWriteLock(BaseLockManager.java:757)
at 
org.apache.manifoldcf.core.lockmanager.LockManager.enterWriteLock(LockManager.java:302)
at 
org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:585)
at 
org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)
at 
org.apache.manifoldcf.crawler.repositoryconnectorpool.RepositoryConnectorPool.pollAllConnectors(RepositoryConnectorPool.java:124)
at 
org.apache.manifoldcf.crawlerui.IdleCleanupThread.run(IdleCleanupThread.java:69)
And the error was repeated indefinitely in the log.

Did it mean that there was a problem with the syncharea folder at some point ?

Thanks,
Best regards,

Olivier TAVARD


Rép : Best practices for Postgresql configuration

2017-10-09 Thread Olivier Tavard

Hi,

Thanks for your answer.
Indeed my estimation was not very precise regarding the volumetry !  I meant 15 
millions of files for 3 TB of data roughly.

Thanks,

Olivier



> Le 9 oct. 2017 à 18:17, user-h...@manifoldcf.apache.org a écrit :
> 
> De: Karl Wright mailto:daddy...@gmail.com>>
> Objet: Rép : Best practices for Postgresql configuration
> Date: 9 octobre 2017 à 16:58:25 UTC+2
> À: "user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org>" 
> mailto:user@manifoldcf.apache.org>>
> 
> 
> Hi Olivier,
> 
> We've tried versions of Postgresql beyond 9.3, and they seem to work, but 
> there's always a possibility that the query plans will turn out badly.  But 
> this is unlikely.
> 
> The automatic vacuum operation in Postgresql has gotten much better over 
> time.  You do not need to pause MCF to do it, but you should expect things to 
> take longer while it is running.  If you do a full vacuum, however, most 
> operations will be blocked until it is done.
> 
> For further optimization, please let us know how many documents you are 
> indexing.  How big is "very large"?
> 
> Thanks,
> Karl
> 
> 
> On Mon, Oct 9, 2017 at 10:43 AM, Olivier Tavard 
> mailto:olivier.tav...@francelabs.com>> wrote:
> 
> Hi community,
> 
> I have some questions regarding Postgresql tuning performance.
> 
> I configured the MCF Postgresql database with the recommended parameters in 
> this page : 
> https://manifoldcf.apache.org/release/release-2.8.1/en_US/how-to-build-and-deploy.html#Configuring+a+PostgreSQL+database
>  
> <https://manifoldcf.apache.org/release/release-2.8.1/en_US/how-to-build-and-deploy.html#Configuring+a+PostgreSQL+database>
> 
> But maybe some sections are outdated and I would like to know if the 
> recommandations are still valid for current version of MCF.
> 
> 1) The documentation says that MCF was tested for different versions of 
> PostgreSQL up to 9.3. Is it OK to run MCF with Postgresql versions beyond 9.3 
> ?
> I know that this question is often present in the mailing list but it would 
> be good to know if people use it in production without problem.
> There is also the 10.0 version released few days ago, did someone already 
> test it with MCF ?
> 
> 2) Some parameters recommended for the postgresql.conf do not longer exist in 
> newer versions of Postgresql like checkpoint_segments (replaced in 9.4 
> version by min_wal_size 
> <https://www.postgresql.org/docs/9.5/static/runtime-config-wal.html#GUC-MIN-WAL-SIZE>
>  and max_wal_size 
> <https://www.postgresql.org/docs/9.5/static/runtime-config-wal.html#GUC-MAX-WAL-SIZE>).
> Is there anything new regarding these parameters since the creation of the 
> documentation ? Do you recommend new settings ?
> 
> 3) Regarding full vacuum operation, I imagine that it is better to do this 
> operation when MCF is not busy, I mean when there is no job running ? Do we 
> need to pause the jobs in MCF in order to do so for example ?
> And what is the recommended frequency to perform it ? Some people say each 
> month, others each day, I would be interested by your recommendation !
> 
> We have some customers with a very large volumetry for the Windows Share 
> repository connector and we try to tweak the Postgresql configuration for 
> increase MCF performance.
> So it would be great to know how MCF users optimize the Postgresql 
> configuration for their use.
> 
> Thanks,
> 
> Olivier TAVARD



Best practices for Postgresql configuration

2017-10-09 Thread Olivier Tavard

Hi community,

I have some questions regarding Postgresql tuning performance.

I configured the MCF Postgresql database with the recommended parameters in 
this page : 
https://manifoldcf.apache.org/release/release-2.8.1/en_US/how-to-build-and-deploy.html#Configuring+a+PostgreSQL+database

But maybe some sections are outdated and I would like to know if the 
recommandations are still valid for current version of MCF.

1) The documentation says that MCF was tested for different versions of 
PostgreSQL up to 9.3. Is it OK to run MCF with Postgresql versions beyond 9.3 ?
I know that this question is often present in the mailing list but it would be 
good to know if people use it in production without problem.
There is also the 10.0 version released few days ago, did someone already test 
it with MCF ?

2) Some parameters recommended for the postgresql.conf do not longer exist in 
newer versions of Postgresql like checkpoint_segments (replaced in 9.4 version 
by min_wal_size 
<https://www.postgresql.org/docs/9.5/static/runtime-config-wal.html#GUC-MIN-WAL-SIZE>
 and max_wal_size 
<https://www.postgresql.org/docs/9.5/static/runtime-config-wal.html#GUC-MAX-WAL-SIZE>).
Is there anything new regarding these parameters since the creation of the 
documentation ? Do you recommend new settings ?

3) Regarding full vacuum operation, I imagine that it is better to do this 
operation when MCF is not busy, I mean when there is no job running ? Do we 
need to pause the jobs in MCF in order to do so for example ?
And what is the recommended frequency to perform it ? Some people say each 
month, others each day, I would be interested by your recommendation !

We have some customers with a very large volumetry for the Windows Share 
repository connector and we try to tweak the Postgresql configuration for 
increase MCF performance.
So it would be great to know how MCF users optimize the Postgresql 
configuration for their use.

Thanks,

Olivier TAVARD


Windows share connector : fetch ACL for an incremental job

2017-05-02 Thread Olivier Tavard
Hi,

I have a question about the Windows Share connector please.
During the incremental job of a file share with security enabled, it seems that 
the getSecurity method is called for each file even if the last modified date 
of the document is unchanged between the two crawls.
Does it mean that the last modified date of a file is not changed after a 
modification of the ACLs on it ? So the connector has to fetch the ACLs on the 
file in all cases (even if the date is the same between the date of the ingest 
status in the MCF database and the date of the file), am I correct ? Or is it 
done in two steps : first check the last modified date of the document and 
after that only if it is different from the date stored into the MCF database, 
fetch ACL of the file and compare it with the ACLs stored into the MCF database 
?

Thanks,

Olivier TAVARD