Re: Does the ES pluging work for ES 5.5.x?

2017-09-05 Thread Steph van Schalkwyk
What a nightmare. Between the API changes and the lack of ES
documentation...




*Steph van Schalkwyk*
Principal, Remcam Search Engines
+1.314.452. <+1+314+452+2896>2896st...@remcam.net   http://remcam.net
<http://www.remcam.net/> Skype: svanschalkwyk
<https://mail.google.com/mail/u/0/#>
<http://linkedin.com/in/vanschalkwyk>

On Sun, Sep 3, 2017 at 8:39 AM, S <st...@remcam.net> wrote:

> Thanks Karl.
> I started last night. Will add ny changes.
> S
> --
> From: Karl Wright <daddy...@gmail.com>
> Sent: ‎03/‎09/‎2017 04:24
> To: user@manifoldcf.apache.org
> Subject: Re: Does the ES pluging work for ES 5.5.x?
>
> I've set up a project for an es-5.5 plugin and done what I could without
> delving into the changes that were made to the API.  You can check it out
> at:
>
> https://svn.apache.org/repos/asf/manifoldcf/integration/
> elasticsearch-5.5/trunk
>
> I also created a ticket -- CONNECTORS-1454 -- that interested parties can
> attach patches to.  If you have revisions that might get us closer to
> having a version that builds, I would be happy to commit them.  I'll also
> see what kind of time I can muster over this long weekend to look at it but
> I wouldn't count on much.
>
> Thanks!!
> Karl
>
>
> On Sat, Sep 2, 2017 at 10:14 PM, Steph van Schalkwyk <st...@remcam.net>
> wrote:
>
>> I'll see if I can get it to work.
>> Initial glance at the API resulted in a "just wow" moment.
>> Everything has changed.
>> Steph
>>
>>
>>
>


Re: Question about ManifoldCF 2.8

2017-09-05 Thread Karl Wright
Hi Othman,

Thanks for doing the evaluation of the problem.

Generally, the ManifoldCF project does not have the expertise to diagnose
problems with external systems like Solr or Elasticsearch.  So going to
another newsgroup for those kinds of issues would be a good idea.

Thanks!
Karl


On Tue, Sep 5, 2017 at 4:33 AM, Beelz Ryuzaki  wrote:

> Hi Karl,
>
> I have analyzed the error and found out that it was mainly an
> elasticsearch problem. I saw in some forums that one of the adopted
> solution is to modify elasticsearch.yml and set the http.max_content_length
> to a greater value. However, the job got stuck in the last two indexable
> files ( two pptx files with 22Mo and 2Mo respectively). The job eventually
> ended but a stack trace showed that elasticsearch ran out of memory. For
> your information, I have allocated 4Go for elasticsearch execution. Is it
> enough in order to have a good performance. You will find attached the
> stack traces of elasticsearch.
>
> Best regards,
>
> Othman BELHAJ.
>
> On Mon, 4 Sep 2017 at 16:40, Beelz Ryuzaki  wrote:
>
>> Hi Karl,
>>
>> I'm sorry to bother on your holiday. I will try to analyze it today and
>> let it you know what I have found. Enjoy your day !
>>
>> Best regards,
>>
>> Othman BELHAJ.
>>
>> On Mon, 4 Sep 2017 at 16:06, Karl Wright  wrote:
>>
>>> Hi Othman,
>>>
>>> I won't be able to look at this today; it is a holiday here.  But, the
>>> "socket write" error is coming from ElasticSearch.  If ES is configured to
>>> not accept documents greater than a certain size, that might explain it.
>>> Maybe the ES logs would help?
>>>
>>> I'm afraid you're going to need to do the work to find out what is going
>>> wrong in those cases now.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Mon, Sep 4, 2017 at 4:53 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 Hi Karl,

 This morning, I have tried the zookeeper based file and it worked
 really good. However, I still have one error which is bugging me. It is a
 socket write error. You will find attached the simple history report.
 Surprisingly, I didn't have any stack trace in the ManifoldCF log file.

 Best regards,

 Othman.

 On Fri, 1 Sep 2017 at 19:39, Karl Wright  wrote:

> This is from file locking yet again.
>
> I have uploaded a new RC.  Please download and try out the zookeeper
> locking.
>
> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-
> manifoldcf-2.8.1
>
> Karl
>
>
> On Fri, Sep 1, 2017 at 1:11 PM, Beelz Ryuzaki 
> wrote:
>
>> There is another issue as well that gives the following stack trace.
>>
>> Othman.
>>
>> On Fri, 1 Sep 2017 at 18:05, Beelz Ryuzaki 
>> wrote:
>>
>>> Hi Karl,
>>>
>>> I took the binary from the ManifoldCF 2.8.1 RC0. It had the version
>>> 3.9 of POI and when I changed the version to 3.15 it worked fine. I 
>>> really
>>> want to try the zookeeper if as you told me its performance is better 
>>> than
>>> the file-based example. For the time being, I'm using the file-based
>>> because it is the only part that works for me but I actually need a 
>>> stable
>>> version for my production environment. That is one point.
>>> Another point is, the path's tab is still an issue for me because I
>>> exclude some files and it still crawls them. I want to exclude some
>>> specific extensions of files and some specific directories. For 
>>> instance, i
>>> don't want to index .exe files and contains a specific word. I do as
>>> follows I make the first exclude with *.exe and the second one with 
>>> *word*.
>>> Only the second one which doesn't work. How can I solve this issue, 
>>> please?
>>>
>>> Thank you very much, have a nice week-end,
>>>
>>> Othman
>>> On Fri, 1 Sep 2017 at 16:46, Karl Wright  wrote:
>>>
 Hi Othman,

 I will respin a new 2.8.1 (RC1) to address the zookeeper issue.

 The failure you are seeing is "NoSuchMethodError".  Therefore, the
 class is being found, but it is the *wrong* class.  When you deployed 
 the
 new release, did you deploy it in a new directory, or did you 
 overwrite the
 previous deployment?  If you overwrote it, you probably have multiple
 versions of the POI jars.

 Karl


 On Fri, Sep 1, 2017 at 9:59 AM, Beelz Ryuzaki 
 wrote:

> Hi Karl,
>
> I have just tried the new release of ManifoldCF. At first, the
> first job ended normally, but in the second I got a new stack trace
> concerning the POI. Moreover, the runzookeeper.bat doesn't run 
> properly. It

Re: Question about ManifoldCF 2.8

2017-09-04 Thread Beelz Ryuzaki
Hi Karl,

I'm sorry to bother on your holiday. I will try to analyze it today and let
it you know what I have found. Enjoy your day !

Best regards,

Othman BELHAJ.

On Mon, 4 Sep 2017 at 16:06, Karl Wright  wrote:

> Hi Othman,
>
> I won't be able to look at this today; it is a holiday here.  But, the
> "socket write" error is coming from ElasticSearch.  If ES is configured to
> not accept documents greater than a certain size, that might explain it.
> Maybe the ES logs would help?
>
> I'm afraid you're going to need to do the work to find out what is going
> wrong in those cases now.
>
> Thanks,
> Karl
>
>
> On Mon, Sep 4, 2017 at 4:53 AM, Beelz Ryuzaki  wrote:
>
>> Hi Karl,
>>
>> This morning, I have tried the zookeeper based file and it worked really
>> good. However, I still have one error which is bugging me. It is a socket
>> write error. You will find attached the simple history report.
>> Surprisingly, I didn't have any stack trace in the ManifoldCF log file.
>>
>> Best regards,
>>
>> Othman.
>>
>> On Fri, 1 Sep 2017 at 19:39, Karl Wright  wrote:
>>
>>> This is from file locking yet again.
>>>
>>> I have uploaded a new RC.  Please download and try out the zookeeper
>>> locking.
>>>
>>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.8.1
>>>
>>> Karl
>>>
>>>
>>> On Fri, Sep 1, 2017 at 1:11 PM, Beelz Ryuzaki 
>>> wrote:
>>>
 There is another issue as well that gives the following stack trace.

 Othman.

 On Fri, 1 Sep 2017 at 18:05, Beelz Ryuzaki  wrote:

> Hi Karl,
>
> I took the binary from the ManifoldCF 2.8.1 RC0. It had the version
> 3.9 of POI and when I changed the version to 3.15 it worked fine. I really
> want to try the zookeeper if as you told me its performance is better than
> the file-based example. For the time being, I'm using the file-based
> because it is the only part that works for me but I actually need a stable
> version for my production environment. That is one point.
> Another point is, the path's tab is still an issue for me because I
> exclude some files and it still crawls them. I want to exclude some
> specific extensions of files and some specific directories. For instance, 
> i
> don't want to index .exe files and contains a specific word. I do as
> follows I make the first exclude with *.exe and the second one with 
> *word*.
> Only the second one which doesn't work. How can I solve this issue, 
> please?
>
> Thank you very much, have a nice week-end,
>
> Othman
> On Fri, 1 Sep 2017 at 16:46, Karl Wright  wrote:
>
>> Hi Othman,
>>
>> I will respin a new 2.8.1 (RC1) to address the zookeeper issue.
>>
>> The failure you are seeing is "NoSuchMethodError".  Therefore, the
>> class is being found, but it is the *wrong* class.  When you deployed the
>> new release, did you deploy it in a new directory, or did you overwrite 
>> the
>> previous deployment?  If you overwrote it, you probably have multiple
>> versions of the POI jars.
>>
>> Karl
>>
>>
>> On Fri, Sep 1, 2017 at 9:59 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Hi Karl,
>>>
>>> I have just tried the new release of ManifoldCF. At first, the first
>>> job ended normally, but in the second I got a new stack trace concerning
>>> the POI. Moreover, the runzookeeper.bat doesn't run properly. It shows 
>>> me
>>> the stack trace attached.
>>>
>>> Ps:
>>> The second attached file contains the POI stack trace.
>>>
>>> Othman.
>>>
>>> On Fri, 1 Sep 2017 at 12:21, Karl Wright  wrote:
>>>
 Hi Othman,

 You do not need a new database instance.

 You can download MCF 2.8.1 RC0 from here:


 https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.8.1

 Karl


 On Fri, Sep 1, 2017 at 5:42 AM, Beelz Ryuzaki 
 wrote:

> Hi Karl,
>
> Thank you very much for your help, I'm going to try out the
> zookeeper example. Should I initialize a new database? And how can I 
> run
> the zookeeper start-agent ?
>
> Othman.
>
> On Fri, 1 Sep 2017 at 11:37, Karl Wright 
> wrote:
>
>> Hi Othman,
>>
>> These exceptions are now coming from file locking and are due to
>> permissions problems.  I suggest you go to Zookeeper for file 
>> locking.
>>
>> I am building a 2.8.1 release candidate.  When it available for
>> download, I'll send you the URL.
>>
>> Thanks,
>> Karl
>>

Re: Question about ManifoldCF 2.8

2017-09-04 Thread Karl Wright
Hi Othman,

I won't be able to look at this today; it is a holiday here.  But, the
"socket write" error is coming from ElasticSearch.  If ES is configured to
not accept documents greater than a certain size, that might explain it.
Maybe the ES logs would help?

I'm afraid you're going to need to do the work to find out what is going
wrong in those cases now.

Thanks,
Karl


On Mon, Sep 4, 2017 at 4:53 AM, Beelz Ryuzaki  wrote:

> Hi Karl,
>
> This morning, I have tried the zookeeper based file and it worked really
> good. However, I still have one error which is bugging me. It is a socket
> write error. You will find attached the simple history report.
> Surprisingly, I didn't have any stack trace in the ManifoldCF log file.
>
> Best regards,
>
> Othman.
>
> On Fri, 1 Sep 2017 at 19:39, Karl Wright  wrote:
>
>> This is from file locking yet again.
>>
>> I have uploaded a new RC.  Please download and try out the zookeeper
>> locking.
>>
>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.8.1
>>
>> Karl
>>
>>
>> On Fri, Sep 1, 2017 at 1:11 PM, Beelz Ryuzaki 
>> wrote:
>>
>>> There is another issue as well that gives the following stack trace.
>>>
>>> Othman.
>>>
>>> On Fri, 1 Sep 2017 at 18:05, Beelz Ryuzaki  wrote:
>>>
 Hi Karl,

 I took the binary from the ManifoldCF 2.8.1 RC0. It had the version 3.9
 of POI and when I changed the version to 3.15 it worked fine. I really want
 to try the zookeeper if as you told me its performance is better than the
 file-based example. For the time being, I'm using the file-based because it
 is the only part that works for me but I actually need a stable version for
 my production environment. That is one point.
 Another point is, the path's tab is still an issue for me because I
 exclude some files and it still crawls them. I want to exclude some
 specific extensions of files and some specific directories. For instance, i
 don't want to index .exe files and contains a specific word. I do as
 follows I make the first exclude with *.exe and the second one with *word*.
 Only the second one which doesn't work. How can I solve this issue, please?

 Thank you very much, have a nice week-end,

 Othman
 On Fri, 1 Sep 2017 at 16:46, Karl Wright  wrote:

> Hi Othman,
>
> I will respin a new 2.8.1 (RC1) to address the zookeeper issue.
>
> The failure you are seeing is "NoSuchMethodError".  Therefore, the
> class is being found, but it is the *wrong* class.  When you deployed the
> new release, did you deploy it in a new directory, or did you overwrite 
> the
> previous deployment?  If you overwrote it, you probably have multiple
> versions of the POI jars.
>
> Karl
>
>
> On Fri, Sep 1, 2017 at 9:59 AM, Beelz Ryuzaki 
> wrote:
>
>> Hi Karl,
>>
>> I have just tried the new release of ManifoldCF. At first, the first
>> job ended normally, but in the second I got a new stack trace concerning
>> the POI. Moreover, the runzookeeper.bat doesn't run properly. It shows me
>> the stack trace attached.
>>
>> Ps:
>> The second attached file contains the POI stack trace.
>>
>> Othman.
>>
>> On Fri, 1 Sep 2017 at 12:21, Karl Wright  wrote:
>>
>>> Hi Othman,
>>>
>>> You do not need a new database instance.
>>>
>>> You can download MCF 2.8.1 RC0 from here:
>>>
>>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-
>>> manifoldcf-2.8.1
>>>
>>> Karl
>>>
>>>
>>> On Fri, Sep 1, 2017 at 5:42 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 Hi Karl,

 Thank you very much for your help, I'm going to try out the
 zookeeper example. Should I initialize a new database? And how can I 
 run
 the zookeeper start-agent ?

 Othman.

 On Fri, 1 Sep 2017 at 11:37, Karl Wright 
 wrote:

> Hi Othman,
>
> These exceptions are now coming from file locking and are due to
> permissions problems.  I suggest you go to Zookeeper for file locking.
>
> I am building a 2.8.1 release candidate.  When it available for
> download, I'll send you the URL.
>
> Thanks,
> Karl
>
>
> On Fri, Sep 1, 2017 at 5:27 AM, Beelz Ryuzaki  > wrote:
>
>> Hi Karl,
>>
>> This morning, I have followed the steps you told me to do and I
>> still got stack traces. I have attached the stack traces as well as 
>> the
>> content of my lib repo and option.env.
>> I have installed zookeeper and I'm ready to use 

RE: Does the ES pluging work for ES 5.5.x?

2017-09-03 Thread S
Thanks Karl.
I started last night. Will add ny changes.
S

-Original Message-
From: "Karl Wright" <daddy...@gmail.com>
Sent: ‎03/‎09/‎2017 04:24
To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
Subject: Re: Does the ES pluging work for ES 5.5.x?

I've set up a project for an es-5.5 plugin and done what I could without 
delving into the changes that were made to the API.  You can check it out at:


https://svn.apache.org/repos/asf/manifoldcf/integration/elasticsearch-5.5/trunk


I also created a ticket -- CONNECTORS-1454 -- that interested parties can 
attach patches to.  If you have revisions that might get us closer to having a 
version that builds, I would be happy to commit them.  I'll also see what kind 
of time I can muster over this long weekend to look at it but I wouldn't count 
on much.


Thanks!!
Karl




On Sat, Sep 2, 2017 at 10:14 PM, Steph van Schalkwyk <st...@remcam.net> wrote:

I'll see if I can get it to work.
Initial glance at the API resulted in a "just wow" moment. 
Everything has changed.
Steph

Re: Does the ES pluging work for ES 5.5.x?

2017-09-02 Thread Karl Wright
Hi Steph,

The version of ManifoldCF doesn't matter.

The ManifoldCF Plugin for ES 2.0 was coded to compile against ES 2.0.  It's
pretty easy to see if it compiles against 5.5 -- you just change a version
in the plugin's pom and rebuild.  Having said that, I have no idea what
APIs in ES may have changed.  I *think* 5.5 is really 2.5.5, so it may all
work.  Please give it a try and let us know what happens.

Thanks,
Karl


On Sat, Sep 2, 2017 at 12:58 PM, Steph van Schalkwyk 
wrote:

> Hi,
> Has anyone used the MCF ES Pluging on ES 5.5.x and MCF 2.8.z?
> Thanks
> Steph
>
>


Re: Question about ManifoldCF 2.8

2017-09-01 Thread Karl Wright
(1) I would create a ticket for the "*word*" exclusion.  It would be
helpful to include a screen shot of the view page of your job as well.
(2) I will be uploading a new ManifoldCF 2.8.1 RC shortly.

Karl



On Fri, Sep 1, 2017 at 12:05 PM, Beelz Ryuzaki  wrote:

> Hi Karl,
>
> I took the binary from the ManifoldCF 2.8.1 RC0. It had the version 3.9 of
> POI and when I changed the version to 3.15 it worked fine. I really want to
> try the zookeeper if as you told me its performance is better than the
> file-based example. For the time being, I'm using the file-based because it
> is the only part that works for me but I actually need a stable version for
> my production environment. That is one point.
> Another point is, the path's tab is still an issue for me because I
> exclude some files and it still crawls them. I want to exclude some
> specific extensions of files and some specific directories. For instance, i
> don't want to index .exe files and contains a specific word. I do as
> follows I make the first exclude with *.exe and the second one with *word*.
> Only the second one which doesn't work. How can I solve this issue, please?
>
> Thank you very much, have a nice week-end,
>
> Othman
> On Fri, 1 Sep 2017 at 16:46, Karl Wright  wrote:
>
>> Hi Othman,
>>
>> I will respin a new 2.8.1 (RC1) to address the zookeeper issue.
>>
>> The failure you are seeing is "NoSuchMethodError".  Therefore, the class
>> is being found, but it is the *wrong* class.  When you deployed the new
>> release, did you deploy it in a new directory, or did you overwrite the
>> previous deployment?  If you overwrote it, you probably have multiple
>> versions of the POI jars.
>>
>> Karl
>>
>>
>> On Fri, Sep 1, 2017 at 9:59 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Hi Karl,
>>>
>>> I have just tried the new release of ManifoldCF. At first, the first job
>>> ended normally, but in the second I got a new stack trace concerning the
>>> POI. Moreover, the runzookeeper.bat doesn't run properly. It shows me the
>>> stack trace attached.
>>>
>>> Ps:
>>> The second attached file contains the POI stack trace.
>>>
>>> Othman.
>>>
>>> On Fri, 1 Sep 2017 at 12:21, Karl Wright  wrote:
>>>
 Hi Othman,

 You do not need a new database instance.

 You can download MCF 2.8.1 RC0 from here:

 https://dist.apache.org/repos/dist/dev/manifoldcf/apache-
 manifoldcf-2.8.1

 Karl


 On Fri, Sep 1, 2017 at 5:42 AM, Beelz Ryuzaki 
 wrote:

> Hi Karl,
>
> Thank you very much for your help, I'm going to try out the zookeeper
> example. Should I initialize a new database? And how can I run the
> zookeeper start-agent ?
>
> Othman.
>
> On Fri, 1 Sep 2017 at 11:37, Karl Wright  wrote:
>
>> Hi Othman,
>>
>> These exceptions are now coming from file locking and are due to
>> permissions problems.  I suggest you go to Zookeeper for file locking.
>>
>> I am building a 2.8.1 release candidate.  When it available for
>> download, I'll send you the URL.
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Sep 1, 2017 at 5:27 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Hi Karl,
>>>
>>> This morning, I have followed the steps you told me to do and I
>>> still got stack traces. I have attached the stack traces as well as the
>>> content of my lib repo and option.env.
>>> I have installed zookeeper and I'm ready to use the zookeeper
>>> example. Could you guide through it? I don't know if I follow the same
>>> steps in the file based example, I may not get stack traces.
>>>
>>> Thanks,
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 18:19, Karl Wright 
>>> wrote:
>>>
 Please do the following:

 (0) Shut down all ManifoldCF processes.
 (1) Move poi*.jar from connector-common-lib to lib.
 (2) Move dom4j*.jar from connector-common-lib to lib.
 (3) Move commons-collections4*.jar from connector-common-lib to lib.
 (4) Move xmlbeans*.java from connector-common-lib to lib.
 (5) Move curvesapi*.jar from connector-common-lib to lib.
 (6) Modify your options.env to include all of the jars you moved.
 (7) Start up all ManifoldCF processes.
 (8) If you still get stack traces, please send them to me.

 Karl


 On Thu, Aug 31, 2017 at 12:12 PM, Beelz Ryuzaki <
 i93oth...@gmail.com> wrote:

> Hi Karl,
>
> By 'other place', do you mean the \lib repository? If that so,
> then I have already tried it and it didn't work.
>
> Othman.
>
> On Thu, 31 Aug 2017 at 18:07, Karl Wright 
> wrote:
>
>> Hi 

Re: Question about ManifoldCF 2.8

2017-09-01 Thread Karl Wright
Hi Othman,

I will respin a new 2.8.1 (RC1) to address the zookeeper issue.

The failure you are seeing is "NoSuchMethodError".  Therefore, the class is
being found, but it is the *wrong* class.  When you deployed the new
release, did you deploy it in a new directory, or did you overwrite the
previous deployment?  If you overwrote it, you probably have multiple
versions of the POI jars.

Karl


On Fri, Sep 1, 2017 at 9:59 AM, Beelz Ryuzaki  wrote:

> Hi Karl,
>
> I have just tried the new release of ManifoldCF. At first, the first job
> ended normally, but in the second I got a new stack trace concerning the
> POI. Moreover, the runzookeeper.bat doesn't run properly. It shows me the
> stack trace attached.
>
> Ps:
> The second attached file contains the POI stack trace.
>
> Othman.
>
> On Fri, 1 Sep 2017 at 12:21, Karl Wright  wrote:
>
>> Hi Othman,
>>
>> You do not need a new database instance.
>>
>> You can download MCF 2.8.1 RC0 from here:
>>
>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.8.1
>>
>> Karl
>>
>>
>> On Fri, Sep 1, 2017 at 5:42 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Hi Karl,
>>>
>>> Thank you very much for your help, I'm going to try out the zookeeper
>>> example. Should I initialize a new database? And how can I run the
>>> zookeeper start-agent ?
>>>
>>> Othman.
>>>
>>> On Fri, 1 Sep 2017 at 11:37, Karl Wright  wrote:
>>>
 Hi Othman,

 These exceptions are now coming from file locking and are due to
 permissions problems.  I suggest you go to Zookeeper for file locking.

 I am building a 2.8.1 release candidate.  When it available for
 download, I'll send you the URL.

 Thanks,
 Karl


 On Fri, Sep 1, 2017 at 5:27 AM, Beelz Ryuzaki 
 wrote:

> Hi Karl,
>
> This morning, I have followed the steps you told me to do and I still
> got stack traces. I have attached the stack traces as well as the content
> of my lib repo and option.env.
> I have installed zookeeper and I'm ready to use the zookeeper example.
> Could you guide through it? I don't know if I follow the same steps in the
> file based example, I may not get stack traces.
>
> Thanks,
> Othman
>
> On Thu, 31 Aug 2017 at 18:19, Karl Wright  wrote:
>
>> Please do the following:
>>
>> (0) Shut down all ManifoldCF processes.
>> (1) Move poi*.jar from connector-common-lib to lib.
>> (2) Move dom4j*.jar from connector-common-lib to lib.
>> (3) Move commons-collections4*.jar from connector-common-lib to lib.
>> (4) Move xmlbeans*.java from connector-common-lib to lib.
>> (5) Move curvesapi*.jar from connector-common-lib to lib.
>> (6) Modify your options.env to include all of the jars you moved.
>> (7) Start up all ManifoldCF processes.
>> (8) If you still get stack traces, please send them to me.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 12:12 PM, Beelz Ryuzaki 
>> wrote:
>>
>>> Hi Karl,
>>>
>>> By 'other place', do you mean the \lib repository? If that so, then
>>> I have already tried it and it didn't work.
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 18:07, Karl Wright 
>>> wrote:
>>>
 Hi Othman,

 I used the java dependency inspector to see what the issue is and
 it turns out that poi-ooxml.jar does refer back to poi.jar in the class
 that is failing.  So you will need to move poi-3.15.jar and
 commons-collections4-1.4.jar to the other place as well.

 Let's hope that finally fixes this issue.

 I'm very unhappy about the quality of the POI project code; it is
 definitely not using reasonable engineering practices, and I will be
 opening a ticket with them.

 Thanks,
 Karl


 On Thu, Aug 31, 2017 at 11:57 AM, Beelz Ryuzaki <
 i93oth...@gmail.com> wrote:

> I'm using the file based example and all the changes you told me
> to do. I reproduced them in the file based example. I'll try to 
> install
> zookeeper and use the zookeeper example. Will I need a configuration 
> to do
> in order to run the zookeeper example ?
>
> Othman.
>
> On Thu, 31 Aug 2017 at 17:46, Karl Wright 
> wrote:
>
>> Are you using the zookeeper example, or the file-based example?
>>
>> If these jars have all been moved, and the options.env includes
>> them, then I have to conclude that Apache POI's pom.xml is incorrect 
>> too.
>> It will take a while to figure out what's missing that poi-ooxml.jar 
>> needs
>> that is 

Re: Question about ManifoldCF 2.8

2017-09-01 Thread Karl Wright
Hi Othman,

You do not need a new database instance.

You can download MCF 2.8.1 RC0 from here:

https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.8.1

Karl


On Fri, Sep 1, 2017 at 5:42 AM, Beelz Ryuzaki  wrote:

> Hi Karl,
>
> Thank you very much for your help, I'm going to try out the zookeeper
> example. Should I initialize a new database? And how can I run the
> zookeeper start-agent ?
>
> Othman.
>
> On Fri, 1 Sep 2017 at 11:37, Karl Wright  wrote:
>
>> Hi Othman,
>>
>> These exceptions are now coming from file locking and are due to
>> permissions problems.  I suggest you go to Zookeeper for file locking.
>>
>> I am building a 2.8.1 release candidate.  When it available for download,
>> I'll send you the URL.
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Sep 1, 2017 at 5:27 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Hi Karl,
>>>
>>> This morning, I have followed the steps you told me to do and I still
>>> got stack traces. I have attached the stack traces as well as the content
>>> of my lib repo and option.env.
>>> I have installed zookeeper and I'm ready to use the zookeeper example.
>>> Could you guide through it? I don't know if I follow the same steps in the
>>> file based example, I may not get stack traces.
>>>
>>> Thanks,
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 18:19, Karl Wright  wrote:
>>>
 Please do the following:

 (0) Shut down all ManifoldCF processes.
 (1) Move poi*.jar from connector-common-lib to lib.
 (2) Move dom4j*.jar from connector-common-lib to lib.
 (3) Move commons-collections4*.jar from connector-common-lib to lib.
 (4) Move xmlbeans*.java from connector-common-lib to lib.
 (5) Move curvesapi*.jar from connector-common-lib to lib.
 (6) Modify your options.env to include all of the jars you moved.
 (7) Start up all ManifoldCF processes.
 (8) If you still get stack traces, please send them to me.

 Karl


 On Thu, Aug 31, 2017 at 12:12 PM, Beelz Ryuzaki 
 wrote:

> Hi Karl,
>
> By 'other place', do you mean the \lib repository? If that so, then I
> have already tried it and it didn't work.
>
> Othman.
>
> On Thu, 31 Aug 2017 at 18:07, Karl Wright  wrote:
>
>> Hi Othman,
>>
>> I used the java dependency inspector to see what the issue is and it
>> turns out that poi-ooxml.jar does refer back to poi.jar in the class that
>> is failing.  So you will need to move poi-3.15.jar and
>> commons-collections4-1.4.jar to the other place as well.
>>
>> Let's hope that finally fixes this issue.
>>
>> I'm very unhappy about the quality of the POI project code; it is
>> definitely not using reasonable engineering practices, and I will be
>> opening a ticket with them.
>>
>> Thanks,
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 11:57 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> I'm using the file based example and all the changes you told me to
>>> do. I reproduced them in the file based example. I'll try to install
>>> zookeeper and use the zookeeper example. Will I need a configuration to 
>>> do
>>> in order to run the zookeeper example ?
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 17:46, Karl Wright 
>>> wrote:
>>>
 Are you using the zookeeper example, or the file-based example?

 If these jars have all been moved, and the options.env includes
 them, then I have to conclude that Apache POI's pom.xml is incorrect 
 too.
 It will take a while to figure out what's missing that poi-ooxml.jar 
 needs
 that is not listed.

 Karl


 On Thu, Aug 31, 2017 at 11:39 AM, Beelz Ryuzaki <
 i93oth...@gmail.com> wrote:

> All the dependencies you mentioned have already been added in the
> options.env.win file in the multiprocess-file-example repository.
>
> On Thu, 31 Aug 2017 at 17:33, Beelz Ryuzaki 
> wrote:
>
>> Yes, I added it in the options.env.win file. Should it be the one
>> in the multiprocess-zk-example document or multiprocess-file-example 
>> ?
>>
>> On Thu, 31 Aug 2017 at 17:30, Karl Wright 
>> wrote:
>>
>>> It's not related at all to elasticsearch.
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki <
>>> i93oth...@gmail.com> wrote:
>>>
 Could it be a problem of elasticsearch's version ? I'm actually
 using 2.1.0 which is pretty old for this new version of ManifoldCF?

 Othman.

 On Thu, 31 Aug 2017 

Re: Question about ManifoldCF 2.8

2017-09-01 Thread Beelz Ryuzaki
Hi Karl,

Thank you very much for your help, I'm going to try out the zookeeper
example. Should I initialize a new database? And how can I run the
zookeeper start-agent ?

Othman.

On Fri, 1 Sep 2017 at 11:37, Karl Wright  wrote:

> Hi Othman,
>
> These exceptions are now coming from file locking and are due to
> permissions problems.  I suggest you go to Zookeeper for file locking.
>
> I am building a 2.8.1 release candidate.  When it available for download,
> I'll send you the URL.
>
> Thanks,
> Karl
>
>
> On Fri, Sep 1, 2017 at 5:27 AM, Beelz Ryuzaki  wrote:
>
>> Hi Karl,
>>
>> This morning, I have followed the steps you told me to do and I still got
>> stack traces. I have attached the stack traces as well as the content of my
>> lib repo and option.env.
>> I have installed zookeeper and I'm ready to use the zookeeper example.
>> Could you guide through it? I don't know if I follow the same steps in the
>> file based example, I may not get stack traces.
>>
>> Thanks,
>> Othman
>>
>> On Thu, 31 Aug 2017 at 18:19, Karl Wright  wrote:
>>
>>> Please do the following:
>>>
>>> (0) Shut down all ManifoldCF processes.
>>> (1) Move poi*.jar from connector-common-lib to lib.
>>> (2) Move dom4j*.jar from connector-common-lib to lib.
>>> (3) Move commons-collections4*.jar from connector-common-lib to lib.
>>> (4) Move xmlbeans*.java from connector-common-lib to lib.
>>> (5) Move curvesapi*.jar from connector-common-lib to lib.
>>> (6) Modify your options.env to include all of the jars you moved.
>>> (7) Start up all ManifoldCF processes.
>>> (8) If you still get stack traces, please send them to me.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 12:12 PM, Beelz Ryuzaki 
>>> wrote:
>>>
 Hi Karl,

 By 'other place', do you mean the \lib repository? If that so, then I
 have already tried it and it didn't work.

 Othman.

 On Thu, 31 Aug 2017 at 18:07, Karl Wright  wrote:

> Hi Othman,
>
> I used the java dependency inspector to see what the issue is and it
> turns out that poi-ooxml.jar does refer back to poi.jar in the class that
> is failing.  So you will need to move poi-3.15.jar and
> commons-collections4-1.4.jar to the other place as well.
>
> Let's hope that finally fixes this issue.
>
> I'm very unhappy about the quality of the POI project code; it is
> definitely not using reasonable engineering practices, and I will be
> opening a ticket with them.
>
> Thanks,
> Karl
>
>
> On Thu, Aug 31, 2017 at 11:57 AM, Beelz Ryuzaki 
> wrote:
>
>> I'm using the file based example and all the changes you told me to
>> do. I reproduced them in the file based example. I'll try to install
>> zookeeper and use the zookeeper example. Will I need a configuration to 
>> do
>> in order to run the zookeeper example ?
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 17:46, Karl Wright  wrote:
>>
>>> Are you using the zookeeper example, or the file-based example?
>>>
>>> If these jars have all been moved, and the options.env includes
>>> them, then I have to conclude that Apache POI's pom.xml is incorrect 
>>> too.
>>> It will take a while to figure out what's missing that poi-ooxml.jar 
>>> needs
>>> that is not listed.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 11:39 AM, Beelz Ryuzaki >> > wrote:
>>>
 All the dependencies you mentioned have already been added in the
 options.env.win file in the multiprocess-file-example repository.

 On Thu, 31 Aug 2017 at 17:33, Beelz Ryuzaki 
 wrote:

> Yes, I added it in the options.env.win file. Should it be the one
> in the multiprocess-zk-example document or multiprocess-file-example ?
>
> On Thu, 31 Aug 2017 at 17:30, Karl Wright 
> wrote:
>
>> It's not related at all to elasticsearch.
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki <
>> i93oth...@gmail.com> wrote:
>>
>>> Could it be a problem of elasticsearch's version ? I'm actually
>>> using 2.1.0 which is pretty old for this new version of ManifoldCF?
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki 
>>> wrote:
>>>
 I moved back both the jars you mentioned and a different is
 showing. You will find the stack trace attached.

 Thanks,
 Othman

 On Thu, 31 Aug 2017 at 17:09, Karl Wright 
 wrote:

> I've looked at the 

Re: Question about ManifoldCF 2.8

2017-09-01 Thread Karl Wright
Hi Othman,

These exceptions are now coming from file locking and are due to
permissions problems.  I suggest you go to Zookeeper for file locking.

I am building a 2.8.1 release candidate.  When it available for download,
I'll send you the URL.

Thanks,
Karl


On Fri, Sep 1, 2017 at 5:27 AM, Beelz Ryuzaki  wrote:

> Hi Karl,
>
> This morning, I have followed the steps you told me to do and I still got
> stack traces. I have attached the stack traces as well as the content of my
> lib repo and option.env.
> I have installed zookeeper and I'm ready to use the zookeeper example.
> Could you guide through it? I don't know if I follow the same steps in the
> file based example, I may not get stack traces.
>
> Thanks,
> Othman
>
> On Thu, 31 Aug 2017 at 18:19, Karl Wright  wrote:
>
>> Please do the following:
>>
>> (0) Shut down all ManifoldCF processes.
>> (1) Move poi*.jar from connector-common-lib to lib.
>> (2) Move dom4j*.jar from connector-common-lib to lib.
>> (3) Move commons-collections4*.jar from connector-common-lib to lib.
>> (4) Move xmlbeans*.java from connector-common-lib to lib.
>> (5) Move curvesapi*.jar from connector-common-lib to lib.
>> (6) Modify your options.env to include all of the jars you moved.
>> (7) Start up all ManifoldCF processes.
>> (8) If you still get stack traces, please send them to me.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 12:12 PM, Beelz Ryuzaki 
>> wrote:
>>
>>> Hi Karl,
>>>
>>> By 'other place', do you mean the \lib repository? If that so, then I
>>> have already tried it and it didn't work.
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 18:07, Karl Wright  wrote:
>>>
 Hi Othman,

 I used the java dependency inspector to see what the issue is and it
 turns out that poi-ooxml.jar does refer back to poi.jar in the class that
 is failing.  So you will need to move poi-3.15.jar and
 commons-collections4-1.4.jar to the other place as well.

 Let's hope that finally fixes this issue.

 I'm very unhappy about the quality of the POI project code; it is
 definitely not using reasonable engineering practices, and I will be
 opening a ticket with them.

 Thanks,
 Karl


 On Thu, Aug 31, 2017 at 11:57 AM, Beelz Ryuzaki 
 wrote:

> I'm using the file based example and all the changes you told me to
> do. I reproduced them in the file based example. I'll try to install
> zookeeper and use the zookeeper example. Will I need a configuration to do
> in order to run the zookeeper example ?
>
> Othman.
>
> On Thu, 31 Aug 2017 at 17:46, Karl Wright  wrote:
>
>> Are you using the zookeeper example, or the file-based example?
>>
>> If these jars have all been moved, and the options.env includes them,
>> then I have to conclude that Apache POI's pom.xml is incorrect too.  It
>> will take a while to figure out what's missing that poi-ooxml.jar needs
>> that is not listed.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 11:39 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> All the dependencies you mentioned have already been added in the
>>> options.env.win file in the multiprocess-file-example repository.
>>>
>>> On Thu, 31 Aug 2017 at 17:33, Beelz Ryuzaki 
>>> wrote:
>>>
 Yes, I added it in the options.env.win file. Should it be the one
 in the multiprocess-zk-example document or multiprocess-file-example ?

 On Thu, 31 Aug 2017 at 17:30, Karl Wright 
 wrote:

> It's not related at all to elasticsearch.
> Karl
>
>
> On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki <
> i93oth...@gmail.com> wrote:
>
>> Could it be a problem of elasticsearch's version ? I'm actually
>> using 2.1.0 which is pretty old for this new version of ManifoldCF?
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki 
>> wrote:
>>
>>> I moved back both the jars you mentioned and a different is
>>> showing. You will find the stack trace attached.
>>>
>>> Thanks,
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 17:09, Karl Wright 
>>> wrote:
>>>
 I've looked at the dependencies; you should not have moved
 poi-3.15.jar.  Please move that back, and 
 commons-collections4-4.1.jar too.

 You *will* need to move curvesapi-1.04.jar though.

 Thanks,
 Karl


 On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <
 daddy...@gmail.com> wrote:


Re: Question about ManifoldCF 2.8

2017-08-31 Thread Beelz Ryuzaki
Hi Karl,

By 'other place', do you mean the \lib repository? If that so, then I have
already tried it and it didn't work.

Othman.

On Thu, 31 Aug 2017 at 18:07, Karl Wright  wrote:

> Hi Othman,
>
> I used the java dependency inspector to see what the issue is and it turns
> out that poi-ooxml.jar does refer back to poi.jar in the class that is
> failing.  So you will need to move poi-3.15.jar and
> commons-collections4-1.4.jar to the other place as well.
>
> Let's hope that finally fixes this issue.
>
> I'm very unhappy about the quality of the POI project code; it is
> definitely not using reasonable engineering practices, and I will be
> opening a ticket with them.
>
> Thanks,
> Karl
>
>
> On Thu, Aug 31, 2017 at 11:57 AM, Beelz Ryuzaki 
> wrote:
>
>> I'm using the file based example and all the changes you told me to do. I
>> reproduced them in the file based example. I'll try to install zookeeper
>> and use the zookeeper example. Will I need a configuration to do in order
>> to run the zookeeper example ?
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 17:46, Karl Wright  wrote:
>>
>>> Are you using the zookeeper example, or the file-based example?
>>>
>>> If these jars have all been moved, and the options.env includes them,
>>> then I have to conclude that Apache POI's pom.xml is incorrect too.  It
>>> will take a while to figure out what's missing that poi-ooxml.jar needs
>>> that is not listed.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 11:39 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 All the dependencies you mentioned have already been added in the
 options.env.win file in the multiprocess-file-example repository.

 On Thu, 31 Aug 2017 at 17:33, Beelz Ryuzaki 
 wrote:

> Yes, I added it in the options.env.win file. Should it be the one in
> the multiprocess-zk-example document or multiprocess-file-example ?
>
> On Thu, 31 Aug 2017 at 17:30, Karl Wright  wrote:
>
>> It's not related at all to elasticsearch.
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Could it be a problem of elasticsearch's version ? I'm actually
>>> using 2.1.0 which is pretty old for this new version of ManifoldCF?
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki 
>>> wrote:
>>>
 I moved back both the jars you mentioned and a different is
 showing. You will find the stack trace attached.

 Thanks,
 Othman

 On Thu, 31 Aug 2017 at 17:09, Karl Wright 
 wrote:

> I've looked at the dependencies; you should not have moved
> poi-3.15.jar.  Please move that back, and 
> commons-collections4-4.1.jar too.
>
> You *will* need to move curvesapi-1.04.jar though.
>
> Thanks,
> Karl
>
>
> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright 
> wrote:
>
>> If you include poi.jar, then all dependencies of poi.jar must
>> also be included.  This would mean that curvesapi-1.04.jar and
>> commons-collections4-4.1.jar should also be included.
>>
>> Karl
>>
>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <
>> i93oth...@gmail.com> wrote:
>>
>>> Hi Karl,
>>>
>>> I added the two jars that you have mentioned and another one :
>>> poi-3.15.jar . Unfortunately, there is another error showing. This 
>>> time, it
>>> concerns excel files. You will find attached the stack trace.
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 15:32, Karl Wright 
>>> wrote:
>>>
 Hi Othman,

 Yes, this shows that the jar we moved calls back into another
 jar, which will also need to be moved.  *That* jar has yet another
 dependency too.

 The list of jars is thus extended to include:

 poi-ooxml-3.15.jar
 dom4j-1.6.1.jar

 Karl


 On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <
 i93oth...@gmail.com> wrote:

> You will find attached the stack trace. My apologies for the
> bad quality of the image, I'm doing my best to send you the stack 
> trace as
> I don't have the right to send documents outside the company.
>
> Thank you for your time,
>
> Othman
>
> On Thu, 31 Aug 2017 at 15:16, Karl Wright 
> wrote:
>

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Beelz Ryuzaki
All the dependencies you mentioned have already been added in the
options.env.win file in the multiprocess-file-example repository.

On Thu, 31 Aug 2017 at 17:33, Beelz Ryuzaki  wrote:

> Yes, I added it in the options.env.win file. Should it be the one in the
> multiprocess-zk-example document or multiprocess-file-example ?
>
> On Thu, 31 Aug 2017 at 17:30, Karl Wright  wrote:
>
>> It's not related at all to elasticsearch.
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Could it be a problem of elasticsearch's version ? I'm actually using
>>> 2.1.0 which is pretty old for this new version of ManifoldCF?
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki  wrote:
>>>
 I moved back both the jars you mentioned and a different is showing.
 You will find the stack trace attached.

 Thanks,
 Othman

 On Thu, 31 Aug 2017 at 17:09, Karl Wright  wrote:

> I've looked at the dependencies; you should not have moved
> poi-3.15.jar.  Please move that back, and commons-collections4-4.1.jar 
> too.
>
> You *will* need to move curvesapi-1.04.jar though.
>
> Thanks,
> Karl
>
>
> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright 
> wrote:
>
>> If you include poi.jar, then all dependencies of poi.jar must also be
>> included.  This would mean that curvesapi-1.04.jar and
>> commons-collections4-4.1.jar should also be included.
>>
>> Karl
>>
>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Hi Karl,
>>>
>>> I added the two jars that you have mentioned and another one :
>>> poi-3.15.jar . Unfortunately, there is another error showing. This 
>>> time, it
>>> concerns excel files. You will find attached the stack trace.
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 15:32, Karl Wright 
>>> wrote:
>>>
 Hi Othman,

 Yes, this shows that the jar we moved calls back into another jar,
 which will also need to be moved.  *That* jar has yet another 
 dependency
 too.

 The list of jars is thus extended to include:

 poi-ooxml-3.15.jar
 dom4j-1.6.1.jar

 Karl


 On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki  wrote:

> You will find attached the stack trace. My apologies for the bad
> quality of the image, I'm doing my best to send you the stack trace 
> as I
> don't have the right to send documents outside the company.
>
> Thank you for your time,
>
> Othman
>
> On Thu, 31 Aug 2017 at 15:16, Karl Wright 
> wrote:
>
>> Once again, I need a stack trace to diagnose what the problem is.
>>
>> Thanks,
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <
>> i93oth...@gmail.com> wrote:
>>
>>> Oh, actually it didn't solve the problem. I looked into the log
>>> file and saw the following error:
>>>
>>> Error tossed : org/apache/poi/POIXMLTypeLoader
>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>>>
>>> Maybe another jar is missing ?
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki 
>>> wrote:
>>>
 I have tried what you told me to do, and you expected the
 crawling resumed. How about the regular expressions? How can I 
 make complex
 regular expressions in the job's paths tab ?

 Thank you very much for your help.

 Othman.


 On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <
 i93oth...@gmail.com> wrote:

> Ok, I will try it right away and let you know if it works.
>
> Othman.
>
> On Thu, 31 Aug 2017 at 14:15, Karl Wright 
> wrote:
>
>> Oh, and you also may need to edit your options.env files to
>> include them in the classpath for startup.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <
>> daddy...@gmail.com> wrote:
>>
>>> If you are amenable, there is another workaround you could
>>> try.  Specifically:
>>>
>>> (1) Shut down all MCF processes.
>>> (2) Move the following two files from 

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Karl Wright
These are the five jars that dependency analysis said should be needed:

 // both poi-ooxml and
poi-ooxml-schemas





Don't do any other jars than these, but DO make sure all four jars are
moved.

Thanks!
Karl


On Thu, Aug 31, 2017 at 11:30 AM, Karl Wright  wrote:

> It's not related at all to elasticsearch.
> Karl
>
>
> On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki 
> wrote:
>
>> Could it be a problem of elasticsearch's version ? I'm actually using
>> 2.1.0 which is pretty old for this new version of ManifoldCF?
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki  wrote:
>>
>>> I moved back both the jars you mentioned and a different is showing. You
>>> will find the stack trace attached.
>>>
>>> Thanks,
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 17:09, Karl Wright  wrote:
>>>
 I've looked at the dependencies; you should not have moved
 poi-3.15.jar.  Please move that back, and commons-collections4-4.1.jar too.

 You *will* need to move curvesapi-1.04.jar though.

 Thanks,
 Karl


 On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright 
 wrote:

> If you include poi.jar, then all dependencies of poi.jar must also be
> included.  This would mean that curvesapi-1.04.jar and
> commons-collections4-4.1.jar should also be included.
>
> Karl
>
> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki 
> wrote:
>
>> Hi Karl,
>>
>> I added the two jars that you have mentioned and another one :
>> poi-3.15.jar . Unfortunately, there is another error showing. This time, 
>> it
>> concerns excel files. You will find attached the stack trace.
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 15:32, Karl Wright  wrote:
>>
>>> Hi Othman,
>>>
>>> Yes, this shows that the jar we moved calls back into another jar,
>>> which will also need to be moved.  *That* jar has yet another dependency
>>> too.
>>>
>>> The list of jars is thus extended to include:
>>>
>>> poi-ooxml-3.15.jar
>>> dom4j-1.6.1.jar
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 You will find attached the stack trace. My apologies for the bad
 quality of the image, I'm doing my best to send you the stack trace as 
 I
 don't have the right to send documents outside the company.

 Thank you for your time,

 Othman

 On Thu, 31 Aug 2017 at 15:16, Karl Wright 
 wrote:

> Once again, I need a stack trace to diagnose what the problem is.
>
> Thanks,
> Karl
>
>
> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <
> i93oth...@gmail.com> wrote:
>
>> Oh, actually it didn't solve the problem. I looked into the log
>> file and saw the following error:
>>
>> Error tossed : org/apache/poi/POIXMLTypeLoader
>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>>
>> Maybe another jar is missing ?
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki 
>> wrote:
>>
>>> I have tried what you told me to do, and you expected the
>>> crawling resumed. How about the regular expressions? How can I make 
>>> complex
>>> regular expressions in the job's paths tab ?
>>>
>>> Thank you very much for your help.
>>>
>>> Othman.
>>>
>>>
>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki 
>>> wrote:
>>>
 Ok, I will try it right away and let you know if it works.

 Othman.

 On Thu, 31 Aug 2017 at 14:15, Karl Wright 
 wrote:

> Oh, and you also may need to edit your options.env files to
> include them in the classpath for startup.
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <
> daddy...@gmail.com> wrote:
>
>> If you are amenable, there is another workaround you could
>> try.  Specifically:
>>
>> (1) Shut down all MCF processes.
>> (2) Move the following two files from connector-common-lib to
>> lib:
>>
>> xmlbeans-2.6.0.jar
>> poi-ooxml-schemas-3.15.jar
>>
>> (3) Restart everything and see if your crawl resumes.
>>
>> Please let me 

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Beelz Ryuzaki
Could it be a problem of elasticsearch's version ? I'm actually using 2.1.0
which is pretty old for this new version of ManifoldCF?

Othman.

On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki  wrote:

> I moved back both the jars you mentioned and a different is showing. You
> will find the stack trace attached.
>
> Thanks,
> Othman
>
> On Thu, 31 Aug 2017 at 17:09, Karl Wright  wrote:
>
>> I've looked at the dependencies; you should not have moved poi-3.15.jar.
>> Please move that back, and commons-collections4-4.1.jar too.
>>
>> You *will* need to move curvesapi-1.04.jar though.
>>
>> Thanks,
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright  wrote:
>>
>>> If you include poi.jar, then all dependencies of poi.jar must also be
>>> included.  This would mean that curvesapi-1.04.jar and
>>> commons-collections4-4.1.jar should also be included.
>>>
>>> Karl
>>>
>>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 Hi Karl,

 I added the two jars that you have mentioned and another one :
 poi-3.15.jar . Unfortunately, there is another error showing. This time, it
 concerns excel files. You will find attached the stack trace.

 Othman.

 On Thu, 31 Aug 2017 at 15:32, Karl Wright  wrote:

> Hi Othman,
>
> Yes, this shows that the jar we moved calls back into another jar,
> which will also need to be moved.  *That* jar has yet another dependency
> too.
>
> The list of jars is thus extended to include:
>
> poi-ooxml-3.15.jar
> dom4j-1.6.1.jar
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki 
> wrote:
>
>> You will find attached the stack trace. My apologies for the bad
>> quality of the image, I'm doing my best to send you the stack trace as I
>> don't have the right to send documents outside the company.
>>
>> Thank you for your time,
>>
>> Othman
>>
>> On Thu, 31 Aug 2017 at 15:16, Karl Wright  wrote:
>>
>>> Once again, I need a stack trace to diagnose what the problem is.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 Oh, actually it didn't solve the problem. I looked into the log
 file and saw the following error:

 Error tossed : org/apache/poi/POIXMLTypeLoader
 java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.

 Maybe another jar is missing ?

 Othman.

 On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki 
 wrote:

> I have tried what you told me to do, and you expected the crawling
> resumed. How about the regular expressions? How can I make complex 
> regular
> expressions in the job's paths tab ?
>
> Thank you very much for your help.
>
> Othman.
>
>
> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki 
> wrote:
>
>> Ok, I will try it right away and let you know if it works.
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 14:15, Karl Wright 
>> wrote:
>>
>>> Oh, and you also may need to edit your options.env files to
>>> include them in the classpath for startup.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright >> > wrote:
>>>
 If you are amenable, there is another workaround you could
 try.  Specifically:

 (1) Shut down all MCF processes.
 (2) Move the following two files from connector-common-lib to
 lib:

 xmlbeans-2.6.0.jar
 poi-ooxml-schemas-3.15.jar

 (3) Restart everything and see if your crawl resumes.

 Please let me know what happens.

 Karl



 On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <
 daddy...@gmail.com> wrote:

> I created a ticket for this: CONNECTORS-1450.
>
> One simple workaround is to use the external Tika server
> transformer rather than the embedded Tika Extractor.  I'm still 
> looking
> into why the jar is not being found.
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <
> i93oth...@gmail.com> wrote:
>
>> Yes, I'm actually using the latest binary version, and my job
>> got stuck on 

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Karl Wright
I've looked at the dependencies; you should not have moved poi-3.15.jar.
Please move that back, and commons-collections4-4.1.jar too.

You *will* need to move curvesapi-1.04.jar though.

Thanks,
Karl


On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright  wrote:

> If you include poi.jar, then all dependencies of poi.jar must also be
> included.  This would mean that curvesapi-1.04.jar and
> commons-collections4-4.1.jar should also be included.
>
> Karl
>
> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki 
> wrote:
>
>> Hi Karl,
>>
>> I added the two jars that you have mentioned and another one :
>> poi-3.15.jar . Unfortunately, there is another error showing. This time, it
>> concerns excel files. You will find attached the stack trace.
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 15:32, Karl Wright  wrote:
>>
>>> Hi Othman,
>>>
>>> Yes, this shows that the jar we moved calls back into another jar, which
>>> will also need to be moved.  *That* jar has yet another dependency too.
>>>
>>> The list of jars is thus extended to include:
>>>
>>> poi-ooxml-3.15.jar
>>> dom4j-1.6.1.jar
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 You will find attached the stack trace. My apologies for the bad
 quality of the image, I'm doing my best to send you the stack trace as I
 don't have the right to send documents outside the company.

 Thank you for your time,

 Othman

 On Thu, 31 Aug 2017 at 15:16, Karl Wright  wrote:

> Once again, I need a stack trace to diagnose what the problem is.
>
> Thanks,
> Karl
>
>
> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki 
> wrote:
>
>> Oh, actually it didn't solve the problem. I looked into the log file
>> and saw the following error:
>>
>> Error tossed : org/apache/poi/POIXMLTypeLoader
>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>>
>> Maybe another jar is missing ?
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki 
>> wrote:
>>
>>> I have tried what you told me to do, and you expected the crawling
>>> resumed. How about the regular expressions? How can I make complex 
>>> regular
>>> expressions in the job's paths tab ?
>>>
>>> Thank you very much for your help.
>>>
>>> Othman.
>>>
>>>
>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki 
>>> wrote:
>>>
 Ok, I will try it right away and let you know if it works.

 Othman.

 On Thu, 31 Aug 2017 at 14:15, Karl Wright 
 wrote:

> Oh, and you also may need to edit your options.env files to
> include them in the classpath for startup.
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright 
> wrote:
>
>> If you are amenable, there is another workaround you could try.
>> Specifically:
>>
>> (1) Shut down all MCF processes.
>> (2) Move the following two files from connector-common-lib to lib:
>>
>> xmlbeans-2.6.0.jar
>> poi-ooxml-schemas-3.15.jar
>>
>> (3) Restart everything and see if your crawl resumes.
>>
>> Please let me know what happens.
>>
>> Karl
>>
>>
>>
>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright 
>> wrote:
>>
>>> I created a ticket for this: CONNECTORS-1450.
>>>
>>> One simple workaround is to use the external Tika server
>>> transformer rather than the embedded Tika Extractor.  I'm still 
>>> looking
>>> into why the jar is not being found.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <
>>> i93oth...@gmail.com> wrote:
>>>
 Yes, I'm actually using the latest binary version, and my job
 got stuck on that specific file.
 The job status is still Running. You can see it in the attached
 file. For your information, the job started yesterday.

 Thanks,

 Othman

 On Thu, 31 Aug 2017 at 13:04, Karl Wright 
 wrote:

> It looks like a dependency of Apache POI is missing.
> I think we will need a ticket to address this, if you are
> indeed using the binary distribution.
>
> Thanks!
> Karl
>
> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <
> i93oth...@gmail.com> wrote:
>

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Karl Wright
If you include poi.jar, then all dependencies of poi.jar must also be
included.  This would mean that curvesapi-1.04.jar and
commons-collections4-4.1.jar should also be included.

Karl

On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki  wrote:

> Hi Karl,
>
> I added the two jars that you have mentioned and another one :
> poi-3.15.jar . Unfortunately, there is another error showing. This time, it
> concerns excel files. You will find attached the stack trace.
>
> Othman.
>
> On Thu, 31 Aug 2017 at 15:32, Karl Wright  wrote:
>
>> Hi Othman,
>>
>> Yes, this shows that the jar we moved calls back into another jar, which
>> will also need to be moved.  *That* jar has yet another dependency too.
>>
>> The list of jars is thus extended to include:
>>
>> poi-ooxml-3.15.jar
>> dom4j-1.6.1.jar
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> You will find attached the stack trace. My apologies for the bad quality
>>> of the image, I'm doing my best to send you the stack trace as I don't have
>>> the right to send documents outside the company.
>>>
>>> Thank you for your time,
>>>
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright  wrote:
>>>
 Once again, I need a stack trace to diagnose what the problem is.

 Thanks,
 Karl


 On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki 
 wrote:

> Oh, actually it didn't solve the problem. I looked into the log file
> and saw the following error:
>
> Error tossed : org/apache/poi/POIXMLTypeLoader
> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>
> Maybe another jar is missing ?
>
> Othman.
>
> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki 
> wrote:
>
>> I have tried what you told me to do, and you expected the crawling
>> resumed. How about the regular expressions? How can I make complex 
>> regular
>> expressions in the job's paths tab ?
>>
>> Thank you very much for your help.
>>
>> Othman.
>>
>>
>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki 
>> wrote:
>>
>>> Ok, I will try it right away and let you know if it works.
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright 
>>> wrote:
>>>
 Oh, and you also may need to edit your options.env files to include
 them in the classpath for startup.

 Karl


 On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright 
 wrote:

> If you are amenable, there is another workaround you could try.
> Specifically:
>
> (1) Shut down all MCF processes.
> (2) Move the following two files from connector-common-lib to lib:
>
> xmlbeans-2.6.0.jar
> poi-ooxml-schemas-3.15.jar
>
> (3) Restart everything and see if your crawl resumes.
>
> Please let me know what happens.
>
> Karl
>
>
>
> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright 
> wrote:
>
>> I created a ticket for this: CONNECTORS-1450.
>>
>> One simple workaround is to use the external Tika server
>> transformer rather than the embedded Tika Extractor.  I'm still 
>> looking
>> into why the jar is not being found.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <
>> i93oth...@gmail.com> wrote:
>>
>>> Yes, I'm actually using the latest binary version, and my job
>>> got stuck on that specific file.
>>> The job status is still Running. You can see it in the attached
>>> file. For your information, the job started yesterday.
>>>
>>> Thanks,
>>>
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright 
>>> wrote:
>>>
 It looks like a dependency of Apache POI is missing.
 I think we will need a ticket to address this, if you are
 indeed using the binary distribution.

 Thanks!
 Karl

 On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <
 i93oth...@gmail.com> wrote:

> I'm actually using the binary version. For security reasons, I
> can't send any files from my computer. I have copied the stack 
> trace and
> scanned it with my cellphone. I hope it will be helpful. 
> Meanwhile, I have
> read the documentation about how to restrict the crawling and I 
> don't think
> the '|' works in the specified. For 

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Beelz Ryuzaki
And concerning the path tabs, I will use the Unix/Windows wildcards. I
think it will be enough.

Othman.

On Thu, 31 Aug 2017 at 16:23, Beelz Ryuzaki  wrote:

> Hi Karl,
>
> I added the two jars that you have mentioned and another one :
> poi-3.15.jar . Unfortunately, there is another error showing. This time, it
> concerns excel files. You will find attached the stack trace.
>
> Othman.
>
> On Thu, 31 Aug 2017 at 15:32, Karl Wright  wrote:
>
>> Hi Othman,
>>
>> Yes, this shows that the jar we moved calls back into another jar, which
>> will also need to be moved.  *That* jar has yet another dependency too.
>>
>> The list of jars is thus extended to include:
>>
>> poi-ooxml-3.15.jar
>> dom4j-1.6.1.jar
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> You will find attached the stack trace. My apologies for the bad quality
>>> of the image, I'm doing my best to send you the stack trace as I don't have
>>> the right to send documents outside the company.
>>>
>>> Thank you for your time,
>>>
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright  wrote:
>>>
 Once again, I need a stack trace to diagnose what the problem is.

 Thanks,
 Karl


 On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki 
 wrote:

> Oh, actually it didn't solve the problem. I looked into the log file
> and saw the following error:
>
> Error tossed : org/apache/poi/POIXMLTypeLoader
> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>
> Maybe another jar is missing ?
>
> Othman.
>
> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki 
> wrote:
>
>> I have tried what you told me to do, and you expected the crawling
>> resumed. How about the regular expressions? How can I make complex 
>> regular
>> expressions in the job's paths tab ?
>>
>> Thank you very much for your help.
>>
>> Othman.
>>
>>
>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki 
>> wrote:
>>
>>> Ok, I will try it right away and let you know if it works.
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright 
>>> wrote:
>>>
 Oh, and you also may need to edit your options.env files to include
 them in the classpath for startup.

 Karl


 On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright 
 wrote:

> If you are amenable, there is another workaround you could try.
> Specifically:
>
> (1) Shut down all MCF processes.
> (2) Move the following two files from connector-common-lib to lib:
>
> xmlbeans-2.6.0.jar
> poi-ooxml-schemas-3.15.jar
>
> (3) Restart everything and see if your crawl resumes.
>
> Please let me know what happens.
>
> Karl
>
>
>
> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright 
> wrote:
>
>> I created a ticket for this: CONNECTORS-1450.
>>
>> One simple workaround is to use the external Tika server
>> transformer rather than the embedded Tika Extractor.  I'm still 
>> looking
>> into why the jar is not being found.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <
>> i93oth...@gmail.com> wrote:
>>
>>> Yes, I'm actually using the latest binary version, and my job
>>> got stuck on that specific file.
>>> The job status is still Running. You can see it in the attached
>>> file. For your information, the job started yesterday.
>>>
>>> Thanks,
>>>
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright 
>>> wrote:
>>>
 It looks like a dependency of Apache POI is missing.
 I think we will need a ticket to address this, if you are
 indeed using the binary distribution.

 Thanks!
 Karl

 On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <
 i93oth...@gmail.com> wrote:

> I'm actually using the binary version. For security reasons, I
> can't send any files from my computer. I have copied the stack 
> trace and
> scanned it with my cellphone. I hope it will be helpful. 
> Meanwhile, I have
> read the documentation about how to restrict the crawling and I 
> don't think
> the '|' works in the specified. For instance, I would like to 
> restrict the
> crawling for the 

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Karl Wright
Hi Othman,

Yes, this shows that the jar we moved calls back into another jar, which
will also need to be moved.  *That* jar has yet another dependency too.

The list of jars is thus extended to include:

poi-ooxml-3.15.jar
dom4j-1.6.1.jar

Karl


On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki  wrote:

> You will find attached the stack trace. My apologies for the bad quality
> of the image, I'm doing my best to send you the stack trace as I don't have
> the right to send documents outside the company.
>
> Thank you for your time,
>
> Othman
>
> On Thu, 31 Aug 2017 at 15:16, Karl Wright  wrote:
>
>> Once again, I need a stack trace to diagnose what the problem is.
>>
>> Thanks,
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Oh, actually it didn't solve the problem. I looked into the log file and
>>> saw the following error:
>>>
>>> Error tossed : org/apache/poi/POIXMLTypeLoader
>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>>>
>>> Maybe another jar is missing ?
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki  wrote:
>>>
 I have tried what you told me to do, and you expected the crawling
 resumed. How about the regular expressions? How can I make complex regular
 expressions in the job's paths tab ?

 Thank you very much for your help.

 Othman.


 On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki 
 wrote:

> Ok, I will try it right away and let you know if it works.
>
> Othman.
>
> On Thu, 31 Aug 2017 at 14:15, Karl Wright  wrote:
>
>> Oh, and you also may need to edit your options.env files to include
>> them in the classpath for startup.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright 
>> wrote:
>>
>>> If you are amenable, there is another workaround you could try.
>>> Specifically:
>>>
>>> (1) Shut down all MCF processes.
>>> (2) Move the following two files from connector-common-lib to lib:
>>>
>>> xmlbeans-2.6.0.jar
>>> poi-ooxml-schemas-3.15.jar
>>>
>>> (3) Restart everything and see if your crawl resumes.
>>>
>>> Please let me know what happens.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright 
>>> wrote:
>>>
 I created a ticket for this: CONNECTORS-1450.

 One simple workaround is to use the external Tika server
 transformer rather than the embedded Tika Extractor.  I'm still looking
 into why the jar is not being found.

 Karl


 On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki  wrote:

> Yes, I'm actually using the latest binary version, and my job got
> stuck on that specific file.
> The job status is still Running. You can see it in the attached
> file. For your information, the job started yesterday.
>
> Thanks,
>
> Othman
>
> On Thu, 31 Aug 2017 at 13:04, Karl Wright 
> wrote:
>
>> It looks like a dependency of Apache POI is missing.
>> I think we will need a ticket to address this, if you are indeed
>> using the binary distribution.
>>
>> Thanks!
>> Karl
>>
>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <
>> i93oth...@gmail.com> wrote:
>>
>>> I'm actually using the binary version. For security reasons, I
>>> can't send any files from my computer. I have copied the stack 
>>> trace and
>>> scanned it with my cellphone. I hope it will be helpful. Meanwhile, 
>>> I have
>>> read the documentation about how to restrict the crawling and I 
>>> don't think
>>> the '|' works in the specified. For instance, I would like to 
>>> restrict the
>>> crawling for the documents that counts the 'sound' word . I proceed 
>>> as
>>> follows: *(SON)* . the document is with capital letters and I 
>>> noticed that
>>> it didn't take it into consideration.
>>>
>>> Thanks,
>>> Othman
>>>
>>>
>>>
>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright 
>>> wrote:
>>>
 Hi Othman,

 The way you restrict documents with the windows share connector
 is by specifying information on the "Paths" tab in jobs that crawl 
 windows
 shares.  There is end-user documentation both online and 
 distributed with
 all binary distributions that describe how to do this.  Have you 

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Karl Wright
Once again, I need a stack trace to diagnose what the problem is.

Thanks,
Karl


On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki  wrote:

> Oh, actually it didn't solve the problem. I looked into the log file and
> saw the following error:
>
> Error tossed : org/apache/poi/POIXMLTypeLoader
> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>
> Maybe another jar is missing ?
>
> Othman.
>
> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki  wrote:
>
>> I have tried what you told me to do, and you expected the crawling
>> resumed. How about the regular expressions? How can I make complex regular
>> expressions in the job's paths tab ?
>>
>> Thank you very much for your help.
>>
>> Othman.
>>
>>
>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki  wrote:
>>
>>> Ok, I will try it right away and let you know if it works.
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright  wrote:
>>>
 Oh, and you also may need to edit your options.env files to include
 them in the classpath for startup.

 Karl


 On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright 
 wrote:

> If you are amenable, there is another workaround you could try.
> Specifically:
>
> (1) Shut down all MCF processes.
> (2) Move the following two files from connector-common-lib to lib:
>
> xmlbeans-2.6.0.jar
> poi-ooxml-schemas-3.15.jar
>
> (3) Restart everything and see if your crawl resumes.
>
> Please let me know what happens.
>
> Karl
>
>
>
> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright 
> wrote:
>
>> I created a ticket for this: CONNECTORS-1450.
>>
>> One simple workaround is to use the external Tika server transformer
>> rather than the embedded Tika Extractor.  I'm still looking into why the
>> jar is not being found.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Yes, I'm actually using the latest binary version, and my job got
>>> stuck on that specific file.
>>> The job status is still Running. You can see it in the attached
>>> file. For your information, the job started yesterday.
>>>
>>> Thanks,
>>>
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright 
>>> wrote:
>>>
 It looks like a dependency of Apache POI is missing.
 I think we will need a ticket to address this, if you are indeed
 using the binary distribution.

 Thanks!
 Karl

 On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki  wrote:

> I'm actually using the binary version. For security reasons, I
> can't send any files from my computer. I have copied the stack trace 
> and
> scanned it with my cellphone. I hope it will be helpful. Meanwhile, I 
> have
> read the documentation about how to restrict the crawling and I don't 
> think
> the '|' works in the specified. For instance, I would like to 
> restrict the
> crawling for the documents that counts the 'sound' word . I proceed as
> follows: *(SON)* . the document is with capital letters and I noticed 
> that
> it didn't take it into consideration.
>
> Thanks,
> Othman
>
>
>
> On Thu, 31 Aug 2017 at 12:40, Karl Wright 
> wrote:
>
>> Hi Othman,
>>
>> The way you restrict documents with the windows share connector
>> is by specifying information on the "Paths" tab in jobs that crawl 
>> windows
>> shares.  There is end-user documentation both online and distributed 
>> with
>> all binary distributions that describe how to do this.  Have you 
>> found it?
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <
>> i93oth...@gmail.com> wrote:
>>
>>> Hello Karl,
>>>
>>> Thank you for your response, I will start using zookeeper and I
>>> will let you know if it works. I have another question to ask. 
>>> Actually, I
>>> need to make some filters while crawling. I don't want to crawl 
>>> some files
>>> and some folders. Could you give me an example of how to use the 
>>> regex.
>>> Does the regex allow to use /i to ignore cases ?
>>>
>>> Thanks,
>>> Othman
>>>
>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright 
>>> wrote:
>>>
 Hi Beelz,

 File-based sync is deprecated because people often have

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Beelz Ryuzaki
Oh, actually it didn't solve the problem. I looked into the log file and
saw the following error:

Error tossed : org/apache/poi/POIXMLTypeLoader
java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.

Maybe another jar is missing ?

Othman.

On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki  wrote:

> I have tried what you told me to do, and you expected the crawling
> resumed. How about the regular expressions? How can I make complex regular
> expressions in the job's paths tab ?
>
> Thank you very much for your help.
>
> Othman.
>
>
> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki  wrote:
>
>> Ok, I will try it right away and let you know if it works.
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 14:15, Karl Wright  wrote:
>>
>>> Oh, and you also may need to edit your options.env files to include them
>>> in the classpath for startup.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright  wrote:
>>>
 If you are amenable, there is another workaround you could try.
 Specifically:

 (1) Shut down all MCF processes.
 (2) Move the following two files from connector-common-lib to lib:

 xmlbeans-2.6.0.jar
 poi-ooxml-schemas-3.15.jar

 (3) Restart everything and see if your crawl resumes.

 Please let me know what happens.

 Karl



 On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright 
 wrote:

> I created a ticket for this: CONNECTORS-1450.
>
> One simple workaround is to use the external Tika server transformer
> rather than the embedded Tika Extractor.  I'm still looking into why the
> jar is not being found.
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki 
> wrote:
>
>> Yes, I'm actually using the latest binary version, and my job got
>> stuck on that specific file.
>> The job status is still Running. You can see it in the attached file.
>> For your information, the job started yesterday.
>>
>> Thanks,
>>
>> Othman
>>
>> On Thu, 31 Aug 2017 at 13:04, Karl Wright  wrote:
>>
>>> It looks like a dependency of Apache POI is missing.
>>> I think we will need a ticket to address this, if you are indeed
>>> using the binary distribution.
>>>
>>> Thanks!
>>> Karl
>>>
>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 I'm actually using the binary version. For security reasons, I
 can't send any files from my computer. I have copied the stack trace 
 and
 scanned it with my cellphone. I hope it will be helpful. Meanwhile, I 
 have
 read the documentation about how to restrict the crawling and I don't 
 think
 the '|' works in the specified. For instance, I would like to restrict 
 the
 crawling for the documents that counts the 'sound' word . I proceed as
 follows: *(SON)* . the document is with capital letters and I noticed 
 that
 it didn't take it into consideration.

 Thanks,
 Othman



 On Thu, 31 Aug 2017 at 12:40, Karl Wright 
 wrote:

> Hi Othman,
>
> The way you restrict documents with the windows share connector is
> by specifying information on the "Paths" tab in jobs that crawl 
> windows
> shares.  There is end-user documentation both online and distributed 
> with
> all binary distributions that describe how to do this.  Have you 
> found it?
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <
> i93oth...@gmail.com> wrote:
>
>> Hello Karl,
>>
>> Thank you for your response, I will start using zookeeper and I
>> will let you know if it works. I have another question to ask. 
>> Actually, I
>> need to make some filters while crawling. I don't want to crawl some 
>> files
>> and some folders. Could you give me an example of how to use the 
>> regex.
>> Does the regex allow to use /i to ignore cases ?
>>
>> Thanks,
>> Othman
>>
>> On Wed, 30 Aug 2017 at 19:53, Karl Wright 
>> wrote:
>>
>>> Hi Beelz,
>>>
>>> File-based sync is deprecated because people often have problems
>>> with getting file permissions right, and they do not understand how 
>>> to shut
>>> processes down cleanly, and zookeeper is resilient against that.  I 
>>> highly
>>> recommend using zookeeper sync.
>>>
>>> ManifoldCF is engineered to not put files into memory so 

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Beelz Ryuzaki
Ok, I will try it right away and let you know if it works.

Othman.

On Thu, 31 Aug 2017 at 14:15, Karl Wright  wrote:

> Oh, and you also may need to edit your options.env files to include them
> in the classpath for startup.
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright  wrote:
>
>> If you are amenable, there is another workaround you could try.
>> Specifically:
>>
>> (1) Shut down all MCF processes.
>> (2) Move the following two files from connector-common-lib to lib:
>>
>> xmlbeans-2.6.0.jar
>> poi-ooxml-schemas-3.15.jar
>>
>> (3) Restart everything and see if your crawl resumes.
>>
>> Please let me know what happens.
>>
>> Karl
>>
>>
>>
>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright  wrote:
>>
>>> I created a ticket for this: CONNECTORS-1450.
>>>
>>> One simple workaround is to use the external Tika server transformer
>>> rather than the embedded Tika Extractor.  I'm still looking into why the
>>> jar is not being found.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 Yes, I'm actually using the latest binary version, and my job got stuck
 on that specific file.
 The job status is still Running. You can see it in the attached file.
 For your information, the job started yesterday.

 Thanks,

 Othman

 On Thu, 31 Aug 2017 at 13:04, Karl Wright  wrote:

> It looks like a dependency of Apache POI is missing.
> I think we will need a ticket to address this, if you are indeed using
> the binary distribution.
>
> Thanks!
> Karl
>
> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki 
> wrote:
>
>> I'm actually using the binary version. For security reasons, I can't
>> send any files from my computer. I have copied the stack trace and 
>> scanned
>> it with my cellphone. I hope it will be helpful. Meanwhile, I have read 
>> the
>> documentation about how to restrict the crawling and I don't think the 
>> '|'
>> works in the specified. For instance, I would like to restrict the 
>> crawling
>> for the documents that counts the 'sound' word . I proceed as follows:
>> *(SON)* . the document is with capital letters and I noticed that it 
>> didn't
>> take it into consideration.
>>
>> Thanks,
>> Othman
>>
>>
>>
>> On Thu, 31 Aug 2017 at 12:40, Karl Wright  wrote:
>>
>>> Hi Othman,
>>>
>>> The way you restrict documents with the windows share connector is
>>> by specifying information on the "Paths" tab in jobs that crawl windows
>>> shares.  There is end-user documentation both online and distributed 
>>> with
>>> all binary distributions that describe how to do this.  Have you found 
>>> it?
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 Hello Karl,

 Thank you for your response, I will start using zookeeper and I
 will let you know if it works. I have another question to ask. 
 Actually, I
 need to make some filters while crawling. I don't want to crawl some 
 files
 and some folders. Could you give me an example of how to use the regex.
 Does the regex allow to use /i to ignore cases ?

 Thanks,
 Othman

 On Wed, 30 Aug 2017 at 19:53, Karl Wright 
 wrote:

> Hi Beelz,
>
> File-based sync is deprecated because people often have problems
> with getting file permissions right, and they do not understand how 
> to shut
> processes down cleanly, and zookeeper is resilient against that.  I 
> highly
> recommend using zookeeper sync.
>
> ManifoldCF is engineered to not put files into memory so you do
> not need huge amounts of memory.  The default values are more than 
> enough
> for 35,000 files, which is a pretty small job for ManifoldCF.
>
> Thanks,
> Karl
>
>
> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
> i93oth...@gmail.com> wrote:
>
>> I'm actually not using zookeeper. i want to know how is zookeeper
>> different from file based sync? I also need a guidance on how to 
>> manage my
>> pc's memory. How many Go should I allocate for the start-agent of
>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>
>> Othman.
>>
>> On Wed, 30 Aug 2017 at 16:11, Karl Wright 
>> wrote:
>>
>>> Your disk is not writable for some reason, and that's
>>> interfering with ManifoldCF 2.8 locking.
>>>

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Karl Wright
Oh, and you also may need to edit your options.env files to include them in
the classpath for startup.

Karl


On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright  wrote:

> If you are amenable, there is another workaround you could try.
> Specifically:
>
> (1) Shut down all MCF processes.
> (2) Move the following two files from connector-common-lib to lib:
>
> xmlbeans-2.6.0.jar
> poi-ooxml-schemas-3.15.jar
>
> (3) Restart everything and see if your crawl resumes.
>
> Please let me know what happens.
>
> Karl
>
>
>
> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright  wrote:
>
>> I created a ticket for this: CONNECTORS-1450.
>>
>> One simple workaround is to use the external Tika server transformer
>> rather than the embedded Tika Extractor.  I'm still looking into why the
>> jar is not being found.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Yes, I'm actually using the latest binary version, and my job got stuck
>>> on that specific file.
>>> The job status is still Running. You can see it in the attached file.
>>> For your information, the job started yesterday.
>>>
>>> Thanks,
>>>
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright  wrote:
>>>
 It looks like a dependency of Apache POI is missing.
 I think we will need a ticket to address this, if you are indeed using
 the binary distribution.

 Thanks!
 Karl

 On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki 
 wrote:

> I'm actually using the binary version. For security reasons, I can't
> send any files from my computer. I have copied the stack trace and scanned
> it with my cellphone. I hope it will be helpful. Meanwhile, I have read 
> the
> documentation about how to restrict the crawling and I don't think the '|'
> works in the specified. For instance, I would like to restrict the 
> crawling
> for the documents that counts the 'sound' word . I proceed as follows:
> *(SON)* . the document is with capital letters and I noticed that it 
> didn't
> take it into consideration.
>
> Thanks,
> Othman
>
>
>
> On Thu, 31 Aug 2017 at 12:40, Karl Wright  wrote:
>
>> Hi Othman,
>>
>> The way you restrict documents with the windows share connector is by
>> specifying information on the "Paths" tab in jobs that crawl windows
>> shares.  There is end-user documentation both online and distributed with
>> all binary distributions that describe how to do this.  Have you found 
>> it?
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Hello Karl,
>>>
>>> Thank you for your response, I will start using zookeeper and I will
>>> let you know if it works. I have another question to ask. Actually, I 
>>> need
>>> to make some filters while crawling. I don't want to crawl some files 
>>> and
>>> some folders. Could you give me an example of how to use the regex. Does
>>> the regex allow to use /i to ignore cases ?
>>>
>>> Thanks,
>>> Othman
>>>
>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright 
>>> wrote:
>>>
 Hi Beelz,

 File-based sync is deprecated because people often have problems
 with getting file permissions right, and they do not understand how to 
 shut
 processes down cleanly, and zookeeper is resilient against that.  I 
 highly
 recommend using zookeeper sync.

 ManifoldCF is engineered to not put files into memory so you do not
 need huge amounts of memory.  The default values are more than enough 
 for
 35,000 files, which is a pretty small job for ManifoldCF.

 Thanks,
 Karl


 On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
 i93oth...@gmail.com> wrote:

> I'm actually not using zookeeper. i want to know how is zookeeper
> different from file based sync? I also need a guidance on how to 
> manage my
> pc's memory. How many Go should I allocate for the start-agent of
> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>
> Othman.
>
> On Wed, 30 Aug 2017 at 16:11, Karl Wright 
> wrote:
>
>> Your disk is not writable for some reason, and that's interfering
>> with ManifoldCF 2.8 locking.
>>
>> I would suggest two things:
>>
>> (1) Use Zookeeper for sync instead of file-based sync.
>> (2) Have a look if you still get failures after that.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Karl Wright
If you are amenable, there is another workaround you could try.
Specifically:

(1) Shut down all MCF processes.
(2) Move the following two files from connector-common-lib to lib:

xmlbeans-2.6.0.jar
poi-ooxml-schemas-3.15.jar

(3) Restart everything and see if your crawl resumes.

Please let me know what happens.

Karl



On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright  wrote:

> I created a ticket for this: CONNECTORS-1450.
>
> One simple workaround is to use the external Tika server transformer
> rather than the embedded Tika Extractor.  I'm still looking into why the
> jar is not being found.
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki 
> wrote:
>
>> Yes, I'm actually using the latest binary version, and my job got stuck
>> on that specific file.
>> The job status is still Running. You can see it in the attached file. For
>> your information, the job started yesterday.
>>
>> Thanks,
>>
>> Othman
>>
>> On Thu, 31 Aug 2017 at 13:04, Karl Wright  wrote:
>>
>>> It looks like a dependency of Apache POI is missing.
>>> I think we will need a ticket to address this, if you are indeed using
>>> the binary distribution.
>>>
>>> Thanks!
>>> Karl
>>>
>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 I'm actually using the binary version. For security reasons, I can't
 send any files from my computer. I have copied the stack trace and scanned
 it with my cellphone. I hope it will be helpful. Meanwhile, I have read the
 documentation about how to restrict the crawling and I don't think the '|'
 works in the specified. For instance, I would like to restrict the crawling
 for the documents that counts the 'sound' word . I proceed as follows:
 *(SON)* . the document is with capital letters and I noticed that it didn't
 take it into consideration.

 Thanks,
 Othman



 On Thu, 31 Aug 2017 at 12:40, Karl Wright  wrote:

> Hi Othman,
>
> The way you restrict documents with the windows share connector is by
> specifying information on the "Paths" tab in jobs that crawl windows
> shares.  There is end-user documentation both online and distributed with
> all binary distributions that describe how to do this.  Have you found it?
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki 
> wrote:
>
>> Hello Karl,
>>
>> Thank you for your response, I will start using zookeeper and I will
>> let you know if it works. I have another question to ask. Actually, I 
>> need
>> to make some filters while crawling. I don't want to crawl some files and
>> some folders. Could you give me an example of how to use the regex. Does
>> the regex allow to use /i to ignore cases ?
>>
>> Thanks,
>> Othman
>>
>> On Wed, 30 Aug 2017 at 19:53, Karl Wright  wrote:
>>
>>> Hi Beelz,
>>>
>>> File-based sync is deprecated because people often have problems
>>> with getting file permissions right, and they do not understand how to 
>>> shut
>>> processes down cleanly, and zookeeper is resilient against that.  I 
>>> highly
>>> recommend using zookeeper sync.
>>>
>>> ManifoldCF is engineered to not put files into memory so you do not
>>> need huge amounts of memory.  The default values are more than enough 
>>> for
>>> 35,000 files, which is a pretty small job for ManifoldCF.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki >> > wrote:
>>>
 I'm actually not using zookeeper. i want to know how is zookeeper
 different from file based sync? I also need a guidance on how to 
 manage my
 pc's memory. How many Go should I allocate for the start-agent of
 ManifoldCF? Is 4Go enough in order to crawler 35K files ?

 Othman.

 On Wed, 30 Aug 2017 at 16:11, Karl Wright 
 wrote:

> Your disk is not writable for some reason, and that's interfering
> with ManifoldCF 2.8 locking.
>
> I would suggest two things:
>
> (1) Use Zookeeper for sync instead of file-based sync.
> (2) Have a look if you still get failures after that.
>
> Thanks,
> Karl
>
>
> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
> i93oth...@gmail.com> wrote:
>
>> Hi Mr Karl,
>>
>> Thank you Mr Karl for your quick response. I have looked into the
>> ManifoldCF log file and extracted the following warnings :
>>
>> - Attempt to set file lock 'D:\\apache_manifoldcf-2.8
>> \multiprocess-file-example\.\.\synch
>> 

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Karl Wright
I created a ticket for this: CONNECTORS-1450.

One simple workaround is to use the external Tika server transformer rather
than the embedded Tika Extractor.  I'm still looking into why the jar is
not being found.

Karl


On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki  wrote:

> Yes, I'm actually using the latest binary version, and my job got stuck on
> that specific file.
> The job status is still Running. You can see it in the attached file. For
> your information, the job started yesterday.
>
> Thanks,
>
> Othman
>
> On Thu, 31 Aug 2017 at 13:04, Karl Wright  wrote:
>
>> It looks like a dependency of Apache POI is missing.
>> I think we will need a ticket to address this, if you are indeed using
>> the binary distribution.
>>
>> Thanks!
>> Karl
>>
>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> I'm actually using the binary version. For security reasons, I can't
>>> send any files from my computer. I have copied the stack trace and scanned
>>> it with my cellphone. I hope it will be helpful. Meanwhile, I have read the
>>> documentation about how to restrict the crawling and I don't think the '|'
>>> works in the specified. For instance, I would like to restrict the crawling
>>> for the documents that counts the 'sound' word . I proceed as follows:
>>> *(SON)* . the document is with capital letters and I noticed that it didn't
>>> take it into consideration.
>>>
>>> Thanks,
>>> Othman
>>>
>>>
>>>
>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright  wrote:
>>>
 Hi Othman,

 The way you restrict documents with the windows share connector is by
 specifying information on the "Paths" tab in jobs that crawl windows
 shares.  There is end-user documentation both online and distributed with
 all binary distributions that describe how to do this.  Have you found it?

 Karl


 On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki 
 wrote:

> Hello Karl,
>
> Thank you for your response, I will start using zookeeper and I will
> let you know if it works. I have another question to ask. Actually, I need
> to make some filters while crawling. I don't want to crawl some files and
> some folders. Could you give me an example of how to use the regex. Does
> the regex allow to use /i to ignore cases ?
>
> Thanks,
> Othman
>
> On Wed, 30 Aug 2017 at 19:53, Karl Wright  wrote:
>
>> Hi Beelz,
>>
>> File-based sync is deprecated because people often have problems with
>> getting file permissions right, and they do not understand how to shut
>> processes down cleanly, and zookeeper is resilient against that.  I 
>> highly
>> recommend using zookeeper sync.
>>
>> ManifoldCF is engineered to not put files into memory so you do not
>> need huge amounts of memory.  The default values are more than enough for
>> 35,000 files, which is a pretty small job for ManifoldCF.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> I'm actually not using zookeeper. i want to know how is zookeeper
>>> different from file based sync? I also need a guidance on how to manage 
>>> my
>>> pc's memory. How many Go should I allocate for the start-agent of
>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>
>>> Othman.
>>>
>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright 
>>> wrote:
>>>
 Your disk is not writable for some reason, and that's interfering
 with ManifoldCF 2.8 locking.

 I would suggest two things:

 (1) Use Zookeeper for sync instead of file-based sync.
 (2) Have a look if you still get failures after that.

 Thanks,
 Karl


 On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki  wrote:

> Hi Mr Karl,
>
> Thank you Mr Karl for your quick response. I have looked into the
> ManifoldCF log file and extracted the following warnings :
>
> - Attempt to set file lock 'D:\\apache_manifoldcf-2.
> 8\multiprocess-file-example\.\.\synch
> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase)
> Synapses.lock' failed : Access is denied.
>
>
> - Couldn't write to lock file; disk may be full. Shutting down
> process; locks may be left dangling. You must cleanup before 
> restarting.
>
> ES (lowercase) synapses being the elasticsearch output connection.
> Moreover, the job uses Tika to extract metadata and a file system as a
> repository connection. During the job, I don't extract the content of 
> the
> 

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Karl Wright
Hi Othman,

The way you restrict documents with the windows share connector is by
specifying information on the "Paths" tab in jobs that crawl windows
shares.  There is end-user documentation both online and distributed with
all binary distributions that describe how to do this.  Have you found it?

Karl


On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki  wrote:

> Hello Karl,
>
> Thank you for your response, I will start using zookeeper and I will let
> you know if it works. I have another question to ask. Actually, I need to
> make some filters while crawling. I don't want to crawl some files and some
> folders. Could you give me an example of how to use the regex. Does the
> regex allow to use /i to ignore cases ?
>
> Thanks,
> Othman
>
> On Wed, 30 Aug 2017 at 19:53, Karl Wright  wrote:
>
>> Hi Beelz,
>>
>> File-based sync is deprecated because people often have problems with
>> getting file permissions right, and they do not understand how to shut
>> processes down cleanly, and zookeeper is resilient against that.  I highly
>> recommend using zookeeper sync.
>>
>> ManifoldCF is engineered to not put files into memory so you do not need
>> huge amounts of memory.  The default values are more than enough for 35,000
>> files, which is a pretty small job for ManifoldCF.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> I'm actually not using zookeeper. i want to know how is zookeeper
>>> different from file based sync? I also need a guidance on how to manage my
>>> pc's memory. How many Go should I allocate for the start-agent of
>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>
>>> Othman.
>>>
>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright  wrote:
>>>
 Your disk is not writable for some reason, and that's interfering with
 ManifoldCF 2.8 locking.

 I would suggest two things:

 (1) Use Zookeeper for sync instead of file-based sync.
 (2) Have a look if you still get failures after that.

 Thanks,
 Karl


 On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki 
 wrote:

> Hi Mr Karl,
>
> Thank you Mr Karl for your quick response. I have looked into the
> ManifoldCF log file and extracted the following warnings :
>
> - Attempt to set file lock 'D:\\apache_manifoldcf-2.
> 8\multiprocess-file-example\.\.\synch 
> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
> (Lowercase) Synapses.lock' failed : Access is denied.
>
>
> - Couldn't write to lock file; disk may be full. Shutting down
> process; locks may be left dangling. You must cleanup before restarting.
>
> ES (lowercase) synapses being the elasticsearch output connection.
> Moreover, the job uses Tika to extract metadata and a file system as a
> repository connection. During the job, I don't extract the content of the
> documents. I was wandering if the issue comes from elasticsearch ?
>
> Othman.
>
>
>
> On Wed, 30 Aug 2017 at 14:08, Karl Wright  wrote:
>
>> Hi Othman,
>>
>> ManifoldCF aborts a job if there's an error that looks like it might
>> go away on retry, but does not.  It can be either on the repository side 
>> or
>> on the output side.  If you look at the Simple History in the UI, or at 
>> the
>> manifoldcf.log file, you should be able to get a better sense of what 
>> went
>> wrong.  Without further information, I can't say any more.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Hello,
>>>
>>> I'm Othman Belhaj, a software engineer from société générale in
>>> France. I'm actually using your recent version of manifoldCF 2.8 . I'm
>>> working on an internal search engine. For this reason, I'm using 
>>> manifoldcf
>>> in order to index documents on windows shares. I encountered a serious
>>> problem while crawling 35K documents. Most of the time, when manifoldcf
>>> start crawling a big sized documents (19Mo for example), it ends the job
>>> with the following error: repeated service interruptions - failure
>>> processing document : software caused connection abort: socket write 
>>> error.
>>> Can you give me some tips on how to solve this problem, please ?
>>>
>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>> I'm looking forward for your response.
>>>
>>> Best regards,
>>>
>>> Othman BELHAJ
>>>
>>
>>

>>


Re: Question about ManifoldCF 2.8

2017-08-31 Thread Karl Wright
I need the complete stack trace please.

Are you building ManifoldCF yourself, or are you using the distributed
binary?

Karl


On Thu, Aug 31, 2017 at 5:48 AM, Beelz Ryuzaki  wrote:

> I have also encountered the following problem while indexing documents in
> the windows shares :
>
> Error tossed: com/microsoft/schemas/office/visio/x2012/main/ConnectsType
>
> Is it a problem of Tika ?
>
> Thanks in advance,
> Othman.
>
> On Thu, 31 Aug 2017 at 11:25, Beelz Ryuzaki  wrote:
>
>> Hello Karl,
>>
>> Thank you for your response, I will start using zookeeper and I will let
>> you know if it works. I have another question to ask. Actually, I need to
>> make some filters while crawling. I don't want to crawl some files and some
>> folders. Could you give me an example of how to use the regex. Does the
>> regex allow to use /i to ignore cases ?
>>
>> Thanks,
>> Othman
>>
>> On Wed, 30 Aug 2017 at 19:53, Karl Wright  wrote:
>>
>>> Hi Beelz,
>>>
>>> File-based sync is deprecated because people often have problems with
>>> getting file permissions right, and they do not understand how to shut
>>> processes down cleanly, and zookeeper is resilient against that.  I highly
>>> recommend using zookeeper sync.
>>>
>>> ManifoldCF is engineered to not put files into memory so you do not need
>>> huge amounts of memory.  The default values are more than enough for 35,000
>>> files, which is a pretty small job for ManifoldCF.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 I'm actually not using zookeeper. i want to know how is zookeeper
 different from file based sync? I also need a guidance on how to manage my
 pc's memory. How many Go should I allocate for the start-agent of
 ManifoldCF? Is 4Go enough in order to crawler 35K files ?

 Othman.

 On Wed, 30 Aug 2017 at 16:11, Karl Wright  wrote:

> Your disk is not writable for some reason, and that's interfering with
> ManifoldCF 2.8 locking.
>
> I would suggest two things:
>
> (1) Use Zookeeper for sync instead of file-based sync.
> (2) Have a look if you still get failures after that.
>
> Thanks,
> Karl
>
>
> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki 
> wrote:
>
>> Hi Mr Karl,
>>
>> Thank you Mr Karl for your quick response. I have looked into the
>> ManifoldCF log file and extracted the following warnings :
>>
>> - Attempt to set file lock 'D:\\apache_manifoldcf-2.
>> 8\multiprocess-file-example\.\.\synch 
>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>> (Lowercase) Synapses.lock' failed : Access is denied.
>>
>>
>> - Couldn't write to lock file; disk may be full. Shutting down
>> process; locks may be left dangling. You must cleanup before restarting.
>>
>> ES (lowercase) synapses being the elasticsearch output connection.
>> Moreover, the job uses Tika to extract metadata and a file system as a
>> repository connection. During the job, I don't extract the content of the
>> documents. I was wandering if the issue comes from elasticsearch ?
>>
>> Othman.
>>
>>
>>
>> On Wed, 30 Aug 2017 at 14:08, Karl Wright  wrote:
>>
>>> Hi Othman,
>>>
>>> ManifoldCF aborts a job if there's an error that looks like it might
>>> go away on retry, but does not.  It can be either on the repository 
>>> side or
>>> on the output side.  If you look at the Simple History in the UI, or at 
>>> the
>>> manifoldcf.log file, you should be able to get a better sense of what 
>>> went
>>> wrong.  Without further information, I can't say any more.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 Hello,

 I'm Othman Belhaj, a software engineer from société générale in
 France. I'm actually using your recent version of manifoldCF 2.8 . I'm
 working on an internal search engine. For this reason, I'm using 
 manifoldcf
 in order to index documents on windows shares. I encountered a serious
 problem while crawling 35K documents. Most of the time, when manifoldcf
 start crawling a big sized documents (19Mo for example), it ends the 
 job
 with the following error: repeated service interruptions - failure
 processing document : software caused connection abort: socket write 
 error.
 Can you give me some tips on how to solve this problem, please ?

 I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
 I'm looking forward for your response.

 Best regards,

 Othman BELHAJ

>>>
>>>

Re: Question about ManifoldCF 2.8

2017-08-31 Thread Beelz Ryuzaki
Hello Karl,

Thank you for your response, I will start using zookeeper and I will let
you know if it works. I have another question to ask. Actually, I need to
make some filters while crawling. I don't want to crawl some files and some
folders. Could you give me an example of how to use the regex. Does the
regex allow to use /i to ignore cases ?

Thanks,
Othman

On Wed, 30 Aug 2017 at 19:53, Karl Wright  wrote:

> Hi Beelz,
>
> File-based sync is deprecated because people often have problems with
> getting file permissions right, and they do not understand how to shut
> processes down cleanly, and zookeeper is resilient against that.  I highly
> recommend using zookeeper sync.
>
> ManifoldCF is engineered to not put files into memory so you do not need
> huge amounts of memory.  The default values are more than enough for 35,000
> files, which is a pretty small job for ManifoldCF.
>
> Thanks,
> Karl
>
>
> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki 
> wrote:
>
>> I'm actually not using zookeeper. i want to know how is zookeeper
>> different from file based sync? I also need a guidance on how to manage my
>> pc's memory. How many Go should I allocate for the start-agent of
>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>
>> Othman.
>>
>> On Wed, 30 Aug 2017 at 16:11, Karl Wright  wrote:
>>
>>> Your disk is not writable for some reason, and that's interfering with
>>> ManifoldCF 2.8 locking.
>>>
>>> I would suggest two things:
>>>
>>> (1) Use Zookeeper for sync instead of file-based sync.
>>> (2) Have a look if you still get failures after that.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 Hi Mr Karl,

 Thank you Mr Karl for your quick response. I have looked into the
 ManifoldCF log file and extracted the following warnings :

 - Attempt to set file lock
 'D:\\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch
 area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase)
 Synapses.lock' failed : Access is denied.


 - Couldn't write to lock file; disk may be full. Shutting down process;
 locks may be left dangling. You must cleanup before restarting.

 ES (lowercase) synapses being the elasticsearch output connection.
 Moreover, the job uses Tika to extract metadata and a file system as a
 repository connection. During the job, I don't extract the content of the
 documents. I was wandering if the issue comes from elasticsearch ?

 Othman.



 On Wed, 30 Aug 2017 at 14:08, Karl Wright  wrote:

> Hi Othman,
>
> ManifoldCF aborts a job if there's an error that looks like it might
> go away on retry, but does not.  It can be either on the repository side 
> or
> on the output side.  If you look at the Simple History in the UI, or at 
> the
> manifoldcf.log file, you should be able to get a better sense of what went
> wrong.  Without further information, I can't say any more.
>
> Thanks,
> Karl
>
>
> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki 
> wrote:
>
>> Hello,
>>
>> I'm Othman Belhaj, a software engineer from société générale in
>> France. I'm actually using your recent version of manifoldCF 2.8 . I'm
>> working on an internal search engine. For this reason, I'm using 
>> manifoldcf
>> in order to index documents on windows shares. I encountered a serious
>> problem while crawling 35K documents. Most of the time, when manifoldcf
>> start crawling a big sized documents (19Mo for example), it ends the job
>> with the following error: repeated service interruptions - failure
>> processing document : software caused connection abort: socket write 
>> error.
>> Can you give me some tips on how to solve this problem, please ?
>>
>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>> I'm looking forward for your response.
>>
>> Best regards,
>>
>> Othman BELHAJ
>>
>
>
>>>
>


Re: Question about ManifoldCF 2.8

2017-08-30 Thread Steph van Schalkwyk
Thanks Karl.


Re: Question about ManifoldCF 2.8

2017-08-30 Thread Furkan KAMACI
Hi Steph,

Zookeeper is a coordination service for distributed systems. Having a
quorum means that more than half of the number of nodes are up and running.
This is for protection of brain splitting issue. Zookeeper is a distributed
system and it may be down at any time.

Brain splitting can be explained like that: assume that there are 4 nodes
of Zookeeper. 2 of them can interact with each other but cannot with other
two and vice versa. Some of MFC nodes are connected to 2 of them and some
of them other. For a quorum we need 4 / 2 + 1 = 3. There could be a problem
for MFC nodes if it was 2 (like a split of brain). Formula is calculated
like that since there can be only one subset of any set which has a size of
greater than half at any time.

Kind Regards,
Furkan KAMACI

On Wed, Aug 30, 2017 at 9:24 PM, Karl Wright  wrote:

> Hi Steph,
>
> You can configure your zookeeper however you like; there is a sample
> configuration file included with MCF that works out of the box.  But yes,
> we do recommend a quorum count of 3 or more.
>
> Karl
>
> On Wed, Aug 30, 2017 at 2:19 PM, Steph van Schalkwyk 
> wrote:
>
>> Karl,
>> Is there a requirement for the number of ZK for MCF? I've used ZK with
>> SOLR, and the minimum quorum count is 3.
>> Thanks
>> Steph
>>
>>
>>
>


Re: Question about ManifoldCF 2.8

2017-08-30 Thread Karl Wright
Hi Steph,

You can configure your zookeeper however you like; there is a sample
configuration file included with MCF that works out of the box.  But yes,
we do recommend a quorum count of 3 or more.

Karl

On Wed, Aug 30, 2017 at 2:19 PM, Steph van Schalkwyk 
wrote:

> Karl,
> Is there a requirement for the number of ZK for MCF? I've used ZK with
> SOLR, and the minimum quorum count is 3.
> Thanks
> Steph
>
>
>


Re: Question about ManifoldCF 2.8

2017-08-30 Thread Steph van Schalkwyk
Karl,
Is there a requirement for the number of ZK for MCF? I've used ZK with
SOLR, and the minimum quorum count is 3.
Thanks
Steph


Re: Question about ManifoldCF 2.8

2017-08-30 Thread Beelz Ryuzaki
I'm actually not using zookeeper. i want to know how is zookeeper different
from file based sync? I also need a guidance on how to manage my pc's
memory. How many Go should I allocate for the start-agent of ManifoldCF? Is
4Go enough in order to crawler 35K files ?

Othman.

On Wed, 30 Aug 2017 at 16:11, Karl Wright  wrote:

> Your disk is not writable for some reason, and that's interfering with
> ManifoldCF 2.8 locking.
>
> I would suggest two things:
>
> (1) Use Zookeeper for sync instead of file-based sync.
> (2) Have a look if you still get failures after that.
>
> Thanks,
> Karl
>
>
> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki 
> wrote:
>
>> Hi Mr Karl,
>>
>> Thank you Mr Karl for your quick response. I have looked into the
>> ManifoldCF log file and extracted the following warnings :
>>
>> - Attempt to set file lock
>> 'D:\\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch
>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase)
>> Synapses.lock' failed : Access is denied.
>>
>>
>> - Couldn't write to lock file; disk may be full. Shutting down process;
>> locks may be left dangling. You must cleanup before restarting.
>>
>> ES (lowercase) synapses being the elasticsearch output connection.
>> Moreover, the job uses Tika to extract metadata and a file system as a
>> repository connection. During the job, I don't extract the content of the
>> documents. I was wandering if the issue comes from elasticsearch ?
>>
>> Othman.
>>
>>
>>
>> On Wed, 30 Aug 2017 at 14:08, Karl Wright  wrote:
>>
>>> Hi Othman,
>>>
>>> ManifoldCF aborts a job if there's an error that looks like it might go
>>> away on retry, but does not.  It can be either on the repository side or on
>>> the output side.  If you look at the Simple History in the UI, or at the
>>> manifoldcf.log file, you should be able to get a better sense of what went
>>> wrong.  Without further information, I can't say any more.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki 
>>> wrote:
>>>
 Hello,

 I'm Othman Belhaj, a software engineer from société générale in France.
 I'm actually using your recent version of manifoldCF 2.8 . I'm working on
 an internal search engine. For this reason, I'm using manifoldcf in order
 to index documents on windows shares. I encountered a serious problem while
 crawling 35K documents. Most of the time, when manifoldcf start crawling a
 big sized documents (19Mo for example), it ends the job with the following
 error: repeated service interruptions - failure processing document :
 software caused connection abort: socket write error.
 Can you give me some tips on how to solve this problem, please ?

 I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
 I'm looking forward for your response.

 Best regards,

 Othman BELHAJ

>>>
>>>
>


Re: Question about ManifoldCF 2.8

2017-08-30 Thread Karl Wright
Your disk is not writable for some reason, and that's interfering with
ManifoldCF 2.8 locking.

I would suggest two things:

(1) Use Zookeeper for sync instead of file-based sync.
(2) Have a look if you still get failures after that.

Thanks,
Karl


On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki  wrote:

> Hi Mr Karl,
>
> Thank you Mr Karl for your quick response. I have looked into the
> ManifoldCF log file and extracted the following warnings :
>
> - Attempt to set file lock 'D:\\apache_manifoldcf-2.
> 8\multiprocess-file-example\.\.\synch 
> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
> (Lowercase) Synapses.lock' failed : Access is denied.
>
>
> - Couldn't write to lock file; disk may be full. Shutting down process;
> locks may be left dangling. You must cleanup before restarting.
>
> ES (lowercase) synapses being the elasticsearch output connection.
> Moreover, the job uses Tika to extract metadata and a file system as a
> repository connection. During the job, I don't extract the content of the
> documents. I was wandering if the issue comes from elasticsearch ?
>
> Othman.
>
>
>
> On Wed, 30 Aug 2017 at 14:08, Karl Wright  wrote:
>
>> Hi Othman,
>>
>> ManifoldCF aborts a job if there's an error that looks like it might go
>> away on retry, but does not.  It can be either on the repository side or on
>> the output side.  If you look at the Simple History in the UI, or at the
>> manifoldcf.log file, you should be able to get a better sense of what went
>> wrong.  Without further information, I can't say any more.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki 
>> wrote:
>>
>>> Hello,
>>>
>>> I'm Othman Belhaj, a software engineer from société générale in France.
>>> I'm actually using your recent version of manifoldCF 2.8 . I'm working on
>>> an internal search engine. For this reason, I'm using manifoldcf in order
>>> to index documents on windows shares. I encountered a serious problem while
>>> crawling 35K documents. Most of the time, when manifoldcf start crawling a
>>> big sized documents (19Mo for example), it ends the job with the following
>>> error: repeated service interruptions - failure processing document :
>>> software caused connection abort: socket write error.
>>> Can you give me some tips on how to solve this problem, please ?
>>>
>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>> I'm looking forward for your response.
>>>
>>> Best regards,
>>>
>>> Othman BELHAJ
>>>
>>
>>


Re: Question about ManifoldCF 2.8

2017-08-30 Thread Beelz Ryuzaki
Hi Mr Karl,

Thank you Mr Karl for your quick response. I have looked into the
ManifoldCF log file and extracted the following warnings :

- Attempt to set file lock
'D:\\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch
area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase)
Synapses.lock' failed : Access is denied.


- Couldn't write to lock file; disk may be full. Shutting down process;
locks may be left dangling. You must cleanup before restarting.

ES (lowercase) synapses being the elasticsearch output connection.
Moreover, the job uses Tika to extract metadata and a file system as a
repository connection. During the job, I don't extract the content of the
documents. I was wandering if the issue comes from elasticsearch ?

Othman.



On Wed, 30 Aug 2017 at 14:08, Karl Wright  wrote:

> Hi Othman,
>
> ManifoldCF aborts a job if there's an error that looks like it might go
> away on retry, but does not.  It can be either on the repository side or on
> the output side.  If you look at the Simple History in the UI, or at the
> manifoldcf.log file, you should be able to get a better sense of what went
> wrong.  Without further information, I can't say any more.
>
> Thanks,
> Karl
>
>
> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki 
> wrote:
>
>> Hello,
>>
>> I'm Othman Belhaj, a software engineer from société générale in France.
>> I'm actually using your recent version of manifoldCF 2.8 . I'm working on
>> an internal search engine. For this reason, I'm using manifoldcf in order
>> to index documents on windows shares. I encountered a serious problem while
>> crawling 35K documents. Most of the time, when manifoldcf start crawling a
>> big sized documents (19Mo for example), it ends the job with the following
>> error: repeated service interruptions - failure processing document :
>> software caused connection abort: socket write error.
>> Can you give me some tips on how to solve this problem, please ?
>>
>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>> I'm looking forward for your response.
>>
>> Best regards,
>>
>> Othman BELHAJ
>>
>
>


Re: Question about ManifoldCF 2.8

2017-08-30 Thread Karl Wright
Hi Othman,

ManifoldCF aborts a job if there's an error that looks like it might go
away on retry, but does not.  It can be either on the repository side or on
the output side.  If you look at the Simple History in the UI, or at the
manifoldcf.log file, you should be able to get a better sense of what went
wrong.  Without further information, I can't say any more.

Thanks,
Karl


On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki  wrote:

> Hello,
>
> I'm Othman Belhaj, a software engineer from société générale in France.
> I'm actually using your recent version of manifoldCF 2.8 . I'm working on
> an internal search engine. For this reason, I'm using manifoldcf in order
> to index documents on windows shares. I encountered a serious problem while
> crawling 35K documents. Most of the time, when manifoldcf start crawling a
> big sized documents (19Mo for example), it ends the job with the following
> error: repeated service interruptions - failure processing document :
> software caused connection abort: socket write error.
> Can you give me some tips on how to solve this problem, please ?
>
> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
> I'm looking forward for your response.
>
> Best regards,
>
> Othman BELHAJ
>


Re: Alfresco webscript connection problem

2017-08-22 Thread Luis Cabaceira
Hi, i'm currently onsite with customer but i can take a look and try to
reproduce the issue next week.

Luis

On 22 August 2017 at 12:12, Maurizio Pillitu  wrote:

> Hi Aurélien,
>
> also adding Luis Cabaceira (from Alfresco Consultancy team) that can
> probably help and try to reproduce the issue.
>
> I haven't tested the connector yet against 5.2.0 Community, I'll give it a
> try and see if I can reproduce the issue.
>
> In the meantime, can you please confirm that you followed the steps
> described in https://github.com/Alfresco/alfresco-indexer/
> blob/master/MANIFOLD.md ? Note that instructions are based on Manifold
> 2.2 , Solr 4.9.1 and Alfresco 5.1.x, you'd need to tweak those values to
> use the versions of choice.
>
> Thanks,
>   mao
>
> On Tue, Aug 22, 2017 at 12:43 PM Karl Wright  wrote:
>
>> Hi Maurizio and Rafa, do you have any response?
>>
>> Karl
>>
>>
>> On Wed, Aug 9, 2017 at 1:24 PM, Karl Wright  wrote:
>>
>>> It might be the case.  I'm cc'ing the resident Alfresco experts about
>>> this now.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Aug 9, 2017 at 1:17 PM, Aurélien MAZOYER <
>>> aurelien.mazo...@francelabs.com> wrote:
>>>
 Hi community,



 I want to crawl data from an Alfresco Community v.5.2.0 with the
 Alfresco Webscript connector of ManifoldCF 2.7.1.

 I installed the AMP as explained in https://github.com/Alfresco/
 alfresco-indexer

 When I try to set up a repository connection to my Alfresco server, I
 get the exception:



 ERROR 2017-08-09 19:01:27,071 (qtp790722099-425) - Json response is
 missing username.

 com.github.maoo.indexer.client.AlfrescoParseException: Json response
 is missing username.

at com.github.maoo.indexer.client.
 WebScriptsAlfrescoClient.getUsername(WebScriptsAlfrescoClient.java:305)

at com.github.maoo.indexer.client.
 WebScriptsAlfrescoClient.getUser(WebScriptsAlfrescoClient.java:298)

at com.github.maoo.indexer.client.
 WebScriptsAlfrescoClient.userFromHttpEntity(
 WebScriptsAlfrescoClient.java:289)

at com.github.maoo.indexer.client.
 WebScriptsAlfrescoClient.fetchUserAuthorities(
 WebScriptsAlfrescoClient.java:352)



 I read in the MCF documentation that the connector was tested with Alfresco
 5.0.d.

 Do you think the connector is not compliant with Alfresco 5.2 and that
 is why I encounter this exception?



 Thank you,



 Aurélien

>>>
>>>
>> --
> Maurizio Pillitu
> maoo @ keybase /github /
> twitter /apache /linkedIn
> 
>



-- 
Luis Cabaceira


Re: Alfresco webscript connection problem

2017-08-22 Thread Maurizio Pillitu
Hi Aurélien,

also adding Luis Cabaceira (from Alfresco Consultancy team) that can
probably help and try to reproduce the issue.

I haven't tested the connector yet against 5.2.0 Community, I'll give it a
try and see if I can reproduce the issue.

In the meantime, can you please confirm that you followed the steps
described in
https://github.com/Alfresco/alfresco-indexer/blob/master/MANIFOLD.md ? Note
that instructions are based on Manifold 2.2 , Solr 4.9.1 and Alfresco
5.1.x, you'd need to tweak those values to use the versions of choice.

Thanks,
  mao

On Tue, Aug 22, 2017 at 12:43 PM Karl Wright  wrote:

> Hi Maurizio and Rafa, do you have any response?
>
> Karl
>
>
> On Wed, Aug 9, 2017 at 1:24 PM, Karl Wright  wrote:
>
>> It might be the case.  I'm cc'ing the resident Alfresco experts about
>> this now.
>>
>> Karl
>>
>>
>> On Wed, Aug 9, 2017 at 1:17 PM, Aurélien MAZOYER <
>> aurelien.mazo...@francelabs.com> wrote:
>>
>>> Hi community,
>>>
>>>
>>>
>>> I want to crawl data from an Alfresco Community v.5.2.0 with the
>>> Alfresco Webscript connector of ManifoldCF 2.7.1.
>>>
>>> I installed the AMP as explained in
>>> https://github.com/Alfresco/alfresco-indexer
>>>
>>> When I try to set up a repository connection to my Alfresco server, I
>>> get the exception:
>>>
>>>
>>>
>>> ERROR 2017-08-09 19:01:27,071 (qtp790722099-425) - Json response is
>>> missing username.
>>>
>>> com.github.maoo.indexer.client.AlfrescoParseException: Json response is
>>> missing username.
>>>
>>>at
>>> com.github.maoo.indexer.client.WebScriptsAlfrescoClient.getUsername(WebScriptsAlfrescoClient.java:305)
>>>
>>>at
>>> com.github.maoo.indexer.client.WebScriptsAlfrescoClient.getUser(WebScriptsAlfrescoClient.java:298)
>>>
>>>at
>>> com.github.maoo.indexer.client.WebScriptsAlfrescoClient.userFromHttpEntity(WebScriptsAlfrescoClient.java:289)
>>>
>>>at
>>> com.github.maoo.indexer.client.WebScriptsAlfrescoClient.fetchUserAuthorities(WebScriptsAlfrescoClient.java:352)
>>>
>>>
>>>
>>> I read in the MCF documentation that the connector was tested with Alfresco
>>> 5.0.d.
>>>
>>> Do you think the connector is not compliant with Alfresco 5.2 and that
>>> is why I encounter this exception?
>>>
>>>
>>>
>>> Thank you,
>>>
>>>
>>>
>>> Aurélien
>>>
>>
>>
> --
Maurizio Pillitu
maoo @ keybase /github /
twitter /apache /linkedIn



Re: Alfresco webscript connection problem

2017-08-22 Thread Karl Wright
Hi Maurizio and Rafa, do you have any response?

Karl


On Wed, Aug 9, 2017 at 1:24 PM, Karl Wright  wrote:

> It might be the case.  I'm cc'ing the resident Alfresco experts about this
> now.
>
> Karl
>
>
> On Wed, Aug 9, 2017 at 1:17 PM, Aurélien MAZOYER <
> aurelien.mazo...@francelabs.com> wrote:
>
>> Hi community,
>>
>>
>>
>> I want to crawl data from an Alfresco Community v.5.2.0 with the Alfresco
>> Webscript connector of ManifoldCF 2.7.1.
>>
>> I installed the AMP as explained in https://github.com/Alfresco/al
>> fresco-indexer
>>
>> When I try to set up a repository connection to my Alfresco server, I get
>> the exception:
>>
>>
>>
>> ERROR 2017-08-09 19:01:27,071 (qtp790722099-425) - Json response is
>> missing username.
>>
>> com.github.maoo.indexer.client.AlfrescoParseException: Json response is
>> missing username.
>>
>>at com.github.maoo.indexer.client
>> .WebScriptsAlfrescoClient.getUsername(WebScriptsAlfrescoClient.java:305)
>>
>>at com.github.maoo.indexer.client
>> .WebScriptsAlfrescoClient.getUser(WebScriptsAlfrescoClient.java:298)
>>
>>at com.github.maoo.indexer.client
>> .WebScriptsAlfrescoClient.userFromHttpEntity(WebScriptsAlfre
>> scoClient.java:289)
>>
>>at com.github.maoo.indexer.client
>> .WebScriptsAlfrescoClient.fetchUserAuthorities(WebScriptsAlf
>> rescoClient.java:352)
>>
>>
>>
>> I read in the MCF documentation that the connector was tested with Alfresco
>> 5.0.d.
>>
>> Do you think the connector is not compliant with Alfresco 5.2 and that is
>> why I encounter this exception?
>>
>>
>>
>> Thank you,
>>
>>
>>
>> Aurélien
>>
>
>


Re: Documentum job stops on error

2017-07-17 Thread Karl Wright
I've attached a third patch to this ticket that should fix both of these
cases.  The patches must be applied in order.

Karl


On Mon, Jul 17, 2017 at 2:46 AM, Tamizh Kumaran Thamizharasan <
tthamizhara...@worldbankgroup.org> wrote:

> Thanks Karl for the patch!!!
>
>
>
> A minor correction is required on the patch https://issues.apache.org/
> jira/secure/attachment/12877287/CONNECTORS-1444-2.patch(file:DCTM.java)
>
> else if (dfe.getType() != DocumentumException.TYPE_CORRUPTEDDOCUMENT)
>
> need to be modified to
>
> else if (dfe.getType() == DocumentumException.TYPE_CORRUPTEDDOCUMENT)
>
>
>
> After the change its working fine.
>
>
>
> Also the observation is these errors(DM_PLATFORM_E_INTEGER_CONVERSION_ERROR
> and DM_OBJECT_E_LOAD_INVALID_STRING_LEN) are emitted from the
> org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.getObjectByQualification
> method call. So all the changes on https://issues.apache.org/
> jira/secure/attachment/12877287/CONNECTORS-1444-2.patch and
> DocumentumException.java
> <https://issues.apache.org/jira/secure/attachment/12877287/CONNECTORS-1444-2.patch%20and%20DocumentumException.java>
> file change on https://issues.apache.org/jira/secure/attachment/
> 12877277/CONNECTORS-1444.patch should be sufficient.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* Friday, July 14, 2017 5:41 PM
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: Documentum job stops on error
>
>
>
> Ok, I've attached and committed an additional patch.  Please let me know.
>
>
>
> Karl
>
>
>
>
>
> On Fri, Jul 14, 2017 at 7:54 AM, Tamizh Kumaran Thamizharasan <
> tthamizhara...@worldbankgroup.org> wrote:
>
> Hi Karl,
>
>
>
> The patch provided is not working since the error is thrown from
> org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.
> getObjectByQualification
>
>
>
> return new DocumentumObjectImpl(objIDfSession,objIDfSession.
> getObjectByQualification(dql));
>
>
>
> Error log as follows:
>
>
>
> DfException:: THREAD: RMI TCP Connection(1083)-127.0.0.1; MSG:
> [DM_OBJECT_E_LOAD_INVALID_STRING_LEN]error:  "Error loading object:
> invalid string length 0 found in input stream"; ERRORCODE: 100; NEXT: null
>
> at com.documentum.fc.client.impl.docbase.DocbaseExceptionMapper.
> newException(DocbaseExceptionMapper.java:57)
>
> at com.documentum.fc.client.impl.connection.docbase.
> MessageEntry.getException(MessageEntry.java:39)
>
> at com.documentum.fc.client.impl.connection.docbase.
> DocbaseMessageManager.getException(DocbaseMessageManager.java:137)
>
> at com.documentum.fc.client.impl.connection.docbase.netwise.
> NetwiseDocbaseRpcClient.checkForMessages(NetwiseDocbaseRpcClient.java:310)
>
> at com.documentum.fc.client.impl.connection.docbase.netwise.
> NetwiseDocbaseRpcClient.applyForObject(NetwiseDocbaseRpcClient.java:653)
>
> at com.documentum.fc.client.impl.connection.docbase.
> DocbaseConnection$8.evaluate(DocbaseConnection.java:1370)
>
> at com.documentum.fc.client.impl.connection.docbase.
> DocbaseConnection.evaluateRpc(DocbaseConnection.java:1129)
>
> at com.documentum.fc.client.impl.connection.docbase.
> DocbaseConnection.applyForObject(DocbaseConnection.java:1362)
>
> at com.documentum.fc.client.impl.docbase.DocbaseApi.
> parameterizedFetch(DocbaseApi.java:107)
>
> at com.documentum.fc.client.impl.objectmanager.
> PersistentDataManager.fetchFromServer(PersistentDataManager.java:191)
>
> at com.documentum.fc.client.impl.objectmanager.
> PersistentDataManager.getData(PersistentDataManager.java:82)
>
> at com.documentum.fc.client.impl.objectmanager.
> PersistentObjectManager.getObjectFromServer(PersistentObjectManager.java:
> 355)
>
> at com.documentum.fc.client.impl.objectmanager.
> PersistentObjectManager.getObject(PersistentObjectManager.java:311)
>
> at com.documentum.fc.client.impl.session.Session.getObject(
> Session.java:958)
>
> at com.documentum.fc.client.impl.session.Session.
> getObjectByQualificationEx(Session.java:1139)
>
> at com.documentum.fc.client.impl.session.Session.
> getObjectByQualification(Session.java:1117)
>
> at com.documentum.fc.client.impl.session.SessionHandle.
> getObjectByQualification(SessionHandle.java:755)
>
> at org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.
> getObjectByQualification(DocumentumImpl.java:334)
>
>  

RE: Documentum job stops on error

2017-07-17 Thread Tamizh Kumaran Thamizharasan
Thanks Karl for the patch!!!

A minor correction is required on the patch 
https://issues.apache.org/jira/secure/attachment/12877287/CONNECTORS-1444-2.patch(file:DCTM.java)
else if (dfe.getType() != DocumentumException.TYPE_CORRUPTEDDOCUMENT)
need to be modified to
else if (dfe.getType() == DocumentumException.TYPE_CORRUPTEDDOCUMENT)

After the change its working fine.

Also the observation is these errors(DM_PLATFORM_E_INTEGER_CONVERSION_ERROR and 
DM_OBJECT_E_LOAD_INVALID_STRING_LEN) are emitted from the 
org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.getObjectByQualification
 method call. So all the changes on 
https://issues.apache.org/jira/secure/attachment/12877287/CONNECTORS-1444-2.patch
 and 
DocumentumException.java<https://issues.apache.org/jira/secure/attachment/12877287/CONNECTORS-1444-2.patch%20and%20DocumentumException.java>
 file change on 
https://issues.apache.org/jira/secure/attachment/12877277/CONNECTORS-1444.patch 
should be sufficient.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddy...@gmail.com]
Sent: Friday, July 14, 2017 5:41 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: Documentum job stops on error

Ok, I've attached and committed an additional patch.  Please let me know.

Karl


On Fri, Jul 14, 2017 at 7:54 AM, Tamizh Kumaran Thamizharasan 
<tthamizhara...@worldbankgroup.org<mailto:tthamizhara...@worldbankgroup.org>> 
wrote:
Hi Karl,

The patch provided is not working since the error is thrown from 
org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.getObjectByQualification

return new 
DocumentumObjectImpl(objIDfSession,objIDfSession.getObjectByQualification(dql));

Error log as follows:

DfException:: THREAD: RMI TCP Connection(1083)-127.0.0.1; MSG: 
[DM_OBJECT_E_LOAD_INVALID_STRING_LEN]error:  "Error loading object: invalid 
string length 0 found in input stream"; ERRORCODE: 100; NEXT: null
at 
com.documentum.fc.client.impl.docbase.DocbaseExceptionMapper.newException(DocbaseExceptionMapper.java:57)
at 
com.documentum.fc.client.impl.connection.docbase.MessageEntry.getException(MessageEntry.java:39)
at 
com.documentum.fc.client.impl.connection.docbase.DocbaseMessageManager.getException(DocbaseMessageManager.java:137)
at 
com.documentum.fc.client.impl.connection.docbase.netwise.NetwiseDocbaseRpcClient.checkForMessages(NetwiseDocbaseRpcClient.java:310)
at 
com.documentum.fc.client.impl.connection.docbase.netwise.NetwiseDocbaseRpcClient.applyForObject(NetwiseDocbaseRpcClient.java:653)
at 
com.documentum.fc.client.impl.connection.docbase.DocbaseConnection$8.evaluate(DocbaseConnection.java:1370)
at 
com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.evaluateRpc(DocbaseConnection.java:1129)
at 
com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.applyForObject(DocbaseConnection.java:1362)
at 
com.documentum.fc.client.impl.docbase.DocbaseApi.parameterizedFetch(DocbaseApi.java:107)
at 
com.documentum.fc.client.impl.objectmanager.PersistentDataManager.fetchFromServer(PersistentDataManager.java:191)
at 
com.documentum.fc.client.impl.objectmanager.PersistentDataManager.getData(PersistentDataManager.java:82)
at 
com.documentum.fc.client.impl.objectmanager.PersistentObjectManager.getObjectFromServer(PersistentObjectManager.java:355)
at 
com.documentum.fc.client.impl.objectmanager.PersistentObjectManager.getObject(PersistentObjectManager.java:311)
at 
com.documentum.fc.client.impl.session.Session.getObject(Session.java:958)
at 
com.documentum.fc.client.impl.session.Session.getObjectByQualificationEx(Session.java:1139)
at 
com.documentum.fc.client.impl.session.Session.getObjectByQualification(Session.java:1117)
at 
com.documentum.fc.client.impl.session.SessionHandle.getObjectByQualification(SessionHandle.java:755)
at 
org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.getObjectByQualification(DocumentumImpl.java:334)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:346)
at sun.rmi.transport.Transport$1.run(Transport.java:200)
at sun.rmi.transport.Transport$1.run(Transport.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
at 
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda

Re: Documentum job stops on error

2017-07-14 Thread Karl Wright
Ok, I've attached and committed an additional patch.  Please let me know.

Karl


On Fri, Jul 14, 2017 at 7:54 AM, Tamizh Kumaran Thamizharasan <
tthamizhara...@worldbankgroup.org> wrote:

> Hi Karl,
>
>
>
> The patch provided is not working since the error is thrown from
> org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.
> getObjectByQualification
>
>
>
> return new DocumentumObjectImpl(objIDfSession,objIDfSession.
> getObjectByQualification(dql));
>
>
>
> Error log as follows:
>
>
>
> DfException:: THREAD: RMI TCP Connection(1083)-127.0.0.1; MSG:
> [DM_OBJECT_E_LOAD_INVALID_STRING_LEN]error:  "Error loading object:
> invalid string length 0 found in input stream"; ERRORCODE: 100; NEXT: null
>
> at com.documentum.fc.client.impl.docbase.DocbaseExceptionMapper.
> newException(DocbaseExceptionMapper.java:57)
>
> at com.documentum.fc.client.impl.connection.docbase.
> MessageEntry.getException(MessageEntry.java:39)
>
> at com.documentum.fc.client.impl.connection.docbase.
> DocbaseMessageManager.getException(DocbaseMessageManager.java:137)
>
> at com.documentum.fc.client.impl.connection.docbase.netwise.
> NetwiseDocbaseRpcClient.checkForMessages(NetwiseDocbaseRpcClient.java:310)
>
> at com.documentum.fc.client.impl.connection.docbase.netwise.
> NetwiseDocbaseRpcClient.applyForObject(NetwiseDocbaseRpcClient.java:653)
>
> at com.documentum.fc.client.impl.connection.docbase.
> DocbaseConnection$8.evaluate(DocbaseConnection.java:1370)
>
> at com.documentum.fc.client.impl.connection.docbase.
> DocbaseConnection.evaluateRpc(DocbaseConnection.java:1129)
>
> at com.documentum.fc.client.impl.connection.docbase.
> DocbaseConnection.applyForObject(DocbaseConnection.java:1362)
>
> at com.documentum.fc.client.impl.docbase.DocbaseApi.
> parameterizedFetch(DocbaseApi.java:107)
>
> at com.documentum.fc.client.impl.objectmanager.
> PersistentDataManager.fetchFromServer(PersistentDataManager.java:191)
>
> at com.documentum.fc.client.impl.objectmanager.
> PersistentDataManager.getData(PersistentDataManager.java:82)
>
> at com.documentum.fc.client.impl.objectmanager.
> PersistentObjectManager.getObjectFromServer(PersistentObjectManager.java:
> 355)
>
> at com.documentum.fc.client.impl.objectmanager.
> PersistentObjectManager.getObject(PersistentObjectManager.java:311)
>
> at com.documentum.fc.client.impl.session.Session.getObject(
> Session.java:958)
>
> at com.documentum.fc.client.impl.session.Session.
> getObjectByQualificationEx(Session.java:1139)
>
> at com.documentum.fc.client.impl.session.Session.
> getObjectByQualification(Session.java:1117)
>
> at com.documentum.fc.client.impl.session.SessionHandle.
> getObjectByQualification(SessionHandle.java:755)
>
> at org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.
> getObjectByQualification(DocumentumImpl.java:334)
>
> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at sun.rmi.server.UnicastServerRef.dispatch(
> UnicastServerRef.java:346)
>
> at sun.rmi.transport.Transport$1.run(Transport.java:200)
>
> at sun.rmi.transport.Transport$1.run(Transport.java:197)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
>
> at sun.rmi.transport.tcp.TCPTransport.handleMessages(
> TCPTransport.java:568)
>
> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(
> TCPTransport.java:826)
>
> at sun.rmi.transport.tcp.TCPTransport$
> ConnectionHandler.lambda$run$0(TCPTransport.java:683)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(
> TCPTransport.java:682)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* Friday, July 14, 2017 4:32 PM
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: Documentum job st

RE: Documentum job stops on error

2017-07-14 Thread Tamizh Kumaran Thamizharasan
Hi Karl,

The patch provided is not working since the error is thrown from 
org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.getObjectByQualification

return new 
DocumentumObjectImpl(objIDfSession,objIDfSession.getObjectByQualification(dql));

Error log as follows:

DfException:: THREAD: RMI TCP Connection(1083)-127.0.0.1; MSG: 
[DM_OBJECT_E_LOAD_INVALID_STRING_LEN]error:  "Error loading object: invalid 
string length 0 found in input stream"; ERRORCODE: 100; NEXT: null
at 
com.documentum.fc.client.impl.docbase.DocbaseExceptionMapper.newException(DocbaseExceptionMapper.java:57)
at 
com.documentum.fc.client.impl.connection.docbase.MessageEntry.getException(MessageEntry.java:39)
at 
com.documentum.fc.client.impl.connection.docbase.DocbaseMessageManager.getException(DocbaseMessageManager.java:137)
at 
com.documentum.fc.client.impl.connection.docbase.netwise.NetwiseDocbaseRpcClient.checkForMessages(NetwiseDocbaseRpcClient.java:310)
at 
com.documentum.fc.client.impl.connection.docbase.netwise.NetwiseDocbaseRpcClient.applyForObject(NetwiseDocbaseRpcClient.java:653)
at 
com.documentum.fc.client.impl.connection.docbase.DocbaseConnection$8.evaluate(DocbaseConnection.java:1370)
at 
com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.evaluateRpc(DocbaseConnection.java:1129)
at 
com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.applyForObject(DocbaseConnection.java:1362)
at 
com.documentum.fc.client.impl.docbase.DocbaseApi.parameterizedFetch(DocbaseApi.java:107)
at 
com.documentum.fc.client.impl.objectmanager.PersistentDataManager.fetchFromServer(PersistentDataManager.java:191)
at 
com.documentum.fc.client.impl.objectmanager.PersistentDataManager.getData(PersistentDataManager.java:82)
at 
com.documentum.fc.client.impl.objectmanager.PersistentObjectManager.getObjectFromServer(PersistentObjectManager.java:355)
at 
com.documentum.fc.client.impl.objectmanager.PersistentObjectManager.getObject(PersistentObjectManager.java:311)
at 
com.documentum.fc.client.impl.session.Session.getObject(Session.java:958)
at 
com.documentum.fc.client.impl.session.Session.getObjectByQualificationEx(Session.java:1139)
at 
com.documentum.fc.client.impl.session.Session.getObjectByQualification(Session.java:1117)
at 
com.documentum.fc.client.impl.session.SessionHandle.getObjectByQualification(SessionHandle.java:755)
at 
org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.getObjectByQualification(DocumentumImpl.java:334)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:346)
at sun.rmi.transport.Transport$1.run(Transport.java:200)
at sun.rmi.transport.Transport$1.run(Transport.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
at 
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:683)
at java.security.AccessController.doPrivileged(Native Method)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddy...@gmail.com]
Sent: Friday, July 14, 2017 4:32 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: Documentum job stops on error

I have created a ticket (CONNECTORS-1444) to track this issue, and attached a 
fix.  I've also committed the fix to trunk.

The fix is not the code change you have done, but instead introduces a new kind 
of DocumentumException: CORRUPTEDDOCUMENT.  This will be thrown whenever 
permanent document corruption is detected, and will cause the document to be 
skipped and not indexed.

The "DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED " error should cause the 
connector to retry the document at a later time, so if indeed this is not a 
permanent error, no special fix should be required.

Please let me know if the fix I have committed works for you.

Karl



On Fri, Jul 14, 2017 at 5:41 AM, Tamizh Kumaran Thamizharasan 
<tthamizhara...@worldbankgroup.org<mailto:tthamizhara...@worldbankgroup.org>> 
wrote:
Hi Karl,

Sorry for not 

RE: Documentum job stops on error

2017-07-14 Thread Tamizh Kumaran Thamizharasan
Hi Karl,

Sorry for not explaining the issue in a detail manner.

(1)   Is it likely to go away or not on a retry;

The DM_PLATFORM_E_INTEGER_CONVERSION_ERROR and 
DM_OBJECT_E_LOAD_INVALID_STRING_LEN error are not likely to go away on 
immediate retry.

(2)   Does it substantially impact the ability of ManifoldCF to properly 
process the document;

The impact is someone need to monitor the indexing and if it gets stopped on 
these issues, need to use the restart-minimal to start the indexing again.
(3) Is it generally acceptable to skip ALL documents where the error occurs.
Yes, those errors are occurred for a large number of documents and its tough 
time for the user to restart the indexing again. Total documents count - 70+
DM_OBJECT_E_LOAD_INVALID_STRING_LEN  - 11147
DM_PLATFORM_E_INTEGER_CONVERSION_ERROR  21708
Im not sure whether the occurrences of these issues are common on the 
documentum / due to improper documentum configuration/maintenance. We have 
encountered those errors on a couple of the documentum instances of lower 
environments (Not validated on production).

The documentum repository errors DM_PLATFORM_E_INTEGER_CONVERSION_ERROR and 
DM_OBJECT_E_LOAD_INVALID_STRING_LEN are of type DfException caused from the 
getObjectByQualification  method in the 
org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.

We made a fix to print the error on the log(documentum server process) and 
return null.
catch (DfException e)
{

  e.printStackTrace();
  return null;
  //throw new DocumentumException("Documentum error: "+e.getMessage());
}


On the run() method of the  ProcessDocumentThread inner class on  the 
org.apache.manifoldcf.crawler.connectors.DCTM.DCTM file,  if did a null check 
to continue with the document processing.
try
  {
IDocumentumObject object = session.getObjectByQualification("dm_document where 
i_chronicle_id='" + documentIdentifier +
  "' and any r_version_label='CURRENT'");
if(object!=null) {
…
}
  }
  catch (Throwable e)
  {
this.exception = e;
  }

The [DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED error occurs very rarely due to 
the document uploaded is parked in interim BOCS and moved to Repository after a 
shorter time.
If indexing happens on the gap, the properties will be accessible, but the 
document content will not be available that causes the error. The fix is not 
yet completed.
The code snippet that causes this error is shared below.
The run() method of the  ProcessDocumentThread inner class on  the 
org.apache.manifoldcf.crawler.connectors.DCTM.DCTM
   try
  {
strFilePath = object.getFile(objFileTemp.getCanonicalPath());
  }
  catch (DocumentumException dfe)
  {
// Fetch failed, so log it
activityStatus = "NOCONTENT";
activityMessage = dfe.getMessage();
if (dfe.getType() != DocumentumException.TYPE_NOTALLOWED)
  throw dfe;
return;
  }

The getFile method on the 
org.apache.manifoldcf.crawler.common.DCTM.DocumentumObjectImpl

catch (DfException dfe)
{
  // Can't decide what to do without looking at the exception text.
  // This is crappy but it's the best we can manage, apparently.
  String errorMessage = dfe.getMessage();
  if (errorMessage.indexOf("[DM_CONTENT_E_CANT_START_PULL]") == -1)
// Treat it as transient, and retry
throw new 
DocumentumException(dfe.getMessage(),DocumentumException.TYPE_SERVICEINTERRUPTION);
  // It's probably not a transient error.  Report it as an access 
violation, even though it
  // may well not be.  We don't have much info as to what's happening.
  throw new 
DocumentumException(dfe.getMessage(),DocumentumException.TYPE_NOTALLOWED);
}

The approach to discard uncrawlable documents and continue with the  indexing 
process is meaningful rather than stalling it. If you feel it is good to 
include, kindly do the required coding exception.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddy...@gmail.com]
Sent: Friday, July 14, 2017 12:36 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: Documentum job stops on error

Hi Tamizh,

For any repository  errors, ManifoldCF needs to know the following:
(1) Is it likely to go away or not on a retry;
(2) Does it substantially impact the ability of ManifoldCF to properly process 
the document;
(3) Is it generally acceptable to skip ALL documents where the error occurs.

In this case your underlying error seems quite worrying:

[DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED]error: "The content is temporarily 
parked on a BOCS server host. It will be available when it is moved to a 
permanent storage area."

I could imagine that many or most documents are in fact in that state, in which 
case nothing

Re: Documentum job stops on error

2017-07-14 Thread Karl Wright
Hi Tamizh,

For any repository  errors, ManifoldCF needs to know the following:

(1) Is it likely to go away or not on a retry;
(2) Does it substantially impact the ability of ManifoldCF to properly
process the document;
(3) Is it generally acceptable to skip ALL documents where the error occurs.

In this case your underlying error seems quite worrying:

[DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED]error: "The content is
temporarily parked on a BOCS server host. It will be available when it is
moved to a permanent storage area."

I could imagine that many or most documents are in fact in that state, in
which case nothing can really be crawled?

I'm happy to make coding exceptions in the Documentum connector for
discarding uncrawlable documents, but only if it makes sense to do that.
Here it is not clear at all that we'd want to change MCF to throw away all
documents with this problem.  It sounds instead like there's some
significant Documentum configuration issue to me.

Thanks,
Karl


On Fri, Jul 14, 2017 at 2:39 AM, Tamizh Kumaran Thamizharasan <
tthamizhara...@worldbankgroup.org> wrote:

> Hi Team,
>
>
>
> Below behavior is observed on using ManifoldCF Documentum connector.
>
>
>
> · On any Documentum specific error, the application throws the
> error and the job stops abruptly. If there is any specific reason for this
> approach?
>
> Can we handle these errors by logging the errors, ignoring the document
> and continue the indexing?
>
>
>
> Please find the sample error causing the job to fail.
>
>
>
> Documentum error: [DM_PLATFORM_E_INTEGER_CONVERSION_ERROR]error:  "The
> server was unable to convert the following string (String Unavailable) to
> an integer or long."
>
>
>
> Caused by: org.apache.manifoldcf.crawler.common.DCTM.DocumentumException:
> Documentum error: [DM_OBJECT_E_LOAD_INVALID_STRING_LEN]error:  "Error
> loading object: invalid string length 0 found in input stream"
>
>
>
> Error: Repeated service interruptions - failure processing document:
> [DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED]error: "The content is
> temporarily parked on a BOCS server host. It will be available when it is
> moved to a permanent storage area."
>
>
>
> Kindly provide your suggestion on this.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>


Re: ldap authentication with crawler ui

2017-07-13 Thread Theodor Carp
Hi Karl,

Many thanks for the support! I'll keep looking into this, as this is a
feature a would really like to have functional.

Best
T

-- 
Theodor Carp

From: Karl Wright <daddy...@gmail.com> <daddy...@gmail.com>
Reply: user@manifoldcf.apache.org <user@manifoldcf.apache.org>
<user@manifoldcf.apache.org>
Date: 13 July 2017 at 14:17:14
To: user@manifoldcf.apache.org <user@manifoldcf.apache.org>
<user@manifoldcf.apache.org>
Subject:  Re: ldap authentication with crawler ui

I wish I was familiar enough with the code for this feature that I could be
> of help.  Nobody seems to have responded either.  It *is* summer and many
> people have vacations.
>
> I think, therefore, you're going to wind up needing to debug this
> yourself.  There's no magic; it's just using the javax packages for LDAP
> communication -- but obviously there's something not set up right and I
> don't know what it is.  It may be a default parameter value or some such.
>
> Thanks,
> Karl
>
>
> On Wed, Jul 12, 2017 at 11:29 AM, Karl Wright <daddy...@gmail.com> wrote:
>
>> Have any users out there made use of LDAP crawler-UI authentication?  If
>> so, can you have a look at Theodor's configuration and setup?
>>
>> Karl
>>
>>
>> On Wed, Jul 12, 2017 at 10:07 AM, Theodor Carp <theodor.c...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Using the below settings:
>>>
>>> >> value="org.apache.manifoldcf.core.auth.LdapAuthenticator" />
>>> >> value="LDAP-AUTHENTICATION" />
>>> ldap://hdp01.local:389; />
>>> >> value="simple" />
>>> >> value="uid=$(userID),ou=Users,dc=local" />
>>> >> value="(uid=$(userID))" />
>>> >> value="uid" />
>>>
>>> I'm getting errors like:
>>>
>>> ERROR 2017-07-12 15:20:32,951 (qtp1295083508-17) - User not
>>> authenticated = authenticating_user exception = [LDAP: error code 32 -
>>> No Such Object]
>>> javax.naming.NameNotFoundException: [LDAP: error code 32 - No Such
>>> Object]; remaining name ''
>>> at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3161)
>>> at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3082)
>>> at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2888)
>>> [...]
>>> FATAL 2017-07-12 15:20:32,956 (qtp1295083508-17) - Exception logging in:
>>> User not authenticated: [LDAP: error code 32 - No Such Object]
>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: User not
>>> authenticated: [LDAP: error code 32 - No Such Object]
>>> at org.apache.manifoldcf.core.auth.LdapAuthenticator.verifyLogi
>>> n(LdapAuthenticator.java:162)
>>> at org.apache.manifoldcf.core.auth.LdapAuthenticator.verifyUILo
>>> gin(LdapAuthenticator.java:107)
>>> at org.apache.manifoldcf.ui.beans.AdminProfile.login(AdminProfi
>>> le.java:103)
>>> [...]
>>> Caused by: javax.naming.NameNotFoundException: [LDAP: error code 32 -
>>> No Such Object]; remaining name ''
>>> at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3161)
>>> at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3082)
>>> at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2888)
>>>
>>> But if do maual ldapsearch basicaly using the same settings: e.g.:
>>>
>>> ldapsearch -x -H ldap://hdp01.local -b "dc=local" -s sub
>>> '(uid=authenticating_user)'
>>>
>>> Or
>>>
>>> ldapsearch -x -D "uid=authenticating_user1,ou=Users,dc=local" -W -H
>>> ldap://hdp01.local -b "dc=local" -s sub 'uid=authenticating_user'
>>>
>>> It basically works ok.
>>>
>>> for reference i'm running manifold 2.7, on tomcat, using postgresql for
>>> database and zookeeper as config repo and orchestrator.
>>>
>>> Any ideas?
>>>
>>> Best,
>>> T
>>>
>>
>>
>


Re: ldap authentication with crawler ui

2017-07-12 Thread Karl Wright
Have any users out there made use of LDAP crawler-UI authentication?  If
so, can you have a look at Theodor's configuration and setup?

Karl


On Wed, Jul 12, 2017 at 10:07 AM, Theodor Carp 
wrote:

> Hi,
>
> Using the below settings:
>
>  value="org.apache.manifoldcf.core.auth.LdapAuthenticator" />
>  value="LDAP-AUTHENTICATION" />
> ldap://hdp01.local:389; />
>  value="simple" />
>  value="uid=$(userID),ou=Users,dc=local" />
>  value="(uid=$(userID))" />
>  value="uid" />
>
> I'm getting errors like:
>
> ERROR 2017-07-12 15:20:32,951 (qtp1295083508-17) - User not authenticated
> = authenticating_user exception = [LDAP: error code 32 - No Such Object]
> javax.naming.NameNotFoundException: [LDAP: error code 32 - No Such Object];
> remaining name ''
> at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3161)
> at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3082)
> at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2888)
> [...]
> FATAL 2017-07-12 15:20:32,956 (qtp1295083508-17) - Exception logging in:
> User not authenticated: [LDAP: error code 32 - No Such Object]
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: User not
> authenticated: [LDAP: error code 32 - No Such Object]
> at org.apache.manifoldcf.core.auth.LdapAuthenticator.
> verifyLogin(LdapAuthenticator.java:162)
> at org.apache.manifoldcf.core.auth.LdapAuthenticator.verifyUILogin(
> LdapAuthenticator.java:107)
> at org.apache.manifoldcf.ui.beans.AdminProfile.login(
> AdminProfile.java:103)
> [...]
> Caused by: javax.naming.NameNotFoundException: [LDAP: error code 32 - No
> Such Object]; remaining name ''
> at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3161)
> at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3082)
> at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2888)
>
> But if do maual ldapsearch basicaly using the same settings: e.g.:
>
> ldapsearch -x -H ldap://hdp01.local -b "dc=local" -s sub
> '(uid=authenticating_user)'
>
> Or
>
> ldapsearch -x -D "uid=authenticating_user1,ou=Users,dc=local" -W -H
> ldap://hdp01.local -b "dc=local" -s sub 'uid=authenticating_user'
>
> It basically works ok.
>
> for reference i'm running manifold 2.7, on tomcat, using postgresql for
> database and zookeeper as config repo and orchestrator.
>
> Any ideas?
>
> Best,
> T
>


RE: ManifoldCF slow documentum indexing performance

2017-07-12 Thread Tamizh Kumaran Thamizharasan
Thanks Karl and Fukran!!!

After pointing to different Documentum instance, the performance issue got 
resolved.
So its look like a Documentum issue.

Regards,
Tamizh Kumaran

From: Furkan KAMACI [mailto:furkankam...@gmail.com]
Sent: Thursday, July 06, 2017 3:22 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF slow documentum indexing performance

Hi Tamizh,

Set Xmx and Xms to same values for a better performance.

Kind Regards,
Furkan KAMACI

On Thu, Jul 6, 2017 at 9:10 AM, Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
Hi Tamizh,

The Documentum Server Process is a thin shell around DFC and its dependencies.  
In order to get helpful suggestions, you will need to contact Documentum, I'm 
afraid.

Thanks,
Karl



On Thu, Jul 6, 2017 at 1:57 AM, Tamizh Kumaran Thamizharasan 
<tthamizhara...@worldbankgroup.org<mailto:tthamizhara...@worldbankgroup.org>> 
wrote:
Thanks Karl!!

After monitoring the CPU usage of Postgresql, the agents process, and the 
documentum server process, mainly the documentum server process consumes most 
of the CPU and the agent process is the second most CPU consumer.

In documentum server run script, java heap is having value as below.
-Xmx512m -Xms32m

Is there any way to speed up the indexing through heap configuration or 
increasing hardware?
If so, Kindly share us the details.

Regards,
Tamizh Kumaran

From: Karl Wright [mailto:daddy...@gmail.com<mailto:daddy...@gmail.com>]
Sent: Wednesday, July 05, 2017 6:19 PM
To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF slow documentum indexing performance

Hi Tamizh,

The likely culprit is Documentum itself.  In my experience it can be quite 
slow, depending on how it is configured.  But you can confirm that by 
monitoring the CPU usage of Postgresql, the agents process, and the documentum 
server process.  If none of these are CPU bound, then Documentum itself is the 
problem.

Thanks,
Karl


On Wed, Jul 5, 2017 at 8:24 AM, Tamizh Kumaran Thamizharasan 
<tthamizhara...@worldbankgroup.org<mailto:tthamizhara...@worldbankgroup.org>> 
wrote:
Hi Team,

The postgresql 9.2, solr 5.3.2 and manifoldcf 2.7.1 are installed on the same 
linux box. The documentum server sits on a different linux box. The indexing 
performance is slow(approx 1000 doc per hour) with the documentum crawler. The 
used properties files is as below for reference


  
  
  
  
  http://org.apache.manifoldcf.co>nnectorsconfigurationfile"
 value="../connectors.xml"/>
  
  http://org.apache.manifoldcf.fi>leresources" 
value="../file-resources"/>
  
  
  
  
  
  http://org.apache.manifoldcf.database.name>"
 value="manifoldcf"/>
  
  
  
  http://org.apache.manifoldcf.cr>awler.threads" 
value="15"/>
  http://org.apache.manifoldcf.cr>awler.repository.store_history"
 value="false"/>

  
  

  
  
  
  


Initially the org.apache.manifoldcf.crawler.threads is setup with 45 and the 
observation is it taking a long time gap between each batch of 45 documents 
during processing.
Can you please point out any changes/recommendations that will speed up the 
indexing.

Regards,
Tamizh Kumaran Thamizharasan






Re: ManifoldCF slow documentum indexing performance

2017-07-06 Thread Karl Wright
Hi Tamizh,

The Documentum Server Process is a thin shell around DFC and its
dependencies.  In order to get helpful suggestions, you will need to
contact Documentum, I'm afraid.

Thanks,
Karl



On Thu, Jul 6, 2017 at 1:57 AM, Tamizh Kumaran Thamizharasan <
tthamizhara...@worldbankgroup.org> wrote:

> Thanks Karl!!
>
>
>
> After monitoring the CPU usage of Postgresql, the agents process, and the
> documentum server process, mainly the documentum server process consumes
> most of the CPU and the agent process is the second most CPU consumer.
>
>
>
> In documentum server run script, java heap is having value as below.
>
> *-Xmx512m -Xms32m*
>
>
>
> Is there any way to speed up the indexing through heap configuration or
> increasing hardware?
>
> If so, Kindly share us the details.
>
>
>
> Regards,
>
> Tamizh Kumaran
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* Wednesday, July 05, 2017 6:19 PM
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF slow documentum indexing performance
>
>
>
> Hi Tamizh,
>
>
>
> The likely culprit is Documentum itself.  In my experience it can be quite
> slow, depending on how it is configured.  But you can confirm that by
> monitoring the CPU usage of Postgresql, the agents process, and the
> documentum server process.  If none of these are CPU bound, then Documentum
> itself is the problem.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Wed, Jul 5, 2017 at 8:24 AM, Tamizh Kumaran Thamizharasan <
> tthamizhara...@worldbankgroup.org> wrote:
>
> Hi Team,
>
>
>
> The postgresql 9.2, solr 5.3.2 and manifoldcf 2.7.1 are installed on the
> same linux box. The documentum server sits on a different linux box. The
> indexing performance is slow(approx 1000 doc per hour) with the documentum
> crawler. The used properties files is as below for reference
>
>
>
> 
>
>   
>
>   
>
>value="./logging.ini"/>
>
>   
>
>value="../connectors.xml"/>
>
>   
>
>value="../file-resources"/>
>
>value="org.apache.manifoldcf.core.database.DBInterfacePostgreSQL"/>
>
>value="localhost"/>
>
>   
>
>value="postgres"/>
>
>   
>
>value="manifoldcf"/>
>
>value="postgres"/>
>
>   
>
>   
>
>   
>
>value="false"/>
>
>
>
>value="***:8349"/>
>
>value="5000"/>
>
> 
>
>   
>
>   
>
>   
>
>   
>
> 
>
>
>
> Initially the org.apache.manifoldcf.crawler.threads is setup with 45 and
> the observation is it taking a long time gap between each batch of 45
> documents during processing.
>
> Can you please point out any changes/recommendations that will speed up
> the indexing.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
>
>


RE: ManifoldCF slow documentum indexing performance

2017-07-05 Thread Tamizh Kumaran Thamizharasan
Thanks Karl!!

After monitoring the CPU usage of Postgresql, the agents process, and the 
documentum server process, mainly the documentum server process consumes most 
of the CPU and the agent process is the second most CPU consumer.

In documentum server run script, java heap is having value as below.
-Xmx512m -Xms32m

Is there any way to speed up the indexing through heap configuration or 
increasing hardware?
If so, Kindly share us the details.

Regards,
Tamizh Kumaran

From: Karl Wright [mailto:daddy...@gmail.com]
Sent: Wednesday, July 05, 2017 6:19 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF slow documentum indexing performance

Hi Tamizh,

The likely culprit is Documentum itself.  In my experience it can be quite 
slow, depending on how it is configured.  But you can confirm that by 
monitoring the CPU usage of Postgresql, the agents process, and the documentum 
server process.  If none of these are CPU bound, then Documentum itself is the 
problem.

Thanks,
Karl


On Wed, Jul 5, 2017 at 8:24 AM, Tamizh Kumaran Thamizharasan 
<tthamizhara...@worldbankgroup.org<mailto:tthamizhara...@worldbankgroup.org>> 
wrote:
Hi Team,

The postgresql 9.2, solr 5.3.2 and manifoldcf 2.7.1 are installed on the same 
linux box. The documentum server sits on a different linux box. The indexing 
performance is slow(approx 1000 doc per hour) with the documentum crawler. The 
used properties files is as below for reference


  
  
  
  
  
  
  
  
  
  
  
  
  http://org.apache.manifoldcf.database.name>"
 value="manifoldcf"/>
  
  
  
  
  

  
  

  
  
  
  


Initially the org.apache.manifoldcf.crawler.threads is setup with 45 and the 
observation is it taking a long time gap between each batch of 45 documents 
during processing.
Can you please point out any changes/recommendations that will speed up the 
indexing.

Regards,
Tamizh Kumaran Thamizharasan




run.sh
Description: run.sh


Re: ManifoldCF slow documentum indexing performance

2017-07-05 Thread Karl Wright
Hi Tamizh,

The likely culprit is Documentum itself.  In my experience it can be quite
slow, depending on how it is configured.  But you can confirm that by
monitoring the CPU usage of Postgresql, the agents process, and the
documentum server process.  If none of these are CPU bound, then Documentum
itself is the problem.

Thanks,
Karl


On Wed, Jul 5, 2017 at 8:24 AM, Tamizh Kumaran Thamizharasan <
tthamizhara...@worldbankgroup.org> wrote:

> Hi Team,
>
>
>
> The postgresql 9.2, solr 5.3.2 and manifoldcf 2.7.1 are installed on the
> same linux box. The documentum server sits on a different linux box. The
> indexing performance is slow(approx 1000 doc per hour) with the documentum
> crawler. The used properties files is as below for reference
>
>
>
> 
>
>   
>
>   
>
>value="./logging.ini"/>
>
>   
>
>value="../connectors.xml"/>
>
>   
>
>value="../file-resources"/>
>
>value="org.apache.manifoldcf.core.database.DBInterfacePostgreSQL"/>
>
>value="localhost"/>
>
>   
>
>value="postgres"/>
>
>   
>
>value="manifoldcf"/>
>
>value="postgres"/>
>
>   
>
>   
>
>   
>
>value="false"/>
>
>
>
>value="***:8349"/>
>
>value="5000"/>
>
> 
>
>   
>
>   
>
>   
>
>   
>
> 
>
>
>
> Initially the org.apache.manifoldcf.crawler.threads is setup with 45 and
> the observation is it taking a long time gap between each batch of 45
> documents during processing.
>
> Can you please point out any changes/recommendations that will speed up
> the indexing.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>


Re: Sharepoint Repository Connector: Metadata Changes not causing re-index library or list items

2017-06-30 Thread Karl Wright
If it's computed from other attributes, then don't the other attributes
need to change in order for the lookup attribute's value to change?

Karl


On Fri, Jun 30, 2017 at 9:13 AM, <markus.sch...@deutschebahn.com> wrote:

> Hi Karl,
>
> we found out, that the affected metadate comes from a lookup field, that
> is computed from attributes of the containing list.
> Such fields do not change the modified date.
>
> We could re-index all list items, when the list itself is modified (by
> carrying down the modified date of the list for example). But this would
> probably lead to too many unecessary version changes for other lists.
> I will check, if there are possibilities to detect lookupfields with the
> API.
>
> Regards,
> Markus
>


Re: ManifoldCF documentum indexing issue

2017-06-22 Thread Karl Wright
I have committed a stop-gap solution to the MCF Solr connector, but the
real problem is in Apache HttpComponents/HttpClient.  I've gotten
permission to suggest a fix for that project as well.

Karl


On Thu, Jun 22, 2017 at 4:27 AM, Tamizh Kumaran Thamizharasan <
tthamizhara...@worldbankgroup.org> wrote:

> Thanks Karl.
>
>
>
> After installing the patch, filename with double quotes and backslashes
> were getting indexed to Solr and the issue is resolved.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* Wednesday, June 21, 2017 5:07 PM
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> I've attached a tentative patch to the ticket CONNECTORS-1434.  Please
> confirm whether or not the patch works for you before I commit it to trunk.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 6:49 AM, Tamizh Kumaran Thamizharasan <
> tthamizhara...@worldbankgroup.org> wrote:
>
> Thanks Karl.
>
>
>
> Please find the below steps to recreate the issue on file system
> repository.
>
>
>
> Output connector : Solr
>
> Repository : File system
>
> File name in repository : “dummy” file “name.pdf
>
>
>
> Additional Solr parameter : expandMacros=false
>
>
>
> On starting the job with above configuration, we are getting “missing
> content stream” .
>
> Please find the attached file for complete log trace.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* Wednesday, June 21, 2017 3:35 PM
>
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> I've created a ticket, CONNECTORS-1434, to look at the file name issues.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <daddy...@gmail.com> wrote:
>
> There is no good way to handle a case where Solr doesn't like the file
> name.  About the only thing that could be done would be to encode the
> filename using something like URL encoding.  This might have some effects
> on existing users, but more importantly, we really would need to know what
> characters were legal before adopting that solution.
>
>
>
> I am not entirely sure how the file name is transmitted to Solr when using
> multipart forms, but how that is done is critical to know what to do.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan <
> tthamizhara...@worldbankgroup.org> wrote:
>
> Hi Karl,
>
>
>
> Thanks for the update!!!
>
>
>
> As per the response from Solr team, expandMacros=false is added to the
> output connector as additional parameter.
>
> After adding  expandMacros=false, the indexing job is getting completed
> with “Missing content stream” error for few of the documents and those are
> not indexed into Solr.
>
>
>
> As per our analysis, the pdf document’s file name we are trying to index
> from documentum  contains whitespace and special characters like double
> quotes.
>
> Which makes the file non readable and missing content stream error is
> thrown.
>
>
>
> If there is any work around to overcome this issue, kindly share it with
> us.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* Wednesday, June 14, 2017 7:20 PM
>
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Here's the response:
>
>
>
> >>>>>>
>
> Karl -
>
> There’s expandMacros=false, as covered here: https://cwiki.apache.
> org/confluence/display/solr/Parameter+Substitution
>
> But… what exactly is being sent to Solr?Is there some kind of “${…”
> being sent as a parameter?   Just curious what’s getting you into this in
> the first place.   But disabling probably is your most desired solution.
>
> Erik
>
> <<<<<<
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <daddy...@gmail.com> wrote:
>
> Here's the question I posted:
>
>
>
> >>>>>>
>
> Hi all,
>
>
>
> I've got a ManifoldCF user who is posting content to Solr 

RE: ManifoldCF documentum indexing issue

2017-06-22 Thread Tamizh Kumaran Thamizharasan
Thanks Karl.

After installing the patch, filename with double quotes and backslashes were 
getting indexed to Solr and the issue is resolved.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddy...@gmail.com]
Sent: Wednesday, June 21, 2017 5:07 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

I've attached a tentative patch to the ticket CONNECTORS-1434.  Please confirm 
whether or not the patch works for you before I commit it to trunk.

Karl


On Wed, Jun 21, 2017 at 6:49 AM, Tamizh Kumaran Thamizharasan 
<tthamizhara...@worldbankgroup.org<mailto:tthamizhara...@worldbankgroup.org>> 
wrote:
Thanks Karl.

Please find the below steps to recreate the issue on file system repository.

Output connector : Solr
Repository : File system
File name in repository : “dummy” file “name.pdf

Additional Solr parameter : expandMacros=false

On starting the job with above configuration, we are getting “missing content 
stream” .
Please find the attached file for complete log trace.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddy...@gmail.com<mailto:daddy...@gmail.com>]
Sent: Wednesday, June 21, 2017 3:35 PM

To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

I've created a ticket, CONNECTORS-1434, to look at the file name issues.

Karl


On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
There is no good way to handle a case where Solr doesn't like the file name.  
About the only thing that could be done would be to encode the filename using 
something like URL encoding.  This might have some effects on existing users, 
but more importantly, we really would need to know what characters were legal 
before adopting that solution.

I am not entirely sure how the file name is transmitted to Solr when using 
multipart forms, but how that is done is critical to know what to do.

Karl


On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan 
<tthamizhara...@worldbankgroup.org<mailto:tthamizhara...@worldbankgroup.org>> 
wrote:
Hi Karl,

Thanks for the update!!!

As per the response from Solr team, expandMacros=false is added to the output 
connector as additional parameter.
After adding  expandMacros=false, the indexing job is getting completed with 
“Missing content stream” error for few of the documents and those are not 
indexed into Solr.

As per our analysis, the pdf document’s file name we are trying to index from 
documentum  contains whitespace and special characters like double quotes.
Which makes the file non readable and missing content stream error is thrown.

If there is any work around to overcome this issue, kindly share it with us.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddy...@gmail.com<mailto:daddy...@gmail.com>]
Sent: Wednesday, June 14, 2017 7:20 PM

To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

Here's the response:

>>>>>>
Karl -

There’s expandMacros=false, as covered here: 
https://cwiki.apache.org/confluence/display/solr/Parameter+Substitution

But… what exactly is being sent to Solr?Is there some kind of “${…” being 
sent as a parameter?   Just curious what’s getting you into this in the first 
place.   But disabling probably is your most desired solution.

Erik
<<<<<<

Karl


On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
Here's the question I posted:

>>>>>>
Hi all,

I've got a ManifoldCF user who is posting content to Solr using the MCF Solr 
output connector.  This connector uses SolrJ under the covers -- a fairly 
recent version -- but also has overridden some classes to insure that multipart 
form posts will be used for most content.

The problem is that, for a specific document, the user is getting an 
ArrayIndexOutOfBounds exception in Solr, as follows:

>>>>>>
2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] - 
{collection=c:documentum_manifoldcf_stg, 
core=x:documentum_manifoldcf_stg_shard1_replica1, 
node_name=n:**:8983_solr, replica=r:core_node1, shard=s:shard1} - 
java.lang.StringIndexOutOfBoundsException: String index out of range: -296
at java.lang.String.substring(String.java:1911)
at 
org.apache.solr.request.macro.MacroExpander._expand(MacroExpander.java:143)
at 
org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:93)
at 
org.apache.sol

Re: ManifoldCF documentum indexing issue

2017-06-21 Thread Karl Wright
I've attached a tentative patch to the ticket CONNECTORS-1434.  Please
confirm whether or not the patch works for you before I commit it to trunk.

Karl


On Wed, Jun 21, 2017 at 6:49 AM, Tamizh Kumaran Thamizharasan <
tthamizhara...@worldbankgroup.org> wrote:

> Thanks Karl.
>
>
>
> Please find the below steps to recreate the issue on file system
> repository.
>
>
>
> Output connector : Solr
>
> Repository : File system
>
> File name in repository : “dummy” file “name.pdf
>
>
>
> Additional Solr parameter : expandMacros=false
>
>
>
> On starting the job with above configuration, we are getting “missing
> content stream” .
>
> Please find the attached file for complete log trace.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* Wednesday, June 21, 2017 3:35 PM
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> I've created a ticket, CONNECTORS-1434, to look at the file name issues.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <daddy...@gmail.com> wrote:
>
> There is no good way to handle a case where Solr doesn't like the file
> name.  About the only thing that could be done would be to encode the
> filename using something like URL encoding.  This might have some effects
> on existing users, but more importantly, we really would need to know what
> characters were legal before adopting that solution.
>
>
>
> I am not entirely sure how the file name is transmitted to Solr when using
> multipart forms, but how that is done is critical to know what to do.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan <
> tthamizhara...@worldbankgroup.org> wrote:
>
> Hi Karl,
>
>
>
> Thanks for the update!!!
>
>
>
> As per the response from Solr team, expandMacros=false is added to the
> output connector as additional parameter.
>
> After adding  expandMacros=false, the indexing job is getting completed
> with “Missing content stream” error for few of the documents and those are
> not indexed into Solr.
>
>
>
> As per our analysis, the pdf document’s file name we are trying to index
> from documentum  contains whitespace and special characters like double
> quotes.
>
> Which makes the file non readable and missing content stream error is
> thrown.
>
>
>
> If there is any work around to overcome this issue, kindly share it with
> us.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* Wednesday, June 14, 2017 7:20 PM
>
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Here's the response:
>
>
>
> >>>>>>
>
> Karl -
>
> There’s expandMacros=false, as covered here: https://cwiki.apache.
> org/confluence/display/solr/Parameter+Substitution
>
> But… what exactly is being sent to Solr?Is there some kind of “${…”
> being sent as a parameter?   Just curious what’s getting you into this in
> the first place.   But disabling probably is your most desired solution.
>
> Erik
>
> <<<<<<
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <daddy...@gmail.com> wrote:
>
> Here's the question I posted:
>
>
>
> >>>>>>
>
> Hi all,
>
>
>
> I've got a ManifoldCF user who is posting content to Solr using the MCF
> Solr output connector.  This connector uses SolrJ under the covers -- a
> fairly recent version -- but also has overridden some classes to insure
> that multipart form posts will be used for most content.
>
>
>
> The problem is that, for a specific document, the user is getting an
> ArrayIndexOutOfBounds exception in Solr, as follows:
>
>
>
> >>>>>>
>
> 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
> {collection=c:documentum_manifoldcf_stg, 
> core=x:documentum_manifoldcf_stg_shard1_replica1,
> node_name=n:**:8983_solr, replica=r:core_node1, shard=s:shard1} -
> java.lang.StringIndexOutOfBoundsException: String index out of range: -296
>
> at java.lang.String.substring(String.java:1911)
>
> at org.apache.solr.request.macro.MacroExpander._expand(
> MacroE

RE: ManifoldCF documentum indexing issue

2017-06-21 Thread Tamizh Kumaran Thamizharasan
Thanks Karl.

Please find the below steps to recreate the issue on file system repository.

Output connector : Solr
Repository : File system
File name in repository : “dummy” file “name.pdf

Additional Solr parameter : expandMacros=false

On starting the job with above configuration, we are getting “missing content 
stream” .
Please find the attached file for complete log trace.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddy...@gmail.com]
Sent: Wednesday, June 21, 2017 3:35 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

I've created a ticket, CONNECTORS-1434, to look at the file name issues.

Karl


On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
There is no good way to handle a case where Solr doesn't like the file name.  
About the only thing that could be done would be to encode the filename using 
something like URL encoding.  This might have some effects on existing users, 
but more importantly, we really would need to know what characters were legal 
before adopting that solution.

I am not entirely sure how the file name is transmitted to Solr when using 
multipart forms, but how that is done is critical to know what to do.

Karl


On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan 
<tthamizhara...@worldbankgroup.org<mailto:tthamizhara...@worldbankgroup.org>> 
wrote:
Hi Karl,

Thanks for the update!!!

As per the response from Solr team, expandMacros=false is added to the output 
connector as additional parameter.
After adding  expandMacros=false, the indexing job is getting completed with 
“Missing content stream” error for few of the documents and those are not 
indexed into Solr.

As per our analysis, the pdf document’s file name we are trying to index from 
documentum  contains whitespace and special characters like double quotes.
Which makes the file non readable and missing content stream error is thrown.

If there is any work around to overcome this issue, kindly share it with us.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddy...@gmail.com<mailto:daddy...@gmail.com>]
Sent: Wednesday, June 14, 2017 7:20 PM

To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

Here's the response:

>>>>>>
Karl -

There’s expandMacros=false, as covered here: 
https://cwiki.apache.org/confluence/display/solr/Parameter+Substitution

But… what exactly is being sent to Solr?Is there some kind of “${…” being 
sent as a parameter?   Just curious what’s getting you into this in the first 
place.   But disabling probably is your most desired solution.

Erik
<<<<<<

Karl


On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
Here's the question I posted:

>>>>>>
Hi all,

I've got a ManifoldCF user who is posting content to Solr using the MCF Solr 
output connector.  This connector uses SolrJ under the covers -- a fairly 
recent version -- but also has overridden some classes to insure that multipart 
form posts will be used for most content.

The problem is that, for a specific document, the user is getting an 
ArrayIndexOutOfBounds exception in Solr, as follows:

>>>>>>
2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] - 
{collection=c:documentum_manifoldcf_stg, 
core=x:documentum_manifoldcf_stg_shard1_replica1, 
node_name=n:**:8983_solr, replica=r:core_node1, shard=s:shard1} - 
java.lang.StringIndexOutOfBoundsException: String index out of range: -296
at java.lang.String.substring(String.java:1911)
at 
org.apache.solr.request.macro.MacroExpander._expand(MacroExpander.java:143)
at 
org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:93)
at 
org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:59)
at 
org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:45)
at 
org.apache.solr.request.json.RequestUtil.processParams(RequestUtil.java:157)
at 
org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginUtils.java:172)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:152)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
at 
org.eclipse.jetty.servlet.Ser

Re: ManifoldCF documentum indexing issue

2017-06-21 Thread Karl Wright
I've created a ticket, CONNECTORS-1434, to look at the file name issues.

Karl


On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <daddy...@gmail.com> wrote:

> There is no good way to handle a case where Solr doesn't like the file
> name.  About the only thing that could be done would be to encode the
> filename using something like URL encoding.  This might have some effects
> on existing users, but more importantly, we really would need to know what
> characters were legal before adopting that solution.
>
> I am not entirely sure how the file name is transmitted to Solr when using
> multipart forms, but how that is done is critical to know what to do.
>
> Karl
>
>
> On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan <
> tthamizhara...@worldbankgroup.org> wrote:
>
>> Hi Karl,
>>
>>
>>
>> Thanks for the update!!!
>>
>>
>>
>> As per the response from Solr team, expandMacros=false is added to the
>> output connector as additional parameter.
>>
>> After adding  expandMacros=false, the indexing job is getting completed
>> with “Missing content stream” error for few of the documents and those are
>> not indexed into Solr.
>>
>>
>>
>> As per our analysis, the pdf document’s file name we are trying to index
>> from documentum  contains whitespace and special characters like double
>> quotes.
>>
>> Which makes the file non readable and missing content stream error is
>> thrown.
>>
>>
>>
>> If there is any work around to overcome this issue, kindly share it with
>> us.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran Thamizharasan
>>
>>
>>
>> *From:* Karl Wright [mailto:daddy...@gmail.com]
>> *Sent:* Wednesday, June 14, 2017 7:20 PM
>>
>> *To:* user@manifoldcf.apache.org
>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>> *Subject:* Re: ManifoldCF documentum indexing issue
>>
>>
>>
>> Here's the response:
>>
>>
>>
>> >>>>>>
>>
>> Karl -
>>
>> There’s expandMacros=false, as covered here: https://cwiki.apache.org
>> /confluence/display/solr/Parameter+Substitution
>>
>> But… what exactly is being sent to Solr?Is there some kind of “${…”
>> being sent as a parameter?   Just curious what’s getting you into this in
>> the first place.   But disabling probably is your most desired solution.
>>
>> Erik
>>
>> <<<<<<
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <daddy...@gmail.com> wrote:
>>
>> Here's the question I posted:
>>
>>
>>
>> >>>>>>
>>
>> Hi all,
>>
>>
>>
>> I've got a ManifoldCF user who is posting content to Solr using the MCF
>> Solr output connector.  This connector uses SolrJ under the covers -- a
>> fairly recent version -- but also has overridden some classes to insure
>> that multipart form posts will be used for most content.
>>
>>
>>
>> The problem is that, for a specific document, the user is getting an
>> ArrayIndexOutOfBounds exception in Solr, as follows:
>>
>>
>>
>> >>>>>>
>>
>> 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
>> {collection=c:documentum_manifoldcf_stg, 
>> core=x:documentum_manifoldcf_stg_shard1_replica1,
>> node_name=n:**:8983_solr, replica=r:core_node1, shard=s:shard1}
>> - java.lang.StringIndexOutOfBoundsException: String index out of range:
>> -296
>>
>> at java.lang.String.substring(String.java:1911)
>>
>> at org.apache.solr.request.macro.MacroExpander._expand(MacroExp
>> ander.java:143)
>>
>> at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
>> nder.java:93)
>>
>> at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
>> nder.java:59)
>>
>> at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
>> nder.java:45)
>>
>> at org.apache.solr.request.json.RequestUtil.processParams(Reque
>> stUtil.java:157)
>>
>> at org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginU
>> tils.java:172)
>>
>> at org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
>> uestHandlerBase.java:152)
>>
>> at org.apache.solr.core.Sol

Re: ManifoldCF documentum indexing issue

2017-06-14 Thread Karl Wright
 Solr itself.
>>
>> I wish there was an easy fix for this.  The problem is *not* an empty
>> stream; it's that Solr is attempting to do something with it that it
>> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
>> from that.
>>
>> >>>>>>
>> https://**/webtop/component/drl?versionLabel=CURRENT=091e8486805142f5
>> (500)
>> <<<<<<
>>
>> Karl
>>
>>
>>
>>
>> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
>> tthamizhara...@worldbankgroup.org> wrote:
>>
>>> Hi Karl,
>>>
>>>
>>>
>>> After configuring Solr to ignore Tika errors by adding Tika transformer
>>> in the job, below behavior is observed.
>>>
>>>
>>>
>>> 1)  ManifoldCF fetches the content from documentum, which contains
>>> null content and tries to push it to the output connector(Solr).
>>>
>>> 2)  Solr couldn’t accept the null as a value and throwing “Missing
>>> content stream” error.
>>>
>>> 3)  Each agent thread In ManifoldCF internally held-up with
>>> different r_object_id’s that don’t have body content and keeps trying to
>>> push the content to Solr  after each failure, but Solr couldn’t accept the
>>> content and throws the same error.
>>>
>>> 4)  Over the time, the manifold job stops with the error thrown by
>>> Solr
>>>
>>>
>>>
>>> Please let know if there is any configuration change which can help us
>>> resolve this issue.
>>>
>>>
>>>
>>> Please find the attached manifoldCF error log,Solr error log and agent
>>> log.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Tamizh Kumaran.
>>>
>>>
>>>
>>> *From:* Karl Wright [mailto:daddy...@gmail.com]
>>> *Sent:* Tuesday, June 13, 2017 2:23 PM
>>> *To:* user@manifoldcf.apache.org
>>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>>> *Subject:* Re: ManifoldCF documentum indexing issue
>>>
>>>
>>>
>>> Hi Tamizh,
>>>
>>>
>>>
>>> The reported error is 'Error from server at http://localhost:8983/solr/
>>> documentum_manifoldcf_stg: String index out of range: -188'.  The
>>> message seemingly indicates that the error was *received* from the solr
>>> server for one specific document.  ManifoldCF does not recognize the error
>>> as being innocuous and therefore it will retry for a while until it
>>> eventually gives up and halts the job.  However, I cannot find that exact
>>> text anywhere in the Solr output connector code, so I wonder if you
>>> transcribed it correctly?
>>>
>>> There should also be the following:
>>>
>>> (1) A record of the attempts in the manifoldcf.log file, with a MCF
>>> stack trace attached to each one;
>>>
>>> (2) Simple history records for that document that are of the type
>>> INGESTDOCUMENT.
>>>
>>> (3) Solr log entries that have a Solr stack trace.
>>>
>>>
>>>
>>> The last one is the one that would be the most helpful.  It is possible
>>> that you are seeing a problem in Solr Cell (Tika) that is manifesting
>>> itself in this way.  You can (and should) configure your Solr to ignore
>>> Tika errors.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
>>> tthamizhara...@worldbankgroup.org> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
>>> integrated with PostgreSQL 9.3. The expected setup is to crawl the
>>> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
>>> app is installed on the tomcat and startup script is pointed with the MF
>>> properties.xml during server startup. Manifold along with the bundled ZK,
>>> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
>>> Server release 6.9 (Santiago). The DB is running on a windows box.
>>>
>>> The ZK is integrated with the DB through the properties.xml and
>>> properties-global.xml
>>>
>>> The ZK, the documentum related processes(registry and server) are up and
>>> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
>>> produce multiple threads to index the documemtum contents into SOLR through
>>> ManifoldCF.
>>>
>>>
>>>
>>> The Current no of the connections configured on the MF are as below.
>>>
>>> SOLR Output max connection : 25
>>>
>>> Document repository  Max Connections: 25
>>>
>>> Properties.xml:
>>>
>>> 
>>>
>>> 
>>>
>>> Total documentum document count : 0.5 million
>>>
>>>
>>>
>>> After the Job is started, it indexed some 2+ documents and gets
>>> terminated with the below error on the Manifold JOB.
>>>
>>> Error: Repeated service interruptions - failure processing document:
>>> Error from server at http://localhost:8983/solr/doc
>>> umentum_manifoldcf_stg: String index out of range: -188
>>>
>>>
>>>
>>> Please find the attached manifoldCF error log and agent log.
>>>
>>>
>>>
>>> Please let me know the observations on the cause of the issue and the
>>> configuration on the threads used  for crawling. Please share your thoughts.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Tamizh Kumaran
>>>
>>>
>>>
>>>
>>>
>>
>>
>


Re: ManifoldCF documentum indexing issue

2017-06-14 Thread Karl Wright
I posted the pertinent question to the solr dev list.  Let's see what they
say.

Thanks,
Karl


On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <daddy...@gmail.com> wrote:

> Hi,
>
> The exception in the solr.log should be reported as a Solr bug.  It is not
> emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>
> I wish there was an easy fix for this.  The problem is *not* an empty
> stream; it's that Solr is attempting to do something with it that it
> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
> from that.
>
> >>>>>>
> https://**/webtop/component/drl?versionLabel=CURRENT=091e8486805142f5
> (500)
> <<<<<<
>
> Karl
>
>
>
>
> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
> tthamizhara...@worldbankgroup.org> wrote:
>
>> Hi Karl,
>>
>>
>>
>> After configuring Solr to ignore Tika errors by adding Tika transformer
>> in the job, below behavior is observed.
>>
>>
>>
>> 1)  ManifoldCF fetches the content from documentum, which contains
>> null content and tries to push it to the output connector(Solr).
>>
>> 2)  Solr couldn’t accept the null as a value and throwing “Missing
>> content stream” error.
>>
>> 3)  Each agent thread In ManifoldCF internally held-up with
>> different r_object_id’s that don’t have body content and keeps trying to
>> push the content to Solr  after each failure, but Solr couldn’t accept the
>> content and throws the same error.
>>
>> 4)  Over the time, the manifold job stops with the error thrown by
>> Solr
>>
>>
>>
>> Please let know if there is any configuration change which can help us
>> resolve this issue.
>>
>>
>>
>> Please find the attached manifoldCF error log,Solr error log and agent
>> log.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran.
>>
>>
>>
>> *From:* Karl Wright [mailto:daddy...@gmail.com]
>> *Sent:* Tuesday, June 13, 2017 2:23 PM
>> *To:* user@manifoldcf.apache.org
>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>> *Subject:* Re: ManifoldCF documentum indexing issue
>>
>>
>>
>> Hi Tamizh,
>>
>>
>>
>> The reported error is 'Error from server at http://localhost:8983/solr/
>> documentum_manifoldcf_stg: String index out of range: -188'.  The
>> message seemingly indicates that the error was *received* from the solr
>> server for one specific document.  ManifoldCF does not recognize the error
>> as being innocuous and therefore it will retry for a while until it
>> eventually gives up and halts the job.  However, I cannot find that exact
>> text anywhere in the Solr output connector code, so I wonder if you
>> transcribed it correctly?
>>
>> There should also be the following:
>>
>> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
>> trace attached to each one;
>>
>> (2) Simple history records for that document that are of the type
>> INGESTDOCUMENT.
>>
>> (3) Solr log entries that have a Solr stack trace.
>>
>>
>>
>> The last one is the one that would be the most helpful.  It is possible
>> that you are seeing a problem in Solr Cell (Tika) that is manifesting
>> itself in this way.  You can (and should) configure your Solr to ignore
>> Tika errors.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
>> tthamizhara...@worldbankgroup.org> wrote:
>>
>> Hi,
>>
>>
>>
>> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
>> integrated with PostgreSQL 9.3. The expected setup is to crawl the
>> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
>> app is installed on the tomcat and startup script is pointed with the MF
>> properties.xml during server startup. Manifold along with the bundled ZK,
>> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
>> Server release 6.9 (Santiago). The DB is running on a windows box.
>>
>> The ZK is integrated with the DB through the properties.xml and
>> properties-global.xml
>>
>> The ZK, the documentum related processes(registry and server) are up and
>> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
>> produce multiple threads to index the documemtum cont

Re: ManifoldCF documentum indexing issue

2017-06-14 Thread Karl Wright
Hi,

The exception in the solr.log should be reported as a Solr bug.  It is not
emanating from the Tika extractor (Solr Cell), but is in Solr itself.

I wish there was an easy fix for this.  The problem is *not* an empty
stream; it's that Solr is attempting to do something with it that it
shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
from that.

>>>>>>
https://**/webtop/component/drl?versionLabel=CURRENT=091e8486805142f5
(500)
<<<<<<

Karl




On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
tthamizhara...@worldbankgroup.org> wrote:

> Hi Karl,
>
>
>
> After configuring Solr to ignore Tika errors by adding Tika transformer in
> the job, below behavior is observed.
>
>
>
> 1)  ManifoldCF fetches the content from documentum, which contains
> null content and tries to push it to the output connector(Solr).
>
> 2)  Solr couldn’t accept the null as a value and throwing “Missing
> content stream” error.
>
> 3)  Each agent thread In ManifoldCF internally held-up with different
> r_object_id’s that don’t have body content and keeps trying to push the
> content to Solr  after each failure, but Solr couldn’t accept the content
> and throws the same error.
>
> 4)  Over the time, the manifold job stops with the error thrown by
> Solr
>
>
>
> Please let know if there is any configuration change which can help us
> resolve this issue.
>
>
>
> Please find the attached manifoldCF error log,Solr error log and agent log.
>
>
>
> Regards,
>
> Tamizh Kumaran.
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* Tuesday, June 13, 2017 2:23 PM
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Hi Tamizh,
>
>
>
> The reported error is 'Error from server at http://localhost:8983/solr/
> documentum_manifoldcf_stg: String index out of range: -188'.  The message
> seemingly indicates that the error was *received* from the solr server for
> one specific document.  ManifoldCF does not recognize the error as being
> innocuous and therefore it will retry for a while until it eventually gives
> up and halts the job.  However, I cannot find that exact text anywhere in
> the Solr output connector code, so I wonder if you transcribed it correctly?
>
> There should also be the following:
>
> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
> trace attached to each one;
>
> (2) Simple history records for that document that are of the type
> INGESTDOCUMENT.
>
> (3) Solr log entries that have a Solr stack trace.
>
>
>
> The last one is the one that would be the most helpful.  It is possible
> that you are seeing a problem in Solr Cell (Tika) that is manifesting
> itself in this way.  You can (and should) configure your Solr to ignore
> Tika errors.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
>
>
>
>
> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
> tthamizhara...@worldbankgroup.org> wrote:
>
> Hi,
>
>
>
> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
> integrated with PostgreSQL 9.3. The expected setup is to crawl the
> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
> app is installed on the tomcat and startup script is pointed with the MF
> properties.xml during server startup. Manifold along with the bundled ZK,
> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
> Server release 6.9 (Santiago). The DB is running on a windows box.
>
> The ZK is integrated with the DB through the properties.xml and
> properties-global.xml
>
> The ZK, the documentum related processes(registry and server) are up and
> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
> produce multiple threads to index the documemtum contents into SOLR through
> ManifoldCF.
>
>
>
> The Current no of the connections configured on the MF are as below.
>
> SOLR Output max connection : 25
>
> Document repository  Max Connections: 25
>
> Properties.xml:
>
> 
>
> 
>
> Total documentum document count : 0.5 million
>
>
>
> After the Job is started, it indexed some 2+ documents and gets
> terminated with the below error on the Manifold JOB.
>
> Error: Repeated service interruptions - failure processing document: Error
> from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
> String index out of range: -188
>
>
>
> Please find the attached manifoldCF error log and agent log.
>
>
>
> Please let me know the observations on the cause of the issue and the
> configuration on the threads used  for crawling. Please share your thoughts.
>
>
>
> Regards,
>
> Tamizh Kumaran
>
>
>
>
>


Re: ManifoldCF documentum indexing issue

2017-06-13 Thread Karl Wright
Hi Tamizh,

The reported error is 'Error from server at http://localhost:8983/solr/
documentum_manifoldcf_stg: String index out of range: -188'.  The message
seemingly indicates that the error was *received* from the solr server for
one specific document.  ManifoldCF does not recognize the error as being
innocuous and therefore it will retry for a while until it eventually gives
up and halts the job.  However, I cannot find that exact text anywhere in
the Solr output connector code, so I wonder if you transcribed it correctly?

There should also be the following:
(1) A record of the attempts in the manifoldcf.log file, with a MCF stack
trace attached to each one;
(2) Simple history records for that document that are of the type
INGESTDOCUMENT.
(3) Solr log entries that have a Solr stack trace.

The last one is the one that would be the most helpful.  It is possible
that you are seeing a problem in Solr Cell (Tika) that is manifesting
itself in this way.  You can (and should) configure your Solr to ignore
Tika errors.

Thanks,
Karl




On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
tthamizhara...@worldbankgroup.org> wrote:

> Hi,
>
>
>
> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
> integrated with PostgreSQL 9.3. The expected setup is to crawl the
> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
> app is installed on the tomcat and startup script is pointed with the MF
> properties.xml during server startup. Manifold along with the bundled ZK,
> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
> Server release 6.9 (Santiago). The DB is running on a windows box.
>
> The ZK is integrated with the DB through the properties.xml and
> properties-global.xml
>
> The ZK, the documentum related processes(registry and server) are up and
> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
> produce multiple threads to index the documemtum contents into SOLR through
> ManifoldCF.
>
>
>
> The Current no of the connections configured on the MF are as below.
>
> SOLR Output max connection : 25
>
> Document repository  Max Connections: 25
>
> Properties.xml:
>
> 
>
> 
>
> Total documentum document count : 0.5 million
>
>
>
> After the Job is started, it indexed some 2+ documents and gets
> terminated with the below error on the Manifold JOB.
>
> Error: Repeated service interruptions - failure processing document: Error
> from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
> String index out of range: -188
>
>
>
> Please find the attached manifoldCF error log and agent log.
>
>
>
> Please let me know the observations on the cause of the issue and the
> configuration on the threads used  for crawling. Please share your thoughts.
>
>
>
> Regards,
>
> Tamizh Kumaran
>
>
>


Re: UTF-8 Format from Confluence to Solr

2017-06-12 Thread Karl Wright
Committed a fix.
Karl


On Mon, Jun 12, 2017 at 7:27 PM, Karl Wright  wrote:

> There's already a ticket for this, assigned to me.  CONNECTORS-1251.  I'll
> freshen it up.
>
> Karl
>
>
>
>
> On Mon, Jun 12, 2017 at 2:52 PM, Furkan KAMACI 
> wrote:
>
>> Hi Marisol,
>>
>> You can create a ticket from here: https://issues.apache.or
>> g/jira/projects/CONNECTORS
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>>
>> 12 Haz 2017 Pzt, saat 18:25 tarihinde Marisol Redondo <
>> marisol.redondo.gar...@gmail.com> şunu yazdı:
>>
>>> How can I do that?
>>>
>>> On 1 June 2017 at 16:43, Antonio David Pérez Morales <
>>> adperezmora...@gmail.com> wrote:
>>>
 Hi Marisol

 Could you mind to create a ticket and provide a patch?

 This way we can test it in our ends and include it for the next
 Manifold release.

 Thanks

 Regards

 2017-06-01 16:28 GMT+02:00 Marisol Redondo <
 marisol.redondo.gar...@gmail.com>:

> I fixed the problem.
>
> The problem is that the Confluence connector is getting the entity of
> the request with the default encoding ("ISO-8859-1"), and not UTF-8.
>
> To fix that, I made a change in the Confluence connector, and each
> time is reading the request's entity I use EntityUtils.toString(entit
> y,*"UTF-8"*)
>
> Thanks
>
>
> On 31 May 2017 at 10:13, Marisol Redondo <
> marisol.redondo.gar...@gmail.com> wrote:
>
>> Hi.
>>
>> I'm having problems with the encoding when injecting in Solr 6 in
>> standalone mode from a Confluence wiki.
>>
>> I have Manifold 2.5 with Tomcat-8.
>>
>> The repository connector from the job take the information from a
>> Confluence wiki and the output connector is Solr, using the Tika
>> transformation, a custom transformation and a Metadata adjuster.
>>
>> When the document is injected into solr, the content of the document
>> has some character that shouldn't be there because are not in the
>> confluence page, mainly a  character.
>>
>> I have checked that confluence, the tomcat server when manifold is
>> running, the http request to confluence has the Accept-Charset header set
>> to UTF-8, the solr server is acepting UTF8.
>>
>> In the log, I have seen that when retrieving the information from
>> confluence, the content is fine, and when it's sending the information to
>> solr, it has the character. I have tried without using any transfomer and
>> getting the same log entry.
>>
>> Is this a bug or how can I resolve this?
>>
>> Thanks for your help
>>
>>
>>
>>
>>
>

>>>
>


Re: UTF-8 Format from Confluence to Solr

2017-06-12 Thread Karl Wright
There's already a ticket for this, assigned to me.  CONNECTORS-1251.  I'll
freshen it up.

Karl




On Mon, Jun 12, 2017 at 2:52 PM, Furkan KAMACI 
wrote:

> Hi Marisol,
>
> You can create a ticket from here: https://issues.apache.
> org/jira/projects/CONNECTORS
>
> Kind Regards,
> Furkan KAMACI
>
>
> 12 Haz 2017 Pzt, saat 18:25 tarihinde Marisol Redondo <
> marisol.redondo.gar...@gmail.com> şunu yazdı:
>
>> How can I do that?
>>
>> On 1 June 2017 at 16:43, Antonio David Pérez Morales <
>> adperezmora...@gmail.com> wrote:
>>
>>> Hi Marisol
>>>
>>> Could you mind to create a ticket and provide a patch?
>>>
>>> This way we can test it in our ends and include it for the next Manifold
>>> release.
>>>
>>> Thanks
>>>
>>> Regards
>>>
>>> 2017-06-01 16:28 GMT+02:00 Marisol Redondo <
>>> marisol.redondo.gar...@gmail.com>:
>>>
 I fixed the problem.

 The problem is that the Confluence connector is getting the entity of
 the request with the default encoding ("ISO-8859-1"), and not UTF-8.

 To fix that, I made a change in the Confluence connector, and each time
 is reading the request's entity I use EntityUtils.toString(entity,
 *"UTF-8"*)

 Thanks


 On 31 May 2017 at 10:13, Marisol Redondo  wrote:

> Hi.
>
> I'm having problems with the encoding when injecting in Solr 6 in
> standalone mode from a Confluence wiki.
>
> I have Manifold 2.5 with Tomcat-8.
>
> The repository connector from the job take the information from a
> Confluence wiki and the output connector is Solr, using the Tika
> transformation, a custom transformation and a Metadata adjuster.
>
> When the document is injected into solr, the content of the document
> has some character that shouldn't be there because are not in the
> confluence page, mainly a  character.
>
> I have checked that confluence, the tomcat server when manifold is
> running, the http request to confluence has the Accept-Charset header set
> to UTF-8, the solr server is acepting UTF8.
>
> In the log, I have seen that when retrieving the information from
> confluence, the content is fine, and when it's sending the information to
> solr, it has the character. I have tried without using any transfomer and
> getting the same log entry.
>
> Is this a bug or how can I resolve this?
>
> Thanks for your help
>
>
>
>
>

>>>
>>


Re: UTF-8 Format from Confluence to Solr

2017-06-12 Thread Furkan KAMACI
Hi Marisol,

You can create a ticket from here:
https://issues.apache.org/jira/projects/CONNECTORS

Kind Regards,
Furkan KAMACI


12 Haz 2017 Pzt, saat 18:25 tarihinde Marisol Redondo <
marisol.redondo.gar...@gmail.com> şunu yazdı:

> How can I do that?
>
> On 1 June 2017 at 16:43, Antonio David Pérez Morales <
> adperezmora...@gmail.com> wrote:
>
>> Hi Marisol
>>
>> Could you mind to create a ticket and provide a patch?
>>
>> This way we can test it in our ends and include it for the next Manifold
>> release.
>>
>> Thanks
>>
>> Regards
>>
>> 2017-06-01 16:28 GMT+02:00 Marisol Redondo <
>> marisol.redondo.gar...@gmail.com>:
>>
>>> I fixed the problem.
>>>
>>> The problem is that the Confluence connector is getting the entity of
>>> the request with the default encoding ("ISO-8859-1"), and not UTF-8.
>>>
>>> To fix that, I made a change in the Confluence connector, and each time
>>> is reading the request's entity I use EntityUtils.toString(entity,
>>> *"UTF-8"*)
>>>
>>> Thanks
>>>
>>>
>>> On 31 May 2017 at 10:13, Marisol Redondo <
>>> marisol.redondo.gar...@gmail.com> wrote:
>>>
 Hi.

 I'm having problems with the encoding when injecting in Solr 6 in
 standalone mode from a Confluence wiki.

 I have Manifold 2.5 with Tomcat-8.

 The repository connector from the job take the information from a
 Confluence wiki and the output connector is Solr, using the Tika
 transformation, a custom transformation and a Metadata adjuster.

 When the document is injected into solr, the content of the document
 has some character that shouldn't be there because are not in the
 confluence page, mainly a  character.

 I have checked that confluence, the tomcat server when manifold is
 running, the http request to confluence has the Accept-Charset header set
 to UTF-8, the solr server is acepting UTF8.

 In the log, I have seen that when retrieving the information from
 confluence, the content is fine, and when it's sending the information to
 solr, it has the character. I have tried without using any transfomer and
 getting the same log entry.

 Is this a bug or how can I resolve this?

 Thanks for your help





>>>
>>
>


Re: UTF-8 Format from Confluence to Solr

2017-06-12 Thread Marisol Redondo
How can I do that?

On 1 June 2017 at 16:43, Antonio David Pérez Morales <
adperezmora...@gmail.com> wrote:

> Hi Marisol
>
> Could you mind to create a ticket and provide a patch?
>
> This way we can test it in our ends and include it for the next Manifold
> release.
>
> Thanks
>
> Regards
>
> 2017-06-01 16:28 GMT+02:00 Marisol Redondo  com>:
>
>> I fixed the problem.
>>
>> The problem is that the Confluence connector is getting the entity of the
>> request with the default encoding ("ISO-8859-1"), and not UTF-8.
>>
>> To fix that, I made a change in the Confluence connector, and each time
>> is reading the request's entity I use EntityUtils.toString(entity,
>> *"UTF-8"*)
>>
>> Thanks
>>
>>
>> On 31 May 2017 at 10:13, Marisol Redondo > com> wrote:
>>
>>> Hi.
>>>
>>> I'm having problems with the encoding when injecting in Solr 6 in
>>> standalone mode from a Confluence wiki.
>>>
>>> I have Manifold 2.5 with Tomcat-8.
>>>
>>> The repository connector from the job take the information from a
>>> Confluence wiki and the output connector is Solr, using the Tika
>>> transformation, a custom transformation and a Metadata adjuster.
>>>
>>> When the document is injected into solr, the content of the document has
>>> some character that shouldn't be there because are not in the confluence
>>> page, mainly a  character.
>>>
>>> I have checked that confluence, the tomcat server when manifold is
>>> running, the http request to confluence has the Accept-Charset header set
>>> to UTF-8, the solr server is acepting UTF8.
>>>
>>> In the log, I have seen that when retrieving the information from
>>> confluence, the content is fine, and when it's sending the information to
>>> solr, it has the character. I have tried without using any transfomer and
>>> getting the same log entry.
>>>
>>> Is this a bug or how can I resolve this?
>>>
>>> Thanks for your help
>>>
>>>
>>>
>>>
>>>
>>
>


Re: UTF-8 Format from Confluence to Solr

2017-06-01 Thread Antonio David Pérez Morales
Hi Marisol

Could you mind to create a ticket and provide a patch?

This way we can test it in our ends and include it for the next Manifold
release.

Thanks

Regards

2017-06-01 16:28 GMT+02:00 Marisol Redondo :

> I fixed the problem.
>
> The problem is that the Confluence connector is getting the entity of the
> request with the default encoding ("ISO-8859-1"), and not UTF-8.
>
> To fix that, I made a change in the Confluence connector, and each time is
> reading the request's entity I use EntityUtils.toString(entity,*"UTF-8"*)
>
> Thanks
>
>
> On 31 May 2017 at 10:13, Marisol Redondo  > wrote:
>
>> Hi.
>>
>> I'm having problems with the encoding when injecting in Solr 6 in
>> standalone mode from a Confluence wiki.
>>
>> I have Manifold 2.5 with Tomcat-8.
>>
>> The repository connector from the job take the information from a
>> Confluence wiki and the output connector is Solr, using the Tika
>> transformation, a custom transformation and a Metadata adjuster.
>>
>> When the document is injected into solr, the content of the document has
>> some character that shouldn't be there because are not in the confluence
>> page, mainly a  character.
>>
>> I have checked that confluence, the tomcat server when manifold is
>> running, the http request to confluence has the Accept-Charset header set
>> to UTF-8, the solr server is acepting UTF8.
>>
>> In the log, I have seen that when retrieving the information from
>> confluence, the content is fine, and when it's sending the information to
>> solr, it has the character. I have tried without using any transfomer and
>> getting the same log entry.
>>
>> Is this a bug or how can I resolve this?
>>
>> Thanks for your help
>>
>>
>>
>>
>>
>


Re: UTF-8 Format from Confluence to Solr

2017-06-01 Thread Marisol Redondo
I fixed the problem.

The problem is that the Confluence connector is getting the entity of the
request with the default encoding ("ISO-8859-1"), and not UTF-8.

To fix that, I made a change in the Confluence connector, and each time is
reading the request's entity I use EntityUtils.toString(entity,*"UTF-8"*)

Thanks


On 31 May 2017 at 10:13, Marisol Redondo 
wrote:

> Hi.
>
> I'm having problems with the encoding when injecting in Solr 6 in
> standalone mode from a Confluence wiki.
>
> I have Manifold 2.5 with Tomcat-8.
>
> The repository connector from the job take the information from a
> Confluence wiki and the output connector is Solr, using the Tika
> transformation, a custom transformation and a Metadata adjuster.
>
> When the document is injected into solr, the content of the document has
> some character that shouldn't be there because are not in the confluence
> page, mainly a  character.
>
> I have checked that confluence, the tomcat server when manifold is
> running, the http request to confluence has the Accept-Charset header set
> to UTF-8, the solr server is acepting UTF8.
>
> In the log, I have seen that when retrieving the information from
> confluence, the content is fine, and when it's sending the information to
> solr, it has the character. I have tried without using any transfomer and
> getting the same log entry.
>
> Is this a bug or how can I resolve this?
>
> Thanks for your help
>
>
>
>
>


Re: ManifoldCF Indexing and Deletion

2017-05-26 Thread Karl Wright
Hi Tamizh,

What do you mean by "incremental run"?  If you mean what happens when you
click "Start minimal" here:
http://manifoldcf.apache.org/release/release-2.7.1/en_US/end-user-documentation.html#executing,
then this behavior is the way it is supposed to work.  You must click the
"Start" button, not the "Start incremental", for documents to be deleted
with the Documentum connector.  This is a limitation of the Documentum
repository.

Thanks,
Karl




On Fri, May 26, 2017 at 2:07 AM, Tamizh Kumaran Thamizharasan <
tthamizhara...@worldbankgroup.org> wrote:

> Hi,
>
>
>
> Need to add few points.
>
>
>
> ManifoldCF version tested – 2.5 and 2.7
>
> Solr version – 4.3.0
>
>
>
> 2) The connector is not performing deletion scenario during incremental
> run.
>
>
>
> Kindly let me know if any details are required.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Tamizh Kumaran Thamizharasan
> *Sent:* Friday, May 26, 2017 10:55 AM
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira
> *Subject:* ManifoldCF Indexing and Deletion
>
>
>
> Hi,
>
>
>
> We are facing the below issues while using the ManifoldCF.
>
>
>
> Repository used – Documentum
>
> Output connector – Solr
>
>
>
> 1)  After successful completion of a job with document filter, on
> clearing the documents from the output connector(Solr) and on restarting
> the job with document filter the documents are not indexed to the output
> connector.
>
>
>
> In Job status, documents are shown as processed but they are not indexed
> to the output connector. On changing the configuration and restarting the
> job with the same document filter, the documents are successfully indexed
> to the output connector.
>
>
>
> 2)  The connector is not performing deletion scenario.
>
>
>
> Kindly let me know if any details are required and help us to resolve
> these issues.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>


Re: ManifoldCF

2017-05-03 Thread Karl Wright
Hi Claudiu,

First, it looks like you are running MCF as a single process. That is fine;
if you were running a multiprocess setup you'd want to be sure to increase
the memory size of all the agents processes, and not worry about any other
MCF processes.

Second, when you put Tika in the pipeline, potentially each worker thread
can be using Tika resources at the same time.  MCF uses Tika in a streaming
way.  We don't have any real control over Tika other than that.  But you
can limit this by reducing the number of Tika connections to some lower
number.  The default is 10 but for experimentation sake I'd try reducing
that down to even lower, e.g. 2-5.  That should limit the maximum memory
consumption.

Third, if the problem *continues* even with that restriction, it's worth
trying to find which document it is that is causing Tika to run out of
memory.  The MCF logs will be a big help here.  Each line contains the
thread ID, which should be helpful.  Please bear in mind that because of
the multi-threaded nature of MCF, the actual document causing the problem
might not be the one that finally causes the OOM.  Unless you reduce the
max number of Tika connections to 1, finding the exact document will be
hard.

If the actual failure document can be included in a bug report for the TIKA
team, that would be ideal.

Please let me know what happens.

Karl


On Wed, May 3, 2017 at 5:32 AM, Kishore Kumar 
wrote:

> Looping manifoldcf mailing list.
>
> KK
>
> --
> *From:* Matei Claudiu 
> *Sent:* Wednesday, May 3, 2017 2:57:52 PM
> *To:* kishorejan...@live.com
> *Cc:* Quirynen Jasper
> *Subject:* ManifoldCF
>
>
> Hi Kishore Kumar,
>
>
>
> Thanks for developing ManifoldCF.
>
>
>
> I have a question about it. I am trying to use the Windows Share connector
> together with Tika.
>
> The problem is that after I index some files, I get the following error:
>
>
>
> agents process ran out of memory - shutting down
>
> java.lang.OutOfMemoryError: Java heap space
>
>   at java.util.Arrays.copyOf(Arrays.java:3308)
>
>   at java.util.BitSet.ensureCapacity(BitSet.java:337)
>
>   at java.util.BitSet.expandTo(BitSet.java:352)
>
>   at java.util.BitSet.set(BitSet.java:447)
>
>   at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(
> BoilerpipeHTMLContentHandler.java:267)
>
>   at org.apache.tika.parser.html.BoilerpipeContentHandler.characters(
> BoilerpipeContentHandler.java:155)
>
>   at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>   at org.apache.tika.sax.SecureContentHandler.characters(
> SecureContentHandler.java:270)
>
>   at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>   at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>   at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>   at org.apache.tika.sax.SafeContentHandler.access$001(
> SafeContentHandler.java:46)
>
>   at org.apache.tika.sax.SafeContentHandler$1.write(
> SafeContentHandler.java:82)
>
>   at org.apache.tika.sax.SafeContentHandler.filter(
> SafeContentHandler.java:140)
>
>   at org.apache.tika.sax.SafeContentHandler.characters(
> SafeContentHandler.java:287)
>
>   at org.apache.tika.sax.XHTMLContentHandler.characters(
> XHTMLContentHandler.java:278)
>
>   at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>   at org.apache.tika.sax.xpath.MatchingContentHandler.characters(
> MatchingContentHandler.java:85)
>
>   at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>   at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>   at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>   at org.apache.tika.sax.SecureContentHandler.characters(
> SecureContentHandler.java:270)
>
>   at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>   at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>   at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:994)
>
>   at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:482)
>
>   at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>
>   at org.apache.tika.parser.code.SourceCodeParser.parse(
> SourceCodeParser.java:120)
>
>   at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:280)
>
>   at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:280)
>
>   at org.apache.tika.parser.AutoDetectParser.parse(
> AutoDetectParser.java:120)
>
>   at org.apache.tika.parser.DelegatingParser.parse(
> DelegatingParser.java:72)
>
> 

Re: Windows share connector : fetch ACL for an incremental job

2017-05-02 Thread Karl Wright
Hi Olivier,

It was a long time ago that the Windows Share Connector was designed, but
at the time it was determined that you could change ACLs that affected
security on a document without changing the document itself, and thus the
document's modified date was insufficient by itself to signal a change that
would require reindexing.

It may not be the ACLs associated with the document itself, but rather the
ACLs associated with the document's share and parent that absolutely would
have to be part of the version string.

Karl


On Tue, May 2, 2017 at 8:44 AM, Olivier Tavard <
olivier.tav...@francelabs.com> wrote:

> Hi,
>
> I have a question about the Windows Share connector please.
> During the incremental job of a file share with security enabled, it seems
> that the getSecurity method is called for each file even if the last
> modified date of the document is unchanged between the two crawls.
> Does it mean that the last modified date of a file is not changed after a
> modification of the ACLs on it ? So the connector has to fetch the ACLs on
> the file in all cases (even if the date is the same between the date of the
> ingest status in the MCF database and the date of the file), am I correct ?
> Or is it done in two steps : first check the last modified date of the
> document and after that only if it is different from the date stored into
> the MCF database, fetch ACL of the file and compare it with the ACLs stored
> into the MCF database ?
>
> Thanks,
>
> Olivier TAVARD
>
>


Re: email job is down

2017-04-28 Thread Karl Wright
Hi Cihad,

The right thing to do is to capture this exception:

>>
Caused by: javax.mail.MessagingException: * BYE JavaMail Exception:
java.io.IOException: Connection dropped by server?
<<

... and throw a ServiceInterruption when it is seen, instead of a
ManifoldCFException.

Can you create a ticket for this?  I'll try to add the appropriate code
when I am no longer traveling.

Thanks,
Karl




On Fri, Apr 28, 2017 at 7:43 PM, Cihad Guzel  wrote:

> Hi,
>
> I have created an email job that runs continuously.  I have an error
> downing my job for a reason. Because it is a job that needs to be
> continuously run, errors should be handled  and the job shouldn't be
> downed. Is there anything we can do about it?
>
>  My error as follow:
>
> ERROR 2017-04-25T14:25:44,475 (Seeding thread) - Exception tossed: Error
> finding emails: * BYE JavaMail Exception: java.io.IOException: Connection
> dropped by server?
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Error finding
> emails: * BYE JavaMail Exception: java.io.IOException: Connection dropped
> by server?
> at
> org.apache.manifoldcf.crawler.connectors.email.EmailConnector.
> handleMessagingException(EmailConnector.java:1721)
> ~[?:?]
> at
> org.apache.manifoldcf.crawler.connectors.email.EmailConnector.
> addSeedDocuments(EmailConnector.java:335)
> ~[?:?]
> at
> org.apache.manifoldcf.crawler.system.SeedingThread.run(
> SeedingThread.java:150)
> [classes/:?]
> Caused by: javax.mail.MessagingException: * BYE JavaMail Exception:
> java.io.IOException: Connection dropped by server?
> at com.sun.mail.imap.IMAPFolder.open(IMAPFolder.java:961)
> ~[mail-1.4.5.jar:1.4.5]
> at
> org.apache.manifoldcf.crawler.connectors.email.EmailSession.
> openFolder(EmailSession.java:99)
> ~[?:?]
> at
> org.apache.manifoldcf.crawler.connectors.email.EmailConnector$
> OpenFolderThread.run(EmailConnector.java:1981)
> ~[?:?]
> Caused by: com.sun.mail.iap.ConnectionException: * BYE JavaMail Exception:
> java.io.IOException: Connection dropped by server?
> at com.sun.mail.iap.Protocol.handleResult(Protocol.java:356)
> ~[mail-1.4.5.jar:1.4.5]
> at com.sun.mail.imap.protocol.IMAPProtocol.examine(IMAPProtocol.java:886)
> ~[mail-1.4.5.jar:1.4.5]
> at com.sun.mail.imap.IMAPFolder.open(IMAPFolder.java:925)
> ~[mail-1.4.5.jar:1.4.5]
> at
> org.apache.manifoldcf.crawler.connectors.email.EmailSession.
> openFolder(EmailSession.java:99)
> ~[?:?]
> at
> org.apache.manifoldcf.crawler.connectors.email.EmailConnector$
> OpenFolderThread.run(EmailConnector.java:1981)
> ~[?:?]
>
> --
> Cihad Güzel
> Regards
>


Re: Delete IDs with JDBC connector

2017-04-27 Thread julien . massiera
ing"); 
> errorCode = activities.NULL_URL; 
> errorDesc = "Excluded because document had a null URL"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // This is not right - url can apparently be a BinaryInput 
> String url = JDBCConnection.readAsString(o); 
> boolean validURL; 
> try 
> { 
> // Check to be sure url is valid 
> new java.net.URI(url); 
> validURL = true; 
> } 
> catch (java.net.URISyntaxException e) 
> { 
> validURL = false; 
> } 
> 
> if (!validURL) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' has an illegal url: 
> '"+url+"' - skipping"); 
> errorCode = activities.BAD_URL; 
> errorDesc = "Excluded because document had illegal URL ('"+url+"')"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // Process the document itself 
> Object contents = row.getValue(JDBCConstants.dataReturnColumnName); 
> // Null data is allowed; we just ignore these 
> if (contents == null) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' seems to have null data - 
> skipping"); 
> errorCode = "NULLDATA"; 
> errorDesc = "Excluded because document had null data"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // We will ingest something, so remove this id from the map in order that we 
> know what we still 
> // need to delete when all done. 
> map.remove(id); 
> <<<<<< 
> 
> As you see, activities.noDocument() is called for all cases, except the one 
> where the document version is null (which cannot happen since all document 
> versions for this case will be the empty string).  So I am at a loss to 
> understand why the delete is not happening. 
> 
> The only way I can think of is that if you clicked one of the buttons on the 
> output connection's view page that told MCF to "forget" all the history for 
> that connection. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 10:42 AM, <julien.massi...@francelabs.com> wrote:
> 
> Hi Karl, 
> 
> I was manually starting the job for test purpose, but even if I schedule it 
> with job invocation "Complete" and "Scan every document once", the missing 
> IDs from the database are not deleted in my Solr index (no trace of any 
> 'document deletion' event in the history).
> I should mention that I only use the 'Seeding query' and 'Data query' and I 
> am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding query. 
> 
> Julien
> 
> Le 26.04.2017 16:05, Karl Wright a écrit : 
> Hi Julien, 
> 
> How are you starting the job?  If you use "Start minimal", deletion would not 
> take place.  If your job is a continuous one, this is also the case. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 9:52 AM, <julien.massi...@francelabs.com> wrote:
> Hi the MCF community,
> 
> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database and 
> index the data into a Solr server, and it works very well. However, when I 
> perform a delta re-crawl, the new IDs are correctly retrieved from the 
> Database but those who have been deleted are not "detected" by the connector 
> and thus, are still present in my Solr index.
> I would like to know if normally it should work and that I maybe have missed 
> something in the configuration of the job, or if this is not implemented ?
> The only way I found to solve this issue is to reset the seeding of the job, 
> but it is very time and resource consuming.
> 
> Best regards,
> Julien Massiera

 

Links:
--
[1] http://doc.id

Re: Delete IDs with JDBC connector

2017-04-27 Thread julien . massiera
gt; // Null data is allowed; we just ignore these 
> if (contents == null) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' seems to have null data - 
> skipping"); 
> errorCode = "NULLDATA"; 
> errorDesc = "Excluded because document had null data"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // We will ingest something, so remove this id from the map in order that we 
> know what we still 
> // need to delete when all done. 
> map.remove(id); 
> <<<<<< 
> 
> As you see, activities.noDocument() is called for all cases, except the one 
> where the document version is null (which cannot happen since all document 
> versions for this case will be the empty string).  So I am at a loss to 
> understand why the delete is not happening. 
> 
> The only way I can think of is that if you clicked one of the buttons on the 
> output connection's view page that told MCF to "forget" all the history for 
> that connection. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 10:42 AM, <julien.massi...@francelabs.com> wrote:
> 
> Hi Karl, 
> 
> I was manually starting the job for test purpose, but even if I schedule it 
> with job invocation "Complete" and "Scan every document once", the missing 
> IDs from the database are not deleted in my Solr index (no trace of any 
> 'document deletion' event in the history).
> I should mention that I only use the 'Seeding query' and 'Data query' and I 
> am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding query. 
> 
> Julien
> 
> Le 26.04.2017 16:05, Karl Wright a écrit : 
> Hi Julien, 
> 
> How are you starting the job?  If you use "Start minimal", deletion would not 
> take place.  If your job is a continuous one, this is also the case. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 9:52 AM, <julien.massi...@francelabs.com> wrote:
> Hi the MCF community,
> 
> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database and 
> index the data into a Solr server, and it works very well. However, when I 
> perform a delta re-crawl, the new IDs are correctly retrieved from the 
> Database but those who have been deleted are not "detected" by the connector 
> and thus, are still present in my Solr index.
> I would like to know if normally it should work and that I maybe have missed 
> something in the configuration of the job, or if this is not implemented ?
> The only way I found to solve this issue is to reset the seeding of the job, 
> but it is very time and resource consuming.
> 
> Best regards,
> Julien Massiera

Re: Delete IDs with JDBC connector

2017-04-26 Thread Karl Wright
ething, so remove this id from the map
>> in order that we know what we still
>> // need to delete when all done.
>> map.remove(id);
>> <<<<<<
>>
>> As you see, activities.noDocument() is called for all cases, except the
>> one where the document version is null (which cannot happen since all
>> document versions for this case will be the empty string).  So I am at a
>> loss to understand why the delete is not happening.
>>
>> The only way I can think of is that if you clicked one of the buttons on
>> the output connection's view page that told MCF to "forget" all the history
>> for that connection.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Wed, Apr 26, 2017 at 10:42 AM, <julien.massi...@francelabs.com> wrote:
>>
>>> Hi Karl,
>>>
>>> I was manually starting the job for test purpose, but even if I schedule
>>> it with job invocation "Complete" and "Scan every document once", the
>>> missing IDs from the database are not deleted in my Solr index (no trace of
>>> any 'document deletion' event in the history).
>>> I should mention that I only use the 'Seeding query' and 'Data query'
>>> and I am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding
>>> query.
>>>
>>> Julien
>>>
>>> Le 26.04.2017 16:05, Karl Wright a écrit :
>>>
>>> Hi Julien,
>>>
>>> How are you starting the job?  If you use "Start minimal", deletion
>>> would not take place.  If your job is a continuous one, this is also the
>>> case.
>>>
>>> Thanks,
>>> Karl
>>>
>>> On Wed, Apr 26, 2017 at 9:52 AM, <julien.massi...@francelabs.com> wrote:
>>>
>>>> Hi the MCF community,
>>>>
>>>> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database
>>>> and index the data into a Solr server, and it works very well. However,
>>>> when I perform a delta re-crawl, the new IDs are correctly retrieved from
>>>> the Database but those who have been deleted are not "detected" by the
>>>> connector and thus, are still present in my Solr index.
>>>> I would like to know if normally it should work and that I maybe have
>>>> missed something in the configuration of the job, or if this is not
>>>> implemented ?
>>>> The only way I found to solve this issue is to reset the seeding of the
>>>> job, but it is very time and resource consuming.
>>>>
>>>> Best regards,
>>>> Julien Massiera
>>>
>>>
>>>
>


Re: Delete IDs with JDBC connector

2017-04-26 Thread julien . massiera
Oh OK so I finally don't have to investigate :)

Thanks Karl ! 

Julien 

Le 26.04.2017 17:20, Karl Wright a écrit :

> Oh, never mind.  I see the issue, which is that without the version query, 
> documents that don't appear in the result list *at all* are never removed 
> from the map.  I'll create a ticket. 
> 
> Karl 
> 
> On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <daddy...@gmail.com> wrote:
> 
> Hi Julien, 
> 
> The delete logic in the connector is as follows: 
> 
>>>>>>> 
> 
> // Now, go through the original id's, and see which ones are still in the 
> map.  These 
> // did not appear in the result and are presumed to be gone from the 
> database, and thus must be deleted. 
> for (String documentIdentifier : documentIdentifiers) 
> { 
> if (fetchDocuments.contains(documentIdentifier)) 
> { 
> String documentVersion = map.get(documentIdentifier); 
> if (documentVersion != null) 
> { 
> // This means we did not see it (or data for it) in the result set.  Delete 
> it! 
> activities.noDocument(documentIdentifier,documentVersion); 
> activities.recordActivity(null, ACTIVITY_FETCH, 
> null, documentIdentifier, "NOTFETCHED", "Document was not seen by processing 
> query", null); 
> } 
> } 
> } 
> <<<<<< 
> 
> For a JDBC job without a version query, fetchDocuments contains all the 
> documents.  But map has the entries removed that were actually fetched.  
> Documents that were *not* fetched for whatever reason therefore will not be 
> cleaned up.  Here's the code that determines that: 
> 
>>>>>>> 
> 
> String version = map.get(id); 
> if (version == null) 
> // Does not need refetching 
> continue; 
> 
> // This document was marked as "not scan only", so we expect to find it. 
> if (Logging.connectors.isDebugEnabled()) 
> Logging.connectors.debug("JDBC: Document data result found for '"+id+"'"); 
> o = row.getValue(JDBCConstants.urlReturnColumnName); 
> if (o == null) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' has a null url - 
> skipping"); 
> errorCode = activities.NULL_URL; 
> errorDesc = "Excluded because document had a null URL"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // This is not right - url can apparently be a BinaryInput 
> String url = JDBCConnection.readAsString(o); 
> boolean validURL; 
> try 
> { 
> // Check to be sure url is valid 
> new java.net.URI(url); 
> validURL = true; 
> } 
> catch (java.net.URISyntaxException e) 
> { 
> validURL = false; 
> } 
> 
> if (!validURL) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' has an illegal url: 
> '"+url+"' - skipping"); 
> errorCode = activities.BAD_URL; 
> errorDesc = "Excluded because document had illegal URL ('"+url+"')"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // Process the document itself 
> Object contents = row.getValue(JDBCConstants.dataReturnColumnName); 
> // Null data is allowed; we just ignore these 
> if (contents == null) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' seems to have null data - 
> skipping"); 
> errorCode = "NULLDATA"; 
> errorDesc = "Excluded because document had null data"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // We will ingest something, so remove this id from the map in order that we 
> know what we still 
> // need to delete when all done. 
> map.remove(id); 
> <<<<<< 
> 
> As you see, activities.noDocument() is called for all cases, except the one 
> where the document version is null (which cannot happen since all document 
> versions for this case will be the empty string).  So I am at a loss to 
> understand why the delete is not happening. 
> 
> The only way I can think of is that if you clicked one of the buttons on the 
> output connection's view page that told MCF to "forget" all the history for 
> that connection. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 10:42 AM, <julien.massi...@francelabs.com> wrote:
> 
> Hi Karl, 
> 
> I was manually starting the job for test purpose, but even if I schedule it 
> with job invocation "Complete" and "Scan every document once", the missing 
> IDs from the database are not deleted in my Solr index (no trace of any 
> 'document deletion' event in the history).
> I should mention that I only use the 'Seeding query' and 'Data query' and I 
> am not using the $(STARTTIME) and $(ENDTIM

Re: Delete IDs with JDBC connector

2017-04-26 Thread Karl Wright
noDocument() is called for all cases, except the
>> one where the document version is null (which cannot happen since all
>> document versions for this case will be the empty string).  So I am at a
>> loss to understand why the delete is not happening.
>>
>> The only way I can think of is that if you clicked one of the buttons on
>> the output connection's view page that told MCF to "forget" all the history
>> for that connection.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Wed, Apr 26, 2017 at 10:42 AM, <julien.massi...@francelabs.com> wrote:
>>
>>> Hi Karl,
>>>
>>> I was manually starting the job for test purpose, but even if I schedule
>>> it with job invocation "Complete" and "Scan every document once", the
>>> missing IDs from the database are not deleted in my Solr index (no trace of
>>> any 'document deletion' event in the history).
>>> I should mention that I only use the 'Seeding query' and 'Data query'
>>> and I am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding
>>> query.
>>>
>>> Julien
>>>
>>> Le 26.04.2017 16:05, Karl Wright a écrit :
>>>
>>> Hi Julien,
>>>
>>> How are you starting the job?  If you use "Start minimal", deletion
>>> would not take place.  If your job is a continuous one, this is also the
>>> case.
>>>
>>> Thanks,
>>> Karl
>>>
>>> On Wed, Apr 26, 2017 at 9:52 AM, <julien.massi...@francelabs.com> wrote:
>>>
>>>> Hi the MCF community,
>>>>
>>>> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database
>>>> and index the data into a Solr server, and it works very well. However,
>>>> when I perform a delta re-crawl, the new IDs are correctly retrieved from
>>>> the Database but those who have been deleted are not "detected" by the
>>>> connector and thus, are still present in my Solr index.
>>>> I would like to know if normally it should work and that I maybe have
>>>> missed something in the configuration of the job, or if this is not
>>>> implemented ?
>>>> The only way I found to solve this issue is to reset the seeding of the
>>>> job, but it is very time and resource consuming.
>>>>
>>>> Best regards,
>>>> Julien Massiera
>>>
>>>
>>>
>>
>


Re: Delete IDs with JDBC connector

2017-04-26 Thread Karl Wright
lt;julien.massi...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> I was manually starting the job for test purpose, but even if I schedule
>> it with job invocation "Complete" and "Scan every document once", the
>> missing IDs from the database are not deleted in my Solr index (no trace of
>> any 'document deletion' event in the history).
>> I should mention that I only use the 'Seeding query' and 'Data query' and
>> I am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding
>> query.
>>
>> Julien
>>
>> Le 26.04.2017 16:05, Karl Wright a écrit :
>>
>> Hi Julien,
>>
>> How are you starting the job?  If you use "Start minimal", deletion would
>> not take place.  If your job is a continuous one, this is also the case.
>>
>> Thanks,
>> Karl
>>
>> On Wed, Apr 26, 2017 at 9:52 AM, <julien.massi...@francelabs.com> wrote:
>>
>>> Hi the MCF community,
>>>
>>> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database
>>> and index the data into a Solr server, and it works very well. However,
>>> when I perform a delta re-crawl, the new IDs are correctly retrieved from
>>> the Database but those who have been deleted are not "detected" by the
>>> connector and thus, are still present in my Solr index.
>>> I would like to know if normally it should work and that I maybe have
>>> missed something in the configuration of the job, or if this is not
>>> implemented ?
>>> The only way I found to solve this issue is to reset the seeding of the
>>> job, but it is very time and resource consuming.
>>>
>>> Best regards,
>>> Julien Massiera
>>
>>
>>
>


Re: Delete IDs with JDBC connector

2017-04-26 Thread Karl Wright
Hi Julien,

The delete logic in the connector is as follows:

>>>>>>
// Now, go through the original id's, and see which ones are still in
the map.  These
// did not appear in the result and are presumed to be gone from the
database, and thus must be deleted.
for (String documentIdentifier : documentIdentifiers)
{
  if (fetchDocuments.contains(documentIdentifier))
  {
String documentVersion = map.get(documentIdentifier);
if (documentVersion != null)
{
  // This means we did not see it (or data for it) in the result
set.  Delete it!
  activities.noDocument(documentIdentifier,documentVersion);
  activities.recordActivity(null, ACTIVITY_FETCH,
null, documentIdentifier, "NOTFETCHED", "Document was not seen
by processing query", null);
}
  }
}
<<<<<<

For a JDBC job without a version query, fetchDocuments contains all the
documents.  But map has the entries removed that were actually fetched.
Documents that were *not* fetched for whatever reason therefore will not be
cleaned up.  Here's the code that determines that:

>>>>>>
String version = map.get(id);
if (version == null)
  // Does not need refetching
  continue;

// This document was marked as "not scan only", so we expect to
find it.
if (Logging.connectors.isDebugEnabled())
  Logging.connectors.debug("JDBC: Document data result found
for '"+id+"'");
o = row.getValue(JDBCConstants.urlReturnColumnName);
if (o == null)
{
  Logging.connectors.debug("JDBC: Document '"+id+"' has a null
url - skipping");
  errorCode = activities.NULL_URL;
  errorDesc = "Excluded because document had a null URL";
  activities.noDocument(id,version);
  continue;
}

// This is not right - url can apparently be a BinaryInput
String url = JDBCConnection.readAsString(o);
boolean validURL;
try
{
  // Check to be sure url is valid
  new java.net.URI(url);
  validURL = true;
}
catch (java.net.URISyntaxException e)
{
  validURL = false;
}

if (!validURL)
{
  Logging.connectors.debug("JDBC: Document '"+id+"' has an
illegal url: '"+url+"' - skipping");
  errorCode = activities.BAD_URL;
  errorDesc = "Excluded because document had illegal URL
('"+url+"')";
  activities.noDocument(id,version);
  continue;
}

// Process the document itself
Object contents =
row.getValue(JDBCConstants.dataReturnColumnName);
// Null data is allowed; we just ignore these
if (contents == null)
{
  Logging.connectors.debug("JDBC: Document '"+id+"' seems to
have null data - skipping");
  errorCode = "NULLDATA";
  errorDesc = "Excluded because document had null data";
  activities.noDocument(id,version);
  continue;
}

// We will ingest something, so remove this id from the map in
order that we know what we still
// need to delete when all done.
map.remove(id);
<<<<<<

As you see, activities.noDocument() is called for all cases, except the one
where the document version is null (which cannot happen since all document
versions for this case will be the empty string).  So I am at a loss to
understand why the delete is not happening.

The only way I can think of is that if you clicked one of the buttons on
the output connection's view page that told MCF to "forget" all the history
for that connection.

Thanks,
Karl



On Wed, Apr 26, 2017 at 10:42 AM, <julien.massi...@francelabs.com> wrote:

> Hi Karl,
>
> I was manually starting the job for test purpose, but even if I schedule
> it with job invocation "Complete" and "Scan every document once", the
> missing IDs from the database are not deleted in my Solr index (no trace of
> any 'document deletion' event in the history).
> I should mention that I only use the 'Seeding query' and 'Data query' and
> I am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding
> query.
>
> Julien
>
> Le 26.04.2017 16:05, Karl Wright a écrit :
>
> Hi Julien,
>
> How are you starting the job?  If you use "Start minimal", deletion would
> not take place.  If your job is a continuous one, this is also the case.
>
> Thanks,
> Karl
>
&g

Re: Delete IDs with JDBC connector

2017-04-26 Thread Karl Wright
Hi Julien,

How are you starting the job?  If you use "Start minimal", deletion would
not take place.  If your job is a continuous one, this is also the case.

Thanks,
Karl

On Wed, Apr 26, 2017 at 9:52 AM, <julien.massi...@francelabs.com> wrote:

> Hi the MCF community,
>
> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database and
> index the data into a Solr server, and it works very well. However, when I
> perform a delta re-crawl, the new IDs are correctly retrieved from the
> Database but those who have been deleted are not "detected" by the
> connector and thus, are still present in my Solr index.
> I would like to know if normally it should work and that I maybe have
> missed something in the configuration of the job, or if this is not
> implemented ?
> The only way I found to solve this issue is to reset the seeding of the
> job, but it is very time and resource consuming.
>
> Best regards,
> Julien Massiera
>


Re: Email filtering does not work for Excahange Server

2017-04-24 Thread Karl Wright
Hi Cihad,

The implementation for filtering is pretty generic.  Details are handled by
the javax mail jar, and there's not much visibility with what it is doing.
I think this is something you will need to experiment with to figure out
what the issue is.  It may be, for instance, that it's the field names that
must be changed for Exchange.

Karl


On Sun, Apr 23, 2017 at 1:02 PM, Cihad Guzel  wrote:

> Hi,
>
> I try email connector for filtering. It is run successfully with gmail.
> Hovewer, if I try it with exchange server, not indexed any email.
>
> Do I have to set any configuration properties from "Server tab" of
> Repository Connection of Email Connector for Exchange?
>
> --
> Cihad Güzel
> Regards
>


Re: ManifoldCf Documentum Negative ACL

2017-04-06 Thread Karl Wright
Hi Sharnel,

I've attached a patch to the CONNECTORS-1401 ticket.  Please let me know if
it works for you.


Thanks,
Karl


On Thu, Apr 6, 2017 at 5:52 PM, Karl Wright <daddy...@gmail.com> wrote:

> Hi Sharnel,
>
> I've created CONNECTORS-1401 to track this issue; I will try to get to it
> tonight or tomorrow.  As you probably know, we are planning to release MCF
> 2.7 by the end of the month, so once I have a patch ready, I'd greatly
> appreciate you trying it out to be sure it functions as designed.
>
> Thanks,
> Karl
>
>
> On Thu, Apr 6, 2017 at 5:34 PM, Sharnel Merdeck Pereira <
> spere...@worldbankgroup.org> wrote:
>
>> Hi Karl.
>>
>>
>>
>> Thanks for taking the time to check.
>>
>>
>>
>> Below is the implementation in documentum.
>>
>>
>>
>> -  Document
>>
>> o   Each Document has an ACL
>>
>> §  ACL can have Groups and Users
>>
>> §  Groups can further have subgroups and Users
>>
>> §  Access level is given to Group or User *only at ACL level.*
>>
>>
>>
>> o   When a user belongs to a group with r_accessor_permit=1 or
>> r_accessor_permit=2, the user should not have READ access to the acl.
>>
>>
>>
>> Considering above, answers to questions in below mail :
>>
>>
>>
>> 1.  implies that the way 'negative groups' have been added to
>> Documentum is by somehow designating groups as 'negative'. Is this
>> correct?  Or are groups designated negative only within the context of
>> individual ACLs?
>>
>>
>>
>> Answer: Groups or Users are given permission only at ACL level. Yes,
>> groups /user designated negative only within the context of individual ACLs
>>
>>
>>
>> As in the below example . Permission is given only at ACL level.
>>
>>
>>
>>
>>
>> *object_name*
>>
>> *  r_accessor_name*
>>
>> *r_accessor_permit  *
>>
>> *r_is_group*
>>
>> Document 1
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ACL_1
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> GroupA
>>
>> 3
>>
>> T
>>
>>
>>
>>
>>
>> GroupB
>>
>> 1
>>
>> T
>>
>>
>>
>>
>>
>> GroupC
>>
>> 6
>>
>> T
>>
>>
>>
>>
>>
>> User1
>>
>> 3
>>
>> F
>>
>>
>>
>>
>>
>> User2
>>
>> 1
>>
>> F
>>
>> GroupA
>>
>> GroupD
>>
>> GroupE
>>
>> User2
>>
>> User4
>>
>> GroupD
>>
>> User4
>>
>> User5
>>
>> GroupB
>>
>> User6
>>
>> User7
>>
>> GroupF
>>
>>
>>
>>
>>
>> -  User2 is part of Group A, ACL_1 has READ(3) access for GroupA
>> but NONE(1) access for User2. Hence lowest access takes precedence, User2
>> won’t have access to ACL_1.
>>
>>
>>
>> -  User4 is part of Group A and has READ(3) access to ACL_1
>>
>>
>>
>> -  User 5 is part of GroupD, GroupD belongs to GroupA which has
>> READ(3) access to Document, hence User5 has access to ACL_1
>>
>>
>>
>> -  User6 belongs to GroupB, Group B has NONE(1) access to
>> Document and hence User6 has NONE access to ACL_1
>>
>>
>>
>> -  if User/Group/ParentGroup has *r_accessor_name* in ACL, the
>> lowest *r_accessor_permit  *takes precedence.
>>
>>
>>
>> The query
>>
>> *select r_accessor_name, r_accessor_permit, r_is_group from dm_acl where
>> object_name =’’ *
>>
>> will retrieve accessor_name and permission for acl.
>>
>>
>>
>> The query
>>
>> *select distinct i_supergroups_names from dm_group where
>> group_name in (select group_name from dm_group where any users_names
>> =’’)*
>>
>> will retrieve user groups. As from above example
>> for User5, query will return both GroupD and GroupA
>>
>>
>>
>> Thanks
>>
>> Sharnel
>>
>>
>>
>>
>>
>> *From:* Karl Wright [mailto:daddy...@gmail.com]
>> *Sent:* Thursday, April 06, 2017 2:49 AM
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: ManifoldCf Documentum Negative ACL
>>
>>
>>
>> Hi

Re: ManifoldCf Documentum Negative ACL

2017-04-06 Thread Karl Wright
Hi Sharnel,

I've created CONNECTORS-1401 to track this issue; I will try to get to it
tonight or tomorrow.  As you probably know, we are planning to release MCF
2.7 by the end of the month, so once I have a patch ready, I'd greatly
appreciate you trying it out to be sure it functions as designed.

Thanks,
Karl


On Thu, Apr 6, 2017 at 5:34 PM, Sharnel Merdeck Pereira <
spere...@worldbankgroup.org> wrote:

> Hi Karl.
>
>
>
> Thanks for taking the time to check.
>
>
>
> Below is the implementation in documentum.
>
>
>
> -  Document
>
> o   Each Document has an ACL
>
> §  ACL can have Groups and Users
>
> §  Groups can further have subgroups and Users
>
> §  Access level is given to Group or User *only at ACL level.*
>
>
>
> o   When a user belongs to a group with r_accessor_permit=1 or
> r_accessor_permit=2, the user should not have READ access to the acl.
>
>
>
> Considering above, answers to questions in below mail :
>
>
>
> 1.  implies that the way 'negative groups' have been added to
> Documentum is by somehow designating groups as 'negative'. Is this
> correct?  Or are groups designated negative only within the context of
> individual ACLs?
>
>
>
> Answer: Groups or Users are given permission only at ACL level. Yes,
> groups /user designated negative only within the context of individual ACLs
>
>
>
> As in the below example . Permission is given only at ACL level.
>
>
>
>
>
> *object_name*
>
> *  r_accessor_name*
>
> *r_accessor_permit  *
>
> *r_is_group*
>
> Document 1
>
>
>
>
>
>
>
>
>
>
>
> ACL_1
>
>
>
>
>
>
>
>
>
>
>
> GroupA
>
> 3
>
> T
>
>
>
>
>
> GroupB
>
> 1
>
> T
>
>
>
>
>
> GroupC
>
> 6
>
> T
>
>
>
>
>
> User1
>
> 3
>
> F
>
>
>
>
>
> User2
>
> 1
>
> F
>
> GroupA
>
> GroupD
>
> GroupE
>
> User2
>
> User4
>
> GroupD
>
> User4
>
> User5
>
> GroupB
>
> User6
>
> User7
>
> GroupF
>
>
>
>
>
> -  User2 is part of Group A, ACL_1 has READ(3) access for GroupA
> but NONE(1) access for User2. Hence lowest access takes precedence, User2
> won’t have access to ACL_1.
>
>
>
> -  User4 is part of Group A and has READ(3) access to ACL_1
>
>
>
> -  User 5 is part of GroupD, GroupD belongs to GroupA which has
> READ(3) access to Document, hence User5 has access to ACL_1
>
>
>
> -  User6 belongs to GroupB, Group B has NONE(1) access to
> Document and hence User6 has NONE access to ACL_1
>
>
>
> -  if User/Group/ParentGroup has *r_accessor_name* in ACL, the
> lowest *r_accessor_permit  *takes precedence.
>
>
>
> The query
>
> *select r_accessor_name, r_accessor_permit, r_is_group from dm_acl where
> object_name =’’ *
>
> will retrieve accessor_name and permission for acl.
>
>
>
> The query
>
> *select distinct i_supergroups_names from dm_group where
> group_name in (select group_name from dm_group where any users_names
> =’’)*
>
> will retrieve user groups. As from above example
> for User5, query will return both GroupD and GroupA
>
>
>
> Thanks
>
> Sharnel
>
>
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* Thursday, April 06, 2017 2:49 AM
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: ManifoldCf Documentum Negative ACL
>
>
>
> Hi Shamel, I've done some further research.
>
>
>
> (1) There is currently only one access token ever stored with a Documentum
> document -- it's the name of the ACL associated with that document.
>
> (2) The Documentum connector does not fire off any of its own DQL at this
> time for finding the document's ACL.  This is how it currently does it,
> using DFC methods all the way:
>
>
>
> >>>>>>
>
> strarrACL[0] = docbaseName + ":" + object.getACLDomain() + "." +
> object.getACLName();
>
> <<<<<<
>
>
>
> ... where:
>
>
>
> >>>>>>
>
>   /** Get the ACL domain */
>
>   public String getACLDomain()
>
> throws DocumentumException, RemoteException
>
>   {
>
> try
>
> {
>
>   return ((IDfSysObject)object).getACLDomain();
>
> }
>
> catch (DfException e)
>
> {
>
>   throw new DocumentumException("Documentum exception:
> "+e.getMessage());
>
> }
>
>   }
>
>
&

RE: Multilingual support with manifolds

2017-03-29 Thread Konrad Holl
Hi Sreenivas,

ok – got it. I thought you were going to publish to SharePoint Search.

Solr does have (limited) support for a variety of languages (including German 
and Japanese). You can configure both indexing and search transformations 
(stemming, synonyms, …) individually. For improved language support there are 
Basis Technologies Rosette and (especially for German) IntraFind LiSa – but 
both are commercial with the smaller price tag on IntraFind. You may want to 
try with the limited support and see how far you get before spending any money.

Regards

Konrad.

From: Sreenivas.T [mailto:sree...@gmail.com]
Sent: Dienstag, 28. März 2017 17:43
To: user@manifoldcf.apache.org
Subject: Re: Multilingual support with manifolds

Thanks a lot for your responses.
Reason for asking was that sharepoint content is in german & japanese. We would 
like to get the content to Solr. If I understand correctly, using ManifoldCF it 
is possible to get this content & push to Solr for indexing.

Thanks & regards,
Sreenivas


On Tue, Mar 28, 2017 at 4:52 PM Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
Hi,

ManifoldCF uses utf-8 and binary throughout for its actual function, so it is 
not language specific in any way at that level.  Its UI has been localized 
(more or less) for four languages: English, Spanish, Japanese, and Chinese.

Hope that helps,
Karl


On Tue, Mar 28, 2017 at 6:13 AM, Sreenivas.T 
<sree...@gmail.com<mailto:sree...@gmail.com>> wrote:
Hi,

I'm new to manifold connector framework. I could not find documentation 
regarding multilingual support of sharepoint, email connectors & regular web 
crawlers. Please let me know if it has support to multilingual and if it has 
what are the languages that it support.

I'm planning to use manifold cf instead of nutch for web crawling purposes too.

Thanks,
Sreenivas



Re: manifoldcf build

2017-03-28 Thread Karl Wright
Hi Cihad,

There are no changes to the build process.  However, there have been
significant changes to the dependencies.

You will need to do the following:

(1) Set your JAVA_HOME to point to JDK 8.  The previous requirement was JDK
7.
(2) ant clean-core-deps make-core-deps
(3) ant clean build

Thanks,
Karl


On Tue, Mar 28, 2017 at 11:16 AM, Cihad Guzel  wrote:

> Hi,
>
> I build trunk. I use "ant clean", "ant build", "ant make-deps" and "ant
> make-core-deps"
>
> There aren't any files in dist/example. Also, I couldn't find this
> directories:
>
> connector-libc
> connector-common-lib
> connector-lib-proprietary
>
> Have you made any changes? How to build it?
>
> --
> Cihad Güzel
>


Re: Multilingual support with manifolds

2017-03-28 Thread Cihad Guzel
Hi Sreenivas,

If you mean something like the "language-identifer" plugin of Nutch ,
ManifoldCF does not have this kind of thing.

2017-03-28 14:23 GMT+03:00 Konrad Holl <kh...@searchtechnologies.com>:

> Hi Sreenivas,
>
>
>
> the language support will only be relevant in the search engine itself
> (SharePoint). It will detect the languages and apply linguistic processing
> as needed during indexing and search time.
>
>
>
> -Konrad
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* Dienstag, 28. März 2017 13:22
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Multilingual support with manifolds
>
>
>
> Hi,
>
>
>
> ManifoldCF uses utf-8 and binary throughout for its actual function, so it
> is not language specific in any way at that level.  Its UI has been
> localized (more or less) for four languages: English, Spanish, Japanese,
> and Chinese.
>
>
>
> Hope that helps,
>
> Karl
>
>
>
>
>
> On Tue, Mar 28, 2017 at 6:13 AM, Sreenivas.T <sree...@gmail.com> wrote:
>
> Hi,
>
>
>
> I'm new to manifold connector framework. I could not find documentation
> regarding multilingual support of sharepoint, email connectors & regular
> web crawlers. Please let me know if it has support to multilingual and if
> it has what are the languages that it support.
>
>
>
> I'm planning to use manifold cf instead of nutch for web crawling purposes
> too.
>
>
>
> Thanks,
>
> Sreenivas
>
>
>




Regards,
Cihad Güzel


RE: Multilingual support with manifolds

2017-03-28 Thread Konrad Holl
Hi Sreenivas,

the language support will only be relevant in the search engine itself 
(SharePoint). It will detect the languages and apply linguistic processing as 
needed during indexing and search time.

-Konrad

From: Karl Wright [mailto:daddy...@gmail.com]
Sent: Dienstag, 28. März 2017 13:22
To: user@manifoldcf.apache.org
Subject: Re: Multilingual support with manifolds

Hi,

ManifoldCF uses utf-8 and binary throughout for its actual function, so it is 
not language specific in any way at that level.  Its UI has been localized 
(more or less) for four languages: English, Spanish, Japanese, and Chinese.

Hope that helps,
Karl


On Tue, Mar 28, 2017 at 6:13 AM, Sreenivas.T 
<sree...@gmail.com<mailto:sree...@gmail.com>> wrote:
Hi,

I'm new to manifold connector framework. I could not find documentation 
regarding multilingual support of sharepoint, email connectors & regular web 
crawlers. Please let me know if it has support to multilingual and if it has 
what are the languages that it support.

I'm planning to use manifold cf instead of nutch for web crawling purposes too.

Thanks,
Sreenivas



Re: Multilingual support with manifolds

2017-03-28 Thread Karl Wright
Hi,

ManifoldCF uses utf-8 and binary throughout for its actual function, so it
is not language specific in any way at that level.  Its UI has been
localized (more or less) for four languages: English, Spanish, Japanese,
and Chinese.

Hope that helps,
Karl


On Tue, Mar 28, 2017 at 6:13 AM, Sreenivas.T  wrote:

> Hi,
>
> I'm new to manifold connector framework. I could not find documentation
> regarding multilingual support of sharepoint, email connectors & regular
> web crawlers. Please let me know if it has support to multilingual and if
> it has what are the languages that it support.
>
> I'm planning to use manifold cf instead of nutch for web crawling purposes
> too.
>
> Thanks,
> Sreenivas
>


Re: SharePoint crawler ArrayIndexOutOfBoundException in log

2017-03-17 Thread Cihad Guzel
Hi,

I use oracle jdk1.8.0_77 . I will try new http client version and return to
you.

Thanks
Cihad Güzel


2017-03-17 23:38 GMT+03:00 Markus Schuch :

> Hi,
>
> i think this may caused by
>
>   https://issues.apache.org/jira/browse/HTTPCLIENT-1715
>
> which was fixed in httpclient 4.5.2
>
> There is a very similar stacktrace in
>
>   https://issues.apache.org/jira/browse/HTTPCLIENT-1686
>
> which is also linked to HTTPCLIENT-1715.
>
> Cheers,
> Markus
>
> Am 17.03.2017 um 19:27 schrieb Karl Wright:
> > Hi Cihad,
> >
> > There are NTLMEngineImpl tests that exercise precisely the case that is
> > failing.  I'm therefore becoming convinced that there is something very
> > odd about your installation.  Are you using a non-standard JVM, for
> > instance?
> >
> > Karl
> >
> >
> > On Fri, Mar 17, 2017 at 10:28 AM, Karl Wright  > > wrote:
> >
> > Hi Cihad,
> >
> > Could you also check out and build the latest 4.5.x httpclient, from
> > this branch?
> >
> > https://svn.apache.org/repos/asf/httpcomponents/httpclient/
> branches/pull-66
> >  branches/pull-66>
> >
> > You will need maven for this but otherwise you can build it any way
> > you like.  Replace the "httpclient-4.5.1.jar" in the lib directory
> > with the jar you build, and then you can rebuild MCF.  See if you
> > still get the error.  If you do, it should be possible to chase it
> > down more readily.
> >
> > Thanks,
> > Karl
> >
> >
> > On Fri, Mar 17, 2017 at 9:57 AM, Cihad Guzel  > > wrote:
> >
> > No. I don't use any custom library.
> >
> > I try with manifoldcf trunk on my notebook. I install sharepoint
> > 2013 on ms server 2012 for testing with default configuration.
> >
> > 17 Mar 2017 16:05 tarihinde "Karl Wright"  > > yazdı:
> >
> > Hmm, I can see no way this can happen.  Are you by any
> > chance using a modified version of the HttpClient library?
> > Karl
> >
> >
> > On Fri, Mar 17, 2017 at 8:09 AM, Karl Wright
> > > wrote:
> >
> > Hi Cihad,
> >
> > This is very interesting because the problem is coming
> > from Httpclient's NTLM engine.  The allocated packet
> > size for the Type 1 message is being exceeded, which I
> > didn't think was even possible.
> >
> > This may be a result of credentials that you have
> > supplied being strange in some way.  Let me look at the
> > Httpclient code and get back to you.
> >
> > Karl
> >
> >
> > On Fri, Mar 17, 2017 at 7:57 AM, Cihad Guzel
> > > wrote:
> >
> > Hi,
> >
> > I try sharepoint connector with Active Directory in
> > debug mode. I saw ArrayIndexOutOfBoundException in
> > manifoldcf.log file. Any bugs?
> >
> > DEBUG 2017-03-17 14:30:48,386 (Worker thread '0') -
> > SharePoint: Getting version of '/Documents2//Step by
> > step Installation of SharePoint 2013 on Windows
> > Server 2012 R2 part 1 - SharePoint Community.pdf'
> > DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') -
> > SharePoint: Checking whether to include document
> > '/Documents2/Step by step Installation of SharePoint
> > 2013 on Windows Server 2012 R2 part 1 - SharePoint
> > Community.pdf'
> > DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') -
> > SharePoint: File '/Documents2/Step by step
> > Installation of SharePoint 2013 on Windows Server
> > 2012 R2 part 1 - SharePoint Community.pdf' exactly
> > matched rule path '/Documents2/*'
> > DEBUG 2017-03-17 14:30:48,467 (Worker thread '0') -
> > SharePoint: Including file '/Documents2/Step by step
> > Installation of SharePoint 2013 on Windows Server
> > 2012 R2 part 1 - SharePoint Community.pdf'
> > DEBUG 2017-03-17 14:30:48,468 (Worker thread '0') -
> > SharePoint: Finding metadata to include for
> > document/item '/Documents2/Step by step Installation
> > of SharePoint 2013 on Windows Server 2012 R2 part 1
> > - SharePoint Community.pdf'.
> > DEBUG 2017-03-17 14:30:48,510 (Worker thread 

Re: SharePoint crawler ArrayIndexOutOfBoundException in log

2017-03-17 Thread Karl Wright
Hi Markus,
Good catch.  Yes, this could do it.
I'm going to update trunk's dependencies and see if that fixes the issue.

Karl


On Fri, Mar 17, 2017 at 4:38 PM, Markus Schuch  wrote:

> Hi,
>
> i think this may caused by
>
>   https://issues.apache.org/jira/browse/HTTPCLIENT-1715
>
> which was fixed in httpclient 4.5.2
>
> There is a very similar stacktrace in
>
>   https://issues.apache.org/jira/browse/HTTPCLIENT-1686
>
> which is also linked to HTTPCLIENT-1715.
>
> Cheers,
> Markus
>
> Am 17.03.2017 um 19:27 schrieb Karl Wright:
> > Hi Cihad,
> >
> > There are NTLMEngineImpl tests that exercise precisely the case that is
> > failing.  I'm therefore becoming convinced that there is something very
> > odd about your installation.  Are you using a non-standard JVM, for
> > instance?
> >
> > Karl
> >
> >
> > On Fri, Mar 17, 2017 at 10:28 AM, Karl Wright  > > wrote:
> >
> > Hi Cihad,
> >
> > Could you also check out and build the latest 4.5.x httpclient, from
> > this branch?
> >
> > https://svn.apache.org/repos/asf/httpcomponents/httpclient/
> branches/pull-66
> >  branches/pull-66>
> >
> > You will need maven for this but otherwise you can build it any way
> > you like.  Replace the "httpclient-4.5.1.jar" in the lib directory
> > with the jar you build, and then you can rebuild MCF.  See if you
> > still get the error.  If you do, it should be possible to chase it
> > down more readily.
> >
> > Thanks,
> > Karl
> >
> >
> > On Fri, Mar 17, 2017 at 9:57 AM, Cihad Guzel  > > wrote:
> >
> > No. I don't use any custom library.
> >
> > I try with manifoldcf trunk on my notebook. I install sharepoint
> > 2013 on ms server 2012 for testing with default configuration.
> >
> > 17 Mar 2017 16:05 tarihinde "Karl Wright"  > > yazdı:
> >
> > Hmm, I can see no way this can happen.  Are you by any
> > chance using a modified version of the HttpClient library?
> > Karl
> >
> >
> > On Fri, Mar 17, 2017 at 8:09 AM, Karl Wright
> > > wrote:
> >
> > Hi Cihad,
> >
> > This is very interesting because the problem is coming
> > from Httpclient's NTLM engine.  The allocated packet
> > size for the Type 1 message is being exceeded, which I
> > didn't think was even possible.
> >
> > This may be a result of credentials that you have
> > supplied being strange in some way.  Let me look at the
> > Httpclient code and get back to you.
> >
> > Karl
> >
> >
> > On Fri, Mar 17, 2017 at 7:57 AM, Cihad Guzel
> > > wrote:
> >
> > Hi,
> >
> > I try sharepoint connector with Active Directory in
> > debug mode. I saw ArrayIndexOutOfBoundException in
> > manifoldcf.log file. Any bugs?
> >
> > DEBUG 2017-03-17 14:30:48,386 (Worker thread '0') -
> > SharePoint: Getting version of '/Documents2//Step by
> > step Installation of SharePoint 2013 on Windows
> > Server 2012 R2 part 1 - SharePoint Community.pdf'
> > DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') -
> > SharePoint: Checking whether to include document
> > '/Documents2/Step by step Installation of SharePoint
> > 2013 on Windows Server 2012 R2 part 1 - SharePoint
> > Community.pdf'
> > DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') -
> > SharePoint: File '/Documents2/Step by step
> > Installation of SharePoint 2013 on Windows Server
> > 2012 R2 part 1 - SharePoint Community.pdf' exactly
> > matched rule path '/Documents2/*'
> > DEBUG 2017-03-17 14:30:48,467 (Worker thread '0') -
> > SharePoint: Including file '/Documents2/Step by step
> > Installation of SharePoint 2013 on Windows Server
> > 2012 R2 part 1 - SharePoint Community.pdf'
> > DEBUG 2017-03-17 14:30:48,468 (Worker thread '0') -
> > SharePoint: Finding metadata to include for
> > document/item '/Documents2/Step by step Installation
> > of SharePoint 2013 on Windows Server 2012 R2 part 1
> > - SharePoint Community.pdf'.
> > DEBUG 

Re: SharePoint crawler ArrayIndexOutOfBoundException in log

2017-03-17 Thread Markus Schuch
Hi,

i think this may caused by

  https://issues.apache.org/jira/browse/HTTPCLIENT-1715

which was fixed in httpclient 4.5.2

There is a very similar stacktrace in

  https://issues.apache.org/jira/browse/HTTPCLIENT-1686

which is also linked to HTTPCLIENT-1715.

Cheers,
Markus

Am 17.03.2017 um 19:27 schrieb Karl Wright:
> Hi Cihad,
> 
> There are NTLMEngineImpl tests that exercise precisely the case that is
> failing.  I'm therefore becoming convinced that there is something very
> odd about your installation.  Are you using a non-standard JVM, for
> instance?
> 
> Karl
> 
> 
> On Fri, Mar 17, 2017 at 10:28 AM, Karl Wright  > wrote:
> 
> Hi Cihad,
> 
> Could you also check out and build the latest 4.5.x httpclient, from
> this branch?
> 
> 
> https://svn.apache.org/repos/asf/httpcomponents/httpclient/branches/pull-66
> 
> 
> 
> You will need maven for this but otherwise you can build it any way
> you like.  Replace the "httpclient-4.5.1.jar" in the lib directory
> with the jar you build, and then you can rebuild MCF.  See if you
> still get the error.  If you do, it should be possible to chase it
> down more readily.
> 
> Thanks,
> Karl
> 
> 
> On Fri, Mar 17, 2017 at 9:57 AM, Cihad Guzel  > wrote:
> 
> No. I don't use any custom library. 
> 
> I try with manifoldcf trunk on my notebook. I install sharepoint
> 2013 on ms server 2012 for testing with default configuration. 
> 
> 17 Mar 2017 16:05 tarihinde "Karl Wright"  > yazdı:
> 
> Hmm, I can see no way this can happen.  Are you by any
> chance using a modified version of the HttpClient library?
> Karl
> 
> 
> On Fri, Mar 17, 2017 at 8:09 AM, Karl Wright
> > wrote:
> 
> Hi Cihad,
> 
> This is very interesting because the problem is coming
> from Httpclient's NTLM engine.  The allocated packet
> size for the Type 1 message is being exceeded, which I
> didn't think was even possible.
> 
> This may be a result of credentials that you have
> supplied being strange in some way.  Let me look at the
> Httpclient code and get back to you.
> 
> Karl
> 
> 
> On Fri, Mar 17, 2017 at 7:57 AM, Cihad Guzel
> > wrote:
> 
> Hi,
> 
> I try sharepoint connector with Active Directory in
> debug mode. I saw ArrayIndexOutOfBoundException in
> manifoldcf.log file. Any bugs?
> 
> DEBUG 2017-03-17 14:30:48,386 (Worker thread '0') -
> SharePoint: Getting version of '/Documents2//Step by
> step Installation of SharePoint 2013 on Windows
> Server 2012 R2 part 1 - SharePoint Community.pdf'
> DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') -
> SharePoint: Checking whether to include document
> '/Documents2/Step by step Installation of SharePoint
> 2013 on Windows Server 2012 R2 part 1 - SharePoint
> Community.pdf'
> DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') -
> SharePoint: File '/Documents2/Step by step
> Installation of SharePoint 2013 on Windows Server
> 2012 R2 part 1 - SharePoint Community.pdf' exactly
> matched rule path '/Documents2/*'
> DEBUG 2017-03-17 14:30:48,467 (Worker thread '0') -
> SharePoint: Including file '/Documents2/Step by step
> Installation of SharePoint 2013 on Windows Server
> 2012 R2 part 1 - SharePoint Community.pdf'
> DEBUG 2017-03-17 14:30:48,468 (Worker thread '0') -
> SharePoint: Finding metadata to include for
> document/item '/Documents2/Step by step Installation
> of SharePoint 2013 on Windows Server 2012 R2 part 1
> - SharePoint Community.pdf'.
> DEBUG 2017-03-17 14:30:48,510 (Worker thread '0') -
> SharePoint: In getFieldValues;
> fieldNames=[Ljava.lang.String;@69f1a61a, site='',
> docLibrary='{1B694C45-DF1F-44E7-9814-F5096E85A126}',
> docId='/Documents2/Step by step Installation of
> SharePoint 2013 on Windows Server 2012 R2 part 1 -
> 

Re: SharePoint crawler ArrayIndexOutOfBoundException in log

2017-03-17 Thread Cihad Guzel
No. I don't use any custom library.

I try with manifoldcf trunk on my notebook. I install sharepoint 2013 on ms
server 2012 for testing with default configuration.

17 Mar 2017 16:05 tarihinde "Karl Wright"  yazdı:

> Hmm, I can see no way this can happen.  Are you by any chance using a
> modified version of the HttpClient library?
> Karl
>
>
> On Fri, Mar 17, 2017 at 8:09 AM, Karl Wright  wrote:
>
>> Hi Cihad,
>>
>> This is very interesting because the problem is coming from Httpclient's
>> NTLM engine.  The allocated packet size for the Type 1 message is being
>> exceeded, which I didn't think was even possible.
>>
>> This may be a result of credentials that you have supplied being strange
>> in some way.  Let me look at the Httpclient code and get back to you.
>>
>> Karl
>>
>>
>> On Fri, Mar 17, 2017 at 7:57 AM, Cihad Guzel  wrote:
>>
>>> Hi,
>>>
>>> I try sharepoint connector with Active Directory in debug mode. I saw
>>> ArrayIndexOutOfBoundException in manifoldcf.log file. Any bugs?
>>>
>>> DEBUG 2017-03-17 14:30:48,386 (Worker thread '0') - SharePoint: Getting
>>> version of '/Documents2//Step by step Installation of SharePoint 2013 on
>>> Windows Server 2012 R2 part 1 - SharePoint Community.pdf'
>>> DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') - SharePoint: Checking
>>> whether to include document '/Documents2/Step by step Installation of
>>> SharePoint 2013 on Windows Server 2012 R2 part 1 - SharePoint Community.pdf'
>>> DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') - SharePoint: File
>>> '/Documents2/Step by step Installation of SharePoint 2013 on Windows Server
>>> 2012 R2 part 1 - SharePoint Community.pdf' exactly matched rule path
>>> '/Documents2/*'
>>> DEBUG 2017-03-17 14:30:48,467 (Worker thread '0') - SharePoint:
>>> Including file '/Documents2/Step by step Installation of SharePoint 2013 on
>>> Windows Server 2012 R2 part 1 - SharePoint Community.pdf'
>>> DEBUG 2017-03-17 14:30:48,468 (Worker thread '0') - SharePoint: Finding
>>> metadata to include for document/item '/Documents2/Step by step
>>> Installation of SharePoint 2013 on Windows Server 2012 R2 part 1 -
>>> SharePoint Community.pdf'.
>>> DEBUG 2017-03-17 14:30:48,510 (Worker thread '0') - SharePoint: In
>>> getFieldValues; fieldNames=[Ljava.lang.String;@69f1a61a, site='',
>>> docLibrary='{1B694C45-DF1F-44E7-9814-F5096E85A126}',
>>> docId='/Documents2/Step by step Installation of SharePoint 2013 on Windows
>>> Server 2012 R2 part 1 - SharePoint Community.pdf', dspStsWorks=false
>>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Getting
>>> version of '/Documents2//'
>>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: Getting
>>> version of '/Documents2//CXFCA3100080010.pdf'
>>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: Checking
>>> whether to include document '/Documents2/CXFCA3100080010.pdf'
>>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: File
>>> '/Documents2/CXFCA3100080010.pdf' exactly matched rule path
>>> '/Documents2/*'
>>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint:
>>> Including file '/Documents2/CXFCA3100080010.pdf'
>>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Checking
>>> whether to include library '/Documents2'
>>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: Finding
>>> metadata to include for document/item '/Documents2/CXFCA3100080010.pdf'.
>>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Library
>>> '/Documents2' partially matched file rule path '/Documents2/*' - including
>>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Document
>>> identifier is a library: '/Documents2'
>>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: In
>>> getDocLibID; parentSite='', parentSiteDecoded='', docLibrary='Documents2'
>>> DEBUG 2017-03-17 14:30:48,540 (Worker thread '2') - SharePoint: Getting
>>> version of '/'
>>> DEBUG 2017-03-17 14:30:48,540 (Worker thread '2') - SharePoint: Checking
>>> whether to include site '/'
>>> DEBUG 2017-03-17 14:30:48,540 (Worker thread '2') - SharePoint: Site '/'
>>> partially matched file rule path '/Documents2/*' - including
>>> DEBUG 2017-03-17 14:30:48,548 (Worker thread '4') - SharePoint: In
>>> getFieldValues; fieldNames=[Ljava.lang.String;@6f447d2e, site='',
>>> docLibrary='{1B694C45-DF1F-44E7-9814-F5096E85A126}',
>>> docId='/Documents2/CXFCA3100080010.pdf', dspStsWorks=false
>>> DEBUG 2017-03-17 14:30:48,560 (Worker thread '2') - SharePoint: Document
>>> identifier is a site: ''
>>> DEBUG 2017-03-17 14:30:48,560 (Worker thread '2') - SharePoint: In
>>> getSites; parentSite=''
>>> DEBUG 2017-03-17 14:30:50,398 (Worker thread '4') - SharePoint: Got a
>>> remote exception getting field values for site  library
>>> {1B694C45-DF1F-44E7-9814-F5096E85A126} document
>>> [/Documents2/CXFCA3100080010.pdf] - retrying
>>> AxisFault
>>>  faultCode: 

Re: SharePoint crawler ArrayIndexOutOfBoundException in log

2017-03-17 Thread Karl Wright
Hmm, I can see no way this can happen.  Are you by any chance using a
modified version of the HttpClient library?
Karl


On Fri, Mar 17, 2017 at 8:09 AM, Karl Wright  wrote:

> Hi Cihad,
>
> This is very interesting because the problem is coming from Httpclient's
> NTLM engine.  The allocated packet size for the Type 1 message is being
> exceeded, which I didn't think was even possible.
>
> This may be a result of credentials that you have supplied being strange
> in some way.  Let me look at the Httpclient code and get back to you.
>
> Karl
>
>
> On Fri, Mar 17, 2017 at 7:57 AM, Cihad Guzel  wrote:
>
>> Hi,
>>
>> I try sharepoint connector with Active Directory in debug mode. I saw
>> ArrayIndexOutOfBoundException in manifoldcf.log file. Any bugs?
>>
>> DEBUG 2017-03-17 14:30:48,386 (Worker thread '0') - SharePoint: Getting
>> version of '/Documents2//Step by step Installation of SharePoint 2013 on
>> Windows Server 2012 R2 part 1 - SharePoint Community.pdf'
>> DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') - SharePoint: Checking
>> whether to include document '/Documents2/Step by step Installation of
>> SharePoint 2013 on Windows Server 2012 R2 part 1 - SharePoint Community.pdf'
>> DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') - SharePoint: File
>> '/Documents2/Step by step Installation of SharePoint 2013 on Windows Server
>> 2012 R2 part 1 - SharePoint Community.pdf' exactly matched rule path
>> '/Documents2/*'
>> DEBUG 2017-03-17 14:30:48,467 (Worker thread '0') - SharePoint: Including
>> file '/Documents2/Step by step Installation of SharePoint 2013 on Windows
>> Server 2012 R2 part 1 - SharePoint Community.pdf'
>> DEBUG 2017-03-17 14:30:48,468 (Worker thread '0') - SharePoint: Finding
>> metadata to include for document/item '/Documents2/Step by step
>> Installation of SharePoint 2013 on Windows Server 2012 R2 part 1 -
>> SharePoint Community.pdf'.
>> DEBUG 2017-03-17 14:30:48,510 (Worker thread '0') - SharePoint: In
>> getFieldValues; fieldNames=[Ljava.lang.String;@69f1a61a, site='',
>> docLibrary='{1B694C45-DF1F-44E7-9814-F5096E85A126}',
>> docId='/Documents2/Step by step Installation of SharePoint 2013 on Windows
>> Server 2012 R2 part 1 - SharePoint Community.pdf', dspStsWorks=false
>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Getting
>> version of '/Documents2//'
>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: Getting
>> version of '/Documents2//CXFCA3100080010.pdf'
>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: Checking
>> whether to include document '/Documents2/CXFCA3100080010.pdf'
>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: File
>> '/Documents2/CXFCA3100080010.pdf' exactly matched rule path
>> '/Documents2/*'
>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: Including
>> file '/Documents2/CXFCA3100080010.pdf'
>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Checking
>> whether to include library '/Documents2'
>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: Finding
>> metadata to include for document/item '/Documents2/CXFCA3100080010.pdf'.
>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Library
>> '/Documents2' partially matched file rule path '/Documents2/*' - including
>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Document
>> identifier is a library: '/Documents2'
>> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: In
>> getDocLibID; parentSite='', parentSiteDecoded='', docLibrary='Documents2'
>> DEBUG 2017-03-17 14:30:48,540 (Worker thread '2') - SharePoint: Getting
>> version of '/'
>> DEBUG 2017-03-17 14:30:48,540 (Worker thread '2') - SharePoint: Checking
>> whether to include site '/'
>> DEBUG 2017-03-17 14:30:48,540 (Worker thread '2') - SharePoint: Site '/'
>> partially matched file rule path '/Documents2/*' - including
>> DEBUG 2017-03-17 14:30:48,548 (Worker thread '4') - SharePoint: In
>> getFieldValues; fieldNames=[Ljava.lang.String;@6f447d2e, site='',
>> docLibrary='{1B694C45-DF1F-44E7-9814-F5096E85A126}',
>> docId='/Documents2/CXFCA3100080010.pdf', dspStsWorks=false
>> DEBUG 2017-03-17 14:30:48,560 (Worker thread '2') - SharePoint: Document
>> identifier is a site: ''
>> DEBUG 2017-03-17 14:30:48,560 (Worker thread '2') - SharePoint: In
>> getSites; parentSite=''
>> DEBUG 2017-03-17 14:30:50,398 (Worker thread '4') - SharePoint: Got a
>> remote exception getting field values for site  library
>> {1B694C45-DF1F-44E7-9814-F5096E85A126} document
>> [/Documents2/CXFCA3100080010.pdf] - retrying
>> AxisFault
>>  faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userExcept
>> ion
>>  faultSubcode:
>>  faultString: java.lang.ArrayIndexOutOfBoundsException: 41
>>  faultActor:
>>  faultNode:
>>  faultDetail:
>> {http://xml.apache.org/axis/}stackTrace:java.lang.ArrayIndex
>> OutOfBoundsException: 41
>> at 

Re: SharePoint crawler ArrayIndexOutOfBoundException in log

2017-03-17 Thread Karl Wright
Hi Cihad,

This is very interesting because the problem is coming from Httpclient's
NTLM engine.  The allocated packet size for the Type 1 message is being
exceeded, which I didn't think was even possible.

This may be a result of credentials that you have supplied being strange in
some way.  Let me look at the Httpclient code and get back to you.

Karl


On Fri, Mar 17, 2017 at 7:57 AM, Cihad Guzel  wrote:

> Hi,
>
> I try sharepoint connector with Active Directory in debug mode. I saw
> ArrayIndexOutOfBoundException in manifoldcf.log file. Any bugs?
>
> DEBUG 2017-03-17 14:30:48,386 (Worker thread '0') - SharePoint: Getting
> version of '/Documents2//Step by step Installation of SharePoint 2013 on
> Windows Server 2012 R2 part 1 - SharePoint Community.pdf'
> DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') - SharePoint: Checking
> whether to include document '/Documents2/Step by step Installation of
> SharePoint 2013 on Windows Server 2012 R2 part 1 - SharePoint Community.pdf'
> DEBUG 2017-03-17 14:30:48,466 (Worker thread '0') - SharePoint: File
> '/Documents2/Step by step Installation of SharePoint 2013 on Windows Server
> 2012 R2 part 1 - SharePoint Community.pdf' exactly matched rule path
> '/Documents2/*'
> DEBUG 2017-03-17 14:30:48,467 (Worker thread '0') - SharePoint: Including
> file '/Documents2/Step by step Installation of SharePoint 2013 on Windows
> Server 2012 R2 part 1 - SharePoint Community.pdf'
> DEBUG 2017-03-17 14:30:48,468 (Worker thread '0') - SharePoint: Finding
> metadata to include for document/item '/Documents2/Step by step
> Installation of SharePoint 2013 on Windows Server 2012 R2 part 1 -
> SharePoint Community.pdf'.
> DEBUG 2017-03-17 14:30:48,510 (Worker thread '0') - SharePoint: In
> getFieldValues; fieldNames=[Ljava.lang.String;@69f1a61a, site='',
> docLibrary='{1B694C45-DF1F-44E7-9814-F5096E85A126}',
> docId='/Documents2/Step by step Installation of SharePoint 2013 on Windows
> Server 2012 R2 part 1 - SharePoint Community.pdf', dspStsWorks=false
> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Getting
> version of '/Documents2//'
> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: Getting
> version of '/Documents2//CXFCA3100080010.pdf'
> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: Checking
> whether to include document '/Documents2/CXFCA3100080010.pdf'
> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: File
> '/Documents2/CXFCA3100080010.pdf' exactly matched rule path
> '/Documents2/*'
> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: Including
> file '/Documents2/CXFCA3100080010.pdf'
> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Checking
> whether to include library '/Documents2'
> DEBUG 2017-03-17 14:30:48,539 (Worker thread '4') - SharePoint: Finding
> metadata to include for document/item '/Documents2/CXFCA3100080010.pdf'.
> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Library
> '/Documents2' partially matched file rule path '/Documents2/*' - including
> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: Document
> identifier is a library: '/Documents2'
> DEBUG 2017-03-17 14:30:48,539 (Worker thread '5') - SharePoint: In
> getDocLibID; parentSite='', parentSiteDecoded='', docLibrary='Documents2'
> DEBUG 2017-03-17 14:30:48,540 (Worker thread '2') - SharePoint: Getting
> version of '/'
> DEBUG 2017-03-17 14:30:48,540 (Worker thread '2') - SharePoint: Checking
> whether to include site '/'
> DEBUG 2017-03-17 14:30:48,540 (Worker thread '2') - SharePoint: Site '/'
> partially matched file rule path '/Documents2/*' - including
> DEBUG 2017-03-17 14:30:48,548 (Worker thread '4') - SharePoint: In
> getFieldValues; fieldNames=[Ljava.lang.String;@6f447d2e, site='',
> docLibrary='{1B694C45-DF1F-44E7-9814-F5096E85A126}', 
> docId='/Documents2/CXFCA3100080010.pdf',
> dspStsWorks=false
> DEBUG 2017-03-17 14:30:48,560 (Worker thread '2') - SharePoint: Document
> identifier is a site: ''
> DEBUG 2017-03-17 14:30:48,560 (Worker thread '2') - SharePoint: In
> getSites; parentSite=''
> DEBUG 2017-03-17 14:30:50,398 (Worker thread '4') - SharePoint: Got a
> remote exception getting field values for site  library
> {1B694C45-DF1F-44E7-9814-F5096E85A126} document
> [/Documents2/CXFCA3100080010.pdf] - retrying
> AxisFault
>  faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.
> userException
>  faultSubcode:
>  faultString: java.lang.ArrayIndexOutOfBoundsException: 41
>  faultActor:
>  faultNode:
>  faultDetail:
> {http://xml.apache.org/axis/}stackTrace:java.lang.
> ArrayIndexOutOfBoundsException: 41
> at org.apache.http.impl.auth.NTLMEngineImpl$NTLMMessage.
> addByte(NTLMEngineImpl.java:911)
> at org.apache.http.impl.auth.NTLMEngineImpl$NTLMMessage.
> addULong(NTLMEngineImpl.java:941)
> at org.apache.http.impl.auth.NTLMEngineImpl$Type1Message.
> getResponse(NTLMEngineImpl.java:1043)
> at 

Re: The job got stuck when JDBC Connector got org.postgresql.util.PSQLException

2017-03-08 Thread Cheng Zeng
Hi Karl,


Thank you very much for your reply. I am not sure why there is a character zero 
returned by MS SQL server. I use a simple version query and there is no 
PSQLException thrown though. The job thread works fine now.


Thanks,

Cheng


From: Karl Wright <daddy...@gmail.com>
Sent: 08 March 2017 16:04
To: user@manifoldcf.apache.org
Subject: Re: The job got stuck when JDBC Connector got 
org.postgresql.util.PSQLException

Hi Cheng,

The issue is that your JDBC connection is generating a version string that has 
a character zero (0x0) in it, and postgresql doesn't allow that.

You get to specify the version string query as part of the job definition -- 
can you look at that and see how you are getting this back?  It might be due to 
the JDBC driver doing something funny, or it might be that your supplied 
version query is querying for something that's really binary, and you'll need 
to do something different to make it work.

Thanks,
Karl


On Wed, Mar 8, 2017 at 10:43 AM, Cheng Zeng 
<ze...@hotmail.co.uk<mailto:ze...@hotmail.co.uk>> wrote:

Hi all,


I was trying to use JDBC connector to index records from MS SQLServer 2012. I 
used ManifoldCF 2.3.


I created a job which need index over 80,000 records in the table. However, The 
job worker stopped processing more docs once org.postgresql.util.PSQLException 
was thrown.


Any help would be appreciated.


Thanks,

Cheng


Here is the track.


DEBUG 2017-03-08 22:46:14,738 (Document delete stuffer thread) - Document 
delete stuffer thread found nothing to do
DEBUG 2017-03-08 22:46:14,745 (Worker thread '28') - Deleting {}
DEBUG 2017-03-08 22:46:14,745 (Worker thread '28') - Hopcount removal {}
DEBUG 2017-03-08 22:46:14,745 (Worker thread '28') - Rescanning documents {}
ERROR 2017-03-08 22:46:14,834 (Worker thread '21') - Worker thread aborting and 
restarting due to database connection reset: Database exception: SQLException 
doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: 
SQLException doing query (22021): ERROR: invalid byte sequence for encoding 
"UTF8": 0x00
at 
org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:715)
at 
org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:741)
at 
org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:784)
at 
org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1457)
at 
org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
at 
org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:204)
at 
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:661)
at 
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performUpdate(DBInterfacePostgreSQL.java:254)
at 
org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80)
at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.noteDocumentIngest(IncrementalIngester.java:2101)
at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$OutputAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3403)
at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3072)
at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2706)
at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
at 
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
at 
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
at 
org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:826)
at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for 
encoding "UTF8": 0x00
at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102)
at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835)
at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
at 
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500)
at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:334)
at org.apache.manifoldcf.

Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

2017-03-08 Thread Karl Wright
I responded in the ticket.  Let's continue our conversation there.

It seems to me that we're seeing a catastrophic communication failure
between Zookeeper and the MCF processes.  I have no idea why that
happened.  Are these on the same machine, or different machines?  If
different machines, it seems possible that something hit your network hard
at about that time.

I'd get everything started again and see if it happens again, and if so,
whether it's about at the same time.  If there's a pattern, that's very
suspicious.

Karl


On Wed, Mar 8, 2017 at 11:12 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
wrote:

> Hi Karl,
>
> I have attached a file to CONNECTORS-1395  which shows an
> excerpt from the Zookeeper  console log.  Unfortunately I don’t have enough
> history in the console to see much before the event and may have lost some
> important information.
>
>
>
> This  Zookeeper issue happened in the middle of the night and no one would
> have manually instigated it.
>
>
>
> Best Regards,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 08 March 2017 13:45
>
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>
>
>
> Hi Guy,
>
>
>
> If nobody recycled Zookeeper intentionally at 00:26:00, is it possible
> that some automated process "cleaned up" zookeeper temporary files out from
> under it at around this time?  Really, we're seeing a catastrophic failure
> of Zookeeper to retain anything beyond that point, and I have no
> explanation for that behavior at all, and have never seen it before.
>
>
>
> This could also potentially explain the apparent postgresql transaction
> integrity issues, because in some cases we rely on local Zookeeper locks to
> prevent threads from interfering with one another.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Mar 8, 2017 at 8:37 AM, Karl Wright <daddy...@gmail.com> wrote:
>
> Right, sorry, I overlooked this attachment in your original mail.  Have a
> look at the ticket for updated status of the research, or later posts in
> this thread.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Mar 8, 2017 at 8:06 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
> wrote:
>
> Hi Karl,
>
> Attached is the MCF trace that includes all the logging pertaining to the
> FATAL error  ( the last entry is at 00:36:56,798  there was no further
> logging until  08:00 when I looked at the jobs in the morning). This is the
> same trace I sent on the original mail, I’m afraid I have no more trace
> than this.
>
>
>
> I’ll try and reproduce the problem with forensic logging on and append the
> traces to  connectors-1395.
>
>
>
> Best Regards,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 08 March 2017 12:32
>
>
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>
>
>
> Hi Guy,
>
>
>
> I've now had a look at everything available to me on your issue.  I've
> created CONNECTORS-1395 to track this issue; please attach any new
> materials to that ticket.
>
>
>
> There's an exception trace you didn't include, specifically for this FATAL
> exception:
>
> FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't 
> release lock we don't hold
>
> java.lang.IllegalStateException: Can't release lock we don't hold”
>
>
>
> This is very probably the smoking gun for the stuck lock.  I'd really like
> that exception trace from the log if you wouldn't mind.
>
>
>
> What has happened is this:
>
>
>
> (1) The document fetch cycle runs like this: (a) "stuff" the document (by
> changing its state to "active", (b) fetch the document, (c) mark the
> document completed.  For a specific document, after it's been fetched, when
> we're trying to mark the document as being "completed" we do not see the
> expected "active" status.  Instead we see a status of "pending purgatory",
> which means that either some other thread changed the document's status out
> from under us, or that document's status was never in fact set to "active"
> during the stuffing phase.  Neither of these is possible given the code
> paths available, but we can prove it one way or another by turning on
> "forensic" debugging, as I described above.
>
>
>
> (2) Once the failure happens, the job in question should abort.  But it
> seems like that abort does not complete because there's a second problem
> with lock management (which generates the FATAL me

Re: The job got stuck when JDBC Connector got org.postgresql.util.PSQLException

2017-03-08 Thread Karl Wright
Hi Cheng,

The issue is that your JDBC connection is generating a version string that
has a character zero (0x0) in it, and postgresql doesn't allow that.

You get to specify the version string query as part of the job definition
-- can you look at that and see how you are getting this back?  It might be
due to the JDBC driver doing something funny, or it might be that your
supplied version query is querying for something that's really binary, and
you'll need to do something different to make it work.

Thanks,
Karl


On Wed, Mar 8, 2017 at 10:43 AM, Cheng Zeng  wrote:

> Hi all,
>
>
> I was trying to use JDBC connector to index records from MS SQLServer
> 2012. I used ManifoldCF 2.3.
>
>
> I created a job which need index over 80,000 records in the table.
> However, The job worker stopped processing more docs once
> org.postgresql.util.PSQLException was thrown.
>
>
> Any help would be appreciated.
>
>
> Thanks,
>
> Cheng
>
>
> Here is the track.
>
>
> DEBUG 2017-03-08 22:46:14,738 (Document delete stuffer thread) - Document
> delete stuffer thread found nothing to do
> DEBUG 2017-03-08 22:46:14,745 (Worker thread '28') - Deleting {}
> DEBUG 2017-03-08 22:46:14,745 (Worker thread '28') - Hopcount removal {}
> DEBUG 2017-03-08 22:46:14,745 (Worker thread '28') - Rescanning documents
> {}
> ERROR 2017-03-08 22:46:14,834 (Worker thread '21') - Worker thread
> aborting and restarting due to database connection reset: Database
> exception: SQLException doing query (22021): ERROR: invalid byte sequence
> for encoding "UTF8": 0x00
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
> exception: SQLException doing query (22021): ERROR: invalid byte sequence
> for encoding "UTF8": 0x00
> at org.apache.manifoldcf.core.database.Database$
> ExecuteQueryThread.finishUp(Database.java:715)
> at org.apache.manifoldcf.core.database.Database.
> executeViaThread(Database.java:741)
> at org.apache.manifoldcf.core.database.Database.
> executeUncachedQuery(Database.java:784)
> at org.apache.manifoldcf.core.database.Database$
> QueryCacheExecutor.create(Database.java:1457)
> at org.apache.manifoldcf.core.cachemanager.CacheManager.
> findObjectsAndExecute(CacheManager.java:146)
> at org.apache.manifoldcf.core.database.Database.
> executeQuery(Database.java:204)
> at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.
> performModification(DBInterfacePostgreSQL.java:661)
> at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.
> performUpdate(DBInterfacePostgreSQL.java:254)
> at org.apache.manifoldcf.core.database.BaseTable.
> performUpdate(BaseTable.java:80)
> at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.
> noteDocumentIngest(IncrementalIngester.java:2101)
> at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$
> OutputAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3403)
> at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$
> PipelineAddFanout.sendDocument(IncrementalIngester.java:3072)
> at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$
> PipelineObjectWithVersions.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:2706)
> at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.
> documentIngest(IncrementalIngester.java:756)
> at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.
> ingestDocumentWithException(WorkerThread.java:1583)
> at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.
> ingestDocumentWithException(WorkerThread.java:1548)
> at org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.
> processDocuments(JDBCConnector.java:826)
> at org.apache.manifoldcf.crawler.system.WorkerThread.run(
> WorkerThread.java:399)
> Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte
> sequence for encoding "UTF8": 0x00
> at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(
> QueryExecutorImpl.java:2102)
> at org.postgresql.core.v3.QueryExecutorImpl.processResults(
> QueryExecutorImpl.java:1835)
> at org.postgresql.core.v3.QueryExecutorImpl.execute(
> QueryExecutorImpl.java:257)
> at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(
> AbstractJdbc2Statement.java:500)
> at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(
> AbstractJdbc2Statement.java:388)
> at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(
> AbstractJdbc2Statement.java:334)
> at org.apache.manifoldcf.core.database.Database.execute(
> Database.java:916)
> at org.apache.manifoldcf.core.database.Database$
> ExecuteQueryThread.run(Database.java:696)
> DEBUG 2017-03-08 22:46:15,581 (Set priority thread) - Done reprioritizing
> because exceeded cycle count
> DEBUG 2017-03-08 22:46:15,581 (Set priority thread) - Set priority thread
> woke up
> DEBUG 2017-03-08 22:46:15,586 (Stuffer thread) - 

RE: Advice on which PostgreSQL to use with ManifoldCF 2.6

2017-03-08 Thread Standen Guy
Hi Karl,
I have attached a file to CONNECTORS-1395  which shows an 
excerpt from the Zookeeper  console log.  Unfortunately I don’t have enough 
history in the console to see much before the event and may have lost some 
important information.

This  Zookeeper issue happened in the middle of the night and no one would have 
manually instigated it.

Best Regards,

Guy

From: Karl Wright [mailto:daddy...@gmail.com]
Sent: 08 March 2017 13:45
To: user@manifoldcf.apache.org
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

Hi Guy,

If nobody recycled Zookeeper intentionally at 00:26:00, is it possible that 
some automated process "cleaned up" zookeeper temporary files out from under it 
at around this time?  Really, we're seeing a catastrophic failure of Zookeeper 
to retain anything beyond that point, and I have no explanation for that 
behavior at all, and have never seen it before.

This could also potentially explain the apparent postgresql transaction 
integrity issues, because in some cases we rely on local Zookeeper locks to 
prevent threads from interfering with one another.

Karl


On Wed, Mar 8, 2017 at 8:37 AM, Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
Right, sorry, I overlooked this attachment in your original mail.  Have a look 
at the ticket for updated status of the research, or later posts in this thread.

Karl


On Wed, Mar 8, 2017 at 8:06 AM, Standen Guy 
<guy.stan...@uk.fujitsu.com<mailto:guy.stan...@uk.fujitsu.com>> wrote:
Hi Karl,
Attached is the MCF trace that includes all the logging pertaining to the FATAL 
error  ( the last entry is at 00:36:56,798  there was no further logging until  
08:00 when I looked at the jobs in the morning). This is the same trace I sent 
on the original mail, I’m afraid I have no more trace than this.

I’ll try and reproduce the problem with forensic logging on and append the 
traces to  connectors-1395.

Best Regards,

Guy

From: Karl Wright [mailto:daddy...@gmail.com<mailto:daddy...@gmail.com>]
Sent: 08 March 2017 12:32

To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

Hi Guy,

I've now had a look at everything available to me on your issue.  I've created 
CONNECTORS-1395 to track this issue; please attach any new materials to that 
ticket.

There's an exception trace you didn't include, specifically for this FATAL 
exception:

FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't 
release lock we don't hold

java.lang.IllegalStateException: Can't release lock we don't hold”

This is very probably the smoking gun for the stuck lock.  I'd really like that 
exception trace from the log if you wouldn't mind.

What has happened is this:

(1) The document fetch cycle runs like this: (a) "stuff" the document (by 
changing its state to "active", (b) fetch the document, (c) mark the document 
completed.  For a specific document, after it's been fetched, when we're trying 
to mark the document as being "completed" we do not see the expected "active" 
status.  Instead we see a status of "pending purgatory", which means that 
either some other thread changed the document's status out from under us, or 
that document's status was never in fact set to "active" during the stuffing 
phase.  Neither of these is possible given the code paths available, but we can 
prove it one way or another by turning on "forensic" debugging, as I described 
above.

(2) Once the failure happens, the job in question should abort.  But it seems 
like that abort does not complete because there's a second problem with lock 
management (which generates the FATAL message above).  This should be readily 
fixed if I can get that trace.

Thanks,
Karl

On Wed, Mar 8, 2017 at 6:52 AM, Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
Hi Guy,

The agents thread dump shows that there's a lock stuck from somewhere; I expect 
it's from the UI.  Next time this happens, could you get a thread dump for the 
UI process as well as from the agents process?  Thanks!!

Karl


On Wed, Mar 8, 2017 at 6:12 AM, Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
Hi Guy,

See https://issues.apache.org/jira/browse/CONNECTORS-590.

When you see "unexpected" this is not a good sign:

>>>>>>
“ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed: 
Unexpected jobqueue status - record id 1488898668325, expecting active status, 
saw 4
org.apache.manifoldcf.core.int<http://org.apache.manifoldcf.core.int>erfaces.ManifoldCFException:
 Unexpected jobqueue status - record id 1488898668325, expecting active status, 
saw 4
at 
org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019)
a

Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

2017-03-08 Thread Karl Wright
Hi Guy,

If nobody recycled Zookeeper intentionally at 00:26:00, is it possible that
some automated process "cleaned up" zookeeper temporary files out from
under it at around this time?  Really, we're seeing a catastrophic failure
of Zookeeper to retain anything beyond that point, and I have no
explanation for that behavior at all, and have never seen it before.

This could also potentially explain the apparent postgresql transaction
integrity issues, because in some cases we rely on local Zookeeper locks to
prevent threads from interfering with one another.

Karl


On Wed, Mar 8, 2017 at 8:37 AM, Karl Wright <daddy...@gmail.com> wrote:

> Right, sorry, I overlooked this attachment in your original mail.  Have a
> look at the ticket for updated status of the research, or later posts in
> this thread.
>
> Karl
>
>
> On Wed, Mar 8, 2017 at 8:06 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
> wrote:
>
>> Hi Karl,
>>
>> Attached is the MCF trace that includes all the logging pertaining to the
>> FATAL error  ( the last entry is at 00:36:56,798  there was no further
>> logging until  08:00 when I looked at the jobs in the morning). This is the
>> same trace I sent on the original mail, I’m afraid I have no more trace
>> than this.
>>
>>
>>
>> I’ll try and reproduce the problem with forensic logging on and append
>> the traces to  connectors-1395.
>>
>>
>>
>> Best Regards,
>>
>>
>>
>> Guy
>>
>>
>>
>> *From:* Karl Wright [mailto:daddy...@gmail.com]
>> *Sent:* 08 March 2017 12:32
>>
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>>
>>
>>
>> Hi Guy,
>>
>>
>>
>> I've now had a look at everything available to me on your issue.  I've
>> created CONNECTORS-1395 to track this issue; please attach any new
>> materials to that ticket.
>>
>>
>>
>> There's an exception trace you didn't include, specifically for this
>> FATAL exception:
>>
>> FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't 
>> release lock we don't hold
>>
>> java.lang.IllegalStateException: Can't release lock we don't hold”
>>
>>
>>
>> This is very probably the smoking gun for the stuck lock.  I'd really
>> like that exception trace from the log if you wouldn't mind.
>>
>>
>>
>> What has happened is this:
>>
>>
>>
>> (1) The document fetch cycle runs like this: (a) "stuff" the document (by
>> changing its state to "active", (b) fetch the document, (c) mark the
>> document completed.  For a specific document, after it's been fetched, when
>> we're trying to mark the document as being "completed" we do not see the
>> expected "active" status.  Instead we see a status of "pending purgatory",
>> which means that either some other thread changed the document's status out
>> from under us, or that document's status was never in fact set to "active"
>> during the stuffing phase.  Neither of these is possible given the code
>> paths available, but we can prove it one way or another by turning on
>> "forensic" debugging, as I described above.
>>
>>
>>
>> (2) Once the failure happens, the job in question should abort.  But it
>> seems like that abort does not complete because there's a second problem
>> with lock management (which generates the FATAL message above).  This
>> should be readily fixed if I can get that trace.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>> On Wed, Mar 8, 2017 at 6:52 AM, Karl Wright <daddy...@gmail.com> wrote:
>>
>> Hi Guy,
>>
>>
>>
>> The agents thread dump shows that there's a lock stuck from somewhere; I
>> expect it's from the UI.  Next time this happens, could you get a thread
>> dump for the UI process as well as from the agents process?  Thanks!!
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Wed, Mar 8, 2017 at 6:12 AM, Karl Wright <daddy...@gmail.com> wrote:
>>
>> Hi Guy,
>>
>>
>>
>> See https://issues.apache.org/jira/browse/CONNECTORS-590.
>>
>>
>>
>> When you see "unexpected" this is not a good sign:
>>
>>
>>
>> >>>>>>
>>
>> “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed:
>> Unexpected jobqueue status - record id 1488898668325, expecting active
>&

Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

2017-03-08 Thread Karl Wright
Right, sorry, I overlooked this attachment in your original mail.  Have a
look at the ticket for updated status of the research, or later posts in
this thread.

Karl


On Wed, Mar 8, 2017 at 8:06 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
wrote:

> Hi Karl,
>
> Attached is the MCF trace that includes all the logging pertaining to the
> FATAL error  ( the last entry is at 00:36:56,798  there was no further
> logging until  08:00 when I looked at the jobs in the morning). This is the
> same trace I sent on the original mail, I’m afraid I have no more trace
> than this.
>
>
>
> I’ll try and reproduce the problem with forensic logging on and append the
> traces to  connectors-1395.
>
>
>
> Best Regards,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 08 March 2017 12:32
>
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>
>
>
> Hi Guy,
>
>
>
> I've now had a look at everything available to me on your issue.  I've
> created CONNECTORS-1395 to track this issue; please attach any new
> materials to that ticket.
>
>
>
> There's an exception trace you didn't include, specifically for this FATAL
> exception:
>
> FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't 
> release lock we don't hold
>
> java.lang.IllegalStateException: Can't release lock we don't hold”
>
>
>
> This is very probably the smoking gun for the stuck lock.  I'd really like
> that exception trace from the log if you wouldn't mind.
>
>
>
> What has happened is this:
>
>
>
> (1) The document fetch cycle runs like this: (a) "stuff" the document (by
> changing its state to "active", (b) fetch the document, (c) mark the
> document completed.  For a specific document, after it's been fetched, when
> we're trying to mark the document as being "completed" we do not see the
> expected "active" status.  Instead we see a status of "pending purgatory",
> which means that either some other thread changed the document's status out
> from under us, or that document's status was never in fact set to "active"
> during the stuffing phase.  Neither of these is possible given the code
> paths available, but we can prove it one way or another by turning on
> "forensic" debugging, as I described above.
>
>
>
> (2) Once the failure happens, the job in question should abort.  But it
> seems like that abort does not complete because there's a second problem
> with lock management (which generates the FATAL message above).  This
> should be readily fixed if I can get that trace.
>
>
>
> Thanks,
>
> Karl
>
>
>
> On Wed, Mar 8, 2017 at 6:52 AM, Karl Wright <daddy...@gmail.com> wrote:
>
> Hi Guy,
>
>
>
> The agents thread dump shows that there's a lock stuck from somewhere; I
> expect it's from the UI.  Next time this happens, could you get a thread
> dump for the UI process as well as from the agents process?  Thanks!!
>
>
>
> Karl
>
>
>
>
>
> On Wed, Mar 8, 2017 at 6:12 AM, Karl Wright <daddy...@gmail.com> wrote:
>
> Hi Guy,
>
>
>
> See https://issues.apache.org/jira/browse/CONNECTORS-590.
>
>
>
> When you see "unexpected" this is not a good sign:
>
>
>
> >>>>>>
>
> “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed:
> Unexpected jobqueue status - record id 1488898668325, expecting active
> status, saw 4
>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
> jobqueue status - record id 1488898668325, expecting active status, saw 4
>
> at org.apache.manifoldcf.crawler.jobs.JobQueue.
> updateCompletedRecord(JobQueue.java:1019)
>
> at org.apache.manifoldcf.crawler.jobs.JobManager.
> markDocumentCompletedMultiple(JobManager.java:3271)
>
> at org.apache.manifoldcf.crawler.system.WorkerThread.run(
> WorkerThread.java:710)
>
> <<<<<<
>
>
>
> I've spent weeks chasing this issue in the past.  I have not seen it at
> all except in certain installations: 2 of them, counting yours.  I
> introduced transaction forensics into the MCF code to figure out what was
> going on and I definitively proved (on Postgresql 9.1) that we were seeing
> a transactional integrity problem with Postgresql itself.  A ticket against
> Postgresql was not logged because they'd need a reproducible test case and
> also the latest version, and it didn't happen on 9.3 at the time.
>
>
>
> You also seem to be seeing a deadlock in M

RE: Advice on which PostgreSQL to use with ManifoldCF 2.6

2017-03-08 Thread Standen Guy
Hi Karl,
Attached is the MCF trace that includes all the logging pertaining to the FATAL 
error  ( the last entry is at 00:36:56,798  there was no further logging until  
08:00 when I looked at the jobs in the morning). This is the same trace I sent 
on the original mail, I’m afraid I have no more trace than this.

I’ll try and reproduce the problem with forensic logging on and append the 
traces to  connectors-1395.

Best Regards,

Guy

From: Karl Wright [mailto:daddy...@gmail.com]
Sent: 08 March 2017 12:32
To: user@manifoldcf.apache.org
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

Hi Guy,

I've now had a look at everything available to me on your issue.  I've created 
CONNECTORS-1395 to track this issue; please attach any new materials to that 
ticket.

There's an exception trace you didn't include, specifically for this FATAL 
exception:


FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't 
release lock we don't hold

java.lang.IllegalStateException: Can't release lock we don't hold”

This is very probably the smoking gun for the stuck lock.  I'd really like that 
exception trace from the log if you wouldn't mind.

What has happened is this:

(1) The document fetch cycle runs like this: (a) "stuff" the document (by 
changing its state to "active", (b) fetch the document, (c) mark the document 
completed.  For a specific document, after it's been fetched, when we're trying 
to mark the document as being "completed" we do not see the expected "active" 
status.  Instead we see a status of "pending purgatory", which means that 
either some other thread changed the document's status out from under us, or 
that document's status was never in fact set to "active" during the stuffing 
phase.  Neither of these is possible given the code paths available, but we can 
prove it one way or another by turning on "forensic" debugging, as I described 
above.

(2) Once the failure happens, the job in question should abort.  But it seems 
like that abort does not complete because there's a second problem with lock 
management (which generates the FATAL message above).  This should be readily 
fixed if I can get that trace.

Thanks,
Karl

On Wed, Mar 8, 2017 at 6:52 AM, Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
Hi Guy,

The agents thread dump shows that there's a lock stuck from somewhere; I expect 
it's from the UI.  Next time this happens, could you get a thread dump for the 
UI process as well as from the agents process?  Thanks!!

Karl


On Wed, Mar 8, 2017 at 6:12 AM, Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
Hi Guy,

See https://issues.apache.org/jira/browse/CONNECTORS-590.

When you see "unexpected" this is not a good sign:

>>>>>>
“ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed: 
Unexpected jobqueue status - record id 1488898668325, expecting active status, 
saw 4
org.apache.manifoldcf.core.int<http://org.apache.manifoldcf.core.int>erfaces.ManifoldCFException:
 Unexpected jobqueue status - record id 1488898668325, expecting active status, 
saw 4
at 
org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019)
at 
org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3271)
at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:710)
<<<<<<

I've spent weeks chasing this issue in the past.  I have not seen it at all 
except in certain installations: 2 of them, counting yours.  I introduced 
transaction forensics into the MCF code to figure out what was going on and I 
definitively proved (on Postgresql 9.1) that we were seeing a transactional 
integrity problem with Postgresql itself.  A ticket against Postgresql was not 
logged because they'd need a reproducible test case and also the latest 
version, and it didn't happen on 9.3 at the time.

You also seem to be seeing a deadlock in MCF locks as well.  When the 
Postgresql bug happens it means that the database is in a state that MCF really 
can't figure out and thus doesn't know how to deal with, so this may just be a 
downstream result of that occurrence.  But I can't be sure without further 
analysis.

I'm very curious now about the details of your setup.  (1) How is your 
postgresql set up?  What version did you decide to use?  (2) How many agents 
processes do you have?  Just one?  (3) What OS? (4) What JDK?

Other things you can try:
(1) Running the postgresql LT tests (ant run-LT-postgresql) against your 
postgresql installation; you will need to change the test code itself to allow 
it to create an instance for testing in that case;
(2) Turning on database transaction forensics (property name 
"org.apache.manifoldcf.diagnostics" value "DEBUG"

<    5   6   7   8   9   10   11   12   13   14   >