Re: Question about ManifoldCF 2.8

Karl Wright Thu, 31 Aug 2017 08:10:22 -0700

I've looked at the dependencies; you should not have moved poi-3.15.jar.
Please move that back, and commons-collections4-4.1.jar too.


You *will* need to move curvesapi-1.04.jar though.

Thanks,
Karl


On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <[email protected]> wrote:

> If you include poi.jar, then all dependencies of poi.jar must also be
> included.  This would mean that curvesapi-1.04.jar and
> commons-collections4-4.1.jar should also be included.
>
> Karl
>
> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <[email protected]>
> wrote:
>
>> Hi Karl,
>>
>> I added the two jars that you have mentioned and another one :
>> poi-3.15.jar . Unfortunately, there is another error showing. This time, it
>> concerns excel files. You will find attached the stack trace.
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 15:32, Karl Wright <[email protected]> wrote:
>>
>>> Hi Othman,
>>>
>>> Yes, this shows that the jar we moved calls back into another jar, which
>>> will also need to be moved.  *That* jar has yet another dependency too.
>>>
>>> The list of jars is thus extended to include:
>>>
>>> poi-ooxml-3.15.jar
>>> dom4j-1.6.1.jar
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <[email protected]>
>>> wrote:
>>>
>>>> You will find attached the stack trace. My apologies for the bad
>>>> quality of the image, I'm doing my best to send you the stack trace as I
>>>> don't have the right to send documents outside the company.
>>>>
>>>> Thank you for your time,
>>>>
>>>> Othman
>>>>
>>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <[email protected]> wrote:
>>>>
>>>>> Once again, I need a stack trace to diagnose what the problem is.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Oh, actually it didn't solve the problem. I looked into the log file
>>>>>> and saw the following error:
>>>>>>
>>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader
>>>>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>>>>>>
>>>>>> Maybe another jar is missing ?
>>>>>>
>>>>>> Othman.
>>>>>>
>>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I have tried what you told me to do, and you expected the crawling
>>>>>>> resumed. How about the regular expressions? How can I make complex 
>>>>>>> regular
>>>>>>> expressions in the job's paths tab ?
>>>>>>>
>>>>>>> Thank you very much for your help.
>>>>>>>
>>>>>>> Othman.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok, I will try it right away and let you know if it works.
>>>>>>>>
>>>>>>>> Othman.
>>>>>>>>
>>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Oh, and you also may need to edit your options.env files to
>>>>>>>>> include them in the classpath for startup.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> If you are amenable, there is another workaround you could try.
>>>>>>>>>> Specifically:
>>>>>>>>>>
>>>>>>>>>> (1) Shut down all MCF processes.
>>>>>>>>>> (2) Move the following two files from connector-common-lib to lib:
>>>>>>>>>>
>>>>>>>>>> xmlbeans-2.6.0.jar
>>>>>>>>>> poi-ooxml-schemas-3.15.jar
>>>>>>>>>>
>>>>>>>>>> (3) Restart everything and see if your crawl resumes.
>>>>>>>>>>
>>>>>>>>>> Please let me know what happens.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I created a ticket for this: CONNECTORS-1450.
>>>>>>>>>>>
>>>>>>>>>>> One simple workaround is to use the external Tika server
>>>>>>>>>>> transformer rather than the embedded Tika Extractor.  I'm still 
>>>>>>>>>>> looking
>>>>>>>>>>> into why the jar is not being found.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, I'm actually using the latest binary version, and my job
>>>>>>>>>>>> got stuck on that specific file.
>>>>>>>>>>>> The job status is still Running. You can see it in the attached
>>>>>>>>>>>> file. For your information, the job started yesterday.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Othman
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> It looks like a dependency of Apache POI is missing.
>>>>>>>>>>>>> I think we will need a ticket to address this, if you are
>>>>>>>>>>>>> indeed using the binary distribution.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm actually using the binary version. For security reasons,
>>>>>>>>>>>>>> I can't send any files from my computer. I have copied the stack 
>>>>>>>>>>>>>> trace and
>>>>>>>>>>>>>> scanned it with my cellphone. I hope it will be helpful. 
>>>>>>>>>>>>>> Meanwhile, I have
>>>>>>>>>>>>>> read the documentation about how to restrict the crawling and I 
>>>>>>>>>>>>>> don't think
>>>>>>>>>>>>>> the '|' works in the specified. For instance, I would like to 
>>>>>>>>>>>>>> restrict the
>>>>>>>>>>>>>> crawling for the documents that counts the 'sound' word . I 
>>>>>>>>>>>>>> proceed as
>>>>>>>>>>>>>> follows: *(SON)* . the document is with capital letters and I 
>>>>>>>>>>>>>> noticed that
>>>>>>>>>>>>>> it didn't take it into consideration.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The way you restrict documents with the windows share
>>>>>>>>>>>>>>> connector is by specifying information on the "Paths" tab in 
>>>>>>>>>>>>>>> jobs that
>>>>>>>>>>>>>>> crawl windows shares.  There is end-user documentation both 
>>>>>>>>>>>>>>> online and
>>>>>>>>>>>>>>> distributed with all binary distributions that describe how to 
>>>>>>>>>>>>>>> do this.
>>>>>>>>>>>>>>> Have you found it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello Karl,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you for your response, I will start using zookeeper
>>>>>>>>>>>>>>>> and I will let you know if it works. I have another question 
>>>>>>>>>>>>>>>> to ask.
>>>>>>>>>>>>>>>> Actually, I need to make some filters while crawling. I don't 
>>>>>>>>>>>>>>>> want to crawl
>>>>>>>>>>>>>>>> some files and some folders. Could you give me an example of 
>>>>>>>>>>>>>>>> how to use the
>>>>>>>>>>>>>>>> regex. Does the regex allow to use /i to ignore cases ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> File-based sync is deprecated because people often have
>>>>>>>>>>>>>>>>> problems with getting file permissions right, and they do not 
>>>>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>> how to shut processes down cleanly, and zookeeper is 
>>>>>>>>>>>>>>>>> resilient against
>>>>>>>>>>>>>>>>> that.  I highly recommend using zookeeper sync.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ManifoldCF is engineered to not put files into memory so
>>>>>>>>>>>>>>>>> you do not need huge amounts of memory.  The default values 
>>>>>>>>>>>>>>>>> are more than
>>>>>>>>>>>>>>>>> enough for 35,000 files, which is a pretty small job for 
>>>>>>>>>>>>>>>>> ManifoldCF.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is
>>>>>>>>>>>>>>>>>> zookeeper different from file based sync? I also need a 
>>>>>>>>>>>>>>>>>> guidance on how to
>>>>>>>>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the 
>>>>>>>>>>>>>>>>>> start-agent of
>>>>>>>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Your disk is not writable for some reason, and that's
>>>>>>>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>>>>>>>>>>>>>>>> (2) Have a look if you still get failures after that.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have
>>>>>>>>>>>>>>>>>>>> looked into the ManifoldCF log file and extracted the 
>>>>>>>>>>>>>>>>>>>> following warnings :
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - Attempt to set file lock
>>>>>>>>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8
>>>>>>>>>>>>>>>>>>>> \multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>>>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full.
>>>>>>>>>>>>>>>>>>>> Shutting down process; locks may be left dangling. You 
>>>>>>>>>>>>>>>>>>>> must cleanup before
>>>>>>>>>>>>>>>>>>>> restarting.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output
>>>>>>>>>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract 
>>>>>>>>>>>>>>>>>>>> metadata and a file
>>>>>>>>>>>>>>>>>>>> system as a repository connection. During the job, I don't 
>>>>>>>>>>>>>>>>>>>> extract the
>>>>>>>>>>>>>>>>>>>> content of the documents. I was wandering if the issue 
>>>>>>>>>>>>>>>>>>>> comes from
>>>>>>>>>>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks
>>>>>>>>>>>>>>>>>>>>> like it might go away on retry, but does not.  It can be 
>>>>>>>>>>>>>>>>>>>>> either on the
>>>>>>>>>>>>>>>>>>>>> repository side or on the output side.  If you look at 
>>>>>>>>>>>>>>>>>>>>> the Simple History
>>>>>>>>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be 
>>>>>>>>>>>>>>>>>>>>> able to get a
>>>>>>>>>>>>>>>>>>>>> better sense of what went wrong.  Without further 
>>>>>>>>>>>>>>>>>>>>> information, I can't say
>>>>>>>>>>>>>>>>>>>>> any more.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société
>>>>>>>>>>>>>>>>>>>>>> générale in France. I'm actually using your recent 
>>>>>>>>>>>>>>>>>>>>>> version of manifoldCF
>>>>>>>>>>>>>>>>>>>>>> 2.8 . I'm working on an internal search engine. For this 
>>>>>>>>>>>>>>>>>>>>>> reason, I'm using
>>>>>>>>>>>>>>>>>>>>>> manifoldcf in order to index documents on windows 
>>>>>>>>>>>>>>>>>>>>>> shares. I encountered a
>>>>>>>>>>>>>>>>>>>>>> serious problem while crawling 35K documents. Most of 
>>>>>>>>>>>>>>>>>>>>>> the time, when
>>>>>>>>>>>>>>>>>>>>>> manifoldcf start crawling a big sized documents (19Mo 
>>>>>>>>>>>>>>>>>>>>>> for example), it ends
>>>>>>>>>>>>>>>>>>>>>> the job with the following error: repeated service 
>>>>>>>>>>>>>>>>>>>>>> interruptions - failure
>>>>>>>>>>>>>>>>>>>>>> processing document : software caused connection abort: 
>>>>>>>>>>>>>>>>>>>>>> socket write error.
>>>>>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this
>>>>>>>>>>>>>>>>>>>>>> problem, please ?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>>>>>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>
>

Re: Question about ManifoldCF 2.8

Reply via email to