Re: Question about ManifoldCF 2.8

Karl Wright Thu, 31 Aug 2017 08:05:25 -0700

If you include poi.jar, then all dependencies of poi.jar must also be
included.  This would mean that curvesapi-1.04.jar and
commons-collections4-4.1.jar should also be included.


Karl

On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <[email protected]> wrote:

> Hi Karl,
>
> I added the two jars that you have mentioned and another one :
> poi-3.15.jar . Unfortunately, there is another error showing. This time, it
> concerns excel files. You will find attached the stack trace.
>
> Othman.
>
> On Thu, 31 Aug 2017 at 15:32, Karl Wright <[email protected]> wrote:
>
>> Hi Othman,
>>
>> Yes, this shows that the jar we moved calls back into another jar, which
>> will also need to be moved.  *That* jar has yet another dependency too.
>>
>> The list of jars is thus extended to include:
>>
>> poi-ooxml-3.15.jar
>> dom4j-1.6.1.jar
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <[email protected]>
>> wrote:
>>
>>> You will find attached the stack trace. My apologies for the bad quality
>>> of the image, I'm doing my best to send you the stack trace as I don't have
>>> the right to send documents outside the company.
>>>
>>> Thank you for your time,
>>>
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <[email protected]> wrote:
>>>
>>>> Once again, I need a stack trace to diagnose what the problem is.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <[email protected]>
>>>> wrote:
>>>>
>>>>> Oh, actually it didn't solve the problem. I looked into the log file
>>>>> and saw the following error:
>>>>>
>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader
>>>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>>>>>
>>>>> Maybe another jar is missing ?
>>>>>
>>>>> Othman.
>>>>>
>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I have tried what you told me to do, and you expected the crawling
>>>>>> resumed. How about the regular expressions? How can I make complex 
>>>>>> regular
>>>>>> expressions in the job's paths tab ?
>>>>>>
>>>>>> Thank you very much for your help.
>>>>>>
>>>>>> Othman.
>>>>>>
>>>>>>
>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok, I will try it right away and let you know if it works.
>>>>>>>
>>>>>>> Othman.
>>>>>>>
>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Oh, and you also may need to edit your options.env files to include
>>>>>>>> them in the classpath for startup.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> If you are amenable, there is another workaround you could try.
>>>>>>>>> Specifically:
>>>>>>>>>
>>>>>>>>> (1) Shut down all MCF processes.
>>>>>>>>> (2) Move the following two files from connector-common-lib to lib:
>>>>>>>>>
>>>>>>>>> xmlbeans-2.6.0.jar
>>>>>>>>> poi-ooxml-schemas-3.15.jar
>>>>>>>>>
>>>>>>>>> (3) Restart everything and see if your crawl resumes.
>>>>>>>>>
>>>>>>>>> Please let me know what happens.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I created a ticket for this: CONNECTORS-1450.
>>>>>>>>>>
>>>>>>>>>> One simple workaround is to use the external Tika server
>>>>>>>>>> transformer rather than the embedded Tika Extractor.  I'm still 
>>>>>>>>>> looking
>>>>>>>>>> into why the jar is not being found.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, I'm actually using the latest binary version, and my job
>>>>>>>>>>> got stuck on that specific file.
>>>>>>>>>>> The job status is still Running. You can see it in the attached
>>>>>>>>>>> file. For your information, the job started yesterday.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Othman
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> It looks like a dependency of Apache POI is missing.
>>>>>>>>>>>> I think we will need a ticket to address this, if you are
>>>>>>>>>>>> indeed using the binary distribution.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm actually using the binary version. For security reasons, I
>>>>>>>>>>>>> can't send any files from my computer. I have copied the stack 
>>>>>>>>>>>>> trace and
>>>>>>>>>>>>> scanned it with my cellphone. I hope it will be helpful. 
>>>>>>>>>>>>> Meanwhile, I have
>>>>>>>>>>>>> read the documentation about how to restrict the crawling and I 
>>>>>>>>>>>>> don't think
>>>>>>>>>>>>> the '|' works in the specified. For instance, I would like to 
>>>>>>>>>>>>> restrict the
>>>>>>>>>>>>> crawling for the documents that counts the 'sound' word . I 
>>>>>>>>>>>>> proceed as
>>>>>>>>>>>>> follows: *(SON)* . the document is with capital letters and I 
>>>>>>>>>>>>> noticed that
>>>>>>>>>>>>> it didn't take it into consideration.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The way you restrict documents with the windows share
>>>>>>>>>>>>>> connector is by specifying information on the "Paths" tab in 
>>>>>>>>>>>>>> jobs that
>>>>>>>>>>>>>> crawl windows shares.  There is end-user documentation both 
>>>>>>>>>>>>>> online and
>>>>>>>>>>>>>> distributed with all binary distributions that describe how to 
>>>>>>>>>>>>>> do this.
>>>>>>>>>>>>>> Have you found it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello Karl,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for your response, I will start using zookeeper
>>>>>>>>>>>>>>> and I will let you know if it works. I have another question to 
>>>>>>>>>>>>>>> ask.
>>>>>>>>>>>>>>> Actually, I need to make some filters while crawling. I don't 
>>>>>>>>>>>>>>> want to crawl
>>>>>>>>>>>>>>> some files and some folders. Could you give me an example of 
>>>>>>>>>>>>>>> how to use the
>>>>>>>>>>>>>>> regex. Does the regex allow to use /i to ignore cases ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> File-based sync is deprecated because people often have
>>>>>>>>>>>>>>>> problems with getting file permissions right, and they do not 
>>>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>> how to shut processes down cleanly, and zookeeper is resilient 
>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>> that.  I highly recommend using zookeeper sync.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ManifoldCF is engineered to not put files into memory so
>>>>>>>>>>>>>>>> you do not need huge amounts of memory.  The default values 
>>>>>>>>>>>>>>>> are more than
>>>>>>>>>>>>>>>> enough for 35,000 files, which is a pretty small job for 
>>>>>>>>>>>>>>>> ManifoldCF.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is
>>>>>>>>>>>>>>>>> zookeeper different from file based sync? I also need a 
>>>>>>>>>>>>>>>>> guidance on how to
>>>>>>>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the 
>>>>>>>>>>>>>>>>> start-agent of
>>>>>>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Your disk is not writable for some reason, and that's
>>>>>>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>>>>>>>>>>>>>>> (2) Have a look if you still get failures after that.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked
>>>>>>>>>>>>>>>>>>> into the ManifoldCF log file and extracted the following 
>>>>>>>>>>>>>>>>>>> warnings :
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.
>>>>>>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full.
>>>>>>>>>>>>>>>>>>> Shutting down process; locks may be left dangling. You must 
>>>>>>>>>>>>>>>>>>> cleanup before
>>>>>>>>>>>>>>>>>>> restarting.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output
>>>>>>>>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata 
>>>>>>>>>>>>>>>>>>> and a file
>>>>>>>>>>>>>>>>>>> system as a repository connection. During the job, I don't 
>>>>>>>>>>>>>>>>>>> extract the
>>>>>>>>>>>>>>>>>>> content of the documents. I was wandering if the issue 
>>>>>>>>>>>>>>>>>>> comes from
>>>>>>>>>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks
>>>>>>>>>>>>>>>>>>>> like it might go away on retry, but does not.  It can be 
>>>>>>>>>>>>>>>>>>>> either on the
>>>>>>>>>>>>>>>>>>>> repository side or on the output side.  If you look at the 
>>>>>>>>>>>>>>>>>>>> Simple History
>>>>>>>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be 
>>>>>>>>>>>>>>>>>>>> able to get a
>>>>>>>>>>>>>>>>>>>> better sense of what went wrong.  Without further 
>>>>>>>>>>>>>>>>>>>> information, I can't say
>>>>>>>>>>>>>>>>>>>> any more.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société
>>>>>>>>>>>>>>>>>>>>> générale in France. I'm actually using your recent 
>>>>>>>>>>>>>>>>>>>>> version of manifoldCF
>>>>>>>>>>>>>>>>>>>>> 2.8 . I'm working on an internal search engine. For this 
>>>>>>>>>>>>>>>>>>>>> reason, I'm using
>>>>>>>>>>>>>>>>>>>>> manifoldcf in order to index documents on windows shares. 
>>>>>>>>>>>>>>>>>>>>> I encountered a
>>>>>>>>>>>>>>>>>>>>> serious problem while crawling 35K documents. Most of the 
>>>>>>>>>>>>>>>>>>>>> time, when
>>>>>>>>>>>>>>>>>>>>> manifoldcf start crawling a big sized documents (19Mo for 
>>>>>>>>>>>>>>>>>>>>> example), it ends
>>>>>>>>>>>>>>>>>>>>> the job with the following error: repeated service 
>>>>>>>>>>>>>>>>>>>>> interruptions - failure
>>>>>>>>>>>>>>>>>>>>> processing document : software caused connection abort: 
>>>>>>>>>>>>>>>>>>>>> socket write error.
>>>>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this
>>>>>>>>>>>>>>>>>>>>> problem, please ?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>>>>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>
>>

Re: Question about ManifoldCF 2.8

Reply via email to