Re: Question about ManifoldCF 2.8

Karl Wright Thu, 31 Aug 2017 06:16:53 -0700

Once again, I need a stack trace to diagnose what the problem is.

Thanks,
Karl



On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:

> Oh, actually it didn't solve the problem. I looked into the log file and
> saw the following error:
>
> Error tossed : org/apache/poi/POIXMLTypeLoader
> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>
> Maybe another jar is missing ?
>
> Othman.
>
> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
>
>> I have tried what you told me to do, and you expected the crawling
>> resumed. How about the regular expressions? How can I make complex regular
>> expressions in the job's paths tab ?
>>
>> Thank you very much for your help.
>>
>> Othman.
>>
>>
>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
>>
>>> Ok, I will try it right away and let you know if it works.
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddy...@gmail.com> wrote:
>>>
>>>> Oh, and you also may need to edit your options.env files to include
>>>> them in the classpath for startup.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddy...@gmail.com>
>>>> wrote:
>>>>
>>>>> If you are amenable, there is another workaround you could try.
>>>>> Specifically:
>>>>>
>>>>> (1) Shut down all MCF processes.
>>>>> (2) Move the following two files from connector-common-lib to lib:
>>>>>
>>>>> xmlbeans-2.6.0.jar
>>>>> poi-ooxml-schemas-3.15.jar
>>>>>
>>>>> (3) Restart everything and see if your crawl resumes.
>>>>>
>>>>> Please let me know what happens.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <daddy...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I created a ticket for this: CONNECTORS-1450.
>>>>>>
>>>>>> One simple workaround is to use the external Tika server transformer
>>>>>> rather than the embedded Tika Extractor.  I'm still looking into why the
>>>>>> jar is not being found.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <i93oth...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, I'm actually using the latest binary version, and my job got
>>>>>>> stuck on that specific file.
>>>>>>> The job status is still Running. You can see it in the attached
>>>>>>> file. For your information, the job started yesterday.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Othman
>>>>>>>
>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddy...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> It looks like a dependency of Apache POI is missing.
>>>>>>>> I think we will need a ticket to address this, if you are indeed
>>>>>>>> using the binary distribution.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <i93oth...@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> I'm actually using the binary version. For security reasons, I
>>>>>>>>> can't send any files from my computer. I have copied the stack trace 
>>>>>>>>> and
>>>>>>>>> scanned it with my cellphone. I hope it will be helpful. Meanwhile, I 
>>>>>>>>> have
>>>>>>>>> read the documentation about how to restrict the crawling and I don't 
>>>>>>>>> think
>>>>>>>>> the '|' works in the specified. For instance, I would like to 
>>>>>>>>> restrict the
>>>>>>>>> crawling for the documents that counts the 'sound' word . I proceed as
>>>>>>>>> follows: *(SON)* . the document is with capital letters and I noticed 
>>>>>>>>> that
>>>>>>>>> it didn't take it into consideration.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Othman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddy...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Othman,
>>>>>>>>>>
>>>>>>>>>> The way you restrict documents with the windows share connector
>>>>>>>>>> is by specifying information on the "Paths" tab in jobs that crawl 
>>>>>>>>>> windows
>>>>>>>>>> shares.  There is end-user documentation both online and distributed 
>>>>>>>>>> with
>>>>>>>>>> all binary distributions that describe how to do this.  Have you 
>>>>>>>>>> found it?
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <
>>>>>>>>>> i93oth...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Karl,
>>>>>>>>>>>
>>>>>>>>>>> Thank you for your response, I will start using zookeeper and I
>>>>>>>>>>> will let you know if it works. I have another question to ask. 
>>>>>>>>>>> Actually, I
>>>>>>>>>>> need to make some filters while crawling. I don't want to crawl 
>>>>>>>>>>> some files
>>>>>>>>>>> and some folders. Could you give me an example of how to use the 
>>>>>>>>>>> regex.
>>>>>>>>>>> Does the regex allow to use /i to ignore cases ?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Othman
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddy...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>>
>>>>>>>>>>>> File-based sync is deprecated because people often have
>>>>>>>>>>>> problems with getting file permissions right, and they do not 
>>>>>>>>>>>> understand
>>>>>>>>>>>> how to shut processes down cleanly, and zookeeper is resilient 
>>>>>>>>>>>> against
>>>>>>>>>>>> that.  I highly recommend using zookeeper sync.
>>>>>>>>>>>>
>>>>>>>>>>>> ManifoldCF is engineered to not put files into memory so you do
>>>>>>>>>>>> not need huge amounts of memory.  The default values are more than 
>>>>>>>>>>>> enough
>>>>>>>>>>>> for 35,000 files, which is a pretty small job for ManifoldCF.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>>>>> i93oth...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is
>>>>>>>>>>>>> zookeeper different from file based sync? I also need a guidance 
>>>>>>>>>>>>> on how to
>>>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the 
>>>>>>>>>>>>> start-agent of
>>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <daddy...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Your disk is not writable for some reason, and that's
>>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>>>>>>>>>>> (2) Have a look if you still get failures after that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>> i93oth...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked
>>>>>>>>>>>>>>> into the ManifoldCF log file and extracted the following 
>>>>>>>>>>>>>>> warnings :
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.
>>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting
>>>>>>>>>>>>>>> down process; locks may be left dangling. You must cleanup 
>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>> restarting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output
>>>>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata and 
>>>>>>>>>>>>>>> a file
>>>>>>>>>>>>>>> system as a repository connection. During the job, I don't 
>>>>>>>>>>>>>>> extract the
>>>>>>>>>>>>>>> content of the documents. I was wandering if the issue comes 
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <
>>>>>>>>>>>>>>> daddy...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks like
>>>>>>>>>>>>>>>> it might go away on retry, but does not.  It can be either on 
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> repository side or on the output side.  If you look at the 
>>>>>>>>>>>>>>>> Simple History
>>>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be able 
>>>>>>>>>>>>>>>> to get a
>>>>>>>>>>>>>>>> better sense of what went wrong.  Without further information, 
>>>>>>>>>>>>>>>> I can't say
>>>>>>>>>>>>>>>> any more.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société
>>>>>>>>>>>>>>>>> générale in France. I'm actually using your recent version of 
>>>>>>>>>>>>>>>>> manifoldCF
>>>>>>>>>>>>>>>>> 2.8 . I'm working on an internal search engine. For this 
>>>>>>>>>>>>>>>>> reason, I'm using
>>>>>>>>>>>>>>>>> manifoldcf in order to index documents on windows shares. I 
>>>>>>>>>>>>>>>>> encountered a
>>>>>>>>>>>>>>>>> serious problem while crawling 35K documents. Most of the 
>>>>>>>>>>>>>>>>> time, when
>>>>>>>>>>>>>>>>> manifoldcf start crawling a big sized documents (19Mo for 
>>>>>>>>>>>>>>>>> example), it ends
>>>>>>>>>>>>>>>>> the job with the following error: repeated service 
>>>>>>>>>>>>>>>>> interruptions - failure
>>>>>>>>>>>>>>>>> processing document : software caused connection abort: 
>>>>>>>>>>>>>>>>> socket write error.
>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this problem,
>>>>>>>>>>>>>>>>> please ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>

Re: Question about ManifoldCF 2.8

Reply via email to