I need the complete stack trace please. Are you building ManifoldCF yourself, or are you using the distributed binary?
Karl On Thu, Aug 31, 2017 at 5:48 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote: > I have also encountered the following problem while indexing documents in > the windows shares : > > Error tossed: com/microsoft/schemas/office/visio/x2012/main/ConnectsType > > Is it a problem of Tika ? > > Thanks in advance, > Othman. > > On Thu, 31 Aug 2017 at 11:25, Beelz Ryuzaki <i93oth...@gmail.com> wrote: > >> Hello Karl, >> >> Thank you for your response, I will start using zookeeper and I will let >> you know if it works. I have another question to ask. Actually, I need to >> make some filters while crawling. I don't want to crawl some files and some >> folders. Could you give me an example of how to use the regex. Does the >> regex allow to use /i to ignore cases ? >> >> Thanks, >> Othman >> >> On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddy...@gmail.com> wrote: >> >>> Hi Beelz, >>> >>> File-based sync is deprecated because people often have problems with >>> getting file permissions right, and they do not understand how to shut >>> processes down cleanly, and zookeeper is resilient against that. I highly >>> recommend using zookeeper sync. >>> >>> ManifoldCF is engineered to not put files into memory so you do not need >>> huge amounts of memory. The default values are more than enough for 35,000 >>> files, which is a pretty small job for ManifoldCF. >>> >>> Thanks, >>> Karl >>> >>> >>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>> wrote: >>> >>>> I'm actually not using zookeeper. i want to know how is zookeeper >>>> different from file based sync? I also need a guidance on how to manage my >>>> pc's memory. How many Go should I allocate for the start-agent of >>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>> >>>> Othman. >>>> >>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <daddy...@gmail.com> wrote: >>>> >>>>> Your disk is not writable for some reason, and that's interfering with >>>>> ManifoldCF 2.8 locking. >>>>> >>>>> I would suggest two things: >>>>> >>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>> (2) Have a look if you still get failures after that. >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Mr Karl, >>>>>> >>>>>> Thank you Mr Karl for your quick response. I have looked into the >>>>>> ManifoldCF log file and extracted the following warnings : >>>>>> >>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2. >>>>>> 8\multiprocess-file-example\.\.\synch >>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES >>>>>> (Lowercase) Synapses.lock' failed : Access is denied. >>>>>> >>>>>> >>>>>> - Couldn't write to lock file; disk may be full. Shutting down >>>>>> process; locks may be left dangling. You must cleanup before restarting. >>>>>> >>>>>> ES (lowercase) synapses being the elasticsearch output connection. >>>>>> Moreover, the job uses Tika to extract metadata and a file system as a >>>>>> repository connection. During the job, I don't extract the content of the >>>>>> documents. I was wandering if the issue comes from elasticsearch ? >>>>>> >>>>>> Othman. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <daddy...@gmail.com> wrote: >>>>>> >>>>>>> Hi Othman, >>>>>>> >>>>>>> ManifoldCF aborts a job if there's an error that looks like it might >>>>>>> go away on retry, but does not. It can be either on the repository >>>>>>> side or >>>>>>> on the output side. If you look at the Simple History in the UI, or at >>>>>>> the >>>>>>> manifoldcf.log file, you should be able to get a better sense of what >>>>>>> went >>>>>>> wrong. Without further information, I can't say any more. >>>>>>> >>>>>>> Thanks, >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> I'm Othman Belhaj, a software engineer from société générale in >>>>>>>> France. I'm actually using your recent version of manifoldCF 2.8 . I'm >>>>>>>> working on an internal search engine. For this reason, I'm using >>>>>>>> manifoldcf >>>>>>>> in order to index documents on windows shares. I encountered a serious >>>>>>>> problem while crawling 35K documents. Most of the time, when manifoldcf >>>>>>>> start crawling a big sized documents (19Mo for example), it ends the >>>>>>>> job >>>>>>>> with the following error: repeated service interruptions - failure >>>>>>>> processing document : software caused connection abort: socket write >>>>>>>> error. >>>>>>>> Can you give me some tips on how to solve this problem, please ? >>>>>>>> >>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>> I'm looking forward for your response. >>>>>>>> >>>>>>>> Best regards, >>>>>>>> >>>>>>>> Othman BELHAJ >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>