Oh, and you also may need to edit your options.env files to include them in the classpath for startup.
Karl On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddy...@gmail.com> wrote: > If you are amenable, there is another workaround you could try. > Specifically: > > (1) Shut down all MCF processes. > (2) Move the following two files from connector-common-lib to lib: > > xmlbeans-2.6.0.jar > poi-ooxml-schemas-3.15.jar > > (3) Restart everything and see if your crawl resumes. > > Please let me know what happens. > > Karl > > > > On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <daddy...@gmail.com> wrote: > >> I created a ticket for this: CONNECTORS-1450. >> >> One simple workaround is to use the external Tika server transformer >> rather than the embedded Tika Extractor. I'm still looking into why the >> jar is not being found. >> >> Karl >> >> >> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <i93oth...@gmail.com> >> wrote: >> >>> Yes, I'm actually using the latest binary version, and my job got stuck >>> on that specific file. >>> The job status is still Running. You can see it in the attached file. >>> For your information, the job started yesterday. >>> >>> Thanks, >>> >>> Othman >>> >>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddy...@gmail.com> wrote: >>> >>>> It looks like a dependency of Apache POI is missing. >>>> I think we will need a ticket to address this, if you are indeed using >>>> the binary distribution. >>>> >>>> Thanks! >>>> Karl >>>> >>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>>> wrote: >>>> >>>>> I'm actually using the binary version. For security reasons, I can't >>>>> send any files from my computer. I have copied the stack trace and scanned >>>>> it with my cellphone. I hope it will be helpful. Meanwhile, I have read >>>>> the >>>>> documentation about how to restrict the crawling and I don't think the '|' >>>>> works in the specified. For instance, I would like to restrict the >>>>> crawling >>>>> for the documents that counts the 'sound' word . I proceed as follows: >>>>> *(SON)* . the document is with capital letters and I noticed that it >>>>> didn't >>>>> take it into consideration. >>>>> >>>>> Thanks, >>>>> Othman >>>>> >>>>> >>>>> >>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddy...@gmail.com> wrote: >>>>> >>>>>> Hi Othman, >>>>>> >>>>>> The way you restrict documents with the windows share connector is by >>>>>> specifying information on the "Paths" tab in jobs that crawl windows >>>>>> shares. There is end-user documentation both online and distributed with >>>>>> all binary distributions that describe how to do this. Have you found >>>>>> it? >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hello Karl, >>>>>>> >>>>>>> Thank you for your response, I will start using zookeeper and I will >>>>>>> let you know if it works. I have another question to ask. Actually, I >>>>>>> need >>>>>>> to make some filters while crawling. I don't want to crawl some files >>>>>>> and >>>>>>> some folders. Could you give me an example of how to use the regex. Does >>>>>>> the regex allow to use /i to ignore cases ? >>>>>>> >>>>>>> Thanks, >>>>>>> Othman >>>>>>> >>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddy...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Beelz, >>>>>>>> >>>>>>>> File-based sync is deprecated because people often have problems >>>>>>>> with getting file permissions right, and they do not understand how to >>>>>>>> shut >>>>>>>> processes down cleanly, and zookeeper is resilient against that. I >>>>>>>> highly >>>>>>>> recommend using zookeeper sync. >>>>>>>> >>>>>>>> ManifoldCF is engineered to not put files into memory so you do not >>>>>>>> need huge amounts of memory. The default values are more than enough >>>>>>>> for >>>>>>>> 35,000 files, which is a pretty small job for ManifoldCF. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I'm actually not using zookeeper. i want to know how is zookeeper >>>>>>>>> different from file based sync? I also need a guidance on how to >>>>>>>>> manage my >>>>>>>>> pc's memory. How many Go should I allocate for the start-agent of >>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>>>>>>> >>>>>>>>> Othman. >>>>>>>>> >>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <daddy...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Your disk is not writable for some reason, and that's interfering >>>>>>>>>> with ManifoldCF 2.8 locking. >>>>>>>>>> >>>>>>>>>> I would suggest two things: >>>>>>>>>> >>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>> >>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked into >>>>>>>>>>> the ManifoldCF log file and extracted the following warnings : >>>>>>>>>>> >>>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.8 >>>>>>>>>>> \multiprocess-file-example\.\.\synch >>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES >>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting down >>>>>>>>>>> process; locks may be left dangling. You must cleanup before >>>>>>>>>>> restarting. >>>>>>>>>>> >>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output >>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata and a >>>>>>>>>>> file >>>>>>>>>>> system as a repository connection. During the job, I don't extract >>>>>>>>>>> the >>>>>>>>>>> content of the documents. I was wandering if the issue comes from >>>>>>>>>>> elasticsearch ? >>>>>>>>>>> >>>>>>>>>>> Othman. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <daddy...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Othman, >>>>>>>>>>>> >>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks like it >>>>>>>>>>>> might go away on retry, but does not. It can be either on the >>>>>>>>>>>> repository >>>>>>>>>>>> side or on the output side. If you look at the Simple History in >>>>>>>>>>>> the UI, >>>>>>>>>>>> or at the manifoldcf.log file, you should be able to get a better >>>>>>>>>>>> sense of >>>>>>>>>>>> what went wrong. Without further information, I can't say any >>>>>>>>>>>> more. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société générale >>>>>>>>>>>>> in France. I'm actually using your recent version of manifoldCF >>>>>>>>>>>>> 2.8 . I'm >>>>>>>>>>>>> working on an internal search engine. For this reason, I'm using >>>>>>>>>>>>> manifoldcf >>>>>>>>>>>>> in order to index documents on windows shares. I encountered a >>>>>>>>>>>>> serious >>>>>>>>>>>>> problem while crawling 35K documents. Most of the time, when >>>>>>>>>>>>> manifoldcf >>>>>>>>>>>>> start crawling a big sized documents (19Mo for example), it ends >>>>>>>>>>>>> the job >>>>>>>>>>>>> with the following error: repeated service interruptions - failure >>>>>>>>>>>>> processing document : software caused connection abort: socket >>>>>>>>>>>>> write error. >>>>>>>>>>>>> Can you give me some tips on how to solve this problem, please >>>>>>>>>>>>> ? >>>>>>>>>>>>> >>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>> >>>>>>>>>>>>> Best regards, >>>>>>>>>>>>> >>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >> >