Once again, I need a stack trace to diagnose what the problem is. Thanks, Karl
On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote: > Oh, actually it didn't solve the problem. I looked into the log file and > saw the following error: > > Error tossed : org/apache/poi/POIXMLTypeLoader > java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader. > > Maybe another jar is missing ? > > Othman. > > On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <i93oth...@gmail.com> wrote: > >> I have tried what you told me to do, and you expected the crawling >> resumed. How about the regular expressions? How can I make complex regular >> expressions in the job's paths tab ? >> >> Thank you very much for your help. >> >> Othman. >> >> >> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93oth...@gmail.com> wrote: >> >>> Ok, I will try it right away and let you know if it works. >>> >>> Othman. >>> >>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddy...@gmail.com> wrote: >>> >>>> Oh, and you also may need to edit your options.env files to include >>>> them in the classpath for startup. >>>> >>>> Karl >>>> >>>> >>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddy...@gmail.com> >>>> wrote: >>>> >>>>> If you are amenable, there is another workaround you could try. >>>>> Specifically: >>>>> >>>>> (1) Shut down all MCF processes. >>>>> (2) Move the following two files from connector-common-lib to lib: >>>>> >>>>> xmlbeans-2.6.0.jar >>>>> poi-ooxml-schemas-3.15.jar >>>>> >>>>> (3) Restart everything and see if your crawl resumes. >>>>> >>>>> Please let me know what happens. >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <daddy...@gmail.com> >>>>> wrote: >>>>> >>>>>> I created a ticket for this: CONNECTORS-1450. >>>>>> >>>>>> One simple workaround is to use the external Tika server transformer >>>>>> rather than the embedded Tika Extractor. I'm still looking into why the >>>>>> jar is not being found. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Yes, I'm actually using the latest binary version, and my job got >>>>>>> stuck on that specific file. >>>>>>> The job status is still Running. You can see it in the attached >>>>>>> file. For your information, the job started yesterday. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Othman >>>>>>> >>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddy...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> It looks like a dependency of Apache POI is missing. >>>>>>>> I think we will need a ticket to address this, if you are indeed >>>>>>>> using the binary distribution. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> Karl >>>>>>>> >>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <i93oth...@gmail.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> I'm actually using the binary version. For security reasons, I >>>>>>>>> can't send any files from my computer. I have copied the stack trace >>>>>>>>> and >>>>>>>>> scanned it with my cellphone. I hope it will be helpful. Meanwhile, I >>>>>>>>> have >>>>>>>>> read the documentation about how to restrict the crawling and I don't >>>>>>>>> think >>>>>>>>> the '|' works in the specified. For instance, I would like to >>>>>>>>> restrict the >>>>>>>>> crawling for the documents that counts the 'sound' word . I proceed as >>>>>>>>> follows: *(SON)* . the document is with capital letters and I noticed >>>>>>>>> that >>>>>>>>> it didn't take it into consideration. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Othman >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddy...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Othman, >>>>>>>>>> >>>>>>>>>> The way you restrict documents with the windows share connector >>>>>>>>>> is by specifying information on the "Paths" tab in jobs that crawl >>>>>>>>>> windows >>>>>>>>>> shares. There is end-user documentation both online and distributed >>>>>>>>>> with >>>>>>>>>> all binary distributions that describe how to do this. Have you >>>>>>>>>> found it? >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki < >>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hello Karl, >>>>>>>>>>> >>>>>>>>>>> Thank you for your response, I will start using zookeeper and I >>>>>>>>>>> will let you know if it works. I have another question to ask. >>>>>>>>>>> Actually, I >>>>>>>>>>> need to make some filters while crawling. I don't want to crawl >>>>>>>>>>> some files >>>>>>>>>>> and some folders. Could you give me an example of how to use the >>>>>>>>>>> regex. >>>>>>>>>>> Does the regex allow to use /i to ignore cases ? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Othman >>>>>>>>>>> >>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddy...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Beelz, >>>>>>>>>>>> >>>>>>>>>>>> File-based sync is deprecated because people often have >>>>>>>>>>>> problems with getting file permissions right, and they do not >>>>>>>>>>>> understand >>>>>>>>>>>> how to shut processes down cleanly, and zookeeper is resilient >>>>>>>>>>>> against >>>>>>>>>>>> that. I highly recommend using zookeeper sync. >>>>>>>>>>>> >>>>>>>>>>>> ManifoldCF is engineered to not put files into memory so you do >>>>>>>>>>>> not need huge amounts of memory. The default values are more than >>>>>>>>>>>> enough >>>>>>>>>>>> for 35,000 files, which is a pretty small job for ManifoldCF. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is >>>>>>>>>>>>> zookeeper different from file based sync? I also need a guidance >>>>>>>>>>>>> on how to >>>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the >>>>>>>>>>>>> start-agent of >>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>>>>>>>>>>> >>>>>>>>>>>>> Othman. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <daddy...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Your disk is not writable for some reason, and that's >>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would suggest two things: >>>>>>>>>>>>>> >>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>>>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Karl >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked >>>>>>>>>>>>>>> into the ManifoldCF log file and extracted the following >>>>>>>>>>>>>>> warnings : >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2. >>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch >>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES >>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting >>>>>>>>>>>>>>> down process; locks may be left dangling. You must cleanup >>>>>>>>>>>>>>> before >>>>>>>>>>>>>>> restarting. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output >>>>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata and >>>>>>>>>>>>>>> a file >>>>>>>>>>>>>>> system as a repository connection. During the job, I don't >>>>>>>>>>>>>>> extract the >>>>>>>>>>>>>>> content of the documents. I was wandering if the issue comes >>>>>>>>>>>>>>> from >>>>>>>>>>>>>>> elasticsearch ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright < >>>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks like >>>>>>>>>>>>>>>> it might go away on retry, but does not. It can be either on >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> repository side or on the output side. If you look at the >>>>>>>>>>>>>>>> Simple History >>>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be able >>>>>>>>>>>>>>>> to get a >>>>>>>>>>>>>>>> better sense of what went wrong. Without further information, >>>>>>>>>>>>>>>> I can't say >>>>>>>>>>>>>>>> any more. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société >>>>>>>>>>>>>>>>> générale in France. I'm actually using your recent version of >>>>>>>>>>>>>>>>> manifoldCF >>>>>>>>>>>>>>>>> 2.8 . I'm working on an internal search engine. For this >>>>>>>>>>>>>>>>> reason, I'm using >>>>>>>>>>>>>>>>> manifoldcf in order to index documents on windows shares. I >>>>>>>>>>>>>>>>> encountered a >>>>>>>>>>>>>>>>> serious problem while crawling 35K documents. Most of the >>>>>>>>>>>>>>>>> time, when >>>>>>>>>>>>>>>>> manifoldcf start crawling a big sized documents (19Mo for >>>>>>>>>>>>>>>>> example), it ends >>>>>>>>>>>>>>>>> the job with the following error: repeated service >>>>>>>>>>>>>>>>> interruptions - failure >>>>>>>>>>>>>>>>> processing document : software caused connection abort: >>>>>>>>>>>>>>>>> socket write error. >>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this problem, >>>>>>>>>>>>>>>>> please ? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>