Could it be a problem of elasticsearch's version ? I'm actually using 2.1.0 which is pretty old for this new version of ManifoldCF?
Othman. On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki <i93oth...@gmail.com> wrote: > I moved back both the jars you mentioned and a different is showing. You > will find the stack trace attached. > > Thanks, > Othman > > On Thu, 31 Aug 2017 at 17:09, Karl Wright <daddy...@gmail.com> wrote: > >> I've looked at the dependencies; you should not have moved poi-3.15.jar. >> Please move that back, and commons-collections4-4.1.jar too. >> >> You *will* need to move curvesapi-1.04.jar though. >> >> Thanks, >> Karl >> >> >> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <daddy...@gmail.com> wrote: >> >>> If you include poi.jar, then all dependencies of poi.jar must also be >>> included. This would mean that curvesapi-1.04.jar and >>> commons-collections4-4.1.jar should also be included. >>> >>> Karl >>> >>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>> wrote: >>> >>>> Hi Karl, >>>> >>>> I added the two jars that you have mentioned and another one : >>>> poi-3.15.jar . Unfortunately, there is another error showing. This time, it >>>> concerns excel files. You will find attached the stack trace. >>>> >>>> Othman. >>>> >>>> On Thu, 31 Aug 2017 at 15:32, Karl Wright <daddy...@gmail.com> wrote: >>>> >>>>> Hi Othman, >>>>> >>>>> Yes, this shows that the jar we moved calls back into another jar, >>>>> which will also need to be moved. *That* jar has yet another dependency >>>>> too. >>>>> >>>>> The list of jars is thus extended to include: >>>>> >>>>> poi-ooxml-3.15.jar >>>>> dom4j-1.6.1.jar >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>>>> wrote: >>>>> >>>>>> You will find attached the stack trace. My apologies for the bad >>>>>> quality of the image, I'm doing my best to send you the stack trace as I >>>>>> don't have the right to send documents outside the company. >>>>>> >>>>>> Thank you for your time, >>>>>> >>>>>> Othman >>>>>> >>>>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <daddy...@gmail.com> wrote: >>>>>> >>>>>>> Once again, I need a stack trace to diagnose what the problem is. >>>>>>> >>>>>>> Thanks, >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Oh, actually it didn't solve the problem. I looked into the log >>>>>>>> file and saw the following error: >>>>>>>> >>>>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader >>>>>>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader. >>>>>>>> >>>>>>>> Maybe another jar is missing ? >>>>>>>> >>>>>>>> Othman. >>>>>>>> >>>>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <i93oth...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I have tried what you told me to do, and you expected the crawling >>>>>>>>> resumed. How about the regular expressions? How can I make complex >>>>>>>>> regular >>>>>>>>> expressions in the job's paths tab ? >>>>>>>>> >>>>>>>>> Thank you very much for your help. >>>>>>>>> >>>>>>>>> Othman. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93oth...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Ok, I will try it right away and let you know if it works. >>>>>>>>>> >>>>>>>>>> Othman. >>>>>>>>>> >>>>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddy...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Oh, and you also may need to edit your options.env files to >>>>>>>>>>> include them in the classpath for startup. >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddy...@gmail.com >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>>> If you are amenable, there is another workaround you could >>>>>>>>>>>> try. Specifically: >>>>>>>>>>>> >>>>>>>>>>>> (1) Shut down all MCF processes. >>>>>>>>>>>> (2) Move the following two files from connector-common-lib to >>>>>>>>>>>> lib: >>>>>>>>>>>> >>>>>>>>>>>> xmlbeans-2.6.0.jar >>>>>>>>>>>> poi-ooxml-schemas-3.15.jar >>>>>>>>>>>> >>>>>>>>>>>> (3) Restart everything and see if your crawl resumes. >>>>>>>>>>>> >>>>>>>>>>>> Please let me know what happens. >>>>>>>>>>>> >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright < >>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I created a ticket for this: CONNECTORS-1450. >>>>>>>>>>>>> >>>>>>>>>>>>> One simple workaround is to use the external Tika server >>>>>>>>>>>>> transformer rather than the embedded Tika Extractor. I'm still >>>>>>>>>>>>> looking >>>>>>>>>>>>> into why the jar is not being found. >>>>>>>>>>>>> >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki < >>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, I'm actually using the latest binary version, and my job >>>>>>>>>>>>>> got stuck on that specific file. >>>>>>>>>>>>>> The job status is still Running. You can see it in the >>>>>>>>>>>>>> attached file. For your information, the job started yesterday. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Othman >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddy...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> It looks like a dependency of Apache POI is missing. >>>>>>>>>>>>>>> I think we will need a ticket to address this, if you are >>>>>>>>>>>>>>> indeed using the binary distribution. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm actually using the binary version. For security >>>>>>>>>>>>>>>> reasons, I can't send any files from my computer. I have >>>>>>>>>>>>>>>> copied the stack >>>>>>>>>>>>>>>> trace and scanned it with my cellphone. I hope it will be >>>>>>>>>>>>>>>> helpful. >>>>>>>>>>>>>>>> Meanwhile, I have read the documentation about how to restrict >>>>>>>>>>>>>>>> the crawling >>>>>>>>>>>>>>>> and I don't think the '|' works in the specified. For >>>>>>>>>>>>>>>> instance, I would >>>>>>>>>>>>>>>> like to restrict the crawling for the documents that counts >>>>>>>>>>>>>>>> the 'sound' >>>>>>>>>>>>>>>> word . I proceed as follows: *(SON)* . the document is with >>>>>>>>>>>>>>>> capital letters >>>>>>>>>>>>>>>> and I noticed that it didn't take it into consideration. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright < >>>>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The way you restrict documents with the windows share >>>>>>>>>>>>>>>>> connector is by specifying information on the "Paths" tab in >>>>>>>>>>>>>>>>> jobs that >>>>>>>>>>>>>>>>> crawl windows shares. There is end-user documentation both >>>>>>>>>>>>>>>>> online and >>>>>>>>>>>>>>>>> distributed with all binary distributions that describe how >>>>>>>>>>>>>>>>> to do this. >>>>>>>>>>>>>>>>> Have you found it? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hello Karl, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thank you for your response, I will start using zookeeper >>>>>>>>>>>>>>>>>> and I will let you know if it works. I have another question >>>>>>>>>>>>>>>>>> to ask. >>>>>>>>>>>>>>>>>> Actually, I need to make some filters while crawling. I >>>>>>>>>>>>>>>>>> don't want to crawl >>>>>>>>>>>>>>>>>> some files and some folders. Could you give me an example of >>>>>>>>>>>>>>>>>> how to use the >>>>>>>>>>>>>>>>>> regex. Does the regex allow to use /i to ignore cases ? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright < >>>>>>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Beelz, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> File-based sync is deprecated because people often have >>>>>>>>>>>>>>>>>>> problems with getting file permissions right, and they do >>>>>>>>>>>>>>>>>>> not understand >>>>>>>>>>>>>>>>>>> how to shut processes down cleanly, and zookeeper is >>>>>>>>>>>>>>>>>>> resilient against >>>>>>>>>>>>>>>>>>> that. I highly recommend using zookeeper sync. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ManifoldCF is engineered to not put files into memory so >>>>>>>>>>>>>>>>>>> you do not need huge amounts of memory. The default values >>>>>>>>>>>>>>>>>>> are more than >>>>>>>>>>>>>>>>>>> enough for 35,000 files, which is a pretty small job for >>>>>>>>>>>>>>>>>>> ManifoldCF. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is >>>>>>>>>>>>>>>>>>>> zookeeper different from file based sync? I also need a >>>>>>>>>>>>>>>>>>>> guidance on how to >>>>>>>>>>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for >>>>>>>>>>>>>>>>>>>> the start-agent of >>>>>>>>>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright < >>>>>>>>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Your disk is not writable for some reason, and that's >>>>>>>>>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I would suggest two things: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>>>>>>>>>>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have >>>>>>>>>>>>>>>>>>>>>> looked into the ManifoldCF log file and extracted the >>>>>>>>>>>>>>>>>>>>>> following warnings : >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> - Attempt to set file lock >>>>>>>>>>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch >>>>>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES >>>>>>>>>>>>>>>>>>>>>> (Lowercase) >>>>>>>>>>>>>>>>>>>>>> Synapses.lock' failed : Access is denied. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. >>>>>>>>>>>>>>>>>>>>>> Shutting down process; locks may be left dangling. You >>>>>>>>>>>>>>>>>>>>>> must cleanup before >>>>>>>>>>>>>>>>>>>>>> restarting. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch >>>>>>>>>>>>>>>>>>>>>> output connection. Moreover, the job uses Tika to >>>>>>>>>>>>>>>>>>>>>> extract metadata and a >>>>>>>>>>>>>>>>>>>>>> file system as a repository connection. During the job, >>>>>>>>>>>>>>>>>>>>>> I don't extract the >>>>>>>>>>>>>>>>>>>>>> content of the documents. I was wandering if the issue >>>>>>>>>>>>>>>>>>>>>> comes from >>>>>>>>>>>>>>>>>>>>>> elasticsearch ? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright < >>>>>>>>>>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that >>>>>>>>>>>>>>>>>>>>>>> looks like it might go away on retry, but does not. It >>>>>>>>>>>>>>>>>>>>>>> can be either on >>>>>>>>>>>>>>>>>>>>>>> the repository side or on the output side. If you look >>>>>>>>>>>>>>>>>>>>>>> at the Simple >>>>>>>>>>>>>>>>>>>>>>> History in the UI, or at the manifoldcf.log file, you >>>>>>>>>>>>>>>>>>>>>>> should be able to get >>>>>>>>>>>>>>>>>>>>>>> a better sense of what went wrong. Without further >>>>>>>>>>>>>>>>>>>>>>> information, I can't >>>>>>>>>>>>>>>>>>>>>>> say any more. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société >>>>>>>>>>>>>>>>>>>>>>>> générale in France. I'm actually using your recent >>>>>>>>>>>>>>>>>>>>>>>> version of manifoldCF >>>>>>>>>>>>>>>>>>>>>>>> 2.8 . I'm working on an internal search engine. For >>>>>>>>>>>>>>>>>>>>>>>> this reason, I'm using >>>>>>>>>>>>>>>>>>>>>>>> manifoldcf in order to index documents on windows >>>>>>>>>>>>>>>>>>>>>>>> shares. I encountered a >>>>>>>>>>>>>>>>>>>>>>>> serious problem while crawling 35K documents. Most of >>>>>>>>>>>>>>>>>>>>>>>> the time, when >>>>>>>>>>>>>>>>>>>>>>>> manifoldcf start crawling a big sized documents (19Mo >>>>>>>>>>>>>>>>>>>>>>>> for example), it ends >>>>>>>>>>>>>>>>>>>>>>>> the job with the following error: repeated service >>>>>>>>>>>>>>>>>>>>>>>> interruptions - failure >>>>>>>>>>>>>>>>>>>>>>>> processing document : software caused connection >>>>>>>>>>>>>>>>>>>>>>>> abort: socket write error. >>>>>>>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this >>>>>>>>>>>>>>>>>>>>>>>> problem, please ? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>>>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>>> >>> >>