I've looked at the dependencies; you should not have moved poi-3.15.jar. Please move that back, and commons-collections4-4.1.jar too.
You *will* need to move curvesapi-1.04.jar though. Thanks, Karl On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <daddy...@gmail.com> wrote: > If you include poi.jar, then all dependencies of poi.jar must also be > included. This would mean that curvesapi-1.04.jar and > commons-collections4-4.1.jar should also be included. > > Karl > > On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <i93oth...@gmail.com> > wrote: > >> Hi Karl, >> >> I added the two jars that you have mentioned and another one : >> poi-3.15.jar . Unfortunately, there is another error showing. This time, it >> concerns excel files. You will find attached the stack trace. >> >> Othman. >> >> On Thu, 31 Aug 2017 at 15:32, Karl Wright <daddy...@gmail.com> wrote: >> >>> Hi Othman, >>> >>> Yes, this shows that the jar we moved calls back into another jar, which >>> will also need to be moved. *That* jar has yet another dependency too. >>> >>> The list of jars is thus extended to include: >>> >>> poi-ooxml-3.15.jar >>> dom4j-1.6.1.jar >>> >>> Karl >>> >>> >>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>> wrote: >>> >>>> You will find attached the stack trace. My apologies for the bad >>>> quality of the image, I'm doing my best to send you the stack trace as I >>>> don't have the right to send documents outside the company. >>>> >>>> Thank you for your time, >>>> >>>> Othman >>>> >>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <daddy...@gmail.com> wrote: >>>> >>>>> Once again, I need a stack trace to diagnose what the problem is. >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>>>> wrote: >>>>> >>>>>> Oh, actually it didn't solve the problem. I looked into the log file >>>>>> and saw the following error: >>>>>> >>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader >>>>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader. >>>>>> >>>>>> Maybe another jar is missing ? >>>>>> >>>>>> Othman. >>>>>> >>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <i93oth...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I have tried what you told me to do, and you expected the crawling >>>>>>> resumed. How about the regular expressions? How can I make complex >>>>>>> regular >>>>>>> expressions in the job's paths tab ? >>>>>>> >>>>>>> Thank you very much for your help. >>>>>>> >>>>>>> Othman. >>>>>>> >>>>>>> >>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93oth...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Ok, I will try it right away and let you know if it works. >>>>>>>> >>>>>>>> Othman. >>>>>>>> >>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddy...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Oh, and you also may need to edit your options.env files to >>>>>>>>> include them in the classpath for startup. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddy...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> If you are amenable, there is another workaround you could try. >>>>>>>>>> Specifically: >>>>>>>>>> >>>>>>>>>> (1) Shut down all MCF processes. >>>>>>>>>> (2) Move the following two files from connector-common-lib to lib: >>>>>>>>>> >>>>>>>>>> xmlbeans-2.6.0.jar >>>>>>>>>> poi-ooxml-schemas-3.15.jar >>>>>>>>>> >>>>>>>>>> (3) Restart everything and see if your crawl resumes. >>>>>>>>>> >>>>>>>>>> Please let me know what happens. >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <daddy...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I created a ticket for this: CONNECTORS-1450. >>>>>>>>>>> >>>>>>>>>>> One simple workaround is to use the external Tika server >>>>>>>>>>> transformer rather than the embedded Tika Extractor. I'm still >>>>>>>>>>> looking >>>>>>>>>>> into why the jar is not being found. >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki < >>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes, I'm actually using the latest binary version, and my job >>>>>>>>>>>> got stuck on that specific file. >>>>>>>>>>>> The job status is still Running. You can see it in the attached >>>>>>>>>>>> file. For your information, the job started yesterday. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Othman >>>>>>>>>>>> >>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddy...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> It looks like a dependency of Apache POI is missing. >>>>>>>>>>>>> I think we will need a ticket to address this, if you are >>>>>>>>>>>>> indeed using the binary distribution. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks! >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki < >>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I'm actually using the binary version. For security reasons, >>>>>>>>>>>>>> I can't send any files from my computer. I have copied the stack >>>>>>>>>>>>>> trace and >>>>>>>>>>>>>> scanned it with my cellphone. I hope it will be helpful. >>>>>>>>>>>>>> Meanwhile, I have >>>>>>>>>>>>>> read the documentation about how to restrict the crawling and I >>>>>>>>>>>>>> don't think >>>>>>>>>>>>>> the '|' works in the specified. For instance, I would like to >>>>>>>>>>>>>> restrict the >>>>>>>>>>>>>> crawling for the documents that counts the 'sound' word . I >>>>>>>>>>>>>> proceed as >>>>>>>>>>>>>> follows: *(SON)* . the document is with capital letters and I >>>>>>>>>>>>>> noticed that >>>>>>>>>>>>>> it didn't take it into consideration. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Othman >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddy...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The way you restrict documents with the windows share >>>>>>>>>>>>>>> connector is by specifying information on the "Paths" tab in >>>>>>>>>>>>>>> jobs that >>>>>>>>>>>>>>> crawl windows shares. There is end-user documentation both >>>>>>>>>>>>>>> online and >>>>>>>>>>>>>>> distributed with all binary distributions that describe how to >>>>>>>>>>>>>>> do this. >>>>>>>>>>>>>>> Have you found it? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hello Karl, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thank you for your response, I will start using zookeeper >>>>>>>>>>>>>>>> and I will let you know if it works. I have another question >>>>>>>>>>>>>>>> to ask. >>>>>>>>>>>>>>>> Actually, I need to make some filters while crawling. I don't >>>>>>>>>>>>>>>> want to crawl >>>>>>>>>>>>>>>> some files and some folders. Could you give me an example of >>>>>>>>>>>>>>>> how to use the >>>>>>>>>>>>>>>> regex. Does the regex allow to use /i to ignore cases ? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright < >>>>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Beelz, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> File-based sync is deprecated because people often have >>>>>>>>>>>>>>>>> problems with getting file permissions right, and they do not >>>>>>>>>>>>>>>>> understand >>>>>>>>>>>>>>>>> how to shut processes down cleanly, and zookeeper is >>>>>>>>>>>>>>>>> resilient against >>>>>>>>>>>>>>>>> that. I highly recommend using zookeeper sync. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ManifoldCF is engineered to not put files into memory so >>>>>>>>>>>>>>>>> you do not need huge amounts of memory. The default values >>>>>>>>>>>>>>>>> are more than >>>>>>>>>>>>>>>>> enough for 35,000 files, which is a pretty small job for >>>>>>>>>>>>>>>>> ManifoldCF. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is >>>>>>>>>>>>>>>>>> zookeeper different from file based sync? I also need a >>>>>>>>>>>>>>>>>> guidance on how to >>>>>>>>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the >>>>>>>>>>>>>>>>>> start-agent of >>>>>>>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright < >>>>>>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Your disk is not writable for some reason, and that's >>>>>>>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I would suggest two things: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>>>>>>>>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have >>>>>>>>>>>>>>>>>>>> looked into the ManifoldCF log file and extracted the >>>>>>>>>>>>>>>>>>>> following warnings : >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> - Attempt to set file lock >>>>>>>>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8 >>>>>>>>>>>>>>>>>>>> \multiprocess-file-example\.\.\synch >>>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES >>>>>>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. >>>>>>>>>>>>>>>>>>>> Shutting down process; locks may be left dangling. You >>>>>>>>>>>>>>>>>>>> must cleanup before >>>>>>>>>>>>>>>>>>>> restarting. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output >>>>>>>>>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract >>>>>>>>>>>>>>>>>>>> metadata and a file >>>>>>>>>>>>>>>>>>>> system as a repository connection. During the job, I don't >>>>>>>>>>>>>>>>>>>> extract the >>>>>>>>>>>>>>>>>>>> content of the documents. I was wandering if the issue >>>>>>>>>>>>>>>>>>>> comes from >>>>>>>>>>>>>>>>>>>> elasticsearch ? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright < >>>>>>>>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks >>>>>>>>>>>>>>>>>>>>> like it might go away on retry, but does not. It can be >>>>>>>>>>>>>>>>>>>>> either on the >>>>>>>>>>>>>>>>>>>>> repository side or on the output side. If you look at >>>>>>>>>>>>>>>>>>>>> the Simple History >>>>>>>>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be >>>>>>>>>>>>>>>>>>>>> able to get a >>>>>>>>>>>>>>>>>>>>> better sense of what went wrong. Without further >>>>>>>>>>>>>>>>>>>>> information, I can't say >>>>>>>>>>>>>>>>>>>>> any more. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société >>>>>>>>>>>>>>>>>>>>>> générale in France. I'm actually using your recent >>>>>>>>>>>>>>>>>>>>>> version of manifoldCF >>>>>>>>>>>>>>>>>>>>>> 2.8 . I'm working on an internal search engine. For this >>>>>>>>>>>>>>>>>>>>>> reason, I'm using >>>>>>>>>>>>>>>>>>>>>> manifoldcf in order to index documents on windows >>>>>>>>>>>>>>>>>>>>>> shares. I encountered a >>>>>>>>>>>>>>>>>>>>>> serious problem while crawling 35K documents. Most of >>>>>>>>>>>>>>>>>>>>>> the time, when >>>>>>>>>>>>>>>>>>>>>> manifoldcf start crawling a big sized documents (19Mo >>>>>>>>>>>>>>>>>>>>>> for example), it ends >>>>>>>>>>>>>>>>>>>>>> the job with the following error: repeated service >>>>>>>>>>>>>>>>>>>>>> interruptions - failure >>>>>>>>>>>>>>>>>>>>>> processing document : software caused connection abort: >>>>>>>>>>>>>>>>>>>>>> socket write error. >>>>>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this >>>>>>>>>>>>>>>>>>>>>> problem, please ? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>> >>> >