If you include poi.jar, then all dependencies of poi.jar must also be included. This would mean that curvesapi-1.04.jar and commons-collections4-4.1.jar should also be included.
Karl On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote: > Hi Karl, > > I added the two jars that you have mentioned and another one : > poi-3.15.jar . Unfortunately, there is another error showing. This time, it > concerns excel files. You will find attached the stack trace. > > Othman. > > On Thu, 31 Aug 2017 at 15:32, Karl Wright <daddy...@gmail.com> wrote: > >> Hi Othman, >> >> Yes, this shows that the jar we moved calls back into another jar, which >> will also need to be moved. *That* jar has yet another dependency too. >> >> The list of jars is thus extended to include: >> >> poi-ooxml-3.15.jar >> dom4j-1.6.1.jar >> >> Karl >> >> >> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <i93oth...@gmail.com> >> wrote: >> >>> You will find attached the stack trace. My apologies for the bad quality >>> of the image, I'm doing my best to send you the stack trace as I don't have >>> the right to send documents outside the company. >>> >>> Thank you for your time, >>> >>> Othman >>> >>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <daddy...@gmail.com> wrote: >>> >>>> Once again, I need a stack trace to diagnose what the problem is. >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <i93oth...@gmail.com> >>>> wrote: >>>> >>>>> Oh, actually it didn't solve the problem. I looked into the log file >>>>> and saw the following error: >>>>> >>>>> Error tossed : org/apache/poi/POIXMLTypeLoader >>>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader. >>>>> >>>>> Maybe another jar is missing ? >>>>> >>>>> Othman. >>>>> >>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <i93oth...@gmail.com> >>>>> wrote: >>>>> >>>>>> I have tried what you told me to do, and you expected the crawling >>>>>> resumed. How about the regular expressions? How can I make complex >>>>>> regular >>>>>> expressions in the job's paths tab ? >>>>>> >>>>>> Thank you very much for your help. >>>>>> >>>>>> Othman. >>>>>> >>>>>> >>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93oth...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Ok, I will try it right away and let you know if it works. >>>>>>> >>>>>>> Othman. >>>>>>> >>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddy...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Oh, and you also may need to edit your options.env files to include >>>>>>>> them in the classpath for startup. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddy...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> If you are amenable, there is another workaround you could try. >>>>>>>>> Specifically: >>>>>>>>> >>>>>>>>> (1) Shut down all MCF processes. >>>>>>>>> (2) Move the following two files from connector-common-lib to lib: >>>>>>>>> >>>>>>>>> xmlbeans-2.6.0.jar >>>>>>>>> poi-ooxml-schemas-3.15.jar >>>>>>>>> >>>>>>>>> (3) Restart everything and see if your crawl resumes. >>>>>>>>> >>>>>>>>> Please let me know what happens. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <daddy...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I created a ticket for this: CONNECTORS-1450. >>>>>>>>>> >>>>>>>>>> One simple workaround is to use the external Tika server >>>>>>>>>> transformer rather than the embedded Tika Extractor. I'm still >>>>>>>>>> looking >>>>>>>>>> into why the jar is not being found. >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki < >>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Yes, I'm actually using the latest binary version, and my job >>>>>>>>>>> got stuck on that specific file. >>>>>>>>>>> The job status is still Running. You can see it in the attached >>>>>>>>>>> file. For your information, the job started yesterday. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Othman >>>>>>>>>>> >>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddy...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> It looks like a dependency of Apache POI is missing. >>>>>>>>>>>> I think we will need a ticket to address this, if you are >>>>>>>>>>>> indeed using the binary distribution. >>>>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki < >>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I'm actually using the binary version. For security reasons, I >>>>>>>>>>>>> can't send any files from my computer. I have copied the stack >>>>>>>>>>>>> trace and >>>>>>>>>>>>> scanned it with my cellphone. I hope it will be helpful. >>>>>>>>>>>>> Meanwhile, I have >>>>>>>>>>>>> read the documentation about how to restrict the crawling and I >>>>>>>>>>>>> don't think >>>>>>>>>>>>> the '|' works in the specified. For instance, I would like to >>>>>>>>>>>>> restrict the >>>>>>>>>>>>> crawling for the documents that counts the 'sound' word . I >>>>>>>>>>>>> proceed as >>>>>>>>>>>>> follows: *(SON)* . the document is with capital letters and I >>>>>>>>>>>>> noticed that >>>>>>>>>>>>> it didn't take it into consideration. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Othman >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddy...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>> >>>>>>>>>>>>>> The way you restrict documents with the windows share >>>>>>>>>>>>>> connector is by specifying information on the "Paths" tab in >>>>>>>>>>>>>> jobs that >>>>>>>>>>>>>> crawl windows shares. There is end-user documentation both >>>>>>>>>>>>>> online and >>>>>>>>>>>>>> distributed with all binary distributions that describe how to >>>>>>>>>>>>>> do this. >>>>>>>>>>>>>> Have you found it? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Karl >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki < >>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello Karl, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you for your response, I will start using zookeeper >>>>>>>>>>>>>>> and I will let you know if it works. I have another question to >>>>>>>>>>>>>>> ask. >>>>>>>>>>>>>>> Actually, I need to make some filters while crawling. I don't >>>>>>>>>>>>>>> want to crawl >>>>>>>>>>>>>>> some files and some folders. Could you give me an example of >>>>>>>>>>>>>>> how to use the >>>>>>>>>>>>>>> regex. Does the regex allow to use /i to ignore cases ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright < >>>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Beelz, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> File-based sync is deprecated because people often have >>>>>>>>>>>>>>>> problems with getting file permissions right, and they do not >>>>>>>>>>>>>>>> understand >>>>>>>>>>>>>>>> how to shut processes down cleanly, and zookeeper is resilient >>>>>>>>>>>>>>>> against >>>>>>>>>>>>>>>> that. I highly recommend using zookeeper sync. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ManifoldCF is engineered to not put files into memory so >>>>>>>>>>>>>>>> you do not need huge amounts of memory. The default values >>>>>>>>>>>>>>>> are more than >>>>>>>>>>>>>>>> enough for 35,000 files, which is a pretty small job for >>>>>>>>>>>>>>>> ManifoldCF. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is >>>>>>>>>>>>>>>>> zookeeper different from file based sync? I also need a >>>>>>>>>>>>>>>>> guidance on how to >>>>>>>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the >>>>>>>>>>>>>>>>> start-agent of >>>>>>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright < >>>>>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Your disk is not writable for some reason, and that's >>>>>>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I would suggest two things: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>>>>>>>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked >>>>>>>>>>>>>>>>>>> into the ManifoldCF log file and extracted the following >>>>>>>>>>>>>>>>>>> warnings : >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2. >>>>>>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch >>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES >>>>>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. >>>>>>>>>>>>>>>>>>> Shutting down process; locks may be left dangling. You must >>>>>>>>>>>>>>>>>>> cleanup before >>>>>>>>>>>>>>>>>>> restarting. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output >>>>>>>>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata >>>>>>>>>>>>>>>>>>> and a file >>>>>>>>>>>>>>>>>>> system as a repository connection. During the job, I don't >>>>>>>>>>>>>>>>>>> extract the >>>>>>>>>>>>>>>>>>> content of the documents. I was wandering if the issue >>>>>>>>>>>>>>>>>>> comes from >>>>>>>>>>>>>>>>>>> elasticsearch ? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright < >>>>>>>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks >>>>>>>>>>>>>>>>>>>> like it might go away on retry, but does not. It can be >>>>>>>>>>>>>>>>>>>> either on the >>>>>>>>>>>>>>>>>>>> repository side or on the output side. If you look at the >>>>>>>>>>>>>>>>>>>> Simple History >>>>>>>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be >>>>>>>>>>>>>>>>>>>> able to get a >>>>>>>>>>>>>>>>>>>> better sense of what went wrong. Without further >>>>>>>>>>>>>>>>>>>> information, I can't say >>>>>>>>>>>>>>>>>>>> any more. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>> i93oth...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société >>>>>>>>>>>>>>>>>>>>> générale in France. I'm actually using your recent >>>>>>>>>>>>>>>>>>>>> version of manifoldCF >>>>>>>>>>>>>>>>>>>>> 2.8 . I'm working on an internal search engine. For this >>>>>>>>>>>>>>>>>>>>> reason, I'm using >>>>>>>>>>>>>>>>>>>>> manifoldcf in order to index documents on windows shares. >>>>>>>>>>>>>>>>>>>>> I encountered a >>>>>>>>>>>>>>>>>>>>> serious problem while crawling 35K documents. Most of the >>>>>>>>>>>>>>>>>>>>> time, when >>>>>>>>>>>>>>>>>>>>> manifoldcf start crawling a big sized documents (19Mo for >>>>>>>>>>>>>>>>>>>>> example), it ends >>>>>>>>>>>>>>>>>>>>> the job with the following error: repeated service >>>>>>>>>>>>>>>>>>>>> interruptions - failure >>>>>>>>>>>>>>>>>>>>> processing document : software caused connection abort: >>>>>>>>>>>>>>>>>>>>> socket write error. >>>>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this >>>>>>>>>>>>>>>>>>>>> problem, please ? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>> >>