Update Tika

Julien Thu, 14 Sep 2017 07:12:01 -0700

Hi Karl,

I want to put the last version of Tika (1.16) into MCF 2.8.1. I already did it 
with MCF 2.6 and Tika 1.15 but not in a clean way since I have put all the Tika 
libs in the ‘lib’ and ‘connector-common-lib’ folders of MCF, removing the MCF 
libs that where in an older version. Despite it was not clean, it worked.
This time I want to properly do the things but I don’t really understand how 
the libs are splitted between the folders ‘lib’, ‘connector-common-lib’ and 
‘connector-lib’, knowing that the Tika libs are spread between them.
I have to be careful cause for example I noticed that MCF 2.8.1 uses the 21 
version of the guava lib where Tika uses the version 17, but there are some 
major changes since the version 19 that can potentially break something in Tika.
Can you help me a little bit with this please ?


Thanks,
Julien

De : Karl Wright
Envoyé le :vendredi 8 septembre 2017 21:43
À : user@manifoldcf.apache.org
Objet :Re: Question about ManifoldCF 2.8

Hi Othman,

There are two properties files for zookeeper: the global properties, and the 
local (zookeeper managed) properties.  The database configuration is in the 
zookeeper managed properties.

Please examine the following page for setting up Postgresql properties:

https://manifoldcf.apache.org/release/release-2.8.1/en_US/how-to-build-and-deploy.html

Indexable files are files that those that the output connector says can be 
indexed.  It's a function of the output connector and its configuration.

Thanks,
Karl



On Fri, Sep 8, 2017 at 2:07 PM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Sorry to bother you again, but what is the difference between indexable files 
and files in the path tab of a job ? 

Thanks,

Othman BELHAJ

On Fri, 8 Sep 2017 at 19:27, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Karl,

My zookeeper is still pointing to the HSQL database. What should I do in order 
to change it so that it points to my PostgreSQL database ?

Best regards,

Othman Belhaj .

On Wed, 6 Sep 2017 at 15:34, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Thank you, Karl. I will try to combine Postgresql with zookeeper and let you 
know.

Othman.

On Wed, 6 Sep 2017 at 13:18, Karl Wright <daddy...@gmail.com> wrote:
No, you can use whatever supported database you like.

Karl


On Wed, Sep 6, 2017 at 6:58 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
As far as I know, when you use zookeeper , you obligatory need to use HSQLDB to 
go with it, right?

Thanks,
Othman 

On Wed, 6 Sep 2017 at 12:56, Karl Wright <daddy...@gmail.com> wrote:
Hi Othman,

HSQLDB stores all tables in memory so you need to size it accordingly.  That is 
one reason we prefer Postgresql for production deployments.

Thanks,
Karl


On Wed, Sep 6, 2017 at 6:21 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Karl, 

I resolved the elasticsearch problem however the application doesn't seem to 
work after I have run a job to crawl over 500k documents. I get an GC overhead 
limit exceeded in the hsql database. How many should I allocate for it? 

Best regards,

Othman

On Tue, 5 Sep 2017 at 12:43, Karl Wright <daddy...@gmail.com> wrote:
Hi Othman,

Thanks for doing the evaluation of the problem.

Generally, the ManifoldCF project does not have the expertise to diagnose 
problems with external systems like Solr or Elasticsearch.  So going to another 
newsgroup for those kinds of issues would be a good idea.

Thanks!
Karl


On Tue, Sep 5, 2017 at 4:33 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Karl, 

I have analyzed the error and found out that it was mainly an elasticsearch 
problem. I saw in some forums that one of the adopted solution is to modify 
elasticsearch.yml and set the http.max_content_length to a greater value. 
However, the job got stuck in the last two indexable files ( two pptx files 
with 22Mo and 2Mo respectively). The job eventually ended but a stack trace 
showed that elasticsearch ran out of memory. For your information, I have 
allocated 4Go for elasticsearch execution. Is it enough in order to have a good 
performance. You will find attached the stack traces of elasticsearch. 

Best regards,

Othman BELHAJ.

On Mon, 4 Sep 2017 at 16:40, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Karl,

I'm sorry to bother on your holiday. I will try to analyze it today and let it 
you know what I have found. Enjoy your day !

Best regards,

Othman BELHAJ.

On Mon, 4 Sep 2017 at 16:06, Karl Wright <daddy...@gmail.com> wrote:
Hi Othman,

I won't be able to look at this today; it is a holiday here.  But, the "socket 
write" error is coming from ElasticSearch.  If ES is configured to not accept 
documents greater than a certain size, that might explain it.  Maybe the ES 
logs would help?

I'm afraid you're going to need to do the work to find out what is going wrong 
in those cases now.

Thanks,
Karl


On Mon, Sep 4, 2017 at 4:53 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Karl,

This morning, I have tried the zookeeper based file and it worked really good. 
However, I still have one error which is bugging me. It is a socket write 
error. You will find attached the simple history report. Surprisingly, I didn't 
have any stack trace in the ManifoldCF log file. 

Best regards,

Othman.

On Fri, 1 Sep 2017 at 19:39, Karl Wright <daddy...@gmail.com> wrote:
This is from file locking yet again.

I have uploaded a new RC.  Please download and try out the zookeeper locking.

https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.8.1

Karl


On Fri, Sep 1, 2017 at 1:11 PM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
There is another issue as well that gives the following stack trace.

Othman. 

On Fri, 1 Sep 2017 at 18:05, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Karl, 

I took the binary from the ManifoldCF 2.8.1 RC0. It had the version 3.9 of POI 
and when I changed the version to 3.15 it worked fine. I really want to try the 
zookeeper if as you told me its performance is better than the file-based 
example. For the time being, I'm using the file-based because it is the only 
part that works for me but I actually need a stable version for my production 
environment. That is one point. 
Another point is, the path's tab is still an issue for me because I exclude 
some files and it still crawls them. I want to exclude some specific extensions 
of files and some specific directories. For instance, i don't want to index 
.exe files and contains a specific word. I do as follows I make the first 
exclude with *.exe and the second one with *word*. Only the second one which 
doesn't work. How can I solve this issue, please?

Thank you very much, have a nice week-end,

Othman 
On Fri, 1 Sep 2017 at 16:46, Karl Wright <daddy...@gmail.com> wrote:
Hi Othman,

I will respin a new 2.8.1 (RC1) to address the zookeeper issue.

The failure you are seeing is "NoSuchMethodError".  Therefore, the class is 
being found, but it is the *wrong* class.  When you deployed the new release, 
did you deploy it in a new directory, or did you overwrite the previous 
deployment?  If you overwrote it, you probably have multiple versions of the 
POI jars.

Karl


On Fri, Sep 1, 2017 at 9:59 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Karl, 

I have just tried the new release of ManifoldCF. At first, the first job ended 
normally, but in the second I got a new stack trace concerning the POI. 
Moreover, the runzookeeper.bat doesn't run properly. It shows me the stack 
trace attached.

Ps:
The second attached file contains the POI stack trace. 

Othman.

On Fri, 1 Sep 2017 at 12:21, Karl Wright <daddy...@gmail.com> wrote:
Hi Othman,

You do not need a new database instance.

You can download MCF 2.8.1 RC0 from here:

https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.8.1

Karl


On Fri, Sep 1, 2017 at 5:42 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Karl,

Thank you very much for your help, I'm going to try out the zookeeper example. 
Should I initialize a new database? And how can I run the zookeeper start-agent 
? 

Othman.

On Fri, 1 Sep 2017 at 11:37, Karl Wright <daddy...@gmail.com> wrote:
Hi Othman,

These exceptions are now coming from file locking and are due to permissions 
problems.  I suggest you go to Zookeeper for file locking.

I am building a 2.8.1 release candidate.  When it available for download, I'll 
send you the URL.

Thanks,
Karl


On Fri, Sep 1, 2017 at 5:27 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Karl,

This morning, I have followed the steps you told me to do and I still got stack 
traces. I have attached the stack traces as well as the content of my lib repo 
and option.env.
I have installed zookeeper and I'm ready to use the zookeeper example. Could 
you guide through it? I don't know if I follow the same steps in the file based 
example, I may not get stack traces. 

Thanks,
Othman 

On Thu, 31 Aug 2017 at 18:19, Karl Wright <daddy...@gmail.com> wrote:
Please do the following:

(0) Shut down all ManifoldCF processes.
(1) Move poi*.jar from connector-common-lib to lib.
(2) Move dom4j*.jar from connector-common-lib to lib.
(3) Move commons-collections4*.jar from connector-common-lib to lib.
(4) Move xmlbeans*.java from connector-common-lib to lib.
(5) Move curvesapi*.jar from connector-common-lib to lib.
(6) Modify your options.env to include all of the jars you moved.
(7) Start up all ManifoldCF processes.
(8) If you still get stack traces, please send them to me.

Karl


On Thu, Aug 31, 2017 at 12:12 PM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Karl, 

By 'other place', do you mean the \lib repository? If that so, then I have 
already tried it and it didn't work.

Othman.

On Thu, 31 Aug 2017 at 18:07, Karl Wright <daddy...@gmail.com> wrote:
Hi Othman,

I used the java dependency inspector to see what the issue is and it turns out 
that poi-ooxml.jar does refer back to poi.jar in the class that is failing.  So 
you will need to move poi-3.15.jar and commons-collections4-1.4.jar to the 
other place as well.

Let's hope that finally fixes this issue.

I'm very unhappy about the quality of the POI project code; it is definitely 
not using reasonable engineering practices, and I will be opening a ticket with 
them.

Thanks,
Karl


On Thu, Aug 31, 2017 at 11:57 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
I'm using the file based example and all the changes you told me to do. I 
reproduced them in the file based example. I'll try to install zookeeper and 
use the zookeeper example. Will I need a configuration to do in order to run 
the zookeeper example ? 

Othman.

On Thu, 31 Aug 2017 at 17:46, Karl Wright <daddy...@gmail.com> wrote:
Are you using the zookeeper example, or the file-based example?

If these jars have all been moved, and the options.env includes them, then I 
have to conclude that Apache POI's pom.xml is incorrect too.  It will take a 
while to figure out what's missing that poi-ooxml.jar needs that is not listed.

Karl


On Thu, Aug 31, 2017 at 11:39 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
All the dependencies you mentioned have already been added in the 
options.env.win file in the multiprocess-file-example repository. 

On Thu, 31 Aug 2017 at 17:33, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Yes, I added it in the options.env.win file. Should it be the one in the 
multiprocess-zk-example document or multiprocess-file-example ? 

On Thu, 31 Aug 2017 at 17:30, Karl Wright <daddy...@gmail.com> wrote:
It's not related at all to elasticsearch.
Karl


On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Could it be a problem of elasticsearch's version ? I'm actually using 2.1.0 
which is pretty old for this new version of ManifoldCF?

Othman.

On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
I moved back both the jars you mentioned and a different is showing. You will 
find the stack trace attached. 

Thanks,
Othman 

On Thu, 31 Aug 2017 at 17:09, Karl Wright <daddy...@gmail.com> wrote:
I've looked at the dependencies; you should not have moved poi-3.15.jar.  
Please move that back, and commons-collections4-4.1.jar too.

You *will* need to move curvesapi-1.04.jar though.

Thanks,
Karl


On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <daddy...@gmail.com> wrote:
If you include poi.jar, then all dependencies of poi.jar must also be included. 
 This would mean that curvesapi-1.04.jar and commons-collections4-4.1.jar 
should also be included.

Karl

On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Karl, 

I added the two jars that you have mentioned and another one : poi-3.15.jar . 
Unfortunately, there is another error showing. This time, it concerns excel 
files. You will find attached the stack trace. 

Othman.

On Thu, 31 Aug 2017 at 15:32, Karl Wright <daddy...@gmail.com> wrote:
Hi Othman,

Yes, this shows that the jar we moved calls back into another jar, which will 
also need to be moved.  *That* jar has yet another dependency too.

The list of jars is thus extended to include:

poi-ooxml-3.15.jar
dom4j-1.6.1.jar

Karl


On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
You will find attached the stack trace. My apologies for the bad quality of the 
image, I'm doing my best to send you the stack trace as I don't have the right 
to send documents outside the company.

Thank you for your time,

Othman 

On Thu, 31 Aug 2017 at 15:16, Karl Wright <daddy...@gmail.com> wrote:
Once again, I need a stack trace to diagnose what the problem is.

Thanks,
Karl


On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Oh, actually it didn't solve the problem. I looked into the log file and saw 
the following error:

Error tossed : org/apache/poi/POIXMLTypeLoader
java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.

Maybe another jar is missing ?

Othman. 

On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
I have tried what you told me to do, and you expected the crawling resumed. How 
about the regular expressions? How can I make complex regular expressions in 
the job's paths tab ?

Thank you very much for your help.

Othman. 


On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Ok, I will try it right away and let you know if it works. 

Othman.

On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddy...@gmail.com> wrote:
Oh, and you also may need to edit your options.env files to include them in the 
classpath for startup.

Karl


On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddy...@gmail.com> wrote:
If you are amenable, there is another workaround you could try.  Specifically:

(1) Shut down all MCF processes.
(2) Move the following two files from connector-common-lib to lib:

xmlbeans-2.6.0.jar
poi-ooxml-schemas-3.15.jar

(3) Restart everything and see if your crawl resumes.

Please let me know what happens.

Karl



On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <daddy...@gmail.com> wrote:
I created a ticket for this: CONNECTORS-1450.

One simple workaround is to use the external Tika server transformer rather 
than the embedded Tika Extractor.  I'm still looking into why the jar is not 
being found.

Karl


On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Yes, I'm actually using the latest binary version, and my job got stuck on that 
specific file. 
The job status is still Running. You can see it in the attached file. For your 
information, the job started yesterday. 

Thanks, 

Othman

On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddy...@gmail.com> wrote:
It looks like a dependency of Apache POI is missing.
I think we will need a ticket to address this, if you are indeed using the 
binary distribution.

Thanks!
Karl

On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
I'm actually using the binary version. For security reasons, I can't send any 
files from my computer. I have copied the stack trace and scanned it with my 
cellphone. I hope it will be helpful. Meanwhile, I have read the documentation 
about how to restrict the crawling and I don't think the '|' works in the 
specified. For instance, I would like to restrict the crawling for the 
documents that counts the 'sound' word . I proceed as follows: *(SON)* . the 
document is with capital letters and I noticed that it didn't take it into 
consideration. 

Thanks, 
Othman



On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddy...@gmail.com> wrote:
Hi Othman,

The way you restrict documents with the windows share connector is by 
specifying information on the "Paths" tab in jobs that crawl windows shares.  
There is end-user documentation both online and distributed with all binary 
distributions that describe how to do this.  Have you found it?

Karl


On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hello Karl, 

Thank you for your response, I will start using zookeeper and I will let you 
know if it works. I have another question to ask. Actually, I need to make some 
filters while crawling. I don't want to crawl some files and some folders. 
Could you give me an example of how to use the regex. Does the regex allow to 
use /i to ignore cases ? 

Thanks, 
Othman

On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddy...@gmail.com> wrote:
Hi Beelz,

File-based sync is deprecated because people often have problems with getting 
file permissions right, and they do not understand how to shut processes down 
cleanly, and zookeeper is resilient against that.  I highly recommend using 
zookeeper sync.

ManifoldCF is engineered to not put files into memory so you do not need huge 
amounts of memory.  The default values are more than enough for 35,000 files, 
which is a pretty small job for ManifoldCF.

Thanks,
Karl


On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
I'm actually not using zookeeper. i want to know how is zookeeper different 
from file based sync? I also need a guidance on how to manage my pc's memory. 
How many Go should I allocate for the start-agent of ManifoldCF? Is 4Go enough 
in order to crawler 35K files ?

Othman. 

On Wed, 30 Aug 2017 at 16:11, Karl Wright <daddy...@gmail.com> wrote:
Your disk is not writable for some reason, and that's interfering with 
ManifoldCF 2.8 locking.

I would suggest two things:

(1) Use Zookeeper for sync instead of file-based sync.
(2) Have a look if you still get failures after that.

Thanks,
Karl


On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hi Mr Karl, 

Thank you Mr Karl for your quick response. I have looked into the ManifoldCF 
log file and extracted the following warnings :

- Attempt to set file lock 
'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch 
area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase) Synapses.lock' 
failed : Access is denied.


- Couldn't write to lock file; disk may be full. Shutting down process; locks 
may be left dangling. You must cleanup before restarting.

ES (lowercase) synapses being the elasticsearch output connection. Moreover, 
the job uses Tika to extract metadata and a file system as a repository 
connection. During the job, I don't extract the content of the documents. I was 
wandering if the issue comes from elasticsearch ?

Othman. 



On Wed, 30 Aug 2017 at 14:08, Karl Wright <daddy...@gmail.com> wrote:
Hi Othman,

ManifoldCF aborts a job if there's an error that looks like it might go away on 
retry, but does not.  It can be either on the repository side or on the output 
side.  If you look at the Simple History in the UI, or at the manifoldcf.log 
file, you should be able to get a better sense of what went wrong.  Without 
further information, I can't say any more.

Thanks,
Karl


On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <i93oth...@gmail.com> wrote:
Hello,

I'm Othman Belhaj, a software engineer from société générale in France. I'm 
actually using your recent version of manifoldCF 2.8 . I'm working on an 
internal search engine. For this reason, I'm using manifoldcf in order to index 
documents on windows shares. I encountered a serious problem while crawling 35K 
documents. Most of the time, when manifoldcf start crawling a big sized 
documents (19Mo for example), it ends the job with the following error: 
repeated service interruptions - failure processing document : software caused 
connection abort: socket write error. 
Can you give me some tips on how to solve this problem, please ? 

I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
I'm looking forward for your response.

Best regards, 

Othman BELHAJ




























---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus

Update Tika

Reply via email to