Re: web crawler not sharing cookies

2018-07-25 Thread Gustavo Beneitez
Hi again, Thanks Karl, I was able of doing that after defining some "login sequence", but also after filling database (cookiedata table) with certain values due to "domain constrictions". Before every web call, I suspect Manifold only takes cookies from URL exact subdomain (i.e. x.y.z.com), so if

Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
oduction it is not terrible as behavior. > > > > Thanks > > Maxence, > > > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* mardi 24 juillet 2018 17:53 > *À :* user@manifoldcf.apache.org > *Objet :* Re: Out of memory, one file bug i think &

Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
ence, >> >> >> >> >> >> *De :* Karl Wright [mailto:daddy...@gmail.com] >> *Envoyé :* mardi 24 juillet 2018 17:53 >> *À :* user@manifoldcf.apache.org >> *Objet :* Re: Out of memory, one file bug i think >> >> >> >> The problem isn't with image

Re: Out of memory, one file bug i think

2018-07-24 Thread Karl Wright
index without errors. Images are necessary > for this job. I try to recreate my job and test. > > > > Thanks, > > Maxence, > > > > > > > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* mardi 24 juillet 2018 17:32 > *À :* user@mani

Re: Out of memory, one file bug i think

2018-07-24 Thread Karl Wright
BLE}{/opt/manifoldcf-trunk/bin/./../web-proprietary/war/mcf-api-service.war} > > [Thread-490] INFO org.eclipse.jetty.server.handler.ContextHandler - > Stopped o.e.j.w.WebAppContext@60410cd > {/mcf-authority-service,file:/tmp/jetty-0.0.0.0-8345-mcf-authority-service.war-_mcf-authority-s

Re: Out of memory, one file bug i think

2018-07-24 Thread Karl Wright
n - Opening socket connection to server > kemp-formation-solr.citya.local/192.168.37.107:2181. Will not attempt to > authenticate using SASL (unknown error) > > [Thread-7602-SendThread(kemp-formation-solr.citya.local:2181)] INFO > org.apache.zookeeper.ClientCnxn - Socket connecti

RE: Out of memory, one file bug i think

2018-07-24 Thread msaunier
3:15 À : user@manifoldcf.apache.org Objet : Re: Out of memory, one file bug i think I've opened CONNECTORS-1516 to track the Class Not Found issue, and also created an Apache POI bugzilla ticket, which is referenced. Karl On Tue, Jul 24, 2018 at 6:15 AM Karl Wright mailto:daddy...@gma

Re: Out of memory, one file bug i think

2018-07-24 Thread Karl Wright
the exception, and look at what Worker Thread 1 was doing. > > Karl > > > On Tue, Jul 24, 2018 at 5:58 AM msaunier wrote: > >> Re Karl, >> >> >> >> I have an Out of Memory Error toda

Re: Optimized memory used

2018-07-24 Thread Karl Wright
ManifoldCF's usage of memory is bounded per thread, but obviously scales with the number of worker threads you have. If you are using Tika, the amount of memory that may be used varies a lot, however, because Tika's streaming document memory behavior is quite variable, depending on the kind of

Re: Out of memory, one file bug i think

2018-07-24 Thread Karl Wright
t document it came from from that. If not, then look at the log prior to the exception, and look at what Worker Thread 1 was doing. Karl On Tue, Jul 24, 2018 at 5:58 AM msaunier wrote: > Re Karl, > > > > I have an Out of Memory Error today. I think I have an error with a > d

Re: web crawler not sharing cookies

2018-07-20 Thread Gustavo Beneitez
Hi, thanks a lot, please let me check then the documentation for an example of that. Regards! El jue., 19 jul. 2018 a las 21:54, Karl Wright () escribió: > You are correct that cookies are not shared among threads. That is by > design. > > The only way to set cookies for the WebConnector is

Re: web crawler not sharing cookies

2018-07-19 Thread Karl Wright
You are correct that cookies are not shared among threads. That is by design. The only way to set cookies for the WebConnector is to have there be a "login sequence". The login sequence sets cookies that are then used by all subsequent fetches. Thanks, Karl On Thu, Jul 19, 2018 at 3:38 PM

Re: Error while crawling Infopath Forms in Sharepoint 2013

2018-07-06 Thread Karl Wright
Hi Nikita, There are no "plugins" available for the SharePoint connector. It only crawls libraries and attachments. In theory more supported types can be added but only if the (deprecated) SharePoint aspx services allow access to them. Karl On Fri, Jul 6, 2018 at 10:06 AM Nikita Ahuja

Re: Error while crawling Infopath Forms in Sharepoint 2013

2018-07-06 Thread Nikita Ahuja
Hi Karl, Thanks for your response. The infopath forms stores the data and required information. And it shows XML files. Can it work by using any plugin ? On Fri 6 Jul, 2018, 6:38 PM Karl Wright, wrote: > Sharepoint has a number of data types that ManifoldCF doesn't know how to > crawl.

Re: Error while crawling Infopath Forms in Sharepoint 2013

2018-07-06 Thread Karl Wright
Sharepoint has a number of data types that ManifoldCF doesn't know how to crawl. Sounds like infopath forms are one such data type. It's not clear that crawling a form is a good idea in any case. What content do you expect this to yield? Karl On Fri, Jul 6, 2018 at 7:59 AM Nikita Ahuja

Re: ManifoldCF 2.10 & Sharepoint 2013 - Configuration assistance

2018-06-27 Thread Karl Wright
Hi Arjan, The ManifoldCF Sharepoint 2013 connector expects to be given either the root of the whole SharePoint site, or the root of a virtual site. The error message displayed shows authorization error not accessing the root but rather http://gocnavigator.com/projects/5277. If this is a virtual

Re: List all jobs page not working

2018-06-21 Thread Karl Wright
Works fine here. Karl On Thu, Jun 21, 2018 at 10:25 AM VINAY Bengaluru wrote: > Hi Karl, > The /json/jobs API request is not returning any results. > Also the list all jobs page isn't displaying in the front end. All other > pages work fine. We don't see any errors in the logs

RE: Documents blocked sometimes without errors

2018-06-21 Thread msaunier
Hello Karl, Ok I build and test this version. Thanks Maxence, De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : jeudi 21 juin 2018 02:43 À : user@manifoldcf.apache.org Objet : Re: Documents blocked sometimes without errors Patch attached, and fix committed to trunk. Karl

Re: Documents blocked sometimes without errors

2018-06-20 Thread Karl Wright
;> >> >> Maxence, >> >> >> >> >> >> >> >> >> >> *De :* Karl Wright [mailto:daddy...@gmail.com] >> *Envoyé :* lundi 18 juin 2018 14:42 >> *À :* user@manifoldcf.apache.org >> *Objet :* Re: Documents blocked so

Re: Documents blocked sometimes without errors

2018-06-20 Thread Karl Wright
est to reproduce the problem again and view if they are they > sames documents or if I have a pattern or other similarities. > > > > Maxence, > > > > > > > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* lundi 18 juin 2018 14:4

Re: script to schedule MCF Jobs by crontab login unauthorized

2018-06-19 Thread Karl Wright
There have been no security changes for many releases. Karl On Tue, Jun 19, 2018 at 9:25 AM Bisonti Mario wrote: > Hallo, I used a script to start remotely a job from crontab on MCF 2.9.1 > and it worked > > The sam script, now, in MCF 2.10 not ork. > > > > Now, I tried this command: > > > >

Re: FATAL 2018-06-18T18:29:23,676 (Worker thread '36') - Error tossed: null

2018-06-19 Thread Karl Wright
') - Error tossed: null > > java.lang.NullPointerException > > FATAL 2018-06-19T09:29:34,742 (Worker thread '3') - Error tossed: null > > java.lang.NullPointerException > > FATAL 2018-06-19T09:29:34,797 (Worker thread '28') - Error tossed: null > > java.lang.NullPointe

Re: FATAL 2018-06-18T18:29:23,676 (Worker thread '36') - Error tossed: null

2018-06-18 Thread Karl Wright
Created CONNECTORS-1510 and committed a fix. Karl On Mon, Jun 18, 2018 at 2:33 PM Karl Wright wrote: > It certainly is a particular file -- the mime type is null, and that's > causing this line to blow up: > > final String lowerMimeType = mimeType.toLowerCase(Locale.ROOT); > > > That code

Re: FATAL 2018-06-18T18:29:23,676 (Worker thread '36') - Error tossed: null

2018-06-18 Thread Karl Wright
It certainly is a particular file -- the mime type is null, and that's causing this line to blow up: final String lowerMimeType = mimeType.toLowerCase(Locale.ROOT); That code was added a couple of revs back to address a different problem; it's a trivial fix: final String lowerMimeType

Re: FATAL 2018-06-18T18:29:23,676 (Worker thread '36') - Error tossed: null

2018-06-18 Thread Steph van Schalkwyk
Looks like a particular file may be causing this. Try to find the filanem it crashes on and copy that to asmall crawl directory. Repeat crawl. On Mon, Jun 18, 2018 at 11:34 AM, Bisonti Mario wrote: > Hallo > > > > I configured ManifoldCF 2.10 with Tomcat 9.0.8 and Postgres 9.3 > > > > I

RE: Documents blocked sometimes without errors

2018-06-18 Thread msaunier
Okay. I test to reproduce the problem again and view if they are they sames documents or if I have a pattern or other similarities. Maxence, De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : lundi 18 juin 2018 14:42 À : user@manifoldcf.apache.org Objet : Re: Documents

Re: Documents blocked sometimes without errors

2018-06-18 Thread Karl Wright
indexing and if it > happens again, I'll tell you. > > > > Maxence, > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* lundi 18 juin 2018 14:25 > *À :* user@manifoldcf.apache.org > *Objet :* Re: Documents blocked sometimes without errors > > > >

RE: Documents blocked sometimes without errors

2018-06-18 Thread msaunier
[mailto:daddy...@gmail.com] Envoyé : lundi 18 juin 2018 14:25 À : user@manifoldcf.apache.org Objet : Re: Documents blocked sometimes without errors My concern is that you upgraded the code but DID NOT do the pause/resume after you did that. If that was was the sequence, you were left with old, un

Re: Documents blocked sometimes without errors

2018-06-18 Thread Karl Wright
With the trunk version, I feel it's less common but the problem is still > here. > > > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* lundi 18 juin 2018 12:14 > *À :* user@manifoldcf.apache.org > *Objet :* Re: Documents blocked sometimes without err

RE: Documents blocked sometimes without errors

2018-06-18 Thread msaunier
Yes my solution is paused the job and resume it. With the trunk version, I feel it's less common but the problem is still here. De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : lundi 18 juin 2018 12:14 À : user@manifoldcf.apache.org Objet : Re: Documents blocked sometimes without

Re: Documents blocked sometimes without errors

2018-06-18 Thread Karl Wright
t; > > > > > *De :* msaunier [mailto:msaun...@citya.com] > *Envoyé :* lundi 18 juin 2018 11:13 > *À :* 'user@manifoldcf.apache.org' > *Objet :* RE: Documents blocked sometimes without errors > > > > Ok I have miss my ln –s so my link go to 2.9.1. Sorry for this er

RE: Documents blocked sometimes without errors

2018-06-18 Thread msaunier
Ok I have miss my ln –s so my link go to 2.9.1. Sorry for this error. Your corrections are okay. De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : lundi 18 juin 2018 10:43 À : user@manifoldcf.apache.org Objet : Re: Documents blocked sometimes without errors If there's any chance

RE: Documents blocked sometimes without errors

2018-06-18 Thread msaunier
Ok. I have paused and restart. I have down the agent and restrart. I continue the tests. I have many millions documents, so it will take time. Maxence, De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : lundi 18 juin 2018 10:43 À : user@manifoldcf.apache.org Objet : Re

Re: Documents blocked sometimes without errors

2018-06-18 Thread Karl Wright
to you later. > > Karl > > On Mon, Jun 18, 2018 at 4:07 AM msaunier wrote: > >> CSV joined. >> >> >> >> Thanks, >> >> Maxence, >> >> >> >> >> >> >> >> *De :* Karl Wright [mailto:daddy...@gmail.com] &g

RE: Documents blocked sometimes without errors

2018-06-18 Thread msaunier
Okay, if you need details I am available. Thanks, Maxence, De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : lundi 18 juin 2018 10:35 À : user@manifoldcf.apache.org Objet : Re: Documents blocked sometimes without errors These are still indeed blocked. Unfortunately I don't

Re: Documents blocked sometimes without errors

2018-06-18 Thread Karl Wright
, > > > > > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* lundi 18 juin 2018 10:02 > *À :* user@manifoldcf.apache.org > *Objet :* Re: Documents blocked sometimes without errors > > > > The only way to know if these are truly blocked is to

RE: Documents blocked sometimes without errors

2018-06-18 Thread msaunier
CSV joined. Thanks, Maxence, De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : lundi 18 juin 2018 10:02 À : user@manifoldcf.apache.org Objet : Re: Documents blocked sometimes without errors The only way to know if these are truly blocked is to find the document records

Re: Documents blocked sometimes without errors

2018-06-18 Thread Karl Wright
n > I verify my trunk vertion after the build? > > > > Thanks, > > Maxence , > > > > > > *De :* msaunier [mailto:msaun...@citya.com] > *Envoyé :* mardi 5 juin 2018 14:54 > *À :* 'user@manifoldcf.apache.org' > *Objet :* RE: Documents block

RE: connectors.xml modified: new repository not in the list

2018-06-15 Thread msaunier
Hello Mario, Your jcifs is named jcifs.jar or jcifs-1.3.19.jar ? What is your ManifoldCF version ? Maxence, De : Bisonti Mario [mailto:mario.biso...@vimar.com] Envoyé : vendredi 15 juin 2018 11:39 À : user@manifoldcf.apache.org Objet : connectors.xml modified: new repository

Re: Job in aborting status

2018-06-13 Thread Karl Wright
> > > > > > > > > > > > For the debug, perhaps isn’t it the right mode? > > Thank a lot > > Mario > > > > > > > > *Da:* Karl Wright > *Inviato:* martedì 12 giugno 2018 17:40 > *A:* user@manifoldcf.apache.org &g

Re: Job in aborting status

2018-06-12 Thread Karl Wright
> > > *Da:* Karl Wright > *Inviato:* martedì 12 giugno 2018 16:22 > *A:* user@manifoldcf.apache.org > *Oggetto:* Re: Job in aborting status > > > > Hi Mario, > > > > What repository connector are you using for Job "B"? Is it your own > connector?

Re: Job in aborting status

2018-06-12 Thread Karl Wright
re > > > > So, I think that there could be a lock situation in the internal HSQLDB > that I am not able to solve. > > > > > > > > > > *Da:* Karl Wright > *Inviato:* martedì 12 giugno 2018 15:46 > *A:* user@manifoldcf.apache.org > *Ogge

Re: Job in aborting status

2018-06-12 Thread Karl Wright
o the > /usr/share/manifoldcf/example to try to clean-up my situation, but perhaps > the script isn’t good for me because I am using jetty on the example > directory? > > > > Thanks > > > > > > > > > > *Da:* Karl Wright > *Inviato:* martedì 12 g

Re: Job in aborting status

2018-06-12 Thread Karl Wright
at > org.apache.manifoldcf.core.system.ManifoldCF$DatabaseShutdown.doCleanup(ManifoldCF.java:1664) > > at > org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1540) > > at > org.apache.manifoldcf.core.system.ManifoldCF$Sh

Re: Job in aborting status

2018-06-12 Thread Karl Wright
Hi Mario, Two things you should know. First, if you have very large jobs, it can take a while to abort them. This is because the documents need to have their document priority cleared, and that can take a while for a large job. Second, what you describe sounds like you may have stuck locks.

Re: locale fr ERROR

2018-06-05 Thread Steph van Schalkwyk
gain a lot of time if > you want to translate the MCF ressource bundle. > > > > Regards, > > > > Cedric > > > > *De :* msaunier [mailto:msaun...@citya.com] > *Envoyé :* lundi 4 juin 2018 16:14 > *À :* user@manifoldcf.apache.org > *Objet :* RE:

RE: Documents blocked sometimes without errors

2018-06-05 Thread msaunier
Ok. I have build and deploy. The tests are in progress. Thanks, Maxence De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : lundi 4 juin 2018 19:55 À : user@manifoldcf.apache.org Objet : Re: Documents blocked sometimes without errors I attached a patch to the ticket

Re: Zk ManifoldCF just questions

2018-06-05 Thread Karl Wright
> > > > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* mardi 5 juin 2018 11:11 > *À :* user@manifoldcf.apache.org > *Objet :* Re: Zk ManifoldCF just questions > > > > Hi Maxence, > > > > I think this will answer your questions:

RE: Zk ManifoldCF just questions

2018-06-05 Thread msaunier
Last question: Can I migrate to Zk Multiprocess and have the same Database? (in order not to lose the current data) Thanks De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : mardi 5 juin 2018 11:11 À : user@manifoldcf.apache.org Objet : Re: Zk ManifoldCF just questions

RE: Zk ManifoldCF just questions

2018-06-05 Thread msaunier
Ok so I go to test this functionnality. Thanks you. Maxence, De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : mardi 5 juin 2018 11:11 À : user@manifoldcf.apache.org Objet : Re: Zk ManifoldCF just questions Hi Maxence, I think this will answer your questions: (1

Re: Zk ManifoldCF just questions

2018-06-05 Thread Karl Wright
Hi Maxence, I think this will answer your questions: (1) Multiprocess MCF is stable, yes. Zookeeper is the recommended configuration; shared files are deprecated. Zookeeper is used to coordinate cluster processes and store global configuration. (2) Multiprocess MCF is best viewed as a cluster.

RE: locale fr ERROR

2018-06-05 Thread msaunier
@manifoldcf.apache.org Objet : RE: locale fr ERROR Hi Steph and Maxence, we started to work on it somewhere in 2016, but due to reprioritization, we kind of froze our translation task, and it ended up as a sleeping beauty in our shared drive. But thanks to your question, we did some homework

RE: Documents blocked sometimes without errors

2018-06-05 Thread msaunier
Ok. I go to test today. De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : lundi 4 juin 2018 19:55 À : user@manifoldcf.apache.org Objet : Re: Documents blocked sometimes without errors I attached a patch to the ticket that is a tentative fix. Please let me know if you still see

Re: Documents blocked sometimes without errors

2018-06-04 Thread Karl Wright
gt;>> On Mon, Jun 4, 2018 at 8:43 AM msaunier wrote: >>> >>>> Thanks for your answers. >>>> >>>> >>>> >>>> So, I join at this email -> interface screen and csv result. >>>> >>>> >>>> >&g

Re: Documents blocked sometimes without errors

2018-06-04 Thread Karl Wright
t;> >>> Thanks for your answers. >>> >>> >>> >>> So, I join at this email -> interface screen and csv result. >>> >>> >>> >>> Thanks, >>> >>> Maxence >>> >>> >>> >>

Re: Documents blocked sometimes without errors

2018-06-04 Thread Karl Wright
t; > > On Mon, Jun 4, 2018 at 8:43 AM msaunier wrote: > >> Thanks for your answers. >> >> >> >> So, I join at this email -> interface screen and csv result. >> >> >> >> Thanks, >> >> Maxence >> >> >> >>

Re: Documents blocked sometimes without errors

2018-06-04 Thread Karl Wright
sv result. > > > > Thanks, > > Maxence > > > > > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* lundi 4 juin 2018 11:36 > *À :* user@manifoldcf.apache.org > *Objet :* Re: Documents blocked sometimes without errors > > > >

RE: locale fr ERROR

2018-06-04 Thread msaunier
Hello, The latest news was that they did not implement this feature. I will ask them. Thank you. De : Steph van Schalkwyk [mailto:st...@remcam.net] Envoyé : lundi 4 juin 2018 16:12 À : user@manifoldcf.apache.org Objet : Re: locale fr ERROR Take a look at FranceLabs' Datafari

Re: locale fr ERROR

2018-06-04 Thread Steph van Schalkwyk
daddy...@gmail.com] > *Envoyé :* vendredi 1 juin 2018 16:29 > *À :* user@manifoldcf.apache.org > *Objet :* Re: locale fr ERROR > > > > Yes, of course you can. > > > > Have a look in the manifoldcf source tree (root here: > https://svn.apache.org/repos/asf/

RE: Documents blocked sometimes without errors

2018-06-04 Thread msaunier
Thanks for your answers. So, I join at this email -> interface screen and csv result. Thanks, Maxence De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : lundi 4 juin 2018 11:36 À : user@manifoldcf.apache.org Objet : Re: Documents blocked sometimes without errors

Re: Documents blocked sometimes without errors

2018-06-04 Thread Karl Wright
Oh, and it should be unnecessary to pause/resume jobs when you bring down ManifoldCF for database maintenance. Stop the agents service, and start it again, and you should pick up exactly where you left off. Karl On Mon, Jun 4, 2018 at 5:33 AM Karl Wright wrote: > Hi Maxence, > > Pausing and

Re: Documents blocked sometimes without errors

2018-06-04 Thread Karl Wright
Hi Maxence, Pausing and restarting a job causes all of its documents to have their docpriority field be recalculated. It should not be necessary to do this in order to have job complete, though. All documents that are queued have their docpriority set at the time they are added to the queue,

Re: Error handling configuration

2018-06-03 Thread Karl Wright
Hi Yasufumi, Connector writers are required to do the following when they write connectors: (1) List the kinds of errors the connector may encounter that might potentially be resolved if a document is fetched another time; (2) Come up with a way of detecting each such error; (3) Decide on a

Re: Error handling configuration

2018-06-03 Thread Yasufumi Mizoguchi
Hi Karl, Thank you for your reply. I am using "org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector" for indexing my local filesystem before indexing my file servers. And I want ManifoldCF to retry at least once when facing any errors. Now, I am trying to generate errors by file

Re: locale fr ERROR

2018-06-01 Thread Karl Wright
; > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* vendredi 1 juin 2018 14:31 > *À :* user@manifoldcf.apache.org > *Objet :* Re: locale fr ERROR > > > > That just means that there are no French translations available. > > > > Karl &

Re: locale fr ERROR

2018-06-01 Thread Karl Wright
That just means that there are no French translations available. Karl On Fri, Jun 1, 2018 at 4:07 AM msaunier wrote: > Hello Karl, > > > > What is the reason of this ERROR : > > > > ./manifoldcf.log:17656:ERROR 2018-06-01T09:57:16,715 (qtp1601687801-508) - > Missing resource bundle

Re: Error handling configuration

2018-06-01 Thread Karl Wright
Hi Yasufumi, Individual connectors determine what happens on specific kinds of errors that they receive. The connector can determine the pattern of behavior based on what kind of ServiceInterruption exception it throws when the error occurs. So this is not "configurable"; the logic for

Re: Regarding skip limit

2018-05-30 Thread Karl Wright
Hi Vinay, I don't have complete information, but offhand it looks to me like the tar is being extracted more than once because the ingestion fails and is being retried. The retries are happening every 7-8 minutes, which is exactly what one expects for error retries. Please note that the number

Re: ManifoldCF API file system exclusion list

2018-05-30 Thread Shashank Raj
t; order does not get preserved when it is re-imported. The fix changes the > exported form to the less-appealing-but-order-preserving equivalent. > > Thanks, > Karl > > > On Wed, May 30, 2018 at 3:01 AM Shashank Raj > wrote: > >> Hi Karl, >>We

Re: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) error SPAM 10Go/hour

2018-05-30 Thread Karl Wright
ailto:daddy...@gmail.com] > *Envoyé :* lundi 28 mai 2018 18:47 > *À :* user@manifoldcf.apache.org > *Objet :* Re: > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) > error SPAM 10Go/hour > > > > This sounds potentially like a problem in Tika, but i

RE: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) error SPAM 10Go/hour

2018-05-30 Thread msaunier
servers I check and I'm coming back towards you. Thanks, Maxence. De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : lundi 28 mai 2018 18:47 À : user@manifoldcf.apache.org Objet : Re: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) error SPAM

Re: ManifoldCF API file system exclusion list

2018-05-30 Thread Karl Wright
Hi Shashank, This question has come up recently from another source as well, and there was a ticket and a fix committed. I believe it went out in 2.10. The problem is that the *exported* JSON is not in the proper form and so order does not get preserved when it is re-imported. The fix changes

Re: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) error SPAM 10Go/hour

2018-05-29 Thread Karl Wright
gt; > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238) > > at > org.apache.pdfbox.contentst

RE: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) error SPAM 10Go/hour

2018-05-29 Thread msaunier
apache.org Objet : Re: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) error SPAM 10Go/hour This sounds potentially like a problem in Tika, but in order to be sure I would need a complete stack trace, not just a piece of one. If it is a Tika issue,

Re: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) error SPAM 10Go/hour

2018-05-28 Thread Karl Wright
This sounds potentially like a problem in Tika, but in order to be sure I would need a complete stack trace, not just a piece of one. If it is a Tika issue, it should appear reliably on the same document, again and again. Is there any way you can crawl ONLY one of the documents that got blocked?

Re: Manifold CF job hangs

2018-05-24 Thread Karl Wright
Hi Vinay, If you know which documents these are, it would be great to get hold of one of them. Alternatively, it might be helpful to provide a thread dump of the ManifoldCF agents process once it's finished all the other documents and is stuck only on ones that are "hung" inside Tika. This

Re: Database for Manifoldcf

2018-05-17 Thread Beelz Ryuzaki
Hi Karl, Thank you for your quick response, I will do as you say. On Thu 17 May 2018 at 14:59, Karl Wright wrote: > Hi Beelz, > > ManifoldCF already supports MySQL. > > Please read the "how to build and deploy" page for how to set it up. > > Karl > > > On Thu, May 17, 2018

Re: Database for Manifoldcf

2018-05-17 Thread Karl Wright
Hi Beelz, ManifoldCF already supports MySQL. Please read the "how to build and deploy" page for how to set it up. Karl On Thu, May 17, 2018 at 8:37 AM Beelz Ryuzaki wrote: > Hello Everyone, > > Manifoldcf uses either postgresql or hsql for its crawler ui database. I >

Re: Issues in crawling contents from Documentum Repository Connector to ElasticSearch Output Connector

2018-05-16 Thread Karl Wright
A 2.10 version was released middle of April. Karl On Wed, May 16, 2018, 3:01 AM Shashank Saurabh LNU < sln...@worldbankgroup.org> wrote: > Hi Karl, > Thanks for the quick reply and update. > > I'm facing this issue in the ManifoldCF version 2.9.1 which I'm using and > think is the latest

Re: Issues in crawling contents from Documentum Repository Connector to ElasticSearch Output Connector

2018-05-16 Thread Shashank Saurabh LNU
Hi Karl, Thanks for the quick reply and update. I'm facing this issue in the ManifoldCF version 2.9.1 which I'm using and think is the latest version. Please suggest the latest version to go with or the version you are referring to in which this issue of ElasticSearch was fixed. Would be a

RE: Issues in crawling contents from Documentum Repository Connector to ElasticSearch Output Connector

2018-05-16 Thread Shashank Saurabh LNU
gards, Shashank Saurabh From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, May 15, 2018 8:32 PM To: user@manifoldcf.apache.org Subject: Re: Issues in crawling contents from Documentum Repository Connector to ElasticSearch Output Connector I believe a fix was made to the ElasticS

Re: Issues in crawling contents from Documentum Repository Connector to ElasticSearch Output Connector

2018-05-15 Thread Karl Wright
I believe a fix was made to the ElasticSearch connector a release or two ago that addresses this problem. ElasticSearch once again changed their API without notice and this was necessary. Please consider updating to the latest release of MCF. Karl On Tue, May 15, 2018 at 9:48 AM Shashank

Re: Documentum - metadata crawl

2018-05-15 Thread Karl Wright
Hi Radko, This is something that could be added to the Documentum connector. If you want this functionality, please submit a Jira ticket describing what you want. I cannot guarantee it will be worked on immediately, however. Thanks, Karl On Tue, May 15, 2018 at 10:56 AM Najman, Radko

Re: Job stuck at aborted/starting up

2018-05-14 Thread Karl Wright
Hi Shashank, Bashing the database is not the right way to do this, at all. You will cause carnage. What you need to do is diagnose what is happening. First of all, large jobs take a long time to abort or start up, because they need to set document priorities for all the crawlable documents in

Re: Using File System Repository Connector for a Sample Crawl in Windows Environment

2018-05-01 Thread Markus Schuch
Hi Irindu, i suppose you left the text field in the "root path" column (next to the "Add" Button) empty, so the root path is the execution directory of your ManifoldCF instance. Instead of using the match rules fields you need to enter your desired crawl root directory as "root path". The

Re: 2.11 ManifoldCF version

2018-04-30 Thread Karl Wright
Hi Maxence, The 2.11 release is scheduled for August 31. Karl On Mon, Apr 30, 2018 at 6:03 AM, msaunier wrote: > Hello Karl, > > > > Do you have a date for the 2.11 of ManifoldCF ? Or an idea ? > > Because I wont use the trunk to push in production my project and I use >

Re: Alfresco connector authentication fail

2018-04-27 Thread Piergiorgio Lucidi
mail.com] > *Envoyé :* vendredi 27 avril 2018 16:43 > *À :* user@manifoldcf.apache.org > *Objet :* Re: Alfresco connector authentication fail > > > > Hi Maxence, > > > > See: > > " Caused by: org.apache.axis.AxisFault: (404)Not Found" > > > &g

RE: Alfresco connector authentication fail

2018-04-27 Thread msaunier
I think not, I don’t know what is it. It’s not just for the webscript repository? I need to install this on alfresco or on ManifoldCF? De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : vendredi 27 avril 2018 16:43 À : user@manifoldcf.apache.org Objet : Re: Alfresco connector

Re: Alfresco connector authentication fail

2018-04-27 Thread Karl Wright
Hi Maxence, See: " Caused by: org.apache.axis.AxisFault: (404)Not Found" The alfresco page it is trying to reach is not there. Have you installed the alfresco indexer plugin? Karl On Fri, Apr 27, 2018 at 9:21 AM, msaunier wrote: > Hello Karl, > > > > I have an error

RE: UpdateProcessor SolrCloud and ManifoldCF

2018-04-19 Thread msaunier
url. Thanks, Maxence De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : jeudi 19 avril 2018 14:59 À : user@manifoldcf.apache.org Objet : Re: UpdateProcessor SolrCloud and ManifoldCF The Arguments tab is not supposed to add a field. It is for arguments to be sent only

Re: UpdateProcessor SolrCloud and ManifoldCF

2018-04-19 Thread Karl Wright
t; > I think, arguments add field, not add param on the url. > > > > Thanks. > > Maxence, > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* jeudi 19 avril 2018 13:44 > *À :* user@manifoldcf.apache.org > *Objet :* Re: UpdateProcessor SolrCloud and

RE: UpdateProcessor SolrCloud and ManifoldCF

2018-04-19 Thread msaunier
/201102081135_ENVOIDEVISPP.doc] unknown field 'processor' I think, arguments add field, not add param on the url. Thanks. Maxence, De : Karl Wright [mailto:daddy...@gmail.com] Envoyé : jeudi 19 avril 2018 13:44 À : user@manifoldcf.apache.org Objet : Re: UpdateProcessor SolrCloud

Re: UpdateProcessor SolrCloud and ManifoldCF

2018-04-19 Thread Karl Wright
Hi Maurice, You're not supposed to add arguments to the handler paths. They're just paths and not full URLs. You can add arbitrary URL arguments to be sent to Solr elsewhere in the configuration. Look at the "Arguments" tab. For commits, look at the "Commits" tab. Karl On Thu, Apr 19, 2018

Re: Connector to use Aconex API

2018-04-18 Thread Nikita Ahuja
Hi Rafa, I have tried to enable remote java debugging in the manifold \example\start.jar file. But it is not able to work properly, It is showing that the port is still in use . Can you please help? Thanks, Nikita On Tue, Apr 17, 2018 at 1:14 PM, Rafa Haro wrote: > Hi

Re: Connector to use Aconex API

2018-04-17 Thread Rafa Haro
Hi Nikita, For debugging, as any other java application, you can just run manifoldcf jar enabling remote java debugging and then configure Remote Debugging at your favourite IDE. On Tue, Apr 17, 2018 at 9:13 AM Karl Wright wrote: > Hi Nikita, > > Debugging connectors I

Re: Connector to use Aconex API

2018-04-17 Thread Karl Wright
Hi Nikita, Debugging connectors I can't give you any advice for, other than they usually are not terribly complicated. The Tika transformation connector can be configured using the general parameters you provide to the Tika engine. I am not sure what you mean beyond that. Tika's output is also

Re: Connector to use Aconex API

2018-04-17 Thread Nikita Ahuja
Thanks Karl, I would start creating the repository connector, but I have doubt related to debugging because whenever if a single change is made to the present code we need to publish it and then check the ManifoldCF interface and test it and go through the log files/ Is there any better way to

Re: Connector to use Aconex API

2018-04-16 Thread Karl Wright
Hi Nikita, You will need to write a Repository Connector. There is a book you can look at describing how to do that -- ManifoldCF In Action. It's free and you can see the PDFs here: https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs Karl On Mon, Apr 16, 2018 at 4:48 AM, Nikita

Re: Needed information on multi-process-zookeeper setup

2018-03-28 Thread Karl Wright
Hi Vinay, The agents process is the one that does the crawling. The tomcat serves the UI, API, and Authority Service web applications. MCF requires its connectors to be bounded in memory consumption. That means that once you determine how much memory you need, the consumption isn't going to

Re: ManifoldCF two server setup

2018-03-23 Thread Karl Wright
Hi Shashank, As I mentioned earlier, file-based synchronization has been deprecated. We strongly recommend that you use Zookeeper-based synchronization. I am very confused that you claim you can run jobs on specific cluster members. Job work is distributed among all cluster members, and you

<    2   3   4   5   6   7   8   9   10   11   >