Re: Exporting crawler configuration easier?
The fact that the export is a zip is not supposed to be used to actually edit the stored information. It sounds like the reason that you want to edit it is to remove the passwords from the file. Perhaps we should look at it from that point of view and allow an export option that does not include any passwords or something? Karl On Wed, Jun 27, 2012 at 7:27 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: We have all configuration files for our search project stored in SVN, even our MCF crawler configuration. Each time we change our MCF settings, i.e. add something to the seed list, we usually export the configuration and commit that change to SVN. This can be a time-consuming process since we have to unzip the generated export file in order to edit the files within it. We need to edit the output file which includes the password to our Solr server. Then we must zip all these files in order to create a similar export file. The order of the files are very important. You cannot just create a zip file right away without being aware of the order of the included files. Otherwise, MCF will complain when you are trying to import that file later. Any suggestions for a smoother way to have a version-controlled configuration? Perhaps I should create a script which does all the steps mentioned above? As far as I know, it's not possible to edit the files directly inside a zip file from a terminal on UNIX. Thanks, Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Crawling behind an ISA proxy (iis 7.5)
I was wondering if you'd picked up and tried the patch for CONNECTORS-483. This patch adds official proxy support for the Web Connector. Alternatively, you could try to build and run with trunk code. Karl On Wed, May 16, 2012 at 12:12 PM, Karl Wright daddy...@gmail.com wrote: Hi Rene, The URL that is causing the RFC2617 challenge/response is being authenticated with basic auth, not NTLM. This could yield a 401. You may want to check the URL in a browser other than IE (Firefox, for instance) to see if basic auth is being used for this URL rather than NTLM. The redirection you describe to GetLogon is pretty standard practice. You can easily tell the web connector that that is part of the logon sequence by following the steps I laid out in the earlier email. Once you have set up what you think is the right set of logon pages, it's very helpful to attempt a crawl and then see what the simple history shows. There are specific activities logged when logon begins and ends, so this is enormously helpful as a diagnostic aid. If you see a continuous loop (entering logon sequence, doing stuff, exiting logon sequence, and repeating) then it is clear that the cookie has not been set. I won't be able to look at your packet log for a while, probably at least a week. Karl On Wed, May 16, 2012 at 10:23 AM, Rene Nederhand r...@nederhand.net wrote: Hi Karl, Thank you so much for putting a so much time in educating a newbe. I appreciate your help enormously. I'd tried to follow each of the steps below. So far, it doesn't work but I will continue this evening to see if I can get this thing going. In the mean time, I have switched loglevels of the crawling proces to INFO and found something interesting in the logs. Perhaps, this could shine some light on my issues: ERROR 2012-05-16 16:04:13,581 (Thread-1019) - Invalid challenge: Basic org.apache.commons.httpclient.auth.MalformedChallengeException: Invalid challenge: Basic at org.apache.commons.httpclient.auth.AuthChallengeParser.extractParams(Unknown Source) at org.apache.commons.httpclient.auth.RFC2617Scheme.processChallenge(Unknown Source) at org.apache.commons.httpclient.auth.BasicScheme.processChallenge(Unknown Source) at org.apache.commons.httpclient.auth.AuthChallengeProcessor.processChallenge(Unknown Source) at org.apache.commons.httpclient.HttpMethodDirector.processWWWAuthChallenge(Unknown Source) at org.apache.commons.httpclient.HttpMethodDirector.processAuthenticationResponse(Unknown Source) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown Source) at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown Source) at org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244) Please not that I have set NTLM (not BASIC) authentication on bb.helo.hanze.nl and nothing else. The error does not occur when I try to crawl our intranet (also with NTLM). Does this mean something? At least, I think it is the source of the 401 I get when looking at the simple report, isn't it? In addition, I've used Charles proxy to monitor all interaction between my browser and the server. I have found that it doesn't matter which url I use to enter Blackboard, they get all directed to https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon. Shouldn't page based authentication handle this? To make the information complete, I've added the HAR file with the CharlesProxy output. It can be displayed at http://www.softwareishard.com/har/viewer/ for example. You'll be able to see all requests/responses when I start with a clean browser (cookies removed) entering https://bb.helo.hanze.nl. Maybe, this does help. Again, thanks a lot for your help! René On Tue, May 15, 2012 at 5:59 PM, Karl Wright daddy...@gmail.com wrote: Hi Rene, You will need both NTLM auth (page auth, which you have already set up), and Session auth (which you haven't yet set up). In order to set up session-based auth, you should first identify the set of pages that you want access to that are protected by a cookie requirement. You will need to write a regular expression that matches these pages and ONLY these pages. This URL gets entered as the URL regular expression on the Access Credentials tab in the Session-based Access Credentials part of the tab. Then, click the Add button. The next thing you will need is to specify how the connector recognizes pages that belong to the logon sequence. The actual sequence you need to understand is what happens in the browser when you try to access a specific protected URL and you don't have the right cookie. You did not actually specify that; I think you are presuming that you'd be entering directly through the logon page, but that is not how it works. The crawler will have a URL in mind and will need access to the content of that URL. It will fetch the URL
RE: How to increase cache settings for ManifoldCF Authority Service
It would be great if you could open a ticket to request that this cache value be configurable like it is in the active directory authority. Karl Sent from my Windows Phone -- From: Anupam Bhattacharya Sent: 7/3/2012 10:13 AM To: user@manifoldcf.apache.org Subject: Re: How to increase cache settings for ManifoldCF Authority Service Many Thanks!! I changed the value did rebuild of ManifoldCF which helped to solve the issue. Regards Anupam On Tue, Jul 3, 2012 at 4:24 PM, Shinichiro Abe shinichiro.ab...@gmail.comwrote: Hi, I think you see the following line if you configure cache life time. source: org.apache.manifoldcf.crawler.authorities.DCTM.AuthorityConnector.java protected static long responseLifetime = 6L; --this value I think ActiveDirectoryAuthority.java code helps this. Regards, Shinichiro Abe On 2012/07/03, at 19:44, Anupam Bhattacharya wrote: Sorry i didn't mention that clearly. I was just trying to figure out from the SVN Code where the 1 min timeout changes have been kept. By my best guess, I can see a line which must be doing this 1 min timeout changes in http://svn.apache.org/repos/asf/manifoldcf/trunk/framework/pull-agent/src/main/java/org/apache/manifoldcf/crawler/system/ExpireStufferThread.java // If there are no documents at all, then we can sleep for a while. // The theory is that we need to allow stuff to accumulate. if (descs.length == 0) { ManifoldCF.sleep(6L); // 1 minute continue; } Pls. confirm if i am on the right direction. Thanks Anupam On Tue, Jul 3, 2012 at 3:51 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, Oh sorry, I told about Active Directory Authority Services. Now there is not a place to configure cache life time in Documentum Authority Services. Shinichiro Abe On 2012/07/03, at 19:07, Anupam Bhattacharya wrote: Hi, I am using ManifoldCF for Documentum Repository Documentum Authority Services. How can I configure cache life time settings in this case when Active directory is not present ? Regards Anupam On Tue, Jul 3, 2012 at 3:32 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, Can I configure these timeout settings value to anything like 60min or 1 day etc ? Yes, you can configure cache life time, in which tokens are cached after last access(user's last access) to the Active Directory. I think this value might as well be set to about 60min, 1day is too long. Regards, Shinichiro Abe On 2012/07/03, at 18:15, Anupam Bhattacharya wrote: Hello Karl, First of all congratulations for ManifoldCF graduation to Apache Projects thanks for all the help you provided during my development previously through this forum. I have recently come across a Performance problem due to ManifoldCF Authority service. After including authority service the Query Response times increases a lot. After doing some inspection i found that ManifoldCF doesn't cache User Token's after 1 min. ( http://search-lucene.com/m/YqXPHki0Dv/v=threaded). Can I configure these timeout settings value to anything like 60min or 1 day etc ? Regards Anupam
[ANNOUNCE] ManifoldCF 0.6 is released!
I'd like to announce the release of ManifoldCF 0.6. The list of changes can be found at https://svn.apache.org/repos/asf/manifoldcf/branches/release-0.6-branch/CHANGES.txt. Congratulations to all involved! Karl
Re: How to import data from Oracle to Solr
Hi Wolfgang, ManifoldCF is meant to handle a binary document and its metadata. You must provide the document. Metadata is optional. The JDBC connector does not currently support metadata. In order to index this, therefore, you will need to decide what should go into your binary document from your database fields. You can append together multiple fields into one document by means of SQL, e.g. the CONCAT operator or its Oracle equivalent. This would go into one field in Solr, then, which is what you'd search on. Alternatively, if you really need separate indexed fields in Solr for search reasons, you can request a JDBC connector enhancement to add metadata support. You'd still need a binary document, although you could return a blank value for that. So I guess the answer depends on what you are trying to do on the whole. Karl On Tue, Jul 17, 2012 at 6:27 AM, Wolfgang Schreiber wolfgang.schrei...@isb-ag.de wrote: Hello, we are trying to ingest data from an Oracle database into Solr. We managed to insert docs into Solr but only document IDs are inserted and no other data fields. Can you provide an example how to setup the import job in ManifoldCF ? Assume we have the following initial situation: 1) Our Oracle table looks something like: ADDRESS -- ID NUMBER ZIP NUMBER CITYVARCHAR(2) STREET VARCHAR(2) 2) In Solr's schema.xml we added the following fields for the database columns ... field name=ZIP type=int indexed=true stored=true / field name=City type=string indexed=true stored=true / field name=Street type=string indexed=true stored=true / ... So here are our questions: * How do we have to setup the queries for the ManifoldCF job? In particular how exactly must the seeding query and the data query look like? * How do the Solr field mappings look like? We read your online documentation as well as your MEAP book but could not find a workíng example for a successful import between Oracle and Solr. Any help is welcome! Best regards Wolfgang
Re: How to import data from Oracle to Solr
The way you create an enhancement request is through Jira, at https://issues.apache.org/jira. Just create a request for an improvement, and be sure to list any specific details that are important to you. Thanks, Karl On Wed, Jul 18, 2012 at 5:46 AM, Wolfgang Schreiber wolfgang.schrei...@isb-ag.de wrote: Hi Karl, hi ManifoldCF team members, Using Solr's copyField element we managed to create separate fields for the different database columns: field name=city type=cityType indexed=true stored=true / ... copyField source=text dest=city/ ... fieldType name=cityType class=solr.TextField analyzer tokenizer class=solr.PatternTokenizerFactory pattern=.+city:(.+);.* group=1 / /analyzer /fieldType Anyhow, this solution has some drawbacks; e.g. the newly created fields all are text fields. In particular numeric and date fields are also copied to text fields and we cannot use type specific functions of Solr. So coming back to the offer in your first mail: Is it possible that you create a JDBC connector enhancement to support metadata? Is there a special request process we must follow? Best regards Wolfgang -Ursprüngliche Nachricht- Von: Karl Wright [mailto:daddy...@gmail.com] Gesendet: Di 17.07.2012 15:13 An: user@manifoldcf.apache.org Betreff: Re: How to import data from Oracle to Solr So if I understand correctly ... 1) ... all mappings added to the Solr Field Mapping tab are ignored in case of a JDBC resource connector? Not exactly - the mappings aren't ignored, there just isn't any metadata associated with a JDBC connector document, so the mappings never apply. Regardless, I am glad you got the rest worked out. Karl On Tue, Jul 17, 2012 at 9:09 AM, Wolfgang Schreiber wolfgang.schrei...@isb-ag.de wrote: Hello Karl, thank you very much for your quick answer! So if I understand correctly ... 1) ... all mappings added to the Solr Field Mapping tab are ignored in case of a JDBC resource connector? 2) Our data query must look somehow like (regarding that || is Oracle's concatenation operator): SELECT ID AS $(IDCOLUMN), ADDRESS_URL AS $(URLCOLUMN), 'ZIP:' || ZIP || ';city:' || CITY || ';street:' || STREET AS $(DATACOLUMN) FROM ADDRESS WHERE ID IN $(IDLIST) This would result into DATACOLUMN values like: ZIP:70173;City:Stuttgart;Street:Heilbronner We tried this statement and we got the data into the text field of our Solr index. It seems we are one step further! Thank you for your help! Best regards Wolfgang -Ursprüngliche Nachricht- Von: Karl Wright [mailto:daddy...@gmail.com] Gesendet: Di 17.07.2012 12:42 An: user@manifoldcf.apache.org Betreff: Re: How to import data from Oracle to Solr Hi Wolfgang, ManifoldCF is meant to handle a binary document and its metadata. You must provide the document. Metadata is optional. The JDBC connector does not currently support metadata. In order to index this, therefore, you will need to decide what should go into your binary document from your database fields. You can append together multiple fields into one document by means of SQL, e.g. the CONCAT operator or its Oracle equivalent. This would go into one field in Solr, then, which is what you'd search on. Alternatively, if you really need separate indexed fields in Solr for search reasons, you can request a JDBC connector enhancement to add metadata support. You'd still need a binary document, although you could return a blank value for that. So I guess the answer depends on what you are trying to do on the whole. Karl On Tue, Jul 17, 2012 at 6:27 AM, Wolfgang Schreiber wolfgang.schrei...@isb-ag.de wrote: Hello, we are trying to ingest data from an Oracle database into Solr. We managed to insert docs into Solr but only document IDs are inserted and no other data fields. Can you provide an example how to setup the import job in ManifoldCF ? Assume we have the following initial situation: 1) Our Oracle table looks something like: ADDRESS -- ID NUMBER ZIP NUMBER CITYVARCHAR(2) STREET VARCHAR(2) 2) In Solr's schema.xml we added the following fields for the database columns ... field name=ZIP type=int indexed=true stored=true / field name=City type=string indexed=true stored=true / field name=Street type=string indexed=true stored=true / ... So here are our questions: * How do we have to setup the queries for the ManifoldCF job? In particular how exactly must the seeding query and the data query look like? * How do the Solr field mappings look like? We read your online documentation as well as your MEAP book but could not find a workíng example for a successful import between Oracle and Solr. Any help is welcome! Best regards Wolfgang
Re: Repeated service interruptions
Hi Abe-san, Sometimes what looks like a server error can actually be due to the domain controller. I wonder if the domain controller needs to be rebooted? Karl On Thu, Jul 19, 2012 at 5:12 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi Karl, Thank you for the reply. I tried to reduce maximum number of connections from 10 to 5, but didn't avoid busy error. I'll try to reduce more. Thank you. Shinichiro Abe On 2012/07/19, at 15:55, Karl Wright wrote: Hi Abe-san, The all pipe instances are busy error is coming from the Windows server you are trying to crawl. I don't know what is happening there but here are some possibilities: (1) The Windows server is just overloaded; you can try reducing the maximum number of connections to 2 or 3 to see if that helps. (2) The Windows server needs rebooting. Thanks, Karl On Wed, Jul 18, 2012 at 10:09 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, I use windows shares connector and ran a job. The job was aborted without done normally and the job's status said: Error: Repeated service interruptions - failure processing document: Read timed out Why was the job aborted? I use ManifoldCF 0.5.1 and the latest version's jcifs.jar. Is the crawled server busy? I think the server MCF is installed seems not to be busy, the other servers in which MCF will crawls seem to be busy. How can I run the job without error? What's wrong? the logs of connector: WARN 2012-07-12 16:28:52,648 (Worker thread '19') - JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy. at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563) at jcifs.smb.SmbTransport.send(SmbTransport.java:663) .. WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: Possibly transient exception detected on attempt 3 while getting share security: All pipe instances are busy. .. WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: 'Busy' response when getting document version for smb://XX.XX.XX.XX/D$/abcde/1234/123456789/e123456789a.pdf: retrying... .. WARN 2012-07-12 16:36:37,585 (Worker thread '19') - Pre-ingest service interruption reported for job 1342076182624 connection 'Windows shares': Timeout or other service interruption: All pipe instances are busy. .. WARN 2012-07-12 19:14:30,335 (Worker thread '19') - Service interruption reported for job 1342076182624 connection 'Windows shares': Ingestion API socket timeout exception waiting for response code: Read timed out; ingestion will be retried again later .. WARN 2012-07-12 20:43:50,210 (Worker thread '19') - Service interruption reported for job 1342076182624 connection 'Windows shares': Ingestion API socket timeout exception waiting for response code: Read timed out; ingestion will be retried again later .. ERROR 2012-07-12 20:43:50,210 (Worker thread '19') - Exception tossed: Repeated service interruptions - failure processing document: Read timed out org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Read timed out at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:606) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at org.apache.manifoldcf.agents.output.solr.HttpPoster.readLine(HttpPoster.java:571) at org.apache.manifoldcf.agents.output.solr.HttpPoster.getResponse(HttpPoster.java:598) Thanks in advance, Shinichiro Abe
RE: How to import data from Oracle to Solr
Hi Wolfgang, Looking at the code it turns out I was wrong about metadata support being there in the connector. Sorry for the confusion. The way it works is that in the data query and no required return column is considered to be metadata with a field name corresponding to the return column name. So in addition to you returning a URL and a binary value, you can also return any single-valued metadata you need. Please let me know if this works for you. Karl Sent from my Windows Phone -Original Message- From: Wolfgang Schreiber Sent: 7/20/2012 7:47 AM To: user@manifoldcf.apache.org Subject: AW: How to import data from Oracle to Solr
Re: Repeated service interruptions
Hi Abe-san, Did you figure out what the problem was? Karl On Thu, Jul 19, 2012 at 5:52 AM, Karl Wright daddy...@gmail.com wrote: Hi Abe-san, Sometimes what looks like a server error can actually be due to the domain controller. I wonder if the domain controller needs to be rebooted? Karl On Thu, Jul 19, 2012 at 5:12 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi Karl, Thank you for the reply. I tried to reduce maximum number of connections from 10 to 5, but didn't avoid busy error. I'll try to reduce more. Thank you. Shinichiro Abe On 2012/07/19, at 15:55, Karl Wright wrote: Hi Abe-san, The all pipe instances are busy error is coming from the Windows server you are trying to crawl. I don't know what is happening there but here are some possibilities: (1) The Windows server is just overloaded; you can try reducing the maximum number of connections to 2 or 3 to see if that helps. (2) The Windows server needs rebooting. Thanks, Karl On Wed, Jul 18, 2012 at 10:09 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, I use windows shares connector and ran a job. The job was aborted without done normally and the job's status said: Error: Repeated service interruptions - failure processing document: Read timed out Why was the job aborted? I use ManifoldCF 0.5.1 and the latest version's jcifs.jar. Is the crawled server busy? I think the server MCF is installed seems not to be busy, the other servers in which MCF will crawls seem to be busy. How can I run the job without error? What's wrong? the logs of connector: WARN 2012-07-12 16:28:52,648 (Worker thread '19') - JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy. at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563) at jcifs.smb.SmbTransport.send(SmbTransport.java:663) .. WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: Possibly transient exception detected on attempt 3 while getting share security: All pipe instances are busy. .. WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: 'Busy' response when getting document version for smb://XX.XX.XX.XX/D$/abcde/1234/123456789/e123456789a.pdf: retrying... .. WARN 2012-07-12 16:36:37,585 (Worker thread '19') - Pre-ingest service interruption reported for job 1342076182624 connection 'Windows shares': Timeout or other service interruption: All pipe instances are busy. .. WARN 2012-07-12 19:14:30,335 (Worker thread '19') - Service interruption reported for job 1342076182624 connection 'Windows shares': Ingestion API socket timeout exception waiting for response code: Read timed out; ingestion will be retried again later .. WARN 2012-07-12 20:43:50,210 (Worker thread '19') - Service interruption reported for job 1342076182624 connection 'Windows shares': Ingestion API socket timeout exception waiting for response code: Read timed out; ingestion will be retried again later .. ERROR 2012-07-12 20:43:50,210 (Worker thread '19') - Exception tossed: Repeated service interruptions - failure processing document: Read timed out org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Read timed out at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:606) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at org.apache.manifoldcf.agents.output.solr.HttpPoster.readLine(HttpPoster.java:571) at org.apache.manifoldcf.agents.output.solr.HttpPoster.getResponse(HttpPoster.java:598) Thanks in advance, Shinichiro Abe
Re: crawled counts on WEB crawling differ between MCF0.4 and MCF0.5
There should be no differences between crawling using MySQL as the database and PostgreSQL, on the same version of ManifoldCF. We include an RSS crawling test which finds exactly the expected number of documents on MySQL. This is a 100,000 document crawl. There are no back-end-specific logic differences in the web connector that would be expected to yield different results based on the back-end database. If you believe you have found a difference between MySQL and PostgreSQL, I suggest the following: (1) Make sure that the repository connections and job definitions are indeed identical between MySQL and PostgreSQL. (2) See if you can locate an example document that was crawled with PostgreSQL but not crawled with MySQL. (3) If you create a second web connection and job under MySQL, and run the job to completion, does the document that was not included get skipped again? Or does it seem random which documents are skipped on each run? Thanks, Karl On Sun, Jul 29, 2012 at 9:51 PM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Aren't there some difference in crawling logics between MySQL and PostgreSQL? I did some tests on web crawling using both of MySQL and PostgreSQL. MCF0.5 running on MySQL indexed around 6000, and meanwhile MCF0.5 running on PostgreSQL indexed over 12000 documents. MCF0.6 running on MySQL indexed around 6000. MCF0.4 running on PostgreSQL indexed over 12000 documents. Each number of indexed documents above is a result of first crawling after deleting indexing history from DB. It seems that changing DB affects crawling and indexing. Regards, Shigeki 2012/7/27 Karl Wright daddy...@gmail.com There was a bug fixed in the way hopcount was being computed. See CONNECTORS-464. This means that fewer documents are left in the queue, but the number of indexed documents should be the same. Karl On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi guys. I wonder if anyone has ever faced the experience on web crawling that the number of crawled counts differs between MCF0.4 and MCF0.5. I crawled some portal sites on intranet using MCF0.4 and MCF0.5. MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled only around half of the contents. I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL. I hope changing DB does not affect the crawling results: MCF0.4: - Crawled Counts: 12000 and over - Solr3.5 - PostgreSQL 9.1.3 - Tomcat6 - Max Hop on Links: 15 - Max Hop on Redirects: 10 - Include only hosts matching seeds: Checked - org.apache.manifoldcf.crawler.threads: 50 - org.apache.manifoldcf.database.maxhandles: 100 MCF0.5: - Crawled Counts: around 6000 - Solr3.5 - MySQL5.5 - Tomcat6 - Max Hop on Links: 15 - Max Hop on Redirects: 10 - Include only hosts matching seeds: Checked - org.apache.manifoldcf.crawler.threads: 50 - org.apache.manifoldcf.database.maxhandles: 100 Does anyone have any ideas? -- ソフトバンクモバイル株式会社 情報システム本部 システムサービス事業統括部 サービス企画部 小林 茂樹 shigeki.kobayas...@g.softbank.co.jp
Re: Repeated service interruptions
On Wed, Aug 1, 2012 at 5:48 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi Karl, I still have a problem. I reduced maximum number of connections into 2. I rebooted the file server, not domain controller. When I configured the paths[1], the log said no error and ShareDrive connector crawled the files successfully. When I made the path's config default(matching * ), the log said all pipe instances are busy error. Both of path's config pointed the same location. Also when this error occurred, watching the log of ingest, HttpPoster was waiting for response stream and couldn't get response from Solr, and threw SocketTimeoutException. I increased jcifs.smb.client.responseTimeout but still threw the exception. On Solr, Jetty threw SocketException(socket wr ite error). I'm working on checking Solr logs. Solr may do something wrong when running /update/extract. If Solr threw the exception this sounds likely. Do you know something like this? Does path's matching config affect those errors? [1]Paths Tab: Include directory(s) matching /01* This should have nothing to do with socket exceptions, except possibly that the crawler winds up trying to read a file that isn't actually a file but is something else, like a named pipe or something. This typically doesn't happen if the server is a Windows machine but if it is a Samba server I could imagine something like that happening. Karl P.S. Thank you for fix CONNECTORS-494. I checked trunk code, worked well. Thank you, Shinichiro Abe On 2012/07/24, at 22:13, Karl Wright wrote: Hi Abe-san, Did you figure out what the problem was? Karl On Thu, Jul 19, 2012 at 5:52 AM, Karl Wright daddy...@gmail.com wrote: Hi Abe-san, Sometimes what looks like a server error can actually be due to the domain controller. I wonder if the domain controller needs to be rebooted? Karl On Thu, Jul 19, 2012 at 5:12 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi Karl, Thank you for the reply. I tried to reduce maximum number of connections from 10 to 5, but didn't avoid busy error. I'll try to reduce more. Thank you. Shinichiro Abe On 2012/07/19, at 15:55, Karl Wright wrote: Hi Abe-san, The all pipe instances are busy error is coming from the Windows server you are trying to crawl. I don't know what is happening there but here are some possibilities: (1) The Windows server is just overloaded; you can try reducing the maximum number of connections to 2 or 3 to see if that helps. (2) The Windows server needs rebooting. Thanks, Karl On Wed, Jul 18, 2012 at 10:09 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, I use windows shares connector and ran a job. The job was aborted without done normally and the job's status said: Error: Repeated service interruptions - failure processing document: Read timed out Why was the job aborted? I use ManifoldCF 0.5.1 and the latest version's jcifs.jar. Is the crawled server busy? I think the server MCF is installed seems not to be busy, the other servers in which MCF will crawls seem to be busy. How can I run the job without error? What's wrong? the logs of connector: WARN 2012-07-12 16:28:52,648 (Worker thread '19') - JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy. at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563) at jcifs.smb.SmbTransport.send(SmbTransport.java:663) .. WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: Possibly transient exception detected on attempt 3 while getting share security: All pipe instances are busy. .. WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: 'Busy' response when getting document version for smb://XX.XX.XX.XX/D$/abcde/1234/123456789/e123456789a.pdf: retrying... .. WARN 2012-07-12 16:36:37,585 (Worker thread '19') - Pre-ingest service interruption reported for job 1342076182624 connection 'Windows shares': Timeout or other service interruption: All pipe instances are busy. .. WARN 2012-07-12 19:14:30,335 (Worker thread '19') - Service interruption reported for job 1342076182624 connection 'Windows shares': Ingestion API socket timeout exception waiting for response code: Read timed out; ingestion will be retried again later .. WARN 2012-07-12 20:43:50,210 (Worker thread '19') - Service interruption reported for job 1342076182624 connection 'Windows shares': Ingestion API socket timeout exception waiting for response code: Read timed out; ingestion will be retried again later .. ERROR 2012-07-12 20:43:50,210 (Worker thread '19') - Exception tossed: Repeated service interruptions - failure processing document: Read timed out org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Read timed out at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:606
Re: SharePoint Library consist of folders
In that case, you will need to wait until CONNECTORS-492 is resolved. Because of SharePoint's lack of support for accessing large libraries via the Lists service, we're having to write our own. But this is not yet ready, although we are getting closer to trying it out soon. Karl On Thu, Aug 2, 2012 at 9:57 AM, Ahmet Arslan iori...@yahoo.com wrote: The DspSts service lists all the files in a library, including those that have a folder path. I believe the lists service does the same. So you should see all the files crawled including those that are within folders. Please let me know if this seems not to be the case. It seems that lists service does not lists files that have a folder path. Instead it lists folder paths. Currently MCF injects one SolrDocument per folder, and sends all files under it it to extracting update handler. I was reading this : http://sympmarc.com/2011/03/28/listing-folders-in-a-sharepoint-list-or-library-with-spservices/ I am attaching a response example that contains 9 items ( 1 file and 8 folders)
Re: SharePoint Library consist of folders
I checked this change into trunk, and added also corresponding code in the place where fields and metadata are fetched. This may work for you in the interim while we're finishing up CONNECTORS-492. Karl On Fri, Aug 3, 2012 at 8:11 AM, Ahmet Arslan iori...@yahoo.com wrote: Hello, I found that there is an queryOptions parameter for this. ViewAttributes Scope=Recursive /. http://msdn.microsoft.com/en-us/library/lists.lists.getlistitems.aspx If I add these three lines to SPSProxyHelper#buildPagingQueryOptions() MessageElement viewAttributesNode = new MessageElement((String)null,ViewAttributes); queryOptionsNode.addChild(viewAttributesNode); viewAttributesNode.addAttribute(null,Scope,Recursive); return rval; SPSProxyHelper#getDocuments() returns expected results: /Documents/Vekaletname.pdf, /Documents/ik_docs/diger/diger_dilekceler/aile_yardimi_almaz_dilecesi.doc, /Documents/ik_docs/diger/fonksiyonel_ekipman_talep_formu.doc, ... But SPSProxyHelper#getFieldValues() method works for only docId=/Documents/Vekaletname.pdf returns empty map for others. Therefore only this one injected. --- On Thu, 8/2/12, Ahmet Arslan iori...@yahoo.com wrote: From: Ahmet Arslan iori...@yahoo.com Subject: RE: SharePoint Library consist of folders To: user@manifoldcf.apache.org Date: Thursday, August 2, 2012, 10:38 PM This replaces the getlistitems call in spsproxyhelper with a custom method call. Once 492 in place is it going to list files that have folder path too? Without checking value of ows_FSObjType attirbute?
Re: Document Security Modification Requirement during Indexing
Well, you can either modify the document's acls in the Tika pipeline (which I think would be easiest), or you can hack up the Apache ManifoldCF Solr Plugin. Those seem like your only real choices to me. I would choose the former since Tika is meant to be configured in this way. Karl On Tue, Aug 14, 2012 at 12:44 AM, Anupam Bhattacharya anupam...@gmail.com wrote: In our application there is a requirement to change the security on the document in index/search app Vs the Documentum repository. So that users who don't have login access to the Documentum system can also view certain documents in the world browse permission scenario. Additional constraint is that, we cannot change the ACLs on the Documentum Repository the ManifoldCF Authority service should work as it is. I can think of 2 options to approach this case. 1. As I have a separate SOLR servlet which is indexing documents via ManifoldCF to SOLR. So this is one place where i can do some modifications to Add Read security tokens to the special documents. 2. Need to do some modifications in the ManifoldCF Authority Service Connector so that those special documents doesn't get filtered. Thanks for any help on this requirement. Regards Anupam
Re: Crawling MySQL with latest MySQL connector fails
There's some online chatter about this. Apparently the JDBC 4.0 specification was clarified in this regard, and MySQL's implementation follows the clarified meaning. The recommended approach during this transition period is to allow the user to select which method they want to use to get the column name. See CONNECTORS-509. Karl On Mon, Aug 20, 2012 at 8:00 AM, Karl Wright daddy...@gmail.com wrote: Here's some additional info. The JDBC class ResultSetMetaData has two methods: getColumnName(), and getColumnLabel(). For all supported databases, getColumnName() returns the right thing, EXCEPT for MySQL, where you have to use getColumnLabel() instead. I've abstracted the logic that does this in the main database classes that underlie the framework, but for the JDBC Connector I tried to make the connector be independent of any special logic, and instead make it the responsibility of the query writer to know how to work with their database. Unfortunately, that strategy is failing when it comes to MySQL because the JDBC driver is implemented in a way that is inconsistent with the specification. So, we have two ways forward: (1) Change the logic to use getColumnLabel() always. If we do this, it will be necessary to test the JDBC connector against a PostgreSQL, MySQL, and MSSQL database before we know what the effects are. It is possible everything will just work, but it is also possible that such a change would break other people's jobs, and that would be no good. (2) Try to conditionalize the logic so that only for MySQL is getColumnLabel() used. This is less risky but results in messy code that sooner or later would become unmaintainable. Karl On Mon, Aug 20, 2012 at 6:22 AM, Karl Wright daddy...@gmail.com wrote: Hi Shigeki, This is critical functionality for ManifoldCF. Quite a lot of ManifoldCF stuff won't work on MySQL if this is broken - not just crawling using the JDBC connector. Are you successfully crawling with MySQL as the back-end? If you are, that means that there is a way to do this right but the JDBC connector is not using it. I am testing with MySQL JDBC connector 5.1.18 here, which would indicate that that is the case. Could you open a ticket describing the problem, and I will look into this in some detail tonight? Thanks, Karl On Mon, Aug 20, 2012 at 4:21 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi guys. I am not sure if everyone has already noticed this, but this is to share an experimental fact of using MySQL connectors to crawl MySQL data. Using AS in Select queries in SeedQuery and DataQuery causes error depending on versions of MySQL connectors. Env: - ManifoldCF0.5 - Solr3.6 - MySQL5.5 Example: SeedQuery:SELECT idfield AS $(IDCOLUMN) FROM documenttable Error Message: Bad seed query; doesn't return $(IDCOLUMN) column. Try using quotes around $(IDCOLUMN) variable, e.g. $(IDCOLUMN). Cause of Error: MySQL connecors of over version 5.1 seem to have a bug that causes error when you use AS in Select to put an alias for a column. Versions of MySQL Connector: mysql-connector-java-5.0.8.jar - OK mysql-connector-java-5.1.18.jar - No Good mysql-connector-java-5.1.21.jar - No Good Exception: Using function (e.g. sysdate() as) or fixed strings (e.g. fixed string as) followed by as does not cause error. Regards, Shigeki
Re: Job crawling SharePoint repository does not end
You will need the SharePoint-2010 plugin, also. You can check that out at: https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2010/trunk ... and follow the README.txt directions. Thanks! Karl On Tue, Sep 4, 2012 at 6:31 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Yes, this is SharePoint 2010 OK, then I'll try switching to trunk and start working with it. Thanks for the information, Karl. Thanks and Regards, Swapna. On Tue, Sep 4, 2012 at 3:44 PM, Karl Wright daddy...@gmail.com wrote: Hi - What version of SharePoint are you trying to crawl? If this is SharePoint 2010, development is underway and you will have to use trunk. Karl On Tue, Sep 4, 2012 at 5:26 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, Am trying to use SharePoint connector of ManifoldCF for the first time and am having couple of issues. Can someone please help me in successfully crawling these repositories ? Am using ManifoldCF version 0.6 and I see that the SharePoint connector is readily available for use. I have defined a Repository Connection of SharePoint type for the URL https://mysite.arup.com/personal/swapna_vuppala/default.aspx; and the connection status shows Connection working. I have got a couple of documents in the libraries Shared Documents and Personal Documents and am interested in indexing them into Solr. Now when I try to define a job using the above created repository connection and a Solr output connection, am able to add rules to include the libraries I have got. When I start the job, the number listed in Documents column is coming correctly, but the job never ends. It is always in the Running state. I cannot see anything in Simple History except the Job Start. The manifoldcf log file shows something like WARN 2012-09-04 14:39:05,204 (Worker thread '1') - Service interruption reported for job 1346736412103 connection 'Test SharePoint': Remote procedure exception: Request is empty. Can someone please tell me if am missing some steps or configuration of something ?? Thanks and Regards, Swapna.
Re: Job crawling SharePoint repository does not end
Also, please be certain to look at CONNECTORS-492, which applies to SharePoint 2010. It may not affect you, but if it does, bear in mind we have not completed development on it yet. Karl On Tue, Sep 4, 2012 at 6:48 AM, Karl Wright daddy...@gmail.com wrote: You will need the SharePoint-2010 plugin, also. You can check that out at: https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2010/trunk ... and follow the README.txt directions. Thanks! Karl On Tue, Sep 4, 2012 at 6:31 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Yes, this is SharePoint 2010 OK, then I'll try switching to trunk and start working with it. Thanks for the information, Karl. Thanks and Regards, Swapna. On Tue, Sep 4, 2012 at 3:44 PM, Karl Wright daddy...@gmail.com wrote: Hi - What version of SharePoint are you trying to crawl? If this is SharePoint 2010, development is underway and you will have to use trunk. Karl On Tue, Sep 4, 2012 at 5:26 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, Am trying to use SharePoint connector of ManifoldCF for the first time and am having couple of issues. Can someone please help me in successfully crawling these repositories ? Am using ManifoldCF version 0.6 and I see that the SharePoint connector is readily available for use. I have defined a Repository Connection of SharePoint type for the URL https://mysite.arup.com/personal/swapna_vuppala/default.aspx; and the connection status shows Connection working. I have got a couple of documents in the libraries Shared Documents and Personal Documents and am interested in indexing them into Solr. Now when I try to define a job using the above created repository connection and a Solr output connection, am able to add rules to include the libraries I have got. When I start the job, the number listed in Documents column is coming correctly, but the job never ends. It is always in the Running state. I cannot see anything in Simple History except the Job Start. The manifoldcf log file shows something like WARN 2012-09-04 14:39:05,204 (Worker thread '1') - Service interruption reported for job 1346736412103 connection 'Test SharePoint': Remote procedure exception: Request is empty. Can someone please tell me if am missing some steps or configuration of something ?? Thanks and Regards, Swapna.
Re: Job crawling SharePoint repository does not end
There is a SharePoint-2010 plugin 0.1 release candidate available now on http://people.apache.org/~kwright . This might save you some time. Karl On Thu, Sep 6, 2012 at 12:47 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Thanks Karl, I'll try and get the new build and use it shortly. Thanks and Regards, Swapna. On Wed, Sep 5, 2012 at 11:01 PM, Karl Wright daddy...@gmail.com wrote: FWIW, CONNECTORS-492 was just completed, and merged into trunk. You will need a new build of the SharePoint-2010 plugin to use it. Thanks, Karl On Tue, Sep 4, 2012 at 7:34 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, I'll make sure to look at the things you had mentioned. Thanks again for the information. Thanks and Regards, Swapna. On Tue, Sep 4, 2012 at 4:19 PM, Karl Wright daddy...@gmail.com wrote: Also, please be certain to look at CONNECTORS-492, which applies to SharePoint 2010. It may not affect you, but if it does, bear in mind we have not completed development on it yet. Karl On Tue, Sep 4, 2012 at 6:48 AM, Karl Wright daddy...@gmail.com wrote: You will need the SharePoint-2010 plugin, also. You can check that out at: https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2010/trunk ... and follow the README.txt directions. Thanks! Karl On Tue, Sep 4, 2012 at 6:31 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Yes, this is SharePoint 2010 OK, then I'll try switching to trunk and start working with it. Thanks for the information, Karl. Thanks and Regards, Swapna. On Tue, Sep 4, 2012 at 3:44 PM, Karl Wright daddy...@gmail.com wrote: Hi - What version of SharePoint are you trying to crawl? If this is SharePoint 2010, development is underway and you will have to use trunk. Karl On Tue, Sep 4, 2012 at 5:26 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, Am trying to use SharePoint connector of ManifoldCF for the first time and am having couple of issues. Can someone please help me in successfully crawling these repositories ? Am using ManifoldCF version 0.6 and I see that the SharePoint connector is readily available for use. I have defined a Repository Connection of SharePoint type for the URL https://mysite.arup.com/personal/swapna_vuppala/default.aspx; and the connection status shows Connection working. I have got a couple of documents in the libraries Shared Documents and Personal Documents and am interested in indexing them into Solr. Now when I try to define a job using the above created repository connection and a Solr output connection, am able to add rules to include the libraries I have got. When I start the job, the number listed in Documents column is coming correctly, but the job never ends. It is always in the Running state. I cannot see anything in Simple History except the Job Start. The manifoldcf log file shows something like WARN 2012-09-04 14:39:05,204 (Worker thread '1') - Service interruption reported for job 1346736412103 connection 'Test SharePoint': Remote procedure exception: Request is empty. Can someone please tell me if am missing some steps or configuration of something ?? Thanks and Regards, Swapna.
RE: Job crawling SharePoint repository does not end
The difference is SharePoint 2010, which disabled a number of key features that were necessary for crawling. For SharePoint 2010, the plugin is indeed mandatory. Karl Sent from my Windows Phone -- From: Swapna Vuppala Sent: 9/10/2012 7:54 AM To: user@manifoldcf.apache.org Subject: Re: Job crawling SharePoint repository does not end Hi Karl, I have got the SharePoint-2010 plugin but I have got couple of doubts before using this. When I was using ManfoldCF version 0.6, I tried defining repository connections and crawling documents on them by running jobs without installing anything on the SharePoint server. I thought I was just using the connector *mcf-sharepoint-connector.jar *which is one the machine running ManifoldCF and I was of the assumption that, I will be able to crawl documents on any SharePoint server, for which I have got access permissions. I was of the opinion that I don't have to be a SharePoint administrator and also I don't have to install anything on the SharePoint server. But looking at this plug-in, I think I have been of a wrong opinion. Can you please clarify if installation of these web services on the SharePoint server is mandatory, just for being able to crawl them and index into Solr ? Why is it different from the connector I was using in ManifoldCF 0.6 ? Thanks and Regards, Swapna. On Thu, Sep 6, 2012 at 7:17 PM, Karl Wright daddy...@gmail.com wrote: There is a SharePoint-2010 plugin 0.1 release candidate available now on http://people.apache.org/~kwright . This might save you some time. Karl On Thu, Sep 6, 2012 at 12:47 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Thanks Karl, I'll try and get the new build and use it shortly. Thanks and Regards, Swapna. On Wed, Sep 5, 2012 at 11:01 PM, Karl Wright daddy...@gmail.com wrote: FWIW, CONNECTORS-492 was just completed, and merged into trunk. You will need a new build of the SharePoint-2010 plugin to use it. Thanks, Karl On Tue, Sep 4, 2012 at 7:34 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, I'll make sure to look at the things you had mentioned. Thanks again for the information. Thanks and Regards, Swapna. On Tue, Sep 4, 2012 at 4:19 PM, Karl Wright daddy...@gmail.com wrote: Also, please be certain to look at CONNECTORS-492, which applies to SharePoint 2010. It may not affect you, but if it does, bear in mind we have not completed development on it yet. Karl On Tue, Sep 4, 2012 at 6:48 AM, Karl Wright daddy...@gmail.com wrote: You will need the SharePoint-2010 plugin, also. You can check that out at: https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2010/trunk ... and follow the README.txt directions. Thanks! Karl On Tue, Sep 4, 2012 at 6:31 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Yes, this is SharePoint 2010 OK, then I'll try switching to trunk and start working with it. Thanks for the information, Karl. Thanks and Regards, Swapna. On Tue, Sep 4, 2012 at 3:44 PM, Karl Wright daddy...@gmail.com wrote: Hi - What version of SharePoint are you trying to crawl? If this is SharePoint 2010, development is underway and you will have to use trunk. Karl On Tue, Sep 4, 2012 at 5:26 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, Am trying to use SharePoint connector of ManifoldCF for the first time and am having couple of issues. Can someone please help me in successfully crawling these repositories ? Am using ManifoldCF version 0.6 and I see that the SharePoint connector is readily available for use. I have defined a Repository Connection of SharePoint type for the URL https://mysite.arup.com/personal/swapna_vuppala/default.aspx; and the connection status shows Connection working. I have got a couple of documents in the libraries Shared Documents and Personal Documents and am interested in indexing them into Solr. Now when I try to define a job using the above created repository connection and a Solr output connection, am able to add rules to include the libraries I have got. When I start the job, the number listed in Documents column is coming correctly, but the job never ends. It is always in the Running state. I cannot see anything in Simple History except the Job Start. The manifoldcf log file shows something like WARN 2012-09-04 14:39:05,204 (Worker thread '1') - Service interruption reported for job 1346736412103 connection 'Test SharePoint': Remote procedure exception: Request is empty. Can someone please
Re: Does anyone use MOSS?
I don't know of any difference from a SharePoint standpoint between MOSS and WSS, except for additional Office-related plugins on MOSS. Connection working means you could could get to SharePoint at least. Can you look in the log and find the exception associated with the Cannot open the requested Sharepoint Site error? It should give a clue as to what the connector is trying to do at that time. Thanks, Karl On Wed, Oct 10, 2012 at 1:50 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, I think MCF supports Windows SharePoint Services(WSS) though, does MCF support Microsoft Office SharePoint Server(MOSS)? I tried to crawl MOSS but I couldn't crawl and got the error. I'm using MOSS 2007 as out of the box. I have only Administrator user. On the repository connection, the config said connection working but when crawling the log said that Cannot open the requested Sharepoint Site.. I couldn't find the server event log that specifies this error. Any help please. Regards, Shinichiro Abe
Re: Web crawling causes Socket Timeout after Database Exception
Hi Shigeki, The socket timeout exception is only a warning. It means that some site you are crawling did not accept a socket connection within the allowed time (5 minutes I think). The Web Connector will retry the connection a few times, and if it is still rejected, it will eventually give up on that page. One thing you want to check, though, is that you are using proper throttling, because if you aren't then one cause of this problem is that the webmaster of the site you are trying to crawl may have blocked you from accessing it. The database exception is more problematic. It means that MySQL thinks it took too long for a specific transaction to complete, and the database aborted the transaction due to a timeout. There are two ways of dealing with this issue. One way is to modify your MySQL configuration to increase the transaction timeout value to some high number. The second way is to modify ManifoldCF to recognize the timeout error specifically, and cause a retry. But in order to do the latter, I would need to know what SQL error code MySQL returns for this situation, which will mean we either need to look it up (if we can), or modify a ManifoldCF instance to log it when this problem occurs. Please let me know how you would like to proceed. Karl On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi I am having a trouble with crawling web using MCF1.0. I run MCF with MySQL 5.5 and Tomcat 6.0. It should keep crawling contents, but MCF prints the following Database exception log, then hangs. After DB Exception, Socket Time Exception occurs. Anyone has faced this problem? --Database Exception log: ERROR 2012-10-10 16:11:05,787 (Worker thread '42') - Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: Lock wait timeout exceeded; try restarting transaction org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: Lock wait timeout exceeded; try restarting transaction at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681) at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709) at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394) at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144) at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186) at org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852) at org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1487) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:6049) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessAcivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:6159) at org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44) at org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:52) at org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50) at org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:225) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:7047) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:6011) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1282) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) Caused by: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624) at
Re: Strange behaviour on internet free server
I take it by internet free you mean a local network that is not connected to the internet? There should be no reason why ManifoldCF would not operate in such an environment. Can you describe the strange behavior you have been seeing? Karl On Mon, Oct 15, 2012 at 12:28 PM, Johan Persson perzzon.jo...@gmail.com wrote: I'm getting strange behaviour from a test-server I've set up in a Internet free environment. Could this hinder Manifold from working properly? Im using Sharepoint, AD and Solr-connections Best Regards /J
Re: Strange behaviour on internet free server
:-) Until somebody starts selling support for ManifoldCF, I'm afraid it is just us volunteers. Karl On Tue, Oct 16, 2012 at 2:54 AM, Johan Persson perzzon.jo...@gmail.com wrote: Hmmm... Stupid me. I just restarted with a new Manifold-installation and seem to have missed to set the Sharepoint-version correctly. Thanks for pointing that out. BTW There is there by any chance a payed support number to call? / Johan 2012/10/16 Karl Wright daddy...@gmail.com: If this is SharePoint 2010, you need to select SharePoint 4.0 (2010) in the pulldown. It looks like you have not done this, since your seem to be trying to use the SharePoint dspsts service, which does not work on SharePoint 2010. I don't know if this is the cause of your Solr problem, but it would certainly prevent progress on a SharePoint crawl. Thanks, Karl On Tue, Oct 16, 2012 at 2:31 AM, Johan Persson perzzon.jo...@gmail.com wrote: Didn't get anything into solr. Also the job seem to end up in some kind of gridlock with solr. (Never terminating until the solr-process is ) Read the Manifold log and found this at the bottom. Just thought that the reason for the strange behaviour was that a DTD or similar was not obtained. DEBUG 2012-10-15 09:23:50,531 (Worker thread '1') - Mapping Exception to AxisFault AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Client.Dsp.Syntax faultSubcode: faultString: Request is empty. faultActor: faultNode: faultDetail: {http://xml.apache.org/axis/}stackTrace:Request is empty. ... ... WARN 2012-10-15 09:23:50,551 (Worker thread '1') - Service interruption reported for job 1350317841332 connection 'Sharepoint': Remote procedure exception: Request is empty. DEBUG 2012-10-15 09:23:58,499 (Idle cleanup thread) - Checking for connections, idleTimeout: 1350318178499
Re: Web crawling causes Socket Timeout after Database Exception
So, what was the resolution of this problem? Any news? Karl On Thu, Oct 11, 2012 at 2:28 AM, Karl Wright daddy...@gmail.com wrote: The only change is that the MySQL driver now performs ANALYZE operations on the fly in order to keep the database operating at high efficiency. This is CONNECTORS-510. It is possible that, on a large database table, these operations will cause others to wait long enough so that their timeout is exceeded. Such an event does not take place while the load tests run, however. If you want to turn off the analyze operation, you can do that by setting a per-table property to override the analyze default of 1 operations: analyzeThreshold = ManifoldCF.getIntProperty(org.apache.manifold.db.mysql.analyze.+tableName,1); The table in question is jobqueue. If you set this value to something like 10 and you still see MySQL timeouts, then this new code is not the problem. And, like I said, the best solution is to recognize the error and retry, but first I would need the error code. Adding an appropriate output of sqlState around line 123 of framework/core/src/main/java/org/apache/manifoldcf/core/database/DBInterfaceMySQL.java would allow us to see what code to catch, when it happened again. For the Web connector, the only modifications have been in regards to how it handles 500 errors, which now correctly code to avoid an IndexExceptionOutOfBounds exception. This has nothing to do with socket exceptions, which are caused for external reasons only. Karl On Wed, Oct 10, 2012 at 10:32 PM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, I was comparing version 1.0 with old trunk based on version 0.6 implementing CONNECTORS-501( Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents). Running each version with the same MySQL setting and the same throttling, somehow the version 1.0 hangs with the error. Since the old trunk completes crawling, I wonder if something has changed. Just to make sure I will recheck if there are any wrong settings in MCF. Thanks. Regards, Shigeki 2012/10/10 Karl Wright daddy...@gmail.com Hi Shigeki, The socket timeout exception is only a warning. It means that some site you are crawling did not accept a socket connection within the allowed time (5 minutes I think). The Web Connector will retry the connection a few times, and if it is still rejected, it will eventually give up on that page. One thing you want to check, though, is that you are using proper throttling, because if you aren't then one cause of this problem is that the webmaster of the site you are trying to crawl may have blocked you from accessing it. The database exception is more problematic. It means that MySQL thinks it took too long for a specific transaction to complete, and the database aborted the transaction due to a timeout. There are two ways of dealing with this issue. One way is to modify your MySQL configuration to increase the transaction timeout value to some high number. The second way is to modify ManifoldCF to recognize the timeout error specifically, and cause a retry. But in order to do the latter, I would need to know what SQL error code MySQL returns for this situation, which will mean we either need to look it up (if we can), or modify a ManifoldCF instance to log it when this problem occurs. Please let me know how you would like to proceed. Karl On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi I am having a trouble with crawling web using MCF1.0. I run MCF with MySQL 5.5 and Tomcat 6.0. It should keep crawling contents, but MCF prints the following Database exception log, then hangs. After DB Exception, Socket Time Exception occurs. Anyone has faced this problem? --Database Exception log: ERROR 2012-10-10 16:11:05,787 (Worker thread '42') - Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: Lock wait timeout exceeded; try restarting transaction org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: Lock wait timeout exceeded; try restarting transaction at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681) at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709) at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394) at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144) at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186) at org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852
Re: Web crawling causes Socket Timeout after Database Exception
I just looked in the code with svn for differences in the web connector from release 0.6. There is a change to the html parser to allow for handling default values for option tags, and a change that fixes an IndexOutOfBounds exception. Neither of these can possibly affect socket timeouts. I also looked at the solr connector (presuming that is what you are using as an output connector). No changes at all since 0.6. So honestly, I can see no significant changes whatsoever in the behavior of how a web crawler indexing into Solr would behave. If you are seeing differences, therefore, I simply cannot account for them. Karl On Fri, Oct 19, 2012 at 5:01 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Due to the error, I had to downgrade to a lower version so I haven't found the MySQL error code yet. I installed MCF1.0 in a different environment where crawlable contents are different from the above environment. I could not reproduce the Database exception but socket timeout occurred In the same environment, I ran MCF0.6 and it completed crawling without socket timeout. Like you said, socket timeout seems to be a different problem from the Database exception . 2012/10/18 Karl Wright daddy...@gmail.com So, what was the resolution of this problem? Any news? Karl On Thu, Oct 11, 2012 at 2:28 AM, Karl Wright daddy...@gmail.com wrote: The only change is that the MySQL driver now performs ANALYZE operations on the fly in order to keep the database operating at high efficiency. This is CONNECTORS-510. It is possible that, on a large database table, these operations will cause others to wait long enough so that their timeout is exceeded. Such an event does not take place while the load tests run, however. If you want to turn off the analyze operation, you can do that by setting a per-table property to override the analyze default of 1 operations: analyzeThreshold = ManifoldCF.getIntProperty(org.apache.manifold.db.mysql.analyze.+tableName,1); The table in question is jobqueue. If you set this value to something like 10 and you still see MySQL timeouts, then this new code is not the problem. And, like I said, the best solution is to recognize the error and retry, but first I would need the error code. Adding an appropriate output of sqlState around line 123 of framework/core/src/main/java/org/apache/manifoldcf/core/database/DBInterfaceMySQL.java would allow us to see what code to catch, when it happened again. For the Web connector, the only modifications have been in regards to how it handles 500 errors, which now correctly code to avoid an IndexExceptionOutOfBounds exception. This has nothing to do with socket exceptions, which are caused for external reasons only. Karl On Wed, Oct 10, 2012 at 10:32 PM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, I was comparing version 1.0 with old trunk based on version 0.6 implementing CONNECTORS-501( Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents). Running each version with the same MySQL setting and the same throttling, somehow the version 1.0 hangs with the error. Since the old trunk completes crawling, I wonder if something has changed. Just to make sure I will recheck if there are any wrong settings in MCF. Thanks. Regards, Shigeki 2012/10/10 Karl Wright daddy...@gmail.com Hi Shigeki, The socket timeout exception is only a warning. It means that some site you are crawling did not accept a socket connection within the allowed time (5 minutes I think). The Web Connector will retry the connection a few times, and if it is still rejected, it will eventually give up on that page. One thing you want to check, though, is that you are using proper throttling, because if you aren't then one cause of this problem is that the webmaster of the site you are trying to crawl may have blocked you from accessing it. The database exception is more problematic. It means that MySQL thinks it took too long for a specific transaction to complete, and the database aborted the transaction due to a timeout. There are two ways of dealing with this issue. One way is to modify your MySQL configuration to increase the transaction timeout value to some high number. The second way is to modify ManifoldCF to recognize the timeout error specifically, and cause a retry. But in order to do the latter, I would need to know what SQL error code MySQL returns for this situation, which will mean we either need to look it up (if we can), or modify a ManifoldCF instance to log it when this problem occurs. Please let me know how you would like to proceed. Karl On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi I am having a trouble with crawling
Re: Problem with reading files from Sharepoint 2010 to manifldcf 1.0.1
I finally was able to look at the logs. The exception that stops the job is in fact coming from the GetListItems call: at org.apache.axis.client.Call.invoke(Call.java:1812) at com.microsoft.sharepoint.webpartpages.PermissionsSoapStub.getListItems(PermissionsSoapStub.java:234) at org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getChildren(SPSProxyHelper.java:619) at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1303) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) Clearly certain entities are expected to have children, but we're either not invoking the service correctly for those, OR we're invoking the service for entities that don't have the ability to get children at all. I don't see any evidence in this log that ANY getListItems calls are succeeding. In fact, it is the first such call that fails. Why do you think that discovery is working? There seems to be no evidence of that. The headers etc. all look good too: DEBUG 2012-10-30 14:04:35,223 (Thread-439) - HttpConnectionManager.getConnection: config = HostConfiguration[host=http://16.59.60.113], timeout = 0 DEBUG 2012-10-30 14:04:35,223 (Thread-439) - Getting free connection, hostConfig=HostConfiguration[host=http://16.59.60.113] DEBUG 2012-10-30 14:04:35,224 (Thread-439) - POST /_vti_bin/MCPermissions.asmx HTTP/1.1[\r][\n] Karl On Tue, Oct 30, 2012 at 8:39 AM, Fridler, Oren oren.frid...@hp.com wrote: Hi I’m using apache-manifoldcf-1.0.1-bin I installed apache-manifoldcf-sharepoint-2010-plugin-0.1 on top of Sharepoint 2010 On mcf I managed to create a Sharepoint repository connection and saw the status is “Connection Working” Also when I create the “Sharepoint to Solr” Job I can see some of the wiki libraries that I created on SP are available for selection so I assume MCF is getting this data from SP. But when I start the job it is getting stuck in status “running” forever, the mcf UI shows documents are discovered, some are processed and some are active, but on Solr side no document is received. On mcf logs I see the error at the end of this email. On my browser I can open http://16.59.60.113 - getting to SP site, and also http://16.59.60.113/_vti_bin/MCPermissions.asmx - getting to a page that lists these 2 services - GetListItems and GetPermissionCollection Attached are the mcf logs with DEBUG level. Any help or idea what can I do would be highly appreciated. Thanks Oren. AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Client faultSubcode: faultString: The Web application at http://16.59.60.113 could not be found. Verify that you have typed the URL correctly. If the URL should be serving existing content, the system administrator may need to add a new request URL mapping to the intended application. faultActor: http://16.59.60.113/_vti_bin/MCPermissions.asmx faultNode: faultDetail: {}Error:ErrorNumber1010/ErrorNumberErrorMessageThe Web application at http://16.59.60.113 could not be found. Verify that you have typed the URL correctly. If the URL should be serving existing content, the system administrator may need to add a new request URL mapping to the intended application./ErrorMessageErrorSourceMicrosoft.SharePoint/ErrorSource
Re: Problem with reading files from Sharepoint 2010 to manifldcf
Seeing the existence of the service in the browser does not mean it will work. It only means that the wsdl is coming back from the service. What can be the reason for this? Unfortunately that is very difficult to determine. SharePoint tends to return catchall errors which are not very meaningful. The server-side event logs may be helpful in figuring out what is going wrong. Can there be a mismatch between the sharepoint driver on MCF and the sharepoint server? This is possible if (for instance) you deployed a SharePoint 2010 plugin on a SharePoint 2007 server, but if you had a version of SharePoint which was incompatible with the plugin you deployed, I would expect you would have seen errors reported during the plugin installation. The plugins are built against specific SharePoint dlls with specific version numbers, and .NET enforces a match. The .bat deployment files though are not very good at telling you that stuff is broken; they don't actually catch the reported errors and stop, so it is possible you may have missed such errors. If there were no errors, I would guess that the problem is probably permissions related. That is, the plugin may not have permissions to do what it needs to do. The permissions are granted (as I understand it) based on the user that installs the plugin, so that may be what the issue is. Karl On Tue, Oct 30, 2012 at 11:19 AM, Fridler, Oren oren.frid...@hp.com wrote: Discovery is not working indeed (sorry I was not clear on this), I just saw on the sharepoint repository connector UI the status connection working So if I understand you correctly the soap call to com.microsoft.sharepoint.webpartpages.PermissionsSoapStub.getListItems(PermissionsSoapStub.java:234) is failing? Although I can see the GetListItems operation supported in the browser. What can be the reason for this? Can there be a mismatch between the sharepoint driver on MCF and the sharepoint server? How do you suggest I continue to investigate? Thanks Oren. -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: יום ג 30 אוקטובר 2012 17:05 To: Fridler, Oren Subject: Re: Problem with reading files from Sharepoint 2010 to manifldcf I responded to user@manifoldcf.a.o. The log disagrees with the idea that discovery is working. It seems like the getListItems() part of the service is failing, and on the very first call too. Karl On Tue, Oct 30, 2012 at 10:39 AM, Fridler, Oren oren.frid...@hp.com wrote: I selected SharePoint 2010. There is only one user I used for the SharePoint Server install and this user is used on MCF SharePoint connection. Is there a way to disable permission checking altogether in the connector and just ask for all documents with the user credentials I entered on the sharepoint connection? I tried to select secutiry=disabled on the job details but it didn't help. -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: יום ג 30 אוקטובר 2012 16:26 To: Fridler, Oren Cc: user@manifoldcf.apache.org Subject: Re: Problem with reading files from Sharepoint 2010 to manifldcf Hi Oren, Here's my reasoning: (1) You would not get connection working if you could not access the MCPermissions service, unless you selected SharePoint 2003, which would then conflict with other data. (2) You said that it discovered documents. That means that the GetListItems part of the service is working. (3) You said that you couldn't index any documents, and got an AXIS exception which terminated the job. That means you could not retrieve document permissions (which is what the GetPermissionCollection part of the service does). (4) The GetPermissionCollection operation uses only one other service, and it is Permissions.asmx. So it figured that the problem was likely in reaching that service, since the complaint was that it couldn't find a service. Until 10 min ago I did not have internet service back, but I will confirm this picture in your logs shortly. The Permissions.asmx service you identify is the correct one; the question seems to be why the MCPermissions service can't talk to it. Could be a permission problem I suppose - perhaps the user you were logged in as when you installed the service had insufficient permissions or some such? Just guessing here... Karl On Tue, Oct 30, 2012 at 9:19 AM, Fridler, Oren oren.frid...@hp.com wrote: Hi Karl Thank you for your prompt reply, By SharePoint permissions service do you refer to this? http://16.59.60.113/_vti_bin/Permissions.asmx I was able to open this service, getting the following operations: AddPermission AddPermissionCollection GetPermissionCollection RemovePermission RemovePermissionCollection UpdatePermission BTW, how can you tell from the logs the mcpermissions server is having trouble reaching SharePoint permissions service? Thanks in advance Oren. -Original Message- From
Re: Problem with reading files from Sharepoint 2010 to manifldcf
Hi Oren, I've been thinking further about your issue, and how many recent kinds of posts we've been getting which basically amount to people trying to get the manifoldcf-sharepoint-2010 plugin working on their particular SharePoint instance, which has no doubt been installed and (mis?)configured by someone else at some point in the past. I think we're going to need a how-to-debug page where we can gather everyone's experiences together, including diagnostic approaches and advice. There is already a page that anyone can edit in the ManifoldCF wiki, which is a fine starting point: https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections . I hope you will be willing to contribute to this effort. In the meantime, let's go back over your questions below and try to eliminate them one at a time, in a more systematic fashion. (1) Version of SharePoint. To rule out any funkiness here, the obvious thing to do is to find the version of your sharepoint.dll. The dll should be in one of the standard locations where assembly dlls are deployed on your server. The assembly name is Microsoft.SharePoint.dll - nothing else, not MicrosoftOffice, or anything else. There are a number of tools for determining the .NET version of such DLLs; here's a link that might help: http://stackoverflow.com/questions/227886/how-do-i-determine-the-dependencies-of-a-net-application . The ManifoldCF-SharePoint-2010 plugin is built against: Reference Include=Microsoft.SharePoint, Version=14.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c, processorArchitecture=MSIL / ... which can be found in the webservice/MCPermissionsService.csproj file in the source package for the service. The ManifoldCF-SharePoint-2007 plugin is, obviously, built against a different version: Reference Include=Microsoft.SharePoint, Version=12.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c, processorArchitecture=MSIL / (2) Meaning of error Here's the error again: {}Error:ErrorNumber1010/ErrorNumberErrorMessageThe Web application at http://16.59.60.113 could not be found. Verify that you have typed the URL correctly. If the URL should be serving existing content, the system administrator may need to add a new request URL mapping to the intended application./ErrorMessageErrorSourceMicrosoft.SharePoint/ErrorSource The error code 1010 comes from the plugin, specifically from the GetListItems method: catch (Exception ex) { EventLog.WriteEntry(MCPermissions.asmx, ex.Message); throw RaiseException(ex.Message, 1010, ex.Source); } So, we know we are getting into the plugin correctly, but we furthermore know that something that is happening in there is not working. The ErrorSource tags include the assembly from which the error is coming: Microsoft.SharePoint The error message, as I pointed out before, is pretty useless when SharePoint is concerned - there are quite a number of catchall errors which are more likely to mislead you than help you. So you have to look at the source code, which is actually rather small and simple. Looking at the code itself, and what it is doing, the likely place that the problem comes from is this: using (SPSite site = new SPSite(SPContext.Current.Web.Url)) { using (SPWeb oWebsiteRoot = site.OpenWeb()) { ... It seems clear that for some reason your SharePoint instance does not have a valid SPContext.Current.Web.Url which will permit the plugin reaching the actual sharepoint logic. I don't know the reason for that; this is happening internal to SharePoint on that server. Possibilities include a URL redirection, I suppose? My knowledge of .NET, and what SharePoint is doing under the covers, is not that strong. But this is the avenue I'd pursue. If you do find that there's a redirection taking place to reach your _vti_bin directory, try using the final target of the redirection instead of the initial URL, and see if that helps... Karl On Tue, Oct 30, 2012 at 11:44 AM, Karl Wright daddy...@gmail.com wrote: Seeing the existence of the service in the browser does not mean it will work. It only means that the wsdl is coming back from the service. What can be the reason for this? Unfortunately that is very difficult to determine. SharePoint tends to return catchall errors which are not very meaningful. The server-side event logs may be helpful in figuring out what is going wrong. Can there be a mismatch between the sharepoint driver on MCF and the sharepoint server? This is possible if (for instance) you deployed a SharePoint 2010 plugin on a SharePoint 2007 server, but if you had a version of SharePoint which was incompatible with the plugin you deployed, I would expect you would have seen errors reported during the plugin installation. The plugins are built against specific SharePoint dlls with specific version numbers, and .NET
Re: Problem with reading files from Sharepoint 2010 to manifldcf
Please see below... On Wed, Oct 31, 2012 at 8:59 AM, Fridler, Oren oren.frid...@hp.com wrote: Thanks Karl I'll be happy to contribute to the debugging wiki once I have some helpful insights. I'm following your advice and sharing the info in case someone encounter the same issues: (1) ShrePoint version - I've found 2 copies of MicrosoftSharePoint.dll (see below), I opened them with .Net Reflector, the first dll's version is 14.0.0.0 and the second is 14.900.0.0 C:\dir /s /b Microsoft.SharePoint.dll C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\ISAPI\Microsoft.SharePoint.dll C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\UserCode\assemblies\Microsoft.SharePoint.dll I don't know which dll is used by my SharePoint 2010, so I uninstalled SharePoint - both dlls were removed and after re-installed they were back again :( I installed the manifold sharepoint plugin (setup output attached) and it went ok without errors. I think I've seen the 14.900.0.0 - it is the Microsoft Office extensions to SharePoint. But as long as the 14.0.0.0 one is available that is probably fine. (2) Meaning of error - I followed your idea that maybe redirects are causing the problem, since ManifoldCF is running on the same server where SharePoint is I changed the URL and replaced the server IP with localhost or 127.0.0.1 Now I don't get the 1010 error with Web Application cannot be found, still no files are imported and the logs (attached) contain these 2 errors: org.apache.axis.ConfigurationException: No service named PermissionsSoap is available ... org.apache.axis.ConfigurationException: No service named http://microsoft.com/sharepoint/webpartpages/GetListItems is available These are just warnings. They seem to be due to some kind of mismatch between the wsdl and what the services actually look like. But just ignore these for now. I'll have a look at your logs shortly and get back to you with an idea what they are telling us. Karl I'll continue to investigate, if someone have any idea/help it would be great Thanks Oren. -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: יום ד 31 אוקטובר 2012 09:39 To: Fridler, Oren; user@manifoldcf.apache.org Subject: Re: Problem with reading files from Sharepoint 2010 to manifldcf Hi Oren, I've been thinking further about your issue, and how many recent kinds of posts we've been getting which basically amount to people trying to get the manifoldcf-sharepoint-2010 plugin working on their particular SharePoint instance, which has no doubt been installed and (mis?)configured by someone else at some point in the past. I think we're going to need a how-to-debug page where we can gather everyone's experiences together, including diagnostic approaches and advice. There is already a page that anyone can edit in the ManifoldCF wiki, which is a fine starting point: https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections . I hope you will be willing to contribute to this effort. In the meantime, let's go back over your questions below and try to eliminate them one at a time, in a more systematic fashion. (1) Version of SharePoint. To rule out any funkiness here, the obvious thing to do is to find the version of your sharepoint.dll. The dll should be in one of the standard locations where assembly dlls are deployed on your server. The assembly name is Microsoft.SharePoint.dll - nothing else, not MicrosoftOffice, or anything else. There are a number of tools for determining the .NET version of such DLLs; here's a link that might help: http://stackoverflow.com/questions/227886/how-do-i-determine-the-dependencies-of-a-net-application . The ManifoldCF-SharePoint-2010 plugin is built against: Reference Include=Microsoft.SharePoint, Version=14.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c, processorArchitecture=MSIL / ... which can be found in the webservice/MCPermissionsService.csproj file in the source package for the service. The ManifoldCF-SharePoint-2007 plugin is, obviously, built against a different version: Reference Include=Microsoft.SharePoint, Version=12.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c, processorArchitecture=MSIL / (2) Meaning of error Here's the error again: {}Error:ErrorNumber1010/ErrorNumberErrorMessageThe Web application at http://16.59.60.113 could not be found. Verify that you have typed the URL correctly. If the URL should be serving existing content, the system administrator may need to add a new request URL mapping to the intended application./ErrorMessageErrorSourceMicrosoft.SharePoint/ErrorSource The error code 1010 comes from the plugin, specifically from the GetListItems method: catch (Exception ex) { EventLog.WriteEntry(MCPermissions.asmx, ex.Message
Re: Problem with reading files from Sharepoint 2010 to manifldcf
I have good news - it is apparently now working. Check your path rules. You need to have a path that matches the document part of the path, e.g. xxx/yyy/*. The end user documentation explains how to set one of these up. Karl On Wed, Oct 31, 2012 at 1:12 PM, Fridler, Oren oren.frid...@hp.com wrote: Sorry, my bad, I attached the wrong file. Attached is manifoldcf log when 127.0.0.1 is used for sharepoint server Oren -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: יום ד 31 אוקטובר 2012 15:25 To: Fridler, Oren Cc: user@manifoldcf.apache.org Subject: Re: Problem with reading files from Sharepoint 2010 to manifldcf The logs you attached have no entries that are dated later than 10/30, so I am uncertain they are the right ones. I still see the same error when MCPermissions.asmx is invoked. Karl On Wed, Oct 31, 2012 at 9:16 AM, Karl Wright daddy...@gmail.com wrote: Please see below... On Wed, Oct 31, 2012 at 8:59 AM, Fridler, Oren oren.frid...@hp.com wrote: Thanks Karl I'll be happy to contribute to the debugging wiki once I have some helpful insights. I'm following your advice and sharing the info in case someone encounter the same issues: (1) ShrePoint version - I've found 2 copies of MicrosoftSharePoint.dll (see below), I opened them with .Net Reflector, the first dll's version is 14.0.0.0 and the second is 14.900.0.0 C:\dir /s /b Microsoft.SharePoint.dll C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\ISAPI\Microsoft.SharePoint.dll C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\UserCode\assemblies\Microsoft.SharePoint.dll I don't know which dll is used by my SharePoint 2010, so I uninstalled SharePoint - both dlls were removed and after re-installed they were back again :( I installed the manifold sharepoint plugin (setup output attached) and it went ok without errors. I think I've seen the 14.900.0.0 - it is the Microsoft Office extensions to SharePoint. But as long as the 14.0.0.0 one is available that is probably fine. (2) Meaning of error - I followed your idea that maybe redirects are causing the problem, since ManifoldCF is running on the same server where SharePoint is I changed the URL and replaced the server IP with localhost or 127.0.0.1 Now I don't get the 1010 error with Web Application cannot be found, still no files are imported and the logs (attached) contain these 2 errors: org.apache.axis.ConfigurationException: No service named PermissionsSoap is available ... org.apache.axis.ConfigurationException: No service named http://microsoft.com/sharepoint/webpartpages/GetListItems is available These are just warnings. They seem to be due to some kind of mismatch between the wsdl and what the services actually look like. But just ignore these for now. I'll have a look at your logs shortly and get back to you with an idea what they are telling us. Karl I'll continue to investigate, if someone have any idea/help it would be great Thanks Oren. -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: יום ד 31 אוקטובר 2012 09:39 To: Fridler, Oren; user@manifoldcf.apache.org Subject: Re: Problem with reading files from Sharepoint 2010 to manifldcf Hi Oren, I've been thinking further about your issue, and how many recent kinds of posts we've been getting which basically amount to people trying to get the manifoldcf-sharepoint-2010 plugin working on their particular SharePoint instance, which has no doubt been installed and (mis?)configured by someone else at some point in the past. I think we're going to need a how-to-debug page where we can gather everyone's experiences together, including diagnostic approaches and advice. There is already a page that anyone can edit in the ManifoldCF wiki, which is a fine starting point: https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Conn ections . I hope you will be willing to contribute to this effort. In the meantime, let's go back over your questions below and try to eliminate them one at a time, in a more systematic fashion. (1) Version of SharePoint. To rule out any funkiness here, the obvious thing to do is to find the version of your sharepoint.dll. The dll should be in one of the standard locations where assembly dlls are deployed on your server. The assembly name is Microsoft.SharePoint.dll - nothing else, not MicrosoftOffice, or anything else. There are a number of tools for determining the .NET version of such DLLs; here's a link that might help: http://stackoverflow.com/questions/227886/how-do-i-determine-the-depe ndencies-of-a-net-application . The ManifoldCF-SharePoint-2010 plugin is built against: Reference Include=Microsoft.SharePoint, Version=14.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c, processorArchitecture=MSIL / ... which can be found in the webservice
Re: Problem with manifold
Actually, from your log it is clear that ManifoldCF can be reached fine from your Solr instance, so please disregard that question. The only other potential issue has to do with Solr search component ordering. This is a bit of black magic, because other Solr components may modify the request in ways which are potentially incompatible with the ManifoldCF plugin. So if you are sure your fields are all correct, you might want to play around with the ordering of your components to see if that makes any difference. There used to be debug component you could also use which would print out the (full) query and the results returned - that may also be useful. Thanks, Karl On Fri, Nov 2, 2012 at 6:25 AM, Karl Wright daddy...@gmail.com wrote: Hi Pablo, The first thing that I notice is that, as you have this configured, you need four fields declared in your schema as indexable fields: allow_token_document deny_token_document allow_token_share deny_token_share Do you have these fields declared, and did you have them all declared when you performed the crawl? Second, the way it is configured, the machine that is running Solr must be the same as the machine running ManifoldCF (because you used a localhost url). Is this true? Thanks, Karl On Fri, Nov 2, 2012 at 5:43 AM, Gonzalez, Pablo pablo.gonzalez.do...@hp.com wrote: Hello, Mr Wright, and thank you for such a fast response. Well, the way I am using to try and communicate mcf and solr is via a SearchComponent. For this I added the apache-solr-mcf-3.6-SNAPSHOT.jar that comes in the file solr-integration to the lib folder of the deployment of the solr webapp in tomcat. Then I changed solrconfig.xml, adding this piece of code: !-- LCF document security enforcement component -- searchComponent name=mcfSecurity class=org.apache.solr.mcf.ManifoldCFSearchComponent str name=AuthorityServiceBaseURLhttp://localhost:8345/mcf/str /searchComponent requestHandler name=/search class=solr.SearchHandler default=true !-- default values for query parameters can be specified, these will be overridden by parameters in the request -- !-- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dftext/str /lst-- arr name=last-components strmcfSecurity/str /arr !--a bunch of comments-- /requestHandler Last thing, I didn't write any additional Java code. I thought it wasn't necessary. Thanks, Pablo -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: viernes, 02 de noviembre de 2012 10:21 To: user@manifoldcf.apache.org Subject: Re: Problem with manifold The ManifoldCF Solr plugin operates by requesting access tokens from ManifoldCF (which seems to be working fine), and using those to modify the incoming Solr search expression to limit the results according to those access tokens. There are two ways (and two independent classes) you can configure to perform this modification. One of these classes functions as a query parser plugin. The other functions as a search component. Obviously, for either one to work right, the Solr configuration has to work properly too. Can you provide details as to (a) which one you are using, and (b) what the configuration details are, e.g. the appropriate clauses from solrconfig.xml? Thanks, Karl On Fri, Nov 2, 2012 at 4:57 AM, Gonzalez, Pablo pablo.gonzalez.do...@hp.com wrote: Hello, I don't know if you already got this message, but anyway here I go: I have been trying to connect ManifoldCF to Solr. I have a file system in a remote server, protected by active directory. I have configured a manifold job to import only a part of the documents under the file system. In fact, I do the importing process from a file which only contains 2 documents, in order to make it easier to see what is happening and get conclusions. Afterwards the documents are output to the solr server. I have created a request handler called selectManifold to connect manifold and solr. Then I call it via http://[host]:8080/solr/selectManifold?indent=onversion=2.2q=*%3A*f q=start=0rows=10fl=*%2Cscorewt=explainOther=hl.fl=Authenticated UserName=user@domain . When doing this, tomcat's log (catalina.out) writes this: oct 31, 2012 2:40:33 PM org.apache.solr.mcf.ManifoldCFSearchComponent prepare Información: Trying to match docs for user 'user@domain' oct 31, 2012 2:40:33 PM org.apache.solr.mcf.ManifoldCFSearchComponent getAccessTokens Información: For user 'user@domain', saw authority response AUTHORIZED:Auth+active+directory+para+el+file+system (this one is the active directory I'm currently using for the job) oct 31, 2012 2:40:33 PM org.apache.solr.mcf.ManifoldCFSearchComponent getAccessTokens Información: For user 'user@domain', saw authority response AUTHORIZED:ad (this one isn't) oct 31, 2012 2:40:33 PM org.apache.solr.core.SolrCore execute Información
Re: ManifoldCF 1.0.1 MySQL setup : Error getting connection: Access denied for user
Hi Nigel, I'm not a MySQL expert, but I seem to recall there was something interesting about the way MySQL authenticated remote connections. There are two properties that the MySQL driver looks at: /** MySQL server property */ public static final String mysqlServerProperty = org.apache.manifoldcf.mysql.server; /** Source system name or IP */ public static final String mysqlClientProperty = org.apache.manifoldcf.mysql.client; I think you may need to set both of these for the auth to succeed. Also, make sure your MySQL server is configured to permit connections from the source system you are trying to connect from. Thanks, Karl On Fri, Nov 2, 2012 at 11:21 AM, Nigel Thomas nigel.tho...@york.ac.uk wrote: Hello, I am having some problems configuring 1.0.1 to use a MySQL database, I have followed steps here : http://manifoldcf.apache.org/release/release-1.0.1/en_US/how-to-build-and-deploy.html#Configuring+a+MySQL+database I have set the following db related properties in properties.xml: property name=org.apache.manifoldcf.databaseimplementationclass value=org.apache.manifoldcf.core.database.DBInterfaceMySQL/ property name=org.apache.manifoldcf.mysql.server value=mysql.example.com/ property name=org.apache.manifoldcf.dbsuperusername value=root/ property name=org.apache.manifoldcf.dbsuperuserpassword value=password/ property name=org.apache.manifoldcf.database.name value=manfold_db/ property name=org.apache.manifoldcf.database.username value=root/ property name=org.apache.manifoldcf.database.password value=password/ property name=org.apache.manifoldcf.database.maxhandles value=100/ On running initialise.sh, the following exception is thrown: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Error getting connection: Access denied for user 'root'@'%' to database 'mysql' at org.apache.manifoldcf.core.database.DBInterfaceMySQL.createUserAndDatabase(DBInterfaceMySQL.java:624) at org.apache.manifoldcf.core.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:700) at org.apache.manifoldcf.crawler.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:168) at org.apache.manifoldcf.crawler.InitializeAndRegister.doExecute(InitializeAndRegister.java:37) at org.apache.manifoldcf.crawler.InitializeAndRegister.main(InitializeAndRegister.java:60) I am able to connect to the MySQL instance using a command line MySQL client from the same machine using the same credentials, this rules out networking and credentials related issues. Am not sure what I am missing with the setup, I have tried the equivalent with a postgres setup, this seems to work just fine. Thanks, Nigel Thomas
Re: Problem with manifold
Just reran the tests on the trunk version of the ManifoldCF solr 3.x plugin - looked good: [junit] Testsuite: org.apache.solr.mcf.ManifoldCFQParserPluginTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 10.56 sec [junit] [junit] - Standard Error - [junit] WARNING: test class left thread running: Thread[MultiThreadedHttpCon nectionManager cleanup,5,main] [junit] RESOURCE LEAK: test class left 1 thread(s) running [junit] - --- [junit] Testsuite: org.apache.solr.mcf.ManifoldCFSearchComponentTest [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.096 sec [junit] [junit] - Standard Error - [junit] WARNING: test class left thread running: Thread[MultiThreadedHttpCon nectionManager cleanup,5,main] [junit] RESOURCE LEAK: test class left 1 thread(s) running [junit] - --- [junit] Testsuite: org.apache.solr.mcf.ManifoldCFSCLoadTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 40.486 sec [junit] [junit] - Standard Output --- [junit] Query time = 24352 [junit] - --- [junit] - Standard Error - [junit] WARNING: test class left thread running: Thread[MultiThreadedHttpCon nectionManager cleanup,5,main] [junit] RESOURCE LEAK: test class left 1 thread(s) running [junit] - --- The components that this test uses are simple: ?xml version=1.0 ? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- !-- $Id: solrconfig-auth.xml 1176500 2011-09-27 18:19:59Z kwright $ $Source$ $Name$ -- config luceneMatchVersion${tests.luceneMatchVersion:LUCENE_CURRENT}/luceneMatchVersion jmx / dataDir${solr.data.dir:}/dataDir directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.RAMDirectoryFactory}/ updateHandler class=solr.DirectUpdateHandler2 /updateHandler requestHandler name=/update class=solr.XmlUpdateRequestHandler / !-- test MCF Security Filter settings -- searchComponent name=mcf-param class=org.apache.solr.mcf.ManifoldCFSearchComponent str name=AuthorityServiceBaseURLhttp://localhost:8345/mcf-as/str int name=SocketTimeOut3000/int str name=AllowAttributePrefixaap-/str str name=DenyAttributePrefixdap-/str /searchComponent searchComponent name=mcf class=org.apache.solr.mcf.ManifoldCFSearchComponent /searchComponent requestHandler name=/mcf class=solr.SearchHandler startup=lazy lst name=invariants bool name=mcftrue/bool /lst lst name=defaults str name=echoParamsall/str /lst arr name=components strquery/str strmcf/str /arr /requestHandler /config On Mon, Nov 5, 2012 at 5:42 AM, Karl Wright daddy...@gmail.com wrote: No - I mean modifying ManifoldCFSearchComponent itself, and rebuilding the component yourself. You can download the sources that correspond to the release from the ManifoldCF download page, http://manifoldcf.apache.org/en_US/download.html . Karl On Mon, Nov 5, 2012 at 4:13 AM, Gonzalez, Pablo pablo.gonzalez.do...@hp.com wrote: Hello, By 'modifying the component itself' do you mean to write a subclass of ManifoldCFSearchComponent? -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: viernes, 02 de noviembre de 2012 14:47 To: user@manifoldcf.apache.org Subject: Re: Problem with manifold If you don't get anywhere with the debug component, you can try modifying the component itself to print the incoming query and the modified query. You might also want to look at the ManifoldCF component tests, which create a handler internally and executed successfully when the component was released. If you create a similar handler and that works, then you can try to figure out what the differences are. Thanks, Karl On Fri, Nov 2, 2012 at 8:29 AM, Gonzalez, Pablo pablo.gonzalez.do...@hp.com wrote: Well, it went wrong. I will crawl again just in case, and if it doesn't go well, I will search on Internet about that debug component
RE: ManifoldCF 1.0.1 MySQL setup : Error getting connection:
Access denied for user MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Hi Nigel, Existence checking is already present, and must indeed work if the system is to function properly in single user mode. It may be that you simply need to provide a superuser with enough privs to do the checks. Karl Sent from my Windows Phone From: Nigel Thomas Sent: 11/5/2012 7:01 AM To: user@manifoldcf.apache.org Subject: Re: ManifoldCF 1.0.1 MySQL setup : Error getting connection: Access denied for user Hi Karl, Thank you for the prompt response. I had run a tcp dump on the connection to get more details on the error, which is a mysql :42000 - Access denied for user error and looking at the source code, the problem is not that it isn't connecting to the database but, it doesn't have the privileges to create the database and grant access to users. We run a shared mysql services, where the user, database and privileges are granted separately, and unfortunately super user access isn't permitted. I guess in this context the initialise script will not work without some modification to check if database and users already exist. I have reverted to using postgres instance with full privileges for the moment and may revisit the code later. Thanks for the help. Nigel Thomas On 2 November 2012 15:35, Karl Wright daddy...@gmail.com wrote: Hi Nigel, I'm not a MySQL expert, but I seem to recall there was something interesting about the way MySQL authenticated remote connections. There are two properties that the MySQL driver looks at: /** MySQL server property */ public static final String mysqlServerProperty = org.apache.manifoldcf.mysql.server; /** Source system name or IP */ public static final String mysqlClientProperty = org.apache.manifoldcf.mysql.client; I think you may need to set both of these for the auth to succeed. Also, make sure your MySQL server is configured to permit connections from the source system you are trying to connect from. Thanks, Karl On Fri, Nov 2, 2012 at 11:21 AM, Nigel Thomas nigel.tho...@york.ac.uk wrote: Hello, I am having some problems configuring 1.0.1 to use a MySQL database, I have followed steps here : http://manifoldcf.apache.org/release/release-1.0.1/en_US/how-to-build-and-deploy.html#Configuring+a+MySQL+database I have set the following db related properties in properties.xml: property name=org.apache.manifoldcf.databaseimplementationclass value=org.apache.manifoldcf.core.database.DBInterfaceMySQL/ property name=org.apache.manifoldcf.mysql.server value=mysql.example.com/ property name=org.apache.manifoldcf.dbsuperusername value=root/ property name=org.apache.manifoldcf.dbsuperuserpassword value=password/ property name=org.apache.manifoldcf.database.name value=manfold_db/ property name=org.apache.manifoldcf.database.username value=root/ property name=org.apache.manifoldcf.database.password value=password/ property name=org.apache.manifoldcf.database.maxhandles value=100/ On running initialise.sh, the following exception is thrown: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Error getting connection: Access denied for user 'root'@'%' to database 'mysql' at org.apache.manifoldcf.core.database.DBInterfaceMySQL.createUserAndDatabase(DBInterfaceMySQL.java:624) at org.apache.manifoldcf.core.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:700) at org.apache.manifoldcf.crawler.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:168) at org.apache.manifoldcf.crawler.InitializeAndRegister.doExecute(InitializeAndRegister.java:37) at org.apache.manifoldcf.crawler.InitializeAndRegister.main(InitializeAndRegister.java:60) I am able to connect to the MySQL instance using a command line MySQL client from the same machine using the same credentials, this rules out networking and credentials related issues. Am not sure what I am missing with the setup, I have tried the equivalent with a postgres setup, this seems to work just fine. Thanks, Nigel Thomas
Re: ManifoldCF 1.0.1 MySQL setup : Error getting connection: Access denied for user
The check-for-existence logic is already there, and you can control the superuser name and password. But you can't control the instance name, which is the MySql root instance name mysql. Karl On Mon, Nov 5, 2012 at 7:00 AM, Nigel Thomas nigel.tho...@york.ac.uk wrote: Hi Karl, Thank you for the prompt response. I had run a tcp dump on the connection to get more details on the error, which is a mysql :42000 - Access denied for user error and looking at the source code, the problem is not that it isn't connecting to the database but, it doesn't have the privileges to create the database and grant access to users. We run a shared mysql services, where the user, database and privileges are granted separately, and unfortunately super user access isn't permitted. I guess in this context the initialise script will not work without some modification to check if database and users already exist. I have reverted to using postgres instance with full privileges for the moment and may revisit the code later. Thanks for the help. Nigel Thomas On 2 November 2012 15:35, Karl Wright daddy...@gmail.com wrote: Hi Nigel, I'm not a MySQL expert, but I seem to recall there was something interesting about the way MySQL authenticated remote connections. There are two properties that the MySQL driver looks at: /** MySQL server property */ public static final String mysqlServerProperty = org.apache.manifoldcf.mysql.server; /** Source system name or IP */ public static final String mysqlClientProperty = org.apache.manifoldcf.mysql.client; I think you may need to set both of these for the auth to succeed. Also, make sure your MySQL server is configured to permit connections from the source system you are trying to connect from. Thanks, Karl On Fri, Nov 2, 2012 at 11:21 AM, Nigel Thomas nigel.tho...@york.ac.uk wrote: Hello, I am having some problems configuring 1.0.1 to use a MySQL database, I have followed steps here : http://manifoldcf.apache.org/release/release-1.0.1/en_US/how-to-build-and-deploy.html#Configuring+a+MySQL+database I have set the following db related properties in properties.xml: property name=org.apache.manifoldcf.databaseimplementationclass value=org.apache.manifoldcf.core.database.DBInterfaceMySQL/ property name=org.apache.manifoldcf.mysql.server value=mysql.example.com/ property name=org.apache.manifoldcf.dbsuperusername value=root/ property name=org.apache.manifoldcf.dbsuperuserpassword value=password/ property name=org.apache.manifoldcf.database.name value=manfold_db/ property name=org.apache.manifoldcf.database.username value=root/ property name=org.apache.manifoldcf.database.password value=password/ property name=org.apache.manifoldcf.database.maxhandles value=100/ On running initialise.sh, the following exception is thrown: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Error getting connection: Access denied for user 'root'@'%' to database 'mysql' at org.apache.manifoldcf.core.database.DBInterfaceMySQL.createUserAndDatabase(DBInterfaceMySQL.java:624) at org.apache.manifoldcf.core.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:700) at org.apache.manifoldcf.crawler.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:168) at org.apache.manifoldcf.crawler.InitializeAndRegister.doExecute(InitializeAndRegister.java:37) at org.apache.manifoldcf.crawler.InitializeAndRegister.main(InitializeAndRegister.java:60) I am able to connect to the MySQL instance using a command line MySQL client from the same machine using the same credentials, this rules out networking and credentials related issues. Am not sure what I am missing with the setup, I have tried the equivalent with a postgres setup, this seems to work just fine. Thanks, Nigel Thomas
Re: Cannot connect to SharePoint 2010 instance
I've seen situations where a SharePoint site is configured to perform a redirection, and this is messing things up internally. Does the your connection server name etc. match precisely the URL you see when you are in the SharePoint user interface? Karl On Tue, Nov 6, 2012 at 8:47 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, After further review it appears the MCpermissions.asmx was installed globally in SharePoint. I am able to access it from within my SharePoint site as well as all other valid SharePoint sub-sites. So this connection http://server/sitepath/_vti_bin works with any valid site in sitepath including the previously mentioned _admin site. That said do you have any thoughts on why I would be getting the 404 error? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 05, 2012 2:45 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance The 404 error indicates that your MCPermissions service is not properly deployed. The _admin in your path is a clue that something might not be right. The place you want to see the MCPermissions.asmx is in the following location: http[s]://server/sitepath/_vti_bin ... where the server is your server name, and the sitepath is your site path. The best way to get this is to enter the SharePoint UI (NOT the admin UI, but the SharePoint end-user UI), and log into the root site. Then make note of the URL in your browser. If the MCPermissions.asmx service appears under that URL, look at your IIS settings and make sure that the MCPermissions.asmx service can be executed. Also, this may be of some help: https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections The end user documentation is also extremely helpful in describing how to properly set up connections. You can uninstall the MCPermissions.asmx service using the .bat files that are included with the plugin. When you re-install, please make sure that you are logged in as a user with full admin privileges, or the service will not work properly. Thanks, Karl On Mon, Nov 5, 2012 at 2:33 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Hello, I have installed apache-manifoldcf-1.0.1 on my Linux server and apache-manifoldcf-sharepoint-2010-plugin-0.1-bin on my SharePoint 2010 server. On my SharePoint server I can see the Permissions Page when I enter http://x:x/_admin/_vti_bin/MCPermissions.asmx in my browser. When I try to make a SharePoint Services 4.0 (2010) connection to my SharePoint 2010 server in the ManifoldCF interface I get this error. Got an unknown remote exception accessing site - axis fault = Client, detail = The request failed with HTTP status 404: Not Found. I can connect using SharePoint Services 2.0 (2003) but when I try a crawl it does not work properly and aborts. The SharePoint Services 3.0 (2007) connection fails the same as the above 2010 connection. Can you please give some direction on how best to resolve this issue. Thanks Bob Robert P. Iannetti Application Architect Novartis Institute for BioMedical Research 186 Massachusetts Avenue Cambridge, MA 02139 Phone: +1 (617) 871-5414 robert.ianne...@novartis.com
Re: The Schedulars are not starting automatically
Hi Anupam, I'm having difficulty understanding what you posted here, but I will try to explain the difference between rescan dynamically and scan every document once. You may find more help also in ManifoldCF in Action, at http://www.manning.com/wright . The first option causes your job to run forever. The job runs only in the schedule windows allotted for it. It periodically discovers new documents, and (depending on the crawling model of the connector) may check for existence or modification of an already-crawled document. Each document has its own schedule for doing this. The second option is more likely to be what you want. Each job starts, runs, and completes, being sure to run only in the scheduling windows you provide. You then run it again, and again (or your job schedule makes that happen). It will do the minimal work to keep your index up to date. There are significant differences between how you would set up a job using one model vs. the other. I strongly suggest you read at least the first few chapters of the book. Karl On Tue, Nov 6, 2012 at 12:35 PM, Anupam Bhattacharya anupam...@gmail.com wrote: My incremental indexing was working previously but I have messed up with few settings due to which the documents indexed for the previous day gets deleted only the new once shows up. I suspect that it is due to the settings in List all JobEdit selected jobSchedulingSchedule type: Rescan documents dynamically OR Scan every document once ? Please let me know the appropriate settings to index only the new documents in the repository. After deleting the SOLR indexes data folder and clearing the table records in jobqueue, repohistory, ingeststatus I found that ManifoldCF scans only the rest new document list. Untill I go to List Output Connections and Click View for a SOLR connection and Click and Ok the Re-ingest all associated documents. How it is functioning to keep a track of which documents ingested previously and then fetch only the list of new document list ? Regards Anupam On Tue, Aug 14, 2012 at 10:01 AM, Anupam Bhattacharya anupam...@gmail.com wrote: Thanks.. There is a option to set Start Method in Connection tab in the Job settings. I made to changes to Start when the Schedule window starts and the problem got resolved. Regards Anupam On Thu, Aug 2, 2012 at 10:59 PM, Karl Wright daddy...@gmail.com wrote: The incremental will work the same whether the job is run manually or started automatically. If you have added the appropriate schedule record to your job, you also have to select the run job automatically radio button on one of the other job tabs for automatic runs to take place. I suspect that is what you are missing. Karl On Thu, Aug 2, 2012 at 1:12 PM, Anupam Bhattacharya anupam...@gmail.com wrote: I have a Job which is indexing properly even the incremental indexing, if initiated/Run manually. Although even after adding a specific time to Run the schedular process the Jobs is not starting on its own. What is the ideal configuration to configure a Job which run automatically everyday at 12 am and does and incremental re-indexing (only look for those document which are new OR modified after the last crawl) of the repository ? Is it necessary to input/give the total run time details for adding a specific schedule time. Regards Anupam -- Thanks Regards Anupam Bhattacharya
Re: Cannot connect to SharePoint 2010 instance
Yes, this can be somewhat tricky. There are a lot of potential configurations that could affect this. First, you want to verify that your IIS is using NTLM authentication, and that all the web services directories are executable. This is critical. Second, the credentials, in the form of domain\user, may be sensitive to whether you use a fully-qualified domain name or a shortcut domain name, e.g. mydomain.novartis.com or just mydomain. I suggest you try some combinations. The other thing you may want to check is whether the machine you are running ManifoldCF on is known by your domain controller; you may not be able to authenticate if it is not. If this doesn't help, and you want to eliminate ManifoldCF's NTLM implementation from the list of possibilities, I suggest downloading the curl utility, and trying to fetch a web service listing or wsdl using it (specifying NTLM of course as the authentication method). If that also doesn't work, it's a server-side configuration problem of some kind. You can also refer to the server-side IIS logs for some additional info. But I've found these are not very helpful for authentication issues. Let me know if you are still stuck after this; there are other diagnostics available but they start to get ugly. Kral On Tue, Nov 6, 2012 at 2:35 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, I turned on the additional debugging and was able to resolve the 404 issue. Now I am getting: Crawl user did not authenticate properly, or has insufficient permissions to access http://.xxx.xxx: (401)Unauthorized I can log into the SharePoint site from the browser using the same credentials. Any Thoughts? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 10:05 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Well, you can turn on httpclient wire debugging, as I believe is described in the article URL I sent you before, and then you can see precisely what URL the connector is trying to reach when it accesses the MCPermissions service. There's no magic here. If the connector gets a 404 error back from IIS, either its URL is wrong, or IIS has decided it's not going to serve that page to the client. Karl On Tue, Nov 6, 2012 at 8:58 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Yes, The URL and what I enter in the ManifoldCF interface are a match. -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 8:52 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance I've seen situations where a SharePoint site is configured to perform a redirection, and this is messing things up internally. Does the your connection server name etc. match precisely the URL you see when you are in the SharePoint user interface? Karl On Tue, Nov 6, 2012 at 8:47 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, After further review it appears the MCpermissions.asmx was installed globally in SharePoint. I am able to access it from within my SharePoint site as well as all other valid SharePoint sub-sites. So this connection http://server/sitepath/_vti_bin works with any valid site in sitepath including the previously mentioned _admin site. That said do you have any thoughts on why I would be getting the 404 error? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 05, 2012 2:45 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance The 404 error indicates that your MCPermissions service is not properly deployed. The _admin in your path is a clue that something might not be right. The place you want to see the MCPermissions.asmx is in the following location: http[s]://server/sitepath/_vti_bin ... where the server is your server name, and the sitepath is your site path. The best way to get this is to enter the SharePoint UI (NOT the admin UI, but the SharePoint end-user UI), and log into the root site. Then make note of the URL in your browser. If the MCPermissions.asmx service appears under that URL, look at your IIS settings and make sure that the MCPermissions.asmx service can be executed. Also, this may be of some help: https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Conn e ctions The end user documentation is also extremely helpful in describing how to properly set up connections. You can uninstall the MCPermissions.asmx service using the .bat files that are included with the plugin. When you re-install, please make sure that you are logged in as a user with full admin privileges, or the service will not work properly. Thanks, Karl On Mon, Nov 5, 2012 at 2:33 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Hello, I have
Re: Cannot connect to SharePoint 2010 instance
No, Kerberos is not supported. This is a limitation of the Apache commons-httpclient library that we use for communicating with SharePoint. It is possible to set up IIS to serve a different port with different authentication that goes to the same SharePoint instance but is NTLM protected, not Kerberos protected. Perhaps you can do this and limit access to that port to only the ManifoldCF machine. Karl On Tue, Nov 6, 2012 at 3:03 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Our SharePoint sites use Kerberos authentication is this supported in ManifoldCF? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 2:50 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Yes, this can be somewhat tricky. There are a lot of potential configurations that could affect this. First, you want to verify that your IIS is using NTLM authentication, and that all the web services directories are executable. This is critical. Second, the credentials, in the form of domain\user, may be sensitive to whether you use a fully-qualified domain name or a shortcut domain name, e.g. mydomain.novartis.com or just mydomain. I suggest you try some combinations. The other thing you may want to check is whether the machine you are running ManifoldCF on is known by your domain controller; you may not be able to authenticate if it is not. If this doesn't help, and you want to eliminate ManifoldCF's NTLM implementation from the list of possibilities, I suggest downloading the curl utility, and trying to fetch a web service listing or wsdl using it (specifying NTLM of course as the authentication method). If that also doesn't work, it's a server-side configuration problem of some kind. You can also refer to the server-side IIS logs for some additional info. But I've found these are not very helpful for authentication issues. Let me know if you are still stuck after this; there are other diagnostics available but they start to get ugly. Kral On Tue, Nov 6, 2012 at 2:35 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, I turned on the additional debugging and was able to resolve the 404 issue. Now I am getting: Crawl user did not authenticate properly, or has insufficient permissions to access http://.xxx.xxx: (401)Unauthorized I can log into the SharePoint site from the browser using the same credentials. Any Thoughts? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 10:05 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Well, you can turn on httpclient wire debugging, as I believe is described in the article URL I sent you before, and then you can see precisely what URL the connector is trying to reach when it accesses the MCPermissions service. There's no magic here. If the connector gets a 404 error back from IIS, either its URL is wrong, or IIS has decided it's not going to serve that page to the client. Karl On Tue, Nov 6, 2012 at 8:58 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Yes, The URL and what I enter in the ManifoldCF interface are a match. -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 8:52 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance I've seen situations where a SharePoint site is configured to perform a redirection, and this is messing things up internally. Does the your connection server name etc. match precisely the URL you see when you are in the SharePoint user interface? Karl On Tue, Nov 6, 2012 at 8:47 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, After further review it appears the MCpermissions.asmx was installed globally in SharePoint. I am able to access it from within my SharePoint site as well as all other valid SharePoint sub-sites. So this connection http://server/sitepath/_vti_bin works with any valid site in sitepath including the previously mentioned _admin site. That said do you have any thoughts on why I would be getting the 404 error? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 05, 2012 2:45 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance The 404 error indicates that your MCPermissions service is not properly deployed. The _admin in your path is a clue that something might not be right. The place you want to see the MCPermissions.asmx is in the following location: http[s]://server/sitepath/_vti_bin ... where the server is your server name, and the sitepath is your site path. The best way to get this is to enter the SharePoint UI (NOT the admin UI
Re: Cannot connect to SharePoint 2010 instance
Hi Bob, The only products I know have a similar limitations. The only one I know is the SharePoint google appliance connector, which when I looked last had exactly the same restriction. It also has other limitations, some severe, such as limiting the number of documents you can crawl to no more than 5000 per library. We are willing to do a reasonable amount of work to upgrade ManifoldCF to be able to support Kerberos. Here's a link which describes the situation: http://old.nabble.com/Support-for-Kerberos-SPNEGO-td14564857.html We currently use a significantly-patched version of 3.1, which supplied the NTLM implementation for 4.0 that is currently in use. Our issue is similar to the commons-httpclient team's, which is we have no good way of testing all of this, and none of us are security protocol experts. If you have (or know somebody with) such expertise, who would be willing/able to donate their time, this problem could be tackled I think without too much pain. So at least httpclient, given the right tickets, would be able to connect. The other issue with Kerberos auth is that I believe it will require a significant amount of work to allow anything using it to obtain the tickets from the AD domain controller. This would obviously require UI work for all connectors that would support Kerberos. But that is something I am willing to attempt if everything else is in place. Karl On Tue, Nov 6, 2012 at 3:11 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, If this is not possible can you recommend any other products to crawl SharePoint content and index it in Solr? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 3:10 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance No, Kerberos is not supported. This is a limitation of the Apache commons-httpclient library that we use for communicating with SharePoint. It is possible to set up IIS to serve a different port with different authentication that goes to the same SharePoint instance but is NTLM protected, not Kerberos protected. Perhaps you can do this and limit access to that port to only the ManifoldCF machine. Karl On Tue, Nov 6, 2012 at 3:03 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Our SharePoint sites use Kerberos authentication is this supported in ManifoldCF? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 2:50 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Yes, this can be somewhat tricky. There are a lot of potential configurations that could affect this. First, you want to verify that your IIS is using NTLM authentication, and that all the web services directories are executable. This is critical. Second, the credentials, in the form of domain\user, may be sensitive to whether you use a fully-qualified domain name or a shortcut domain name, e.g. mydomain.novartis.com or just mydomain. I suggest you try some combinations. The other thing you may want to check is whether the machine you are running ManifoldCF on is known by your domain controller; you may not be able to authenticate if it is not. If this doesn't help, and you want to eliminate ManifoldCF's NTLM implementation from the list of possibilities, I suggest downloading the curl utility, and trying to fetch a web service listing or wsdl using it (specifying NTLM of course as the authentication method). If that also doesn't work, it's a server-side configuration problem of some kind. You can also refer to the server-side IIS logs for some additional info. But I've found these are not very helpful for authentication issues. Let me know if you are still stuck after this; there are other diagnostics available but they start to get ugly. Kral On Tue, Nov 6, 2012 at 2:35 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, I turned on the additional debugging and was able to resolve the 404 issue. Now I am getting: Crawl user did not authenticate properly, or has insufficient permissions to access http://.xxx.xxx: (401)Unauthorized I can log into the SharePoint site from the browser using the same credentials. Any Thoughts? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 10:05 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Well, you can turn on httpclient wire debugging, as I believe is described in the article URL I sent you before, and then you can see precisely what URL the connector is trying to reach when it accesses the MCPermissions service. There's no magic here. If the connector gets a 404 error back from IIS, either its URL is wrong, or IIS has decided it's not going
Re: Cannot connect to SharePoint 2010 instance
Hi Bob, That depends very strongly on whether SharePoint 2013 continues the Microsoft tradition of breaking web services that used to work. :-) Seriously, we need three things to develop a SharePoint 2013 solution: (1) A stable release (a beta is not sufficient because Microsoft is famous for changing things in a major way between beta and release); (2) a benevolent client with sufficient patience to try things out that we develop in their environment, and (3) enough time so that we're not on the bleeding edge and that other people have run into most of the sticky problems first. We're volunteers here and we all have day jobs, so we mostly can't afford to be pounding away at brick walls on our own. It could be the case that everything just works, in which case the development is trivial. We'll have to see. Karl On Tue, Nov 6, 2012 at 3:37 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, On another topic is there a roadmap for supporting SharePoint 2013 ? We are in the process of migrating and were wondering when your ManifoldCF product would be available to support it. Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 3:34 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Hi Bob, The only products I know have a similar limitations. The only one I know is the SharePoint google appliance connector, which when I looked last had exactly the same restriction. It also has other limitations, some severe, such as limiting the number of documents you can crawl to no more than 5000 per library. We are willing to do a reasonable amount of work to upgrade ManifoldCF to be able to support Kerberos. Here's a link which describes the situation: http://old.nabble.com/Support-for-Kerberos-SPNEGO-td14564857.html We currently use a significantly-patched version of 3.1, which supplied the NTLM implementation for 4.0 that is currently in use. Our issue is similar to the commons-httpclient team's, which is we have no good way of testing all of this, and none of us are security protocol experts. If you have (or know somebody with) such expertise, who would be willing/able to donate their time, this problem could be tackled I think without too much pain. So at least httpclient, given the right tickets, would be able to connect. The other issue with Kerberos auth is that I believe it will require a significant amount of work to allow anything using it to obtain the tickets from the AD domain controller. This would obviously require UI work for all connectors that would support Kerberos. But that is something I am willing to attempt if everything else is in place. Karl On Tue, Nov 6, 2012 at 3:11 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, If this is not possible can you recommend any other products to crawl SharePoint content and index it in Solr? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 3:10 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance No, Kerberos is not supported. This is a limitation of the Apache commons-httpclient library that we use for communicating with SharePoint. It is possible to set up IIS to serve a different port with different authentication that goes to the same SharePoint instance but is NTLM protected, not Kerberos protected. Perhaps you can do this and limit access to that port to only the ManifoldCF machine. Karl On Tue, Nov 6, 2012 at 3:03 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Our SharePoint sites use Kerberos authentication is this supported in ManifoldCF? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 2:50 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Yes, this can be somewhat tricky. There are a lot of potential configurations that could affect this. First, you want to verify that your IIS is using NTLM authentication, and that all the web services directories are executable. This is critical. Second, the credentials, in the form of domain\user, may be sensitive to whether you use a fully-qualified domain name or a shortcut domain name, e.g. mydomain.novartis.com or just mydomain. I suggest you try some combinations. The other thing you may want to check is whether the machine you are running ManifoldCF on is known by your domain controller; you may not be able to authenticate if it is not. If this doesn't help, and you want to eliminate ManifoldCF's NTLM implementation from the list of possibilities, I suggest downloading the curl utility, and trying to fetch a web service listing or wsdl using it (specifying NTLM of course as the authentication method
Re: Cannot connect to SharePoint 2010 instance
If you want, we can create a ticket to cover SharePoint 2013 work. If you want to attempt a sanity check, if you email me (personally, to daddy...@gmail.com) the Microsoft.SharePoint.dll I can set up a ManifoldCF-Sharepoint-2013 plugin. If I can build that, then the next step would be just trying it all out and seeing where it fails. Karl On Tue, Nov 6, 2012 at 3:49 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, That sounds reasonable. I am having my SP Admin set up the NTML SharePoint instance described below I will let you know how it works. BTW SP 2013 RTM has been released so we can cross #1 off the list :) Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 3:47 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Hi Bob, That depends very strongly on whether SharePoint 2013 continues the Microsoft tradition of breaking web services that used to work. :-) Seriously, we need three things to develop a SharePoint 2013 solution: (1) A stable release (a beta is not sufficient because Microsoft is famous for changing things in a major way between beta and release); (2) a benevolent client with sufficient patience to try things out that we develop in their environment, and (3) enough time so that we're not on the bleeding edge and that other people have run into most of the sticky problems first. We're volunteers here and we all have day jobs, so we mostly can't afford to be pounding away at brick walls on our own. It could be the case that everything just works, in which case the development is trivial. We'll have to see. Karl On Tue, Nov 6, 2012 at 3:37 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, On another topic is there a roadmap for supporting SharePoint 2013 ? We are in the process of migrating and were wondering when your ManifoldCF product would be available to support it. Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 3:34 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Hi Bob, The only products I know have a similar limitations. The only one I know is the SharePoint google appliance connector, which when I looked last had exactly the same restriction. It also has other limitations, some severe, such as limiting the number of documents you can crawl to no more than 5000 per library. We are willing to do a reasonable amount of work to upgrade ManifoldCF to be able to support Kerberos. Here's a link which describes the situation: http://old.nabble.com/Support-for-Kerberos-SPNEGO-td14564857.html We currently use a significantly-patched version of 3.1, which supplied the NTLM implementation for 4.0 that is currently in use. Our issue is similar to the commons-httpclient team's, which is we have no good way of testing all of this, and none of us are security protocol experts. If you have (or know somebody with) such expertise, who would be willing/able to donate their time, this problem could be tackled I think without too much pain. So at least httpclient, given the right tickets, would be able to connect. The other issue with Kerberos auth is that I believe it will require a significant amount of work to allow anything using it to obtain the tickets from the AD domain controller. This would obviously require UI work for all connectors that would support Kerberos. But that is something I am willing to attempt if everything else is in place. Karl On Tue, Nov 6, 2012 at 3:11 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, If this is not possible can you recommend any other products to crawl SharePoint content and index it in Solr? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 3:10 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance No, Kerberos is not supported. This is a limitation of the Apache commons-httpclient library that we use for communicating with SharePoint. It is possible to set up IIS to serve a different port with different authentication that goes to the same SharePoint instance but is NTLM protected, not Kerberos protected. Perhaps you can do this and limit access to that port to only the ManifoldCF machine. Karl On Tue, Nov 6, 2012 at 3:03 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Our SharePoint sites use Kerberos authentication is this supported in ManifoldCF? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 06, 2012 2:50 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Yes, this can be somewhat tricky. There are a lot
Re: Problem with manifold
-2050932820- -deny_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820- allow_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-513 -deny_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-513 allow_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1113 -deny_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1113 allow_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1110 -deny_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1110 allow_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1107 -deny_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1107 allow_token_document:active_dir:S-1-1-0 -deny_token_document:active_dir:S-1-1-0 allow_token_document:ad:S-1-5-32-545 -deny_token_document:ad:S-1-5-32-545 allow_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820- -deny_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820- allow_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-513 -deny_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-513 allow_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1113 -deny_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1113 allow_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1110 -deny_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1110 allow_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1107 -deny_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1107 allow_token_document:ad:S-1-1-0 -deny_token_document:ad:S-1-1-0) This is the _document security chunk of the BooleanQuery (quoting all the SIDs with so it doesn't think active_dir is a field only for having a : after it). The query gives the expected results. Thinking about it, the truth is that when we configured our security policies by means of ActiveDirectory we did not take into consideration share-level policies. Our users are authenticated only at a document level. Anyway, I don't think this gives us any clue on why my handler isn't working. But, now I could modify my own component to take care of the _document-level security alone, forgetting about the _share-level. I think it would work and that's what I will try for now, but I seriously think there must be another way to do it, so if this data makes you have any idea please let me know. I will anyway tell you whether it worked or not. Thanks, Pablo -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: lunes, 05 de noviembre de 2012 11:57 To: user@manifoldcf.apache.org Subject: Re: Problem with manifold Just reran the tests on the trunk version of the ManifoldCF solr 3.x plugin - looked good: [junit] Testsuite: org.apache.solr.mcf.ManifoldCFQParserPluginTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 10.56 sec [junit] [junit] - Standard Error - [junit] WARNING: test class left thread running: Thread[MultiThreadedHttpCon nectionManager cleanup,5,main] [junit] RESOURCE LEAK: test class left 1 thread(s) running [junit] - --- [junit] Testsuite: org.apache.solr.mcf.ManifoldCFSearchComponentTest [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.096 sec [junit] [junit] - Standard Error - [junit] WARNING: test class left thread running: Thread[MultiThreadedHttpCon nectionManager cleanup,5,main] [junit] RESOURCE LEAK: test class left 1 thread(s) running [junit] - --- [junit] Testsuite: org.apache.solr.mcf.ManifoldCFSCLoadTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 40.486 sec [junit] [junit] - Standard Output --- [junit] Query time = 24352 [junit] - --- [junit] - Standard Error - [junit] WARNING: test class left thread running: Thread[MultiThreadedHttpCon nectionManager cleanup,5,main] [junit] RESOURCE LEAK: test class left 1 thread(s) running [junit] - --- The components that this test uses are simple: ?xml version=1.0 ? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed
RE: Problem with manifold
Hi Pablo, Yes, I don't think you included the schema before. Having a default of _nosecurity_ is critical. Were the instructions unclear? And yes, this is safe, because all that it does is effectively guarantee that solr fields without any value get one that can be queried on. Karl Karl Sent from my Windows Phone From: Gonzalez, Pablo Sent: 11/7/2012 6:08 AM To: user@manifoldcf.apache.org Subject: RE: Problem with manifold Well, I did two things: -first I did what I told you in the last message: I changed my component only to care about the document-level security, and that way the query worked -then I realized that the documents that I indexed only had _document tokens, not _share tokens at all. THAT is the real problem. So, what I did was to change the definition of the fields in this way: field name=allow_token_document type=string indexed=true stored=true multiValued=true required=true default=__nosecurity__/ field name=deny_token_document type=string indexed=true stored=true multiValued=true required=true default=__nosecurity__/ field name=allow_token_share type=string indexed=true stored=true multiValued=true required=true default=__nosecurity__/ field name=deny_token_share type=string indexed=true stored=true multiValued=true required=true default=__nosecurity__/ Then I used the default /select handler and it worked. But my question is: is this safe? What I think this means is: if the document that I'm indexing has no share security restrictions, then set it to no security and let the user access it only if its document-level policies allow him to do so. Thinking about the system-not-indexing-share-tokens issue, I am wondering what could be the cause. Maybe it is an error that I have in my manifold or solr configurations that strips all the share tokens, or perhaps we should do something at the machine that contains the documents that we are indexing, to configure share-level security as we did at the document level. -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: miércoles, 07 de noviembre de 2012 11:42 To: user@manifoldcf.apache.org Subject: Re: Problem with manifold So, can you look at one document, and tell me what the allow and deny tokens are for both document and share levels? Just taking the share part of the clause away means that you will be allowing people to see search results when they cannot see within the corresponding Windows share (according to Active Directory). I'm hoping that you are just crawling through a different share than the one your users use to access the document. But in any case the URLs that are indexed will also not work to reach the files in question because the share restrictions. Karl On Wed, Nov 7, 2012 at 4:20 AM, Gonzalez, Pablo pablo.gonzalez.do...@hp.com wrote: Hello Karl, this is what I've done: -I've modified the class so that it prints out the BooleanQuery that it creates. -I've rerun the query (with my handler), and this is what it pumps out: +((+allow_token_share:__nosecurity__ +deny_token_share:__nosecurity__) allow_token_share:active_dir:S-1-5-32-545 -deny_token_share:active_dir:S-1-5-32-545 allow_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820 - -deny_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820 - allow_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820 -513 -deny_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820 -513 allow_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820 -1113 -deny_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820 -1113 allow_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820 -1110 -deny_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820 -1110 allow_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820 -1107 -deny_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820 -1107 allow_token_share:active_dir:S-1-1-0 -deny_token_share:active_dir:S-1-1-0 allow_token_share:ad:S-1-5-32-545 -deny_token_share:ad:S-1-5-32-545 allow_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820- -deny_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820- allow_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-513 -deny_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-513 allow_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1113 -deny_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1113 allow_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1110 -deny_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1110 allow_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1107 -deny_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1107 allow_token_share:ad:S-1-1-0 -deny_token_share:ad:S-1-1-0) +((+allow_token_document:__nosecurity__ +deny_token_document:__nosecurity__) allow_token_document:active_dir:S-1-5-32-545
Re: value of DATACOLUMN
Since it works on 3.6, it is definitely not a JDBC issue. What content-type are you referring to? The Solr connector does not change what content type it posts based on the version of Solr it is posting to, so the content-type you are talking about sounds like something Solr is detecting rather than receiving. Can you confirm? Karl On Mon, Nov 12, 2012 at 8:40 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Thank you for the reply. I tried to set -Dfile.encoding but didn't resolve the issue. What I can tell now is that euro symbol in DATACOLUMN can be indexed on Solr 3.6 and can not indexed on Solr 4.0. On Solr 3.6 content-type was text/plain, On Solr 4.0 content-type was application/octet-stream. Is this Solr's issue, not database's encoding? On 2012/11/12, at 20:36, Karl Wright wrote: It looks like the Postgresql JDBC driver sets the encoding itself, from what I can find. So I would guess that it is setting the character encoding based on the database you are connected to. So if the euro symbol is not handled by the database's encoding, there would be no way to include it in the query string. I think... Karl On Mon, Nov 12, 2012 at 6:22 AM, Karl Wright daddy...@gmail.com wrote: To clarify, we pass every string to the JDBC driver as a unicode string, but it is up to the JDBC driver to decide how to interpret it. I don't know what exactly the PostgreSQL 9.1 driver does here. It would be interesting to see what is posted to Solr, if you have those logs. It may be that it is picking an encoding that is based on your machine's default encoding, which would be unfortunate. This page apparently indicates that there is somehow a way to set the encoding that JDBC communicates with the database with: http://stackoverflow.com/questions/3040597/jdbc-character-encoding I don't know if this is applicable to us at all though. You can try: java -Dfile.encoding=utf8 start.jar ...and see if that changes things - it would be a good hint. Karl On Mon, Nov 12, 2012 at 6:12 AM, Karl Wright daddy...@gmail.com wrote: Hi Abe-san, Quoted strings in SQL queries are not necessarily unicode. See this page for details: http://www.postgresql.org/docs/7.3/static/functions-string.html There is nothing you can do in JDBC invocations to control character set. This must be done in the query itself, or in the database itself. Karl On Mon, Nov 12, 2012 at 6:03 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, I'm using Solr 4.0 and JDBC connection via PostgreSQL. The dataQuery is configured below: SELECT idfield AS $(IDCOLUMN), 'http://server?id=' || idfield AS $(URLCOLUMN), '12345' AS $(DATACOLUMN) FROM album WHERE idfield IN $(IDLIST) On Solr side, '12345' was be able to indexed and stored. But when not-ascii character was configured, SELECT idfield AS $(IDCOLUMN), 'http://server?id=' || idfield AS $(URLCOLUMN), '€€€' AS $(DATACOLUMN) FROM album WHERE idfield IN $(IDLIST) On Solr side, '€€€' was not indexed and stored. Actually, I configure the column which contains not-ascii characters into DATACOLUMN. It seems content-type differ between them. Can JDBC connection control content-type? Regards, Shinichiro Abe
Re: Process behavior of executing multiple jobs
Hi Shigeki, This is a complex question, which is actually at the center of what ManifoldCF does. There are two different kinds of scheduling that MCF does. The first is scheduling documents within a single connection. The second is scheduling documents across connections. Let's start with the first. Every connector, given a document, has the ability to determine what throttling bins it belongs in. A throttling bin is an arbitrary grouping of documents that should be treated together for the purposes of throttling. For example, the web connector uses a document's server name as a throttling bin, which means that any new document from the same server will be rate-limited relative to other documents from that server. This grouping allows the ManifoldCF document queue to be prioritized (which means that a priority number is set) in such a way that documents from all bins have an equal probability of being scheduled in a given time interval. Then, the query that finds the next set of documents to crawl can do mostly the right thing if it just orders the query based on the priority number. The second layer adjusts for differences in performance between bins and between connections. ManifoldCF keeps track of the performance statistics of each connector and each throttle bin. If the statistics show that processing a document for one bin in one connector is significantly slower than for the others, it will take that into account and learn to give fewer documents from that bin or connection to the worker threads during any given time interval. If the statistics change, it will obviously be a little while before ManifoldCF adjusts its behavior. But eventually it should adjust. If you are seeing a specific long-term behavior that is not optimal, please let us know. It's been quite a while since anyone has had questions/issues in this area. Thanks, Karl On Sun, Nov 18, 2012 at 10:55 PM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi. I have a question of process behavior of executing multiple jobs. I run MCF1.0 on Tomcat, crawl files on Windows file servers, and index them into Solr3.6. When I set multiple jobs and execute them at the same times, I realize the number of documents processed by each job seems to be partial to another. For example, while one job processes 100 documents the other job only process 5 documents yet. At the end, all of jobs completes processing, but I wonder how those jobs can process documents evenly at the same time. On the other hand, I wonder how MCF determines priority of each documents of each job to crawl and index. Regards, Shigeki
Re: Cannot connect to SharePoint 2010 instance
I've done further research on HttpComponents' support for Kerberos. It turns out that HttpComponents claims they can successfully use tickets from the local machine's ticket store. I haven't tried this here (don't have the setup for it), but it looks like it could conceivably work with MCF trunk at this point. Read up on it here: http://hc.apache.org/httpcomponents-client-ga/tutorial/html/authentication.html Ideally, of course, we'd really want to add the ability for ManifoldCF to handle its own ticket cache, one per connection, so that each connection looks like its own independent client. In order for that to happen, connectors that support Kerberos would need to be able to kerberos authenticate. But, for right now, this may work for people needing Kerberos. Karl On Sun, Nov 11, 2012 at 8:42 AM, Karl Wright daddy...@gmail.com wrote: The port of the SharePoint connector to httpcomponents 4.2.2 is complete. I don't know whether it will help you or not, but if you check out ManifoldCF trunk (from https://svn.apache.org/repos/asf/manifoldcf/trunk) and run: ant make-core-deps build ... you will be running the latest code. It has been tried against a plain-vanilla SharePoint system using standard NTLM and found to work. If you try the new code and it works for you, that would be very interesting to know; it looks like httpcomponents has developed some support for SPNEGO, which may be what is missing in the current ManifoldCF release. Thanks, Karl On Wed, Nov 7, 2012 at 4:47 PM, Karl Wright daddy...@gmail.com wrote: MCPermissions.asmx and Lists.asmx are two different services, and the Lists.asmx is likely failing before the MCPermissions.asmx is even needed. If, for instance, you are just trying with the UI to see if you get back Connection working, this makes sense since the Lists service is called first and then the MCPermissions service is called after. FWIW, I'm starting to look into porting ManifoldCF to the httpcomponent libraries from the older httpclient 3.1 world. This will make it easier, I think, to incorporate newer additions. Thanks, Karl On Wed, Nov 7, 2012 at 3:44 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, It looks like I am failing connecting to the /_vti_bin/lists.asmx service but I never see the MCPermissions.asmx in any of my trace logs. Why is that? Thanks Bob -Original Message- From: Iannetti, Robert Sent: Wednesday, November 07, 2012 10:37 AM To: user@manifoldcf.apache.org Subject: RE: Cannot connect to SharePoint 2010 instance Karl, The X's you see are me trying to make the log look generic there were valid guids present in the real log. I will try WireShark and let you know the results. Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Wednesday, November 07, 2012 10:32 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance This in general looks like a proper NTLM authorization sequence, except for the lack of confirmation at the end. The only thing I see that I don't recognize is this: DEBUG 2012-11-07 09:56:11,212 (Thread-441) - SPRequestGuid: xxx[\r][\n] If SharePoint is expecting this GUID to be returned somehow then that would explain it, but frankly we've got a number of SP 2010 installations and that hasn't been an issue anywhere else. And, I don't expect curl would work if that was the case. It's worth a shot using a tool like WireShark to see if you can find any difference in headers etc. between curl and ManifoldCF. We've noticed in the past that the exact Host header seems to be the critical issue, so any differences there would be of interest. Karl On Wed, Nov 7, 2012 at 10:08 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, We have created the NTLM SharePoint instance as instructed. The Curl command is now responding when before it would not. curl --ntlm -u domain\\username http://xxx.xxx.xxx.xxx/_vti_bin/MCPermissions.asmx -v But we are still getting an error when issuing the connection request from the ManifoldCF GUI Crawl user did not authenticate properly, or has insufficient permissions to access http://XXX.XXX.XXX.XXX: (401)Unauthorized From the log file DEBUG 2012-11-07 09:56:11,126 (Thread-441) - POST /_vti_bin/lists.asmx HTTP/1.1[\r][\n] DEBUG 2012-11-07 09:56:11,151 (Thread-441) - Content-Type: text/xml; charset=utf-8[\r][\n] DEBUG 2012-11-07 09:56:11,152 (Thread-441) - SOAPAction: http://schemas.microsoft.com/sharepoint/soap/GetListCollection[\r][\n]; DEBUG 2012-11-07 09:56:11,152 (Thread-441) - User-Agent: Axis/1.4[\r][\n] DEBUG 2012-11-07 09:56:11,152 (Thread-441) - Host: x...[\r][\n] DEBUG 2012-11-07 09:56:11,152 (Thread-441) - Transfer-Encoding: chunked[\r][\n] DEBUG 2012-11-07 09:56:11,152 (Thread-441) - [\r][\n] DEBUG 2012-11-07 09:56:11,153 (Thread-441) - 14f
Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized
Hi Luigi, The Negotiate is clearly part of the problem; please leave that out. The log entries you mention are indeed harmless warnings that we don't have an Italian localization yet. When you view the connection in the UI, what do you see now? Karl On Tue, Nov 27, 2012 at 8:25 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: hi Karl, thanks for your reply. *(1) Are you sure that your SharePoint IIS is not configured to use* *Kerberos auth?* On Sharepoint Server, in the MetaBase.xml i have IIsWebVirtualDir Location =/LM/W3SVC/662429156/Root AccessFlags=AccessExecute | AccessRead | AccessWrite | AccessScript AppFriendlyName=Root AppIsolated=2 AppPoolId=SharePoint - 80 AppRoot=/LM/W3SVC/662429156/Root AuthFlags=*AuthNTLM* ContentIndexed=FALSE DefaultLogonDomain=services-kirey.lan DoDynamicCompression=TRUE DoStaticCompression=TRUE HttpCustomHeaders=X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 12.0.0.6421 *NTAuthenticationProviders=Negotiate,NTLM* Path=C:\Inetpub\wwwroot\wss\VirtualDirectories\80 Ok, i have first Negotiate, but if I force only NTLM (* NTAuthenticationProviders=NTLM*), manifoldcf.log *not recorder any messages* !! With a simply asp script running on my Sharepoint Server page i tried to get authentication mode via http and this is the result: with *NTAuthenticationProviders=NTLM:* *User Id = VM-SHPT2K7\Administrator The user was logged in using the NTLM authentication method.* with *NTAuthenticationProviders=Negotiate,NTLM:* * * *User Id = VM-SHPT2K7\Administrator The Negotiate method was used! The user was logged on using NTLM* In manifoldcf.log i founded this error but i think is not related with 401: ERROR 2012-11-27 10:56:49,828 (qtp17632942-166) - Missing resource bundle 'org.apache.manifoldcf.crawler.connectors.sharepoint.common' for locale 'it': Can't find bundle for base name org.apache.manifoldcf.crawler.connectors.sharepoint.common, locale it; trying it java.util.MissingResourceException: Can't find bundle for base name org.apache.manifoldcf.crawler.connectors.sharepoint.common, locale it 2012/11/27 Karl Wright daddy...@gmail.com Hi Luigi, The warning is coming from the part of commons-httpclient that is trying to set up communication with your SharePoint instance. It thinks it needs to use SPNEGO to figure out the authentication mechanism, and it seems to be trying to load kerberos 5 configuration information, which means that it thinks Kerberos is the authentication mechanism of choice. (1) Are you sure that your SharePoint IIS is not configured to use Kerberos auth? (2) What command-line arguments are you giving to the JVM that is running ManifoldCF? Karl On Tue, Nov 27, 2012 at 7:44 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: Hello, I have installed apache-manifoldcf-1.0.1 on my Windows XP and apache-manifoldcf-sharepoint-2007-plugin on my SharePoint 2007 server. (a virtual machine). I can see the Permissions Page when I enter http://x:x/sub_directory/_vti_bin/MCPermissions.asmx in my browser. When I try to make a SharePoint Services 3.0 (2007) connection to my SharePoint 2007 server in the ManifoldCF interface I get this error: Crawl user did not authenticate properly, or has insufficient permissions to accesshttp://vm-shpt2k7/KireyRep: (401)HTTP/1.1 401 Unauthorized Via curl i get first a 401 and then a 200 status: curl --ntlm -u vm-shpt2k7\\administrator http://vm-shpt2k7/KireyRep/_vti_bin/MCPermissions.asmx -v Enter host password for user 'vm-shpt2k7\\administrator': * About to connect() to vm-shpt2k7 port 80 (#0) * Trying 192.168.30.42... * connected * Connected to vm-shpt2k7 (192.168.30.42) port 80 (#0) * Server auth using NTLM with user 'vm-shpt2k7\\administrator' GET /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAABt4II4gAFASgKDw== User-Agent: curl/7.25.0 (i386-pc-win32) libcurl/7.25.0 OpenSSL/0.9.8u zlib/1.2 .6 libssh2/1.4.0 Host: vm-shpt2k7 Accept: */* HTTP/1.1 401 Unauthorized Content-Length: 1539 Content-Type: text/html Server: Microsoft-IIS/6.0 WWW-Authenticate: NTLM TlRMTVNTUAACHAAcADg1goniwKcRCkDsTOwAAMo AygBUBQLODg9TAEUAUgBWAEkAQwBFAFMALQBLAEkAUgBFAFkAAgAcAFMARQBSAFYASQBDAEU AUwAtAEsASQBSAEUAWQABABQAVgBNAC0AUwBIAFAAVAAyAEsANwAEACQAcwBlAHIAdgBpAGMAZQBzAC0 AawBpAHIAZQB5AC4AbABhAG4AAwA6AHYAbQAtAHMAaABwAHQAMgBrADcALgBzAGUAcgB2AGkAYwBlAHM ALQBrAGkAcgBlAHkALgBsAGEAbgAFACQAcwBlAHIAdgBpAGMAZQBzAC0AawBpAHIAZQB5AC4AbABhAG4 AAA== X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 12.0.0.6421 Date: Mon, 26 Nov 2012 21:47:30 GMT * Ignoring the response-body * Connection #0 to host vm-shpt2k7 left intact * Issue another request to this URL: 'http://vm-shpt2k7/KireyRep/_vti_bin/MCPerm issions.asmx' * Re-using existing
Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized
Ok, can you try a fully-qualified domain name, rather than the abbreviated one you have given, for the credentials? Also, you might want to look at the server-side event logs for the reason for the authentication failure. Thanks, Karl On Tue, Nov 27, 2012 at 9:04 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: well, on SharePoint Server: *NTAuthenticationProviders=NTLM* * * on ManifoldCF UI interface, error: Parameters: serverLocation=/KireyRep serverPort=80 serverVersion=3.0 userName=VM-SHPT2K7\Administrator serverProtocol=http serverName=vm-shpt2k7.services-kirey.lan password= Connection status:Crawl user did not authenticate properly, or has insufficient permissions to access http://vm-shpt2k7.services-kirey.lan/KireyRep: *(401)HTTP/1.1 401 Unauthorized* on manifoldcf.log *no error trace !* 2012/11/27 Karl Wright daddy...@gmail.com Hi Luigi, The Negotiate is clearly part of the problem; please leave that out. The log entries you mention are indeed harmless warnings that we don't have an Italian localization yet. When you view the connection in the UI, what do you see now? Karl On Tue, Nov 27, 2012 at 8:25 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: hi Karl, thanks for your reply. *(1) Are you sure that your SharePoint IIS is not configured to use* *Kerberos auth?* On Sharepoint Server, in the MetaBase.xml i have IIsWebVirtualDir Location =/LM/W3SVC/662429156/Root AccessFlags=AccessExecute | AccessRead | AccessWrite | AccessScript AppFriendlyName=Root AppIsolated=2 AppPoolId=SharePoint - 80 AppRoot=/LM/W3SVC/662429156/Root AuthFlags=*AuthNTLM* ContentIndexed=FALSE DefaultLogonDomain=services-kirey.lan DoDynamicCompression=TRUE DoStaticCompression=TRUE HttpCustomHeaders=X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 12.0.0.6421 *NTAuthenticationProviders=Negotiate,NTLM* Path=C:\Inetpub\wwwroot\wss\VirtualDirectories\80 Ok, i have first Negotiate, but if I force only NTLM (* NTAuthenticationProviders=NTLM*), manifoldcf.log *not recorder any messages* !! With a simply asp script running on my Sharepoint Server page i tried to get authentication mode via http and this is the result: with *NTAuthenticationProviders=NTLM:* *User Id = VM-SHPT2K7\Administrator The user was logged in using the NTLM authentication method.* with *NTAuthenticationProviders=Negotiate,NTLM:* * * *User Id = VM-SHPT2K7\Administrator The Negotiate method was used! The user was logged on using NTLM* In manifoldcf.log i founded this error but i think is not related with 401: ERROR 2012-11-27 10:56:49,828 (qtp17632942-166) - Missing resource bundle 'org.apache.manifoldcf.crawler.connectors.sharepoint.common' for locale 'it': Can't find bundle for base name org.apache.manifoldcf.crawler.connectors.sharepoint.common, locale it; trying it java.util.MissingResourceException: Can't find bundle for base name org.apache.manifoldcf.crawler.connectors.sharepoint.common, locale it 2012/11/27 Karl Wright daddy...@gmail.com Hi Luigi, The warning is coming from the part of commons-httpclient that is trying to set up communication with your SharePoint instance. It thinks it needs to use SPNEGO to figure out the authentication mechanism, and it seems to be trying to load kerberos 5 configuration information, which means that it thinks Kerberos is the authentication mechanism of choice. (1) Are you sure that your SharePoint IIS is not configured to use Kerberos auth? (2) What command-line arguments are you giving to the JVM that is running ManifoldCF? Karl On Tue, Nov 27, 2012 at 7:44 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: Hello, I have installed apache-manifoldcf-1.0.1 on my Windows XP and apache-manifoldcf-sharepoint-2007-plugin on my SharePoint 2007 server. (a virtual machine). I can see the Permissions Page when I enter http://x:x/sub_directory/_vti_bin/MCPermissions.asmx in my browser. When I try to make a SharePoint Services 3.0 (2007) connection to my SharePoint 2007 server in the ManifoldCF interface I get this error: Crawl user did not authenticate properly, or has insufficient permissions to accesshttp://vm-shpt2k7/KireyRep: (401)HTTP/1.1 401 Unauthorized Via curl i get first a 401 and then a 200 status: curl --ntlm -u vm-shpt2k7\\administrator http://vm-shpt2k7/KireyRep/_vti_bin/MCPermissions.asmx -v Enter host password for user 'vm-shpt2k7\\administrator': * About to connect() to vm-shpt2k7 port 80 (#0) * Trying 192.168.30.42... * connected * Connected to vm-shpt2k7 (192.168.30.42) port 80 (#0) * Server auth using NTLM with user 'vm-shpt2k7\\administrator' GET /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAABt4II4gAFASgKDw== User-Agent: curl/7.25.0 (i386-pc-win32) libcurl/7.25.0 OpenSSL/0.9.8u zlib/1.2 .6
Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized
Just on a whim, can you try POST with curl also? It is possible that POSTs are blocked in some way. If that doesn't work, then your security settings are prohibiting post. If that DOES work, then I'd like you to download a ManifoldCF 1.1-dev image from http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev , and try that. This uses httpcomponents rather than our special commons-httpclient version. If none of this helps, getting a packet capture of both a curl POST and the comparable ManifoldCF attempt may well show us what the key issue is. It's possible that there is a header or something your IIS is rejecting, for instance. Thanks, Karl On Tue, Nov 27, 2012 at 11:06 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: Karl, I tried many credential combination .. always 401 .. From server log, with ManifoldCF UI interface (in POST), 401 error: #Software: Microsoft Internet Information Services 6.0 #Version: 1.0 #Date: 2012-11-27 15:38:37 #Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 *POST*/KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Axis/1.4 *401* 2 2148074254 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 *POST */KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Axis/1.4 *401* 1 0 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 *POST */KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Axis/1.4 *401* 1 2148074252 With direct call via http (http://vm-shpt2k7/KireyRep/_vti_bin/lists.asmx), (in GET): 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET /KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729) *401* 2 2148074254 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET /KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729) *401 *1 0 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET /KireyRep/_vti_bin/lists.asmx - 80 vm-shpt2k7\administrator 192.168.49.62 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729) *200* 0 0 It's quite a conundrum ... 2012/11/27 Karl Wright daddy...@gmail.com Ok, can you try a fully-qualified domain name, rather than the abbreviated one you have given, for the credentials? Also, you might want to look at the server-side event logs for the reason for the authentication failure. Thanks, Karl On Tue, Nov 27, 2012 at 9:04 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: well, on SharePoint Server: *NTAuthenticationProviders=NTLM* * * on ManifoldCF UI interface, error: Parameters: serverLocation=/KireyRep serverPort=80 serverVersion=3.0 userName=VM-SHPT2K7\Administrator serverProtocol=http serverName=vm-shpt2k7.services-kirey.lan password= Connection status:Crawl user did not authenticate properly, or has insufficient permissions to access http://vm-shpt2k7.services-kirey.lan/KireyRep: *(401)HTTP/1.1 401 Unauthorized* on manifoldcf.log *no error trace !* 2012/11/27 Karl Wright daddy...@gmail.com Hi Luigi, The Negotiate is clearly part of the problem; please leave that out. The log entries you mention are indeed harmless warnings that we don't have an Italian localization yet. When you view the connection in the UI, what do you see now? Karl On Tue, Nov 27, 2012 at 8:25 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: hi Karl, thanks for your reply. *(1) Are you sure that your SharePoint IIS is not configured to use* *Kerberos auth?* On Sharepoint Server, in the MetaBase.xml i have IIsWebVirtualDir Location =/LM/W3SVC/662429156/Root AccessFlags=AccessExecute | AccessRead | AccessWrite | AccessScript AppFriendlyName=Root AppIsolated=2 AppPoolId=SharePoint - 80 AppRoot=/LM/W3SVC/662429156/Root AuthFlags=*AuthNTLM* ContentIndexed=FALSE DefaultLogonDomain=services-kirey.lan DoDynamicCompression=TRUE DoStaticCompression=TRUE HttpCustomHeaders=X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 12.0.0.6421 *NTAuthenticationProviders=Negotiate,NTLM* Path=C:\Inetpub\wwwroot\wss\VirtualDirectories\80 Ok, i have first Negotiate, but if I force only NTLM (* NTAuthenticationProviders=NTLM*), manifoldcf.log *not recorder any messages* !! With a simply asp script running on my Sharepoint Server page i tried to get authentication mode via http and this is the result: with *NTAuthenticationProviders=NTLM:* *User Id = VM-SHPT2K7\Administrator The user was logged in using the NTLM authentication method.* with *NTAuthenticationProviders=Negotiate,NTLM:* * * *User Id = VM-SHPT2K7
Re: Cannot connect to SharePoint 2010 instance
Hi Bob, This is really beginning to sound like there is a header problem of some kind. This is what I'd like to try. (1) Turn on wire debugging for SharePoint, as described here: https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections (2) Using curl, try to use post and the proper credentials, using the -vvv switch. If you successfully connect, save that output. Then try to EXACTLY mimic the request that ManifoldCF does, and if that FAILS record that output and send it all to me. Thanks! Karl On Tue, Nov 27, 2012 at 11:22 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Hi Karl, I have installed the dev version of the connector from below and am having an issue connecting to my SharePoint 2010 site. It actually seems similar to what is happening in your thread with Luigi. I try to log in to the sharepoint site as a user with full control and I get this error Crawl user did not authenticate properly, or has insufficient permissions to access http://...: (401)HTTP/1.1 401 Unauthorized Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 26, 2012 6:38 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Ok, you can download a dev build at: http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev It takes me about an hour to put one of these together, so if you can possibly build ManifoldCF yourself that would be a huge help. Karl On Mon, Nov 26, 2012 at 11:12 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: That would be great please let me know when it is available Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 26, 2012 10:59 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Hi Robert, I can build a binary version you can download, but not until tonight. It may be easier to talk through getting a build environment set up on your Linux machine. Is this Debian or Ubuntu linux, by any chance? If so, the setup is trivial and I can help you with that. Karl On Mon, Nov 26, 2012 at 10:12 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Is there a binary release (pre -compiled version) of the manifold trunk mentioned below https://svn.apache.org/repos/asf/manifoldcf/trunk that you can point me to I am new to Linux and don't have any experience with ANT. Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 26, 2012 4:32 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance I've done further research on HttpComponents' support for Kerberos. It turns out that HttpComponents claims they can successfully use tickets from the local machine's ticket store. I haven't tried this here (don't have the setup for it), but it looks like it could conceivably work with MCF trunk at this point. Read up on it here: http://hc.apache.org/httpcomponents-client-ga/tutorial/html/authentic a tion.html Ideally, of course, we'd really want to add the ability for ManifoldCF to handle its own ticket cache, one per connection, so that each connection looks like its own independent client. In order for that to happen, connectors that support Kerberos would need to be able to kerberos authenticate. But, for right now, this may work for people needing Kerberos. Karl On Sun, Nov 11, 2012 at 8:42 AM, Karl Wright daddy...@gmail.com wrote: The port of the SharePoint connector to httpcomponents 4.2.2 is complete. I don't know whether it will help you or not, but if you check out ManifoldCF trunk (from https://svn.apache.org/repos/asf/manifoldcf/trunk) and run: ant make-core-deps build ... you will be running the latest code. It has been tried against a plain-vanilla SharePoint system using standard NTLM and found to work. If you try the new code and it works for you, that would be very interesting to know; it looks like httpcomponents has developed some support for SPNEGO, which may be what is missing in the current ManifoldCF release. Thanks, Karl On Wed, Nov 7, 2012 at 4:47 PM, Karl Wright daddy...@gmail.com wrote: MCPermissions.asmx and Lists.asmx are two different services, and the Lists.asmx is likely failing before the MCPermissions.asmx is even needed. If, for instance, you are just trying with the UI to see if you get back Connection working, this makes sense since the Lists service is called first and then the MCPermissions service is called after. FWIW, I'm starting to look into porting ManifoldCF to the httpcomponent libraries from the older httpclient 3.1 world. This will make it easier, I think, to incorporate newer additions. Thanks, Karl On Wed, Nov 7, 2012 at 3:44 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl
Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized
You need to use the --data option, not -X. Karl On Tue, Nov 27, 2012 at 11:37 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: Karl, via curl in POST i get a HTTP/1.1 *411 Length Required* * * It meand that POST is blocked ? curl -X POST --ntlm -u vm-shpt2k7\\administrator http://vm-s hpt2k7/KireyRep/_vti_bin/MCPermissions.asmx -v Enter host password for user 'vm-shpt2k7\\administrator': * About to connect() to vm-shpt2k7 port 80 (#0) * Trying 192.168.30.42... * connected * Connected to vm-shpt2k7 (192.168.30.42) port 80 (#0) * Server auth using NTLM with user 'vm-shpt2k7\\administrator' POST /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAABt4II4gAFASgKDw== User-Agent: curl/7.25.0 (i386-pc-win32) libcurl/7.25.0 OpenSSL/0.9.8u zlib/1.2 .6 libssh2/1.4.0 Host: vm-shpt2k7 Accept: */* HTTP/1.1 *411 Length Required* Content-Type: text/html Date: Tue, 27 Nov 2012 16:32:06 GMT Connection: close Content-Length: 24 h1Length Required/h1* Closing connection #0 2012/11/27 Karl Wright daddy...@gmail.com Just on a whim, can you try POST with curl also? It is possible that POSTs are blocked in some way. If that doesn't work, then your security settings are prohibiting post. If that DOES work, then I'd like you to download a ManifoldCF 1.1-dev image from http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev , and try that. This uses httpcomponents rather than our special commons-httpclient version. If none of this helps, getting a packet capture of both a curl POST and the comparable ManifoldCF attempt may well show us what the key issue is. It's possible that there is a header or something your IIS is rejecting, for instance. Thanks, Karl On Tue, Nov 27, 2012 at 11:06 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: Karl, I tried many credential combination .. always 401 .. From server log, with ManifoldCF UI interface (in POST), 401 error: #Software: Microsoft Internet Information Services 6.0 #Version: 1.0 #Date: 2012-11-27 15:38:37 #Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 *POST*/KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Axis/1.4 *401* 2 2148074254 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 *POST */KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Axis/1.4 *401* 1 0 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 *POST */KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Axis/1.4 *401* 1 2148074252 With direct call via http ( http://vm-shpt2k7/KireyRep/_vti_bin/lists.asmx), (in GET): 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET /KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729) *401* 2 2148074254 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET /KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729) *401 *1 0 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET /KireyRep/_vti_bin/lists.asmx - 80 vm-shpt2k7\administrator 192.168.49.62 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729) *200* 0 0 It's quite a conundrum ... 2012/11/27 Karl Wright daddy...@gmail.com Ok, can you try a fully-qualified domain name, rather than the abbreviated one you have given, for the credentials? Also, you might want to look at the server-side event logs for the reason for the authentication failure. Thanks, Karl On Tue, Nov 27, 2012 at 9:04 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: well, on SharePoint Server: *NTAuthenticationProviders=NTLM* * * on ManifoldCF UI interface, error: Parameters: serverLocation=/KireyRep serverPort=80 serverVersion=3.0 userName=VM-SHPT2K7\Administrator serverProtocol=http serverName=vm-shpt2k7.services-kirey.lan password= Connection status:Crawl user did not authenticate properly, or has insufficient permissions to access http://vm-shpt2k7.services-kirey.lan/KireyRep: *(401)HTTP/1.1 401 Unauthorized* on manifoldcf.log *no error trace !* 2012/11/27 Karl Wright daddy...@gmail.com Hi Luigi, The Negotiate is clearly part of the problem; please leave that out. The log entries you mention are indeed harmless warnings that we don't have an Italian localization yet. When you view the connection in the UI, what do you see now? Karl On Tue, Nov 27, 2012 at 8:25 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: hi Karl, thanks for your reply. *(1) Are you sure that your SharePoint IIS
Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized
Curl with POST then works. So the next step is to turn on wire debugging. See https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections. Repeat the connection attempt with ManifoldCF, and send me the output. I want to verify that the headers (apart from the NTLM www-authenticate headers) are the same. Thanks! Karl On Tue, Nov 27, 2012 at 11:54 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: Thanks :o] *HTTP/1.1 401 Unauthorized* + *HTTP/1.1 500 Internal Server Error* curl --data POST --ntlm -u vm-shpt2k7\\administrator http://vm-shpt2k7/KireyRep/_vti_bin/MCPermissions.asmx -v Enter host password for user 'vm-shpt2k7\\administrator': * About to connect() to vm-shpt2k7 port 80 (#0) * Trying 192.168.30.42... * connected * Connected to vm-shpt2k7 (192.168.30.42) port 80 (#0) * Server auth using NTLM with user 'vm-shpt2k7\\administrator' POST /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAABt4II4gAFASgKDw== User-Agent: curl/7.25.0 (i386-pc-win32) libcurl/7.25.0 OpenSSL/0.9.8u zlib/1.2 .6 libssh2/1.4.0 Host: vm-shpt2k7 Accept: */* Content-Length: 0 Content-Type: application/x-www-form-urlencoded *HTTP/1.1 401 Unauthorized* Content-Length: 1539 Content-Type: text/html Server: Microsoft-IIS/6.0 WWW-Authenticate: NTLM TlRMTVNTUAACHAAcADg1goniAo2Exi/3+LAAAMo AygBUBQLODg9TAEUAUgBWAEkAQwBFAFMALQBLAEkAUgBFAFkAAgAcAFMARQBSAFYASQBDAEU AUwAtAEsASQBSAEUAWQABABQAVgBNAC0AUwBIAFAAVAAyAEsANwAEACQAcwBlAHIAdgBpAGMAZQBzAC0 AawBpAHIAZQB5AC4AbABhAG4AAwA6AHYAbQAtAHMAaABwAHQAMgBrADcALgBzAGUAcgB2AGkAYwBlAHM ALQBrAGkAcgBlAHkALgBsAGEAbgAFACQAcwBlAHIAdgBpAGMAZQBzAC0AawBpAHIAZQB5AC4AbABhAG4 AAA== X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 12.0.0.6421 Date: Tue, 27 Nov 2012 16:44:14 GMT * Ignoring the response-body * Connection #0 to host vm-shpt2k7 left intact * Issue another request to this URL: ' http://vm-shpt2k7/KireyRep/_vti_bin/MCPermissions.asmx' * Re-using existing connection! (#0) with host (nil) * Connected to (nil) (192.168.30.42) port 80 (#0) * Server auth using NTLM with user 'vm-shpt2k7\\administrator' POST /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAADGAAYAJAYABgAqBQAFABIHAAcAFwAAA AYABgAeBAAEADANYKI4gUBKAoPdgBtAC0AcwBoAHAAdAAyAGsANwBcAGEAZABtAGkAbg BpAHMAdAByAGEAdABvAHIAUgBNAC0ARABBAEQARABBAFIASQBPAEwA/F/Dh4QXrhYAAA AADC8nQrN/CSjOrtwdcX5eneq+k+ZoTa0H5pip2sZd+GXoCE/Z+1QHfg== User-Agent: curl/7.25.0 (i386-pc-win32) libcurl/7.25.0 OpenSSL/0.9.8u zlib/1.2 .6 libssh2/1.4.0 Host: vm-shpt2k7 Accept: */* Content-Length: 4 Content-Type: application/x-www-form-urlencoded * upload completely sent off: 4 out of 4 bytes *HTTP/1.1 500 Internal Server Error* Date: Tue, 27 Nov 2012 16:44:15 GMT Server: Microsoft-IIS/6.0 X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 12.0.0.6421 X-AspNet-Version: 2.0.50727 Cache-Control: private Content-Type: application/soap+xml; charset=utf-8 Content-Length: 521 ?xml version=1.0 encoding=utf-8?soap:Envelope xmlns:soap= http://www.w3.o rg/2003/05/soap-envelope xmlns:xsi= http://www.w3.org/2001/XMLSchema-instance; xmlns:xsd=http://www.w3.org/2001/XMLSchema soap:Bodysoap:Faultsoap:Code soap:Valuesoap:Receiver/soap:Value/soap:Codesoap:Reasonsoap:Text xml:lan g=itImpossibile elaborare la richiesta. ---gt; Rilevati dati non validi al l ivello principale. Riga 1, posizione 1./soap:Text/soap:Reasonsoap:Detail / /soap:Fault/soap:Body/soap:Envelope* Connection #0 to host (nil) left inta ct * Closing connection #0 And from server log: 2012-11-27 16:44:14 W3SVC662429156 192.168.30.42 *POST */KireyRep/_vti_bin/MCPermissions.asmx - 80 - 192.168.49.65 curl/7.25.0+(i386-pc-win32)+libcurl/7.25.0+OpenSSL/0.9.8u+zlib/1.2.6+libssh2/1.4.0 *401* 1 0 2012-11-27 16:44:14 W3SVC662429156 192.168.30.42 *POST */KireyRep/_vti_bin/MCPermissions.asmx - 80 vm-shpt2k7\Administrator 192.168.49.65 curl/7.25.0+(i386-pc-win32)+libcurl/7.25.0+OpenSSL/0.9.8u+zlib/1.2.6+libssh2/1.4.0 *500* 0 0 2012/11/27 Karl Wright daddy...@gmail.com You need to use the --data option, not -X. Karl On Tue, Nov 27, 2012 at 11:37 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: Karl, via curl in POST i get a HTTP/1.1 *411 Length Required* * * It meand that POST is blocked ? curl -X POST --ntlm -u vm-shpt2k7\\administrator http://vm-s hpt2k7/KireyRep/_vti_bin/MCPermissions.asmx -v Enter host password for user 'vm-shpt2k7\\administrator': * About to connect() to vm-shpt2k7 port 80 (#0) * Trying 192.168.30.42... * connected * Connected to vm-shpt2k7 (192.168.30.42) port 80 (#0) * Server auth using NTLM with user 'vm-shpt2k7\\administrator' POST /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM
Re: Cannot connect to SharePoint 2010 instance
Hi Bob, If the headers all check out, then maybe this is the cause: http://technet.microsoft.com/en-us/library/dd566199%28v=ws.10%29.aspx I will have to check the httpcomponents code to verify that it uses at least 128-bit encryption. I won't be able to do that until tonight or tomorrow though. Karl On Tue, Nov 27, 2012 at 11:36 AM, Karl Wright daddy...@gmail.com wrote: Hi Bob, This is really beginning to sound like there is a header problem of some kind. This is what I'd like to try. (1) Turn on wire debugging for SharePoint, as described here: https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections (2) Using curl, try to use post and the proper credentials, using the -vvv switch. If you successfully connect, save that output. Then try to EXACTLY mimic the request that ManifoldCF does, and if that FAILS record that output and send it all to me. Thanks! Karl On Tue, Nov 27, 2012 at 11:22 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Hi Karl, I have installed the dev version of the connector from below and am having an issue connecting to my SharePoint 2010 site. It actually seems similar to what is happening in your thread with Luigi. I try to log in to the sharepoint site as a user with full control and I get this error Crawl user did not authenticate properly, or has insufficient permissions to access http://...: (401)HTTP/1.1 401 Unauthorized Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 26, 2012 6:38 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Ok, you can download a dev build at: http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev It takes me about an hour to put one of these together, so if you can possibly build ManifoldCF yourself that would be a huge help. Karl On Mon, Nov 26, 2012 at 11:12 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: That would be great please let me know when it is available Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 26, 2012 10:59 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Hi Robert, I can build a binary version you can download, but not until tonight. It may be easier to talk through getting a build environment set up on your Linux machine. Is this Debian or Ubuntu linux, by any chance? If so, the setup is trivial and I can help you with that. Karl On Mon, Nov 26, 2012 at 10:12 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Is there a binary release (pre -compiled version) of the manifold trunk mentioned below https://svn.apache.org/repos/asf/manifoldcf/trunk that you can point me to I am new to Linux and don't have any experience with ANT. Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 26, 2012 4:32 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance I've done further research on HttpComponents' support for Kerberos. It turns out that HttpComponents claims they can successfully use tickets from the local machine's ticket store. I haven't tried this here (don't have the setup for it), but it looks like it could conceivably work with MCF trunk at this point. Read up on it here: http://hc.apache.org/httpcomponents-client-ga/tutorial/html/authentic a tion.html Ideally, of course, we'd really want to add the ability for ManifoldCF to handle its own ticket cache, one per connection, so that each connection looks like its own independent client. In order for that to happen, connectors that support Kerberos would need to be able to kerberos authenticate. But, for right now, this may work for people needing Kerberos. Karl On Sun, Nov 11, 2012 at 8:42 AM, Karl Wright daddy...@gmail.com wrote: The port of the SharePoint connector to httpcomponents 4.2.2 is complete. I don't know whether it will help you or not, but if you check out ManifoldCF trunk (from https://svn.apache.org/repos/asf/manifoldcf/trunk) and run: ant make-core-deps build ... you will be running the latest code. It has been tried against a plain-vanilla SharePoint system using standard NTLM and found to work. If you try the new code and it works for you, that would be very interesting to know; it looks like httpcomponents has developed some support for SPNEGO, which may be what is missing in the current ManifoldCF release. Thanks, Karl On Wed, Nov 7, 2012 at 4:47 PM, Karl Wright daddy...@gmail.com wrote: MCPermissions.asmx and Lists.asmx are two different services, and the Lists.asmx is likely failing before the MCPermissions.asmx is even needed. If, for instance, you are just trying with the UI to see if you get back Connection working, this makes sense since
Re: Cannot connect to SharePoint 2010 instance
The file is usually called logging.ini, and is referenced by the main manifoldcf properties file. Karl On Tue, Nov 27, 2012 at 1:06 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Where is the logging properties file where I would add the debugging commands located ? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 27, 2012 12:52 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Hi Bob, If the headers all check out, then maybe this is the cause: http://technet.microsoft.com/en-us/library/dd566199%28v=ws.10%29.aspx I will have to check the httpcomponents code to verify that it uses at least 128-bit encryption. I won't be able to do that until tonight or tomorrow though. Karl On Tue, Nov 27, 2012 at 11:36 AM, Karl Wright daddy...@gmail.com wrote: Hi Bob, This is really beginning to sound like there is a header problem of some kind. This is what I'd like to try. (1) Turn on wire debugging for SharePoint, as described here: https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Conne ctions (2) Using curl, try to use post and the proper credentials, using the -vvv switch. If you successfully connect, save that output. Then try to EXACTLY mimic the request that ManifoldCF does, and if that FAILS record that output and send it all to me. Thanks! Karl On Tue, Nov 27, 2012 at 11:22 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Hi Karl, I have installed the dev version of the connector from below and am having an issue connecting to my SharePoint 2010 site. It actually seems similar to what is happening in your thread with Luigi. I try to log in to the sharepoint site as a user with full control and I get this error Crawl user did not authenticate properly, or has insufficient permissions to access http://...: (401)HTTP/1.1 401 Unauthorized Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 26, 2012 6:38 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Ok, you can download a dev build at: http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev It takes me about an hour to put one of these together, so if you can possibly build ManifoldCF yourself that would be a huge help. Karl On Mon, Nov 26, 2012 at 11:12 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: That would be great please let me know when it is available Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 26, 2012 10:59 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Hi Robert, I can build a binary version you can download, but not until tonight. It may be easier to talk through getting a build environment set up on your Linux machine. Is this Debian or Ubuntu linux, by any chance? If so, the setup is trivial and I can help you with that. Karl On Mon, Nov 26, 2012 at 10:12 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Is there a binary release (pre -compiled version) of the manifold trunk mentioned below https://svn.apache.org/repos/asf/manifoldcf/trunk that you can point me to I am new to Linux and don't have any experience with ANT. Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, November 26, 2012 4:32 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance I've done further research on HttpComponents' support for Kerberos. It turns out that HttpComponents claims they can successfully use tickets from the local machine's ticket store. I haven't tried this here (don't have the setup for it), but it looks like it could conceivably work with MCF trunk at this point. Read up on it here: http://hc.apache.org/httpcomponents-client-ga/tutorial/html/authent ic a tion.html Ideally, of course, we'd really want to add the ability for ManifoldCF to handle its own ticket cache, one per connection, so that each connection looks like its own independent client. In order for that to happen, connectors that support Kerberos would need to be able to kerberos authenticate. But, for right now, this may work for people needing Kerberos. Karl On Sun, Nov 11, 2012 at 8:42 AM, Karl Wright daddy...@gmail.com wrote: The port of the SharePoint connector to httpcomponents 4.2.2 is complete. I don't know whether it will help you or not, but if you check out ManifoldCF trunk (from https://svn.apache.org/repos/asf/manifoldcf/trunk) and run: ant make-core-deps build ... you will be running the latest code. It has been tried against a plain-vanilla SharePoint system using standard NTLM and found to work. If you try the new code and it works
Re: Cannot connect to SharePoint 2010 instance
The wire debugging setup you are using will only work with commons-httpclient, not the new httpcomponent package. I'll have to do some research and see if there's a comparable logger setting for that package. Karl On Tue, Nov 27, 2012 at 2:01 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, It's odd I added this to the properties.xml file property name=org.apache.manifoldcf.connectors value=DEBUG/ And this to the logging.ini file log4j.logger.httpclient.wire=DEBUG I restarted manifold but nothing is being written to the manifoldcf.log file Any thoughts? Here is the curl data [iannero1@ip-10-145-32-121 logs]$ curl --data POST --ntlm -u nanet\\iannero1 http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx -v Enter host password for user 'nanet\iannero1': * About to connect() to searchpoc.testprojects.nibr.novartis.intra port 80 (#0) * Trying 160.62.169.185... connected * Connected to searchpoc.testprojects.nibr.novartis.intra (160.62.169.185) port 80 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * Server auth using NTLM with user 'nanet\iannero1' POST /_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAABBoIIAAA= User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2 Host: searchpoc.testprojects.nibr.novartis.intra Accept: */* Content-Length: 0 Content-Type: application/x-www-form-urlencoded HTTP/1.1 401 Unauthorized Server: Microsoft-IIS/7.5 SPRequestGuid: f7b8f5a5-1de4-43d1-9b70-7adf3b7d5987 WWW-Authenticate: NTLM TlRMTVNTUAACBwAHADgGgokClVhQpcbj++YAAMoAygA/BgGxHQ9OSUJSTkVUAgAOAE4ASQBCAFIATgBFAFQAAQAYAE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AAQAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQAAwA8AE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AC4AbgBpAGIAcgAuAG4AbwB2AGEAcgB0AGkAcwAuAG4AZQB0AAUAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQABwAIAA9MVPbLzM0BAA== X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 14.0.0.6123 X-MS-InvokeApp: 1; RequireReadOnly Date: Tue, 27 Nov 2012 18:21:04 GMT Content-Length: 0 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra left intact * Issue another request to this URL: 'http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx' * Re-using existing connection! (#0) with host searchpoc.testprojects.nibr.novartis.intra * Connected to searchpoc.testprojects.nibr.novartis.intra (160.62.169.185) port 80 (#0) * Server auth using NTLM with user 'nanet\iannero1' POST /_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAADGAAYAEAYABgAWAUABQBwCAAIAHUQABAAfQAABoKJApbX61a3hdN3ANfyrXuxF91dkEOBT5GMXTvsPdHWkjT6rm5hbmV0aWFubmVybzFpcC0xMC0xNDUtMzItMTIx User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2 Host: searchpoc.testprojects.nibr.novartis.intra Accept: */* Content-Length: 4 Content-Type: application/x-www-form-urlencoded HTTP/1.1 500 Internal Server Error Cache-Control: private Content-Type: application/soap+xml; charset=utf-8 Server: Microsoft-IIS/7.5 X-AspNet-Version: 2.0.50727 Persistent-Auth: true X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 14.0.0.6123 X-MS-InvokeApp: 1; RequireReadOnly Date: Tue, 27 Nov 2012 18:21:04 GMT Content-Length: 509 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra left intact * Closing connection #0 ?xml version=1.0 encoding=utf-8?soap:Envelope xmlns:soap=http://www.w3.org/2003/05/soap-envelope; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns:xsd=http://www.w3.org/2001/XMLSchema;soap:Bodysoap:Faultsoap:Codesoap:Valuesoap:Receiver/soap:Value/soap:Codesoap:Reasonsoap:Text xml:lang=enServer was unable to process request. ---gt; Data at the root level is invalid. Line 1, position 1./soap:Text/soap:Reasonsoap:Detail //soap:Fault/soap:Body/soap:Envelope[iannero1@ip-10-145-32-121 logs]$ -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 27, 2012 1:10 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance The file is usually called logging.ini, and is referenced by the main manifoldcf properties file. Karl On Tue, Nov 27, 2012 at 1:06 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Where is the logging properties file where I would add the debugging commands located ? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 27, 2012 12:52 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Hi Bob, If the headers all check out, then maybe this is the cause: http://technet.microsoft.com/en-us/library/dd566199%28v=ws.10%29.aspx I will have to check
Re: Cannot connect to SharePoint 2010 instance
Here we go: Header logging: org.apache.http.headers=DEBUG Wire logging (which we probably don't need): org.apache.http.wire=DEBUG Karl On Tue, Nov 27, 2012 at 2:04 PM, Karl Wright daddy...@gmail.com wrote: The wire debugging setup you are using will only work with commons-httpclient, not the new httpcomponent package. I'll have to do some research and see if there's a comparable logger setting for that package. Karl On Tue, Nov 27, 2012 at 2:01 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, It's odd I added this to the properties.xml file property name=org.apache.manifoldcf.connectors value=DEBUG/ And this to the logging.ini file log4j.logger.httpclient.wire=DEBUG I restarted manifold but nothing is being written to the manifoldcf.log file Any thoughts? Here is the curl data [iannero1@ip-10-145-32-121 logs]$ curl --data POST --ntlm -u nanet\\iannero1 http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx -v Enter host password for user 'nanet\iannero1': * About to connect() to searchpoc.testprojects.nibr.novartis.intra port 80 (#0) * Trying 160.62.169.185... connected * Connected to searchpoc.testprojects.nibr.novartis.intra (160.62.169.185) port 80 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * Server auth using NTLM with user 'nanet\iannero1' POST /_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAABBoIIAAA= User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2 Host: searchpoc.testprojects.nibr.novartis.intra Accept: */* Content-Length: 0 Content-Type: application/x-www-form-urlencoded HTTP/1.1 401 Unauthorized Server: Microsoft-IIS/7.5 SPRequestGuid: f7b8f5a5-1de4-43d1-9b70-7adf3b7d5987 WWW-Authenticate: NTLM TlRMTVNTUAACBwAHADgGgokClVhQpcbj++YAAMoAygA/BgGxHQ9OSUJSTkVUAgAOAE4ASQBCAFIATgBFAFQAAQAYAE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AAQAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQAAwA8AE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AC4AbgBpAGIAcgAuAG4AbwB2AGEAcgB0AGkAcwAuAG4AZQB0AAUAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQABwAIAA9MVPbLzM0BAA== X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 14.0.0.6123 X-MS-InvokeApp: 1; RequireReadOnly Date: Tue, 27 Nov 2012 18:21:04 GMT Content-Length: 0 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra left intact * Issue another request to this URL: 'http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx' * Re-using existing connection! (#0) with host searchpoc.testprojects.nibr.novartis.intra * Connected to searchpoc.testprojects.nibr.novartis.intra (160.62.169.185) port 80 (#0) * Server auth using NTLM with user 'nanet\iannero1' POST /_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAADGAAYAEAYABgAWAUABQBwCAAIAHUQABAAfQAABoKJApbX61a3hdN3ANfyrXuxF91dkEOBT5GMXTvsPdHWkjT6rm5hbmV0aWFubmVybzFpcC0xMC0xNDUtMzItMTIx User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2 Host: searchpoc.testprojects.nibr.novartis.intra Accept: */* Content-Length: 4 Content-Type: application/x-www-form-urlencoded HTTP/1.1 500 Internal Server Error Cache-Control: private Content-Type: application/soap+xml; charset=utf-8 Server: Microsoft-IIS/7.5 X-AspNet-Version: 2.0.50727 Persistent-Auth: true X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 14.0.0.6123 X-MS-InvokeApp: 1; RequireReadOnly Date: Tue, 27 Nov 2012 18:21:04 GMT Content-Length: 509 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra left intact * Closing connection #0 ?xml version=1.0 encoding=utf-8?soap:Envelope xmlns:soap=http://www.w3.org/2003/05/soap-envelope; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns:xsd=http://www.w3.org/2001/XMLSchema;soap:Bodysoap:Faultsoap:Codesoap:Valuesoap:Receiver/soap:Value/soap:Codesoap:Reasonsoap:Text xml:lang=enServer was unable to process request. ---gt; Data at the root level is invalid. Line 1, position 1./soap:Text/soap:Reasonsoap:Detail //soap:Fault/soap:Body/soap:Envelope[iannero1@ip-10-145-32-121 logs]$ -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 27, 2012 1:10 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance The file is usually called logging.ini, and is referenced by the main manifoldcf properties file. Karl On Tue, Nov 27, 2012 at 1:06 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Where is the logging properties file where I would add the debugging commands located ? Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 27, 2012 12:52 PM To: user@manifoldcf.apache.org
Re: Cannot connect to SharePoint 2010 instance
Yes. Karl On Tue, Nov 27, 2012 at 2:14 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: So would the org.apache.http.headers=DEBUG replace the log4j.logger.httpclient.wire=DEBUG entry in the logging.ini file? -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 27, 2012 2:07 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Here we go: Header logging: org.apache.http.headers=DEBUG Wire logging (which we probably don't need): org.apache.http.wire=DEBUG Karl On Tue, Nov 27, 2012 at 2:04 PM, Karl Wright daddy...@gmail.com wrote: The wire debugging setup you are using will only work with commons-httpclient, not the new httpcomponent package. I'll have to do some research and see if there's a comparable logger setting for that package. Karl On Tue, Nov 27, 2012 at 2:01 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, It's odd I added this to the properties.xml file property name=org.apache.manifoldcf.connectors value=DEBUG/ And this to the logging.ini file log4j.logger.httpclient.wire=DEBUG I restarted manifold but nothing is being written to the manifoldcf.log file Any thoughts? Here is the curl data [iannero1@ip-10-145-32-121 logs]$ curl --data POST --ntlm -u nanet\\iannero1 http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx -v Enter host password for user 'nanet\iannero1': * About to connect() to searchpoc.testprojects.nibr.novartis.intra port 80 (#0) * Trying 160.62.169.185... connected * Connected to searchpoc.testprojects.nibr.novartis.intra (160.62.169.185) port 80 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * Server auth using NTLM with user 'nanet\iannero1' POST /_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAABBoIIAAA= User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2 Host: searchpoc.testprojects.nibr.novartis.intra Accept: */* Content-Length: 0 Content-Type: application/x-www-form-urlencoded HTTP/1.1 401 Unauthorized Server: Microsoft-IIS/7.5 SPRequestGuid: f7b8f5a5-1de4-43d1-9b70-7adf3b7d5987 WWW-Authenticate: NTLM TlRMTVNTUAACBwAHADgGgokClVhQpcbj++YAAMoAygA/BgGxH Q9OSUJSTkVUAgAOAE4ASQBCAFIATgBFAFQAAQAYAE4AUgBVAFMAQwBBAC0AUwBEAD AANwA5AAQAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQAAwA8AE4AUgB VAFMAQwBBAC0AUwBEADAANwA5AC4AbgBpAGIAcgAuAG4AbwB2AGEAcgB0AGkAcwAuAG4A ZQB0AAUAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQABwAIAA9MVPbLz M0BAA== X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 14.0.0.6123 X-MS-InvokeApp: 1; RequireReadOnly Date: Tue, 27 Nov 2012 18:21:04 GMT Content-Length: 0 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra left intact * Issue another request to this URL: 'http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx' * Re-using existing connection! (#0) with host searchpoc.testprojects.nibr.novartis.intra * Connected to searchpoc.testprojects.nibr.novartis.intra (160.62.169.185) port 80 (#0) * Server auth using NTLM with user 'nanet\iannero1' POST /_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAADGAAYAEAYABgAWAUABQBwCAAIAHUQABAAfQAA BoKJApbX61a3hdN3ANfyrXuxF91dkEOBT5GM XTvsPdHWkjT6rm5hbmV0aWFubmVybzFpcC0xMC0xNDUtMzItMTIx User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2 Host: searchpoc.testprojects.nibr.novartis.intra Accept: */* Content-Length: 4 Content-Type: application/x-www-form-urlencoded HTTP/1.1 500 Internal Server Error Cache-Control: private Content-Type: application/soap+xml; charset=utf-8 Server: Microsoft-IIS/7.5 X-AspNet-Version: 2.0.50727 Persistent-Auth: true X-Powered-By: ASP.NET MicrosoftSharePointTeamServices: 14.0.0.6123 X-MS-InvokeApp: 1; RequireReadOnly Date: Tue, 27 Nov 2012 18:21:04 GMT Content-Length: 509 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra left intact * Closing connection #0 ?xml version=1.0 encoding=utf-8?soap:Envelope xmlns:soap=http://www.w3.org/2003/05/soap-envelope; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns:xsd=http://www.w3.org/2001/XMLSchema;soap:Bodysoap:Fault soap:Codesoap:Valuesoap:Receiver/soap:Value/soap:Codesoap:Rea sonsoap:Text xml:lang=enServer was unable to process request. ---gt; Data at the root level is invalid. Line 1, position 1./soap:Text/soap:Reasonsoap:Detail //soap:Fault/soap:Body/soap:Envelope[iannero1@ip-10-145-32-121 logs]$ -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 27, 2012 1:10 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance The file is usually called
Re: Web crawling causes Socket Timeout after Database Exception
Ok, fix has been checked in. Karl On Wed, Nov 28, 2012 at 3:19 AM, Karl Wright daddy...@gmail.com wrote: The ticket is CONNECTORS-571. Karl On Wed, Nov 28, 2012 at 3:12 AM, Karl Wright daddy...@gmail.com wrote: Hi Shigeki, This confirms my theory that our MySQL driver is not detecting all cases where MySQL gives up on a transaction. We need to correct this, but in order to do that we need the SQL error code that MySQL throws in this case: Caused by: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction It looks like somebody actually posted the SQL error code that MYSQL sends out with this online: ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction Are you able to build ManifoldCF? I will check in a fix to trunk for this problem shortly; it would be great if you could try it out. Thanks, Karl On Wed, Nov 28, 2012 at 2:30 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, Here is a log of Database Exception that is occurred while crawling Web. This time, socket timeout exception did not happen so it might be a different matter. Even though the job status remain Running, it seems that MCF stopped crawling (The job was not aborted). ERROR 2012-11-22 19:36:28,593 (Worker thread '16') - Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: Lock wait timeout exceeded; try restarting transaction org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: Lock wait timeout exceeded; try restarting transaction at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681) at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709) at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394) at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144) at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186) at org.apache.manifoldcf.core.database.DBInterfaceMySQL.performModification(DBInterfaceMySQL.java:678) at org.apache.manifoldcf.core.database.DBInterfaceMySQL.performUpdate(DBInterfaceMySQL.java:275) at org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80) at org.apache.manifoldcf.crawler.jobs.HopCount.markForDelete(HopCount.java:1426) at org.apache.manifoldcf.crawler.jobs.HopCount.doDeleteInvalidation(HopCount.java:1356) at org.apache.manifoldcf.crawler.jobs.HopCount.doFinish(HopCount.java:1057) at org.apache.manifoldcf.crawler.jobs.HopCount.finishParents(HopCount.java:389) at org.apache.manifoldcf.crawler.jobs.JobManager.finishDocuments(JobManager.java:4309) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:557) Caused by: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624) at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2345) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2330) at org.apache.manifoldcf.core.database.Database.execute(Database.java:840) at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641) Here is a log of Database Exception that is occurred while crawling files using Windows shares connection: 2012/11/22 23:39:28 ERROR (Job start thread) - Job start thread aborting and restarting due to database connection reset: Database exception: Exception doing query: Lock wait timeout exceeded; try restarting transaction org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: Lock wait timeout exceeded; try restarting transaction at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681) at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709) at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394
Re: Cannot connect to SharePoint 2010 instance
BAQAAAOCqZ5Zwzc0BgVBsO4H03jQAAgAOAE4ASQBCAFIATgBFAFQAAQAYAE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AAQAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQAAwA8AE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AC4AbgBpAGIAcgAuAG4Abw B2AGEAcgB0AGkAcwAuAG4AZQB0AAUAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQABwAIAOdjFpZwzc0BAE4AQQBOAEUAVABpAGEAbgBuAGUAcgBvADEASQBQAC0AMQAwAC0AMQA0ADUALQAzADIALQAxADIAMQA= DEBUG 2012-11-28 08:59:31,678 (Thread-479) - HTTP/1.1 401 Unauthorized DEBUG 2012-11-28 08:59:31,678 (Thread-479) - Server: Microsoft-IIS/7.5 DEBUG 2012-11-28 08:59:31,678 (Thread-479) - SPRequestGuid: cfac18c9-3870-4854-bb2d-816f3dc8c2f3 DEBUG 2012-11-28 08:59:31,678 (Thread-479) - WWW-Authenticate: NTLM DEBUG 2012-11-28 08:59:31,678 (Thread-479) - X-Powered-By: ASP.NET DEBUG 2012-11-28 08:59:31,678 (Thread-479) - MicrosoftSharePointTeamServices: 14.0.0.6123 DEBUG 2012-11-28 08:59:31,678 (Thread-479) - X-MS-InvokeApp: 1; RequireReadOnly DEBUG 2012-11-28 08:59:31,678 (Thread-479) - Date: Wed, 28 Nov 2012 13:59:31 GMT DEBUG 2012-11-28 08:59:31,678 (Thread-479) - Content-Length: 0 -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 27, 2012 5:25 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance no, you need: log4j.logger.logger_name=DEBUG in this case: log4j.logger.org.apache.http.headers=DEBUG Thanks, Karl On Tue, Nov 27, 2012 at 3:30 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, I added the parameter to the logging.ini file and I am still not seeing any data written to the log # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the License); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an AS IS BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. log4j.appender.MAIN.File=logs/manifoldcf.log log4j.rootLogger=WARN, MAIN log4j.appender.MAIN=org.apache.log4j.RollingFileAppender log4j.appender.MAIN.layout=org.apache.log4j.PatternLayout log4j.appender.MAIN.layout.ConversionPattern=%5p %d{ISO8601} (%t) - %m%n # add additional logging org.apache.http.headers=DEBUG -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 27, 2012 3:20 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Yes. Karl On Tue, Nov 27, 2012 at 2:14 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: So would the org.apache.http.headers=DEBUG replace the log4j.logger.httpclient.wire=DEBUG entry in the logging.ini file? -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Tuesday, November 27, 2012 2:07 PM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Here we go: Header logging: org.apache.http.headers=DEBUG Wire logging (which we probably don't need): org.apache.http.wire=DEBUG Karl On Tue, Nov 27, 2012 at 2:04 PM, Karl Wright daddy...@gmail.com wrote: The wire debugging setup you are using will only work with commons-httpclient, not the new httpcomponent package. I'll have to do some research and see if there's a comparable logger setting for that package. Karl On Tue, Nov 27, 2012 at 2:01 PM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, It's odd I added this to the properties.xml file property name=org.apache.manifoldcf.connectors value=DEBUG/ And this to the logging.ini file log4j.logger.httpclient.wire=DEBUG I restarted manifold but nothing is being written to the manifoldcf.log file Any thoughts? Here is the curl data [iannero1@ip-10-145-32-121 logs]$ curl --data POST --ntlm -u nanet\\iannero1 http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx -v Enter host password for user 'nanet\iannero1': * About to connect() to searchpoc.testprojects.nibr.novartis.intra port 80 (#0) * Trying 160.62.169.185... connected * Connected to searchpoc.testprojects.nibr.novartis.intra (160.62.169.185) port 80 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * Server auth using NTLM with user 'nanet\iannero1' POST /_vti_bin/MCPermissions.asmx HTTP/1.1 Authorization: NTLM TlRMTVNTUAABBoIIAAA= User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2 Host: searchpoc.testprojects.nibr.novartis.intra Accept: */* Content-Length: 0 Content-Type: application/x-www-form
Re: Web crawling causes Socket Timeout after Database Exception
Hi Shigeki, I noticed that your crawl is using hopcount filtering. This feature is costly performance-wise. If you can crawl with hopcount filtering disabled, your crawl will be much faster. To disable completely, select the radio button titled 読込めないコンテンツ情報は永久保存, and leave the hopcount fields blank. Thanks, Karl On Fri, Nov 30, 2012 at 1:57 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi, Karl I think handling MySQL exception keeps MCF crawling contents. However, because of deadlocks, crawling speed would be remained slow. I think the fundamental solution of the problem is to reduce deadlocks in MySQL. I am not sure if this could be solved by MCF but this is a task that people using MySQL need to know. Regards, Shigeki 2012/11/28 Karl Wright daddy...@gmail.com Yes, the SQL code will be output to the manifoldcf.log as part of the exception text. However I hope that this checkin will already fix your problem. Thanks, Karl On Wed, Nov 28, 2012 at 3:44 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, I can try. To obtain the error code, could you let me know what to code to put in what line of what file? I suppose the error code will be output into manifoldcf.log, is this right? Regards, Shigeki 2012/11/28 Karl Wright daddy...@gmail.com Hi Shigeki, This confirms my theory that our MySQL driver is not detecting all cases where MySQL gives up on a transaction. We need to correct this, but in order to do that we need the SQL error code that MySQL throws in this case: Caused by: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction It looks like somebody actually posted the SQL error code that MYSQL sends out with this online: ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction Are you able to build ManifoldCF? I will check in a fix to trunk for this problem shortly; it would be great if you could try it out. Thanks, Karl On Wed, Nov 28, 2012 at 2:30 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, Here is a log of Database Exception that is occurred while crawling Web. This time, socket timeout exception did not happen so it might be a different matter. Even though the job status remain Running, it seems that MCF stopped crawling (The job was not aborted). ERROR 2012-11-22 19:36:28,593 (Worker thread '16') - Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: Lock wait timeout exceeded; try restarting transaction org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: Lock wait timeout exceeded; try restarting transaction at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681) at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709) at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394) at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144) at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186) at org.apache.manifoldcf.core.database.DBInterfaceMySQL.performModification(DBInterfaceMySQL.java:678) at org.apache.manifoldcf.core.database.DBInterfaceMySQL.performUpdate(DBInterfaceMySQL.java:275) at org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80) at org.apache.manifoldcf.crawler.jobs.HopCount.markForDelete(HopCount.java:1426) at org.apache.manifoldcf.crawler.jobs.HopCount.doDeleteInvalidation(HopCount.java:1356) at org.apache.manifoldcf.crawler.jobs.HopCount.doFinish(HopCount.java:1057) at org.apache.manifoldcf.crawler.jobs.HopCount.finishParents(HopCount.java:389) at org.apache.manifoldcf.crawler.jobs.JobManager.finishDocuments(JobManager.java:4309) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:557) Caused by: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624
Re: Running multiple MCFs on one Tomcat
Hi Shigeki, Each MCF instance should have its own properties.xml file. Since the way you tell MCF where the properties.xml file is located is with a -D switch, I don't think you can run multiple instances properly in one JVM. If this is important to you, please let us know, and also please describe what you are trying to do this for. Thanks, Karl On Thu, Nov 29, 2012 at 8:05 PM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi everyone, Just wondering if there is anyone tried running multiple MCFs on one Tomcat (not multiple jobs in one MCF). If that's possible, I like to try testing crawling performance using multiple MCFs. Regards, Shigeki
Re: Running multiple MCFs on one Tomcat
CPU usage is a function of the crawling task, and to some extent the database. When you run an open crawl on PostgreSQL, CPU usage is very high. If you are running a constrained, throttled crawl, CPU usage is low. Karl On Tue, Dec 4, 2012 at 4:07 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, I noticed MCF does not use much CPU. I was wondering if running multiple MCFs could increase the CPU usages. Regards, Shigeki 2012/11/30 Karl Wright daddy...@gmail.com Hi Shigeki, Each MCF instance should have its own properties.xml file. Since the way you tell MCF where the properties.xml file is located is with a -D switch, I don't think you can run multiple instances properly in one JVM. If this is important to you, please let us know, and also please describe what you are trying to do this for. Thanks, Karl On Thu, Nov 29, 2012 at 8:05 PM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi everyone, Just wondering if there is anyone tried running multiple MCFs on one Tomcat (not multiple jobs in one MCF). If that's possible, I like to try testing crawling performance using multiple MCFs. Regards, Shigeki
Re: Cannot connect to SharePoint 2010 instance
Hi Robert, I've solved Luigi's problem - and now I want to know if it solves yours. Unfortunately, you WILL have to build ManifoldCF for this step, since I cannot modify the build process easily to accommodate the patched httpcomponents dependencies. Can you do the following: (1) Check out a trunk copy of manifoldcf sources, e.g svn co https://svn.apache.org/repos/asf/manifoldcf/trunk; . (2) Download the lib package from http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev, unpack it, and install it in the lib directory as per the instructions in the lib package. (3) Run ant build to be sure you can actually build the project. If that works, download the two patched httpcomponents jars from http://people.apache.org/~kwright , and use them to overwrite lib/httpcore.jar and lib/httpclient.jar. (4) Run ant build clean (5) Start manifoldcf (it's under the dist directory), and see if you can connect to your sharepoint instance. Thanks! Karl On Thu, Nov 29, 2012 at 8:56 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Hi Karl, I have been following your thread with Luigi I look forward to testing the new release. Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Thursday, November 29, 2012 3:28 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Hi Robert, Luigi and I think we've discovered the issue, which we're going to see if we can confirm today. There is a ticket tracking it, which is CONNECTORS-572. If correct, it appears that Windows may have changed what it considers to be the name of the user at some recent time, and the httpcomponents and commons-httpclient implementations of NTLM are not resilient to this change - which isn't surprising since they are basically reverse-engineered. If correct, httpcomponents will likely need to release a patch, so the schedule will be, in part, up to them. Alternatively, we can build and patch httpcomponents as part of the ManifoldCF release process, but it would require us to have a new Maven dependency for the make-core-deps part of our release. Karl On Wed, Nov 28, 2012 at 9:01 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Here is my debug output DEBUG 2012-11-28 08:59:25,884 (Thread-479) - POST /_vti_bin/lists.asmx HTTP/1 .1 DEBUG 2012-11-28 08:59:25,899 (Thread-479) - Content-Type: text/xml; charset= utf-8 DEBUG 2012-11-28 08:59:25,899 (Thread-479) - SOAPAction: http://schemas.micr osoft.com/sharepoint/soap/GetListCollection DEBUG 2012-11-28 08:59:25,899 (Thread-479) - User-Agent: Axis/1.4 DEBUG 2012-11-28 08:59:25,899 (Thread-479) - Content-Length: 335 DEBUG 2012-11-28 08:59:25,899 (Thread-479) - Host: searchpoc.testprojects.nib r.novartis.intra DEBUG 2012-11-28 08:59:25,899 (Thread-479) - Connection: Keep-Alive DEBUG 2012-11-28 08:59:30,629 (Thread-479) - HTTP/1.1 401 Unauthorized DEBUG 2012-11-28 08:59:30,629 (Thread-479) - Server: Microsoft-IIS/7.5 DEBUG 2012-11-28 08:59:30,629 (Thread-479) - SPRequestGuid: 56647ed0-9bac-4a2 e-b61a-2d2e76ae8db0 DEBUG 2012-11-28 08:59:30,629 (Thread-479) - WWW-Authenticate: NTLM DEBUG 2012-11-28 08:59:30,629 (Thread-479) - X-Powered-By: ASP.NET DEBUG 2012-11-28 08:59:30,630 (Thread-479) - MicrosoftSharePointTeamServices: 14.0.0.6123 DEBUG 2012-11-28 08:59:30,630 (Thread-479) - X-MS-InvokeApp: 1; RequireReadOn ly DEBUG 2012-11-28 08:59:30,630 (Thread-479) - Date: Wed, 28 Nov 2012 13:59:30 GMT DEBUG 2012-11-28 08:59:30,630 (Thread-479) - Content-Length: 0 DEBUG 2012-11-28 08:59:30,663 (Thread-479) - POST /_vti_bin/lists.asmx HTTP/1.1 DEBUG 2012-11-28 08:59:30,663 (Thread-479) - Content-Type: text/xml; charset=utf-8 DEBUG 2012-11-28 08:59:30,663 (Thread-479) - SOAPAction: http://schemas.microsoft.com/sharepoint/soap/GetListCollection; DEBUG 2012-11-28 08:59:30,663 (Thread-479) - User-Agent: Axis/1.4 DEBUG 2012-11-28 08:59:30,663 (Thread-479) - Content-Length: 335 DEBUG 2012-11-28 08:59:30,663 (Thread-479) - Host: searchpoc.testprojects.nibr.novartis.intra DEBUG 2012-11-28 08:59:30,663 (Thread-479) - Connection: Keep-Alive DEBUG 2012-11-28 08:59:30,663 (Thread-479) - Authorization: NTLM TlRMTVNTUAABNQIIIAoACgBAIAAgACBJAFAALQAxADAALQAxADQANQAtAD MAMgAtADEAMgAxAE4AQQBOAEUAVAA= DEBUG 2012-11-28 08:59:30,680 (Thread-479) - HTTP/1.1 401 Unauthorized DEBUG 2012-11-28 08:59:30,680 (Thread-479) - Server: Microsoft-IIS/7.5 DEBUG 2012-11-28 08:59:30,680 (Thread-479) - SPRequestGuid: 208f5c66-7d26-4761-b578-d01645f042ed DEBUG 2012-11-28 08:59:30,680 (Thread-479) - WWW-Authenticate: NTLM TlRMTVNTUAACDgAOADg1Aoki47BOSwwS+moAAMoAygBGBgGxHQ 9OAEkAQgBSAE4ARQBUAAIADgBOAEkAQgBSAE4ARQBUAAEAGABOAFIAVQBTAEMA QQAtAFMARAAwADcAOQAEACIAbgBpAGIAcgAuAG4AbwB2AGEAcgB0AGkAcwAuAG4AZQB0AA MAPABOAFIAVQBTAEMAQQAtAFMARAAwADcAOQAuAG4AaQBiAHIALgBuAG8AdgBhAHIAdABp
Re: Cannot connect to SharePoint 2010 instance
I actually did decide to modify the build to pull the changed jars down automatically. So you can just download the artifacts under http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev and you should get updated binaries. Karl On Wed, Dec 5, 2012 at 6:03 PM, Karl Wright daddy...@gmail.com wrote: Hi Robert, I've solved Luigi's problem - and now I want to know if it solves yours. Unfortunately, you WILL have to build ManifoldCF for this step, since I cannot modify the build process easily to accommodate the patched httpcomponents dependencies. Can you do the following: (1) Check out a trunk copy of manifoldcf sources, e.g svn co https://svn.apache.org/repos/asf/manifoldcf/trunk; . (2) Download the lib package from http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev, unpack it, and install it in the lib directory as per the instructions in the lib package. (3) Run ant build to be sure you can actually build the project. If that works, download the two patched httpcomponents jars from http://people.apache.org/~kwright , and use them to overwrite lib/httpcore.jar and lib/httpclient.jar. (4) Run ant build clean (5) Start manifoldcf (it's under the dist directory), and see if you can connect to your sharepoint instance. Thanks! Karl On Thu, Nov 29, 2012 at 8:56 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Hi Karl, I have been following your thread with Luigi I look forward to testing the new release. Thanks Bob -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Thursday, November 29, 2012 3:28 AM To: user@manifoldcf.apache.org Subject: Re: Cannot connect to SharePoint 2010 instance Hi Robert, Luigi and I think we've discovered the issue, which we're going to see if we can confirm today. There is a ticket tracking it, which is CONNECTORS-572. If correct, it appears that Windows may have changed what it considers to be the name of the user at some recent time, and the httpcomponents and commons-httpclient implementations of NTLM are not resilient to this change - which isn't surprising since they are basically reverse-engineered. If correct, httpcomponents will likely need to release a patch, so the schedule will be, in part, up to them. Alternatively, we can build and patch httpcomponents as part of the ManifoldCF release process, but it would require us to have a new Maven dependency for the make-core-deps part of our release. Karl On Wed, Nov 28, 2012 at 9:01 AM, Iannetti, Robert robert.ianne...@novartis.com wrote: Karl, Here is my debug output DEBUG 2012-11-28 08:59:25,884 (Thread-479) - POST /_vti_bin/lists.asmx HTTP/1 .1 DEBUG 2012-11-28 08:59:25,899 (Thread-479) - Content-Type: text/xml; charset= utf-8 DEBUG 2012-11-28 08:59:25,899 (Thread-479) - SOAPAction: http://schemas.micr osoft.com/sharepoint/soap/GetListCollection DEBUG 2012-11-28 08:59:25,899 (Thread-479) - User-Agent: Axis/1.4 DEBUG 2012-11-28 08:59:25,899 (Thread-479) - Content-Length: 335 DEBUG 2012-11-28 08:59:25,899 (Thread-479) - Host: searchpoc.testprojects.nib r.novartis.intra DEBUG 2012-11-28 08:59:25,899 (Thread-479) - Connection: Keep-Alive DEBUG 2012-11-28 08:59:30,629 (Thread-479) - HTTP/1.1 401 Unauthorized DEBUG 2012-11-28 08:59:30,629 (Thread-479) - Server: Microsoft-IIS/7.5 DEBUG 2012-11-28 08:59:30,629 (Thread-479) - SPRequestGuid: 56647ed0-9bac-4a2 e-b61a-2d2e76ae8db0 DEBUG 2012-11-28 08:59:30,629 (Thread-479) - WWW-Authenticate: NTLM DEBUG 2012-11-28 08:59:30,629 (Thread-479) - X-Powered-By: ASP.NET DEBUG 2012-11-28 08:59:30,630 (Thread-479) - MicrosoftSharePointTeamServices: 14.0.0.6123 DEBUG 2012-11-28 08:59:30,630 (Thread-479) - X-MS-InvokeApp: 1; RequireReadOn ly DEBUG 2012-11-28 08:59:30,630 (Thread-479) - Date: Wed, 28 Nov 2012 13:59:30 GMT DEBUG 2012-11-28 08:59:30,630 (Thread-479) - Content-Length: 0 DEBUG 2012-11-28 08:59:30,663 (Thread-479) - POST /_vti_bin/lists.asmx HTTP/1.1 DEBUG 2012-11-28 08:59:30,663 (Thread-479) - Content-Type: text/xml; charset=utf-8 DEBUG 2012-11-28 08:59:30,663 (Thread-479) - SOAPAction: http://schemas.microsoft.com/sharepoint/soap/GetListCollection; DEBUG 2012-11-28 08:59:30,663 (Thread-479) - User-Agent: Axis/1.4 DEBUG 2012-11-28 08:59:30,663 (Thread-479) - Content-Length: 335 DEBUG 2012-11-28 08:59:30,663 (Thread-479) - Host: searchpoc.testprojects.nibr.novartis.intra DEBUG 2012-11-28 08:59:30,663 (Thread-479) - Connection: Keep-Alive DEBUG 2012-11-28 08:59:30,663 (Thread-479) - Authorization: NTLM TlRMTVNTUAABNQIIIAoACgBAIAAgACBJAFAALQAxADAALQAxADQANQAtAD MAMgAtADEAMgAxAE4AQQBOAEUAVAA= DEBUG 2012-11-28 08:59:30,680 (Thread-479) - HTTP/1.1 401 Unauthorized DEBUG 2012-11-28 08:59:30,680 (Thread-479) - Server: Microsoft-IIS/7.5 DEBUG 2012-11-28 08:59:30,680 (Thread-479) - SPRequestGuid: 208f5c66-7d26-4761-b578-d01645f042ed DEBUG 2012-11-28 08:59:30,680 (Thread-479
Re: Web crawl exited with an unexpected jobqueue status error under MySQL
Actually, I just noticed this: I ran MCF0.6 . MCF 0.6 runs MySQL in the wrong mode, so this was a problem. It was fixed in ManifoldCF 1.0. Can you upgrade to MCF 1.01 and see if this still happens for you? Karl On Wed, Dec 5, 2012 at 9:46 PM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hello Karl. MySQL: 5.5.24 Tomcat: 6.0.35 CentOS: 6.3 Regards, Shigeki 2012/12/5 Karl Wright daddy...@gmail.com Yes, I believe it is related, in the sense that the fix for CONNECTORS-246 was a fix to the HSQLDB database. This error makes it clear that MySQL has a similar problem with its MVCC model, and will also require a fix. However, I do not have the same kinds of leverage in the MySQL community that I do with HSQLDB. Can you give some details about the version of MySQL you are running, and on what platform? I will capture that and then maybe figure out how to open a MySQL ticket. Karl On Wed, Dec 5, 2012 at 6:57 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi. I ran MCF0.6 under MySQL5.5. I crawled WEB and the following error occurred, then MCF stopped the job: 2012/12/04 18:50:07 ERROR (Worker thread '0') - Exception tossed: Unexpected jobqueue status - record id 1354608871138, expecting active status, saw 3 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected jobqueue status - record id 1354608871138, expecting active status, saw 3 at org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:711) at org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:2435) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:745) There was a similar ticket A file crawl exited with an unexpected jobqueue status error under HSQLDB. https://issues.apache.org/jira/browse/CONNECTORS-246 Wondering if this is related.. Regards, Shigeki
Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized
Hi Luigi, Others have also run into this exception, from one or more SharePoint web services. It is a server side catch-all exception which tells us very little. You may get more details by looking at the server's event logs. SharePoint also has a log you can look at which may be even more helpful. In my experience, this is often the result of administrators changing the system's permissions in ways that cause SharePoint's web services to stop functioning correctly. At MetaCarta we never would see this on fresh SharePoint installations, but only on those where SharePoint was first installed, and then afterwards people made adjustments to the system permissions. I hope you have access to a competent SharePoint system administrator, because without that, it will be very hard to resolve this problem. Thanks, Karl On Thu, Dec 6, 2012 at 5:12 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: Karl, I'm trying to put into Solr my SharPoint documents from Shared Documents. What do you think about this exception ? Permission problems again or ? DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Getting version of '/Shared Documents//' DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Checking whether to include library '/Shared Documents' DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Library '/Shared Documents' exactly matched rule path '/Shared Documents' DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Including library '/Shared Documents' DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Processing: '/Shared Documents//' DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Document identifier is a library: '/Shared Documents' DEBUG 2012-12-06 11:02:09,515 (Worker thread '3') - Enter: CommonsHTTPSender::invoke DEBUG 2012-12-06 11:02:10,000 (Worker thread '3') - Exit: CommonsHTTPSender::invoke DEBUG 2012-12-06 11:02:10,031 (Worker thread '3') - Enter: CommonsHTTPSender::invoke DEBUG 2012-12-06 11:02:10,406 (Worker thread '3') - Exit: CommonsHTTPSender::invoke DEBUG 2012-12-06 11:02:10,421 (Worker thread '3') - SharePoint: Got an unknown remote exception getting child documents for site guid {CC072748-E1EE-4F34-B120-FAF33273A616} - axis fault = Server.Dsp.Connect, detail = Cannot open the requested Sharepoint Site. - retrying AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.Dsp.Connect faultSubcode: faultString: Cannot open the requested Sharepoint Site. faultActor: faultNode: faultDetail: {http://schemas.microsoft.com/sharepoint/dsp}queryResponse:dsQueryResponse status=failure/ Cannot open the requested Sharepoint Site. I send you manifoldcf.log. Thanks. Luigi 2012/12/5 Luigi D'Addario luigi.dadda...@googlemail.com ..and I, finally, tomorrow will try to put into Solr my SharPoint documents ! 2012/12/5 Karl Wright daddy...@gmail.com I'll have to figure out how to get this patched httpcomponents release into the field
Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized
If you have access to the SharePoint installation media itself, one approach would be to try to install your own version of SharePoint on a similar environment. Prove to yourself (and others) that you can actually crawl on that SharePoint. Then, based on what the target system's event logs and SharePoint logs tell you, you can start modifying settings and module permissions to match the fresh installation's, until it works. You can also save yourself some time by getting the actual request being done using http wire debugging in ManifoldCF, and then trying that request over and over with curl until you get it to not fail. Thanks, Karl On Thu, Dec 6, 2012 at 6:29 AM, Karl Wright daddy...@gmail.com wrote: Hi Luigi, Others have also run into this exception, from one or more SharePoint web services. It is a server side catch-all exception which tells us very little. You may get more details by looking at the server's event logs. SharePoint also has a log you can look at which may be even more helpful. In my experience, this is often the result of administrators changing the system's permissions in ways that cause SharePoint's web services to stop functioning correctly. At MetaCarta we never would see this on fresh SharePoint installations, but only on those where SharePoint was first installed, and then afterwards people made adjustments to the system permissions. I hope you have access to a competent SharePoint system administrator, because without that, it will be very hard to resolve this problem. Thanks, Karl On Thu, Dec 6, 2012 at 5:12 AM, Luigi D'Addario luigi.dadda...@googlemail.com wrote: Karl, I'm trying to put into Solr my SharPoint documents from Shared Documents. What do you think about this exception ? Permission problems again or ? DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Getting version of '/Shared Documents//' DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Checking whether to include library '/Shared Documents' DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Library '/Shared Documents' exactly matched rule path '/Shared Documents' DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Including library '/Shared Documents' DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Processing: '/Shared Documents//' DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Document identifier is a library: '/Shared Documents' DEBUG 2012-12-06 11:02:09,515 (Worker thread '3') - Enter: CommonsHTTPSender::invoke DEBUG 2012-12-06 11:02:10,000 (Worker thread '3') - Exit: CommonsHTTPSender::invoke DEBUG 2012-12-06 11:02:10,031 (Worker thread '3') - Enter: CommonsHTTPSender::invoke DEBUG 2012-12-06 11:02:10,406 (Worker thread '3') - Exit: CommonsHTTPSender::invoke DEBUG 2012-12-06 11:02:10,421 (Worker thread '3') - SharePoint: Got an unknown remote exception getting child documents for site guid {CC072748-E1EE-4F34-B120-FAF33273A616} - axis fault = Server.Dsp.Connect, detail = Cannot open the requested Sharepoint Site. - retrying AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.Dsp.Connect faultSubcode: faultString: Cannot open the requested Sharepoint Site. faultActor: faultNode: faultDetail: {http://schemas.microsoft.com/sharepoint/dsp}queryResponse:dsQueryResponse status=failure/ Cannot open the requested Sharepoint Site. I send you manifoldcf.log. Thanks. Luigi 2012/12/5 Luigi D'Addario luigi.dadda...@googlemail.com ..and I, finally, tomorrow will try to put into Solr my SharPoint documents ! 2012/12/5 Karl Wright daddy...@gmail.com I'll have to figure out how to get this patched httpcomponents release into the field
Re: Too many slow queries caused by MCF running MySQL 5.5
Hi Shigeki, The rules for when a database will use an index for an ORDER BY clause differ significantly from database to database. The current logic seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL. I will see if I can find a solution. The ticket for this CONNECTORS-584. Karl On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi. I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL I tried to crawl 10 million files using Windows share connection and index them into Solr. As MCF reached over 1 million files, the crawling speed started getting slower. So I checked slow queries and found out that too many slow queries occurred, especially the following kinds: # Time: 121204 16:25:40 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1] # Query_time: 7.240532 Lock_time: 0.000204 Rows_sent: 1200 Rows_examined: 611091 SET timestamp=1354605940; SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200; # Time: 121204 16:25:44 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1] # Query_time: 3.064339 Lock_time: 0.84 Rows_sent: 1 Rows_examined: 406359 SET timestamp=1354605944; SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE status IN ('P','G') AND checkaction='R' AND checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid) ORDER BY docpriority ASC,status ASC,checkaction ASC,checktime ASC LIMIT 1; --- I wonder if the queries appropriately use index of the table. As a result of EXPLAIN against the slow query, there was filesort. There seems to be some conditions that MySQL does not use index depending on ORDER BY: - Executing ORDER BY against multiple keys - When keys selected from records are different from keys used by ORDER BY Since filesort was happening, fully scanning records should be having MCF slower. Do you think this could happen even in PostgreSQL or HSQLDB? Do you think queries could be modified to use index appropriately? Regards, Shigeki
Re: Too many slow queries caused by MCF running MySQL 5.5
Since you have a large table, can you try an EXPLAIN for the following query, which should match the explanation given here: http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ? Does it use the index? SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G') AND t0.checkaction='R' AND t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200 Thanks! Karl On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote: Hi Shigeki, The rules for when a database will use an index for an ORDER BY clause differ significantly from database to database. The current logic seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL. I will see if I can find a solution. The ticket for this CONNECTORS-584. Karl On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi. I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL I tried to crawl 10 million files using Windows share connection and index them into Solr. As MCF reached over 1 million files, the crawling speed started getting slower. So I checked slow queries and found out that too many slow queries occurred, especially the following kinds: # Time: 121204 16:25:40 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1] # Query_time: 7.240532 Lock_time: 0.000204 Rows_sent: 1200 Rows_examined: 611091 SET timestamp=1354605940; SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200; # Time: 121204 16:25:44 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1] # Query_time: 3.064339 Lock_time: 0.84 Rows_sent: 1 Rows_examined: 406359 SET timestamp=1354605944; SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE status IN ('P','G') AND checkaction='R' AND checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid) ORDER BY docpriority ASC,status ASC,checkaction ASC,checktime ASC LIMIT 1; --- I wonder if the queries appropriately use index of the table. As a result of EXPLAIN against the slow query, there was filesort. There seems to be some conditions that MySQL does not use index depending on ORDER BY: - Executing ORDER BY against multiple keys - When keys selected from records are different from keys used by ORDER BY Since filesort was happening, fully scanning records should be having MCF slower. Do you think this could happen even in PostgreSQL or HSQLDB? Do you think queries could be modified to use index appropriately? Regards, Shigeki
Re: latest trunk BUILD FAILED
Ok, I fixed this. Karl On Sun, Dec 9, 2012 at 8:57 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, I couldn't build the latest trunk. It seemed that MeridioConnector could not be compiled. compile-connector: [javac] /Users/abe/mcf/trunk/connectors/connector-build.xml:420: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 10 source files to /Users/abe/mcf/trunk/connectors/meridio/build/connector/classes [javac] /Users/abe/mcf/trunk/connectors/meridio/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/meridio/MeridioConnector.java:1398: package org.apache.commons.httpclient does not exist [javac] catch (org.apache.commons.httpclient.ConnectTimeoutException ioex) [javac] ^ [javac] Note: /Users/abe/mcf/trunk/connectors/meridio/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/meridio/CommonsHTTPSender.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 error BUILD FAILED Regards, Shinichiro Abe
Re: Too many slow queries caused by MCF running MySQL 5.5
Ok, that is unfortunate. I will do some further MySQL research here. There is a FORCE INDEX MySQL construct that may help, e.g. SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ... which we can also try. In this case that would be: FORCE INDEX (docpriority,status,checkaction,checktime) or FORCE INDEX (docpriority_status_checkaction_checktime) - unclear what the right syntax actually is. Maybe you can try an explain with that in the query? FWIW, PostgreSQL should always use the index for this situation. Karl On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, Thanks for the reply. I did EXPLAIN as following: mysql explain SELECT - t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G') - AND t0.checkaction='R' AND - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200; +++---++--++-+-++-+ | id | select_type| table | type | possible_keys | key| key_len | ref | rows | Extra | +++---++--++-+-++-+ | 1 | PRIMARY| t0| range | I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25 | NULL| 151494 | Using where; Using filesort | | 4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077 | I1354241297077 | 8 | manifoldcf.t0.id| 1 | | | 4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY | PRIMARY| 767 | manifoldcf.t3.eventname | 1 | Using index | | 3 | DEPENDENT SUBQUERY | t2| ref| I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 | manifoldcf.t0.dochash | 1 | Using where | | 2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080 | PRIMARY| 8 | manifoldcf.t0.jobid | 1 | Using where | +++---++--++-+-++-+ As you see Using filesort, I do not think it uses the index. By the way, which database do you recommend for the case of crawling a humongous number of files for now? PostgreSQL? Regards, Shigeki 2012/12/10 Karl Wright daddy...@gmail.com Since you have a large table, can you try an EXPLAIN for the following query, which should match the explanation given here: http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ? Does it use the index? SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G') AND t0.checkaction='R' AND t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200 Thanks! Karl On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote: Hi Shigeki, The rules for when a database will use an index for an ORDER BY clause differ significantly from database to database. The current logic seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL. I will see if I can find a solution. The ticket for this CONNECTORS-584. Karl On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi. I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL I tried to crawl 10 million files using Windows share connection and index them into Solr. As MCF reached over 1 million files, the crawling speed started getting slower. So I checked slow queries and found out that too many slow queries occurred, especially the following kinds: # Time: 121204 16:25
Re: Too many slow queries caused by MCF running MySQL 5.5
Sorry, the FORCE INDEX hint requires the name of the index. Since ManifoldCF does not assign index names to fixed values, you will need to find the right one, by using the SHOW INDEX command first to get the right index's name. Apologies, Karl On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com wrote: Ok, that is unfortunate. I will do some further MySQL research here. There is a FORCE INDEX MySQL construct that may help, e.g. SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ... which we can also try. In this case that would be: FORCE INDEX (docpriority,status,checkaction,checktime) or FORCE INDEX (docpriority_status_checkaction_checktime) - unclear what the right syntax actually is. Maybe you can try an explain with that in the query? FWIW, PostgreSQL should always use the index for this situation. Karl On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, Thanks for the reply. I did EXPLAIN as following: mysql explain SELECT - t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G') - AND t0.checkaction='R' AND - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200; +++---++--++-+-++-+ | id | select_type| table | type | possible_keys | key| key_len | ref | rows | Extra | +++---++--++-+-++-+ | 1 | PRIMARY| t0| range | I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25 | NULL| 151494 | Using where; Using filesort | | 4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077 | I1354241297077 | 8 | manifoldcf.t0.id| 1 | | | 4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY | PRIMARY| 767 | manifoldcf.t3.eventname | 1 | Using index | | 3 | DEPENDENT SUBQUERY | t2| ref| I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 | manifoldcf.t0.dochash | 1 | Using where | | 2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080 | PRIMARY| 8 | manifoldcf.t0.jobid | 1 | Using where | +++---++--++-+-++-+ As you see Using filesort, I do not think it uses the index. By the way, which database do you recommend for the case of crawling a humongous number of files for now? PostgreSQL? Regards, Shigeki 2012/12/10 Karl Wright daddy...@gmail.com Since you have a large table, can you try an EXPLAIN for the following query, which should match the explanation given here: http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ? Does it use the index? SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G') AND t0.checkaction='R' AND t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200 Thanks! Karl On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote: Hi Shigeki, The rules for when a database will use an index for an ORDER BY clause differ significantly from database to database. The current logic seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL. I will see if I can find a solution. The ticket for this CONNECTORS-584. Karl On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi. I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL I tried to crawl 10 million files using Windows
Re: Too many slow queries caused by MCF running MySQL 5.5
Experiments here indicate that FORCE INDEX seems to do what we need. I'm going to think about it a bit and then come up with a fix that should use FORCE INDEX in this situation. Then we can see if it actually helps for you. Karl On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com wrote: Sorry, the FORCE INDEX hint requires the name of the index. Since ManifoldCF does not assign index names to fixed values, you will need to find the right one, by using the SHOW INDEX command first to get the right index's name. Apologies, Karl On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com wrote: Ok, that is unfortunate. I will do some further MySQL research here. There is a FORCE INDEX MySQL construct that may help, e.g. SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ... which we can also try. In this case that would be: FORCE INDEX (docpriority,status,checkaction,checktime) or FORCE INDEX (docpriority_status_checkaction_checktime) - unclear what the right syntax actually is. Maybe you can try an explain with that in the query? FWIW, PostgreSQL should always use the index for this situation. Karl On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, Thanks for the reply. I did EXPLAIN as following: mysql explain SELECT - t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G') - AND t0.checkaction='R' AND - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200; +++---++--++-+-++-+ | id | select_type| table | type | possible_keys | key| key_len | ref | rows | Extra | +++---++--++-+-++-+ | 1 | PRIMARY| t0| range | I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25 | NULL| 151494 | Using where; Using filesort | | 4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077 | I1354241297077 | 8 | manifoldcf.t0.id| 1 | | | 4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY | PRIMARY| 767 | manifoldcf.t3.eventname | 1 | Using index | | 3 | DEPENDENT SUBQUERY | t2| ref| I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 | manifoldcf.t0.dochash | 1 | Using where | | 2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080 | PRIMARY| 8 | manifoldcf.t0.jobid | 1 | Using where | +++---++--++-+-++-+ As you see Using filesort, I do not think it uses the index. By the way, which database do you recommend for the case of crawling a humongous number of files for now? PostgreSQL? Regards, Shigeki 2012/12/10 Karl Wright daddy...@gmail.com Since you have a large table, can you try an EXPLAIN for the following query, which should match the explanation given here: http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ? Does it use the index? SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G') AND t0.checkaction='R' AND t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200 Thanks! Karl On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote: Hi Shigeki, The rules for when a database will use an index for an ORDER BY clause differ significantly from database to database. The current logic seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL
Re: Too many slow queries caused by MCF running MySQL 5.5
Hi Shigeki, I'm uploading a new version of ManifoldCF 1.1-dev, which you can pick up at http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev . This has a good chance of fixing the query performance problem. Please try it out, and let me know if you still get slow queries in the log. You should be to use the existing database instance. Thanks, Karl On Mon, Dec 10, 2012 at 5:05 PM, Karl Wright daddy...@gmail.com wrote: Experiments here indicate that FORCE INDEX seems to do what we need. I'm going to think about it a bit and then come up with a fix that should use FORCE INDEX in this situation. Then we can see if it actually helps for you. Karl On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com wrote: Sorry, the FORCE INDEX hint requires the name of the index. Since ManifoldCF does not assign index names to fixed values, you will need to find the right one, by using the SHOW INDEX command first to get the right index's name. Apologies, Karl On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com wrote: Ok, that is unfortunate. I will do some further MySQL research here. There is a FORCE INDEX MySQL construct that may help, e.g. SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ... which we can also try. In this case that would be: FORCE INDEX (docpriority,status,checkaction,checktime) or FORCE INDEX (docpriority_status_checkaction_checktime) - unclear what the right syntax actually is. Maybe you can try an explain with that in the query? FWIW, PostgreSQL should always use the index for this situation. Karl On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, Thanks for the reply. I did EXPLAIN as following: mysql explain SELECT - t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G') - AND t0.checkaction='R' AND - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200; +++---++--++-+-++-+ | id | select_type| table | type | possible_keys | key| key_len | ref | rows | Extra | +++---++--++-+-++-+ | 1 | PRIMARY| t0| range | I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25 | NULL| 151494 | Using where; Using filesort | | 4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077 | I1354241297077 | 8 | manifoldcf.t0.id| 1 | | | 4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY | PRIMARY| 767 | manifoldcf.t3.eventname | 1 | Using index | | 3 | DEPENDENT SUBQUERY | t2| ref| I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 | manifoldcf.t0.dochash | 1 | Using where | | 2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080 | PRIMARY| 8 | manifoldcf.t0.jobid | 1 | Using where | +++---++--++-+-++-+ As you see Using filesort, I do not think it uses the index. By the way, which database do you recommend for the case of crawling a humongous number of files for now? PostgreSQL? Regards, Shigeki 2012/12/10 Karl Wright daddy...@gmail.com Since you have a large table, can you try an EXPLAIN for the following query, which should match the explanation given here: http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ? Does it use the index? SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G') AND t0.checkaction='R' AND t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3
RE: Too many slow queries caused by MCF running MySQL 5.5
You just need to run ant make-deps too before building. Karl Sent from my Windows Phone -- From: Shigeki Kobayashi Sent: 12/11/2012 3:58 AM To: user@manifoldcf.apache.org Subject: Re: Too many slow queries caused by MCF running MySQL 5.5 Hi Karl. I could build the source ok but the following code is missing from connectors.xml. Does this mean I built it incorrectly or this is on purpose? Do I have to just add the code to enable the Windows share connection? repositoryconnector name=Windows shares class=org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector/ Regards, Shigeki 2012/12/11 Karl Wright daddy...@gmail.com Hi Shigeki, I'm uploading a new version of ManifoldCF 1.1-dev, which you can pick up at http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev . This has a good chance of fixing the query performance problem. Please try it out, and let me know if you still get slow queries in the log. You should be to use the existing database instance. Thanks, Karl On Mon, Dec 10, 2012 at 5:05 PM, Karl Wright daddy...@gmail.com wrote: Experiments here indicate that FORCE INDEX seems to do what we need. I'm going to think about it a bit and then come up with a fix that should use FORCE INDEX in this situation. Then we can see if it actually helps for you. Karl On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com wrote: Sorry, the FORCE INDEX hint requires the name of the index. Since ManifoldCF does not assign index names to fixed values, you will need to find the right one, by using the SHOW INDEX command first to get the right index's name. Apologies, Karl On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com wrote: Ok, that is unfortunate. I will do some further MySQL research here. There is a FORCE INDEX MySQL construct that may help, e.g. SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ... which we can also try. In this case that would be: FORCE INDEX (docpriority,status,checkaction,checktime) or FORCE INDEX (docpriority_status_checkaction_checktime) - unclear what the right syntax actually is. Maybe you can try an explain with that in the query? FWIW, PostgreSQL should always use the index for this situation. Karl On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, Thanks for the reply. I did EXPLAIN as following: mysql explain SELECT - t0.id ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G') - AND t0.checkaction='R' AND - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200; +++---++--++-+-++-+ | id | select_type| table | type | possible_keys | key| key_len | ref | rows | Extra | +++---++--++-+-++-+ | 1 | PRIMARY| t0| range | I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25 | NULL| 151494 | Using where; Using filesort | | 4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077 | I1354241297077 | 8 | manifoldcf.t0.id| 1 | | | 4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY | PRIMARY| 767 | manifoldcf.t3.eventname | 1 | Using index | | 3 | DEPENDENT SUBQUERY | t2| ref| I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 | manifoldcf.t0.dochash | 1 | Using where | | 2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080 | PRIMARY| 8 | manifoldcf.t0.jobid | 1 | Using where | +++---++--++-+-++-+ As you see Using filesort, I do not think it uses the index. By the way, which database do you recommend for the case of crawling a humongous number of files for now? PostgreSQL? Regards, Shigeki 2012/12/10
Re: How to crawl from the point where the job is stopped by errors
ManifoldCF is incremental and will do as little work as possible when a job is restarted. The details of what that means depend on the actual connector involved. For Windows Share connections, the document's modify date is checked again, but the document does not need to be indexed if that has not changed. Karl On Wed, Dec 12, 2012 at 12:11 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi There are sometimes errors occurred that stop the jobs crawling files using Windows share connecton. In this case, when starting the stopped the job again by clicking 'Start', i suppose that MCF crawls from the beginning again. If that's right, is there any way that could have MCF crawl from the point that the job is stopped? Regards, Shigeki
Re: Many sleep process in MySQL while crawling files using Window share connection
The MySQL threads correspond to handles in the ManifoldCF handle pool. Since a worker thread can use only one handle at a time, one expects that at best the number of MySQL processes that are active during a crawl are about equal to the number of ManifoldCF worker threads. If this is not true it indicates low database use - which may be OK, depending on your crawl, because of throttle settings. For example, if you are crawling only N domains and you have more than N worker threads, some of these threads will have to wait. However, if your CPU is 100%, and that is all going into ONE MySQL process, it means that one query is blocking all the rest. This would usually be the stuffing query, which is the one we have been looking at over the last couple of days. This query must be fast for ManifoldCF to use its resources well; if it takes a long time to run, the rest of the worker threads get nothing to do. A good way of assessing the state of ManifoldCF under these conditions is to get a thread dump (which can be gotten with kill -QUIT on Linux systems). Look at the worker threads and see what they are doing. If you send me a dump, I will interpret it for you. Thanks, Karl On Wed, Dec 12, 2012 at 2:52 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi I run MCF1.1dev downloaded at Dec. 11th with Mysql5.5 While crawling, I listed process in MySQL and realized there are so many process that are sleeping. I set org.apache.manifoldcf.database.maxhandles to 100. Does this mean that MCF does not handle MySQL process appropriately? I feel strange that even though there are many process created, they are not used much. I see CPU usages 100% in mysql but the cpu state is shown as sleep. Do you think this is related to the sleep process in MySQL? Is this a correct behavior? mysql show processlist; +++-++-+--++--+ | Id | User | Host| db | Command | Time | State | Info | +++-++-+--++--+ | 1 | manifoldcf | localhost:37683 | manifoldcf | Sleep | 279 || NULL | | 2 | manifoldcf | localhost:37684 | manifoldcf | Query |0 | update | INSERT INTO ingeststatus (id,changecount,dockey,firstingest,connectionname,authorityname,urihash,las | | 3 | manifoldcf | localhost:37685 | manifoldcf | Sleep | 279 || NULL | | 4 | manifoldcf | localhost:37686 | manifoldcf | Sleep | 24 || NULL | | 5 | manifoldcf | localhost:37687 | manifoldcf | Sleep | 24 || NULL | | 6 | manifoldcf | localhost:37688 | manifoldcf | Sleep | 217 || NULL | | 7 | manifoldcf | localhost:37689 | manifoldcf | Sleep | 279 || NULL | | 8 | manifoldcf | localhost:37690 | manifoldcf | Sleep | 12 || NULL | | 9 | manifoldcf | localhost:37694 | manifoldcf | Sleep | 279 || NULL | | 10 | manifoldcf | localhost:37695 | manifoldcf | Sleep |0 || NULL | | 11 | manifoldcf | localhost:37696 | manifoldcf | Sleep | 24 || NULL | | 12 | manifoldcf | localhost:37697 | manifoldcf | Sleep | 279 || NULL | | 13 | manifoldcf | localhost:37698 | manifoldcf | Sleep | 279 || NULL | | 14 | manifoldcf | localhost:37699 | manifoldcf | Sleep | 279 || NULL | | 15 | manifoldcf | localhost:37700 | manifoldcf | Sleep | 217 || NULL | | 16 | manifoldcf | localhost:37701 | manifoldcf | Sleep | 24 || NULL | | 17 | manifoldcf | localhost:37703 | manifoldcf | Sleep | 24 || NULL | | 18 | manifoldcf | localhost:37732 | manifoldcf | Sleep | 217 || NULL | | 19 | manifoldcf | localhost:37733 | manifoldcf | Sleep |5 || NULL | | 20 | manifoldcf | localhost:37734 | manifoldcf | Sleep | 24 || NULL | | 21 | manifoldcf | localhost:37735 | manifoldcf | Sleep | 217 || NULL | | 22 | manifoldcf | localhost:37736 | manifoldcf | Sleep |0 || NULL | | 23 | manifoldcf | localhost:37737 | manifoldcf | Sleep | 217 || NULL | | 24 | manifoldcf | localhost:37738 | manifoldcf | Sleep | 217 || NULL | | 25 | manifoldcf | localhost:37739 | manifoldcf | Sleep | 24 || NULL | | 26 | manifoldcf | localhost:37740 | manifoldcf | Sleep | 24 || NULL | | 27 | manifoldcf | localhost:39340 | manifoldcf | Sleep | 279 || NULL | | 28 | manifoldcf | localhost:39341 | manifoldcf | Sleep |3 || NULL | | 29 | manifoldcf | localhost:39342 | manifoldcf | Sleep |0 || NULL | | 30 | manifoldcf | localhost:39343 | manifoldcf | Query |0 |
Re: Build failure on Java7
I created a ticket, CONNECTORS-586, to track this problem. Karl On Wed, Dec 12, 2012 at 4:52 PM, Karl Wright daddy...@gmail.com wrote: Native2Ascii is a maven plugin, but it may well not be compatible with Java 7, or you might be using a non-Oracle jdk. Generally we recommend openjdk or oracle. I suggest you try the ant build. I believe ant implemented their own native2ascii converter without relying on the sun/oracle proprietary classes. Karl On Wed, Dec 12, 2012 at 3:48 PM, Arcadius Ahouansou arcad...@menelic.com wrote: Hello. I have tried to build both trunk and the 1.01 tag on java 1.7.0_04-b22 without luck. The error is: [INFO] [INFO] Building ManifoldCF - Framework - UI Core 1.0 [INFO] [INFO] [INFO] --- maven-remote-resources-plugin:1.1:process (default) @ mcf-ui-core --- [INFO] [INFO] --- native2ascii-maven-plugin:1.0-alpha-1:native2ascii (native2ascii-utf8) @ mcf-ui-core --- [INFO] [INFO] Reactor Summary: [INFO] [INFO] ManifoldCF SUCCESS [4.024s] [INFO] ManifoldCF - Framework SUCCESS [0.061s] [INFO] ManifoldCF - Framework - Core . SUCCESS [6.795s] [INFO] ManifoldCF - Framework - UI Core .. FAILURE [1.453s] [INFO] ManifoldCF - Framework - Agents ... SKIPPED [INFO] ManifoldCF - Framework - Pull Agent ... SKIPPED [INFO] ManifoldCF - Framework - Authority Servlet SKIPPED [INFO] ManifoldCF - Framework - API Servlet .. SKIPPED [INFO] ManifoldCF - Framework - Authority Service SKIPPED [INFO] ManifoldCF - Framework - API Service .. SKIPPED [INFO] ManifoldCF - Framework - Crawler UI ... SKIPPED [INFO] ManifoldCF - Framework - Script Engine SKIPPED [INFO] ManifoldCF - Connectors ... SKIPPED [INFO] ManifoldCF - Connectors - Active Directory SKIPPED [INFO] ManifoldCF - Connectors - Filesystem .. SKIPPED [INFO] ManifoldCF - Connectors - MetaCarta GTS ... SKIPPED [INFO] ManifoldCF - Connectors - jCIFS ... SKIPPED [INFO] ManifoldCF - Connectors - JDBC SKIPPED [INFO] ManifoldCF - Connectors - Null Authority .. SKIPPED [INFO] ManifoldCF - Connectors - Null Output . SKIPPED [INFO] ManifoldCF - Connectors - RSS . SKIPPED [INFO] ManifoldCF - Connectors - SharePoint .. SKIPPED [INFO] ManifoldCF - Connectors - Solr SKIPPED [INFO] ManifoldCF - Connectors - Web . SKIPPED [INFO] ManifoldCF - Connectors - CMIS SKIPPED [INFO] ManifoldCF - Connectors - OpenSearchServer SKIPPED [INFO] ManifoldCF - Connectors - Wiki SKIPPED [INFO] ManifoldCF - Connectors - Alfresco SKIPPED [INFO] ManifoldCF - Connectors - ElasticSearch ... SKIPPED [INFO] ManifoldCF - Test materials ... SKIPPED [INFO] ManifoldCF - Test Materials - Alfresco WAR SKIPPED [INFO] ManifoldCF - Framework - Jetty Runner . SKIPPED [INFO] ManifoldCF - Tests SKIPPED [INFO] ManifoldCF - Test - ElasticSearch . SKIPPED [INFO] ManifoldCF - Test - Alfresco .. SKIPPED [INFO] ManifoldCF - Test - Wiki .. SKIPPED [INFO] ManifoldCF - Test - CMIS .. SKIPPED [INFO] ManifoldCF - Test - Filesystem SKIPPED [INFO] ManifoldCF - Test - Sharepoint SKIPPED [INFO] ManifoldCF - Test - RSS ... SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 15.711s [INFO] Finished at: Wed Dec 12 20:34:41 GMT 2012 [INFO] Final Memory: 14M/34M [INFO] [ERROR] Failed to execute goal org.codehaus.mojo:native2ascii-maven-plugin:1.0-alpha-1:native2ascii (native2ascii-utf8) on project mcf-ui-core: Execution native2ascii-utf8 of goal org.codehaus.mojo:native2ascii-maven-plugin:1.0-alpha-1:native2ascii failed: Error starting Sun's native2ascii: sun.tools.native2ascii.Main - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence
Re: File crawl using exited with an unexpected jobqueue status error under MySQL
Yes, it is the same cause - a transactional integrity bug in the database, MySQL in this case. I can open a ManifoldCF ticket, but the real fix has to come from the MySQL team. Karl On Thu, Dec 20, 2012 at 8:59 PM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi I run MCF1.1dev trunk downloaded on Dec. 22nd and craw file using Windows share connection under MySQL 5.5.28, for Linux (x86_64). The following Error occurred and then the job exited: --- 2012/12/21 10:09:37 ERROR (Worker thread '78') - Exception tossed: Unexpected jobqueue status - record id 1356045273314, expecting active status, saw 0 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected jobqueue status - record id 1356045273314, expecting active status, saw 0 at org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:742) at org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:2438) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:765) --- Do you think this is related to https://issues.apache.org/jira/browse/CONNECTORS-246? Regards, Shigeki
Re: Timeout values to be configurable
FWIW, the newest version of the Solr connector now has configurable timeout values. But my original comment still stands; you really should not find yourself in a position to need this. Karl On Wed, Dec 26, 2012 at 6:19 AM, Karl Wright daddy...@gmail.com wrote: Hi Shigeki, While timeout values into Solr could theoretically be configured as connection parameters, the timeout values for jCIFS are currently only settable globally. Therefore, to make changes configurable by connection, the jCIFS library needs to change. I've already approached the jCIFS developer about changes of this kind, and he was unreceptive to this request. Part of the reason is the nature of the CIFS protocol, which multiplexes many simultaneous requests using the same connection. So this cannot be solved in the manner you suggest, in any case. Furthermore, on a properly-set-up system, it should be unnecessary to adjust either jCIFS timeout parameters or Solr timeout parameters. If you are consistently getting timeouts from jCIFS, it is a strong sign you are overloading the Windows servers you are trying to crawl, and you should take steps immediately to reduce the maximum number of connections you are trying to crawl with. Similarly, chronically exceeding the Solr timeout parameters indicates you are pushing documents into a Solr that is either insufficiently powered, or has too few available threads. Cutting back on the max number of connections is also indicated here as well. Since ManifoldCF retries failures, occasional failures due to other loads on either the Windows servers or on Solr are expected and will not cause problems. But chronic failures indicate serious configuration problems, for which increasing the timeouts is the wrong solution. So I hesitate to add features of the kind you request, unless you can convince me that there is a fundamental reason why it should be necessary to change these parameters. Thanks, Karl On Wed, Dec 26, 2012 at 2:18 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi. As I have used MCF so far, I've faced timeout error many times while crawling and indexing files to Solr. I would like to propose to have the following timeout values configurable in properties.xml. Timeout errors often occur depending on files and environments(machines), so it would be nice to change the timeout value without rebuild the whole source. $MCF_HOME\connectors\solr\connector\src\main\java\org\apache\manifoldcf\agents\output\solr\HttpPoster.java int responseRetries = 9000; // Long basic wait: 3 minutes. This will also be added to by a term based on the size of the request. $MCF_HOME\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java System.setProperty(jcifs.smb.client.soTimeout,15); System.setProperty(jcifs.smb.client.responseTimeout,12); Regards, Shigeki
Re: Http status code 302
When I try the URL you gave using curl and no special arguments, I get this: C:\Users\Karlcurl -vvv http://lucene.jugem.jp/?eid=39; * About to connect() to lucene.jugem.jp port 80 (#0) * Trying 210.172.160.170... connected * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0) GET /?eid=39 HTTP/1.1 User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c zlib/1.2 .5 librtmp/2.3 Host: lucene.jugem.jp Accept: */* HTTP/1.1 200 OK Date: Wed, 09 Jan 2013 08:47:52 GMT Server: Apache/2.0.59 (Unix) Vary: User-Agent,Host,Accept-Encoding Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT Accept-Ranges: bytes Content-Length: 22594 Cache-Control: private Pragma: no-cache Connection: close Content-Type: text/html There's no 302 from here. Are you trying to crawl through a proxy? If so, that might be where the problem lies. Karl On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright daddy...@gmail.com wrote: It sounds like the httpclient upgrade definitely broke something. We should open a ticket. But first, can you confirm what connector this is? Is it the web connector? If so, I am puzzled because the web connector has always logged any 302 return, but then queued a second document which it subsequently fetches. Karl On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, I'm using trunk code and crawling web site with seeds which have http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt). As I'm look at Simple History, it shows 302 result code at fetch activity and doesn't ingest document. When I used MCF 1.0.1 in the same situation, Simple History showed 200 result code and MCF could ingest documents. Why does the trunk shows 302 status? Is it relevant to upgrading httpclient? Thanks in advance, Shinichiro Abe
Re: Http status code 302
Odd that curl would yield a 200 while ManifoldCF gets a 302. Maybe Koji's blog site does not like one of the headers, crawler-agent perhaps? I am behind a firewall now but I will explore this later today. In the meantime, if you want to research the problem, could you turn on wire debugging? You do this in the logging.ini file following these instructions: http://hc.apache.org/httpcomponents-client-ga/logging.html You should see everything happening in the log then, and you can then compare against curl using -vvv. Please let me know what you find. Thanks! Karl On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: I'm using web connector. Are you trying to crawl through a proxy? No. I just set seeds that url without a proxy. (Also I didn't obey robots.txt) Using curl, it is the same as your result. Could you reproduce that? Shinichiro On 2013/01/09, at 17:49, Karl Wright wrote: When I try the URL you gave using curl and no special arguments, I get this: C:\Users\Karlcurl -vvv http://lucene.jugem.jp/?eid=39; * About to connect() to lucene.jugem.jp port 80 (#0) * Trying 210.172.160.170... connected * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0) GET /?eid=39 HTTP/1.1 User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c zlib/1.2 .5 librtmp/2.3 Host: lucene.jugem.jp Accept: */* HTTP/1.1 200 OK Date: Wed, 09 Jan 2013 08:47:52 GMT Server: Apache/2.0.59 (Unix) Vary: User-Agent,Host,Accept-Encoding Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT Accept-Ranges: bytes Content-Length: 22594 Cache-Control: private Pragma: no-cache Connection: close Content-Type: text/html There's no 302 from here. Are you trying to crawl through a proxy? If so, that might be where the problem lies. Karl On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright daddy...@gmail.com wrote: It sounds like the httpclient upgrade definitely broke something. We should open a ticket. But first, can you confirm what connector this is? Is it the web connector? If so, I am puzzled because the web connector has always logged any 302 return, but then queued a second document which it subsequently fetches. Karl On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, I'm using trunk code and crawling web site with seeds which have http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt). As I'm look at Simple History, it shows 302 result code at fetch activity and doesn't ingest document. When I used MCF 1.0.1 in the same situation, Simple History showed 200 result code and MCF could ingest documents. Why does the trunk shows 302 status? Is it relevant to upgrading httpclient? Thanks in advance, Shinichiro Abe
Re: Http status code 302
There seems to be only two differences. The Host header value is different, and there is an Accept header in the one that works. (Accept: */*) I will experiment with curl this evening to see which of these is causing the problem. Or, if you don't want to wait, you can use curl and explicitly set these headers to see which one causes it to fail. Thanks, Karl On Wed, Jan 9, 2013 at 9:56 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Thank you for your navigation. I got a log from MCF 1.0.1. A) a log from curl curl -vvv http://lucene.jugem.jp/?eid=39; * About to connect() to lucene.jugem.jp port 80 (#0) * Trying 210.172.160.170... connected * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0) GET /?eid=39 HTTP/1.1 User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8r zlib/1.2.3 Host: lucene.jugem.jp Accept: */* HTTP/1.1 200 OK Date: Wed, 09 Jan 2013 13:23:15 GMT Server: Apache/2.0.59 (Unix) Vary: User-Agent,Host,Accept-Encoding Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT Accept-Ranges: bytes Content-Length: 22594 Cache-Control: private Pragma: no-cache Connection: close Content-Type: text/html B) a log from MCF 1.0.1 DEBUG 2013-01-09 23:40:11,313 (Thread-472) - Open connection to 210.172.160.170:80 DEBUG 2013-01-09 23:40:11,436 (Thread-472) - GET /?eid=39 HTTP/1.1[\r][\n] DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Using virtual host name: lucene.jugem.jp DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Adding Host request header DEBUG 2013-01-09 23:40:11,447 (Thread-472) - User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n] DEBUG 2013-01-09 23:40:11,447 (Thread-472) - From: shinichiro.ab...@gmail.com[\r][\n] DEBUG 2013-01-09 23:40:11,447 (Thread-472) - Host: lucene.jugem.jp[\r][\n] DEBUG 2013-01-09 23:40:11,447 (Thread-472) - [\r][\n] DEBUG 2013-01-09 23:40:11,629 (Thread-472) - HTTP/1.1 200 OK[\r][\n] DEBUG 2013-01-09 23:40:11,632 (Thread-472) - Date: Wed, 09 Jan 2013 14:39:24 GMT[\r][\n] DEBUG 2013-01-09 23:40:11,632 (Thread-472) - Server: Apache/2.0.59 (Unix)[\r][\n] DEBUG 2013-01-09 23:40:11,632 (Thread-472) - Vary: User-Agent,Host,Accept-Encoding[\r][\n] DEBUG 2013-01-09 23:40:11,632 (Thread-472) - Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Accept-Ranges: bytes[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Content-Length: 22594[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Cache-Control: private[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Pragma: no-cache[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Connection: close[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Content-Type: text/html[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - [\r][\n] DEBUG 2013-01-09 23:40:12,054 (Worker thread '0') - Should close connection in response to directive: close Is it enough to diagnose? Thank you very much, Shinichiro On 2013/01/09, at 23:12, Karl Wright wrote: Wire debugging with MCF 1.0.1 requires different logging.ini parameters, because it uses commons-httpclient instead. That's described here: http://hc.apache.org/httpclient-3.x/logging.html I will need a working comparison to diagnose what is happening, so please either get a log from curl, or better yet from MCF 1.0.1. Thanks! Karl On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, I did wire debugging: curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got a 200. The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs. [1] DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 HTTP/1.1 DEBUG 2013-01-09 22:07:26,495 (Thread-474) - GET /?eid=39 HTTP/1.1[\r][\n] DEBUG 2013-01-09 22:07:26,496 (Thread-474) - User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n] DEBUG 2013-01-09 22:07:26,497 (Thread-474) - From: shinichiro.ab...@gmail.com[\r][\n] DEBUG 2013-01-09 22:07:26,497 (Thread-474) - Host: lucene.jugem.jp:80[\r][\n] DEBUG 2013-01-09 22:07:26,497 (Thread-474) - Connection: Keep-Alive[\r][\n] DEBUG 2013-01-09 22:07:26,497 (Thread-474) - [\r][\n] DEBUG 2013-01-09 22:07:26,497 (Thread-474) - GET /?eid=39 HTTP/1.1 DEBUG 2013-01-09 22:07:26,497 (Thread-474) - User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com) DEBUG 2013-01-09 22:07:26,497 (Thread-474) - From: shinichiro.ab...@gmail.com DEBUG 2013-01-09 22:07:26,497 (Thread-474) - Host: lucene.jugem.jp:80 DEBUG 2013-01-09 22:07:26,497 (Thread-474) - Connection: Keep-Alive DEBUG 2013-01-09 22:07:26,556 (Thread-474) - HTTP/1.1 302 Found[\r][\n] DEBUG 2013-01-09 22:07:26,561 (Thread-474) - Date: Wed, 09 Jan 2013 13:06:39 GMT[\r][\n] DEBUG 2013-01-09 22:07:26,561 (Thread-474) - Server: Apache
Re: Http status code 302
I created CONNECTORS-604 to track this problem. Karl On Wed, Jan 9, 2013 at 10:02 AM, Karl Wright daddy...@gmail.com wrote: There seems to be only two differences. The Host header value is different, and there is an Accept header in the one that works. (Accept: */*) I will experiment with curl this evening to see which of these is causing the problem. Or, if you don't want to wait, you can use curl and explicitly set these headers to see which one causes it to fail. Thanks, Karl On Wed, Jan 9, 2013 at 9:56 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Thank you for your navigation. I got a log from MCF 1.0.1. A) a log from curl curl -vvv http://lucene.jugem.jp/?eid=39; * About to connect() to lucene.jugem.jp port 80 (#0) * Trying 210.172.160.170... connected * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0) GET /?eid=39 HTTP/1.1 User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8r zlib/1.2.3 Host: lucene.jugem.jp Accept: */* HTTP/1.1 200 OK Date: Wed, 09 Jan 2013 13:23:15 GMT Server: Apache/2.0.59 (Unix) Vary: User-Agent,Host,Accept-Encoding Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT Accept-Ranges: bytes Content-Length: 22594 Cache-Control: private Pragma: no-cache Connection: close Content-Type: text/html B) a log from MCF 1.0.1 DEBUG 2013-01-09 23:40:11,313 (Thread-472) - Open connection to 210.172.160.170:80 DEBUG 2013-01-09 23:40:11,436 (Thread-472) - GET /?eid=39 HTTP/1.1[\r][\n] DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Using virtual host name: lucene.jugem.jp DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Adding Host request header DEBUG 2013-01-09 23:40:11,447 (Thread-472) - User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n] DEBUG 2013-01-09 23:40:11,447 (Thread-472) - From: shinichiro.ab...@gmail.com[\r][\n] DEBUG 2013-01-09 23:40:11,447 (Thread-472) - Host: lucene.jugem.jp[\r][\n] DEBUG 2013-01-09 23:40:11,447 (Thread-472) - [\r][\n] DEBUG 2013-01-09 23:40:11,629 (Thread-472) - HTTP/1.1 200 OK[\r][\n] DEBUG 2013-01-09 23:40:11,632 (Thread-472) - Date: Wed, 09 Jan 2013 14:39:24 GMT[\r][\n] DEBUG 2013-01-09 23:40:11,632 (Thread-472) - Server: Apache/2.0.59 (Unix)[\r][\n] DEBUG 2013-01-09 23:40:11,632 (Thread-472) - Vary: User-Agent,Host,Accept-Encoding[\r][\n] DEBUG 2013-01-09 23:40:11,632 (Thread-472) - Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Accept-Ranges: bytes[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Content-Length: 22594[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Cache-Control: private[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Pragma: no-cache[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Connection: close[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - Content-Type: text/html[\r][\n] DEBUG 2013-01-09 23:40:11,633 (Thread-472) - [\r][\n] DEBUG 2013-01-09 23:40:12,054 (Worker thread '0') - Should close connection in response to directive: close Is it enough to diagnose? Thank you very much, Shinichiro On 2013/01/09, at 23:12, Karl Wright wrote: Wire debugging with MCF 1.0.1 requires different logging.ini parameters, because it uses commons-httpclient instead. That's described here: http://hc.apache.org/httpclient-3.x/logging.html I will need a working comparison to diagnose what is happening, so please either get a log from curl, or better yet from MCF 1.0.1. Thanks! Karl On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, I did wire debugging: curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got a 200. The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs. [1] DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 HTTP/1.1 DEBUG 2013-01-09 22:07:26,495 (Thread-474) - GET /?eid=39 HTTP/1.1[\r][\n] DEBUG 2013-01-09 22:07:26,496 (Thread-474) - User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n] DEBUG 2013-01-09 22:07:26,497 (Thread-474) - From: shinichiro.ab...@gmail.com[\r][\n] DEBUG 2013-01-09 22:07:26,497 (Thread-474) - Host: lucene.jugem.jp:80[\r][\n] DEBUG 2013-01-09 22:07:26,497 (Thread-474) - Connection: Keep-Alive[\r][\n] DEBUG 2013-01-09 22:07:26,497 (Thread-474) - [\r][\n] DEBUG 2013-01-09 22:07:26,497 (Thread-474) - GET /?eid=39 HTTP/1.1 DEBUG 2013-01-09 22:07:26,497 (Thread-474) - User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com) DEBUG 2013-01-09 22:07:26,497 (Thread-474) - From: shinichiro.ab...@gmail.com DEBUG 2013-01-09 22:07:26,497 (Thread-474) - Host: lucene.jugem.jp:80 DEBUG 2013-01-09 22:07:26,497 (Thread-474) - Connection: Keep-Alive DEBUG 2013-01-09 22:07:26,556 (Thread-474) - HTTP/1.1 302 Found[\r][\n] DEBUG 2013-01
Re: Monitoring Manifold CF
Hi, The REST API can give you the job status. Karl On Wed, Jan 16, 2013 at 6:12 AM, Christian Hepworth christian.hepwo...@york.ac.uk wrote: Hello We are using Manifold CF to index Solr, via an Oracle connection. Our job is currently scheduled to run every evening, but we have had a few failed jobs. Ideally, we would like to use a tool such as Nagios to monitor the logs (or something else) and report on the success/failure of the job. We are struggling to find any output in the logs which would indicate the status of a job. Has anyone else put this sort of monitoring in place? Any advice would be much appreciated. Many thanks Christian
Re: Crawling new/updated files using Windows share connection takes too long
Hi Shigeki, What database is ManifoldCF configured to use in this case? Do you see any indication of slow queries in the ManifoldCF log? Karl On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hello I would like some advice to improve crawling time of new/updated files using Windows share connection. I crawl file in Windows server and index them into Solr. Currently, the second crawling of two hundred thousands files takes over 5 hours, even though any files are not updated, created, deleted. I assume MCF does the following processes (let me know if I am wrong) - obtain updated time of a file - compare the updated time with the one MCF obtained last time crawling( probably stored in DB) - if they are different MCF recognizes the file is to be indexed. If the above processes are done for two thousands files, what part of the processes could take time the most? obtaining updated time? reading data from DB? what could be done to increase the crawling time do you think? Please give me some advice. Regards, Shigeki
Re: XML parsing error quits file crawling using Windows share connection
This means that the Solr you are talking to has returned an unintelligible (non-XML) response. When this happens I believe the actual return text is included in the Simple History, so I'd look there first to see what the problem might be. You may also eventually want to update to the current ManifoldCF 1.1 release candidate, which has a revised Solr connector based on SolrJ. I don't think that will help but at least you'll know it isn't our code. ;-) Karl On Sun, Jan 20, 2013 at 11:56 PM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi I use trunk 1.1dev downloaded on Dec 12th and crawl files using Windows Share Connection to index them into Solr 4.0 There was the following error and it quit the crawling job. 2013/01/19 20:59:04 ERROR (Worker thread '8') - Exception tossed: XML parsing error on response org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error on response at org.apache.manifoldcf.agents.output.solr.HttpPoster$CodeDetails.parseIngestionResponse(HttpPoster.java:2059) at org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1365) Anyone has any ideas? Regards, Shigeki
Re: Job hanging on Starting up with never ending external query.
Hi Anthony, What happens between the framework recognizing that the job should be started (which it does fine in both cases), and actually achieving a correct job start, is the seeding phase, which is going to try to execute the seeding query against your Oracle database. If something happens at that time to hang the JDBC connection's seeding query, then it precisely explains the behavior you are seeing. It is also the case that the timeout on the queries that the JDBC connector does is effectively infinite. This makes me suspicious that what is happening is an Oracle query is going out but there is no response ever coming back. The other possibility is that the JDBC connector is in fact correctly throwing a ServiceInterruption, but that the ManifoldCF code is either not handling it properly, or the connector is not forming it properly. In that case, when you notice a hung job, the startup thread will be a particular place in the code, and I can diagnose it that way. The first order of business is therefore to get a thread dump when the system is hung. That will help confirm the picture. There are a number of additional questions here. (1) Why is this happening? Is there any possibility that the Oracle database you are crawling is (very occasionally) not able to properly respond to a JDBC query? I can imagine that, under some network conditions, it might be possible for the Oracle JDBC driver to wind up waiting indefinitely for a response that never comes. (2) Given that we can't always control the infrastructure we're trying to crawl through, should we attempt to provide a reasonable workaround? For example, a timeout on JDBC connector queries, where we throw a ServiceInterruption if the timeout is exceeded? Karl On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard anthony.leon...@york.ac.uk wrote: Hi there, We have recently started running a nightly job 2AM in ManifoldCF to extract data from an Oracle repository and populate a Solr index. Most nights this works fine, but occasionally the job has been hanging at the Starting up phase. We have observed this on our test setup also occasionally. A restart of ManifoldCF usually solves this. Using the simple history reports today I looked up all records and sorted them by the Time column, largest first, and found the following: Start Time,Activity,Identifier,Result Code,Bytes,Time,Result Description 11-12-2012 05:00:05.941,external query ... SQL QUERY ...,ERROR,0,1926607529,Interrupted: null 01-21-2013 02:00:11.843,external query ... SQL QUERY ...,ERROR,0,31644956,Interrupted: null 01-17-2013 02:00:03.600,external query ... SQL QUERY ...,ERROR,0,31637594,Interrupted: null 12-04-2012 12:12:19.860,external query ... SQL QUERY ...,OK,0,17511, ... etc ... If the Time column is in millis that means the first query was hanging for 22 days! (This was in the period before we went live when our live server was sitting idle for a while.) The other two occasions it was hanging for about 8 hours until we arrived to restart the job in the morning. I have confirmed that the Oracle database we are connecting to was available throughout these periods. These times are also too long for any network or database timeouts, which makes me suspect that it's a problem with the application. We have the following logging config in properties.xml property name=org.apache.manifoldcf.jobs value=ALL/ property name=org.apache.manifoldcf.connectors value=ALL/ property name=org.apache.manifoldcf.agents value=ALL/ property name=org.apache.manifoldcf.misc value=ALL/ The job failed again last night and when I checked at 10:40 AM this morning the last few lines of manifoldcf.log were: DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job 1352455005553 needs to be started; it was last checked at 1358733575454, and now it is 1358733585635 DEBUG 2013-01-21 01:59:45,654 (Job start thread) - No time match found within interval 1358733575454 to 1358733585635 DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job 1352455005553 needs to be started; it was last checked at 1358733585636, and now it is 1358733595662 DEBUG 2013-01-21 01:59:55,805 (Job start thread) - No time match found within interval 1358733585636 to 1358733595662 DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job 1352455005553 needs to be started; it was last checked at 1358733595663, and now it is 1358733605813 DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Time match FOUND within interval 1358733595663 to 1358733605813 DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job '1352455005553' is within run window at 1358733605813 ms. (which starts at 135873360 ms.) DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job start for job 1352455005553 DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job 1352455005553 for startup DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The
Re: Crawling new/updated files using Windows share connection
CONNECTORS-618 Karl On Mon, Jan 21, 2013 at 9:08 AM, Karl Wright daddy...@gmail.com wrote: Bad news, I am afraid. MySQL seems to always put null values at the front of the index, and that cannot be changed through any means I can find. This is different from all other databases I know of. The only possible fixes for the problem are as follows: (1) Not use a null doc priority but instead use an actual special number that is guaranteed to always sort to the end. This would be non-trivial because this column is a FLOAT value, and round-off errors will prevent the ManifoldCF code from reliably using a special number like that on all databases. (2) Use docpriorities that are ordered in the opposite way - which would work ONLY for MySQL and would break all other databases. I'll create a ticket and think about the problem some more. Karl On Mon, Jan 21, 2013 at 8:48 AM, Karl Wright daddy...@gmail.com wrote: Hi Shigeki, I reviewed the code in detail. At the time CONNECTORS-290 was fixed, all document priorities were set to null whenever a job was paused or aborted, so what I suspected might be the problem cannot in fact happen. The most likely possible explanation for MySQL's behavior, therefore, is that MySQL orders null docpriority values BEFORE all other rows in the index it is using for queue stuffing. I have no other way of explaining why it thinks it needs to go through 6.5 million rows before it gets to the ones that are active. If this is the case, it may be possible to tell MySQL to order null column values to the END instead of the beginning of the index. I'll do some research on this later and get back to you. Thanks, Karl On Mon, Jan 21, 2013 at 6:21 AM, Karl Wright daddy...@gmail.com wrote: Are there any large paused or aborted jobs present on the same ManifoldCF? If so, can you tell me whether the job is paused, or aborted? (I am betting paused...) Karl On Mon, Jan 21, 2013 at 5:59 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, Here is the explain. There isn't such sort... mysql explain SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0 IN ('P','G') AND t0.checkaction='R' AND t0.checktime=1358649661663 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC LIMIT 4800; +++---++--++-+-+--+-+ | id | select_type| table | type | possible_keys | key| key_len | ref | rows | Extra | +++---++--++-+-+--+-+ | 1 | PRIMARY| t0| index | NULL | I1358228295210 | 25 | NULL| 4800 | Using where | | 4 | DEPENDENT SUBQUERY | t3| ref| I1358228295216 | I1358228295216 | 8 | manifoldcf.t0.id|1 | | | 4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY | PRIMARY| 767 | manifoldcf.t3.eventname |1 | Using index | | 3 | DEPENDENT SUBQUERY | t2| ref| I1358228295209,I1358228295212,I1358228295211 | I1358228295209 | 122 | manifoldcf.t0.dochash |1 | Using where | | 2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1358228295219 | PRIMARY| 8 | manifoldcf.t0.jobid |1 | Using where | +++---++--++-+-+--+-+ 5 rows in set (0.00 sec) Regards, Shigeki 2013/1/21 Karl Wright daddy...@gmail.com takes too long MIME-Version: 1.0 Content-Type: multipart/alternative; boundary=14dae9399bd9676e5704d3c356e9 --14dae9399bd9676e5704d3c356e9 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Can you get an EXPLAIN for this query? It sounds like it is disregarding the hint for some reason. Karl Sent from my Windows Phone From: Shigeki Kobayashi Sent: 1/20/2013 9:37 PM To: user@manifoldcf.apache.org Subject: Re: Crawling new/updated files using Windows share connection takes too long Hi Karl. I configured MySQL 5.5 to run MCF this time. The version of MCF is trunk 1.1dev downloaded on Dec, 12th. , which you fixed the slow query using FORCE INDEX. Solr is 4.0 I thought is was fixed but the log shows that the following are slow queries
Re: Job hanging on Starting up with never ending external query.
kill -QUIT should not abort the agents process, just cause a thread dump. kill -9 is a different story. You can also do the same thing by using jstack, in the jvm bin directory. Karl On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard anthony.leon...@york.ac.uk wrote: Dear Karl, Many thanks for your insights. I'll do a kill -QUIT next time we have this issue which should hopefully give me the thread dump. However we've noticed that killing processes means we have to run the locks-clean script so it's not our favourite way of doing it. Also I definitely think a timeout for queries would be a good thing. I guess we go back to checking that the connection to the database should have been ok last night... Best wishes, Anthony. -- Dr Anthony Leonard System Integrator, Information Directorate University of York, Heslington, York, UK, YO10 5DD Tel: +44 (0)1904 434350 http://twitter.com/apbleonard Times Higher Education University of the Year 2010 On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright daddy...@gmail.com wrote: Hi Anthony, What happens between the framework recognizing that the job should be started (which it does fine in both cases), and actually achieving a correct job start, is the seeding phase, which is going to try to execute the seeding query against your Oracle database. If something happens at that time to hang the JDBC connection's seeding query, then it precisely explains the behavior you are seeing. It is also the case that the timeout on the queries that the JDBC connector does is effectively infinite. This makes me suspicious that what is happening is an Oracle query is going out but there is no response ever coming back. The other possibility is that the JDBC connector is in fact correctly throwing a ServiceInterruption, but that the ManifoldCF code is either not handling it properly, or the connector is not forming it properly. In that case, when you notice a hung job, the startup thread will be a particular place in the code, and I can diagnose it that way. The first order of business is therefore to get a thread dump when the system is hung. That will help confirm the picture. There are a number of additional questions here. (1) Why is this happening? Is there any possibility that the Oracle database you are crawling is (very occasionally) not able to properly respond to a JDBC query? I can imagine that, under some network conditions, it might be possible for the Oracle JDBC driver to wind up waiting indefinitely for a response that never comes. (2) Given that we can't always control the infrastructure we're trying to crawl through, should we attempt to provide a reasonable workaround? For example, a timeout on JDBC connector queries, where we throw a ServiceInterruption if the timeout is exceeded? Karl On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard anthony.leon...@york.ac.uk wrote: Hi there, We have recently started running a nightly job 2AM in ManifoldCF to extract data from an Oracle repository and populate a Solr index. Most nights this works fine, but occasionally the job has been hanging at the Starting up phase. We have observed this on our test setup also occasionally. A restart of ManifoldCF usually solves this. Using the simple history reports today I looked up all records and sorted them by the Time column, largest first, and found the following: Start Time,Activity,Identifier,Result Code,Bytes,Time,Result Description 11-12-2012 05:00:05.941,external query ... SQL QUERY ...,ERROR,0,1926607529,Interrupted: null 01-21-2013 02:00:11.843,external query ... SQL QUERY ...,ERROR,0,31644956,Interrupted: null 01-17-2013 02:00:03.600,external query ... SQL QUERY ...,ERROR,0,31637594,Interrupted: null 12-04-2012 12:12:19.860,external query ... SQL QUERY ...,OK,0,17511, ... etc ... If the Time column is in millis that means the first query was hanging for 22 days! (This was in the period before we went live when our live server was sitting idle for a while.) The other two occasions it was hanging for about 8 hours until we arrived to restart the job in the morning. I have confirmed that the Oracle database we are connecting to was available throughout these periods. These times are also too long for any network or database timeouts, which makes me suspect that it's a problem with the application. We have the following logging config in properties.xml property name=org.apache.manifoldcf.jobs value=ALL/ property name=org.apache.manifoldcf.connectors value=ALL/ property name=org.apache.manifoldcf.agents value=ALL/ property name=org.apache.manifoldcf.misc value=ALL/ The job failed again last night and when I checked at 10:40 AM this morning the last few lines of manifoldcf.log were: DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job 1352455005553 needs to be started; it was last checked
Re: Crawling new/updated files using Windows share connection
I checked a fix for this into trunk. Please sync up with trunk and see if this fixes your problem. If it does, I will gladly include the fix in MCF 1.1. Karl On Mon, Jan 21, 2013 at 9:14 AM, Karl Wright daddy...@gmail.com wrote: CONNECTORS-618 Karl On Mon, Jan 21, 2013 at 9:08 AM, Karl Wright daddy...@gmail.com wrote: Bad news, I am afraid. MySQL seems to always put null values at the front of the index, and that cannot be changed through any means I can find. This is different from all other databases I know of. The only possible fixes for the problem are as follows: (1) Not use a null doc priority but instead use an actual special number that is guaranteed to always sort to the end. This would be non-trivial because this column is a FLOAT value, and round-off errors will prevent the ManifoldCF code from reliably using a special number like that on all databases. (2) Use docpriorities that are ordered in the opposite way - which would work ONLY for MySQL and would break all other databases. I'll create a ticket and think about the problem some more. Karl On Mon, Jan 21, 2013 at 8:48 AM, Karl Wright daddy...@gmail.com wrote: Hi Shigeki, I reviewed the code in detail. At the time CONNECTORS-290 was fixed, all document priorities were set to null whenever a job was paused or aborted, so what I suspected might be the problem cannot in fact happen. The most likely possible explanation for MySQL's behavior, therefore, is that MySQL orders null docpriority values BEFORE all other rows in the index it is using for queue stuffing. I have no other way of explaining why it thinks it needs to go through 6.5 million rows before it gets to the ones that are active. If this is the case, it may be possible to tell MySQL to order null column values to the END instead of the beginning of the index. I'll do some research on this later and get back to you. Thanks, Karl On Mon, Jan 21, 2013 at 6:21 AM, Karl Wright daddy...@gmail.com wrote: Are there any large paused or aborted jobs present on the same ManifoldCF? If so, can you tell me whether the job is paused, or aborted? (I am betting paused...) Karl On Mon, Jan 21, 2013 at 5:59 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi Karl, Here is the explain. There isn't such sort... mysql explain SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0 IN ('P','G') AND t0.checkaction='R' AND t0.checktime=1358649661663 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC LIMIT 4800; +++---++--++-+-+--+-+ | id | select_type| table | type | possible_keys | key| key_len | ref | rows | Extra | +++---++--++-+-+--+-+ | 1 | PRIMARY| t0| index | NULL | I1358228295210 | 25 | NULL| 4800 | Using where | | 4 | DEPENDENT SUBQUERY | t3| ref| I1358228295216 | I1358228295216 | 8 | manifoldcf.t0.id|1 | | | 4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY | PRIMARY| 767 | manifoldcf.t3.eventname |1 | Using index | | 3 | DEPENDENT SUBQUERY | t2| ref| I1358228295209,I1358228295212,I1358228295211 | I1358228295209 | 122 | manifoldcf.t0.dochash |1 | Using where | | 2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1358228295219 | PRIMARY| 8 | manifoldcf.t0.jobid |1 | Using where | +++---++--++-+-+--+-+ 5 rows in set (0.00 sec) Regards, Shigeki 2013/1/21 Karl Wright daddy...@gmail.com takes too long MIME-Version: 1.0 Content-Type: multipart/alternative; boundary=14dae9399bd9676e5704d3c356e9 --14dae9399bd9676e5704d3c356e9 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Can you get an EXPLAIN for this query? It sounds like it is disregarding the hint for some reason. Karl Sent from my Windows Phone From: Shigeki Kobayashi Sent: 1/20/2013 9:37 PM To: user@manifoldcf.apache.org Subject: Re: Crawling new/updated files using Windows share connection takes too long Hi Karl. I configured MySQL 5.5 to run MCF
Re: Job hanging on Starting up with never ending external query.
Hmm. The following threads are of interest here: Thread 29975: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame) - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.init(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection, java.lang.String, java.util.ArrayList, int) @bci=39, line=1058 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String, java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long, long, int) @bci=106, line=246 (Interpreted frame) - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636, line=179 (Interpreted frame) ... which is probably waiting in this one: Thread 24457: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame) - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection() @bci=80, line=80 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String) @bci=433, line=128 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run() @bci=36, line=1212 (Interpreted frame) ... which is waiting to obtain a JDBC connection, and the reason it can't is because it thinks that the only available JDBC connection is currently in use. Since you have only a single connection around, and nothing else is active, it stands to reason that a JDBC connection handle has somehow been leaked, which is a challenge since connections are typically freed in a try/finally block through ManifoldCF. I notice that your stack frames are pretty unusual - what JDK is this that you are using? Karl Since On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard anthony.leon...@york.ac.uk wrote: Dear Karl, Our DBA noticed that each time our job was run 10 Oracle connections were created. So, we dropped the Max connections parameter on the repository connection config to 1 and re-ran the job with the DBA watching. The job worked fine but the DBA reported that 1 connection was created and then 10 more briefly ... Out of curiosity we re-ran the job again with no further changes and this time got the following results: * the job hung in the Starting Up phase again, with the same logging and symptoms as detailed before on this thread. * the DBA reported seeing no connections at all this time. * I have attached a thread dump created by jstack -F pid. This is reporting all threads as blocked. Any ideas? Any help with this would certainly be very gratefully received. Best wishes, Anthony. -- Dr Anthony Leonard System Integrator, Information Directorate University of York, Heslington, York, UK, YO10 5DD Tel: +44 (0)1904 434350 http://twitter.com/apbleonard Times Higher Education University of the Year 2010 On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright daddy...@gmail.com wrote: kill -QUIT should not abort the agents process, just cause a thread dump. kill -9 is a different story. You can also do the same thing by using jstack, in the jvm bin directory. Karl On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard anthony.leon...@york.ac.uk wrote: Dear Karl, Many thanks for your insights. I'll do a kill -QUIT next time we have this issue which should hopefully give me the thread dump. However we've noticed that killing processes means we have to run the locks-clean script so it's not our favourite way of doing it. Also I definitely think a timeout for queries would be a good thing. I guess we go back to checking that the connection to the database should have been ok last night... Best wishes, Anthony. -- Dr Anthony Leonard System Integrator, Information Directorate University of York, Heslington, York, UK, YO10 5DD Tel: +44 (0)1904 434350 http://twitter.com/apbleonard Times Higher Education University of the Year 2010 On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright daddy...@gmail.com wrote: Hi Anthony, What happens between the framework recognizing that the job should be started (which it does fine in both cases), and actually achieving a correct job start, is the seeding phase, which is going to try to execute the seeding query against your Oracle database. If something happens at that time to hang the JDBC connection's seeding query, then it precisely explains the behavior you are seeing. It is also
Re: Job hanging on Starting up with never ending external query.
I've looked into the code in some detail. There is indeed a place where it is possible for a JDBC connection handle to be leaked, I believe. However, it's not clear whether this is the circumstance you are encountering or not, since it does in involve an exception getting thrown doing something not terribly likely to cause exceptions. I've opened a ticket - CONNECTORS-620. Karl On Tue, Jan 22, 2013 at 12:53 PM, Karl Wright daddy...@gmail.com wrote: Hmm. The following threads are of interest here: Thread 29975: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame) - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.init(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection, java.lang.String, java.util.ArrayList, int) @bci=39, line=1058 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String, java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long, long, int) @bci=106, line=246 (Interpreted frame) - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636, line=179 (Interpreted frame) ... which is probably waiting in this one: Thread 24457: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame) - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection() @bci=80, line=80 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String) @bci=433, line=128 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run() @bci=36, line=1212 (Interpreted frame) ... which is waiting to obtain a JDBC connection, and the reason it can't is because it thinks that the only available JDBC connection is currently in use. Since you have only a single connection around, and nothing else is active, it stands to reason that a JDBC connection handle has somehow been leaked, which is a challenge since connections are typically freed in a try/finally block through ManifoldCF. I notice that your stack frames are pretty unusual - what JDK is this that you are using? Karl Since On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard anthony.leon...@york.ac.uk wrote: Dear Karl, Our DBA noticed that each time our job was run 10 Oracle connections were created. So, we dropped the Max connections parameter on the repository connection config to 1 and re-ran the job with the DBA watching. The job worked fine but the DBA reported that 1 connection was created and then 10 more briefly ... Out of curiosity we re-ran the job again with no further changes and this time got the following results: * the job hung in the Starting Up phase again, with the same logging and symptoms as detailed before on this thread. * the DBA reported seeing no connections at all this time. * I have attached a thread dump created by jstack -F pid. This is reporting all threads as blocked. Any ideas? Any help with this would certainly be very gratefully received. Best wishes, Anthony. -- Dr Anthony Leonard System Integrator, Information Directorate University of York, Heslington, York, UK, YO10 5DD Tel: +44 (0)1904 434350 http://twitter.com/apbleonard Times Higher Education University of the Year 2010 On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright daddy...@gmail.com wrote: kill -QUIT should not abort the agents process, just cause a thread dump. kill -9 is a different story. You can also do the same thing by using jstack, in the jvm bin directory. Karl On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard anthony.leon...@york.ac.uk wrote: Dear Karl, Many thanks for your insights. I'll do a kill -QUIT next time we have this issue which should hopefully give me the thread dump. However we've noticed that killing processes means we have to run the locks-clean script so it's not our favourite way of doing it. Also I definitely think a timeout for queries would be a good thing. I guess we go back to checking that the connection to the database should have been ok last night... Best wishes, Anthony. -- Dr Anthony Leonard System Integrator, Information Directorate University of York, Heslington, York, UK, YO10 5DD Tel: +44 (0)1904 434350 http://twitter.com/apbleonard Times Higher Education University of the Year 2010 On Mon
Re: max_pred_locks_per_transaction
Hi Erlend, Leaving logging at the default values would have shown the ERROR message you have below. So the cause for the pause must have been something else. When ManifoldCF seems to make no progress, the first thing to do is look at the simple history and see if it is retrying on something for some reason. If that is not helpful, get a thread dump. You can use jstack for that purpose. As for the PostgreSQL parameters, max_pred_locks_per_transaction seems to be PostgreSQL 9 black magic. Here's the documentation: http://www.postgresql.org/docs/9.1/static/runtime-config-locks.html Default is 64, but they don't say how it is allocated. I'd guess therefore that you should try 50% higher and see if that works, e.g. 96. I guess the limit is the amount of shared memory your OS allows you to allocate. Karl On Fri, Jan 25, 2013 at 5:26 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: After we started to crawl journals, the crawler just stopped after a couple of hours. Running MCF in debug mode gave me the stack trace shown below. I thing we need to adjust some PG parameters, perhaps max_pred_locks_per_transaction. The database admins are now asking me about which value to set. They have increased it, but I don't know whether it is sufficient. Another thing. I did not see this message until I changed the log level to debug, but the log4j should catch this error messages with warn level enabled. So maybe it is a dead end, i.e. a totally different cause and that this just occurred as a coincidence. ERROR 2013-01-19 03:47:15,049 (Worker thread '49') - Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: ERROR: out of shared memory Hint: You might need to increase max_pred_locks_per_transaction. org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: out of shared memory Hint: You might need to increase max_pred_locks_per_transaction. at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681) at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709) at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394) at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144) at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186) at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:803) at org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.flush(WorkerThread.java:1863) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:554) Caused by: org.postgresql.util.PSQLException: ERROR: out of shared memory Hint: You might need to increase max_pred_locks_per_transaction. at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.manifoldcf.core.database.Database.execute(Database.java:826) at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641) DEBUG 2013-01-19 03:47:23,386 (Idle cleanup thread) - Checking for connections, idleTimeout: 1358563583386 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Job hanging on Starting up with never ending external query.
You can download the current release candidate for 1.1 (RC6) from http://people.apache.org/~kwright/apache-manifoldcf-1.1 . Karl On Fri, Jan 25, 2013 at 12:02 PM, Anthony Leonard anthony.leon...@york.ac.uk wrote: Hi Karl, Thank you so much for this. Sorry for the lack of response as we've been working on other things. One question - would we have to build ManifoldCF ourselves to get the new code you've checked in or would it already be part of a binary distribution somewhere? Best wishes, Anthony. -- Dr Anthony Leonard System Integrator, Information Directorate University of York, Heslington, York, UK, YO10 5DD Tel: +44 (0)1904 434350 http://twitter.com/apbleonard Times Higher Education University of the Year 2010 On Tue, Jan 22, 2013 at 6:51 PM, Karl Wright daddy...@gmail.com wrote: I've checked in code in both trunk and the release branch for this issue. It would be good if you could try this again in your environment. The fix simply prevents some kinds of exceptions from causing a handle leak. Please try this with only 1 JDBC Connection connection handle per JVM and let me know if you see any hangs. Thanks, Karl On Tue, Jan 22, 2013 at 1:11 PM, Karl Wright daddy...@gmail.com wrote: I've looked into the code in some detail. There is indeed a place where it is possible for a JDBC connection handle to be leaked, I believe. However, it's not clear whether this is the circumstance you are encountering or not, since it does in involve an exception getting thrown doing something not terribly likely to cause exceptions. I've opened a ticket - CONNECTORS-620. Karl On Tue, Jan 22, 2013 at 12:53 PM, Karl Wright daddy...@gmail.com wrote: Hmm. The following threads are of interest here: Thread 29975: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame) - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.init(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection, java.lang.String, java.util.ArrayList, int) @bci=39, line=1058 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String, java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long, long, int) @bci=106, line=246 (Interpreted frame) - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636, line=179 (Interpreted frame) ... which is probably waiting in this one: Thread 24457: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame) - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection() @bci=80, line=80 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String) @bci=433, line=128 (Interpreted frame) - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run() @bci=36, line=1212 (Interpreted frame) ... which is waiting to obtain a JDBC connection, and the reason it can't is because it thinks that the only available JDBC connection is currently in use. Since you have only a single connection around, and nothing else is active, it stands to reason that a JDBC connection handle has somehow been leaked, which is a challenge since connections are typically freed in a try/finally block through ManifoldCF. I notice that your stack frames are pretty unusual - what JDK is this that you are using? Karl Since On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard anthony.leon...@york.ac.uk wrote: Dear Karl, Our DBA noticed that each time our job was run 10 Oracle connections were created. So, we dropped the Max connections parameter on the repository connection config to 1 and re-ran the job with the DBA watching. The job worked fine but the DBA reported that 1 connection was created and then 10 more briefly ... Out of curiosity we re-ran the job again with no further changes and this time got the following results: * the job hung in the Starting Up phase again, with the same logging and symptoms as detailed before on this thread. * the DBA reported seeing no connections at all this time. * I have attached a thread dump created by jstack -F pid. This is reporting all threads as blocked. Any ideas? Any help with this would certainly be very gratefully received
Re: Diagnosing REJECTED documents in job history
Ok, so let's back up a bit. First, which version of ManifoldCF is this? I need to know that before I can interpret the stack trace. Second, what do you see when you view the connection in the crawler UI? Does it say Connection working, or something else, and if so, what? I've created a ticket for better error reporting in this connector - it was a contribution and AFAIK the error handling is not very robust at this point, but I can fix that quickly with your help. ;-) Karl On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com wrote: On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote: So you saw events in the history which correspond to these documents and which are of type Indexation that say success? If that is the case, then the ElasticSearch connector thinks it handed the documents successfully to the ElasticSearch server. Ah, no, the activity is fetch rather than indexation. e.g. 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361 I don't see any history entries relating to indexing as a specific activity in its own right. Sorry, that was probably a red herring, I don't think it's getting that far. I just noticed that above all the service interruption reported warnings are some errors like this: ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed: org.apache.manifoldcf.core.interfaces.ManifoldCFException: at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97) at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138) at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652) at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) Sadly there's no description, just a stacktrace. I know the ES server is visible from the MCF server -- actually they're the same machine, and it's configured to use http://127.0.0.1:9200/ as the server URL. And I can go to the command line on that server and curl that URL successfully.
Re: Diagnosing REJECTED documents in job history
I agree that the Elastic Search connector needs far better logging and error handling. CONNECTORS-629. Karl On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com wrote: Nailed it with the help of wireshark! Turns out it was my fault -- I had set it up to use (i.e. create) an index called DocumentumRoW but it turns out ES index names must be all lowercase. Never knew that before. Slightly annoyed that ES didn't log that... Thanks again for your help Karl :-) My only request on the MCF front would be that it would be nice for the output connector to log the actual status code and content of a non-successful HTTP response. On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote: That information isn't being recorded in manifoldcf.log unfortunately -- I included all that was there. And there are no exceptions in elasticsearch.log either... I'll try running wireshark to see if I can follow the TCP stream. On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote: Ok, ElasticSearch is not happy about something when the document is being posted. The connector is seeing a non-200 HTTP response, and throwing an exception as a result: if (!checkResultCode(method.getStatusCode())) throw new ManifoldCFException(getResultDescription()); Presumably the exception message in the log tells us what that HTTP code is, but you did not include that key info. Karl On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg andrew.cl...@gmail.com wrote: Thanks for all your help Karl! It's 1.0.1 from the binary distro. And yes, it says Connection working when I view it. On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote: Ok, so let's back up a bit. First, which version of ManifoldCF is this? I need to know that before I can interpret the stack trace. Second, what do you see when you view the connection in the crawler UI? Does it say Connection working, or something else, and if so, what? I've created a ticket for better error reporting in this connector - it was a contribution and AFAIK the error handling is not very robust at this point, but I can fix that quickly with your help. ;-) Karl On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com wrote: On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote: So you saw events in the history which correspond to these documents and which are of type Indexation that say success? If that is the case, then the ElasticSearch connector thinks it handed the documents successfully to the ElasticSearch server. Ah, no, the activity is fetch rather than indexation. e.g. 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361 I don't see any history entries relating to indexing as a specific activity in its own right. Sorry, that was probably a red herring, I don't think it's getting that far. I just noticed that above all the service interruption reported warnings are some errors like this: ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed: org.apache.manifoldcf.core.interfaces.ManifoldCFException: at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97) at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138) at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652) at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) Sadly there's no description, just a stacktrace. I know the ES server is visible from the MCF server -- actually they're the same machine, and it's configured to use http://127.0.0.1:9200/ as the server URL. And I can go to the command line on that server and curl that URL successfully. -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
Re: Diagnosing REJECTED documents in job history
I just checked in a refactoring to trunk that should improve Elastic Search error reporting significantly. Karl On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright daddy...@gmail.com wrote: I agree that the Elastic Search connector needs far better logging and error handling. CONNECTORS-629. Karl On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com wrote: Nailed it with the help of wireshark! Turns out it was my fault -- I had set it up to use (i.e. create) an index called DocumentumRoW but it turns out ES index names must be all lowercase. Never knew that before. Slightly annoyed that ES didn't log that... Thanks again for your help Karl :-) My only request on the MCF front would be that it would be nice for the output connector to log the actual status code and content of a non-successful HTTP response. On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote: That information isn't being recorded in manifoldcf.log unfortunately -- I included all that was there. And there are no exceptions in elasticsearch.log either... I'll try running wireshark to see if I can follow the TCP stream. On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote: Ok, ElasticSearch is not happy about something when the document is being posted. The connector is seeing a non-200 HTTP response, and throwing an exception as a result: if (!checkResultCode(method.getStatusCode())) throw new ManifoldCFException(getResultDescription()); Presumably the exception message in the log tells us what that HTTP code is, but you did not include that key info. Karl On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg andrew.cl...@gmail.com wrote: Thanks for all your help Karl! It's 1.0.1 from the binary distro. And yes, it says Connection working when I view it. On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote: Ok, so let's back up a bit. First, which version of ManifoldCF is this? I need to know that before I can interpret the stack trace. Second, what do you see when you view the connection in the crawler UI? Does it say Connection working, or something else, and if so, what? I've created a ticket for better error reporting in this connector - it was a contribution and AFAIK the error handling is not very robust at this point, but I can fix that quickly with your help. ;-) Karl On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com wrote: On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote: So you saw events in the history which correspond to these documents and which are of type Indexation that say success? If that is the case, then the ElasticSearch connector thinks it handed the documents successfully to the ElasticSearch server. Ah, no, the activity is fetch rather than indexation. e.g. 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361 I don't see any history entries relating to indexing as a specific activity in its own right. Sorry, that was probably a red herring, I don't think it's getting that far. I just noticed that above all the service interruption reported warnings are some errors like this: ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed: org.apache.manifoldcf.core.interfaces.ManifoldCFException: at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97) at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138) at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652) at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) Sadly there's no description, just a stacktrace. I know the ES server is visible from the MCF server -- actually they're the same machine, and it's configured to use http://127.0.0.1:9200/ as the server URL. And I can go to the command line on that server and curl that URL successfully. -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg -- http://tinyurl.com/andrew-clegg-linkedin | http
Re: Diagnosing REJECTED documents in job history
The problem is that there are some documents you are indexing that have no mime type set at all. The ElasticSearch connector is not handling that case properly. I've opened ticket CONNECTORS-637, and will fix it shortly. Karl On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg andrew.cl...@gmail.com wrote: Hi Karl, The extended logging has helped me find the next problem :-) Now I'm seeing hundreds of exceptions like this in the manifold log: FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed: null java.lang.NullPointerException at java.util.TreeMap.getEntry(TreeMap.java:324) at java.util.TreeMap.containsKey(TreeMap.java:209) at java.util.TreeSet.contains(TreeSet.java:217) at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164) at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091) at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556) There'll be a whole batch, then a pause, then another batch. I suspect this is because MCF is retrying? My theory about this is that Documentum is returning the mime type as just pdf instead of application/pdf -- although I did add pdf as an allowed mime type in the ElasticSearch page of the job config, just to see if it would parse this ok. Do you know if there's any way to map from a source's content type to a destination's content type? On 31 January 2013 23:09, Karl Wright daddy...@gmail.com wrote: I just chased down and fixed a problem in trunk. ElasticSearch is now returning a 201 code for successful indexing in some cases, and the connector was not handling that as 'success'. Karl On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright daddy...@gmail.com wrote: Please let me know if you see any problems. I'll fix anything you find as quickly as I can. Karl On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg andrew.cl...@gmail.com wrote: Great, thanks, I'll give it a try. On 30 January 2013 18:52, Karl Wright daddy...@gmail.com wrote: I just checked in a refactoring to trunk that should improve Elastic Search error reporting significantly. Karl On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright daddy...@gmail.com wrote: I agree that the Elastic Search connector needs far better logging and error handling. CONNECTORS-629. Karl On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com wrote: Nailed it with the help of wireshark! Turns out it was my fault -- I had set it up to use (i.e. create) an index called DocumentumRoW but it turns out ES index names must be all lowercase. Never knew that before. Slightly annoyed that ES didn't log that... Thanks again for your help Karl :-) My only request on the MCF front would be that it would be nice for the output connector to log the actual status code and content of a non-successful HTTP response. On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote: That information isn't being recorded in manifoldcf.log unfortunately -- I included all that was there. And there are no exceptions in elasticsearch.log either... I'll try running wireshark to see if I can follow the TCP stream. On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote: Ok, ElasticSearch is not happy about something when the document is being posted. The connector is seeing a non-200 HTTP response, and throwing an exception as a result: if (!checkResultCode(method.getStatusCode())) throw new ManifoldCFException(getResultDescription()); Presumably the exception message in the log tells us what that HTTP code is, but you did not include that key info. Karl On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg andrew.cl...@gmail.com wrote: Thanks for all your help Karl! It's 1.0.1 from the binary distro. And yes, it says Connection working when I view it. On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote: Ok, so let's back up a bit. First, which version of ManifoldCF is this? I need to know that before I can interpret the stack trace. Second, what do you see when you view the connection in the crawler UI? Does it say Connection working, or something else, and if so, what? I've created a ticket for better error reporting in this connector - it was a contribution and AFAIK the error handling is not very robust