Re: Exporting crawler configuration easier?

2012-06-27 Thread Karl Wright
The fact that the export is a zip is not supposed to be used to
actually edit the stored information.

It sounds like the reason that you want to edit it is to remove the
passwords from the file.  Perhaps we should look at it from that point
of view and allow an export option that does not include any passwords
or something?

Karl

On Wed, Jun 27, 2012 at 7:27 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:

 We have all configuration files for our search project stored in SVN, even
 our MCF crawler configuration. Each time we change our MCF settings, i.e.
 add something to the seed list, we usually export the configuration and
 commit that change to SVN.

 This can be a time-consuming process since we have to unzip the generated
 export file in order to edit the files within it. We need to edit the output
 file which includes the password to our Solr server.

 Then we must zip all these files in order to create a similar export file.
 The order of the files are very important. You cannot just create a zip file
 right away without being aware of the order of the included files.
 Otherwise, MCF will complain when you are trying to import that file later.

 Any suggestions for a smoother way to have a version-controlled
 configuration? Perhaps I should create a script which does all the steps
 mentioned above? As far as I know, it's not possible to edit the files
 directly inside a zip file from a terminal on UNIX.

 Thanks,
 Erlend

 --
 Erlend Garåsen
 Center for Information Technology Services
 University of Oslo
 P.O. Box 1086 Blindern, N-0317 OSLO, Norway
 Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



Re: Crawling behind an ISA proxy (iis 7.5)

2012-06-28 Thread Karl Wright
I was wondering if you'd picked up and tried the patch for
CONNECTORS-483.  This patch adds official proxy support for the Web
Connector.  Alternatively, you could try to build and run with trunk
code.

Karl

On Wed, May 16, 2012 at 12:12 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Rene,

 The URL that is causing the RFC2617 challenge/response is being
 authenticated with basic auth, not NTLM.  This could yield a 401.  You
 may want to check the URL in a browser other than IE (Firefox, for
 instance) to see if basic auth is being used for this URL rather than
 NTLM.

 The redirection you describe to GetLogon is pretty standard practice.
 You can easily tell the web connector that that is part of the logon
 sequence by following the steps I laid out in the earlier email.

 Once you have set up what you think is the right set of logon pages,
 it's very helpful to attempt a crawl and then see what the simple
 history shows.  There are specific activities logged when logon begins
 and ends, so this is enormously helpful as a diagnostic aid.  If you
 see a continuous loop (entering logon sequence, doing stuff, exiting
 logon sequence, and repeating) then it is clear that the cookie has
 not been set.

 I won't be able to look at your packet log for a while, probably at
 least a week.

 Karl



 On Wed, May 16, 2012 at 10:23 AM, Rene Nederhand r...@nederhand.net wrote:
 Hi Karl,

 Thank you so much for putting a so much time in educating a newbe. I
 appreciate your help enormously.

 I'd tried to follow each of the steps below. So far, it doesn't work but I
 will continue this evening to see if I can get this thing going.

 In the mean time, I have switched loglevels of the crawling proces to INFO
 and found something interesting in the logs. Perhaps, this could shine some
 light on my issues:

 ERROR 2012-05-16 16:04:13,581 (Thread-1019) - Invalid challenge: Basic
 org.apache.commons.httpclient.auth.MalformedChallengeException: Invalid
 challenge: Basic
 at
 org.apache.commons.httpclient.auth.AuthChallengeParser.extractParams(Unknown
 Source)
 at org.apache.commons.httpclient.auth.RFC2617Scheme.processChallenge(Unknown
 Source)
 at org.apache.commons.httpclient.auth.BasicScheme.processChallenge(Unknown
 Source)
 at
 org.apache.commons.httpclient.auth.AuthChallengeProcessor.processChallenge(Unknown
 Source)
 at
 org.apache.commons.httpclient.HttpMethodDirector.processWWWAuthChallenge(Unknown
 Source)
 at
 org.apache.commons.httpclient.HttpMethodDirector.processAuthenticationResponse(Unknown
 Source)
 at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown
 Source)
 at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown Source)
 at
 org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244)

 Please not that I have set NTLM (not BASIC) authentication on
 bb.helo.hanze.nl and nothing else. The error does not occur when I try to
 crawl our intranet (also with NTLM). Does this mean something? At least, I
 think it is the source of the 401 I get when looking at the simple report,
 isn't it?

 In addition, I've used Charles proxy to monitor all interaction between my
 browser and the server. I have found that it doesn't matter which url I use
 to enter Blackboard, they get all directed to
 https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon. Shouldn't page based
 authentication handle this?

 To make the information complete, I've added the HAR file with the
 CharlesProxy output. It can be displayed
 at http://www.softwareishard.com/har/viewer/ for example. You'll be able to
 see all requests/responses when I start with a clean browser (cookies
 removed) entering https://bb.helo.hanze.nl. Maybe, this does help.

 Again, thanks a lot for your help!

 René





 On Tue, May 15, 2012 at 5:59 PM, Karl Wright daddy...@gmail.com wrote:

 Hi Rene,

 You will need both NTLM auth (page auth, which you have already set
 up), and Session auth (which you haven't yet set up).

 In order to set up session-based auth, you should first identify the
 set of pages that you want access to that are protected by a cookie
 requirement.  You will need to write a regular expression that matches
 these pages and ONLY these pages.  This URL gets entered as the URL
 regular expression on the Access Credentials tab in the Session-based
 Access Credentials part of the tab.  Then, click the Add button.

 The next thing you will need is to specify how the connector
 recognizes pages that belong to the logon sequence.  The actual
 sequence you need to understand is what happens in the browser when
 you try to access a specific protected URL and you don't have the
 right cookie.  You did not actually specify that; I think you are
 presuming that you'd be entering directly through the logon page, but
 that is not how it works.  The crawler will have a URL in mind and
 will need access to the content of that URL.  It will fetch the URL

RE: How to increase cache settings for ManifoldCF Authority Service

2012-07-04 Thread Karl Wright
It would be great if you could open a ticket to request that this cache
value be configurable like it is in the active directory authority.

Karl

Sent from my Windows Phone
--
From: Anupam Bhattacharya
Sent: 7/3/2012 10:13 AM
To: user@manifoldcf.apache.org
Subject: Re: How to increase cache settings for ManifoldCF Authority Service

Many Thanks!! I changed the value  did rebuild of ManifoldCF which helped
to solve the issue.

Regards
Anupam

On Tue, Jul 3, 2012 at 4:24 PM, Shinichiro Abe
shinichiro.ab...@gmail.comwrote:

 Hi,

 I think you see the following line if you configure cache life time.

 source:
 org.apache.manifoldcf.crawler.authorities.DCTM.AuthorityConnector.java

 protected static long responseLifetime = 6L;  --this value


 I think ActiveDirectoryAuthority.java code helps this.

 Regards,
 Shinichiro Abe

 On 2012/07/03, at 19:44, Anupam Bhattacharya wrote:

  Sorry i didn't mention that clearly.
 
  I was just trying to figure out from the SVN Code where the 1 min
 timeout changes have been kept.
  By my best guess, I can see a line which must be doing this 1 min
 timeout changes in
 http://svn.apache.org/repos/asf/manifoldcf/trunk/framework/pull-agent/src/main/java/org/apache/manifoldcf/crawler/system/ExpireStufferThread.java
 
// If there are no documents at all, then we can sleep for a while.
  // The theory is that we need to allow stuff to
 accumulate.
  if (descs.length == 0)
  {
ManifoldCF.sleep(6L);  // 1 minute
continue;
}
 
  Pls. confirm if i am on the right direction.
 
  Thanks
  Anupam
 
  On Tue, Jul 3, 2012 at 3:51 PM, Shinichiro Abe 
 shinichiro.ab...@gmail.com wrote:
  Hi,
  Oh sorry, I told about Active Directory Authority Services.
  Now there is not a place to configure cache life time
  in Documentum Authority Services.
 
  Shinichiro Abe
  On 2012/07/03, at 19:07, Anupam Bhattacharya wrote:
 
   Hi,
  
   I am using ManifoldCF for Documentum Repository  Documentum Authority
 Services. How can I configure cache life time settings in this case when
 Active directory is not present ?
  
   Regards
   Anupam
  
   On Tue, Jul 3, 2012 at 3:32 PM, Shinichiro Abe 
 shinichiro.ab...@gmail.com wrote:
   Hi,
  
Can I configure these timeout settings value to anything like 60min
 or 1 day etc ?
   Yes, you can configure cache life time,
in which tokens are cached after last access(user's last access) to
 the Active Directory.
   I think this value might as well be set to about 60min,  1day is too
 long.
  
   Regards,
   Shinichiro Abe
  
   On 2012/07/03, at 18:15, Anupam Bhattacharya wrote:
  
Hello Karl,
   
First of all congratulations for ManifoldCF graduation to Apache
 Projects  thanks for all the help you provided during my development
 previously through this forum.
   
I have recently come across a Performance problem due to ManifoldCF
 Authority service. After including authority service the Query Response
 times increases a lot. After doing some inspection i found that ManifoldCF
 doesn't cache User Token's after 1 min. (
 http://search-lucene.com/m/YqXPHki0Dv/v=threaded).
   
Can I configure these timeout settings value to anything like 60min
 or 1 day etc ?
   
Regards
Anupam
   
   
   
  
 



[ANNOUNCE] ManifoldCF 0.6 is released!

2012-07-16 Thread Karl Wright
I'd like to announce the release of ManifoldCF 0.6.  The list of
changes can be found at
https://svn.apache.org/repos/asf/manifoldcf/branches/release-0.6-branch/CHANGES.txt.
 Congratulations to all involved!

Karl


Re: How to import data from Oracle to Solr

2012-07-17 Thread Karl Wright
Hi Wolfgang,

ManifoldCF is meant to handle a binary document and its metadata.  You
must provide the document.  Metadata is optional.

The JDBC connector does not currently support metadata.  In order to
index this, therefore, you will need to decide what should go into
your binary document from your database fields.  You can append
together multiple fields into one document by means of SQL, e.g. the
CONCAT operator or its Oracle equivalent.  This would go into one
field in Solr, then, which is what you'd search on.

Alternatively, if you really need separate indexed fields in Solr for
search reasons, you can request a JDBC connector enhancement to add
metadata support.  You'd still need a binary document, although you
could return a blank value for that.

So I guess the answer depends on what you are trying to do on the whole.

Karl


On Tue, Jul 17, 2012 at 6:27 AM, Wolfgang Schreiber
wolfgang.schrei...@isb-ag.de wrote:
 Hello,

 we are trying to ingest data from an Oracle database into Solr.
 We managed to insert docs into Solr but only document IDs are inserted and no
 other data fields.

 Can you provide an example how to setup the import job in ManifoldCF ?


 Assume we have the following initial situation:

 1) Our Oracle table looks something like:

 ADDRESS
 --
 ID  NUMBER
 ZIP NUMBER
 CITYVARCHAR(2)
 STREET  VARCHAR(2)


 2) In Solr's schema.xml we added the following fields for the database
 columns
 ...
 field name=ZIP type=int indexed=true stored=true /
 field name=City type=string indexed=true stored=true /
 field name=Street type=string indexed=true stored=true /
 ...


 So here are our questions:

 * How do we have to setup the queries for the ManifoldCF job?
   In particular how exactly must the seeding query and the data query look
 like?

 * How do the Solr field mappings look like?


 We read your online documentation as well as your MEAP book but could not
 find a workíng example for a successful import between Oracle and Solr.
 Any help is welcome!

 Best regards
 Wolfgang


Re: How to import data from Oracle to Solr

2012-07-18 Thread Karl Wright
The way you create an enhancement request is through Jira, at
https://issues.apache.org/jira.  Just create a request for an
improvement, and be sure to list any specific details that are
important to you.

Thanks,
Karl

On Wed, Jul 18, 2012 at 5:46 AM, Wolfgang Schreiber
wolfgang.schrei...@isb-ag.de wrote:
 Hi Karl,
 hi ManifoldCF team members,


 Using Solr's copyField element we managed to create separate fields for the
 different database columns:

 field name=city type=cityType indexed=true stored=true /
 ...
 copyField source=text dest=city/
 ...
 fieldType name=cityType class=solr.TextField
 analyzer
 tokenizer class=solr.PatternTokenizerFactory
  pattern=.+city:(.+);.* group=1 /
 /analyzer
 /fieldType

 Anyhow, this solution has some drawbacks; e.g. the newly created fields all
 are text fields.
 In particular numeric and date fields are also copied to text fields and we
 cannot use type specific functions of Solr.

 So coming back to the offer in your first mail: Is it possible that you
 create a JDBC connector enhancement to support metadata?
 Is there a special request process we must follow?

 Best regards
 Wolfgang




 -Ursprüngliche Nachricht-
 Von: Karl Wright [mailto:daddy...@gmail.com]
 Gesendet: Di 17.07.2012 15:13
 An: user@manifoldcf.apache.org
 Betreff: Re: How to import data from Oracle to Solr

 So if I understand correctly ...

 1) ... all mappings added to the Solr Field Mapping tab are ignored in case
 of a JDBC resource connector?

 Not exactly - the mappings aren't ignored, there just isn't any
 metadata associated with a JDBC connector document, so the mappings
 never apply.

 Regardless, I am glad you got the rest worked out.

 Karl


 On Tue, Jul 17, 2012 at 9:09 AM, Wolfgang Schreiber
 wolfgang.schrei...@isb-ag.de wrote:
 Hello Karl,

 thank you very much for your quick answer!

 So if I understand correctly ...

 1) ... all mappings added to the Solr Field Mapping tab are ignored in
 case
 of a JDBC resource connector?

 2) Our data query must look somehow like (regarding that || is Oracle's
 concatenation operator):
SELECT ID AS $(IDCOLUMN), ADDRESS_URL AS $(URLCOLUMN),
'ZIP:' || ZIP || ';city:' || CITY || ';street:' || STREET
AS $(DATACOLUMN) FROM ADDRESS WHERE ID IN $(IDLIST)

This would result into DATACOLUMN values like:
ZIP:70173;City:Stuttgart;Street:Heilbronner

 We tried this statement and we got the data into the text field of our Solr
 index.
 It seems we are one step further!

 Thank you for your help! Best regards
 Wolfgang


 -Ursprüngliche Nachricht-
 Von: Karl Wright [mailto:daddy...@gmail.com]
 Gesendet: Di 17.07.2012 12:42
 An: user@manifoldcf.apache.org
 Betreff: Re: How to import data from Oracle to Solr

 Hi Wolfgang,

 ManifoldCF is meant to handle a binary document and its metadata.  You
 must provide the document.  Metadata is optional.

 The JDBC connector does not currently support metadata.  In order to
 index this, therefore, you will need to decide what should go into
 your binary document from your database fields.  You can append
 together multiple fields into one document by means of SQL, e.g. the
 CONCAT operator or its Oracle equivalent.  This would go into one
 field in Solr, then, which is what you'd search on.

 Alternatively, if you really need separate indexed fields in Solr for
 search reasons, you can request a JDBC connector enhancement to add
 metadata support.  You'd still need a binary document, although you
 could return a blank value for that.

 So I guess the answer depends on what you are trying to do on the whole.

 Karl


 On Tue, Jul 17, 2012 at 6:27 AM, Wolfgang Schreiber
 wolfgang.schrei...@isb-ag.de wrote:
 Hello,

 we are trying to ingest data from an Oracle database into Solr.
 We managed to insert docs into Solr but only document IDs are inserted and
 no
 other data fields.

 Can you provide an example how to setup the import job in ManifoldCF ?


 Assume we have the following initial situation:

 1) Our Oracle table looks something like:

 ADDRESS
 --
 ID  NUMBER
 ZIP NUMBER
 CITYVARCHAR(2)
 STREET  VARCHAR(2)


 2) In Solr's schema.xml we added the following fields for the database
 columns
 ...
 field name=ZIP type=int indexed=true stored=true /
 field name=City type=string indexed=true stored=true /
 field name=Street type=string indexed=true stored=true /
 ...


 So here are our questions:

 * How do we have to setup the queries for the ManifoldCF job?
   In particular how exactly must the seeding query and the data query look
 like?

 * How do the Solr field mappings look like?


 We read your online documentation as well as your MEAP book but could not
 find a workíng example for a successful import between Oracle and Solr.
 Any help is welcome!

 Best regards
 Wolfgang




Re: Repeated service interruptions

2012-07-19 Thread Karl Wright
Hi Abe-san,

Sometimes what looks like a server error can actually be due to the
domain controller.  I wonder if the domain controller needs to be
rebooted?

Karl

On Thu, Jul 19, 2012 at 5:12 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Hi Karl,
 Thank you for the reply.
 I tried to reduce maximum number of connections from 10
 to 5, but didn't  avoid busy error. I'll try to reduce more.
 Thank you.
 Shinichiro Abe

 On 2012/07/19, at 15:55, Karl Wright wrote:

 Hi Abe-san,

 The all pipe instances are busy error is coming from the Windows
 server you are trying to crawl.  I don't know what is happening there
 but here are some possibilities:

 (1) The Windows server is just overloaded; you can try reducing the
 maximum number of connections to 2 or 3 to see if that helps.
 (2) The Windows server needs rebooting.

 Thanks,
 Karl

 On Wed, Jul 18, 2012 at 10:09 PM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I use windows shares connector and ran a job.
 The job was aborted without done normally and the job's status said:
 Error: Repeated service interruptions - failure processing document: Read 
 timed out

 Why was the job aborted? I use ManifoldCF 0.5.1 and the latest version's 
 jcifs.jar.
 Is the crawled server busy? I think the server MCF is installed seems not 
 to be busy,
 the other servers in which MCF will crawls seem to be busy.
 How can I run the job without error? What's wrong?


 the logs of connector:

 WARN 2012-07-12 16:28:52,648 (Worker thread '19') - JCIFS: Possibly 
 transient exception detected on attempt 1 while getting share security: All 
 pipe instances are busy.
at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563)
at jcifs.smb.SmbTransport.send(SmbTransport.java:663)
 ..
 WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: Possibly 
 transient exception detected on attempt 3 while getting share security: All 
 pipe instances are busy.
 ..
 WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: 'Busy' response 
 when getting document version for 
 smb://XX.XX.XX.XX/D$/abcde/1234/123456789/e123456789a.pdf: retrying...
 ..
 WARN 2012-07-12 16:36:37,585 (Worker thread '19') - Pre-ingest service 
 interruption reported for job 1342076182624 connection 'Windows shares': 
 Timeout or other service interruption: All pipe instances are busy.
 ..
 WARN 2012-07-12 19:14:30,335 (Worker thread '19') - Service interruption 
 reported for job 1342076182624 connection 'Windows shares': Ingestion API 
 socket timeout exception waiting for response code: Read timed out; 
 ingestion will be retried again later
 ..
 WARN 2012-07-12 20:43:50,210 (Worker thread '19') - Service interruption 
 reported for job 1342076182624 connection 'Windows shares': Ingestion API 
 socket timeout exception waiting for response code: Read timed out; 
 ingestion will be retried again later
 ..
 ERROR 2012-07-12 20:43:50,210 (Worker thread '19') - Exception tossed: 
 Repeated service interruptions - failure processing document: Read timed out
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service 
 interruptions - failure processing document: Read timed out
at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:606)
 Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at 
 org.apache.manifoldcf.agents.output.solr.HttpPoster.readLine(HttpPoster.java:571)
at 
 org.apache.manifoldcf.agents.output.solr.HttpPoster.getResponse(HttpPoster.java:598)

 Thanks in advance,
 Shinichiro Abe








RE: How to import data from Oracle to Solr

2012-07-21 Thread Karl Wright
Hi Wolfgang,

Looking at the code it turns out I was wrong about metadata support
being there in the connector.  Sorry for the confusion.

The way it works is that in the data query and no required return
column is considered to be metadata with a field name corresponding to
the return column name.  So in addition to you returning a URL and a
binary value, you can also return any single-valued metadata you need.

Please let me know if this works for you.

Karl

Sent from my Windows Phone

-Original Message-
From: Wolfgang Schreiber
Sent: 7/20/2012 7:47 AM
To: user@manifoldcf.apache.org
Subject: AW: How to import data from Oracle to Solr


Re: Repeated service interruptions

2012-07-24 Thread Karl Wright
Hi Abe-san,

Did you figure out what the problem was?

Karl

On Thu, Jul 19, 2012 at 5:52 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Abe-san,

 Sometimes what looks like a server error can actually be due to the
 domain controller.  I wonder if the domain controller needs to be
 rebooted?

 Karl

 On Thu, Jul 19, 2012 at 5:12 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi Karl,
 Thank you for the reply.
 I tried to reduce maximum number of connections from 10
 to 5, but didn't  avoid busy error. I'll try to reduce more.
 Thank you.
 Shinichiro Abe

 On 2012/07/19, at 15:55, Karl Wright wrote:

 Hi Abe-san,

 The all pipe instances are busy error is coming from the Windows
 server you are trying to crawl.  I don't know what is happening there
 but here are some possibilities:

 (1) The Windows server is just overloaded; you can try reducing the
 maximum number of connections to 2 or 3 to see if that helps.
 (2) The Windows server needs rebooting.

 Thanks,
 Karl

 On Wed, Jul 18, 2012 at 10:09 PM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I use windows shares connector and ran a job.
 The job was aborted without done normally and the job's status said:
 Error: Repeated service interruptions - failure processing document: Read 
 timed out

 Why was the job aborted? I use ManifoldCF 0.5.1 and the latest version's 
 jcifs.jar.
 Is the crawled server busy? I think the server MCF is installed seems not 
 to be busy,
 the other servers in which MCF will crawls seem to be busy.
 How can I run the job without error? What's wrong?


 the logs of connector:

 WARN 2012-07-12 16:28:52,648 (Worker thread '19') - JCIFS: Possibly 
 transient exception detected on attempt 1 while getting share security: 
 All pipe instances are busy.
at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563)
at jcifs.smb.SmbTransport.send(SmbTransport.java:663)
 ..
 WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: Possibly 
 transient exception detected on attempt 3 while getting share security: 
 All pipe instances are busy.
 ..
 WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: 'Busy' response 
 when getting document version for 
 smb://XX.XX.XX.XX/D$/abcde/1234/123456789/e123456789a.pdf: retrying...
 ..
 WARN 2012-07-12 16:36:37,585 (Worker thread '19') - Pre-ingest service 
 interruption reported for job 1342076182624 connection 'Windows shares': 
 Timeout or other service interruption: All pipe instances are busy.
 ..
 WARN 2012-07-12 19:14:30,335 (Worker thread '19') - Service interruption 
 reported for job 1342076182624 connection 'Windows shares': Ingestion API 
 socket timeout exception waiting for response code: Read timed out; 
 ingestion will be retried again later
 ..
 WARN 2012-07-12 20:43:50,210 (Worker thread '19') - Service interruption 
 reported for job 1342076182624 connection 'Windows shares': Ingestion API 
 socket timeout exception waiting for response code: Read timed out; 
 ingestion will be retried again later
 ..
 ERROR 2012-07-12 20:43:50,210 (Worker thread '19') - Exception tossed: 
 Repeated service interruptions - failure processing document: Read timed 
 out
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated 
 service interruptions - failure processing document: Read timed out
at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:606)
 Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at 
 org.apache.manifoldcf.agents.output.solr.HttpPoster.readLine(HttpPoster.java:571)
at 
 org.apache.manifoldcf.agents.output.solr.HttpPoster.getResponse(HttpPoster.java:598)

 Thanks in advance,
 Shinichiro Abe








Re: crawled counts on WEB crawling differ between MCF0.4 and MCF0.5

2012-07-29 Thread Karl Wright
There should be no differences between crawling using MySQL as the
database and PostgreSQL, on the same version of ManifoldCF.

We include an RSS crawling test which finds exactly the expected
number of documents on MySQL.  This is a 100,000 document crawl.
There are no back-end-specific logic differences in the web connector
that would be expected to yield different results based on the
back-end database.

If you believe you have found a difference between MySQL and
PostgreSQL, I suggest the following:

(1) Make sure that the repository connections and job definitions are
indeed identical between MySQL and PostgreSQL.
(2) See if you can locate an example document that was crawled with
PostgreSQL but not crawled with MySQL.
(3) If you create a second web connection and job under MySQL, and run
the job to completion, does the document that was not included get
skipped again?  Or does it seem random which documents are skipped on
each run?

Thanks,
Karl



On Sun, Jul 29, 2012 at 9:51 PM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:
 Aren't there some difference in crawling logics between MySQL and
 PostgreSQL?



 I did some tests on web crawling using both of MySQL and PostgreSQL.





 MCF0.5 running on MySQL indexed around 6000, and meanwhile MCF0.5 running on
 PostgreSQL indexed over 12000 documents.

 MCF0.6 running on MySQL indexed around 6000. MCF0.4 running on PostgreSQL
 indexed over 12000 documents.





 Each number of indexed documents above is a result of first crawling after
 deleting indexing history from DB.

 It seems that changing DB affects crawling and indexing.



 Regards,

 Shigeki

 2012/7/27 Karl Wright daddy...@gmail.com

 There was a bug fixed in the way hopcount was being computed.  See
 CONNECTORS-464.

 This means that fewer documents are left in the queue, but the number
 of indexed documents should be the same.

 Karl

 On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 
  Hi guys.
 
 
  I wonder if anyone has ever faced the experience on web crawling that
  the
  number of crawled counts differs between MCF0.4 and MCF0.5.
 
 
  I crawled some portal sites on intranet using MCF0.4 and MCF0.5.
  MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled only
  around half of the contents.
  I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL.
  I hope changing DB does not affect the crawling results:
 
 
  MCF0.4:
- Crawled Counts: 12000 and over
- Solr3.5
- PostgreSQL 9.1.3
- Tomcat6
- Max Hop on Links: 15
- Max Hop on Redirects: 10
- Include only hosts matching seeds: Checked
- org.apache.manifoldcf.crawler.threads: 50
- org.apache.manifoldcf.database.maxhandles: 100
 
 
  MCF0.5:
- Crawled Counts: around 6000
- Solr3.5
- MySQL5.5
- Tomcat6
- Max Hop on Links: 15
- Max Hop on Redirects: 10
- Include only hosts matching seeds: Checked
- org.apache.manifoldcf.crawler.threads: 50
- org.apache.manifoldcf.database.maxhandles: 100
 
 
  Does anyone have any ideas?
 




 --
 
  ソフトバンクモバイル株式会社
  情報システム本部
  システムサービス事業統括部
  サービス企画部

  小林 茂樹
  shigeki.kobayas...@g.softbank.co.jp
 





Re: Repeated service interruptions

2012-08-01 Thread Karl Wright
On Wed, Aug 1, 2012 at 5:48 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Hi Karl,

 I still have a problem.
 I reduced maximum number of connections into 2.
 I rebooted the file server, not domain controller.
 When I configured the paths[1], the log said no error
 and ShareDrive connector crawled the files successfully.
 When I made the path's config default(matching * ),
 the log said all pipe instances are busy error.
 Both of path's config pointed the same location.

 Also when this error occurred, watching the log of ingest,
 HttpPoster was waiting for response stream
 and couldn't get response from Solr,
 and threw SocketTimeoutException.
 I increased jcifs.smb.client.responseTimeout
 but still threw the exception.
 On Solr, Jetty threw SocketException(socket wr
 ite error).
 I'm working on checking Solr logs.
 Solr may do something wrong when running /update/extract.


If Solr threw the exception this sounds likely.

 Do you know something like this?
 Does path's matching config affect those errors?

 [1]Paths Tab:
 Include  directory(s)  matching  /01*


This should have nothing to do with socket exceptions, except possibly
that the crawler winds up trying to read a file that isn't actually a
file but is something else, like a named pipe or something.  This
typically doesn't happen if the server is a Windows machine but if it
is a Samba server I could imagine something like that happening.

Karl

 P.S.
 Thank you for fix CONNECTORS-494.
 I checked trunk code, worked well.

 Thank you,
 Shinichiro Abe

 On 2012/07/24, at 22:13, Karl Wright wrote:

 Hi Abe-san,

 Did you figure out what the problem was?

 Karl

 On Thu, Jul 19, 2012 at 5:52 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Abe-san,

 Sometimes what looks like a server error can actually be due to the
 domain controller.  I wonder if the domain controller needs to be
 rebooted?

 Karl

 On Thu, Jul 19, 2012 at 5:12 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi Karl,
 Thank you for the reply.
 I tried to reduce maximum number of connections from 10
 to 5, but didn't  avoid busy error. I'll try to reduce more.
 Thank you.
 Shinichiro Abe

 On 2012/07/19, at 15:55, Karl Wright wrote:

 Hi Abe-san,

 The all pipe instances are busy error is coming from the Windows
 server you are trying to crawl.  I don't know what is happening there
 but here are some possibilities:

 (1) The Windows server is just overloaded; you can try reducing the
 maximum number of connections to 2 or 3 to see if that helps.
 (2) The Windows server needs rebooting.

 Thanks,
 Karl

 On Wed, Jul 18, 2012 at 10:09 PM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I use windows shares connector and ran a job.
 The job was aborted without done normally and the job's status said:
 Error: Repeated service interruptions - failure processing document: 
 Read timed out

 Why was the job aborted? I use ManifoldCF 0.5.1 and the latest version's 
 jcifs.jar.
 Is the crawled server busy? I think the server MCF is installed seems 
 not to be busy,
 the other servers in which MCF will crawls seem to be busy.
 How can I run the job without error? What's wrong?


 the logs of connector:

 WARN 2012-07-12 16:28:52,648 (Worker thread '19') - JCIFS: Possibly 
 transient exception detected on attempt 1 while getting share security: 
 All pipe instances are busy.
   at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563)
   at jcifs.smb.SmbTransport.send(SmbTransport.java:663)
 ..
 WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: Possibly 
 transient exception detected on attempt 3 while getting share security: 
 All pipe instances are busy.
 ..
 WARN 2012-07-12 16:36:37,585 (Worker thread '19') - JCIFS: 'Busy' 
 response when getting document version for 
 smb://XX.XX.XX.XX/D$/abcde/1234/123456789/e123456789a.pdf: retrying...
 ..
 WARN 2012-07-12 16:36:37,585 (Worker thread '19') - Pre-ingest service 
 interruption reported for job 1342076182624 connection 'Windows shares': 
 Timeout or other service interruption: All pipe instances are busy.
 ..
 WARN 2012-07-12 19:14:30,335 (Worker thread '19') - Service interruption 
 reported for job 1342076182624 connection 'Windows shares': Ingestion 
 API socket timeout exception waiting for response code: Read timed out; 
 ingestion will be retried again later
 ..
 WARN 2012-07-12 20:43:50,210 (Worker thread '19') - Service interruption 
 reported for job 1342076182624 connection 'Windows shares': Ingestion 
 API socket timeout exception waiting for response code: Read timed out; 
 ingestion will be retried again later
 ..
 ERROR 2012-07-12 20:43:50,210 (Worker thread '19') - Exception tossed: 
 Repeated service interruptions - failure processing document: Read timed 
 out
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated 
 service interruptions - failure processing document: Read timed out
   at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:606

Re: SharePoint Library consist of folders

2012-08-02 Thread Karl Wright
In that case, you will need to wait until CONNECTORS-492 is resolved.
Because of SharePoint's lack of support for accessing large libraries
via the Lists service, we're having to write our own.  But this is not
yet ready, although we are getting closer to trying it out soon.

Karl

On Thu, Aug 2, 2012 at 9:57 AM, Ahmet Arslan iori...@yahoo.com wrote:
 The DspSts service lists all the files in a library,
 including those
 that have a folder path.  I believe the lists service
 does the same.
 So you should see all the files crawled including those that
 are
 within folders.  Please let me know if this seems not
 to be the case.

 It seems that lists service does not lists files that have a folder path. 
 Instead it lists folder paths. Currently MCF injects one SolrDocument per 
 folder, and sends all files under it it to extracting update handler.

 I was reading this : 
 http://sympmarc.com/2011/03/28/listing-folders-in-a-sharepoint-list-or-library-with-spservices/

 I am attaching a response example that contains 9 items ( 1 file and 8 
 folders)


Re: SharePoint Library consist of folders

2012-08-03 Thread Karl Wright
I checked this change into trunk, and added also corresponding code in
the place where fields and metadata are fetched.  This may work for
you in the interim while we're finishing up CONNECTORS-492.

Karl

On Fri, Aug 3, 2012 at 8:11 AM, Ahmet Arslan iori...@yahoo.com wrote:
 Hello,

 I found that there is an queryOptions parameter for this. ViewAttributes 
 Scope=Recursive /.

 http://msdn.microsoft.com/en-us/library/lists.lists.getlistitems.aspx

 If I add these three lines to SPSProxyHelper#buildPagingQueryOptions()

 MessageElement viewAttributesNode = new 
 MessageElement((String)null,ViewAttributes);
 queryOptionsNode.addChild(viewAttributesNode);
 viewAttributesNode.addAttribute(null,Scope,Recursive);

 return rval;

 SPSProxyHelper#getDocuments() returns expected results:

 /Documents/Vekaletname.pdf, 
 /Documents/ik_docs/diger/diger_dilekceler/aile_yardimi_almaz_dilecesi.doc, 
 /Documents/ik_docs/diger/fonksiyonel_ekipman_talep_formu.doc, ...

 But SPSProxyHelper#getFieldValues() method works for only 
 docId=/Documents/Vekaletname.pdf returns empty map for others. Therefore only 
 this one injected.

 --- On Thu, 8/2/12, Ahmet Arslan iori...@yahoo.com wrote:

 From: Ahmet Arslan iori...@yahoo.com
 Subject: RE: SharePoint Library consist of folders
 To: user@manifoldcf.apache.org
 Date: Thursday, August 2, 2012, 10:38 PM
  This replaces the getlistitems
 call
  in spsproxyhelper with a custom
  method call.

 Once 492 in place is it going to list files that have folder
 path too? Without checking value of ows_FSObjType
 attirbute?





Re: Document Security Modification Requirement during Indexing

2012-08-13 Thread Karl Wright
Well, you can either modify the document's acls in the Tika pipeline
(which I think would be easiest), or you can hack up the Apache
ManifoldCF Solr Plugin.  Those seem like your only real choices to me.
 I would choose the former since Tika is meant to be configured in
this way.

Karl

On Tue, Aug 14, 2012 at 12:44 AM, Anupam Bhattacharya
anupam...@gmail.com wrote:
 In our application there is a requirement to change the security on the
 document in index/search app Vs the Documentum repository. So that users who
 don't have login access to the Documentum system can also view certain
 documents in the world browse permission scenario.

 Additional constraint is that, we cannot change the ACLs on the Documentum
 Repository  the ManifoldCF Authority service should work as it is.

 I can think of 2 options to approach this case.

 1. As I have a separate SOLR servlet which is indexing documents via
 ManifoldCF to SOLR. So this is one place where i can do some modifications
 to Add Read security tokens to the special documents.
 2. Need to do some modifications in the ManifoldCF Authority Service
 Connector so that those special documents doesn't get filtered.

 Thanks for any help on this requirement.

 Regards
 Anupam




Re: Crawling MySQL with latest MySQL connector fails

2012-08-20 Thread Karl Wright
There's some online chatter about this.  Apparently the JDBC 4.0
specification was clarified in this regard, and MySQL's implementation
follows the clarified meaning.  The recommended approach during this
transition period is to allow the user to select which method they
want to use to get the column name.  See CONNECTORS-509.

Karl

On Mon, Aug 20, 2012 at 8:00 AM, Karl Wright daddy...@gmail.com wrote:
 Here's some additional info.

 The JDBC class ResultSetMetaData has two methods: getColumnName(),
 and getColumnLabel().  For all supported databases, getColumnName()
 returns the right thing, EXCEPT for MySQL, where you have to use
 getColumnLabel() instead.

 I've abstracted the logic that does this in the main database classes
 that underlie the framework, but for the JDBC Connector I tried to
 make the connector be independent of any special logic, and instead
 make it the responsibility of the query writer to know how to work
 with their database.  Unfortunately, that strategy is failing when it
 comes to MySQL because the JDBC driver is implemented in a way that is
 inconsistent with the specification.

 So, we have two ways forward:

 (1) Change the logic to use getColumnLabel() always.  If we do this,
 it will be necessary to test the JDBC connector against a PostgreSQL,
 MySQL, and MSSQL database before we know what the effects are.  It is
 possible everything will just work, but it is also possible that such
 a change would break other people's jobs, and that would be no good.

 (2) Try to conditionalize the logic so that only for MySQL is
 getColumnLabel() used.  This is less risky but results in messy code
 that sooner or later would become unmaintainable.

 Karl


 On Mon, Aug 20, 2012 at 6:22 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Shigeki,

 This is critical functionality for ManifoldCF.  Quite a lot of
 ManifoldCF stuff won't work on MySQL if this is broken - not just
 crawling using the JDBC connector.  Are you successfully crawling with
 MySQL as the back-end?  If you are, that means that there is a way to
 do this right but the JDBC connector is not using it.

 I am testing with MySQL JDBC connector 5.1.18 here, which would
 indicate that that is the case.

 Could you open a ticket describing the problem, and I will look into
 this in some detail tonight?  Thanks,
 Karl


 On Mon, Aug 20, 2012 at 4:21 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi guys.


 I am not sure if everyone has already noticed this, but this is to share an
 experimental fact of using MySQL connectors to crawl MySQL data.

 Using AS in Select queries in SeedQuery and DataQuery causes error
 depending on versions of MySQL connectors.

 Env:
 - ManifoldCF0.5
 - Solr3.6
 - MySQL5.5

 Example:

  SeedQuery:SELECT idfield AS $(IDCOLUMN) FROM documenttable

 Error Message:
   Bad seed query; doesn't return $(IDCOLUMN) column. Try using quotes around
 $(IDCOLUMN) variable, e.g. $(IDCOLUMN).

 Cause of Error:
  MySQL connecors of over version 5.1 seem to have a bug that causes error
 when you use AS in Select to put an alias for a column.

 Versions of MySQL Connector:
  mysql-connector-java-5.0.8.jar  - OK
  mysql-connector-java-5.1.18.jar - No Good
  mysql-connector-java-5.1.21.jar - No Good

 Exception:
 Using function  (e.g. sysdate() as) or fixed strings  (e.g.  fixed string
 as) followed by as does not cause error.

 Regards,

 Shigeki


Re: Job crawling SharePoint repository does not end

2012-09-04 Thread Karl Wright
You will need the SharePoint-2010 plugin, also.  You can check that out at:

https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2010/trunk

... and follow the README.txt directions.

Thanks!
Karl


On Tue, Sep 4, 2012 at 6:31 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi Karl,

 Yes, this is SharePoint 2010
 OK, then I'll try switching to trunk and start working with it. Thanks for
 the information, Karl.

 Thanks and Regards,
 Swapna.


 On Tue, Sep 4, 2012 at 3:44 PM, Karl Wright daddy...@gmail.com wrote:

 Hi - What version of SharePoint are you trying to crawl?

 If this is SharePoint 2010, development is underway and you will have
 to use trunk.

 Karl

 On Tue, Sep 4, 2012 at 5:26 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Hi,
 
  Am trying to use SharePoint connector of ManifoldCF for the first time
  and
  am having couple of issues. Can someone please help me in successfully
  crawling these repositories ?
 
  Am using ManifoldCF version 0.6 and I see that the SharePoint connector
  is
  readily available for use. I have defined a Repository Connection of
  SharePoint type for the URL
  https://mysite.arup.com/personal/swapna_vuppala/default.aspx; and the
  connection status shows Connection working.
 
  I have got a couple of documents in the libraries Shared Documents and
  Personal Documents and am interested in indexing them into Solr. Now
  when
  I try to define a job using the above created repository connection and
  a
  Solr output connection, am able to add rules to include the libraries I
  have
  got. When I start the job, the number listed in Documents column is
  coming
  correctly, but the job never ends. It is always in the Running state.
  I
  cannot see anything in Simple History except the Job Start.
 
  The manifoldcf log file shows something like WARN 2012-09-04
  14:39:05,204
  (Worker thread '1') - Service interruption reported for job
  1346736412103
  connection 'Test SharePoint': Remote procedure exception: Request is
  empty.
 
  Can someone please tell me if am missing some steps or configuration of
  something ??
 
  Thanks and Regards,
  Swapna.




Re: Job crawling SharePoint repository does not end

2012-09-04 Thread Karl Wright
Also, please be certain to look at CONNECTORS-492, which applies to
SharePoint 2010.  It may not affect you, but if it does, bear in mind
we have not completed development on it yet.

Karl

On Tue, Sep 4, 2012 at 6:48 AM, Karl Wright daddy...@gmail.com wrote:
 You will need the SharePoint-2010 plugin, also.  You can check that out at:

 https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2010/trunk

 ... and follow the README.txt directions.

 Thanks!
 Karl


 On Tue, Sep 4, 2012 at 6:31 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
 Hi Karl,

 Yes, this is SharePoint 2010
 OK, then I'll try switching to trunk and start working with it. Thanks for
 the information, Karl.

 Thanks and Regards,
 Swapna.


 On Tue, Sep 4, 2012 at 3:44 PM, Karl Wright daddy...@gmail.com wrote:

 Hi - What version of SharePoint are you trying to crawl?

 If this is SharePoint 2010, development is underway and you will have
 to use trunk.

 Karl

 On Tue, Sep 4, 2012 at 5:26 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Hi,
 
  Am trying to use SharePoint connector of ManifoldCF for the first time
  and
  am having couple of issues. Can someone please help me in successfully
  crawling these repositories ?
 
  Am using ManifoldCF version 0.6 and I see that the SharePoint connector
  is
  readily available for use. I have defined a Repository Connection of
  SharePoint type for the URL
  https://mysite.arup.com/personal/swapna_vuppala/default.aspx; and the
  connection status shows Connection working.
 
  I have got a couple of documents in the libraries Shared Documents and
  Personal Documents and am interested in indexing them into Solr. Now
  when
  I try to define a job using the above created repository connection and
  a
  Solr output connection, am able to add rules to include the libraries I
  have
  got. When I start the job, the number listed in Documents column is
  coming
  correctly, but the job never ends. It is always in the Running state.
  I
  cannot see anything in Simple History except the Job Start.
 
  The manifoldcf log file shows something like WARN 2012-09-04
  14:39:05,204
  (Worker thread '1') - Service interruption reported for job
  1346736412103
  connection 'Test SharePoint': Remote procedure exception: Request is
  empty.
 
  Can someone please tell me if am missing some steps or configuration of
  something ??
 
  Thanks and Regards,
  Swapna.




Re: Job crawling SharePoint repository does not end

2012-09-06 Thread Karl Wright
There is a SharePoint-2010 plugin 0.1 release candidate available now
on http://people.apache.org/~kwright .  This might save you some time.

Karl


On Thu, Sep 6, 2012 at 12:47 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Thanks Karl, I'll try and get the new build and use it shortly.

 Thanks and Regards,
 Swapna.

 On Wed, Sep 5, 2012 at 11:01 PM, Karl Wright daddy...@gmail.com wrote:

 FWIW, CONNECTORS-492 was just completed, and merged into trunk.

 You will need a new build of the SharePoint-2010 plugin to use it.

 Thanks,
 Karl

 On Tue, Sep 4, 2012 at 7:34 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Hi Karl,
 
  I'll make sure to look at the things you had mentioned. Thanks again for
  the
  information.
 
  Thanks and Regards,
  Swapna.
 
 
  On Tue, Sep 4, 2012 at 4:19 PM, Karl Wright daddy...@gmail.com wrote:
 
  Also, please be certain to look at CONNECTORS-492, which applies to
  SharePoint 2010.  It may not affect you, but if it does, bear in mind
  we have not completed development on it yet.
 
  Karl
 
  On Tue, Sep 4, 2012 at 6:48 AM, Karl Wright daddy...@gmail.com wrote:
   You will need the SharePoint-2010 plugin, also.  You can check that
   out
   at:
  
  
  
   https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2010/trunk
  
   ... and follow the README.txt directions.
  
   Thanks!
   Karl
  
  
   On Tue, Sep 4, 2012 at 6:31 AM, Swapna Vuppala
   swapna.kollip...@gmail.com wrote:
   Hi Karl,
  
   Yes, this is SharePoint 2010
   OK, then I'll try switching to trunk and start working with it.
   Thanks
   for
   the information, Karl.
  
   Thanks and Regards,
   Swapna.
  
  
   On Tue, Sep 4, 2012 at 3:44 PM, Karl Wright daddy...@gmail.com
   wrote:
  
   Hi - What version of SharePoint are you trying to crawl?
  
   If this is SharePoint 2010, development is underway and you will
   have
   to use trunk.
  
   Karl
  
   On Tue, Sep 4, 2012 at 5:26 AM, Swapna Vuppala
   swapna.kollip...@gmail.com wrote:
Hi,
   
Am trying to use SharePoint connector of ManifoldCF for the first
time
and
am having couple of issues. Can someone please help me in
successfully
crawling these repositories ?
   
Am using ManifoldCF version 0.6 and I see that the SharePoint
connector
is
readily available for use. I have defined a Repository Connection
of
SharePoint type for the URL
https://mysite.arup.com/personal/swapna_vuppala/default.aspx;
and
the
connection status shows Connection working.
   
I have got a couple of documents in the libraries Shared
Documents
and
Personal Documents and am interested in indexing them into
Solr.
Now
when
I try to define a job using the above created repository
connection
and
a
Solr output connection, am able to add rules to include the
libraries I
have
got. When I start the job, the number listed in Documents
column
is
coming
correctly, but the job never ends. It is always in the Running
state.
I
cannot see anything in Simple History except the Job Start.
   
The manifoldcf log file shows something like WARN 2012-09-04
14:39:05,204
(Worker thread '1') - Service interruption reported for job
1346736412103
connection 'Test SharePoint': Remote procedure exception: Request
is
empty.
   
Can someone please tell me if am missing some steps or
configuration
of
something ??
   
Thanks and Regards,
Swapna.
  
  
 
 




RE: Job crawling SharePoint repository does not end

2012-09-10 Thread Karl Wright
The difference is SharePoint 2010, which disabled a number of key features
that were necessary for crawling.  For SharePoint 2010, the plugin is
indeed mandatory.

Karl

Sent from my Windows Phone
--
From: Swapna Vuppala
Sent: 9/10/2012 7:54 AM
To: user@manifoldcf.apache.org
Subject: Re: Job crawling SharePoint repository does not end

Hi Karl,

I have got the SharePoint-2010 plugin but I have got couple of doubts
before using this.

When I was using ManfoldCF version 0.6, I tried defining repository
connections and crawling documents on them by running jobs without
installing anything on the SharePoint server. I thought I was just using
the connector *mcf-sharepoint-connector.jar *which is one the machine
running ManifoldCF and I was of the assumption that, I will be able to
crawl documents on any SharePoint server, for which I have got access
permissions.
I was of the opinion that I don't have to be a SharePoint administrator and
also I don't have to install anything on the SharePoint server.

But looking at this plug-in, I think I have been of a wrong opinion. Can
you please clarify if installation of these web services on the SharePoint
server is mandatory, just for being able to crawl them and index into Solr ?
Why is it different from the connector I was using in ManifoldCF 0.6 ?

Thanks and Regards,
Swapna.

On Thu, Sep 6, 2012 at 7:17 PM, Karl Wright daddy...@gmail.com wrote:

 There is a SharePoint-2010 plugin 0.1 release candidate available now
 on http://people.apache.org/~kwright .  This might save you some time.

 Karl


 On Thu, Sep 6, 2012 at 12:47 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Thanks Karl, I'll try and get the new build and use it shortly.
 
  Thanks and Regards,
  Swapna.
 
  On Wed, Sep 5, 2012 at 11:01 PM, Karl Wright daddy...@gmail.com wrote:
 
  FWIW, CONNECTORS-492 was just completed, and merged into trunk.
 
  You will need a new build of the SharePoint-2010 plugin to use it.
 
  Thanks,
  Karl
 
  On Tue, Sep 4, 2012 at 7:34 AM, Swapna Vuppala
  swapna.kollip...@gmail.com wrote:
   Hi Karl,
  
   I'll make sure to look at the things you had mentioned. Thanks again
 for
   the
   information.
  
   Thanks and Regards,
   Swapna.
  
  
   On Tue, Sep 4, 2012 at 4:19 PM, Karl Wright daddy...@gmail.com
 wrote:
  
   Also, please be certain to look at CONNECTORS-492, which applies to
   SharePoint 2010.  It may not affect you, but if it does, bear in mind
   we have not completed development on it yet.
  
   Karl
  
   On Tue, Sep 4, 2012 at 6:48 AM, Karl Wright daddy...@gmail.com
 wrote:
You will need the SharePoint-2010 plugin, also.  You can check that
out
at:
   
   
   
   
 https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2010/trunk
   
... and follow the README.txt directions.
   
Thanks!
Karl
   
   
On Tue, Sep 4, 2012 at 6:31 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
Hi Karl,
   
Yes, this is SharePoint 2010
OK, then I'll try switching to trunk and start working with it.
Thanks
for
the information, Karl.
   
Thanks and Regards,
Swapna.
   
   
On Tue, Sep 4, 2012 at 3:44 PM, Karl Wright daddy...@gmail.com
wrote:
   
Hi - What version of SharePoint are you trying to crawl?
   
If this is SharePoint 2010, development is underway and you will
have
to use trunk.
   
Karl
   
On Tue, Sep 4, 2012 at 5:26 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi,

 Am trying to use SharePoint connector of ManifoldCF for the
 first
 time
 and
 am having couple of issues. Can someone please help me in
 successfully
 crawling these repositories ?

 Am using ManifoldCF version 0.6 and I see that the SharePoint
 connector
 is
 readily available for use. I have defined a Repository
 Connection
 of
 SharePoint type for the URL
 https://mysite.arup.com/personal/swapna_vuppala/default.aspx;
 and
 the
 connection status shows Connection working.

 I have got a couple of documents in the libraries Shared
 Documents
 and
 Personal Documents and am interested in indexing them into
 Solr.
 Now
 when
 I try to define a job using the above created repository
 connection
 and
 a
 Solr output connection, am able to add rules to include the
 libraries I
 have
 got. When I start the job, the number listed in Documents
 column
 is
 coming
 correctly, but the job never ends. It is always in the
 Running
 state.
 I
 cannot see anything in Simple History except the Job Start.

 The manifoldcf log file shows something like WARN 2012-09-04
 14:39:05,204
 (Worker thread '1') - Service interruption reported for job
 1346736412103
 connection 'Test SharePoint': Remote procedure exception:
 Request
 is
 empty.

 Can someone please

Re: Does anyone use MOSS?

2012-10-10 Thread Karl Wright
I don't know of any difference from a SharePoint standpoint between
MOSS and WSS, except for additional Office-related plugins on MOSS.

Connection working means you could could get to SharePoint at least.

Can you look in the log and find the exception associated with the
Cannot open the requested Sharepoint Site error?  It should give a
clue as to what the connector is trying to do at that time.

Thanks,
Karl


On Wed, Oct 10, 2012 at 1:50 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Hi,
 I think MCF supports Windows SharePoint Services(WSS) though,
 does MCF support Microsoft Office SharePoint Server(MOSS)?

 I tried to crawl MOSS but I couldn't crawl and got the error.

 I'm using MOSS 2007 as out of the box.
 I have only Administrator user.
 On the repository connection, the config said connection working
 but when crawling the log said that
 Cannot open the requested Sharepoint Site..
 I couldn't find the server event log that specifies this error.

 Any help please.
 Regards,
 Shinichiro Abe


Re: Web crawling causes Socket Timeout after Database Exception

2012-10-10 Thread Karl Wright
Hi Shigeki,

The socket timeout exception is only a warning.  It means that some
site you are crawling did not accept a socket connection within the
allowed time (5 minutes I think).  The Web Connector will retry the
connection a few times, and if it is still rejected, it will
eventually give up on that page.  One thing you want to check, though,
is that you are using proper throttling, because if you aren't then
one cause of this problem is that the webmaster of the site you are
trying to crawl may have blocked you from accessing it.

The database exception is more problematic.  It means that MySQL
thinks it took too long for a specific transaction to complete, and
the database aborted the transaction due to a timeout.  There are two
ways of dealing with this issue.  One way is to modify your MySQL
configuration to increase the transaction timeout value to some high
number.  The second way is to modify ManifoldCF to recognize the
timeout error specifically, and cause a retry.  But in order to do the
latter, I would need to know what SQL error code MySQL returns for
this situation, which will mean we either need to look it up (if we
can), or modify a ManifoldCF instance to log it when this problem
occurs.

Please let me know how you would like to proceed.

Karl

On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:

 Hi

 I am having a trouble with crawling web using MCF1.0.
 I run MCF with MySQL 5.5 and Tomcat 6.0.
 It should keep crawling contents, but MCF prints the following Database
 exception log, then hangs.
 After DB Exception, Socket Time Exception occurs.

 Anyone has faced this problem?

 --Database Exception log:

 ERROR 2012-10-10 16:11:05,787 (Worker thread '42') - Worker thread aborting
 and restarting due to database connection reset: Database exception:
 Exception doing query: Lock wait timeout exceeded; try restarting
 transaction
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
 exception: Exception doing query: Lock wait timeout exceeded; try restarting
 transaction
 at
 org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
 at
 org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
 at
 org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
 at
 org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
 at
 org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
 at
 org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
 at
 org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
 at
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
 at
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1487)
 at
 org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:6049)
 at
 org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessAcivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:6159)
 at
 org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
 at
 org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:52)
 at
 org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
 at
 org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:225)
 at
 org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:7047)
 at
 org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:6011)
 at
 org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1282)
 at
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
 Caused by: java.sql.SQLException: Lock wait timeout exceeded; try restarting
 transaction
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
 at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
 at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
 at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
 at
 

Re: Strange behaviour on internet free server

2012-10-15 Thread Karl Wright
I take it by internet free you mean a local network that is not
connected to the internet?

There should be no reason why ManifoldCF would not operate in such an
environment.  Can you describe the strange behavior you have been
seeing?

Karl

On Mon, Oct 15, 2012 at 12:28 PM, Johan Persson perzzon.jo...@gmail.com wrote:
 I'm getting strange behaviour from a test-server I've set up in a
 Internet free environment.
 Could this hinder Manifold from working properly?

 Im using Sharepoint, AD and Solr-connections

 Best Regards /J


Re: Strange behaviour on internet free server

2012-10-16 Thread Karl Wright
:-)  Until somebody starts selling support for ManifoldCF, I'm afraid
it is just us volunteers.

Karl

On Tue, Oct 16, 2012 at 2:54 AM, Johan Persson perzzon.jo...@gmail.com wrote:
 Hmmm... Stupid me.
 I just restarted with a new Manifold-installation and seem to have
 missed to set the Sharepoint-version correctly. Thanks for pointing
 that out.

 BTW There is there by any chance a payed support number to call?

 / Johan

 2012/10/16 Karl Wright daddy...@gmail.com:
 If this is SharePoint 2010, you need to select SharePoint 4.0 (2010)
 in the pulldown.  It looks like you have not done this, since your
 seem to be trying to use the SharePoint dspsts service, which does not
 work on SharePoint 2010.

 I don't know if this is the cause of your Solr problem, but it would
 certainly prevent progress on a SharePoint crawl.

 Thanks,
 Karl

 On Tue, Oct 16, 2012 at 2:31 AM, Johan Persson perzzon.jo...@gmail.com 
 wrote:
 Didn't get anything into solr. Also the job seem to end up in some
 kind of gridlock with solr. (Never terminating until the solr-process
 is )

 Read the Manifold log and found this at the bottom.
 Just thought that the reason for the strange behaviour was that a DTD
 or similar was not obtained.

 DEBUG 2012-10-15 09:23:50,531 (Worker thread '1') - Mapping Exception
 to AxisFault
 AxisFault
  faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Client.Dsp.Syntax
  faultSubcode:
  faultString: Request is empty.
  faultActor:
  faultNode:
  faultDetail:
 {http://xml.apache.org/axis/}stackTrace:Request is empty.
 ...
 ...

  WARN 2012-10-15 09:23:50,551 (Worker thread '1') - Service
 interruption reported for job 1350317841332 connection 'Sharepoint':
 Remote procedure exception: Request is empty.
 DEBUG 2012-10-15 09:23:58,499 (Idle cleanup thread) - Checking for
 connections, idleTimeout: 1350318178499
 


Re: Web crawling causes Socket Timeout after Database Exception

2012-10-18 Thread Karl Wright
So, what was the resolution of this problem?  Any news?
Karl

On Thu, Oct 11, 2012 at 2:28 AM, Karl Wright daddy...@gmail.com wrote:
 The only change is that the MySQL driver now performs ANALYZE
 operations on the fly in order to keep the database operating at high
 efficiency.  This is CONNECTORS-510.  It is possible that, on a large
 database table, these operations will cause others to wait long enough
 so that their timeout is exceeded.  Such an event does not take place
 while the load tests run, however.  If you want to turn off the
 analyze operation, you can do that by setting a per-table property to
 override the analyze default of 1 operations:

 analyzeThreshold =
 ManifoldCF.getIntProperty(org.apache.manifold.db.mysql.analyze.+tableName,1);

 The table in question is jobqueue.  If you set this value to
 something like 10 and you still see MySQL timeouts, then this
 new code is not the problem.  And, like I said, the best solution is
 to recognize the error and retry, but first I would need the error
 code.  Adding an appropriate output of sqlState around line 123 of
 framework/core/src/main/java/org/apache/manifoldcf/core/database/DBInterfaceMySQL.java
 would allow us to see what code to catch, when it happened again.

 For the Web connector, the only modifications have been in regards to
 how it handles 500 errors, which now correctly code to avoid an
 IndexExceptionOutOfBounds exception.  This has nothing to do with
 socket exceptions, which are caused for external reasons only.

 Karl


 On Wed, Oct 10, 2012 at 10:32 PM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,


 I was comparing version 1.0 with old trunk based on version 0.6 implementing
 CONNECTORS-501(
 Medium-scale web crawl with hopcount-based filtering fails to find correct
 number of documents).

 Running each version with the same MySQL setting and the same throttling,
 somehow the version 1.0 hangs with the error.
 Since the old trunk completes crawling, I wonder if something has changed.

 Just to make sure I will recheck if there are any wrong settings in MCF.

 Thanks.

 Regards,

 Shigeki

 2012/10/10 Karl Wright daddy...@gmail.com

 Hi Shigeki,

 The socket timeout exception is only a warning.  It means that some
 site you are crawling did not accept a socket connection within the
 allowed time (5 minutes I think).  The Web Connector will retry the
 connection a few times, and if it is still rejected, it will
 eventually give up on that page.  One thing you want to check, though,
 is that you are using proper throttling, because if you aren't then
 one cause of this problem is that the webmaster of the site you are
 trying to crawl may have blocked you from accessing it.

 The database exception is more problematic.  It means that MySQL
 thinks it took too long for a specific transaction to complete, and
 the database aborted the transaction due to a timeout.  There are two
 ways of dealing with this issue.  One way is to modify your MySQL
 configuration to increase the transaction timeout value to some high
 number.  The second way is to modify ManifoldCF to recognize the
 timeout error specifically, and cause a retry.  But in order to do the
 latter, I would need to know what SQL error code MySQL returns for
 this situation, which will mean we either need to look it up (if we
 can), or modify a ManifoldCF instance to log it when this problem
 occurs.

 Please let me know how you would like to proceed.

 Karl

 On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 
  Hi
 
  I am having a trouble with crawling web using MCF1.0.
  I run MCF with MySQL 5.5 and Tomcat 6.0.
  It should keep crawling contents, but MCF prints the following Database
  exception log, then hangs.
  After DB Exception, Socket Time Exception occurs.
 
  Anyone has faced this problem?
 
  --Database Exception log:
 
  ERROR 2012-10-10 16:11:05,787 (Worker thread '42') - Worker thread
  aborting
  and restarting due to database connection reset: Database exception:
  Exception doing query: Lock wait timeout exceeded; try restarting
  transaction
  org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
  exception: Exception doing query: Lock wait timeout exceeded; try
  restarting
  transaction
  at
 
  org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
  at
 
  org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
  at
 
  org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
  at
 
  org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
  at
 
  org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
  at
 
  org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852

Re: Web crawling causes Socket Timeout after Database Exception

2012-10-19 Thread Karl Wright
I just looked in the code with svn for differences in the web
connector from release 0.6.  There is a change to the html parser to
allow for handling default values for option tags, and a change that
fixes an IndexOutOfBounds exception.  Neither of these can possibly
affect socket timeouts.

I also looked at the solr connector (presuming that is what you are
using as an output connector).  No changes at all since 0.6.

So honestly, I can see no significant changes whatsoever in the
behavior of how a web crawler indexing into Solr would behave.  If you
are seeing differences, therefore, I simply cannot account for them.

Karl


On Fri, Oct 19, 2012 at 5:01 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:
 Due to the error, I had to downgrade to a lower version so I haven't found
 the MySQL error code yet.

 I installed MCF1.0 in a different environment where crawlable contents are
 different from the above environment.
 I could not reproduce the Database exception but socket timeout occurred  In
 the same environment, I ran MCF0.6 and it completed crawling without socket
 timeout.
 Like you said, socket timeout seems to be a different problem from the
 Database exception .

 2012/10/18 Karl Wright daddy...@gmail.com

 So, what was the resolution of this problem?  Any news?
 Karl

 On Thu, Oct 11, 2012 at 2:28 AM, Karl Wright daddy...@gmail.com wrote:
  The only change is that the MySQL driver now performs ANALYZE
  operations on the fly in order to keep the database operating at high
  efficiency.  This is CONNECTORS-510.  It is possible that, on a large
  database table, these operations will cause others to wait long enough
  so that their timeout is exceeded.  Such an event does not take place
  while the load tests run, however.  If you want to turn off the
  analyze operation, you can do that by setting a per-table property to
  override the analyze default of 1 operations:
 
  analyzeThreshold =
 
  ManifoldCF.getIntProperty(org.apache.manifold.db.mysql.analyze.+tableName,1);
 
  The table in question is jobqueue.  If you set this value to
  something like 10 and you still see MySQL timeouts, then this
  new code is not the problem.  And, like I said, the best solution is
  to recognize the error and retry, but first I would need the error
  code.  Adding an appropriate output of sqlState around line 123 of
 
  framework/core/src/main/java/org/apache/manifoldcf/core/database/DBInterfaceMySQL.java
  would allow us to see what code to catch, when it happened again.
 
  For the Web connector, the only modifications have been in regards to
  how it handles 500 errors, which now correctly code to avoid an
  IndexExceptionOutOfBounds exception.  This has nothing to do with
  socket exceptions, which are caused for external reasons only.
 
  Karl
 
 
  On Wed, Oct 10, 2012 at 10:32 PM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
  Hi Karl,
 
 
  I was comparing version 1.0 with old trunk based on version 0.6
  implementing
  CONNECTORS-501(
  Medium-scale web crawl with hopcount-based filtering fails to find
  correct
  number of documents).
 
  Running each version with the same MySQL setting and the same
  throttling,
  somehow the version 1.0 hangs with the error.
  Since the old trunk completes crawling, I wonder if something has
  changed.
 
  Just to make sure I will recheck if there are any wrong settings in
  MCF.
 
  Thanks.
 
  Regards,
 
  Shigeki
 
  2012/10/10 Karl Wright daddy...@gmail.com
 
  Hi Shigeki,
 
  The socket timeout exception is only a warning.  It means that some
  site you are crawling did not accept a socket connection within the
  allowed time (5 minutes I think).  The Web Connector will retry the
  connection a few times, and if it is still rejected, it will
  eventually give up on that page.  One thing you want to check, though,
  is that you are using proper throttling, because if you aren't then
  one cause of this problem is that the webmaster of the site you are
  trying to crawl may have blocked you from accessing it.
 
  The database exception is more problematic.  It means that MySQL
  thinks it took too long for a specific transaction to complete, and
  the database aborted the transaction due to a timeout.  There are two
  ways of dealing with this issue.  One way is to modify your MySQL
  configuration to increase the transaction timeout value to some high
  number.  The second way is to modify ManifoldCF to recognize the
  timeout error specifically, and cause a retry.  But in order to do the
  latter, I would need to know what SQL error code MySQL returns for
  this situation, which will mean we either need to look it up (if we
  can), or modify a ManifoldCF instance to log it when this problem
  occurs.
 
  Please let me know how you would like to proceed.
 
  Karl
 
  On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
  
   Hi
  
   I am having a trouble with crawling

Re: Problem with reading files from Sharepoint 2010 to manifldcf 1.0.1

2012-10-30 Thread Karl Wright
I finally was able to look at the logs.

The exception that stops the job is in fact coming from the GetListItems call:

at org.apache.axis.client.Call.invoke(Call.java:1812)
at 
com.microsoft.sharepoint.webpartpages.PermissionsSoapStub.getListItems(PermissionsSoapStub.java:234)
at 
org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getChildren(SPSProxyHelper.java:619)
at 
org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1303)
at 
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)

Clearly certain entities are expected to have children, but we're
either not invoking the service correctly for those, OR we're invoking
the service for entities that don't have the ability to get children
at all.

I don't see any evidence in this log that ANY getListItems calls are
succeeding.  In fact, it is the first such call that fails.  Why do
you think that discovery is working?  There seems to be no evidence of
that.  The headers etc. all look good too:

DEBUG 2012-10-30 14:04:35,223 (Thread-439) -
HttpConnectionManager.getConnection:  config =
HostConfiguration[host=http://16.59.60.113], timeout = 0
DEBUG 2012-10-30 14:04:35,223 (Thread-439) - Getting free connection,
hostConfig=HostConfiguration[host=http://16.59.60.113]
DEBUG 2012-10-30 14:04:35,224 (Thread-439) -  POST
/_vti_bin/MCPermissions.asmx HTTP/1.1[\r][\n]

Karl
On Tue, Oct 30, 2012 at 8:39 AM, Fridler, Oren oren.frid...@hp.com wrote:
 Hi

 I’m using apache-manifoldcf-1.0.1-bin

 I installed apache-manifoldcf-sharepoint-2010-plugin-0.1  on top of
 Sharepoint 2010



 On mcf I managed to create a Sharepoint repository connection and saw the
 status is “Connection Working”

 Also when I create the “Sharepoint to Solr” Job I can see some of the wiki
 libraries that I created on SP are available for selection so I assume MCF
 is getting this data from SP.

 But when I start the job it is getting stuck in status “running” forever,
 the mcf UI shows documents are discovered, some are processed and some are
 active, but on Solr side no document is received.

 On mcf logs I see the error at the end of this email.

 On my browser I can open http://16.59.60.113 - getting to SP site, and also
 http://16.59.60.113/_vti_bin/MCPermissions.asmx  - getting to a page that
 lists these 2 services - GetListItems and GetPermissionCollection

 Attached are the mcf logs with DEBUG level.

 Any help or idea what can I do would be highly appreciated.

 Thanks

 Oren.





 AxisFault

 faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Client

 faultSubcode:

  faultString: The Web application at http://16.59.60.113 could not be found.
 Verify that you have typed the URL correctly. If the URL should be serving
 existing content, the system administrator may need to add a new request URL
 mapping to the intended application.

 faultActor: http://16.59.60.113/_vti_bin/MCPermissions.asmx

 faultNode:

  faultDetail:

{}Error:ErrorNumber1010/ErrorNumberErrorMessageThe Web
 application at http://16.59.60.113 could not be found. Verify that you have
 typed the URL correctly. If the URL should be serving existing content, the
 system administrator may need to add a new request URL mapping to the
 intended
 application./ErrorMessageErrorSourceMicrosoft.SharePoint/ErrorSource












Re: Problem with reading files from Sharepoint 2010 to manifldcf

2012-10-30 Thread Karl Wright
Seeing the existence of the service in the browser does not mean it
will work.  It only means that the wsdl is coming back from the
service.

 What can be the reason for this?

Unfortunately that is very difficult to determine.  SharePoint tends
to return catchall errors which are not very meaningful.  The
server-side event logs may be helpful in figuring out what is going
wrong.

 Can there be a mismatch between the sharepoint driver on MCF and the 
 sharepoint server?

This is possible if (for instance) you deployed a SharePoint 2010
plugin on a SharePoint 2007 server, but if you had a version of
SharePoint which was incompatible with the plugin you deployed, I
would expect you would have seen errors reported during the plugin
installation.  The plugins are built against specific SharePoint dlls
with specific version numbers, and .NET enforces a match.  The .bat
deployment files though are not very good at telling you that stuff is
broken; they don't actually catch the reported errors and stop, so it
is possible you may have missed such errors.

If there were no errors, I would guess that the problem is probably
permissions related.  That is, the plugin may not have permissions to
do what it needs to do.  The permissions are granted (as I understand
it) based on the user that installs the plugin, so that may be what
the issue is.

Karl


On Tue, Oct 30, 2012 at 11:19 AM, Fridler, Oren oren.frid...@hp.com wrote:
 Discovery is not working indeed (sorry I was not clear on this), I just saw 
 on the sharepoint repository connector UI the status connection working

 So if I understand you correctly the soap call to  
 com.microsoft.sharepoint.webpartpages.PermissionsSoapStub.getListItems(PermissionsSoapStub.java:234)
   is failing? Although I can see the GetListItems operation supported in the 
 browser.
 What can be the reason for this?
 Can there be a mismatch between the sharepoint driver on MCF and the 
 sharepoint server?
 How do you suggest I continue to investigate?
 Thanks
 Oren.


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: יום ג 30 אוקטובר 2012 17:05
 To: Fridler, Oren
 Subject: Re: Problem with reading files from Sharepoint 2010 to manifldcf

 I responded to user@manifoldcf.a.o.  The log disagrees with the idea that 
 discovery is working.  It seems like the getListItems() part of the service 
 is failing, and on the very first call too.

 Karl

 On Tue, Oct 30, 2012 at 10:39 AM, Fridler, Oren oren.frid...@hp.com wrote:
 I selected SharePoint 2010.
 There is only one user I used for the SharePoint Server install and this 
 user is used on MCF SharePoint connection.
 Is there a way to disable permission checking altogether in the connector 
 and just ask for all documents with the user credentials I entered on the 
 sharepoint connection? I tried to select secutiry=disabled on the job 
 details but it didn't help.


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: יום ג 30 אוקטובר 2012 16:26
 To: Fridler, Oren
 Cc: user@manifoldcf.apache.org
 Subject: Re: Problem with reading files from Sharepoint 2010 to
 manifldcf

 Hi Oren,

 Here's my reasoning:

 (1) You would not get connection working if you could not access the 
 MCPermissions service, unless you selected SharePoint 2003, which would then 
 conflict with other data.

 (2) You said that it discovered documents.  That means that the GetListItems 
 part of the service is working.

 (3) You said that you couldn't index any documents, and got an AXIS 
 exception which terminated the job.  That means you could not retrieve 
 document permissions (which is what the GetPermissionCollection part of the 
 service does).

 (4) The GetPermissionCollection operation uses only one other service, and 
 it is Permissions.asmx.  So it figured that the problem was likely in 
 reaching that service, since the complaint was that it couldn't find a 
 service.

 Until 10 min ago I did not have internet service back, but I will confirm 
 this picture in your logs shortly.

 The Permissions.asmx service you identify is the correct one; the question 
 seems to be why the MCPermissions service can't talk to it.
 Could be a permission problem I suppose - perhaps the user you were logged 
 in as when you installed the service had insufficient permissions or some 
 such?  Just guessing here...

 Karl


 On Tue, Oct 30, 2012 at 9:19 AM, Fridler, Oren oren.frid...@hp.com wrote:
 Hi Karl
 Thank you for your prompt reply,

 By SharePoint permissions service do you refer to this?   
 http://16.59.60.113/_vti_bin/Permissions.asmx
 I was able to open this service, getting the following operations:
 AddPermission
 AddPermissionCollection
 GetPermissionCollection
 RemovePermission
 RemovePermissionCollection
 UpdatePermission

 BTW, how can you tell from the logs the mcpermissions server is having 
 trouble reaching SharePoint permissions service?
 Thanks in advance
 Oren.

 -Original Message-
 From

Re: Problem with reading files from Sharepoint 2010 to manifldcf

2012-10-31 Thread Karl Wright
Hi Oren,

I've been thinking further about your issue, and how many recent kinds
of posts we've been getting which basically amount to people trying to
get the manifoldcf-sharepoint-2010 plugin working on their particular
SharePoint instance, which has no doubt been installed and
(mis?)configured by someone else at some point in the past.  I think
we're going to need a how-to-debug page where we can gather everyone's
experiences together, including diagnostic approaches and advice.
There is already a page that anyone can edit in the ManifoldCF wiki,
which is a fine starting point:
https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
.  I hope you will be willing to contribute to this effort.

In the meantime, let's go back over your questions below and try to
eliminate them one at a time, in a more systematic fashion.

(1) Version of SharePoint.

To rule out any funkiness here, the obvious thing to do is to find the
version of your sharepoint.dll.  The dll should be in one of the
standard locations where assembly dlls are deployed on your server.
The assembly name is Microsoft.SharePoint.dll - nothing else, not
MicrosoftOffice, or anything else.  There are a number of tools for
determining the .NET version of such DLLs; here's a link that might
help: 
http://stackoverflow.com/questions/227886/how-do-i-determine-the-dependencies-of-a-net-application
.  The ManifoldCF-SharePoint-2010 plugin is built against:

Reference Include=Microsoft.SharePoint, Version=14.0.0.0,
Culture=neutral, PublicKeyToken=71e9bce111e9429c,
processorArchitecture=MSIL /

... which can be found in the webservice/MCPermissionsService.csproj
file in the source package for the service.  The
ManifoldCF-SharePoint-2007 plugin is, obviously, built against a
different version:

Reference Include=Microsoft.SharePoint, Version=12.0.0.0,
Culture=neutral, PublicKeyToken=71e9bce111e9429c,
processorArchitecture=MSIL /

(2) Meaning of error

Here's the error again:
{}Error:ErrorNumber1010/ErrorNumberErrorMessageThe Web
application at http://16.59.60.113 could not be found. Verify that you
have typed the URL correctly. If the URL should be serving existing
content, the system administrator may need to add a new request URL
mapping to the intended
application./ErrorMessageErrorSourceMicrosoft.SharePoint/ErrorSource

The error code 1010 comes from the plugin, specifically from the
GetListItems method:

catch (Exception ex)
{
EventLog.WriteEntry(MCPermissions.asmx, ex.Message);
throw RaiseException(ex.Message, 1010, ex.Source);
}

So, we know we are getting into the plugin correctly, but we
furthermore know that something that is happening in there is not
working.  The ErrorSource tags include the assembly from which the
error is coming:

Microsoft.SharePoint

The error message, as I pointed out before, is pretty useless when
SharePoint is concerned - there are quite a number of catchall
errors which are more likely to mislead you than help you.  So you
have to look at the source code, which is actually rather small and
simple.

Looking at the code itself, and what it is doing, the likely place
that the problem comes from is this:

using (SPSite site = new SPSite(SPContext.Current.Web.Url))
{
using (SPWeb oWebsiteRoot = site.OpenWeb())
{
...

It seems clear that for some reason your SharePoint instance does not
have a valid SPContext.Current.Web.Url which will permit the plugin
reaching the actual sharepoint logic.  I don't know the reason for
that; this is happening internal to SharePoint on that server.
Possibilities include a URL redirection, I suppose?  My knowledge of
.NET, and what SharePoint is doing under the covers, is not that
strong.  But this is the avenue I'd pursue.  If you do find that
there's a redirection taking place to reach your _vti_bin directory,
try using the final target of the redirection instead of the initial
URL, and see if that helps...

Karl


On Tue, Oct 30, 2012 at 11:44 AM, Karl Wright daddy...@gmail.com wrote:
 Seeing the existence of the service in the browser does not mean it
 will work.  It only means that the wsdl is coming back from the
 service.

 What can be the reason for this?

 Unfortunately that is very difficult to determine.  SharePoint tends
 to return catchall errors which are not very meaningful.  The
 server-side event logs may be helpful in figuring out what is going
 wrong.

 Can there be a mismatch between the sharepoint driver on MCF and the 
 sharepoint server?

 This is possible if (for instance) you deployed a SharePoint 2010
 plugin on a SharePoint 2007 server, but if you had a version of
 SharePoint which was incompatible with the plugin you deployed, I
 would expect you would have seen errors reported during the plugin
 installation.  The plugins are built against specific SharePoint dlls
 with specific version numbers, and .NET

Re: Problem with reading files from Sharepoint 2010 to manifldcf

2012-10-31 Thread Karl Wright
Please see below...

On Wed, Oct 31, 2012 at 8:59 AM, Fridler, Oren oren.frid...@hp.com wrote:
 Thanks Karl
 I'll be happy to contribute to the debugging wiki once I have some helpful 
 insights.

 I'm following your advice and sharing the info in case someone encounter the 
 same issues:

 (1) ShrePoint version - I've found 2 copies of MicrosoftSharePoint.dll (see 
 below), I opened them with .Net Reflector, the first dll's version is 
 14.0.0.0 and the second is 14.900.0.0
 C:\dir /s /b Microsoft.SharePoint.dll
 C:\Program Files\Common Files\Microsoft Shared\Web Server 
 Extensions\14\ISAPI\Microsoft.SharePoint.dll
 C:\Program Files\Common Files\Microsoft Shared\Web Server 
 Extensions\14\UserCode\assemblies\Microsoft.SharePoint.dll

 I don't know which dll is used by my SharePoint 2010, so I uninstalled 
 SharePoint - both dlls were removed and after re-installed they were back 
 again :(

 I installed the manifold sharepoint plugin (setup output attached) and it 
 went ok without errors.


I think I've seen the 14.900.0.0 - it is the Microsoft Office
extensions to SharePoint.  But as long as the 14.0.0.0 one is
available that is probably fine.

 (2) Meaning of error - I followed your idea that maybe redirects are causing 
 the problem, since ManifoldCF is running on the same server where SharePoint 
 is I changed the URL and replaced the server IP with localhost or 
 127.0.0.1
 Now I don't get the 1010 error with Web Application cannot be found, still no 
 files are imported and the logs (attached) contain these 2 errors:

 org.apache.axis.ConfigurationException: No service named PermissionsSoap is 
 available
 ...
 org.apache.axis.ConfigurationException: No service named 
 http://microsoft.com/sharepoint/webpartpages/GetListItems is available


These are just warnings.  They seem to be due to some kind of mismatch
between the wsdl and what the services actually look like.  But just
ignore these for now.

I'll have a look at your logs shortly and get back to you with an idea
what they are telling us.

Karl

 I'll continue to investigate, if someone have any idea/help it would be great
 Thanks
 Oren.


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: יום ד 31 אוקטובר 2012 09:39
 To: Fridler, Oren; user@manifoldcf.apache.org
 Subject: Re: Problem with reading files from Sharepoint 2010 to manifldcf

 Hi Oren,

 I've been thinking further about your issue, and how many recent kinds of 
 posts we've been getting which basically amount to people trying to get the 
 manifoldcf-sharepoint-2010 plugin working on their particular SharePoint 
 instance, which has no doubt been installed and (mis?)configured by someone 
 else at some point in the past.  I think we're going to need a how-to-debug 
 page where we can gather everyone's experiences together, including 
 diagnostic approaches and advice.
 There is already a page that anyone can edit in the ManifoldCF wiki, which is 
 a fine starting point:
 https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
 .  I hope you will be willing to contribute to this effort.

 In the meantime, let's go back over your questions below and try to eliminate 
 them one at a time, in a more systematic fashion.

 (1) Version of SharePoint.

 To rule out any funkiness here, the obvious thing to do is to find the 
 version of your sharepoint.dll.  The dll should be in one of the standard 
 locations where assembly dlls are deployed on your server.
 The assembly name is Microsoft.SharePoint.dll - nothing else, not 
 MicrosoftOffice, or anything else.  There are a number of tools for 
 determining the .NET version of such DLLs; here's a link that might
 help: 
 http://stackoverflow.com/questions/227886/how-do-i-determine-the-dependencies-of-a-net-application
 .  The ManifoldCF-SharePoint-2010 plugin is built against:

 Reference Include=Microsoft.SharePoint, Version=14.0.0.0, Culture=neutral, 
 PublicKeyToken=71e9bce111e9429c, processorArchitecture=MSIL /

 ... which can be found in the webservice/MCPermissionsService.csproj
 file in the source package for the service.  The
 ManifoldCF-SharePoint-2007 plugin is, obviously, built against a different 
 version:

 Reference Include=Microsoft.SharePoint, Version=12.0.0.0, Culture=neutral, 
 PublicKeyToken=71e9bce111e9429c, processorArchitecture=MSIL /

 (2) Meaning of error

 Here's the error again:
 {}Error:ErrorNumber1010/ErrorNumberErrorMessageThe Web application at 
 http://16.59.60.113 could not be found. Verify that you have typed the URL 
 correctly. If the URL should be serving existing content, the system 
 administrator may need to add a new request URL mapping to the intended 
 application./ErrorMessageErrorSourceMicrosoft.SharePoint/ErrorSource

 The error code 1010 comes from the plugin, specifically from the GetListItems 
 method:

 catch (Exception ex)
 {
 EventLog.WriteEntry(MCPermissions.asmx, ex.Message

Re: Problem with reading files from Sharepoint 2010 to manifldcf

2012-10-31 Thread Karl Wright
I have good news - it is apparently now working.

Check your path rules.  You need to have a path that matches the
document part of the path, e.g. xxx/yyy/*.  The end user documentation
explains how to set one of these up.

Karl

On Wed, Oct 31, 2012 at 1:12 PM, Fridler, Oren oren.frid...@hp.com wrote:
 Sorry, my bad, I attached the wrong file.
 Attached is manifoldcf log when 127.0.0.1 is used for sharepoint server
 Oren
 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: יום ד 31 אוקטובר 2012 15:25
 To: Fridler, Oren
 Cc: user@manifoldcf.apache.org
 Subject: Re: Problem with reading files from Sharepoint 2010 to manifldcf

 The logs you attached have no entries that are dated later than 10/30, so I 
 am uncertain they are the right ones.

 I still see the same error when MCPermissions.asmx is invoked.

 Karl

 On Wed, Oct 31, 2012 at 9:16 AM, Karl Wright daddy...@gmail.com wrote:
 Please see below...

 On Wed, Oct 31, 2012 at 8:59 AM, Fridler, Oren oren.frid...@hp.com wrote:
 Thanks Karl
 I'll be happy to contribute to the debugging wiki once I have some helpful 
 insights.

 I'm following your advice and sharing the info in case someone encounter 
 the same issues:

 (1) ShrePoint version - I've found 2 copies of
 MicrosoftSharePoint.dll (see below), I opened them with .Net
 Reflector, the first dll's version is 14.0.0.0 and the second is
 14.900.0.0 C:\dir /s /b Microsoft.SharePoint.dll C:\Program
 Files\Common Files\Microsoft Shared\Web Server
 Extensions\14\ISAPI\Microsoft.SharePoint.dll
 C:\Program Files\Common Files\Microsoft Shared\Web Server
 Extensions\14\UserCode\assemblies\Microsoft.SharePoint.dll

 I don't know which dll is used by my SharePoint 2010, so I
 uninstalled SharePoint - both dlls were removed and after
 re-installed they were back again :(

 I installed the manifold sharepoint plugin (setup output attached) and it 
 went ok without errors.


 I think I've seen the 14.900.0.0 - it is the Microsoft Office
 extensions to SharePoint.  But as long as the 14.0.0.0 one is
 available that is probably fine.

 (2) Meaning of error - I followed your idea that maybe redirects are 
 causing the problem, since ManifoldCF is running on the same server where 
 SharePoint is I changed the URL and replaced the server IP with localhost 
 or 127.0.0.1
 Now I don't get the 1010 error with Web Application cannot be found, still 
 no files are imported and the logs (attached) contain these 2 errors:

 org.apache.axis.ConfigurationException: No service named
 PermissionsSoap is available ...
 org.apache.axis.ConfigurationException: No service named
 http://microsoft.com/sharepoint/webpartpages/GetListItems is
 available


 These are just warnings.  They seem to be due to some kind of mismatch
 between the wsdl and what the services actually look like.  But just
 ignore these for now.

 I'll have a look at your logs shortly and get back to you with an idea
 what they are telling us.

 Karl

 I'll continue to investigate, if someone have any idea/help it would
 be great Thanks Oren.


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: יום ד 31 אוקטובר 2012 09:39
 To: Fridler, Oren; user@manifoldcf.apache.org
 Subject: Re: Problem with reading files from Sharepoint 2010 to
 manifldcf

 Hi Oren,

 I've been thinking further about your issue, and how many recent kinds of 
 posts we've been getting which basically amount to people trying to get the 
 manifoldcf-sharepoint-2010 plugin working on their particular SharePoint 
 instance, which has no doubt been installed and (mis?)configured by someone 
 else at some point in the past.  I think we're going to need a how-to-debug 
 page where we can gather everyone's experiences together, including 
 diagnostic approaches and advice.
 There is already a page that anyone can edit in the ManifoldCF wiki, which 
 is a fine starting point:
 https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Conn
 ections .  I hope you will be willing to contribute to this effort.

 In the meantime, let's go back over your questions below and try to 
 eliminate them one at a time, in a more systematic fashion.

 (1) Version of SharePoint.

 To rule out any funkiness here, the obvious thing to do is to find the 
 version of your sharepoint.dll.  The dll should be in one of the standard 
 locations where assembly dlls are deployed on your server.
 The assembly name is Microsoft.SharePoint.dll - nothing else, not
 MicrosoftOffice, or anything else.  There are a number of tools for
 determining the .NET version of such DLLs; here's a link that might
 help:
 http://stackoverflow.com/questions/227886/how-do-i-determine-the-depe
 ndencies-of-a-net-application .  The ManifoldCF-SharePoint-2010
 plugin is built against:

 Reference Include=Microsoft.SharePoint, Version=14.0.0.0,
 Culture=neutral, PublicKeyToken=71e9bce111e9429c,
 processorArchitecture=MSIL /

 ... which can be found in the webservice

Re: Problem with manifold

2012-11-02 Thread Karl Wright
Actually, from your log it is clear that ManifoldCF can be reached
fine from your Solr instance, so please disregard that question.

The only other potential issue has to do with Solr search component
ordering.  This is a bit of black magic, because other Solr components
may modify the request in ways which are potentially incompatible with
the ManifoldCF plugin.  So if you are sure your fields are all
correct, you might want to play around with the ordering of your
components to see if that makes any difference.

There used to be debug component you could also use which would print
out the (full) query and the results returned - that may also be
useful.

Thanks,
Karl

On Fri, Nov 2, 2012 at 6:25 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Pablo,

 The first thing that I notice is that, as you have this configured,
 you need four fields declared in your schema as indexable fields:

 allow_token_document
 deny_token_document
 allow_token_share
 deny_token_share


 Do you have these fields declared, and did you have them all declared
 when you performed the crawl?

 Second, the way it is configured, the machine that is running Solr
 must be the same as the machine running ManifoldCF (because you used a
 localhost url).  Is this true?

 Thanks,
 Karl


 On Fri, Nov 2, 2012 at 5:43 AM, Gonzalez, Pablo
 pablo.gonzalez.do...@hp.com wrote:
 Hello, Mr Wright, and thank you for such a fast response. Well, the way I am 
 using to try and communicate mcf and solr is via a SearchComponent. For this 
 I added the apache-solr-mcf-3.6-SNAPSHOT.jar that comes in the file 
 solr-integration to the lib folder of the deployment of the solr webapp in 
 tomcat. Then I changed solrconfig.xml, adding this piece of code:


 !-- LCF document security enforcement component --
 searchComponent name=mcfSecurity 
 class=org.apache.solr.mcf.ManifoldCFSearchComponent
 str name=AuthorityServiceBaseURLhttp://localhost:8345/mcf/str
 /searchComponent


 requestHandler name=/search class=solr.SearchHandler default=true

 !-- default values for query parameters can be specified, these

  will be overridden by parameters in the request

   --

!--  lst name=defaults

str name=echoParamsexplicit/str

int name=rows10/int

str name=dftext/str

  /lst--

 arr name=last-components
 strmcfSecurity/str
 /arr
 !--a bunch of comments--
 /requestHandler

 Last thing, I didn't write any additional Java code. I thought it wasn't 
 necessary.

 Thanks,

 Pablo


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: viernes, 02 de noviembre de 2012 10:21
 To: user@manifoldcf.apache.org
 Subject: Re: Problem with manifold

 The ManifoldCF Solr plugin operates by requesting access tokens from 
 ManifoldCF (which seems to be working fine), and using those to modify the 
 incoming Solr search expression to limit the results according to those 
 access tokens.

 There are two ways (and two independent classes) you can configure to 
 perform this modification.  One of these classes functions as a query parser 
 plugin.  The other functions as a search component.  Obviously, for either 
 one to work right, the Solr configuration has to work properly too.  Can you 
 provide details as to (a) which one you are using, and (b) what the 
 configuration details are, e.g. the appropriate clauses from solrconfig.xml?

 Thanks,
 Karl

 On Fri, Nov 2, 2012 at 4:57 AM, Gonzalez, Pablo 
 pablo.gonzalez.do...@hp.com wrote:
 Hello,
 I don't know if you already got this message, but anyway here I go:
 I have been trying to connect ManifoldCF to Solr. I have a file system
 in a remote server, protected by active directory.
 I have configured a manifold job to import only a part of the
 documents under the file system. In fact, I do the importing process
 from a file which only contains 2 documents, in order to make it
 easier to see what is happening and get conclusions. Afterwards the
 documents are output to the solr server.
 I have created a request handler called selectManifold to connect
 manifold and solr. Then I call it via
 http://[host]:8080/solr/selectManifold?indent=onversion=2.2q=*%3A*f
 q=start=0rows=10fl=*%2Cscorewt=explainOther=hl.fl=Authenticated
 UserName=user@domain . When doing this, tomcat's log (catalina.out)
 writes this:
 oct 31, 2012 2:40:33 PM org.apache.solr.mcf.ManifoldCFSearchComponent
 prepare
 Información: Trying to match docs for user 'user@domain'
 oct 31, 2012 2:40:33 PM org.apache.solr.mcf.ManifoldCFSearchComponent
 getAccessTokens
 Información: For user 'user@domain', saw authority response
 AUTHORIZED:Auth+active+directory+para+el+file+system (this one is the
 active directory I'm currently using for the job) oct 31, 2012 2:40:33
 PM org.apache.solr.mcf.ManifoldCFSearchComponent
 getAccessTokens
 Información: For user 'user@domain', saw authority response
 AUTHORIZED:ad (this one isn't) oct 31, 2012 2:40:33 PM
 org.apache.solr.core.SolrCore execute
 Información

Re: ManifoldCF 1.0.1 MySQL setup : Error getting connection: Access denied for user

2012-11-02 Thread Karl Wright
Hi Nigel,

I'm not a MySQL expert, but I seem to recall there was something
interesting about the way MySQL authenticated remote connections.
There are two properties that the MySQL driver looks at:

  /** MySQL server property */
  public static final String mysqlServerProperty =
org.apache.manifoldcf.mysql.server;
  /** Source system name or IP */
  public static final String mysqlClientProperty =
org.apache.manifoldcf.mysql.client;


I think you may need to set both of these for the auth to succeed.
Also, make sure your MySQL server is configured to permit connections
from the source system you are trying to connect from.

Thanks,
Karl

On Fri, Nov 2, 2012 at 11:21 AM, Nigel Thomas nigel.tho...@york.ac.uk wrote:
 Hello,

 I am having some problems configuring 1.0.1 to use a MySQL database, I
 have followed steps here :
 http://manifoldcf.apache.org/release/release-1.0.1/en_US/how-to-build-and-deploy.html#Configuring+a+MySQL+database

 I have set the following db related properties in properties.xml:

   property name=org.apache.manifoldcf.databaseimplementationclass
 value=org.apache.manifoldcf.core.database.DBInterfaceMySQL/
   property name=org.apache.manifoldcf.mysql.server
 value=mysql.example.com/
   property name=org.apache.manifoldcf.dbsuperusername value=root/
   property name=org.apache.manifoldcf.dbsuperuserpassword 
 value=password/
   property name=org.apache.manifoldcf.database.name value=manfold_db/
   property name=org.apache.manifoldcf.database.username value=root/
   property name=org.apache.manifoldcf.database.password value=password/
   property name=org.apache.manifoldcf.database.maxhandles value=100/

 On running initialise.sh, the following exception is thrown:

  org.apache.manifoldcf.core.interfaces.ManifoldCFException: Error
 getting connection: Access denied for user 'root'@'%' to database
 'mysql'
 at 
 org.apache.manifoldcf.core.database.DBInterfaceMySQL.createUserAndDatabase(DBInterfaceMySQL.java:624)
 at 
 org.apache.manifoldcf.core.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:700)
 at 
 org.apache.manifoldcf.crawler.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:168)
 at 
 org.apache.manifoldcf.crawler.InitializeAndRegister.doExecute(InitializeAndRegister.java:37)
 at 
 org.apache.manifoldcf.crawler.InitializeAndRegister.main(InitializeAndRegister.java:60)

 I am able to connect to the MySQL instance using a command line MySQL
 client from the same machine using the same credentials, this rules
 out networking and credentials related issues.

 Am not sure what I am missing with the setup, I have tried the
 equivalent with a postgres setup, this seems to work just fine.

 Thanks,


 Nigel Thomas


Re: Problem with manifold

2012-11-05 Thread Karl Wright
Just reran the tests on the trunk version of the ManifoldCF solr 3.x
plugin - looked good:

[junit] Testsuite: org.apache.solr.mcf.ManifoldCFQParserPluginTest
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 10.56 sec
[junit]
[junit] - Standard Error -
[junit] WARNING: test class left thread running: Thread[MultiThreadedHttpCon
nectionManager cleanup,5,main]
[junit] RESOURCE LEAK: test class left 1 thread(s) running
[junit] -  ---
[junit] Testsuite: org.apache.solr.mcf.ManifoldCFSearchComponentTest
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.096 sec
[junit]
[junit] - Standard Error -
[junit] WARNING: test class left thread running: Thread[MultiThreadedHttpCon
nectionManager cleanup,5,main]
[junit] RESOURCE LEAK: test class left 1 thread(s) running
[junit] -  ---
[junit] Testsuite: org.apache.solr.mcf.ManifoldCFSCLoadTest
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 40.486 sec
[junit]
[junit] - Standard Output ---
[junit] Query time = 24352
[junit] -  ---
[junit] - Standard Error -
[junit] WARNING: test class left thread running: Thread[MultiThreadedHttpCon
nectionManager cleanup,5,main]
[junit] RESOURCE LEAK: test class left 1 thread(s) running
[junit] -  ---


The components that this test uses are simple:

?xml version=1.0 ?

!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the License); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an AS IS BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
--

!-- $Id: solrconfig-auth.xml 1176500 2011-09-27 18:19:59Z kwright $
 $Source$
 $Name$
  --

config

  
luceneMatchVersion${tests.luceneMatchVersion:LUCENE_CURRENT}/luceneMatchVersion
  jmx /

  dataDir${solr.data.dir:}/dataDir

  directoryFactory name=DirectoryFactory
class=${solr.directoryFactory:solr.RAMDirectoryFactory}/

  updateHandler class=solr.DirectUpdateHandler2
  /updateHandler

  requestHandler name=/update class=solr.XmlUpdateRequestHandler /

  !-- test MCF Security Filter settings --
  searchComponent name=mcf-param
class=org.apache.solr.mcf.ManifoldCFSearchComponent 
str name=AuthorityServiceBaseURLhttp://localhost:8345/mcf-as/str
int name=SocketTimeOut3000/int
str name=AllowAttributePrefixaap-/str
str name=DenyAttributePrefixdap-/str
  /searchComponent

  searchComponent name=mcf
class=org.apache.solr.mcf.ManifoldCFSearchComponent 
  /searchComponent

  requestHandler name=/mcf class=solr.SearchHandler startup=lazy
lst name=invariants
  bool name=mcftrue/bool
/lst
lst name=defaults
  str name=echoParamsall/str
/lst
arr name=components
  strquery/str
  strmcf/str
/arr
  /requestHandler

/config



On Mon, Nov 5, 2012 at 5:42 AM, Karl Wright daddy...@gmail.com wrote:
 No - I mean modifying ManifoldCFSearchComponent itself, and rebuilding
 the component yourself.  You can download the sources that correspond
 to the release from the ManifoldCF download page,
 http://manifoldcf.apache.org/en_US/download.html .

 Karl

 On Mon, Nov 5, 2012 at 4:13 AM, Gonzalez, Pablo
 pablo.gonzalez.do...@hp.com wrote:
 Hello,

 By 'modifying the component itself' do you mean to write a subclass of 
 ManifoldCFSearchComponent?

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: viernes, 02 de noviembre de 2012 14:47
 To: user@manifoldcf.apache.org
 Subject: Re: Problem with manifold

 If you don't get anywhere with the debug component, you can try modifying 
 the component itself to print the incoming query and the modified query.  
 You might also want to look at the ManifoldCF component tests, which create 
 a handler internally and executed successfully when the component was 
 released.  If you create a similar handler and that works, then you can try 
 to figure out what the differences are.

 Thanks,
 Karl

 On Fri, Nov 2, 2012 at 8:29 AM, Gonzalez, Pablo 
 pablo.gonzalez.do...@hp.com wrote:
 Well, it went wrong. I will crawl again just in case, and if it doesn't go 
 well, I will search on Internet about that debug component

RE: ManifoldCF 1.0.1 MySQL setup : Error getting connection:

2012-11-05 Thread Karl Wright
 Access denied for user
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Hi Nigel,

Existence checking is already present, and must indeed work if the
system is to function properly in single user mode. It may be that you
simply need to provide a superuser with enough privs to do the checks.

Karl

Sent from my Windows Phone
From: Nigel Thomas
Sent: 11/5/2012 7:01 AM
To: user@manifoldcf.apache.org
Subject: Re: ManifoldCF 1.0.1 MySQL setup : Error getting connection:
Access denied for user
Hi Karl,

Thank you for the prompt response.

I had run a tcp dump on the connection to get more details on the
error, which is a mysql :42000 - Access denied for user error and
looking at the source code, the problem is not that it isn't
connecting to the database but, it doesn't have the privileges to
create the database and grant access to users. We run a shared mysql
services, where the user, database and privileges are granted
separately, and unfortunately super user access isn't permitted.

I guess in this context the initialise script will not work without
some modification to check if database and users already exist.

I have reverted to using postgres instance with full privileges for
the moment and may revisit the code later.

Thanks for the help.

Nigel Thomas


On 2 November 2012 15:35, Karl Wright daddy...@gmail.com wrote:
 Hi Nigel,

 I'm not a MySQL expert, but I seem to recall there was something
 interesting about the way MySQL authenticated remote connections.
 There are two properties that the MySQL driver looks at:

   /** MySQL server property */
   public static final String mysqlServerProperty =
 org.apache.manifoldcf.mysql.server;
   /** Source system name or IP */
   public static final String mysqlClientProperty =
 org.apache.manifoldcf.mysql.client;


 I think you may need to set both of these for the auth to succeed.
 Also, make sure your MySQL server is configured to permit connections
 from the source system you are trying to connect from.

 Thanks,
 Karl

 On Fri, Nov 2, 2012 at 11:21 AM, Nigel Thomas nigel.tho...@york.ac.uk wrote:
 Hello,

 I am having some problems configuring 1.0.1 to use a MySQL database, I
 have followed steps here :
 http://manifoldcf.apache.org/release/release-1.0.1/en_US/how-to-build-and-deploy.html#Configuring+a+MySQL+database

 I have set the following db related properties in properties.xml:

   property name=org.apache.manifoldcf.databaseimplementationclass
 value=org.apache.manifoldcf.core.database.DBInterfaceMySQL/
   property name=org.apache.manifoldcf.mysql.server
 value=mysql.example.com/
   property name=org.apache.manifoldcf.dbsuperusername value=root/
   property name=org.apache.manifoldcf.dbsuperuserpassword 
 value=password/
   property name=org.apache.manifoldcf.database.name value=manfold_db/
   property name=org.apache.manifoldcf.database.username value=root/
   property name=org.apache.manifoldcf.database.password value=password/
   property name=org.apache.manifoldcf.database.maxhandles value=100/

 On running initialise.sh, the following exception is thrown:

  org.apache.manifoldcf.core.interfaces.ManifoldCFException: Error
 getting connection: Access denied for user 'root'@'%' to database
 'mysql'
 at 
 org.apache.manifoldcf.core.database.DBInterfaceMySQL.createUserAndDatabase(DBInterfaceMySQL.java:624)
 at 
 org.apache.manifoldcf.core.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:700)
 at 
 org.apache.manifoldcf.crawler.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:168)
 at 
 org.apache.manifoldcf.crawler.InitializeAndRegister.doExecute(InitializeAndRegister.java:37)
 at 
 org.apache.manifoldcf.crawler.InitializeAndRegister.main(InitializeAndRegister.java:60)

 I am able to connect to the MySQL instance using a command line MySQL
 client from the same machine using the same credentials, this rules
 out networking and credentials related issues.

 Am not sure what I am missing with the setup, I have tried the
 equivalent with a postgres setup, this seems to work just fine.

 Thanks,


 Nigel Thomas


Re: ManifoldCF 1.0.1 MySQL setup : Error getting connection: Access denied for user

2012-11-05 Thread Karl Wright
The check-for-existence logic is already there, and you can control
the superuser name and password.  But you can't control the instance
name, which is the MySql root instance name mysql.

Karl

On Mon, Nov 5, 2012 at 7:00 AM, Nigel Thomas nigel.tho...@york.ac.uk wrote:
 Hi Karl,

 Thank you for the prompt response.

 I had run a tcp dump on the connection to get more details on the
 error, which is a mysql :42000 - Access denied for user error and
 looking at the source code, the problem is not that it isn't
 connecting to the database but, it doesn't have the privileges to
 create the database and grant access to users. We run a shared mysql
 services, where the user, database and privileges are granted
 separately, and unfortunately super user access isn't permitted.

 I guess in this context the initialise script will not work without
 some modification to check if database and users already exist.

 I have reverted to using postgres instance with full privileges for
 the moment and may revisit the code later.

 Thanks for the help.

 Nigel Thomas


 On 2 November 2012 15:35, Karl Wright daddy...@gmail.com wrote:
 Hi Nigel,

 I'm not a MySQL expert, but I seem to recall there was something
 interesting about the way MySQL authenticated remote connections.
 There are two properties that the MySQL driver looks at:

   /** MySQL server property */
   public static final String mysqlServerProperty =
 org.apache.manifoldcf.mysql.server;
   /** Source system name or IP */
   public static final String mysqlClientProperty =
 org.apache.manifoldcf.mysql.client;


 I think you may need to set both of these for the auth to succeed.
 Also, make sure your MySQL server is configured to permit connections
 from the source system you are trying to connect from.

 Thanks,
 Karl

 On Fri, Nov 2, 2012 at 11:21 AM, Nigel Thomas nigel.tho...@york.ac.uk 
 wrote:
 Hello,

 I am having some problems configuring 1.0.1 to use a MySQL database, I
 have followed steps here :
 http://manifoldcf.apache.org/release/release-1.0.1/en_US/how-to-build-and-deploy.html#Configuring+a+MySQL+database

 I have set the following db related properties in properties.xml:

   property name=org.apache.manifoldcf.databaseimplementationclass
 value=org.apache.manifoldcf.core.database.DBInterfaceMySQL/
   property name=org.apache.manifoldcf.mysql.server
 value=mysql.example.com/
   property name=org.apache.manifoldcf.dbsuperusername value=root/
   property name=org.apache.manifoldcf.dbsuperuserpassword 
 value=password/
   property name=org.apache.manifoldcf.database.name value=manfold_db/
   property name=org.apache.manifoldcf.database.username value=root/
   property name=org.apache.manifoldcf.database.password 
 value=password/
   property name=org.apache.manifoldcf.database.maxhandles value=100/

 On running initialise.sh, the following exception is thrown:

  org.apache.manifoldcf.core.interfaces.ManifoldCFException: Error
 getting connection: Access denied for user 'root'@'%' to database
 'mysql'
 at 
 org.apache.manifoldcf.core.database.DBInterfaceMySQL.createUserAndDatabase(DBInterfaceMySQL.java:624)
 at 
 org.apache.manifoldcf.core.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:700)
 at 
 org.apache.manifoldcf.crawler.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:168)
 at 
 org.apache.manifoldcf.crawler.InitializeAndRegister.doExecute(InitializeAndRegister.java:37)
 at 
 org.apache.manifoldcf.crawler.InitializeAndRegister.main(InitializeAndRegister.java:60)

 I am able to connect to the MySQL instance using a command line MySQL
 client from the same machine using the same credentials, this rules
 out networking and credentials related issues.

 Am not sure what I am missing with the setup, I have tried the
 equivalent with a postgres setup, this seems to work just fine.

 Thanks,


 Nigel Thomas


Re: Cannot connect to SharePoint 2010 instance

2012-11-06 Thread Karl Wright
I've seen situations where a SharePoint site is configured to perform
a redirection, and this is messing things up internally.  Does the
your connection server name etc. match precisely the URL you see when
you are in the SharePoint user interface?

Karl

On Tue, Nov 6, 2012 at 8:47 AM, Iannetti, Robert
robert.ianne...@novartis.com wrote:
 Karl,

 After further review it appears the MCpermissions.asmx was installed globally 
 in SharePoint. I am able to access it from within my SharePoint site as well 
 as all other valid SharePoint sub-sites.
 So this connection http://server/sitepath/_vti_bin works with any valid 
 site in sitepath including the previously mentioned _admin site.

 That said do you have any thoughts on why I would be getting the 404 error?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 05, 2012 2:45 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 The 404 error indicates that your MCPermissions service is not properly 
 deployed.  The _admin in your path is a clue that something might not be 
 right.  The place you want to see the MCPermissions.asmx is in the following 
 location:

 http[s]://server/sitepath/_vti_bin

 ... where the server is your server name, and the sitepath is your site 
 path.  The best way to get this is to enter the SharePoint UI (NOT the admin 
 UI, but the SharePoint end-user UI), and log into the root site.  Then make 
 note of the URL in your browser.

 If the MCPermissions.asmx service appears under that URL, look at your IIS 
 settings and make sure that the MCPermissions.asmx service can be executed.

 Also, this may be of some help:
 https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections

 The end user documentation is also extremely helpful in describing how to 
 properly set up connections.

 You can uninstall the MCPermissions.asmx service using the .bat files that 
 are included with the plugin.  When you re-install, please make sure that you 
 are logged in as a user with full admin privileges, or the service will not 
 work properly.

 Thanks,
 Karl

 On Mon, Nov 5, 2012 at 2:33 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Hello,



 I have installed apache-manifoldcf-1.0.1 on my Linux server and
 apache-manifoldcf-sharepoint-2010-plugin-0.1-bin on my SharePoint 2010
 server.

 On my SharePoint server I can see the Permissions Page when I enter
 http://x:x/_admin/_vti_bin/MCPermissions.asmx in my browser.



 When I try to make a SharePoint Services 4.0 (2010) connection to my
 SharePoint 2010 server in the ManifoldCF interface I get this error.

 Got an unknown remote exception accessing site - axis fault = Client,
 detail = The request failed with HTTP status 404: Not Found.



 I can connect using SharePoint Services 2.0 (2003) but when I try a
 crawl it does not work properly and aborts.

 The  SharePoint Services 3.0 (2007) connection fails the same as the
 above
 2010 connection.



 Can you please give some direction on how best to resolve this issue.



 Thanks

 Bob





 Robert P. Iannetti



 Application Architect

 Novartis Institute for BioMedical Research

 186 Massachusetts Avenue

 Cambridge, MA 02139

 Phone: +1 (617) 871-5414

 robert.ianne...@novartis.com








Re: The Schedulars are not starting automatically

2012-11-06 Thread Karl Wright
Hi Anupam,

I'm having difficulty understanding what you posted here, but I will
try to explain the difference between rescan dynamically and scan
every document once.  You may find more help also in ManifoldCF in
Action, at http://www.manning.com/wright .

The first option causes your job to run forever.  The job runs only in
the schedule windows allotted for it.  It periodically discovers new
documents, and (depending on the crawling model of the connector) may
check for existence or modification of an already-crawled document.
Each document has its own schedule for doing this.

The second option is more likely to be what you want.  Each job
starts, runs, and completes, being sure to run only in the scheduling
windows you provide.  You then run it again, and again (or your job
schedule makes that happen).  It will do the minimal work to keep your
index up to date.

There are significant differences between how you would set up a job
using one model vs. the other.  I strongly suggest you read at least
the first few chapters of the book.

Karl

On Tue, Nov 6, 2012 at 12:35 PM, Anupam Bhattacharya
anupam...@gmail.com wrote:
 My incremental indexing was working previously but I have messed up with few
 settings due to which the documents indexed for the previous day gets
 deleted  only the new once shows up. I suspect that it is due to the
 settings in List all JobEdit selected jobSchedulingSchedule type: Rescan
 documents dynamically OR Scan every document once ? Please let me know
 the appropriate settings to index only the new documents in the repository.

 After deleting the SOLR indexes data folder and clearing the table records
 in jobqueue, repohistory, ingeststatus I found that ManifoldCF scans only
 the rest new document list. Untill I go to List Output Connections and Click
 View for a SOLR connection and Click and Ok the Re-ingest all associated
 documents. How it is functioning to keep a track of which documents ingested
 previously and then fetch only the list of new document list ?

 Regards
 Anupam


 On Tue, Aug 14, 2012 at 10:01 AM, Anupam Bhattacharya anupam...@gmail.com
 wrote:

 Thanks..

 There is a option to set Start Method in Connection tab in the Job
 settings. I made to changes to Start when the Schedule window starts and
 the problem got resolved.

 Regards
 Anupam


 On Thu, Aug 2, 2012 at 10:59 PM, Karl Wright daddy...@gmail.com wrote:

 The incremental will work the same whether the job is run manually or
 started automatically.

 If you have added the appropriate schedule record to your job, you
 also have to select the run job automatically radio button on one of
 the other job tabs for automatic runs to take place.  I suspect that
 is what you are missing.

 Karl

 On Thu, Aug 2, 2012 at 1:12 PM, Anupam Bhattacharya anupam...@gmail.com
 wrote:
  I have a Job which is indexing properly even the incremental indexing,
  if
  initiated/Run manually. Although even after adding a specific time to
  Run
  the schedular process the Jobs is not starting on its own.
 
  What is the ideal configuration to configure a Job which run
  automatically
  everyday at 12 am and does and incremental re-indexing (only look for
  those
  document which are new OR modified after the last crawl) of the
  repository ?
 
  Is it necessary to input/give the total run time details for adding a
  specific schedule time.
 
  Regards
  Anupam





 --
 Thanks  Regards
 Anupam Bhattacharya




Re: Cannot connect to SharePoint 2010 instance

2012-11-06 Thread Karl Wright
Yes, this can be somewhat tricky.  There are a lot of potential
configurations that could affect this.

First, you want to verify that your IIS is using NTLM authentication,
and that all the web services directories are executable.  This is
critical.

Second, the credentials, in the form of domain\user, may be sensitive
to whether you use a fully-qualified domain name or a shortcut domain
name, e.g. mydomain.novartis.com or just mydomain.  I suggest you try
some combinations.  The other thing you may want to check is whether
the machine you are running ManifoldCF on is known by your domain
controller; you may not be able to authenticate if it is not.

If this doesn't help, and you want to eliminate ManifoldCF's NTLM
implementation from the list of possibilities, I suggest downloading
the curl utility, and trying to fetch a web service listing or wsdl
using it (specifying NTLM of course as the authentication method).  If
that also doesn't work, it's a server-side configuration problem of
some kind.

You can also refer to the server-side IIS logs for some additional
info.  But I've found these are not very helpful for authentication
issues.

Let me know if you are still stuck after this; there are other
diagnostics available but they start to get ugly.

Kral

On Tue, Nov 6, 2012 at 2:35 PM, Iannetti, Robert
robert.ianne...@novartis.com wrote:
 Karl,

 I turned on the additional debugging and was able to resolve the 404 issue.

 Now I am getting:
 Crawl user did not authenticate properly, or has insufficient permissions to 
 access http://.xxx.xxx: (401)Unauthorized

 I can log into the SharePoint site from the browser using the same 
 credentials.


 Any Thoughts?

 Thanks
 Bob

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 10:05 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Well, you can turn on httpclient wire debugging, as I believe is described in 
 the article URL I sent you before, and then you can see precisely what URL 
 the connector is trying to reach when it accesses the MCPermissions service.

 There's no magic here.  If the connector gets a 404 error back from IIS, 
 either its URL is wrong, or IIS has decided it's not going to serve that page 
 to the client.

 Karl


 On Tue, Nov 6, 2012 at 8:58 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Yes, The URL and what I enter in the ManifoldCF interface are a match.

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 8:52 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 I've seen situations where a SharePoint site is configured to perform a 
 redirection, and this is messing things up internally.  Does the your 
 connection server name etc. match precisely the URL you see when you are in 
 the SharePoint user interface?

 Karl

 On Tue, Nov 6, 2012 at 8:47 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 After further review it appears the MCpermissions.asmx was installed 
 globally in SharePoint. I am able to access it from within my SharePoint 
 site as well as all other valid SharePoint sub-sites.
 So this connection http://server/sitepath/_vti_bin works with any valid 
 site in sitepath including the previously mentioned _admin site.

 That said do you have any thoughts on why I would be getting the 404 error?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 05, 2012 2:45 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 The 404 error indicates that your MCPermissions service is not properly 
 deployed.  The _admin in your path is a clue that something might not be 
 right.  The place you want to see the MCPermissions.asmx is in the 
 following location:

 http[s]://server/sitepath/_vti_bin

 ... where the server is your server name, and the sitepath is your site 
 path.  The best way to get this is to enter the SharePoint UI (NOT the 
 admin UI, but the SharePoint end-user UI), and log into the root site.  
 Then make note of the URL in your browser.

 If the MCPermissions.asmx service appears under that URL, look at your IIS 
 settings and make sure that the MCPermissions.asmx service can be executed.

 Also, this may be of some help:
 https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Conn
 e
 ctions

 The end user documentation is also extremely helpful in describing how to 
 properly set up connections.

 You can uninstall the MCPermissions.asmx service using the .bat files that 
 are included with the plugin.  When you re-install, please make sure that 
 you are logged in as a user with full admin privileges, or the service will 
 not work properly.

 Thanks,
 Karl

 On Mon, Nov 5, 2012 at 2:33 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Hello,



 I have

Re: Cannot connect to SharePoint 2010 instance

2012-11-06 Thread Karl Wright
No, Kerberos is not supported.  This is a limitation of the Apache
commons-httpclient library that we use for communicating with
SharePoint.

It is possible to set up IIS to serve a different port with different
authentication that goes to the same SharePoint instance but is NTLM
protected, not Kerberos protected.  Perhaps you can do this and limit
access to that port to only the ManifoldCF machine.

Karl

On Tue, Nov 6, 2012 at 3:03 PM, Iannetti, Robert
robert.ianne...@novartis.com wrote:
 Karl,

 Our SharePoint sites use Kerberos authentication is this supported in 
 ManifoldCF?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 2:50 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Yes, this can be somewhat tricky.  There are a lot of potential 
 configurations that could affect this.

 First, you want to verify that your IIS is using NTLM authentication, and 
 that all the web services directories are executable.  This is critical.

 Second, the credentials, in the form of domain\user, may be sensitive to 
 whether you use a fully-qualified domain name or a shortcut domain name, e.g. 
 mydomain.novartis.com or just mydomain.  I suggest you try some combinations. 
  The other thing you may want to check is whether the machine you are running 
 ManifoldCF on is known by your domain controller; you may not be able to 
 authenticate if it is not.

 If this doesn't help, and you want to eliminate ManifoldCF's NTLM 
 implementation from the list of possibilities, I suggest downloading the 
 curl utility, and trying to fetch a web service listing or wsdl using it 
 (specifying NTLM of course as the authentication method).  If that also 
 doesn't work, it's a server-side configuration problem of some kind.

 You can also refer to the server-side IIS logs for some additional info.  But 
 I've found these are not very helpful for authentication issues.

 Let me know if you are still stuck after this; there are other diagnostics 
 available but they start to get ugly.

 Kral

 On Tue, Nov 6, 2012 at 2:35 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 I turned on the additional debugging and was able to resolve the 404 issue.

 Now I am getting:
 Crawl user did not authenticate properly, or has insufficient
 permissions to access http://.xxx.xxx: (401)Unauthorized

 I can log into the SharePoint site from the browser using the same 
 credentials.


 Any Thoughts?

 Thanks
 Bob

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 10:05 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Well, you can turn on httpclient wire debugging, as I believe is described 
 in the article URL I sent you before, and then you can see precisely what 
 URL the connector is trying to reach when it accesses the MCPermissions 
 service.

 There's no magic here.  If the connector gets a 404 error back from IIS, 
 either its URL is wrong, or IIS has decided it's not going to serve that 
 page to the client.

 Karl


 On Tue, Nov 6, 2012 at 8:58 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Yes, The URL and what I enter in the ManifoldCF interface are a match.

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 8:52 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 I've seen situations where a SharePoint site is configured to perform a 
 redirection, and this is messing things up internally.  Does the your 
 connection server name etc. match precisely the URL you see when you are in 
 the SharePoint user interface?

 Karl

 On Tue, Nov 6, 2012 at 8:47 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 After further review it appears the MCpermissions.asmx was installed 
 globally in SharePoint. I am able to access it from within my SharePoint 
 site as well as all other valid SharePoint sub-sites.
 So this connection http://server/sitepath/_vti_bin works with any 
 valid site in sitepath including the previously mentioned _admin site.

 That said do you have any thoughts on why I would be getting the 404 error?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 05, 2012 2:45 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 The 404 error indicates that your MCPermissions service is not properly 
 deployed.  The _admin in your path is a clue that something might not be 
 right.  The place you want to see the MCPermissions.asmx is in the 
 following location:

 http[s]://server/sitepath/_vti_bin

 ... where the server is your server name, and the sitepath is your 
 site path.  The best way to get this is to enter the SharePoint UI (NOT 
 the admin UI

Re: Cannot connect to SharePoint 2010 instance

2012-11-06 Thread Karl Wright
Hi Bob,

The only products I know have a similar limitations.  The only one I
know is the SharePoint google appliance connector, which when I looked
last had exactly the same restriction.  It also has other limitations,
some severe, such as limiting the number of documents you can crawl to
no more than 5000 per library.

We are willing to do a reasonable amount of work to upgrade ManifoldCF
to be able to support Kerberos.  Here's a link which describes the
situation:

http://old.nabble.com/Support-for-Kerberos-SPNEGO-td14564857.html

We currently use a significantly-patched version of 3.1, which
supplied the NTLM implementation for 4.0 that is currently in use.
Our issue is similar to the commons-httpclient team's, which is we
have no good way of testing all of this, and none of us are security
protocol experts.  If you have (or know somebody with) such expertise,
who would be willing/able to donate their time, this problem could be
tackled I think without too much pain.  So at least httpclient, given
the right tickets, would be able to connect.

The other issue with Kerberos auth is that I believe it will require a
significant amount of work to allow anything using it to obtain the
tickets from the AD domain controller.  This would obviously require
UI work for all connectors that would support Kerberos.  But that is
something I am willing to attempt if everything else is in place.

Karl


On Tue, Nov 6, 2012 at 3:11 PM, Iannetti, Robert
robert.ianne...@novartis.com wrote:
 Karl,

 If this is not possible can you recommend any other products to crawl 
 SharePoint content and index it in Solr?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 3:10 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 No, Kerberos is not supported.  This is a limitation of the Apache 
 commons-httpclient library that we use for communicating with SharePoint.

 It is possible to set up IIS to serve a different port with different 
 authentication that goes to the same SharePoint instance but is NTLM 
 protected, not Kerberos protected.  Perhaps you can do this and limit access 
 to that port to only the ManifoldCF machine.

 Karl

 On Tue, Nov 6, 2012 at 3:03 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 Our SharePoint sites use Kerberos authentication is this supported in 
 ManifoldCF?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 2:50 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Yes, this can be somewhat tricky.  There are a lot of potential 
 configurations that could affect this.

 First, you want to verify that your IIS is using NTLM authentication, and 
 that all the web services directories are executable.  This is critical.

 Second, the credentials, in the form of domain\user, may be sensitive to 
 whether you use a fully-qualified domain name or a shortcut domain name, 
 e.g. mydomain.novartis.com or just mydomain.  I suggest you try some 
 combinations.  The other thing you may want to check is whether the machine 
 you are running ManifoldCF on is known by your domain controller; you may 
 not be able to authenticate if it is not.

 If this doesn't help, and you want to eliminate ManifoldCF's NTLM 
 implementation from the list of possibilities, I suggest downloading the 
 curl utility, and trying to fetch a web service listing or wsdl using it 
 (specifying NTLM of course as the authentication method).  If that also 
 doesn't work, it's a server-side configuration problem of some kind.

 You can also refer to the server-side IIS logs for some additional info.  
 But I've found these are not very helpful for authentication issues.

 Let me know if you are still stuck after this; there are other diagnostics 
 available but they start to get ugly.

 Kral

 On Tue, Nov 6, 2012 at 2:35 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 I turned on the additional debugging and was able to resolve the 404 issue.

 Now I am getting:
 Crawl user did not authenticate properly, or has insufficient
 permissions to access http://.xxx.xxx: (401)Unauthorized

 I can log into the SharePoint site from the browser using the same 
 credentials.


 Any Thoughts?

 Thanks
 Bob

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 10:05 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Well, you can turn on httpclient wire debugging, as I believe is described 
 in the article URL I sent you before, and then you can see precisely what 
 URL the connector is trying to reach when it accesses the MCPermissions 
 service.

 There's no magic here.  If the connector gets a 404 error back from IIS, 
 either its URL is wrong, or IIS has decided it's not going

Re: Cannot connect to SharePoint 2010 instance

2012-11-06 Thread Karl Wright
Hi Bob,

That depends very strongly on whether SharePoint 2013 continues the
Microsoft tradition of breaking web services that used to work. :-)

Seriously, we need three things to develop a SharePoint 2013 solution:
(1) A stable release (a beta is not sufficient because Microsoft is
famous for changing things in a major way between beta and release);
(2) a benevolent client with sufficient patience to try things out
that we develop in their environment, and (3) enough time so that
we're not on the bleeding edge and that other people have run into
most of the sticky problems first.  We're volunteers here and we all
have day jobs, so we mostly can't afford to be pounding away at brick
walls on our own.

It could be the case that everything just works, in which case the
development is trivial.  We'll have to see.

Karl

On Tue, Nov 6, 2012 at 3:37 PM, Iannetti, Robert
robert.ianne...@novartis.com wrote:
 Karl,

 On another topic is there a roadmap for supporting SharePoint 2013 ?
 We are in the process of migrating and were wondering when your ManifoldCF 
 product would be available to support it.

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 3:34 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Hi Bob,

 The only products I know have a similar limitations.  The only one I know is 
 the SharePoint google appliance connector, which when I looked last had 
 exactly the same restriction.  It also has other limitations, some severe, 
 such as limiting the number of documents you can crawl to no more than 5000 
 per library.

 We are willing to do a reasonable amount of work to upgrade ManifoldCF to be 
 able to support Kerberos.  Here's a link which describes the
 situation:

 http://old.nabble.com/Support-for-Kerberos-SPNEGO-td14564857.html

 We currently use a significantly-patched version of 3.1, which supplied the 
 NTLM implementation for 4.0 that is currently in use.
 Our issue is similar to the commons-httpclient team's, which is we have no 
 good way of testing all of this, and none of us are security protocol 
 experts.  If you have (or know somebody with) such expertise, who would be 
 willing/able to donate their time, this problem could be tackled I think 
 without too much pain.  So at least httpclient, given the right tickets, 
 would be able to connect.

 The other issue with Kerberos auth is that I believe it will require a 
 significant amount of work to allow anything using it to obtain the tickets 
 from the AD domain controller.  This would obviously require UI work for all 
 connectors that would support Kerberos.  But that is something I am willing 
 to attempt if everything else is in place.

 Karl


 On Tue, Nov 6, 2012 at 3:11 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 If this is not possible can you recommend any other products to crawl 
 SharePoint content and index it in Solr?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 3:10 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 No, Kerberos is not supported.  This is a limitation of the Apache 
 commons-httpclient library that we use for communicating with SharePoint.

 It is possible to set up IIS to serve a different port with different 
 authentication that goes to the same SharePoint instance but is NTLM 
 protected, not Kerberos protected.  Perhaps you can do this and limit access 
 to that port to only the ManifoldCF machine.

 Karl

 On Tue, Nov 6, 2012 at 3:03 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 Our SharePoint sites use Kerberos authentication is this supported in 
 ManifoldCF?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 2:50 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Yes, this can be somewhat tricky.  There are a lot of potential 
 configurations that could affect this.

 First, you want to verify that your IIS is using NTLM authentication, and 
 that all the web services directories are executable.  This is critical.

 Second, the credentials, in the form of domain\user, may be sensitive to 
 whether you use a fully-qualified domain name or a shortcut domain name, 
 e.g. mydomain.novartis.com or just mydomain.  I suggest you try some 
 combinations.  The other thing you may want to check is whether the machine 
 you are running ManifoldCF on is known by your domain controller; you may 
 not be able to authenticate if it is not.

 If this doesn't help, and you want to eliminate ManifoldCF's NTLM 
 implementation from the list of possibilities, I suggest downloading the 
 curl utility, and trying to fetch a web service listing or wsdl using it 
 (specifying NTLM of course as the authentication method

Re: Cannot connect to SharePoint 2010 instance

2012-11-06 Thread Karl Wright
If you want, we can create a ticket to cover SharePoint 2013 work.  If
you want to attempt a sanity check, if you email me (personally, to
daddy...@gmail.com) the Microsoft.SharePoint.dll I can set up a
ManifoldCF-Sharepoint-2013 plugin.  If I can build that, then the next
step would be just trying it all out and seeing where it fails.

Karl

On Tue, Nov 6, 2012 at 3:49 PM, Iannetti, Robert
robert.ianne...@novartis.com wrote:
 Karl,

 That sounds reasonable. I am having my SP Admin set up the NTML SharePoint 
 instance described below I will let you know how it works.

 BTW SP 2013 RTM has been released so we can cross #1 off the list :)

 Thanks
 Bob

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 3:47 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Hi Bob,

 That depends very strongly on whether SharePoint 2013 continues the Microsoft 
 tradition of breaking web services that used to work. :-)

 Seriously, we need three things to develop a SharePoint 2013 solution:
 (1) A stable release (a beta is not sufficient because Microsoft is famous 
 for changing things in a major way between beta and release);
 (2) a benevolent client with sufficient patience to try things out that we 
 develop in their environment, and (3) enough time so that we're not on the 
 bleeding edge and that other people have run into most of the sticky problems 
 first.  We're volunteers here and we all have day jobs, so we mostly can't 
 afford to be pounding away at brick walls on our own.

 It could be the case that everything just works, in which case the 
 development is trivial.  We'll have to see.

 Karl

 On Tue, Nov 6, 2012 at 3:37 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 On another topic is there a roadmap for supporting SharePoint 2013 ?
 We are in the process of migrating and were wondering when your ManifoldCF 
 product would be available to support it.

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 3:34 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Hi Bob,

 The only products I know have a similar limitations.  The only one I know is 
 the SharePoint google appliance connector, which when I looked last had 
 exactly the same restriction.  It also has other limitations, some severe, 
 such as limiting the number of documents you can crawl to no more than 5000 
 per library.

 We are willing to do a reasonable amount of work to upgrade ManifoldCF
 to be able to support Kerberos.  Here's a link which describes the
 situation:

 http://old.nabble.com/Support-for-Kerberos-SPNEGO-td14564857.html

 We currently use a significantly-patched version of 3.1, which supplied the 
 NTLM implementation for 4.0 that is currently in use.
 Our issue is similar to the commons-httpclient team's, which is we have no 
 good way of testing all of this, and none of us are security protocol 
 experts.  If you have (or know somebody with) such expertise, who would be 
 willing/able to donate their time, this problem could be tackled I think 
 without too much pain.  So at least httpclient, given the right tickets, 
 would be able to connect.

 The other issue with Kerberos auth is that I believe it will require a 
 significant amount of work to allow anything using it to obtain the tickets 
 from the AD domain controller.  This would obviously require UI work for all 
 connectors that would support Kerberos.  But that is something I am willing 
 to attempt if everything else is in place.

 Karl


 On Tue, Nov 6, 2012 at 3:11 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 If this is not possible can you recommend any other products to crawl 
 SharePoint content and index it in Solr?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 3:10 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 No, Kerberos is not supported.  This is a limitation of the Apache 
 commons-httpclient library that we use for communicating with SharePoint.

 It is possible to set up IIS to serve a different port with different 
 authentication that goes to the same SharePoint instance but is NTLM 
 protected, not Kerberos protected.  Perhaps you can do this and limit 
 access to that port to only the ManifoldCF machine.

 Karl

 On Tue, Nov 6, 2012 at 3:03 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 Our SharePoint sites use Kerberos authentication is this supported in 
 ManifoldCF?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 06, 2012 2:50 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Yes, this can be somewhat tricky.  There are a lot

Re: Problem with manifold

2012-11-07 Thread Karl Wright
-2050932820-
  
 -deny_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-
  
 allow_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-513
  
 -deny_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-513
  
 allow_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1113
  
 -deny_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1113
  
 allow_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1110
  
 -deny_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1110
  
 allow_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1107
  
 -deny_token_document:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1107
  allow_token_document:active_dir:S-1-1-0 
 -deny_token_document:active_dir:S-1-1-0
  allow_token_document:ad:S-1-5-32-545 -deny_token_document:ad:S-1-5-32-545
  allow_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820- 
 -deny_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-
  allow_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-513 
 -deny_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-513
  allow_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1113 
 -deny_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1113
  allow_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1110 
 -deny_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1110
  allow_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1107 
 -deny_token_document:ad:S-1-5-21-2039231098-2614715072-2050932820-1107
  allow_token_document:ad:S-1-1-0 -deny_token_document:ad:S-1-1-0)

 This is the _document security chunk of the BooleanQuery (quoting all the 
 SIDs with   so it doesn't think active_dir is a field only for having a : 
 after it). The query gives the expected results.

 Thinking about it, the truth is that when we configured our security policies 
 by means of ActiveDirectory we did not take into consideration share-level 
 policies. Our users are authenticated only at a document level. Anyway, I 
 don't think this gives us any clue on why my handler isn't working.

 But, now I could modify my own component to take care  of the _document-level 
 security alone, forgetting about the _share-level. I think it would work and 
 that's what I will try for now, but I seriously think there must be another 
 way to do it, so if this data makes you have any idea please let me know.

 I will anyway tell you whether it worked or not.

 Thanks,

 Pablo


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: lunes, 05 de noviembre de 2012 11:57
 To: user@manifoldcf.apache.org
 Subject: Re: Problem with manifold

 Just reran the tests on the trunk version of the ManifoldCF solr 3.x plugin - 
 looked good:

 [junit] Testsuite: org.apache.solr.mcf.ManifoldCFQParserPluginTest
 [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 10.56 sec
 [junit]
 [junit] - Standard Error -
 [junit] WARNING: test class left thread running: 
 Thread[MultiThreadedHttpCon nectionManager cleanup,5,main]
 [junit] RESOURCE LEAK: test class left 1 thread(s) running
 [junit] -  ---
 [junit] Testsuite: org.apache.solr.mcf.ManifoldCFSearchComponentTest
 [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.096 sec
 [junit]
 [junit] - Standard Error -
 [junit] WARNING: test class left thread running: 
 Thread[MultiThreadedHttpCon nectionManager cleanup,5,main]
 [junit] RESOURCE LEAK: test class left 1 thread(s) running
 [junit] -  ---
 [junit] Testsuite: org.apache.solr.mcf.ManifoldCFSCLoadTest
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 40.486 sec
 [junit]
 [junit] - Standard Output ---
 [junit] Query time = 24352
 [junit] -  ---
 [junit] - Standard Error -
 [junit] WARNING: test class left thread running: 
 Thread[MultiThreadedHttpCon nectionManager cleanup,5,main]
 [junit] RESOURCE LEAK: test class left 1 thread(s) running
 [junit] -  ---


 The components that this test uses are simple:

 ?xml version=1.0 ?

 !--
  Licensed to the Apache Software Foundation (ASF) under one or more  
 contributor license agreements.  See the NOTICE file distributed with  this 
 work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0  
 (the License); you may not use this file except in compliance with  the 
 License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed

RE: Problem with manifold

2012-11-07 Thread Karl Wright
Hi Pablo,

Yes, I don't think you included the schema before. Having a default of
_nosecurity_ is critical. Were the instructions unclear?

And yes, this is safe, because all that it does is effectively
guarantee that solr fields without any value get one that can be
queried on.

Karl

Karl

Sent from my Windows Phone
From: Gonzalez, Pablo
Sent: 11/7/2012 6:08 AM
To: user@manifoldcf.apache.org
Subject: RE: Problem with manifold
Well, I did two things:
-first I did what I told you in the last message: I changed my
component only to care about the document-level security, and that way
the query worked
-then I realized that the documents that I indexed only had _document
tokens, not _share tokens at all. THAT is the real problem. So, what I
did was to change the definition of the fields in this way:
   field name=allow_token_document type=string indexed=true
stored=true multiValued=true required=true
default=__nosecurity__/
   field name=deny_token_document type=string indexed=true
stored=true multiValued=true required=true
default=__nosecurity__/
   field name=allow_token_share type=string indexed=true
stored=true multiValued=true required=true
default=__nosecurity__/
   field name=deny_token_share type=string indexed=true
stored=true multiValued=true required=true
default=__nosecurity__/

Then I used the default /select handler and it worked. But my question
is: is this safe? What I think this means is: if the document that I'm
indexing has no share security restrictions, then set it to no
security and let the user access it only if its document-level
policies allow him to do so.

Thinking about the system-not-indexing-share-tokens issue, I am
wondering what could be the cause. Maybe it is an error that I have in
my manifold or solr configurations that strips all the share tokens,
or perhaps we should do something at the machine that contains the
documents that we are indexing, to configure share-level security as
we did at the document level.

-Original Message-
From: Karl Wright [mailto:daddy...@gmail.com]
Sent: miércoles, 07 de noviembre de 2012 11:42
To: user@manifoldcf.apache.org
Subject: Re: Problem with manifold

So, can you look at one document, and tell me what the allow and deny
tokens are for both document and share levels?

Just taking the share part of the clause away means that you will be
allowing people to see search results when they cannot see within the
corresponding Windows share (according to Active Directory).  I'm
hoping that you are just crawling through a different share than the
one your users use to access the document.  But in any case the URLs
that are indexed will also not work to reach the files in question
because the share restrictions.

Karl

On Wed, Nov 7, 2012 at 4:20 AM, Gonzalez, Pablo
pablo.gonzalez.do...@hp.com wrote:
 Hello Karl, this is what I've done:
 -I've modified the class so that it prints out the BooleanQuery that it 
 creates.
 -I've rerun the query (with my handler), and this is what it pumps out:

 +((+allow_token_share:__nosecurity__ +deny_token_share:__nosecurity__)
  allow_token_share:active_dir:S-1-5-32-545
 -deny_token_share:active_dir:S-1-5-32-545

 allow_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820
 -
 -deny_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820
 -

 allow_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820
 -513
 -deny_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820
 -513

 allow_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820
 -1113
 -deny_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820
 -1113

 allow_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820
 -1110
 -deny_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820
 -1110

 allow_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820
 -1107
 -deny_token_share:active_dir:S-1-5-21-2039231098-2614715072-2050932820
 -1107
  allow_token_share:active_dir:S-1-1-0
 -deny_token_share:active_dir:S-1-1-0
  allow_token_share:ad:S-1-5-32-545 -deny_token_share:ad:S-1-5-32-545
  allow_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-
 -deny_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-
  allow_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-513
 -deny_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-513
  allow_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1113
 -deny_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1113
  allow_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1110
 -deny_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1110
  allow_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1107
 -deny_token_share:ad:S-1-5-21-2039231098-2614715072-2050932820-1107
  allow_token_share:ad:S-1-1-0 -deny_token_share:ad:S-1-1-0)
 +((+allow_token_document:__nosecurity__
 +deny_token_document:__nosecurity__)
  allow_token_document:active_dir:S-1-5-32-545

Re: value of DATACOLUMN

2012-11-12 Thread Karl Wright
Since it works on 3.6, it is definitely not a JDBC issue.

What content-type are you referring to?  The Solr connector does not
change what content type it posts based on the version of Solr it is
posting to, so the content-type you are talking about sounds like
something Solr is detecting rather than receiving.  Can you confirm?

Karl

On Mon, Nov 12, 2012 at 8:40 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Thank you for the reply.
 I tried to set -Dfile.encoding but didn't resolve the issue.
 What I can tell now is that euro symbol in DATACOLUMN
 can be indexed on Solr 3.6 and can not indexed on Solr 4.0.
 On Solr 3.6 content-type was text/plain, On Solr 4.0 content-type was
 application/octet-stream.
 Is this Solr's issue, not database's encoding?

 On 2012/11/12, at 20:36, Karl Wright wrote:

 It looks like the Postgresql JDBC driver sets the encoding itself,
 from what I can find.  So I would guess that it is setting the
 character encoding based on the database you are connected to.  So if
 the euro symbol is not handled by the database's encoding, there would
 be no way to include it in the query string.  I think...


 Karl

 On Mon, Nov 12, 2012 at 6:22 AM, Karl Wright daddy...@gmail.com wrote:
 To clarify, we pass every string to the JDBC driver as a unicode
 string, but it is up to the JDBC driver to decide how to interpret it.
 I don't know what exactly the PostgreSQL 9.1 driver does here.  It
 would be interesting to see what is posted to Solr, if you have those
 logs.  It may be that it is picking an encoding that is based on your
 machine's default encoding, which would be unfortunate.

 This page apparently indicates that there is somehow a way to set the
 encoding that JDBC communicates with the database with:

 http://stackoverflow.com/questions/3040597/jdbc-character-encoding

 I don't know if this is applicable to us at all though.  You can try:

 java -Dfile.encoding=utf8 start.jar

 ...and see if that changes things - it would be a good hint.

 Karl


 On Mon, Nov 12, 2012 at 6:12 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Abe-san,

 Quoted strings in SQL queries are not necessarily unicode.  See this
 page for details:

 http://www.postgresql.org/docs/7.3/static/functions-string.html

 There is nothing you can do in JDBC invocations to control character
 set.  This must be done in the query itself, or in the database
 itself.

 Karl

 On Mon, Nov 12, 2012 at 6:03 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I'm using Solr 4.0 and JDBC connection via PostgreSQL.
 The dataQuery is configured below:

 SELECT idfield AS $(IDCOLUMN), 'http://server?id=' || idfield AS 
 $(URLCOLUMN), '12345' AS $(DATACOLUMN) FROM album WHERE idfield IN 
 $(IDLIST)

 On Solr side, '12345' was be able to indexed and stored.

 But when not-ascii character was configured,

 SELECT idfield AS $(IDCOLUMN), 'http://server?id=' || idfield AS 
 $(URLCOLUMN), '€€€' AS $(DATACOLUMN) FROM album WHERE idfield IN $(IDLIST)

 On Solr side, '€€€' was not indexed and stored.

 Actually, I configure the column which contains not-ascii characters into 
 DATACOLUMN.
 It seems content-type differ between them.
 Can JDBC connection control content-type?

 Regards,
 Shinichiro Abe




Re: Process behavior of executing multiple jobs

2012-11-19 Thread Karl Wright
Hi Shigeki,

This is a complex question, which is actually at the center of what
ManifoldCF does.

There are two different kinds of scheduling that MCF does.  The first
is scheduling documents within a single connection.  The second is
scheduling documents across connections.

Let's start with the first.  Every connector, given a document, has
the ability to determine what throttling bins it belongs in.  A
throttling bin is an arbitrary grouping of documents that should be
treated together for the purposes of throttling.  For example, the web
connector uses a document's server name as a throttling bin, which
means that any new document from the same server will be rate-limited
relative to other documents from that server.  This grouping allows
the ManifoldCF document queue to be prioritized (which means that a
priority number is set) in such a way that documents from all bins
have an equal probability of being scheduled in a given time interval.
 Then, the query that finds the next set of documents to crawl can do
mostly the right thing if it just orders the query based on the
priority number.

The second layer adjusts for differences in performance between bins
and between connections.  ManifoldCF keeps track of the performance
statistics of each connector and each throttle bin.  If the statistics
show that processing a document for one bin in one connector is
significantly slower than for the others, it will take that into
account and learn to give fewer documents from that bin or connection
to the worker threads during any given time interval.

If the statistics change, it will obviously be a little while before
ManifoldCF adjusts its behavior.  But eventually it should adjust.

If you are seeing a specific long-term behavior that is not optimal,
please let us know.  It's been quite a while since anyone has had
questions/issues in this area.

Thanks,
Karl

On Sun, Nov 18, 2012 at 10:55 PM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:

 Hi.

 I have a question of process behavior of executing multiple jobs.

 I run MCF1.0 on Tomcat, crawl files on Windows file servers, and index them
 into Solr3.6.

 When I set multiple jobs and execute them at the same times, I realize the
 number of documents processed by each job seems to be partial to another.
 For example, while one job processes 100 documents  the other job only
 process 5 documents yet. At the end, all of jobs completes processing, but I
 wonder how those jobs can process documents evenly at the same time.
 On the other hand, I wonder how MCF determines priority of each documents of
 each job to crawl and index.


 Regards,


 Shigeki


Re: Cannot connect to SharePoint 2010 instance

2012-11-26 Thread Karl Wright
I've done further research on HttpComponents' support for Kerberos.
It turns out that HttpComponents claims they can successfully use
tickets from the local machine's ticket store.  I haven't tried this
here (don't have the setup for it), but it looks like it could
conceivably work with MCF trunk at this point.  Read up on it here:

http://hc.apache.org/httpcomponents-client-ga/tutorial/html/authentication.html

Ideally, of course, we'd really want to add the ability for ManifoldCF
to handle its own ticket cache, one per connection, so that each
connection looks like its own independent client.  In order for that
to happen, connectors that support Kerberos would need to be able to
kerberos authenticate.  But, for right now, this may work for people
needing Kerberos.

Karl

On Sun, Nov 11, 2012 at 8:42 AM, Karl Wright daddy...@gmail.com wrote:
 The port of the SharePoint connector to httpcomponents 4.2.2 is complete.

 I don't know whether it will help you or not, but if you check out
 ManifoldCF trunk (from
 https://svn.apache.org/repos/asf/manifoldcf/trunk) and run:

 ant make-core-deps build

 ... you will be running the latest code.  It has been tried against a
 plain-vanilla SharePoint system using standard NTLM and found to work.
  If you try the new code and it works for you, that would be very
 interesting to know; it looks like httpcomponents has developed some
 support for SPNEGO, which may be what is missing in the current
 ManifoldCF release.

 Thanks,
 Karl

 On Wed, Nov 7, 2012 at 4:47 PM, Karl Wright daddy...@gmail.com wrote:
 MCPermissions.asmx and Lists.asmx are two different services, and the
 Lists.asmx is likely failing before the MCPermissions.asmx is even
 needed.  If, for instance, you are just trying with the UI to see if
 you get back Connection working, this makes sense since the Lists
 service is called first and then the MCPermissions service is called
 after.

 FWIW, I'm starting to look into porting ManifoldCF to the
 httpcomponent libraries from the older httpclient 3.1 world.  This
 will make it easier, I think, to incorporate newer additions.

 Thanks,
 Karl


 On Wed, Nov 7, 2012 at 3:44 PM, Iannetti, Robert
 robert.ianne...@novartis.com wrote:
 Karl,

 It looks like I am failing connecting to the  /_vti_bin/lists.asmx service 
 but I never see the MCPermissions.asmx in any of my trace logs.

 Why is that?

 Thanks
 Bob


 -Original Message-
 From: Iannetti, Robert
 Sent: Wednesday, November 07, 2012 10:37 AM
 To: user@manifoldcf.apache.org
 Subject: RE: Cannot connect to SharePoint 2010 instance

 Karl,

 The X's you see are me trying to make the log look generic there were valid 
 guids present in the real log.

 I will try WireShark and let you know the results.

 Thanks
 Bob




 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Wednesday, November 07, 2012 10:32 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 This in general looks like a proper NTLM authorization sequence, except for 
 the lack of confirmation at the end.  The only thing I see that I don't 
 recognize is this:

 DEBUG 2012-11-07 09:56:11,212 (Thread-441) -  SPRequestGuid: 
 xxx[\r][\n]

 If SharePoint is expecting this GUID to be returned somehow then that would 
 explain it, but frankly we've got a number of SP 2010 installations and 
 that hasn't been an issue anywhere else.  And, I don't expect curl would 
 work if that was the case.

 It's worth a shot using a tool like WireShark to see if you can find any 
 difference in headers etc. between curl and ManifoldCF.  We've noticed in 
 the past that the exact Host header seems to be the critical issue, so any 
 differences there would be of interest.

 Karl

 On Wed, Nov 7, 2012 at 10:08 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 We have created the NTLM SharePoint instance as instructed.

 The Curl command is now responding when before it would not.
 curl --ntlm -u domain\\username
 http://xxx.xxx.xxx.xxx/_vti_bin/MCPermissions.asmx -v

 But we are still getting an error when issuing the connection request
 from the ManifoldCF GUI Crawl user did not authenticate properly, or
 has insufficient permissions to access http://XXX.XXX.XXX.XXX:
 (401)Unauthorized

 From the log file

 DEBUG 2012-11-07 09:56:11,126 (Thread-441) -  POST /_vti_bin/lists.asmx 
 HTTP/1.1[\r][\n]
 DEBUG 2012-11-07 09:56:11,151 (Thread-441) -  Content-Type: text/xml; 
 charset=utf-8[\r][\n]
 DEBUG 2012-11-07 09:56:11,152 (Thread-441) -  SOAPAction: 
 http://schemas.microsoft.com/sharepoint/soap/GetListCollection[\r][\n];
 DEBUG 2012-11-07 09:56:11,152 (Thread-441) -  User-Agent: 
 Axis/1.4[\r][\n]
 DEBUG 2012-11-07 09:56:11,152 (Thread-441) -  Host: 
 x...[\r][\n]
 DEBUG 2012-11-07 09:56:11,152 (Thread-441) -  Transfer-Encoding: 
 chunked[\r][\n]
 DEBUG 2012-11-07 09:56:11,152 (Thread-441) -  [\r][\n]
 DEBUG 2012-11-07 09:56:11,153 (Thread-441) -  14f

Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized

2012-11-27 Thread Karl Wright
Hi Luigi,

The Negotiate is clearly part of the problem; please leave that out.

The log entries you mention are indeed harmless warnings that we don't have
an Italian localization yet.

When you view the connection in the UI, what do you see now?

Karl


On Tue, Nov 27, 2012 at 8:25 AM, Luigi D'Addario 
luigi.dadda...@googlemail.com wrote:

 hi Karl,
 thanks for your reply.

 *(1) Are you sure that your SharePoint IIS is not configured to use*
 *Kerberos auth?*


 On Sharepoint Server, in the MetaBase.xml  i have

 IIsWebVirtualDir Location =/LM/W3SVC/662429156/Root
  AccessFlags=AccessExecute | AccessRead | AccessWrite | AccessScript
 AppFriendlyName=Root
  AppIsolated=2
 AppPoolId=SharePoint - 80
 AppRoot=/LM/W3SVC/662429156/Root
  AuthFlags=*AuthNTLM*
 ContentIndexed=FALSE
 DefaultLogonDomain=services-kirey.lan
  DoDynamicCompression=TRUE
 DoStaticCompression=TRUE
  HttpCustomHeaders=X-Powered-By: ASP.NET
 MicrosoftSharePointTeamServices: 12.0.0.6421
  *NTAuthenticationProviders=Negotiate,NTLM*
 Path=C:\Inetpub\wwwroot\wss\VirtualDirectories\80


 Ok, i have first Negotiate, but  if I force only NTLM  (*
 NTAuthenticationProviders=NTLM*), manifoldcf.log *not recorder any
 messages* !!


 With a simply asp script running on my Sharepoint Server page i tried to
 get authentication mode via http and this is the result:

 with *NTAuthenticationProviders=NTLM:*

 *User Id = VM-SHPT2K7\Administrator The user was logged in using the NTLM 
 authentication
 method.*


 with *NTAuthenticationProviders=Negotiate,NTLM:*
 *
 *
 *User Id = VM-SHPT2K7\Administrator The Negotiate method was used!
 The user was logged on using NTLM*




 In  manifoldcf.log i founded this error but i think is not related with
 401:

 ERROR 2012-11-27 10:56:49,828 (qtp17632942-166) - Missing resource bundle
 'org.apache.manifoldcf.crawler.connectors.sharepoint.common' for locale
 'it': Can't find bundle for base name
 org.apache.manifoldcf.crawler.connectors.sharepoint.common, locale it;
 trying it
 java.util.MissingResourceException: Can't find bundle for base name
 org.apache.manifoldcf.crawler.connectors.sharepoint.common, locale it





 2012/11/27 Karl Wright daddy...@gmail.com

 Hi Luigi,

 The warning is coming from the part of commons-httpclient that is
 trying to set up communication with your SharePoint instance.  It
 thinks it needs to use SPNEGO to figure out the authentication
 mechanism, and it seems to be trying to load kerberos 5 configuration
 information, which means that  it thinks Kerberos is the
 authentication mechanism of choice.

 (1) Are you sure that your SharePoint IIS is not configured to use
 Kerberos auth?

 (2) What command-line arguments are you giving to the JVM that is
 running ManifoldCF?

 Karl

 On Tue, Nov 27, 2012 at 7:44 AM, Luigi D'Addario
 luigi.dadda...@googlemail.com wrote:
  Hello,
 
  I have installed apache-manifoldcf-1.0.1 on my Windows XP and
  apache-manifoldcf-sharepoint-2007-plugin on my SharePoint 2007 server.
   (a virtual machine).
 
  I can see the Permissions Page when I enter
  http://x:x/sub_directory/_vti_bin/MCPermissions.asmx
  in my browser.
  When I try to make a SharePoint Services 3.0 (2007)
  connection to my SharePoint 2007 server in the ManifoldCF
  interface I get this error:
 
  Crawl user did not authenticate properly, or has insufficient
 permissions to
  accesshttp://vm-shpt2k7/KireyRep: (401)HTTP/1.1 401 Unauthorized
 
  Via curl i get first a 401 and then a 200 status:
 
  curl --ntlm -u vm-shpt2k7\\administrator
  http://vm-shpt2k7/KireyRep/_vti_bin/MCPermissions.asmx -v
  Enter host password for user 'vm-shpt2k7\\administrator':
  * About to connect() to vm-shpt2k7 port 80 (#0)
  *   Trying 192.168.30.42...
  * connected
  * Connected to vm-shpt2k7 (192.168.30.42) port 80 (#0)
  * Server auth using NTLM with user 'vm-shpt2k7\\administrator'
  GET /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1
  Authorization: NTLM
  TlRMTVNTUAABt4II4gAFASgKDw==
  User-Agent: curl/7.25.0 (i386-pc-win32) libcurl/7.25.0 OpenSSL/0.9.8u
  zlib/1.2
  .6 libssh2/1.4.0
  Host: vm-shpt2k7
  Accept: */*
 
   HTTP/1.1 401 Unauthorized
   Content-Length: 1539
   Content-Type: text/html
   Server: Microsoft-IIS/6.0
   WWW-Authenticate: NTLM
  TlRMTVNTUAACHAAcADg1goniwKcRCkDsTOwAAMo
 
 AygBUBQLODg9TAEUAUgBWAEkAQwBFAFMALQBLAEkAUgBFAFkAAgAcAFMARQBSAFYASQBDAEU
 
 AUwAtAEsASQBSAEUAWQABABQAVgBNAC0AUwBIAFAAVAAyAEsANwAEACQAcwBlAHIAdgBpAGMAZQBzAC0
 
 AawBpAHIAZQB5AC4AbABhAG4AAwA6AHYAbQAtAHMAaABwAHQAMgBrADcALgBzAGUAcgB2AGkAYwBlAHM
 
 ALQBrAGkAcgBlAHkALgBsAGEAbgAFACQAcwBlAHIAdgBpAGMAZQBzAC0AawBpAHIAZQB5AC4AbABhAG4
  AAA==
   X-Powered-By: ASP.NET
   MicrosoftSharePointTeamServices: 12.0.0.6421
   Date: Mon, 26 Nov 2012 21:47:30 GMT
  
  * Ignoring the response-body
  * Connection #0 to host vm-shpt2k7 left intact
  * Issue another request to this URL:
  'http://vm-shpt2k7/KireyRep/_vti_bin/MCPerm
  issions.asmx'
  * Re-using existing

Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized

2012-11-27 Thread Karl Wright
Ok, can you try a fully-qualified domain name, rather than the abbreviated
one you have given, for the credentials?  Also, you might want to look at
the server-side event logs for the reason for the authentication failure.

Thanks,
Karl


On Tue, Nov 27, 2012 at 9:04 AM, Luigi D'Addario 
luigi.dadda...@googlemail.com wrote:

 well,

 on SharePoint Server:

 *NTAuthenticationProviders=NTLM*

 *
 *
 on ManifoldCF UI interface, error:

 Parameters: serverLocation=/KireyRep
 serverPort=80
 serverVersion=3.0
 userName=VM-SHPT2K7\Administrator
 serverProtocol=http
 serverName=vm-shpt2k7.services-kirey.lan
 password=

 Connection status:Crawl user did not authenticate properly, or has
 insufficient permissions to access
 http://vm-shpt2k7.services-kirey.lan/KireyRep: *(401)HTTP/1.1 401
 Unauthorized*

 on manifoldcf.log

 *no error trace !*





 2012/11/27 Karl Wright daddy...@gmail.com

 Hi Luigi,

 The Negotiate is clearly part of the problem; please leave that out.

 The log entries you mention are indeed harmless warnings that we don't
 have an Italian localization yet.

 When you view the connection in the UI, what do you see now?


 Karl


 On Tue, Nov 27, 2012 at 8:25 AM, Luigi D'Addario 
 luigi.dadda...@googlemail.com wrote:

 hi Karl,
 thanks for your reply.

 *(1) Are you sure that your SharePoint IIS is not configured to use*
 *Kerberos auth?*


 On Sharepoint Server, in the MetaBase.xml  i have

 IIsWebVirtualDir Location =/LM/W3SVC/662429156/Root
  AccessFlags=AccessExecute | AccessRead | AccessWrite | AccessScript
 AppFriendlyName=Root
  AppIsolated=2
 AppPoolId=SharePoint - 80
 AppRoot=/LM/W3SVC/662429156/Root
  AuthFlags=*AuthNTLM*
 ContentIndexed=FALSE
 DefaultLogonDomain=services-kirey.lan
  DoDynamicCompression=TRUE
 DoStaticCompression=TRUE
  HttpCustomHeaders=X-Powered-By: ASP.NET
 MicrosoftSharePointTeamServices: 12.0.0.6421
  *NTAuthenticationProviders=Negotiate,NTLM*
 Path=C:\Inetpub\wwwroot\wss\VirtualDirectories\80


 Ok, i have first Negotiate, but  if I force only NTLM  (*
 NTAuthenticationProviders=NTLM*), manifoldcf.log *not recorder any
 messages* !!


 With a simply asp script running on my Sharepoint Server page i tried to
 get authentication mode via http and this is the result:

 with *NTAuthenticationProviders=NTLM:*

 *User Id = VM-SHPT2K7\Administrator The user was logged in using the
 NTLM authentication method.*


 with *NTAuthenticationProviders=Negotiate,NTLM:*
 *
 *
 *User Id = VM-SHPT2K7\Administrator The Negotiate method was used!
 The user was logged on using NTLM*




 In  manifoldcf.log i founded this error but i think is not related with
 401:

 ERROR 2012-11-27 10:56:49,828 (qtp17632942-166) - Missing resource
 bundle 'org.apache.manifoldcf.crawler.connectors.sharepoint.common' for
 locale 'it': Can't find bundle for base name
 org.apache.manifoldcf.crawler.connectors.sharepoint.common, locale it;
 trying it
 java.util.MissingResourceException: Can't find bundle for base name
 org.apache.manifoldcf.crawler.connectors.sharepoint.common, locale it





 2012/11/27 Karl Wright daddy...@gmail.com

 Hi Luigi,

 The warning is coming from the part of commons-httpclient that is
 trying to set up communication with your SharePoint instance.  It
 thinks it needs to use SPNEGO to figure out the authentication
 mechanism, and it seems to be trying to load kerberos 5 configuration
 information, which means that  it thinks Kerberos is the
 authentication mechanism of choice.

 (1) Are you sure that your SharePoint IIS is not configured to use
 Kerberos auth?

 (2) What command-line arguments are you giving to the JVM that is
 running ManifoldCF?

 Karl

 On Tue, Nov 27, 2012 at 7:44 AM, Luigi D'Addario
 luigi.dadda...@googlemail.com wrote:
  Hello,
 
  I have installed apache-manifoldcf-1.0.1 on my Windows XP and
  apache-manifoldcf-sharepoint-2007-plugin on my SharePoint 2007 server.
   (a virtual machine).
 
  I can see the Permissions Page when I enter
  http://x:x/sub_directory/_vti_bin/MCPermissions.asmx
  in my browser.
  When I try to make a SharePoint Services 3.0 (2007)
  connection to my SharePoint 2007 server in the ManifoldCF
  interface I get this error:
 
  Crawl user did not authenticate properly, or has insufficient
 permissions to
  accesshttp://vm-shpt2k7/KireyRep: (401)HTTP/1.1 401 Unauthorized
 
  Via curl i get first a 401 and then a 200 status:
 
  curl --ntlm -u vm-shpt2k7\\administrator
  http://vm-shpt2k7/KireyRep/_vti_bin/MCPermissions.asmx -v
  Enter host password for user 'vm-shpt2k7\\administrator':
  * About to connect() to vm-shpt2k7 port 80 (#0)
  *   Trying 192.168.30.42...
  * connected
  * Connected to vm-shpt2k7 (192.168.30.42) port 80 (#0)
  * Server auth using NTLM with user 'vm-shpt2k7\\administrator'
  GET /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1
  Authorization: NTLM
  TlRMTVNTUAABt4II4gAFASgKDw==
  User-Agent: curl/7.25.0 (i386-pc-win32) libcurl/7.25.0 OpenSSL/0.9.8u
  zlib/1.2
  .6

Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized

2012-11-27 Thread Karl Wright
Just on a whim, can you try POST with curl also?  It is possible that POSTs
are blocked in some way.

If that doesn't work, then your security settings are prohibiting post.

If that DOES work, then I'd like you to download a ManifoldCF 1.1-dev image
from http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev , and try
that.  This uses httpcomponents rather than our special commons-httpclient
version.

If none of this helps, getting a packet capture of both a curl POST and the
comparable ManifoldCF attempt may well show us what the key issue is.  It's
possible that there is a header or something your IIS is rejecting, for
instance.

Thanks,
Karl


On Tue, Nov 27, 2012 at 11:06 AM, Luigi D'Addario 
luigi.dadda...@googlemail.com wrote:

 Karl,

 I tried many credential combination .. always 401 ..

 From server log,
 with ManifoldCF UI interface (in POST), 401 error:

 #Software: Microsoft Internet Information Services 6.0
 #Version: 1.0
 #Date: 2012-11-27 15:38:37
 #Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query
 s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus
 sc-win32-status
 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 
 *POST*/KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Axis/1.4
 *401* 2 2148074254
 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 *POST 
 */KireyRep/_vti_bin/lists.asmx
 - 80 - 192.168.49.62 Axis/1.4 *401* 1 0
 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 *POST 
 */KireyRep/_vti_bin/lists.asmx
 - 80 - 192.168.49.62 Axis/1.4 *401* 1 2148074252


 With direct call via http (http://vm-shpt2k7/KireyRep/_vti_bin/lists.asmx),
 (in GET):

 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET
 /KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62
 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729)
 *401* 2 2148074254
 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET
 /KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62
 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729)
 *401 *1 0
 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET
 /KireyRep/_vti_bin/lists.asmx - 80 vm-shpt2k7\administrator 192.168.49.62
 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729)
 *200* 0 0

 It's quite a conundrum ...



 2012/11/27 Karl Wright daddy...@gmail.com

 Ok, can you try a fully-qualified domain name, rather than the
 abbreviated one you have given, for the credentials?  Also, you might want
 to look at the server-side event logs for the reason for the authentication
 failure.

 Thanks,
 Karl



 On Tue, Nov 27, 2012 at 9:04 AM, Luigi D'Addario 
 luigi.dadda...@googlemail.com wrote:

 well,

 on SharePoint Server:

 *NTAuthenticationProviders=NTLM*

 *
 *
 on ManifoldCF UI interface, error:

 Parameters: serverLocation=/KireyRep
 serverPort=80
 serverVersion=3.0
 userName=VM-SHPT2K7\Administrator
 serverProtocol=http
 serverName=vm-shpt2k7.services-kirey.lan
 password=

 Connection status:Crawl user did not authenticate properly, or has
 insufficient permissions to access
 http://vm-shpt2k7.services-kirey.lan/KireyRep: *(401)HTTP/1.1 401
 Unauthorized*

 on manifoldcf.log

 *no error trace !*





 2012/11/27 Karl Wright daddy...@gmail.com

 Hi Luigi,

 The Negotiate is clearly part of the problem; please leave that out.

 The log entries you mention are indeed harmless warnings that we don't
 have an Italian localization yet.

 When you view the connection in the UI, what do you see now?


 Karl


 On Tue, Nov 27, 2012 at 8:25 AM, Luigi D'Addario 
 luigi.dadda...@googlemail.com wrote:

 hi Karl,
 thanks for your reply.

 *(1) Are you sure that your SharePoint IIS is not configured to use*
 *Kerberos auth?*


 On Sharepoint Server, in the MetaBase.xml  i have

 IIsWebVirtualDir Location =/LM/W3SVC/662429156/Root
  AccessFlags=AccessExecute | AccessRead | AccessWrite | AccessScript
 AppFriendlyName=Root
  AppIsolated=2
 AppPoolId=SharePoint - 80
 AppRoot=/LM/W3SVC/662429156/Root
  AuthFlags=*AuthNTLM*
 ContentIndexed=FALSE
 DefaultLogonDomain=services-kirey.lan
  DoDynamicCompression=TRUE
 DoStaticCompression=TRUE
  HttpCustomHeaders=X-Powered-By: ASP.NET
 MicrosoftSharePointTeamServices: 12.0.0.6421
  *NTAuthenticationProviders=Negotiate,NTLM*
 Path=C:\Inetpub\wwwroot\wss\VirtualDirectories\80


 Ok, i have first Negotiate, but  if I force only NTLM  (*
 NTAuthenticationProviders=NTLM*), manifoldcf.log *not recorder any
 messages* !!


 With a simply asp script running on my Sharepoint Server page i tried
 to get authentication mode via http and this is the result:

 with *NTAuthenticationProviders=NTLM:*

 *User Id = VM-SHPT2K7\Administrator The user was logged in using the
 NTLM authentication method.*


 with *NTAuthenticationProviders=Negotiate,NTLM:*
 *
 *
 *User Id = VM-SHPT2K7

Re: Cannot connect to SharePoint 2010 instance

2012-11-27 Thread Karl Wright
Hi Bob,

This is really beginning to sound like there is a header problem of some kind.

This is what I'd like to try.
(1) Turn on wire debugging for SharePoint, as described here:
https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
(2) Using curl, try to use post and the proper credentials, using the
-vvv switch.  If you successfully connect, save that output.  Then try
to EXACTLY mimic the request that ManifoldCF does, and if that FAILS
record that output and send it all to me.

Thanks!
Karl



On Tue, Nov 27, 2012 at 11:22 AM, Iannetti, Robert
robert.ianne...@novartis.com wrote:
 Hi Karl,

 I have installed the dev version of the connector from below and am having an 
 issue connecting to my SharePoint 2010 site.
 It actually seems similar to what is happening in your thread with Luigi.

 I try to log in to the sharepoint site as a user with full control and I get 
 this error

 Crawl user did not authenticate properly, or has insufficient permissions to 
 access http://...: (401)HTTP/1.1 401 Unauthorized

 Thanks
 Bob

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 26, 2012 6:38 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Ok, you can download a dev build at:

 http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev

 It takes me about an hour to put one of these together, so if you can 
 possibly build ManifoldCF yourself that would be a huge help.

 Karl


 On Mon, Nov 26, 2012 at 11:12 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 That would be great please let me know when it is available

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 26, 2012 10:59 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Hi Robert,

 I can build a binary version you can download, but not until tonight.

 It may be easier to talk through getting a build environment set up on your 
 Linux machine.  Is this Debian or Ubuntu linux, by any chance?
 If so, the setup is trivial and I can help you with that.

 Karl

 On Mon, Nov 26, 2012 at 10:12 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 Is there a binary release (pre -compiled version) of the manifold trunk 
 mentioned below https://svn.apache.org/repos/asf/manifoldcf/trunk that you 
 can point me to I am new to Linux and don't have any experience with ANT.


 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 26, 2012 4:32 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 I've done further research on HttpComponents' support for Kerberos.
 It turns out that HttpComponents claims they can successfully use tickets 
 from the local machine's ticket store.  I haven't tried this here (don't 
 have the setup for it), but it looks like it could conceivably work with 
 MCF trunk at this point.  Read up on it here:

 http://hc.apache.org/httpcomponents-client-ga/tutorial/html/authentic
 a
 tion.html

 Ideally, of course, we'd really want to add the ability for ManifoldCF to 
 handle its own ticket cache, one per connection, so that each connection 
 looks like its own independent client.  In order for that to happen, 
 connectors that support Kerberos would need to be able to kerberos 
 authenticate.  But, for right now, this may work for people needing 
 Kerberos.

 Karl

 On Sun, Nov 11, 2012 at 8:42 AM, Karl Wright daddy...@gmail.com wrote:
 The port of the SharePoint connector to httpcomponents 4.2.2 is complete.

 I don't know whether it will help you or not, but if you check out
 ManifoldCF trunk (from
 https://svn.apache.org/repos/asf/manifoldcf/trunk) and run:

 ant make-core-deps build

 ... you will be running the latest code.  It has been tried against
 a plain-vanilla SharePoint system using standard NTLM and found to work.
  If you try the new code and it works for you, that would be very
 interesting to know; it looks like httpcomponents has developed some
 support for SPNEGO, which may be what is missing in the current
 ManifoldCF release.

 Thanks,
 Karl

 On Wed, Nov 7, 2012 at 4:47 PM, Karl Wright daddy...@gmail.com wrote:
 MCPermissions.asmx and Lists.asmx are two different services, and
 the Lists.asmx is likely failing before the MCPermissions.asmx is
 even needed.  If, for instance, you are just trying with the UI to
 see if you get back Connection working, this makes sense since
 the Lists service is called first and then the MCPermissions
 service is called after.

 FWIW, I'm starting to look into porting ManifoldCF to the
 httpcomponent libraries from the older httpclient 3.1 world.  This
 will make it easier, I think, to incorporate newer additions.

 Thanks,
 Karl


 On Wed, Nov 7, 2012 at 3:44 PM, Iannetti, Robert
 robert.ianne...@novartis.com wrote:
 Karl

Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized

2012-11-27 Thread Karl Wright
You need to use the --data option, not -X.

Karl

On Tue, Nov 27, 2012 at 11:37 AM, Luigi D'Addario 
luigi.dadda...@googlemail.com wrote:

 Karl,

 via curl in POST i get a  HTTP/1.1 *411 Length Required*
 *
 *
 It meand that POST is blocked ?


 curl -X POST --ntlm -u vm-shpt2k7\\administrator http://vm-s
 hpt2k7/KireyRep/_vti_bin/MCPermissions.asmx -v
 Enter host password for user 'vm-shpt2k7\\administrator':
 * About to connect() to vm-shpt2k7 port 80 (#0)
 *   Trying 192.168.30.42...
 * connected
 * Connected to vm-shpt2k7 (192.168.30.42) port 80 (#0)
 * Server auth using NTLM with user 'vm-shpt2k7\\administrator'
  POST /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1
  Authorization: NTLM
 TlRMTVNTUAABt4II4gAFASgKDw==
  User-Agent: curl/7.25.0 (i386-pc-win32) libcurl/7.25.0 OpenSSL/0.9.8u
 zlib/1.2
 .6 libssh2/1.4.0
  Host: vm-shpt2k7
  Accept: */*
 
  HTTP/1.1 *411 Length Required*
  Content-Type: text/html
  Date: Tue, 27 Nov 2012 16:32:06 GMT
  Connection: close
  Content-Length: 24
 
 h1Length Required/h1* Closing connection #0



 2012/11/27 Karl Wright daddy...@gmail.com

 Just on a whim, can you try POST with curl also?  It is possible that
 POSTs are blocked in some way.

 If that doesn't work, then your security settings are prohibiting post.

 If that DOES work, then I'd like you to download a ManifoldCF 1.1-dev
 image from http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev ,
 and try that.  This uses httpcomponents rather than our special
 commons-httpclient version.

 If none of this helps, getting a packet capture of both a curl POST and
 the comparable ManifoldCF attempt may well show us what the key issue is.
 It's possible that there is a header or something your IIS is rejecting,
 for instance.

 Thanks,
 Karl



 On Tue, Nov 27, 2012 at 11:06 AM, Luigi D'Addario 
 luigi.dadda...@googlemail.com wrote:

 Karl,

 I tried many credential combination .. always 401 ..

 From server log,
 with ManifoldCF UI interface (in POST), 401 error:

 #Software: Microsoft Internet Information Services 6.0
 #Version: 1.0
 #Date: 2012-11-27 15:38:37
 #Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query
 s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus
 sc-win32-status
 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 
 *POST*/KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62 Axis/1.4
 *401* 2 2148074254
 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 *POST 
 */KireyRep/_vti_bin/lists.asmx
 - 80 - 192.168.49.62 Axis/1.4 *401* 1 0
 2012-11-27 15:38:37 W3SVC662429156 192.168.30.42 *POST 
 */KireyRep/_vti_bin/lists.asmx
 - 80 - 192.168.49.62 Axis/1.4 *401* 1 2148074252


 With direct call via http (
 http://vm-shpt2k7/KireyRep/_vti_bin/lists.asmx), (in GET):

 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET
 /KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62
 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729)
 *401* 2 2148074254
 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET
 /KireyRep/_vti_bin/lists.asmx - 80 - 192.168.49.62
 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729)
 *401 *1 0
 2012-11-27 15:43:48 W3SVC662429156 192.168.30.42 GET
 /KireyRep/_vti_bin/lists.asmx - 80 vm-shpt2k7\administrator 192.168.49.62
 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727;+.NET+CLR+1.1.4322;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729)
 *200* 0 0

 It's quite a conundrum ...



 2012/11/27 Karl Wright daddy...@gmail.com

 Ok, can you try a fully-qualified domain name, rather than the
 abbreviated one you have given, for the credentials?  Also, you might want
 to look at the server-side event logs for the reason for the authentication
 failure.

 Thanks,
 Karl



 On Tue, Nov 27, 2012 at 9:04 AM, Luigi D'Addario 
 luigi.dadda...@googlemail.com wrote:

 well,

 on SharePoint Server:

 *NTAuthenticationProviders=NTLM*

 *
 *
 on ManifoldCF UI interface, error:

 Parameters: serverLocation=/KireyRep
 serverPort=80
 serverVersion=3.0
 userName=VM-SHPT2K7\Administrator
 serverProtocol=http
 serverName=vm-shpt2k7.services-kirey.lan
 password=

 Connection status:Crawl user did not authenticate properly, or has
 insufficient permissions to access
 http://vm-shpt2k7.services-kirey.lan/KireyRep: *(401)HTTP/1.1 401
 Unauthorized*

 on manifoldcf.log

 *no error trace !*





 2012/11/27 Karl Wright daddy...@gmail.com

 Hi Luigi,

 The Negotiate is clearly part of the problem; please leave that out.

 The log entries you mention are indeed harmless warnings that we
 don't have an Italian localization yet.

 When you view the connection in the UI, what do you see now?


 Karl


 On Tue, Nov 27, 2012 at 8:25 AM, Luigi D'Addario 
 luigi.dadda...@googlemail.com wrote:

 hi Karl,
 thanks for your reply.

 *(1) Are you sure that your SharePoint IIS

Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized

2012-11-27 Thread Karl Wright
Curl with POST then works.

So the next step is to turn on wire debugging.  See
https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections.

Repeat the connection attempt with ManifoldCF, and send me the output.  I
want to verify that the headers (apart from the NTLM www-authenticate
headers) are the same.

Thanks!
Karl

On Tue, Nov 27, 2012 at 11:54 AM, Luigi D'Addario 
luigi.dadda...@googlemail.com wrote:

 Thanks :o]

 *HTTP/1.1 401 Unauthorized* +  *HTTP/1.1 500 Internal Server Error*

 curl --data POST --ntlm -u vm-shpt2k7\\administrator
 http://vm-shpt2k7/KireyRep/_vti_bin/MCPermissions.asmx -v
 Enter host password for user 'vm-shpt2k7\\administrator':
 * About to connect() to vm-shpt2k7 port 80 (#0)
 *   Trying 192.168.30.42...
 * connected
 * Connected to vm-shpt2k7 (192.168.30.42) port 80 (#0)
 * Server auth using NTLM with user 'vm-shpt2k7\\administrator'
  POST /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1
  Authorization: NTLM
 TlRMTVNTUAABt4II4gAFASgKDw==
  User-Agent: curl/7.25.0 (i386-pc-win32) libcurl/7.25.0 OpenSSL/0.9.8u
 zlib/1.2
 .6 libssh2/1.4.0
  Host: vm-shpt2k7
  Accept: */*
  Content-Length: 0
  Content-Type: application/x-www-form-urlencoded
 
  *HTTP/1.1 401 Unauthorized*
  Content-Length: 1539
  Content-Type: text/html
  Server: Microsoft-IIS/6.0
  WWW-Authenticate: NTLM
 TlRMTVNTUAACHAAcADg1goniAo2Exi/3+LAAAMo

 AygBUBQLODg9TAEUAUgBWAEkAQwBFAFMALQBLAEkAUgBFAFkAAgAcAFMARQBSAFYASQBDAEU

 AUwAtAEsASQBSAEUAWQABABQAVgBNAC0AUwBIAFAAVAAyAEsANwAEACQAcwBlAHIAdgBpAGMAZQBzAC0

 AawBpAHIAZQB5AC4AbABhAG4AAwA6AHYAbQAtAHMAaABwAHQAMgBrADcALgBzAGUAcgB2AGkAYwBlAHM

 ALQBrAGkAcgBlAHkALgBsAGEAbgAFACQAcwBlAHIAdgBpAGMAZQBzAC0AawBpAHIAZQB5AC4AbABhAG4
 AAA==
  X-Powered-By: ASP.NET
  MicrosoftSharePointTeamServices: 12.0.0.6421
  Date: Tue, 27 Nov 2012 16:44:14 GMT
 
 * Ignoring the response-body
 * Connection #0 to host vm-shpt2k7 left intact
 * Issue another request to this URL: '
 http://vm-shpt2k7/KireyRep/_vti_bin/MCPermissions.asmx'
 * Re-using existing connection! (#0) with host (nil)
 * Connected to (nil) (192.168.30.42) port 80 (#0)
 * Server auth using NTLM with user 'vm-shpt2k7\\administrator'
  POST /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1
  Authorization: NTLM
 TlRMTVNTUAADGAAYAJAYABgAqBQAFABIHAAcAFwAAA

 AYABgAeBAAEADANYKI4gUBKAoPdgBtAC0AcwBoAHAAdAAyAGsANwBcAGEAZABtAGkAbg

 BpAHMAdAByAGEAdABvAHIAUgBNAC0ARABBAEQARABBAFIASQBPAEwA/F/Dh4QXrhYAAA
 AADC8nQrN/CSjOrtwdcX5eneq+k+ZoTa0H5pip2sZd+GXoCE/Z+1QHfg==
  User-Agent: curl/7.25.0 (i386-pc-win32) libcurl/7.25.0 OpenSSL/0.9.8u
 zlib/1.2
 .6 libssh2/1.4.0
  Host: vm-shpt2k7
  Accept: */*
  Content-Length: 4
  Content-Type: application/x-www-form-urlencoded
 
 * upload completely sent off: 4 out of 4 bytes
  *HTTP/1.1 500 Internal Server Error*
  Date: Tue, 27 Nov 2012 16:44:15 GMT
  Server: Microsoft-IIS/6.0
  X-Powered-By: ASP.NET
  MicrosoftSharePointTeamServices: 12.0.0.6421
  X-AspNet-Version: 2.0.50727
  Cache-Control: private
  Content-Type: application/soap+xml; charset=utf-8
  Content-Length: 521
 
 ?xml version=1.0 encoding=utf-8?soap:Envelope xmlns:soap=
 http://www.w3.o
 rg/2003/05/soap-envelope xmlns:xsi=
 http://www.w3.org/2001/XMLSchema-instance;
 xmlns:xsd=http://www.w3.org/2001/XMLSchema
 soap:Bodysoap:Faultsoap:Code
 soap:Valuesoap:Receiver/soap:Value/soap:Codesoap:Reasonsoap:Text
 xml:lan
 g=itImpossibile elaborare la richiesta. ---gt; Rilevati dati non
 validi al l
 ivello principale. Riga 1, posizione
 1./soap:Text/soap:Reasonsoap:Detail /
 /soap:Fault/soap:Body/soap:Envelope* Connection #0 to host (nil)
 left inta
 ct
 * Closing connection #0


 And from server log:

 2012-11-27 16:44:14 W3SVC662429156 192.168.30.42 *POST 
 */KireyRep/_vti_bin/MCPermissions.asmx
 - 80 - 192.168.49.65
 curl/7.25.0+(i386-pc-win32)+libcurl/7.25.0+OpenSSL/0.9.8u+zlib/1.2.6+libssh2/1.4.0
 *401* 1 0
 2012-11-27 16:44:14 W3SVC662429156 192.168.30.42 *POST 
 */KireyRep/_vti_bin/MCPermissions.asmx
 - 80 vm-shpt2k7\Administrator 192.168.49.65
 curl/7.25.0+(i386-pc-win32)+libcurl/7.25.0+OpenSSL/0.9.8u+zlib/1.2.6+libssh2/1.4.0
 *500* 0 0




 2012/11/27 Karl Wright daddy...@gmail.com

 You need to use the --data option, not -X.

 Karl


 On Tue, Nov 27, 2012 at 11:37 AM, Luigi D'Addario 
 luigi.dadda...@googlemail.com wrote:

 Karl,

 via curl in POST i get a  HTTP/1.1 *411 Length Required*
 *
 *
 It meand that POST is blocked ?


 curl -X POST --ntlm -u vm-shpt2k7\\administrator http://vm-s
 hpt2k7/KireyRep/_vti_bin/MCPermissions.asmx -v
 Enter host password for user 'vm-shpt2k7\\administrator':
 * About to connect() to vm-shpt2k7 port 80 (#0)
 *   Trying 192.168.30.42...
 * connected
 * Connected to vm-shpt2k7 (192.168.30.42) port 80 (#0)
 * Server auth using NTLM with user 'vm-shpt2k7\\administrator'
  POST /KireyRep/_vti_bin/MCPermissions.asmx HTTP/1.1
   Authorization: NTLM

Re: Cannot connect to SharePoint 2010 instance

2012-11-27 Thread Karl Wright
Hi Bob,

If the headers all check out, then maybe this is the cause:

http://technet.microsoft.com/en-us/library/dd566199%28v=ws.10%29.aspx

I will have to check the httpcomponents code to verify that it uses at
least 128-bit encryption.  I won't be able to do that until tonight or
tomorrow though.

Karl


On Tue, Nov 27, 2012 at 11:36 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Bob,

 This is really beginning to sound like there is a header problem of some kind.

 This is what I'd like to try.
 (1) Turn on wire debugging for SharePoint, as described here:
 https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
 (2) Using curl, try to use post and the proper credentials, using the
 -vvv switch.  If you successfully connect, save that output.  Then try
 to EXACTLY mimic the request that ManifoldCF does, and if that FAILS
 record that output and send it all to me.

 Thanks!
 Karl



 On Tue, Nov 27, 2012 at 11:22 AM, Iannetti, Robert
 robert.ianne...@novartis.com wrote:
 Hi Karl,

 I have installed the dev version of the connector from below and am having 
 an issue connecting to my SharePoint 2010 site.
 It actually seems similar to what is happening in your thread with Luigi.

 I try to log in to the sharepoint site as a user with full control and I get 
 this error

 Crawl user did not authenticate properly, or has insufficient permissions to 
 access http://...: (401)HTTP/1.1 401 Unauthorized

 Thanks
 Bob

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 26, 2012 6:38 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Ok, you can download a dev build at:

 http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev

 It takes me about an hour to put one of these together, so if you can 
 possibly build ManifoldCF yourself that would be a huge help.

 Karl


 On Mon, Nov 26, 2012 at 11:12 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 That would be great please let me know when it is available

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 26, 2012 10:59 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Hi Robert,

 I can build a binary version you can download, but not until tonight.

 It may be easier to talk through getting a build environment set up on your 
 Linux machine.  Is this Debian or Ubuntu linux, by any chance?
 If so, the setup is trivial and I can help you with that.

 Karl

 On Mon, Nov 26, 2012 at 10:12 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 Is there a binary release (pre -compiled version) of the manifold trunk 
 mentioned below https://svn.apache.org/repos/asf/manifoldcf/trunk that you 
 can point me to I am new to Linux and don't have any experience with ANT.


 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 26, 2012 4:32 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 I've done further research on HttpComponents' support for Kerberos.
 It turns out that HttpComponents claims they can successfully use tickets 
 from the local machine's ticket store.  I haven't tried this here (don't 
 have the setup for it), but it looks like it could conceivably work with 
 MCF trunk at this point.  Read up on it here:

 http://hc.apache.org/httpcomponents-client-ga/tutorial/html/authentic
 a
 tion.html

 Ideally, of course, we'd really want to add the ability for ManifoldCF to 
 handle its own ticket cache, one per connection, so that each connection 
 looks like its own independent client.  In order for that to happen, 
 connectors that support Kerberos would need to be able to kerberos 
 authenticate.  But, for right now, this may work for people needing 
 Kerberos.

 Karl

 On Sun, Nov 11, 2012 at 8:42 AM, Karl Wright daddy...@gmail.com wrote:
 The port of the SharePoint connector to httpcomponents 4.2.2 is complete.

 I don't know whether it will help you or not, but if you check out
 ManifoldCF trunk (from
 https://svn.apache.org/repos/asf/manifoldcf/trunk) and run:

 ant make-core-deps build

 ... you will be running the latest code.  It has been tried against
 a plain-vanilla SharePoint system using standard NTLM and found to work.
  If you try the new code and it works for you, that would be very
 interesting to know; it looks like httpcomponents has developed some
 support for SPNEGO, which may be what is missing in the current
 ManifoldCF release.

 Thanks,
 Karl

 On Wed, Nov 7, 2012 at 4:47 PM, Karl Wright daddy...@gmail.com wrote:
 MCPermissions.asmx and Lists.asmx are two different services, and
 the Lists.asmx is likely failing before the MCPermissions.asmx is
 even needed.  If, for instance, you are just trying with the UI to
 see if you get back Connection working, this makes sense since

Re: Cannot connect to SharePoint 2010 instance

2012-11-27 Thread Karl Wright
The file is usually called logging.ini, and is referenced by the main
manifoldcf properties file.

Karl

On Tue, Nov 27, 2012 at 1:06 PM, Iannetti, Robert
robert.ianne...@novartis.com wrote:
 Karl,

 Where is the logging properties file where I would add the debugging commands 
 located ?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 27, 2012 12:52 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Hi Bob,

 If the headers all check out, then maybe this is the cause:

 http://technet.microsoft.com/en-us/library/dd566199%28v=ws.10%29.aspx

 I will have to check the httpcomponents code to verify that it uses at least 
 128-bit encryption.  I won't be able to do that until tonight or tomorrow 
 though.

 Karl


 On Tue, Nov 27, 2012 at 11:36 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Bob,

 This is really beginning to sound like there is a header problem of some 
 kind.

 This is what I'd like to try.
 (1) Turn on wire debugging for SharePoint, as described here:
 https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Conne
 ctions
 (2) Using curl, try to use post and the proper credentials, using the
 -vvv switch.  If you successfully connect, save that output.  Then try
 to EXACTLY mimic the request that ManifoldCF does, and if that FAILS
 record that output and send it all to me.

 Thanks!
 Karl



 On Tue, Nov 27, 2012 at 11:22 AM, Iannetti, Robert
 robert.ianne...@novartis.com wrote:
 Hi Karl,

 I have installed the dev version of the connector from below and am having 
 an issue connecting to my SharePoint 2010 site.
 It actually seems similar to what is happening in your thread with Luigi.

 I try to log in to the sharepoint site as a user with full control
 and I get this error

 Crawl user did not authenticate properly, or has insufficient
 permissions to access http://...: (401)HTTP/1.1 401
 Unauthorized

 Thanks
 Bob

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 26, 2012 6:38 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Ok, you can download a dev build at:

 http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev

 It takes me about an hour to put one of these together, so if you can 
 possibly build ManifoldCF yourself that would be a huge help.

 Karl


 On Mon, Nov 26, 2012 at 11:12 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 That would be great please let me know when it is available

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 26, 2012 10:59 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Hi Robert,

 I can build a binary version you can download, but not until tonight.

 It may be easier to talk through getting a build environment set up on 
 your Linux machine.  Is this Debian or Ubuntu linux, by any chance?
 If so, the setup is trivial and I can help you with that.

 Karl

 On Mon, Nov 26, 2012 at 10:12 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 Is there a binary release (pre -compiled version) of the manifold trunk 
 mentioned below https://svn.apache.org/repos/asf/manifoldcf/trunk that 
 you can point me to I am new to Linux and don't have any experience with 
 ANT.


 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, November 26, 2012 4:32 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 I've done further research on HttpComponents' support for Kerberos.
 It turns out that HttpComponents claims they can successfully use tickets 
 from the local machine's ticket store.  I haven't tried this here (don't 
 have the setup for it), but it looks like it could conceivably work with 
 MCF trunk at this point.  Read up on it here:

 http://hc.apache.org/httpcomponents-client-ga/tutorial/html/authent
 ic
 a
 tion.html

 Ideally, of course, we'd really want to add the ability for ManifoldCF to 
 handle its own ticket cache, one per connection, so that each connection 
 looks like its own independent client.  In order for that to happen, 
 connectors that support Kerberos would need to be able to kerberos 
 authenticate.  But, for right now, this may work for people needing 
 Kerberos.

 Karl

 On Sun, Nov 11, 2012 at 8:42 AM, Karl Wright daddy...@gmail.com wrote:
 The port of the SharePoint connector to httpcomponents 4.2.2 is complete.

 I don't know whether it will help you or not, but if you check out
 ManifoldCF trunk (from
 https://svn.apache.org/repos/asf/manifoldcf/trunk) and run:

 ant make-core-deps build

 ... you will be running the latest code.  It has been tried
 against a plain-vanilla SharePoint system using standard NTLM and found 
 to work.
  If you try the new code and it works

Re: Cannot connect to SharePoint 2010 instance

2012-11-27 Thread Karl Wright
The wire debugging setup you are using will only work with
commons-httpclient, not the new httpcomponent package.  I'll have to
do some research and see if there's a comparable logger setting for
that package.

Karl


On Tue, Nov 27, 2012 at 2:01 PM, Iannetti, Robert
robert.ianne...@novartis.com wrote:
 Karl,

 It's odd I added this to the properties.xml file
 property name=org.apache.manifoldcf.connectors value=DEBUG/

 And this to the logging.ini file
 log4j.logger.httpclient.wire=DEBUG

 I restarted manifold but nothing is being written to the manifoldcf.log file

 Any thoughts?

 Here is the curl data


 [iannero1@ip-10-145-32-121 logs]$ curl --data POST --ntlm -u  nanet\\iannero1 
 http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx 
 -v
 Enter host password for user 'nanet\iannero1':
 * About to connect() to searchpoc.testprojects.nibr.novartis.intra port 80 
 (#0)
 *   Trying 160.62.169.185... connected
 * Connected to searchpoc.testprojects.nibr.novartis.intra (160.62.169.185) 
 port 80 (#0)
 * Initializing NSS with certpath: sql:/etc/pki/nssdb
 * Server auth using NTLM with user 'nanet\iannero1'
 POST /_vti_bin/MCPermissions.asmx HTTP/1.1
 Authorization: NTLM TlRMTVNTUAABBoIIAAA=
 User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 
 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
 Host: searchpoc.testprojects.nibr.novartis.intra
 Accept: */*
 Content-Length: 0
 Content-Type: application/x-www-form-urlencoded

  HTTP/1.1 401 Unauthorized
  Server: Microsoft-IIS/7.5
  SPRequestGuid: f7b8f5a5-1de4-43d1-9b70-7adf3b7d5987
  WWW-Authenticate: NTLM 
 TlRMTVNTUAACBwAHADgGgokClVhQpcbj++YAAMoAygA/BgGxHQ9OSUJSTkVUAgAOAE4ASQBCAFIATgBFAFQAAQAYAE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AAQAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQAAwA8AE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AC4AbgBpAGIAcgAuAG4AbwB2AGEAcgB0AGkAcwAuAG4AZQB0AAUAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQABwAIAA9MVPbLzM0BAA==
  X-Powered-By: ASP.NET
  MicrosoftSharePointTeamServices: 14.0.0.6123
  X-MS-InvokeApp: 1; RequireReadOnly
  Date: Tue, 27 Nov 2012 18:21:04 GMT
  Content-Length: 0
 
 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra left intact
 * Issue another request to this URL: 
 'http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx'
 * Re-using existing connection! (#0) with host 
 searchpoc.testprojects.nibr.novartis.intra
 * Connected to searchpoc.testprojects.nibr.novartis.intra (160.62.169.185) 
 port 80 (#0)
 * Server auth using NTLM with user 'nanet\iannero1'
 POST /_vti_bin/MCPermissions.asmx HTTP/1.1
 Authorization: NTLM 
 TlRMTVNTUAADGAAYAEAYABgAWAUABQBwCAAIAHUQABAAfQAABoKJApbX61a3hdN3ANfyrXuxF91dkEOBT5GMXTvsPdHWkjT6rm5hbmV0aWFubmVybzFpcC0xMC0xNDUtMzItMTIx
 User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 
 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
 Host: searchpoc.testprojects.nibr.novartis.intra
 Accept: */*
 Content-Length: 4
 Content-Type: application/x-www-form-urlencoded

  HTTP/1.1 500 Internal Server Error
  Cache-Control: private
  Content-Type: application/soap+xml; charset=utf-8
  Server: Microsoft-IIS/7.5
  X-AspNet-Version: 2.0.50727
  Persistent-Auth: true
  X-Powered-By: ASP.NET
  MicrosoftSharePointTeamServices: 14.0.0.6123
  X-MS-InvokeApp: 1; RequireReadOnly
  Date: Tue, 27 Nov 2012 18:21:04 GMT
  Content-Length: 509
 
 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra left intact
 * Closing connection #0
 ?xml version=1.0 encoding=utf-8?soap:Envelope 
 xmlns:soap=http://www.w3.org/2003/05/soap-envelope; 
 xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; 
 xmlns:xsd=http://www.w3.org/2001/XMLSchema;soap:Bodysoap:Faultsoap:Codesoap:Valuesoap:Receiver/soap:Value/soap:Codesoap:Reasonsoap:Text
  xml:lang=enServer was unable to process request. ---gt; Data at the root 
 level is invalid. Line 1, position 1./soap:Text/soap:Reasonsoap:Detail 
 //soap:Fault/soap:Body/soap:Envelope[iannero1@ip-10-145-32-121 logs]$




 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 27, 2012 1:10 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 The file is usually called logging.ini, and is referenced by the main 
 manifoldcf properties file.

 Karl

 On Tue, Nov 27, 2012 at 1:06 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 Where is the logging properties file where I would add the debugging 
 commands located ?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 27, 2012 12:52 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Hi Bob,

 If the headers all check out, then maybe this is the cause:

 http://technet.microsoft.com/en-us/library/dd566199%28v=ws.10%29.aspx

 I will have to check

Re: Cannot connect to SharePoint 2010 instance

2012-11-27 Thread Karl Wright
Here we go:

Header logging: org.apache.http.headers=DEBUG
Wire logging (which we probably don't need): org.apache.http.wire=DEBUG

Karl

On Tue, Nov 27, 2012 at 2:04 PM, Karl Wright daddy...@gmail.com wrote:
 The wire debugging setup you are using will only work with
 commons-httpclient, not the new httpcomponent package.  I'll have to
 do some research and see if there's a comparable logger setting for
 that package.

 Karl


 On Tue, Nov 27, 2012 at 2:01 PM, Iannetti, Robert
 robert.ianne...@novartis.com wrote:
 Karl,

 It's odd I added this to the properties.xml file
 property name=org.apache.manifoldcf.connectors value=DEBUG/

 And this to the logging.ini file
 log4j.logger.httpclient.wire=DEBUG

 I restarted manifold but nothing is being written to the manifoldcf.log file

 Any thoughts?

 Here is the curl data


 [iannero1@ip-10-145-32-121 logs]$ curl --data POST --ntlm -u  
 nanet\\iannero1 
 http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx
  -v
 Enter host password for user 'nanet\iannero1':
 * About to connect() to searchpoc.testprojects.nibr.novartis.intra port 80 
 (#0)
 *   Trying 160.62.169.185... connected
 * Connected to searchpoc.testprojects.nibr.novartis.intra (160.62.169.185) 
 port 80 (#0)
 * Initializing NSS with certpath: sql:/etc/pki/nssdb
 * Server auth using NTLM with user 'nanet\iannero1'
 POST /_vti_bin/MCPermissions.asmx HTTP/1.1
 Authorization: NTLM TlRMTVNTUAABBoIIAAA=
 User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 
 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
 Host: searchpoc.testprojects.nibr.novartis.intra
 Accept: */*
 Content-Length: 0
 Content-Type: application/x-www-form-urlencoded

  HTTP/1.1 401 Unauthorized
  Server: Microsoft-IIS/7.5
  SPRequestGuid: f7b8f5a5-1de4-43d1-9b70-7adf3b7d5987
  WWW-Authenticate: NTLM 
 TlRMTVNTUAACBwAHADgGgokClVhQpcbj++YAAMoAygA/BgGxHQ9OSUJSTkVUAgAOAE4ASQBCAFIATgBFAFQAAQAYAE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AAQAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQAAwA8AE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AC4AbgBpAGIAcgAuAG4AbwB2AGEAcgB0AGkAcwAuAG4AZQB0AAUAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQABwAIAA9MVPbLzM0BAA==
  X-Powered-By: ASP.NET
  MicrosoftSharePointTeamServices: 14.0.0.6123
  X-MS-InvokeApp: 1; RequireReadOnly
  Date: Tue, 27 Nov 2012 18:21:04 GMT
  Content-Length: 0
 
 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra left 
 intact
 * Issue another request to this URL: 
 'http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx'
 * Re-using existing connection! (#0) with host 
 searchpoc.testprojects.nibr.novartis.intra
 * Connected to searchpoc.testprojects.nibr.novartis.intra (160.62.169.185) 
 port 80 (#0)
 * Server auth using NTLM with user 'nanet\iannero1'
 POST /_vti_bin/MCPermissions.asmx HTTP/1.1
 Authorization: NTLM 
 TlRMTVNTUAADGAAYAEAYABgAWAUABQBwCAAIAHUQABAAfQAABoKJApbX61a3hdN3ANfyrXuxF91dkEOBT5GMXTvsPdHWkjT6rm5hbmV0aWFubmVybzFpcC0xMC0xNDUtMzItMTIx
 User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 
 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
 Host: searchpoc.testprojects.nibr.novartis.intra
 Accept: */*
 Content-Length: 4
 Content-Type: application/x-www-form-urlencoded

  HTTP/1.1 500 Internal Server Error
  Cache-Control: private
  Content-Type: application/soap+xml; charset=utf-8
  Server: Microsoft-IIS/7.5
  X-AspNet-Version: 2.0.50727
  Persistent-Auth: true
  X-Powered-By: ASP.NET
  MicrosoftSharePointTeamServices: 14.0.0.6123
  X-MS-InvokeApp: 1; RequireReadOnly
  Date: Tue, 27 Nov 2012 18:21:04 GMT
  Content-Length: 509
 
 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra left 
 intact
 * Closing connection #0
 ?xml version=1.0 encoding=utf-8?soap:Envelope 
 xmlns:soap=http://www.w3.org/2003/05/soap-envelope; 
 xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; 
 xmlns:xsd=http://www.w3.org/2001/XMLSchema;soap:Bodysoap:Faultsoap:Codesoap:Valuesoap:Receiver/soap:Value/soap:Codesoap:Reasonsoap:Text
  xml:lang=enServer was unable to process request. ---gt; Data at the 
 root level is invalid. Line 1, position 
 1./soap:Text/soap:Reasonsoap:Detail 
 //soap:Fault/soap:Body/soap:Envelope[iannero1@ip-10-145-32-121 logs]$




 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 27, 2012 1:10 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 The file is usually called logging.ini, and is referenced by the main 
 manifoldcf properties file.

 Karl

 On Tue, Nov 27, 2012 at 1:06 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 Where is the logging properties file where I would add the debugging 
 commands located ?

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 27, 2012 12:52 PM
 To: user@manifoldcf.apache.org

Re: Cannot connect to SharePoint 2010 instance

2012-11-27 Thread Karl Wright
Yes.
Karl

On Tue, Nov 27, 2012 at 2:14 PM, Iannetti, Robert
robert.ianne...@novartis.com wrote:
 So would the org.apache.http.headers=DEBUG replace the 
 log4j.logger.httpclient.wire=DEBUG entry in the logging.ini file?

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 27, 2012 2:07 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Here we go:

 Header logging: org.apache.http.headers=DEBUG Wire logging (which we probably 
 don't need): org.apache.http.wire=DEBUG

 Karl

 On Tue, Nov 27, 2012 at 2:04 PM, Karl Wright daddy...@gmail.com wrote:
 The wire debugging setup you are using will only work with
 commons-httpclient, not the new httpcomponent package.  I'll have to
 do some research and see if there's a comparable logger setting for
 that package.

 Karl


 On Tue, Nov 27, 2012 at 2:01 PM, Iannetti, Robert
 robert.ianne...@novartis.com wrote:
 Karl,

 It's odd I added this to the properties.xml file property
 name=org.apache.manifoldcf.connectors value=DEBUG/

 And this to the logging.ini file
 log4j.logger.httpclient.wire=DEBUG

 I restarted manifold but nothing is being written to the
 manifoldcf.log file

 Any thoughts?

 Here is the curl data


 [iannero1@ip-10-145-32-121 logs]$ curl --data POST --ntlm -u
 nanet\\iannero1 
 http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx
  -v Enter host password for user 'nanet\iannero1':
 * About to connect() to searchpoc.testprojects.nibr.novartis.intra port 80 
 (#0)
 *   Trying 160.62.169.185... connected
 * Connected to searchpoc.testprojects.nibr.novartis.intra
 (160.62.169.185) port 80 (#0)
 * Initializing NSS with certpath: sql:/etc/pki/nssdb
 * Server auth using NTLM with user 'nanet\iannero1'
 POST /_vti_bin/MCPermissions.asmx HTTP/1.1
 Authorization: NTLM TlRMTVNTUAABBoIIAAA=
 User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7
 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
 Host: searchpoc.testprojects.nibr.novartis.intra
 Accept: */*
 Content-Length: 0
 Content-Type: application/x-www-form-urlencoded

  HTTP/1.1 401 Unauthorized
  Server: Microsoft-IIS/7.5
  SPRequestGuid: f7b8f5a5-1de4-43d1-9b70-7adf3b7d5987
  WWW-Authenticate: NTLM
 TlRMTVNTUAACBwAHADgGgokClVhQpcbj++YAAMoAygA/BgGxH
 Q9OSUJSTkVUAgAOAE4ASQBCAFIATgBFAFQAAQAYAE4AUgBVAFMAQwBBAC0AUwBEAD
 AANwA5AAQAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQAAwA8AE4AUgB
 VAFMAQwBBAC0AUwBEADAANwA5AC4AbgBpAGIAcgAuAG4AbwB2AGEAcgB0AGkAcwAuAG4A
 ZQB0AAUAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQABwAIAA9MVPbLz
 M0BAA==
  X-Powered-By: ASP.NET
  MicrosoftSharePointTeamServices: 14.0.0.6123  X-MS-InvokeApp: 1;
 RequireReadOnly  Date: Tue, 27 Nov 2012 18:21:04 GMT 
 Content-Length: 0 
 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra
 left intact
 * Issue another request to this URL: 
 'http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx'
 * Re-using existing connection! (#0) with host
 searchpoc.testprojects.nibr.novartis.intra
 * Connected to searchpoc.testprojects.nibr.novartis.intra
 (160.62.169.185) port 80 (#0)
 * Server auth using NTLM with user 'nanet\iannero1'
 POST /_vti_bin/MCPermissions.asmx HTTP/1.1
 Authorization: NTLM
 TlRMTVNTUAADGAAYAEAYABgAWAUABQBwCAAIAHUQABAAfQAA
 BoKJApbX61a3hdN3ANfyrXuxF91dkEOBT5GM
 XTvsPdHWkjT6rm5hbmV0aWFubmVybzFpcC0xMC0xNDUtMzItMTIx
 User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7
 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
 Host: searchpoc.testprojects.nibr.novartis.intra
 Accept: */*
 Content-Length: 4
 Content-Type: application/x-www-form-urlencoded

  HTTP/1.1 500 Internal Server Error
  Cache-Control: private
  Content-Type: application/soap+xml; charset=utf-8  Server:
 Microsoft-IIS/7.5  X-AspNet-Version: 2.0.50727  Persistent-Auth:
 true  X-Powered-By: ASP.NET  MicrosoftSharePointTeamServices:
 14.0.0.6123  X-MS-InvokeApp: 1; RequireReadOnly  Date: Tue, 27 Nov
 2012 18:21:04 GMT  Content-Length: 509 
 * Connection #0 to host searchpoc.testprojects.nibr.novartis.intra
 left intact
 * Closing connection #0
 ?xml version=1.0 encoding=utf-8?soap:Envelope
 xmlns:soap=http://www.w3.org/2003/05/soap-envelope;
 xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
 xmlns:xsd=http://www.w3.org/2001/XMLSchema;soap:Bodysoap:Fault
 soap:Codesoap:Valuesoap:Receiver/soap:Value/soap:Codesoap:Rea
 sonsoap:Text xml:lang=enServer was unable to process request.
 ---gt; Data at the root level is invalid. Line 1, position
 1./soap:Text/soap:Reasonsoap:Detail
 //soap:Fault/soap:Body/soap:Envelope[iannero1@ip-10-145-32-121
 logs]$




 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 27, 2012 1:10 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 The file is usually called

Re: Web crawling causes Socket Timeout after Database Exception

2012-11-28 Thread Karl Wright
Ok, fix has been checked in.
Karl

On Wed, Nov 28, 2012 at 3:19 AM, Karl Wright daddy...@gmail.com wrote:
 The ticket is CONNECTORS-571.

 Karl

 On Wed, Nov 28, 2012 at 3:12 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Shigeki,

 This confirms my theory that our MySQL driver is not detecting all
 cases where MySQL gives up on a transaction.  We need to correct this,
 but in order to do that we need the SQL error code that MySQL throws
 in this case:

 Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
 restarting transaction

 It looks like somebody actually posted the SQL error code that MYSQL
 sends out with this online:

 ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

 Are you able to build ManifoldCF?  I will check in a fix to trunk for
 this problem shortly; it would be great if you could try it out.

 Thanks,
 Karl

 On Wed, Nov 28, 2012 at 2:30 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,


 Here is a log of Database Exception that is occurred while crawling Web.
 This time, socket timeout exception did not happen so it might be a
 different matter.
 Even though the job status remain Running, it seems that MCF stopped
 crawling (The job was not aborted).
 
 ERROR 2012-11-22 19:36:28,593 (Worker thread '16') - Worker thread aborting
 and restarting due to database connection reset: Database exception:
 Exception doing query: Lock wait timeout exceeded; try restarting
 transaction
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
 exception: Exception doing query: Lock wait timeout exceeded; try restarting
 transaction
 at
 org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
 at
 org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
 at
 org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
 at
 org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
 at
 org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
 at
 org.apache.manifoldcf.core.database.DBInterfaceMySQL.performModification(DBInterfaceMySQL.java:678)
 at
 org.apache.manifoldcf.core.database.DBInterfaceMySQL.performUpdate(DBInterfaceMySQL.java:275)
 at
 org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80)
 at
 org.apache.manifoldcf.crawler.jobs.HopCount.markForDelete(HopCount.java:1426)
 at
 org.apache.manifoldcf.crawler.jobs.HopCount.doDeleteInvalidation(HopCount.java:1356)
 at
 org.apache.manifoldcf.crawler.jobs.HopCount.doFinish(HopCount.java:1057)
 at
 org.apache.manifoldcf.crawler.jobs.HopCount.finishParents(HopCount.java:389)
 at
 org.apache.manifoldcf.crawler.jobs.JobManager.finishDocuments(JobManager.java:4309)
 at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:557)
 Caused by: java.sql.SQLException: Lock wait timeout exceeded; try restarting
 transaction
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
 at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
 at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
 at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
 at
 com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
 at
 com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
 at
 com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2345)
 at
 com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2330)
 at
 org.apache.manifoldcf.core.database.Database.execute(Database.java:840)
 at
 org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)

 


 Here is a log of Database Exception that is occurred while crawling files
 using Windows shares connection:


 
 2012/11/22 23:39:28 ERROR (Job start thread) - Job start thread aborting and
 restarting due to database connection reset: Database exception: Exception
 doing query: Lock wait timeout exceeded; try restarting transaction
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
 exception: Exception doing query: Lock wait timeout exceeded; try restarting
 transaction
 at
 org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
 at
 org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
 at
 org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394

Re: Cannot connect to SharePoint 2010 instance

2012-11-29 Thread Karl Wright
 BAQAAAOCqZ5Zwzc0BgVBsO4H03jQAAgAOAE4ASQBCAFIATgBFAFQAAQAYAE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AAQAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQAAwA8AE4AUgBVAFMAQwBBAC0AUwBEADAANwA5AC4AbgBpAGIAcgAuAG4Abw
 B2AGEAcgB0AGkAcwAuAG4AZQB0AAUAIgBuAGkAYgByAC4AbgBvAHYAYQByAHQAaQBzAC4AbgBlAHQABwAIAOdjFpZwzc0BAE4AQQBOAEUAVABpAGEAbgBuAGUAcgBvADEASQBQAC0AMQAwAC0AMQA0ADUALQAzADIALQAxADIAMQA=
 DEBUG 2012-11-28 08:59:31,678 (Thread-479) -  HTTP/1.1 401 Unauthorized
 DEBUG 2012-11-28 08:59:31,678 (Thread-479) -  Server: Microsoft-IIS/7.5
 DEBUG 2012-11-28 08:59:31,678 (Thread-479) -  SPRequestGuid: 
 cfac18c9-3870-4854-bb2d-816f3dc8c2f3
 DEBUG 2012-11-28 08:59:31,678 (Thread-479) -  WWW-Authenticate: NTLM
 DEBUG 2012-11-28 08:59:31,678 (Thread-479) -  X-Powered-By: ASP.NET
 DEBUG 2012-11-28 08:59:31,678 (Thread-479) -  
 MicrosoftSharePointTeamServices: 14.0.0.6123
 DEBUG 2012-11-28 08:59:31,678 (Thread-479) -  X-MS-InvokeApp: 1; 
 RequireReadOnly
 DEBUG 2012-11-28 08:59:31,678 (Thread-479) -  Date: Wed, 28 Nov 2012 
 13:59:31 GMT
 DEBUG 2012-11-28 08:59:31,678 (Thread-479) -  Content-Length: 0

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 27, 2012 5:25 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 no, you need:

 log4j.logger.logger_name=DEBUG

 in this case:

 log4j.logger.org.apache.http.headers=DEBUG

 Thanks,
 Karl


 On Tue, Nov 27, 2012 at 3:30 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 I added the parameter to the logging.ini file and I am still not
 seeing any data written to the log

 # Licensed to the Apache Software Foundation (ASF) under one or more #
 contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version
 2.0 # (the License); you may not use this file except in compliance
 with # the License.  You may obtain a copy of the License at #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an AS IS BASIS, #
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.

 log4j.appender.MAIN.File=logs/manifoldcf.log
 log4j.rootLogger=WARN, MAIN
 log4j.appender.MAIN=org.apache.log4j.RollingFileAppender
 log4j.appender.MAIN.layout=org.apache.log4j.PatternLayout
 log4j.appender.MAIN.layout.ConversionPattern=%5p %d{ISO8601} (%t) -
 %m%n

 # add additional logging
 org.apache.http.headers=DEBUG




 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 27, 2012 3:20 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Yes.
 Karl

 On Tue, Nov 27, 2012 at 2:14 PM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 So would the org.apache.http.headers=DEBUG replace the 
 log4j.logger.httpclient.wire=DEBUG entry in the logging.ini file?

 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Tuesday, November 27, 2012 2:07 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Here we go:

 Header logging: org.apache.http.headers=DEBUG Wire logging (which we
 probably don't need): org.apache.http.wire=DEBUG

 Karl

 On Tue, Nov 27, 2012 at 2:04 PM, Karl Wright daddy...@gmail.com wrote:
 The wire debugging setup you are using will only work with
 commons-httpclient, not the new httpcomponent package.  I'll have to
 do some research and see if there's a comparable logger setting for
 that package.

 Karl


 On Tue, Nov 27, 2012 at 2:01 PM, Iannetti, Robert
 robert.ianne...@novartis.com wrote:
 Karl,

 It's odd I added this to the properties.xml file property
 name=org.apache.manifoldcf.connectors value=DEBUG/

 And this to the logging.ini file
 log4j.logger.httpclient.wire=DEBUG

 I restarted manifold but nothing is being written to the
 manifoldcf.log file

 Any thoughts?

 Here is the curl data


 [iannero1@ip-10-145-32-121 logs]$ curl --data POST --ntlm -u
 nanet\\iannero1 
 http://searchpoc.testprojects.nibr.novartis.intra/_vti_bin/MCPermissions.asmx
  -v Enter host password for user 'nanet\iannero1':
 * About to connect() to searchpoc.testprojects.nibr.novartis.intra port 
 80 (#0)
 *   Trying 160.62.169.185... connected
 * Connected to searchpoc.testprojects.nibr.novartis.intra
 (160.62.169.185) port 80 (#0)
 * Initializing NSS with certpath: sql:/etc/pki/nssdb
 * Server auth using NTLM with user 'nanet\iannero1'
 POST /_vti_bin/MCPermissions.asmx HTTP/1.1
 Authorization: NTLM TlRMTVNTUAABBoIIAAA=
 User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7
 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
 Host: searchpoc.testprojects.nibr.novartis.intra
 Accept: */*
 Content-Length: 0
 Content-Type: application/x-www-form

Re: Web crawling causes Socket Timeout after Database Exception

2012-11-30 Thread Karl Wright
Hi Shigeki,

I noticed that your crawl is using hopcount filtering.  This feature
is costly performance-wise.  If you can crawl with hopcount filtering
disabled, your crawl will be much faster.

To disable completely, select the radio button titled
読込めないコンテンツ情報は永久保存, and leave the hopcount fields blank.

Thanks,
Karl

On Fri, Nov 30, 2012 at 1:57 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi, Karl


 I think handling MySQL exception keeps MCF crawling contents. However,
 because of deadlocks, crawling speed would be remained slow. I think the
 fundamental solution of the problem is to reduce deadlocks in MySQL. I am
 not sure if this could be solved by MCF but this is a task that people using
 MySQL  need to know.


 Regards,


 Shigeki


 2012/11/28 Karl Wright daddy...@gmail.com

 Yes, the SQL code will be output to the manifoldcf.log as part of the
 exception text.

 However I hope that this checkin will already fix your problem.

 Thanks,
 Karl

 On Wed, Nov 28, 2012 at 3:44 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 
  Hi Karl,
 
  I can try. To obtain the error code, could you let me know what to code
  to
  put in what line of what file? I suppose the error code will be output
  into
  manifoldcf.log, is this right?
 
 
  Regards,
 
 
  Shigeki
 
 
 
  2012/11/28 Karl Wright daddy...@gmail.com
 
  Hi Shigeki,
 
  This confirms my theory that our MySQL driver is not detecting all
  cases where MySQL gives up on a transaction.  We need to correct this,
  but in order to do that we need the SQL error code that MySQL throws
  in this case:
 
  Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
  restarting transaction
 
  It looks like somebody actually posted the SQL error code that MYSQL
  sends out with this online:
 
  ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting
  transaction
 
  Are you able to build ManifoldCF?  I will check in a fix to trunk for
  this problem shortly; it would be great if you could try it out.
 
  Thanks,
  Karl
 
  On Wed, Nov 28, 2012 at 2:30 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
   Hi Karl,
  
  
   Here is a log of Database Exception that is occurred while crawling
   Web.
   This time, socket timeout exception did not happen so it might be a
   different matter.
   Even though the job status remain Running, it seems that MCF
   stopped
   crawling (The job was not aborted).
   
   ERROR 2012-11-22 19:36:28,593 (Worker thread '16') - Worker thread
   aborting
   and restarting due to database connection reset: Database exception:
   Exception doing query: Lock wait timeout exceeded; try restarting
   transaction
   org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
   exception: Exception doing query: Lock wait timeout exceeded; try
   restarting
   transaction
   at
  
  
   org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
   at
  
  
   org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
   at
  
  
   org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
   at
  
  
   org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
   at
  
  
   org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
   at
  
  
   org.apache.manifoldcf.core.database.DBInterfaceMySQL.performModification(DBInterfaceMySQL.java:678)
   at
  
  
   org.apache.manifoldcf.core.database.DBInterfaceMySQL.performUpdate(DBInterfaceMySQL.java:275)
   at
  
  
   org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80)
   at
  
  
   org.apache.manifoldcf.crawler.jobs.HopCount.markForDelete(HopCount.java:1426)
   at
  
  
   org.apache.manifoldcf.crawler.jobs.HopCount.doDeleteInvalidation(HopCount.java:1356)
   at
  
   org.apache.manifoldcf.crawler.jobs.HopCount.doFinish(HopCount.java:1057)
   at
  
  
   org.apache.manifoldcf.crawler.jobs.HopCount.finishParents(HopCount.java:389)
   at
  
  
   org.apache.manifoldcf.crawler.jobs.JobManager.finishDocuments(JobManager.java:4309)
   at
  
  
   org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:557)
   Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
   restarting
   transaction
   at
   com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
   at
   com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624

Re: Running multiple MCFs on one Tomcat

2012-11-30 Thread Karl Wright
Hi Shigeki,

Each MCF instance should have its own properties.xml file.  Since the
way you tell MCF where the properties.xml file is located is with a -D
switch, I don't think you can run multiple instances properly in one
JVM.

If this is important to you, please let us know, and also please
describe what you are trying to do this for.

Thanks,
Karl

On Thu, Nov 29, 2012 at 8:05 PM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi everyone,


 Just wondering if there is anyone tried running multiple MCFs on one Tomcat
 (not multiple jobs in one MCF).

 If that's possible, I like to try testing crawling performance using
 multiple MCFs.

 Regards,


 Shigeki


Re: Running multiple MCFs on one Tomcat

2012-12-04 Thread Karl Wright
CPU usage is a function of the crawling task, and to some extent the
database.  When you run an open crawl on PostgreSQL, CPU usage is very
high.  If you are running a constrained, throttled crawl, CPU usage is
low.

Karl

On Tue, Dec 4, 2012 at 4:07 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,


 I noticed MCF does not use much CPU. I was wondering if running multiple
 MCFs could increase the CPU usages.


 Regards,

 Shigeki

 2012/11/30 Karl Wright daddy...@gmail.com

 Hi Shigeki,

 Each MCF instance should have its own properties.xml file.  Since the
 way you tell MCF where the properties.xml file is located is with a -D
 switch, I don't think you can run multiple instances properly in one
 JVM.

 If this is important to you, please let us know, and also please
 describe what you are trying to do this for.

 Thanks,
 Karl

 On Thu, Nov 29, 2012 at 8:05 PM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
  Hi everyone,
 
 
  Just wondering if there is anyone tried running multiple MCFs on one
  Tomcat
  (not multiple jobs in one MCF).
 
  If that's possible, I like to try testing crawling performance using
  multiple MCFs.
 
  Regards,
 
 
  Shigeki






Re: Cannot connect to SharePoint 2010 instance

2012-12-05 Thread Karl Wright
Hi Robert,

I've solved Luigi's problem - and now I want to know if it solves
yours.  Unfortunately, you WILL have to build ManifoldCF for this
step, since I cannot modify the build process easily to accommodate
the patched httpcomponents dependencies.

Can you do the following:

(1) Check out a trunk copy of manifoldcf sources, e.g svn co
https://svn.apache.org/repos/asf/manifoldcf/trunk; .
(2) Download the lib package from
http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev, unpack
it, and install it in the lib directory as per the instructions in the
lib package.
(3) Run ant build to be sure you can actually build the project.  If
that works, download the two patched httpcomponents jars from
http://people.apache.org/~kwright , and use them to overwrite
lib/httpcore.jar and lib/httpclient.jar.
(4) Run ant build clean
(5) Start manifoldcf (it's under the dist directory), and see if you
can connect to your sharepoint instance.

Thanks!
Karl


On Thu, Nov 29, 2012 at 8:56 AM, Iannetti, Robert
robert.ianne...@novartis.com wrote:
 Hi Karl,

 I have been following your thread with Luigi I look forward to testing the 
 new release.

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Thursday, November 29, 2012 3:28 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Hi Robert,

 Luigi and I think we've discovered the issue, which we're going to see if we 
 can confirm today.  There is a ticket tracking it, which is CONNECTORS-572.  
 If correct, it appears that Windows may have changed what it considers to be 
 the name of the user at some recent time, and the httpcomponents and 
 commons-httpclient implementations of NTLM are not resilient to this change - 
 which isn't surprising since they are basically reverse-engineered.  If 
 correct, httpcomponents will likely need to release a patch, so the schedule 
 will be, in part, up to them.
  Alternatively, we can build and patch httpcomponents as part of the 
 ManifoldCF release process, but it would require us to have a new Maven 
 dependency for the make-core-deps part of our release.

 Karl

 On Wed, Nov 28, 2012 at 9:01 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 Here is my debug output

 DEBUG 2012-11-28 08:59:25,884 (Thread-479) -  POST
 /_vti_bin/lists.asmx HTTP/1
 .1
 DEBUG 2012-11-28 08:59:25,899 (Thread-479) -  Content-Type:
 text/xml; charset=
 utf-8
 DEBUG 2012-11-28 08:59:25,899 (Thread-479) -  SOAPAction:
 http://schemas.micr osoft.com/sharepoint/soap/GetListCollection
 DEBUG 2012-11-28 08:59:25,899 (Thread-479) -  User-Agent: Axis/1.4
 DEBUG 2012-11-28 08:59:25,899 (Thread-479) -  Content-Length: 335
 DEBUG 2012-11-28 08:59:25,899 (Thread-479) -  Host:
 searchpoc.testprojects.nib r.novartis.intra DEBUG 2012-11-28
 08:59:25,899 (Thread-479) -  Connection: Keep-Alive DEBUG 2012-11-28
 08:59:30,629 (Thread-479) -  HTTP/1.1 401 Unauthorized DEBUG
 2012-11-28 08:59:30,629 (Thread-479) -  Server: Microsoft-IIS/7.5
 DEBUG 2012-11-28 08:59:30,629 (Thread-479) -  SPRequestGuid:
 56647ed0-9bac-4a2
 e-b61a-2d2e76ae8db0
 DEBUG 2012-11-28 08:59:30,629 (Thread-479) -  WWW-Authenticate: NTLM
 DEBUG 2012-11-28 08:59:30,629 (Thread-479) -  X-Powered-By: ASP.NET
 DEBUG 2012-11-28 08:59:30,630 (Thread-479) -  
 MicrosoftSharePointTeamServices:
  14.0.0.6123
 DEBUG 2012-11-28 08:59:30,630 (Thread-479) -  X-MS-InvokeApp: 1;
 RequireReadOn ly DEBUG 2012-11-28 08:59:30,630 (Thread-479) -  Date:
 Wed, 28 Nov 2012 13:59:30 GMT DEBUG 2012-11-28 08:59:30,630
 (Thread-479) -  Content-Length: 0 DEBUG 2012-11-28 08:59:30,663
 (Thread-479) -  POST /_vti_bin/lists.asmx HTTP/1.1 DEBUG 2012-11-28
 08:59:30,663 (Thread-479) -  Content-Type: text/xml; charset=utf-8
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  SOAPAction: 
 http://schemas.microsoft.com/sharepoint/soap/GetListCollection;
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  User-Agent: Axis/1.4
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  Content-Length: 335
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  Host:
 searchpoc.testprojects.nibr.novartis.intra
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  Connection: Keep-Alive
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  Authorization: NTLM
 TlRMTVNTUAABNQIIIAoACgBAIAAgACBJAFAALQAxADAALQAxADQANQAtAD
 MAMgAtADEAMgAxAE4AQQBOAEUAVAA= DEBUG 2012-11-28 08:59:30,680
 (Thread-479) -  HTTP/1.1 401 Unauthorized DEBUG 2012-11-28
 08:59:30,680 (Thread-479) -  Server: Microsoft-IIS/7.5 DEBUG
 2012-11-28 08:59:30,680 (Thread-479) -  SPRequestGuid:
 208f5c66-7d26-4761-b578-d01645f042ed
 DEBUG 2012-11-28 08:59:30,680 (Thread-479) -  WWW-Authenticate: NTLM
 TlRMTVNTUAACDgAOADg1Aoki47BOSwwS+moAAMoAygBGBgGxHQ
 9OAEkAQgBSAE4ARQBUAAIADgBOAEkAQgBSAE4ARQBUAAEAGABOAFIAVQBTAEMA
 QQAtAFMARAAwADcAOQAEACIAbgBpAGIAcgAuAG4AbwB2AGEAcgB0AGkAcwAuAG4AZQB0AA
 MAPABOAFIAVQBTAEMAQQAtAFMARAAwADcAOQAuAG4AaQBiAHIALgBuAG8AdgBhAHIAdABp

Re: Cannot connect to SharePoint 2010 instance

2012-12-05 Thread Karl Wright
I actually did decide to modify the build to pull the changed jars
down automatically.  So you can just download the artifacts under
http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev and you
should get updated binaries.

Karl

On Wed, Dec 5, 2012 at 6:03 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Robert,

 I've solved Luigi's problem - and now I want to know if it solves
 yours.  Unfortunately, you WILL have to build ManifoldCF for this
 step, since I cannot modify the build process easily to accommodate
 the patched httpcomponents dependencies.

 Can you do the following:

 (1) Check out a trunk copy of manifoldcf sources, e.g svn co
 https://svn.apache.org/repos/asf/manifoldcf/trunk; .
 (2) Download the lib package from
 http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev, unpack
 it, and install it in the lib directory as per the instructions in the
 lib package.
 (3) Run ant build to be sure you can actually build the project.  If
 that works, download the two patched httpcomponents jars from
 http://people.apache.org/~kwright , and use them to overwrite
 lib/httpcore.jar and lib/httpclient.jar.
 (4) Run ant build clean
 (5) Start manifoldcf (it's under the dist directory), and see if you
 can connect to your sharepoint instance.

 Thanks!
 Karl


 On Thu, Nov 29, 2012 at 8:56 AM, Iannetti, Robert
 robert.ianne...@novartis.com wrote:
 Hi Karl,

 I have been following your thread with Luigi I look forward to testing the 
 new release.

 Thanks
 Bob


 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Thursday, November 29, 2012 3:28 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Cannot connect to SharePoint 2010 instance

 Hi Robert,

 Luigi and I think we've discovered the issue, which we're going to see if we 
 can confirm today.  There is a ticket tracking it, which is CONNECTORS-572.  
 If correct, it appears that Windows may have changed what it considers to be 
 the name of the user at some recent time, and the httpcomponents and 
 commons-httpclient implementations of NTLM are not resilient to this change 
 - which isn't surprising since they are basically reverse-engineered.  If 
 correct, httpcomponents will likely need to release a patch, so the schedule 
 will be, in part, up to them.
  Alternatively, we can build and patch httpcomponents as part of the 
 ManifoldCF release process, but it would require us to have a new Maven 
 dependency for the make-core-deps part of our release.

 Karl

 On Wed, Nov 28, 2012 at 9:01 AM, Iannetti, Robert 
 robert.ianne...@novartis.com wrote:
 Karl,

 Here is my debug output

 DEBUG 2012-11-28 08:59:25,884 (Thread-479) -  POST
 /_vti_bin/lists.asmx HTTP/1
 .1
 DEBUG 2012-11-28 08:59:25,899 (Thread-479) -  Content-Type:
 text/xml; charset=
 utf-8
 DEBUG 2012-11-28 08:59:25,899 (Thread-479) -  SOAPAction:
 http://schemas.micr osoft.com/sharepoint/soap/GetListCollection
 DEBUG 2012-11-28 08:59:25,899 (Thread-479) -  User-Agent: Axis/1.4
 DEBUG 2012-11-28 08:59:25,899 (Thread-479) -  Content-Length: 335
 DEBUG 2012-11-28 08:59:25,899 (Thread-479) -  Host:
 searchpoc.testprojects.nib r.novartis.intra DEBUG 2012-11-28
 08:59:25,899 (Thread-479) -  Connection: Keep-Alive DEBUG 2012-11-28
 08:59:30,629 (Thread-479) -  HTTP/1.1 401 Unauthorized DEBUG
 2012-11-28 08:59:30,629 (Thread-479) -  Server: Microsoft-IIS/7.5
 DEBUG 2012-11-28 08:59:30,629 (Thread-479) -  SPRequestGuid:
 56647ed0-9bac-4a2
 e-b61a-2d2e76ae8db0
 DEBUG 2012-11-28 08:59:30,629 (Thread-479) -  WWW-Authenticate: NTLM
 DEBUG 2012-11-28 08:59:30,629 (Thread-479) -  X-Powered-By: ASP.NET
 DEBUG 2012-11-28 08:59:30,630 (Thread-479) -  
 MicrosoftSharePointTeamServices:
  14.0.0.6123
 DEBUG 2012-11-28 08:59:30,630 (Thread-479) -  X-MS-InvokeApp: 1;
 RequireReadOn ly DEBUG 2012-11-28 08:59:30,630 (Thread-479) -  Date:
 Wed, 28 Nov 2012 13:59:30 GMT DEBUG 2012-11-28 08:59:30,630
 (Thread-479) -  Content-Length: 0 DEBUG 2012-11-28 08:59:30,663
 (Thread-479) -  POST /_vti_bin/lists.asmx HTTP/1.1 DEBUG 2012-11-28
 08:59:30,663 (Thread-479) -  Content-Type: text/xml; charset=utf-8
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  SOAPAction: 
 http://schemas.microsoft.com/sharepoint/soap/GetListCollection;
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  User-Agent: Axis/1.4
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  Content-Length: 335
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  Host:
 searchpoc.testprojects.nibr.novartis.intra
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  Connection: Keep-Alive
 DEBUG 2012-11-28 08:59:30,663 (Thread-479) -  Authorization: NTLM
 TlRMTVNTUAABNQIIIAoACgBAIAAgACBJAFAALQAxADAALQAxADQANQAtAD
 MAMgAtADEAMgAxAE4AQQBOAEUAVAA= DEBUG 2012-11-28 08:59:30,680
 (Thread-479) -  HTTP/1.1 401 Unauthorized DEBUG 2012-11-28
 08:59:30,680 (Thread-479) -  Server: Microsoft-IIS/7.5 DEBUG
 2012-11-28 08:59:30,680 (Thread-479) -  SPRequestGuid:
 208f5c66-7d26-4761-b578-d01645f042ed
 DEBUG 2012-11-28 08:59:30,680 (Thread-479

Re: Web crawl exited with an unexpected jobqueue status error under MySQL

2012-12-05 Thread Karl Wright
Actually, I just noticed this: I ran MCF0.6 .  MCF 0.6 runs MySQL in
the wrong mode, so this was a problem.  It was fixed in ManifoldCF
1.0.

Can you upgrade to MCF 1.01 and see if this still happens for you?

Karl

On Wed, Dec 5, 2012 at 9:46 PM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:
 Hello Karl.

 MySQL:  5.5.24
 Tomcat:  6.0.35
 CentOS: 6.3


 Regards,

 Shigeki


 2012/12/5 Karl Wright daddy...@gmail.com

 Yes, I believe it is related, in the sense that the fix for
 CONNECTORS-246 was a fix to the HSQLDB database.  This error makes it
 clear that MySQL has a similar problem with its MVCC model, and will
 also require a fix.  However, I do not have the same kinds of leverage
 in the MySQL community that I do with HSQLDB.

 Can you give some details about the version of MySQL you are running,
 and on what platform?  I will capture that and then maybe figure out
 how to open a MySQL ticket.

 Karl

 On Wed, Dec 5, 2012 at 6:57 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 
  Hi.
 
  I ran MCF0.6 under MySQL5.5. I crawled WEB and the following error
  occurred,
  then MCF stopped the job:
 
  
  2012/12/04 18:50:07 ERROR (Worker thread '0') - Exception tossed:
  Unexpected
  jobqueue status - record id 1354608871138, expecting active status, saw
  3
  org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
  jobqueue status - record id 1354608871138, expecting active status, saw
  3
  at
 
  org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:711)
  at
 
  org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:2435)
  at
 
  org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:745)
  
 
  There was a similar ticket A file crawl exited with an unexpected
  jobqueue
  status error under HSQLDB.
  https://issues.apache.org/jira/browse/CONNECTORS-246
 
 
  Wondering if this is related..
 
 
  Regards,
 
  Shigeki







Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized

2012-12-06 Thread Karl Wright
Hi Luigi,

Others have also run into this exception, from one or more SharePoint
web services.  It is a server side catch-all exception which tells us
very little.

You may get more details by looking at the server's event logs.
SharePoint also has a log you can look at which may be even more
helpful.  In my experience, this is often the result of administrators
changing the system's permissions in ways that cause SharePoint's web
services to stop functioning correctly.  At MetaCarta we never would
see this on fresh SharePoint installations, but only on those where
SharePoint was first installed, and then afterwards people made
adjustments to the system permissions.

I hope you have access to a competent SharePoint system administrator,
because without that, it will be very hard to resolve this problem.

Thanks,
Karl

On Thu, Dec 6, 2012 at 5:12 AM, Luigi D'Addario
luigi.dadda...@googlemail.com wrote:
 Karl,

 I'm trying to put into Solr my SharPoint documents from Shared Documents.
 What do you think about this exception ?
 Permission problems again or ?

 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Getting
 version of '/Shared Documents//'
 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Checking
 whether to include library '/Shared Documents'
 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Library
 '/Shared Documents' exactly matched rule path '/Shared Documents'
 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Including
 library '/Shared Documents'
 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Processing:
 '/Shared Documents//'
 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Document
 identifier is a library: '/Shared Documents'
 DEBUG 2012-12-06 11:02:09,515 (Worker thread '3') - Enter:
 CommonsHTTPSender::invoke
 DEBUG 2012-12-06 11:02:10,000 (Worker thread '3') - Exit:
 CommonsHTTPSender::invoke
 DEBUG 2012-12-06 11:02:10,031 (Worker thread '3') - Enter:
 CommonsHTTPSender::invoke
 DEBUG 2012-12-06 11:02:10,406 (Worker thread '3') - Exit:
 CommonsHTTPSender::invoke
 DEBUG 2012-12-06 11:02:10,421 (Worker thread '3') - SharePoint: Got an
 unknown remote exception getting child documents for site  guid
 {CC072748-E1EE-4F34-B120-FAF33273A616} - axis fault = Server.Dsp.Connect,
 detail = Cannot open the requested Sharepoint Site. - retrying
 AxisFault
  faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.Dsp.Connect
  faultSubcode:
  faultString: Cannot open the requested Sharepoint Site.
  faultActor:
  faultNode:
  faultDetail:
 {http://schemas.microsoft.com/sharepoint/dsp}queryResponse:dsQueryResponse
 status=failure/

 Cannot open the requested Sharepoint Site.


 I send you manifoldcf.log.


 Thanks.

 Luigi


 2012/12/5 Luigi D'Addario luigi.dadda...@googlemail.com

 ..and I, finally, tomorrow will try to put into Solr my SharPoint
 documents  !


 2012/12/5 Karl Wright daddy...@gmail.com

 I'll have to figure out how to get this patched httpcomponents release
 into the field





Re: SharePoint 2007 Connector - (401)HTTP/1.1 401 Unauthorized

2012-12-06 Thread Karl Wright
If you have access to the SharePoint installation media itself, one
approach would be to try to install your own version of SharePoint on
a similar environment.  Prove to yourself (and others) that you can
actually crawl on that SharePoint.  Then, based on what the target
system's event logs and SharePoint logs tell you, you can start
modifying settings and module permissions to match the fresh
installation's, until it works.  You can also save yourself some time
by getting the actual request being done using http wire debugging in
ManifoldCF, and then trying that request over and over with curl until
you get it to not fail.

Thanks,
Karl

On Thu, Dec 6, 2012 at 6:29 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Luigi,

 Others have also run into this exception, from one or more SharePoint
 web services.  It is a server side catch-all exception which tells us
 very little.

 You may get more details by looking at the server's event logs.
 SharePoint also has a log you can look at which may be even more
 helpful.  In my experience, this is often the result of administrators
 changing the system's permissions in ways that cause SharePoint's web
 services to stop functioning correctly.  At MetaCarta we never would
 see this on fresh SharePoint installations, but only on those where
 SharePoint was first installed, and then afterwards people made
 adjustments to the system permissions.

 I hope you have access to a competent SharePoint system administrator,
 because without that, it will be very hard to resolve this problem.

 Thanks,
 Karl

 On Thu, Dec 6, 2012 at 5:12 AM, Luigi D'Addario
 luigi.dadda...@googlemail.com wrote:
 Karl,

 I'm trying to put into Solr my SharPoint documents from Shared Documents.
 What do you think about this exception ?
 Permission problems again or ?

 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Getting
 version of '/Shared Documents//'
 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Checking
 whether to include library '/Shared Documents'
 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Library
 '/Shared Documents' exactly matched rule path '/Shared Documents'
 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Including
 library '/Shared Documents'
 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Processing:
 '/Shared Documents//'
 DEBUG 2012-12-06 11:02:09,500 (Worker thread '3') - SharePoint: Document
 identifier is a library: '/Shared Documents'
 DEBUG 2012-12-06 11:02:09,515 (Worker thread '3') - Enter:
 CommonsHTTPSender::invoke
 DEBUG 2012-12-06 11:02:10,000 (Worker thread '3') - Exit:
 CommonsHTTPSender::invoke
 DEBUG 2012-12-06 11:02:10,031 (Worker thread '3') - Enter:
 CommonsHTTPSender::invoke
 DEBUG 2012-12-06 11:02:10,406 (Worker thread '3') - Exit:
 CommonsHTTPSender::invoke
 DEBUG 2012-12-06 11:02:10,421 (Worker thread '3') - SharePoint: Got an
 unknown remote exception getting child documents for site  guid
 {CC072748-E1EE-4F34-B120-FAF33273A616} - axis fault = Server.Dsp.Connect,
 detail = Cannot open the requested Sharepoint Site. - retrying
 AxisFault
  faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.Dsp.Connect
  faultSubcode:
  faultString: Cannot open the requested Sharepoint Site.
  faultActor:
  faultNode:
  faultDetail:
 {http://schemas.microsoft.com/sharepoint/dsp}queryResponse:dsQueryResponse
 status=failure/

 Cannot open the requested Sharepoint Site.


 I send you manifoldcf.log.


 Thanks.

 Luigi


 2012/12/5 Luigi D'Addario luigi.dadda...@googlemail.com

 ..and I, finally, tomorrow will try to put into Solr my SharPoint
 documents  !


 2012/12/5 Karl Wright daddy...@gmail.com

 I'll have to figure out how to get this patched httpcomponents release
 into the field





Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-09 Thread Karl Wright
Hi Shigeki,

The rules for when a database will use an index for an ORDER BY clause
differ significantly from database to database.  The current logic
seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
 I will see if I can find a solution.  The ticket for this
CONNECTORS-584.

Karl

On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:

 Hi.


 I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
 I tried to crawl 10 million files using Windows share connection and index
 them into Solr.

 As MCF reached over 1 million files, the crawling speed started getting
 slower.
 So I checked slow queries and found out that too many slow queries occurred,
 especially the following kinds:

 
 # Time: 121204 16:25:40
 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
 # Query_time: 7.240532  Lock_time: 0.000204 Rows_sent: 1200  Rows_examined:
 611091
 SET timestamp=1354605940;
 SELECT
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 # Time: 121204 16:25:44
 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
 # Query_time: 3.064339  Lock_time: 0.84 Rows_sent: 1  Rows_examined:
 406359
 SET timestamp=1354605944;
 SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE status IN
 ('P','G') AND checkaction='R' AND checktime=1354605932817 AND EXISTS(SELECT
 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid)  ORDER BY
 docpriority ASC,status ASC,checkaction ASC,checktime ASC LIMIT 1;
 ---

 I wonder if the queries appropriately use index of the table.
 As a result of EXPLAIN against the slow query, there was filesort.
 There seems to be some conditions that MySQL does not use index depending on
 ORDER BY:
  - Executing ORDER BY against multiple keys
  - When keys selected from records are different from keys used by ORDER BY

 Since filesort was happening, fully scanning records should be having MCF
 slower.

 Do you think this could happen even in PostgreSQL or HSQLDB?
 Do you think queries could be modified to use index appropriately?


 Regards,

 Shigeki


Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-10 Thread Karl Wright
Since you have a large table, can you try an EXPLAIN for the following
query, which should match the explanation given here:
http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
Does it use the index?

SELECT
t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
AND t0.checkaction='R' AND
t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status
IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200

Thanks!
Karl

On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Shigeki,

 The rules for when a database will use an index for an ORDER BY clause
 differ significantly from database to database.  The current logic
 seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
  I will see if I can find a solution.  The ticket for this
 CONNECTORS-584.

 Karl

 On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:

 Hi.


 I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
 I tried to crawl 10 million files using Windows share connection and index
 them into Solr.

 As MCF reached over 1 million files, the crawling speed started getting
 slower.
 So I checked slow queries and found out that too many slow queries occurred,
 especially the following kinds:

 
 # Time: 121204 16:25:40
 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
 # Query_time: 7.240532  Lock_time: 0.000204 Rows_sent: 1200  Rows_examined:
 611091
 SET timestamp=1354605940;
 SELECT
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 # Time: 121204 16:25:44
 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
 # Query_time: 3.064339  Lock_time: 0.84 Rows_sent: 1  Rows_examined:
 406359
 SET timestamp=1354605944;
 SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE status IN
 ('P','G') AND checkaction='R' AND checktime=1354605932817 AND EXISTS(SELECT
 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid)  ORDER BY
 docpriority ASC,status ASC,checkaction ASC,checktime ASC LIMIT 1;
 ---

 I wonder if the queries appropriately use index of the table.
 As a result of EXPLAIN against the slow query, there was filesort.
 There seems to be some conditions that MySQL does not use index depending on
 ORDER BY:
  - Executing ORDER BY against multiple keys
  - When keys selected from records are different from keys used by ORDER BY

 Since filesort was happening, fully scanning records should be having MCF
 slower.

 Do you think this could happen even in PostgreSQL or HSQLDB?
 Do you think queries could be modified to use index appropriately?


 Regards,

 Shigeki


Re: latest trunk BUILD FAILED

2012-12-10 Thread Karl Wright
Ok, I fixed this.
Karl

On Sun, Dec 9, 2012 at 8:57 PM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Hi,
 I couldn't build the latest trunk. It seemed that MeridioConnector could not 
 be compiled.

 compile-connector:
 [javac] /Users/abe/mcf/trunk/connectors/connector-build.xml:420: warning: 
 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set 
 to false for repeatable builds
 [javac] Compiling 10 source files to 
 /Users/abe/mcf/trunk/connectors/meridio/build/connector/classes
 [javac] 
 /Users/abe/mcf/trunk/connectors/meridio/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/meridio/MeridioConnector.java:1398:
  package org.apache.commons.httpclient does not exist
 [javac]   catch 
 (org.apache.commons.httpclient.ConnectTimeoutException ioex)
 [javac]   ^
 [javac] Note: 
 /Users/abe/mcf/trunk/connectors/meridio/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/meridio/CommonsHTTPSender.java
  uses or overrides a deprecated API.
 [javac] Note: Recompile with -Xlint:deprecation for details.
 [javac] Note: Some input files use unchecked or unsafe operations.
 [javac] Note: Recompile with -Xlint:unchecked for details.
 [javac] 1 error

 BUILD FAILED

 Regards,
 Shinichiro Abe




Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-10 Thread Karl Wright
Ok, that is unfortunate.  I will do some further MySQL research here.
There is a FORCE INDEX MySQL construct that may help, e.g.

SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...

which we can also try.  In this case that would be: FORCE INDEX
(docpriority,status,checkaction,checktime) or FORCE INDEX
(docpriority_status_checkaction_checktime)  - unclear what the right
syntax actually is.  Maybe you can try an explain with that in the
query?

FWIW, PostgreSQL should always use the index for this situation.

Karl


On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,

 Thanks for the reply.

 I did EXPLAIN as following:

 mysql explain SELECT
 -
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 - AND t0.checkaction='R' AND
 - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
 EXISTS(SELECT
 - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 +++---++--++-+-++-+
 | id | select_type| table | type   | possible_keys
 | key| key_len | ref | rows   | Extra
 |
 +++---++--++-+-++-+
 |  1 | PRIMARY| t0| range  |
 I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25  |
 NULL| 151494 | Using where; Using filesort |
 |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
 | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
 |
 |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
 | PRIMARY| 767 | manifoldcf.t3.eventname |  1 | Using index
 |
 |  3 | DEPENDENT SUBQUERY | t2| ref|
 I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 |
 manifoldcf.t0.dochash   |  1 | Using where |
 |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
 | PRIMARY| 8   | manifoldcf.t0.jobid |  1 | Using where
 |
 +++---++--++-+-++-+


 As you see Using filesort, I do not think it uses the index.

 By the way, which database do you recommend for the case of crawling  a
 humongous number of files for now? PostgreSQL?


 Regards,

 Shigeki

 2012/12/10 Karl Wright daddy...@gmail.com

 Since you have a large table, can you try an EXPLAIN for the following
 query, which should match the explanation given here:
 http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
 Does it use the index?

 SELECT

 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200

 Thanks!
 Karl

 On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote:
  Hi Shigeki,
 
  The rules for when a database will use an index for an ORDER BY clause
  differ significantly from database to database.  The current logic
  seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
   I will see if I can find a solution.  The ticket for this
  CONNECTORS-584.
 
  Karl
 
  On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
 
  Hi.
 
 
  I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
  I tried to crawl 10 million files using Windows share connection and
  index
  them into Solr.
 
  As MCF reached over 1 million files, the crawling speed started getting
  slower.
  So I checked slow queries and found out that too many slow queries
  occurred,
  especially the following kinds:
 
  
  # Time: 121204 16:25

Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-10 Thread Karl Wright
Sorry, the FORCE INDEX hint requires the name of the index.  Since
ManifoldCF does not assign index names to fixed values, you will need
to find the right one, by using the SHOW INDEX command first to get
the right index's name.

Apologies,
Karl


On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com wrote:
 Ok, that is unfortunate.  I will do some further MySQL research here.
 There is a FORCE INDEX MySQL construct that may help, e.g.

 SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...

 which we can also try.  In this case that would be: FORCE INDEX
 (docpriority,status,checkaction,checktime) or FORCE INDEX
 (docpriority_status_checkaction_checktime)  - unclear what the right
 syntax actually is.  Maybe you can try an explain with that in the
 query?

 FWIW, PostgreSQL should always use the index for this situation.

 Karl


 On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,

 Thanks for the reply.

 I did EXPLAIN as following:

 mysql explain SELECT
 -
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 - AND t0.checkaction='R' AND
 - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
 EXISTS(SELECT
 - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 +++---++--++-+-++-+
 | id | select_type| table | type   | possible_keys
 | key| key_len | ref | rows   | Extra
 |
 +++---++--++-+-++-+
 |  1 | PRIMARY| t0| range  |
 I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25  |
 NULL| 151494 | Using where; Using filesort |
 |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
 | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
 |
 |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
 | PRIMARY| 767 | manifoldcf.t3.eventname |  1 | Using index
 |
 |  3 | DEPENDENT SUBQUERY | t2| ref|
 I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 |
 manifoldcf.t0.dochash   |  1 | Using where |
 |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
 | PRIMARY| 8   | manifoldcf.t0.jobid |  1 | Using where
 |
 +++---++--++-+-++-+


 As you see Using filesort, I do not think it uses the index.

 By the way, which database do you recommend for the case of crawling  a
 humongous number of files for now? PostgreSQL?


 Regards,

 Shigeki

 2012/12/10 Karl Wright daddy...@gmail.com

 Since you have a large table, can you try an EXPLAIN for the following
 query, which should match the explanation given here:
 http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
 Does it use the index?

 SELECT

 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200

 Thanks!
 Karl

 On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote:
  Hi Shigeki,
 
  The rules for when a database will use an index for an ORDER BY clause
  differ significantly from database to database.  The current logic
  seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
   I will see if I can find a solution.  The ticket for this
  CONNECTORS-584.
 
  Karl
 
  On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
 
  Hi.
 
 
  I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
  I tried to crawl 10 million files using Windows

Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-10 Thread Karl Wright
Experiments here indicate that FORCE INDEX seems to do what we need.

I'm going to think about it a bit and then come up with a fix that
should use FORCE INDEX in this situation.  Then we can see if it
actually helps for you.

Karl


On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com wrote:
 Sorry, the FORCE INDEX hint requires the name of the index.  Since
 ManifoldCF does not assign index names to fixed values, you will need
 to find the right one, by using the SHOW INDEX command first to get
 the right index's name.

 Apologies,
 Karl


 On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com wrote:
 Ok, that is unfortunate.  I will do some further MySQL research here.
 There is a FORCE INDEX MySQL construct that may help, e.g.

 SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...

 which we can also try.  In this case that would be: FORCE INDEX
 (docpriority,status,checkaction,checktime) or FORCE INDEX
 (docpriority_status_checkaction_checktime)  - unclear what the right
 syntax actually is.  Maybe you can try an explain with that in the
 query?

 FWIW, PostgreSQL should always use the index for this situation.

 Karl


 On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,

 Thanks for the reply.

 I did EXPLAIN as following:

 mysql explain SELECT
 -
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 - AND t0.checkaction='R' AND
 - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
 EXISTS(SELECT
 - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 +++---++--++-+-++-+
 | id | select_type| table | type   | possible_keys
 | key| key_len | ref | rows   | Extra
 |
 +++---++--++-+-++-+
 |  1 | PRIMARY| t0| range  |
 I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25  |
 NULL| 151494 | Using where; Using filesort |
 |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
 | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
 |
 |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
 | PRIMARY| 767 | manifoldcf.t3.eventname |  1 | Using index
 |
 |  3 | DEPENDENT SUBQUERY | t2| ref|
 I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 |
 manifoldcf.t0.dochash   |  1 | Using where |
 |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
 | PRIMARY| 8   | manifoldcf.t0.jobid |  1 | Using where
 |
 +++---++--++-+-++-+


 As you see Using filesort, I do not think it uses the index.

 By the way, which database do you recommend for the case of crawling  a
 humongous number of files for now? PostgreSQL?


 Regards,

 Shigeki

 2012/12/10 Karl Wright daddy...@gmail.com

 Since you have a large table, can you try an EXPLAIN for the following
 query, which should match the explanation given here:
 http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
 Does it use the index?

 SELECT

 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200

 Thanks!
 Karl

 On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote:
  Hi Shigeki,
 
  The rules for when a database will use an index for an ORDER BY clause
  differ significantly from database to database.  The current logic
  seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL

Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-10 Thread Karl Wright
Hi Shigeki,

I'm uploading a new version of ManifoldCF 1.1-dev, which you can pick
up at http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev .
This has a good chance of fixing the query performance problem.
Please try it out, and let me know if you still get slow queries in
the log.  You should be to use the existing database instance.

Thanks,
Karl

On Mon, Dec 10, 2012 at 5:05 PM, Karl Wright daddy...@gmail.com wrote:
 Experiments here indicate that FORCE INDEX seems to do what we need.

 I'm going to think about it a bit and then come up with a fix that
 should use FORCE INDEX in this situation.  Then we can see if it
 actually helps for you.

 Karl


 On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com wrote:
 Sorry, the FORCE INDEX hint requires the name of the index.  Since
 ManifoldCF does not assign index names to fixed values, you will need
 to find the right one, by using the SHOW INDEX command first to get
 the right index's name.

 Apologies,
 Karl


 On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com wrote:
 Ok, that is unfortunate.  I will do some further MySQL research here.
 There is a FORCE INDEX MySQL construct that may help, e.g.

 SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...

 which we can also try.  In this case that would be: FORCE INDEX
 (docpriority,status,checkaction,checktime) or FORCE INDEX
 (docpriority_status_checkaction_checktime)  - unclear what the right
 syntax actually is.  Maybe you can try an explain with that in the
 query?

 FWIW, PostgreSQL should always use the index for this situation.

 Karl


 On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,

 Thanks for the reply.

 I did EXPLAIN as following:

 mysql explain SELECT
 -
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN 
 ('P','G')
 - AND t0.checkaction='R' AND
 - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
 EXISTS(SELECT
 - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 +++---++--++-+-++-+
 | id | select_type| table | type   | possible_keys
 | key| key_len | ref | rows   | Extra
 |
 +++---++--++-+-++-+
 |  1 | PRIMARY| t0| range  |
 I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25  |
 NULL| 151494 | Using where; Using filesort |
 |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
 | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
 |
 |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
 | PRIMARY| 767 | manifoldcf.t3.eventname |  1 | Using index
 |
 |  3 | DEPENDENT SUBQUERY | t2| ref|
 I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 |
 manifoldcf.t0.dochash   |  1 | Using where |
 |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
 | PRIMARY| 8   | manifoldcf.t0.jobid |  1 | Using where
 |
 +++---++--++-+-++-+


 As you see Using filesort, I do not think it uses the index.

 By the way, which database do you recommend for the case of crawling  a
 humongous number of files for now? PostgreSQL?


 Regards,

 Shigeki

 2012/12/10 Karl Wright daddy...@gmail.com

 Since you have a large table, can you try an EXPLAIN for the following
 query, which should match the explanation given here:
 http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
 Does it use the index?

 SELECT

 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3

RE: Too many slow queries caused by MCF running MySQL 5.5

2012-12-11 Thread Karl Wright
You just need to run ant make-deps too before building.

Karl

Sent from my Windows Phone
--
From: Shigeki Kobayashi
Sent: 12/11/2012 3:58 AM
To: user@manifoldcf.apache.org
Subject: Re: Too many slow queries caused by MCF running MySQL 5.5

Hi Karl.

I could build the source ok but the following code is missing
from connectors.xml. Does this mean I built it incorrectly or this is on
purpose?
Do I have to just add the code to enable the Windows share connection?


  repositoryconnector name=Windows shares
class=org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector/


Regards,

Shigeki


2012/12/11 Karl Wright daddy...@gmail.com

 Hi Shigeki,

 I'm uploading a new version of ManifoldCF 1.1-dev, which you can pick
 up at http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev .
 This has a good chance of fixing the query performance problem.
 Please try it out, and let me know if you still get slow queries in
 the log.  You should be to use the existing database instance.

 Thanks,
 Karl

 On Mon, Dec 10, 2012 at 5:05 PM, Karl Wright daddy...@gmail.com wrote:
  Experiments here indicate that FORCE INDEX seems to do what we need.
 
  I'm going to think about it a bit and then come up with a fix that
  should use FORCE INDEX in this situation.  Then we can see if it
  actually helps for you.
 
  Karl
 
 
  On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com wrote:
  Sorry, the FORCE INDEX hint requires the name of the index.  Since
  ManifoldCF does not assign index names to fixed values, you will need
  to find the right one, by using the SHOW INDEX command first to get
  the right index's name.
 
  Apologies,
  Karl
 
 
  On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com
 wrote:
  Ok, that is unfortunate.  I will do some further MySQL research here.
  There is a FORCE INDEX MySQL construct that may help, e.g.
 
  SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...
 
  which we can also try.  In this case that would be: FORCE INDEX
  (docpriority,status,checkaction,checktime) or FORCE INDEX
  (docpriority_status_checkaction_checktime)  - unclear what the right
  syntax actually is.  Maybe you can try an explain with that in the
  query?
 
  FWIW, PostgreSQL should always use the index for this situation.
 
  Karl
 
 
  On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
  Hi Karl,
 
  Thanks for the reply.
 
  I did EXPLAIN as following:
 
  mysql explain SELECT
  -
  t0.id
 ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
  - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN
 ('P','G')
  - AND t0.checkaction='R' AND
  - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1
 WHERE
  - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5)
 AND NOT
  - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash
 AND
  t2.status
  - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
  EXISTS(SELECT
  - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
  - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
  - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 
 +++---++--++-+-++-+
  | id | select_type| table | type   | possible_keys
  | key| key_len | ref | rows   | Extra
  |
 
 +++---++--++-+-++-+
  |  1 | PRIMARY| t0| range  |
  I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25
|
  NULL| 151494 | Using where; Using filesort |
  |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
  | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
  |
  |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
  | PRIMARY| 767 | manifoldcf.t3.eventname |  1 | Using
 index
  |
  |  3 | DEPENDENT SUBQUERY | t2| ref|
  I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122
   |
  manifoldcf.t0.dochash   |  1 | Using where |
  |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
  | PRIMARY| 8   | manifoldcf.t0.jobid |  1 | Using
 where
  |
 
 +++---++--++-+-++-+
 
 
  As you see Using filesort, I do not think it uses the index.
 
  By the way, which database do you recommend for the case of crawling
  a
  humongous number of files for now? PostgreSQL?
 
 
  Regards,
 
  Shigeki
 
  2012/12/10

Re: How to crawl from the point where the job is stopped by errors

2012-12-12 Thread Karl Wright
ManifoldCF is incremental and will do as little work as possible when
a job is restarted.  The details of what that means depend on the
actual connector involved.

For Windows Share connections, the document's modify date is checked
again, but the document does not need to be indexed if that has not
changed.

Karl

On Wed, Dec 12, 2012 at 12:11 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:

 Hi

 There are sometimes errors occurred that stop the jobs crawling files using
 Windows share connecton.
 In this case, when starting the stopped the job again by clicking 'Start', i
 suppose that MCF crawls from the beginning again.
 If that's right, is there any way that could have MCF crawl from the point
 that the job is stopped?


 Regards,

 Shigeki


Re: Many sleep process in MySQL while crawling files using Window share connection

2012-12-12 Thread Karl Wright
The MySQL threads correspond to handles in the ManifoldCF handle pool.
 Since a worker thread can use only one handle at a time, one expects
that at best the number of MySQL processes that are active during a
crawl are about equal to the number of ManifoldCF worker threads.  If
this is not true it indicates low database use - which may be OK,
depending on your crawl, because of throttle settings.  For example,
if you are crawling only N domains and you have more than N worker
threads, some of these threads will have to wait.

However, if your CPU is 100%, and that is all going into ONE MySQL
process, it means that one query is blocking all the rest.  This would
usually be the stuffing query, which is the one we have been looking
at over the last couple of days.  This query must be fast for
ManifoldCF to use its resources well; if it takes a long time to run,
the rest of the worker threads get nothing to do.

A good way of assessing the state of ManifoldCF under these conditions
is to get a thread dump (which can be gotten with kill -QUIT on Linux
systems).  Look at the worker threads and see what they are doing.  If
you send me a dump, I will interpret it for you.

Thanks,
Karl


On Wed, Dec 12, 2012 at 2:52 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:


 Hi

 I run MCF1.1dev downloaded at Dec. 11th with Mysql5.5
 While crawling, I listed process in MySQL and realized there are so many
 process that are sleeping.
 I set org.apache.manifoldcf.database.maxhandles to 100.

 Does this mean that MCF does not handle MySQL process appropriately?
 I feel strange that even though there are many process created, they are not
 used much.
 I see CPU usages 100% in mysql but the cpu state is shown as sleep. Do you
 think this is related to the sleep process in MySQL?
 Is this a correct behavior?

 mysql show processlist;
 +++-++-+--++--+
 | Id | User   | Host| db | Command | Time | State  |
 Info
 |
 +++-++-+--++--+
 |  1 | manifoldcf | localhost:37683 | manifoldcf | Sleep   |  279 ||
 NULL
 |
 |  2 | manifoldcf | localhost:37684 | manifoldcf | Query   |0 | update |
 INSERT INTO ingeststatus
 (id,changecount,dockey,firstingest,connectionname,authorityname,urihash,las
 |
 |  3 | manifoldcf | localhost:37685 | manifoldcf | Sleep   |  279 ||
 NULL
 |
 |  4 | manifoldcf | localhost:37686 | manifoldcf | Sleep   |   24 ||
 NULL
 |
 |  5 | manifoldcf | localhost:37687 | manifoldcf | Sleep   |   24 ||
 NULL
 |
 |  6 | manifoldcf | localhost:37688 | manifoldcf | Sleep   |  217 ||
 NULL
 |
 |  7 | manifoldcf | localhost:37689 | manifoldcf | Sleep   |  279 ||
 NULL
 |
 |  8 | manifoldcf | localhost:37690 | manifoldcf | Sleep   |   12 ||
 NULL
 |
 |  9 | manifoldcf | localhost:37694 | manifoldcf | Sleep   |  279 ||
 NULL
 |
 | 10 | manifoldcf | localhost:37695 | manifoldcf | Sleep   |0 ||
 NULL
 |
 | 11 | manifoldcf | localhost:37696 | manifoldcf | Sleep   |   24 ||
 NULL
 |
 | 12 | manifoldcf | localhost:37697 | manifoldcf | Sleep   |  279 ||
 NULL
 |
 | 13 | manifoldcf | localhost:37698 | manifoldcf | Sleep   |  279 ||
 NULL
 |
 | 14 | manifoldcf | localhost:37699 | manifoldcf | Sleep   |  279 ||
 NULL
 |
 | 15 | manifoldcf | localhost:37700 | manifoldcf | Sleep   |  217 ||
 NULL
 |
 | 16 | manifoldcf | localhost:37701 | manifoldcf | Sleep   |   24 ||
 NULL
 |
 | 17 | manifoldcf | localhost:37703 | manifoldcf | Sleep   |   24 ||
 NULL
 |
 | 18 | manifoldcf | localhost:37732 | manifoldcf | Sleep   |  217 ||
 NULL
 |
 | 19 | manifoldcf | localhost:37733 | manifoldcf | Sleep   |5 ||
 NULL
 |
 | 20 | manifoldcf | localhost:37734 | manifoldcf | Sleep   |   24 ||
 NULL
 |
 | 21 | manifoldcf | localhost:37735 | manifoldcf | Sleep   |  217 ||
 NULL
 |
 | 22 | manifoldcf | localhost:37736 | manifoldcf | Sleep   |0 ||
 NULL
 |
 | 23 | manifoldcf | localhost:37737 | manifoldcf | Sleep   |  217 ||
 NULL
 |
 | 24 | manifoldcf | localhost:37738 | manifoldcf | Sleep   |  217 ||
 NULL
 |
 | 25 | manifoldcf | localhost:37739 | manifoldcf | Sleep   |   24 ||
 NULL
 |
 | 26 | manifoldcf | localhost:37740 | manifoldcf | Sleep   |   24 ||
 NULL
 |
 | 27 | manifoldcf | localhost:39340 | manifoldcf | Sleep   |  279 ||
 NULL
 |
 | 28 | manifoldcf | localhost:39341 | manifoldcf | Sleep   |3 ||
 NULL
 |
 | 29 | manifoldcf | localhost:39342 | manifoldcf | Sleep   |0 ||
 NULL
 |
 | 30 | manifoldcf | localhost:39343 | manifoldcf | Query   |0 | 

Re: Build failure on Java7

2012-12-12 Thread Karl Wright
I created a ticket, CONNECTORS-586, to track this problem.

Karl

On Wed, Dec 12, 2012 at 4:52 PM, Karl Wright daddy...@gmail.com wrote:
 Native2Ascii is a maven plugin, but it may well not be compatible with
 Java 7, or you might be using a non-Oracle jdk.  Generally we
 recommend openjdk or oracle.

 I suggest you try the ant build.  I believe ant implemented their own
 native2ascii converter without relying on the sun/oracle proprietary
 classes.

 Karl

 On Wed, Dec 12, 2012 at 3:48 PM, Arcadius Ahouansou
 arcad...@menelic.com wrote:

 Hello.

 I have tried to build both trunk and the 1.01 tag on java 1.7.0_04-b22
 without luck.

 The error is:

 [INFO]
 
 [INFO] Building ManifoldCF - Framework - UI Core 1.0
 [INFO]
 
 [INFO]
 [INFO] --- maven-remote-resources-plugin:1.1:process (default) @ mcf-ui-core
 ---
 [INFO]
 [INFO] --- native2ascii-maven-plugin:1.0-alpha-1:native2ascii
 (native2ascii-utf8) @ mcf-ui-core ---
 [INFO]
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] ManifoldCF  SUCCESS [4.024s]
 [INFO] ManifoldCF - Framework  SUCCESS [0.061s]
 [INFO] ManifoldCF - Framework - Core . SUCCESS [6.795s]
 [INFO] ManifoldCF - Framework - UI Core .. FAILURE [1.453s]
 [INFO] ManifoldCF - Framework - Agents ... SKIPPED
 [INFO] ManifoldCF - Framework - Pull Agent ... SKIPPED
 [INFO] ManifoldCF - Framework - Authority Servlet  SKIPPED
 [INFO] ManifoldCF - Framework - API Servlet .. SKIPPED
 [INFO] ManifoldCF - Framework - Authority Service  SKIPPED
 [INFO] ManifoldCF - Framework - API Service .. SKIPPED
 [INFO] ManifoldCF - Framework - Crawler UI ... SKIPPED
 [INFO] ManifoldCF - Framework - Script Engine  SKIPPED
 [INFO] ManifoldCF - Connectors ... SKIPPED
 [INFO] ManifoldCF - Connectors - Active Directory  SKIPPED
 [INFO] ManifoldCF - Connectors - Filesystem .. SKIPPED
 [INFO] ManifoldCF - Connectors - MetaCarta GTS ... SKIPPED
 [INFO] ManifoldCF - Connectors - jCIFS ... SKIPPED
 [INFO] ManifoldCF - Connectors - JDBC  SKIPPED
 [INFO] ManifoldCF - Connectors - Null Authority .. SKIPPED
 [INFO] ManifoldCF - Connectors - Null Output . SKIPPED
 [INFO] ManifoldCF - Connectors - RSS . SKIPPED
 [INFO] ManifoldCF - Connectors - SharePoint .. SKIPPED
 [INFO] ManifoldCF - Connectors - Solr  SKIPPED
 [INFO] ManifoldCF - Connectors - Web . SKIPPED
 [INFO] ManifoldCF - Connectors - CMIS  SKIPPED
 [INFO] ManifoldCF - Connectors - OpenSearchServer  SKIPPED
 [INFO] ManifoldCF - Connectors - Wiki  SKIPPED
 [INFO] ManifoldCF - Connectors - Alfresco  SKIPPED
 [INFO] ManifoldCF - Connectors - ElasticSearch ... SKIPPED
 [INFO] ManifoldCF - Test materials ... SKIPPED
 [INFO] ManifoldCF - Test Materials - Alfresco WAR  SKIPPED
 [INFO] ManifoldCF - Framework - Jetty Runner . SKIPPED
 [INFO] ManifoldCF - Tests  SKIPPED
 [INFO] ManifoldCF - Test - ElasticSearch . SKIPPED
 [INFO] ManifoldCF - Test - Alfresco .. SKIPPED
 [INFO] ManifoldCF - Test - Wiki .. SKIPPED
 [INFO] ManifoldCF - Test - CMIS .. SKIPPED
 [INFO] ManifoldCF - Test - Filesystem  SKIPPED
 [INFO] ManifoldCF - Test - Sharepoint  SKIPPED
 [INFO] ManifoldCF - Test - RSS ... SKIPPED
 [INFO]
 
 [INFO] BUILD FAILURE
 [INFO]
 
 [INFO] Total time: 15.711s
 [INFO] Finished at: Wed Dec 12 20:34:41 GMT 2012
 [INFO] Final Memory: 14M/34M
 [INFO]
 
 [ERROR] Failed to execute goal
 org.codehaus.mojo:native2ascii-maven-plugin:1.0-alpha-1:native2ascii
 (native2ascii-utf8) on project mcf-ui-core: Execution native2ascii-utf8 of
 goal org.codehaus.mojo:native2ascii-maven-plugin:1.0-alpha-1:native2ascii
 failed: Error starting Sun's native2ascii: sun.tools.native2ascii.Main -
 [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
 switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions, please
 read the following articles:
 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence

Re: File crawl using exited with an unexpected jobqueue status error under MySQL

2012-12-20 Thread Karl Wright
Yes, it is the same cause - a transactional integrity bug in the
database, MySQL in this case.  I can open a ManifoldCF ticket, but the
real fix has to come from the MySQL team.

Karl

On Thu, Dec 20, 2012 at 8:59 PM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:

 Hi


 I run MCF1.1dev trunk downloaded on Dec. 22nd and craw file using Windows
 share connection
 under MySQL 5.5.28, for Linux (x86_64).

 The following Error occurred and then the job exited:

 ---
 2012/12/21 10:09:37 ERROR (Worker thread '78') - Exception tossed:
 Unexpected jobqueue status - record id 1356045273314, expecting active
 status, saw 0
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
 jobqueue status - record id 1356045273314, expecting active status, saw 0
 at
 org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:742)
 at
 org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:2438)
 at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:765)
 ---


 Do you think this is related to
 https://issues.apache.org/jira/browse/CONNECTORS-246?


 Regards,


 Shigeki


Re: Timeout values to be configurable

2013-01-03 Thread Karl Wright
FWIW, the newest version of the Solr connector now has configurable
timeout values.  But my original comment still stands; you really
should not find yourself in a position to need this.

Karl

On Wed, Dec 26, 2012 at 6:19 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Shigeki,

 While timeout values into Solr could theoretically be configured as
 connection parameters, the timeout values for jCIFS are currently only
 settable globally.  Therefore, to make changes configurable by
 connection, the jCIFS library needs to change.  I've already
 approached the jCIFS developer about changes of this kind, and he was
 unreceptive to this request.  Part of the reason is the nature of the
 CIFS protocol, which multiplexes many simultaneous requests using the
 same connection.  So this cannot be solved in the manner you suggest,
 in any case.

 Furthermore, on a properly-set-up system, it should be unnecessary to
 adjust either jCIFS timeout parameters or Solr timeout parameters.  If
 you are consistently getting timeouts from jCIFS, it is a strong sign
 you are overloading the Windows servers you are trying to crawl, and
 you should take steps immediately to reduce the maximum number of
 connections you are trying to crawl with.  Similarly, chronically
 exceeding the Solr timeout parameters indicates you are pushing
 documents into a Solr that is either insufficiently powered, or has
 too few available threads.  Cutting back on the max number of
 connections is also indicated here as well.

 Since ManifoldCF retries failures, occasional failures due to other
 loads on either the Windows servers or on Solr are expected and will
 not cause problems.  But chronic failures indicate serious
 configuration problems, for which increasing the timeouts is the wrong
 solution.  So I hesitate to add features of the kind you request,
 unless you can convince me that there is a fundamental reason why it
 should be necessary to change these parameters.

 Thanks,
 Karl


 On Wed, Dec 26, 2012 at 2:18 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:


 Hi.

 As I have used MCF so far, I've faced timeout error many times while
 crawling and indexing files to Solr.
 I would like to propose to have the following timeout values configurable in
 properties.xml.

 Timeout errors often occur depending on files and environments(machines), so
 it would be nice to change
 the timeout value without rebuild the whole source.


 $MCF_HOME\connectors\solr\connector\src\main\java\org\apache\manifoldcf\agents\output\solr\HttpPoster.java

 int responseRetries = 9000; // Long basic wait: 3 minutes.  This
 will also be added to by a term based on the size of the request.

 $MCF_HOME\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java
 System.setProperty(jcifs.smb.client.soTimeout,15);
 System.setProperty(jcifs.smb.client.responseTimeout,12);


 Regards,


 Shigeki


Re: Http status code 302

2013-01-09 Thread Karl Wright
When I try the URL you gave using curl and no special arguments, I get this:


C:\Users\Karlcurl -vvv http://lucene.jugem.jp/?eid=39;
* About to connect() to lucene.jugem.jp port 80 (#0)
*   Trying 210.172.160.170... connected
* Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
 GET /?eid=39 HTTP/1.1
 User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c zlib/1.2
.5 librtmp/2.3
 Host: lucene.jugem.jp
 Accept: */*

 HTTP/1.1 200 OK
 Date: Wed, 09 Jan 2013 08:47:52 GMT
 Server: Apache/2.0.59 (Unix)
 Vary: User-Agent,Host,Accept-Encoding
 Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
 Accept-Ranges: bytes
 Content-Length: 22594
 Cache-Control: private
 Pragma: no-cache
 Connection: close
 Content-Type: text/html

There's no 302 from here.

Are you trying to crawl through a proxy?  If so, that might be where
the problem lies.

Karl

On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright daddy...@gmail.com wrote:
 It sounds like the httpclient upgrade definitely broke something.  We
 should open a ticket.

 But first, can you confirm what connector this is?  Is it the web
 connector?  If so, I am puzzled because the web connector has always
 logged any 302 return, but then queued a second document which it
 subsequently fetches.

 Karl

 On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I'm using trunk code and crawling web site with seeds which have 
 http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt).
 As I'm look at Simple History, it shows 302 result code at fetch activity 
 and doesn't ingest document.

 When I used MCF 1.0.1 in the same situation, Simple History showed 200 
 result code and MCF could ingest documents.

 Why does the trunk shows 302 status? Is it relevant to upgrading httpclient?

 Thanks in advance,
 Shinichiro Abe


Re: Http status code 302

2013-01-09 Thread Karl Wright
Odd that curl would yield a 200 while ManifoldCF gets a 302.  Maybe
Koji's blog site does not like one of the headers, crawler-agent
perhaps?

I am behind a firewall now but I will explore this later today.  In
the meantime, if you want to research the problem, could you turn on
wire debugging?  You do this in the logging.ini file following these
instructions:

http://hc.apache.org/httpcomponents-client-ga/logging.html

You should see everything happening in the log then, and you can then
compare against curl using -vvv.  Please let me know what you find.

Thanks!
Karl

On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 I'm using web connector.

 Are you trying to crawl through a proxy?
 No. I just set seeds that url without a proxy.
 (Also I didn't obey robots.txt)

 Using curl, it is the same as your result.

 Could you reproduce that?

 Shinichiro

 On 2013/01/09, at 17:49, Karl Wright wrote:

 When I try the URL you gave using curl and no special arguments, I get this:


 C:\Users\Karlcurl -vvv http://lucene.jugem.jp/?eid=39;
 * About to connect() to lucene.jugem.jp port 80 (#0)
 *   Trying 210.172.160.170... connected
 * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
 GET /?eid=39 HTTP/1.1
 User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c 
 zlib/1.2
 .5 librtmp/2.3
 Host: lucene.jugem.jp
 Accept: */*

  HTTP/1.1 200 OK
  Date: Wed, 09 Jan 2013 08:47:52 GMT
  Server: Apache/2.0.59 (Unix)
  Vary: User-Agent,Host,Accept-Encoding
  Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
  Accept-Ranges: bytes
  Content-Length: 22594
  Cache-Control: private
  Pragma: no-cache
  Connection: close
  Content-Type: text/html

 There's no 302 from here.

 Are you trying to crawl through a proxy?  If so, that might be where
 the problem lies.

 Karl

 On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright daddy...@gmail.com wrote:
 It sounds like the httpclient upgrade definitely broke something.  We
 should open a ticket.

 But first, can you confirm what connector this is?  Is it the web
 connector?  If so, I am puzzled because the web connector has always
 logged any 302 return, but then queued a second document which it
 subsequently fetches.

 Karl

 On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I'm using trunk code and crawling web site with seeds which have 
 http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt).
 As I'm look at Simple History, it shows 302 result code at fetch activity 
 and doesn't ingest document.

 When I used MCF 1.0.1 in the same situation, Simple History showed 200 
 result code and MCF could ingest documents.

 Why does the trunk shows 302 status? Is it relevant to upgrading 
 httpclient?

 Thanks in advance,
 Shinichiro Abe



Re: Http status code 302

2013-01-09 Thread Karl Wright
There seems to be only two differences.  The Host header value is
different, and there is an Accept header in the one that works.
(Accept: */*)

I will experiment with curl this evening to see which of these is
causing the problem.  Or, if you don't want to wait, you can use curl
and explicitly set these headers to see which one causes it to fail.

Thanks,
Karl


On Wed, Jan 9, 2013 at 9:56 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Thank you for your navigation.
 I got a log from MCF 1.0.1.

 A) a log from curl

 curl -vvv http://lucene.jugem.jp/?eid=39;
 * About to connect() to lucene.jugem.jp port 80 (#0)
 *   Trying 210.172.160.170... connected
 * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
 GET /?eid=39 HTTP/1.1
 User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 
 OpenSSL/0.9.8r zlib/1.2.3
 Host: lucene.jugem.jp
 Accept: */*

  HTTP/1.1 200 OK
  Date: Wed, 09 Jan 2013 13:23:15 GMT
  Server: Apache/2.0.59 (Unix)
  Vary: User-Agent,Host,Accept-Encoding
  Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
  Accept-Ranges: bytes
  Content-Length: 22594
  Cache-Control: private
  Pragma: no-cache
  Connection: close
  Content-Type: text/html


 B) a log from MCF 1.0.1

 DEBUG 2013-01-09 23:40:11,313 (Thread-472) - Open connection to 
 210.172.160.170:80
 DEBUG 2013-01-09 23:40:11,436 (Thread-472) -  GET /?eid=39 
 HTTP/1.1[\r][\n]
 DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Using virtual host name: 
 lucene.jugem.jp
 DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Adding Host request header
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  From: 
 shinichiro.ab...@gmail.com[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  Host: 
 lucene.jugem.jp[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  [\r][\n]
 DEBUG 2013-01-09 23:40:11,629 (Thread-472) -  HTTP/1.1 200 OK[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Date: Wed, 09 Jan 2013 
 14:39:24 GMT[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Server: Apache/2.0.59 
 (Unix)[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Vary: 
 User-Agent,Host,Accept-Encoding[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Last-Modified: Tue, 08 Jan 
 2013 07:58:33 GMT[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Accept-Ranges: bytes[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Content-Length: 
 22594[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Cache-Control: 
 private[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Pragma: no-cache[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Connection: close[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Content-Type: 
 text/html[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  [\r][\n]
 DEBUG 2013-01-09 23:40:12,054 (Worker thread '0') - Should close connection 
 in response to directive: close

 Is it enough to diagnose?

 Thank you very much,
 Shinichiro




 On 2013/01/09, at 23:12, Karl Wright wrote:

 Wire debugging with MCF 1.0.1 requires different logging.ini
 parameters, because it uses commons-httpclient instead.  That's
 described here:

 http://hc.apache.org/httpclient-3.x/logging.html

 I will need a working comparison to diagnose what is happening, so
 please either get a log from curl, or better yet from MCF 1.0.1.

 Thanks!
 Karl


 On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I did wire debugging:
 curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got a 
 200.

 The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs.

 [1]
 DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 
 HTTP/1.1
 DEBUG 2013-01-09 22:07:26,495 (Thread-474) -  GET /?eid=39 
 HTTP/1.1[\r][\n]
 DEBUG 2013-01-09 22:07:26,496 (Thread-474) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  From: 
 shinichiro.ab...@gmail.com[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Host: 
 lucene.jugem.jp:80[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Connection: 
 Keep-Alive[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  [\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  GET /?eid=39 HTTP/1.1
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  From: 
 shinichiro.ab...@gmail.com
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Host: lucene.jugem.jp:80
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Connection: Keep-Alive
 DEBUG 2013-01-09 22:07:26,556 (Thread-474) -  HTTP/1.1 302 Found[\r][\n]
 DEBUG 2013-01-09 22:07:26,561 (Thread-474) -  Date: Wed, 09 Jan 2013 
 13:06:39 GMT[\r][\n]
 DEBUG 2013-01-09 22:07:26,561 (Thread-474) -  Server: Apache

Re: Http status code 302

2013-01-09 Thread Karl Wright
I created CONNECTORS-604 to track this problem.

Karl

On Wed, Jan 9, 2013 at 10:02 AM, Karl Wright daddy...@gmail.com wrote:
 There seems to be only two differences.  The Host header value is
 different, and there is an Accept header in the one that works.
 (Accept: */*)

 I will experiment with curl this evening to see which of these is
 causing the problem.  Or, if you don't want to wait, you can use curl
 and explicitly set these headers to see which one causes it to fail.

 Thanks,
 Karl


 On Wed, Jan 9, 2013 at 9:56 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Thank you for your navigation.
 I got a log from MCF 1.0.1.

 A) a log from curl

 curl -vvv http://lucene.jugem.jp/?eid=39;
 * About to connect() to lucene.jugem.jp port 80 (#0)
 *   Trying 210.172.160.170... connected
 * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
 GET /?eid=39 HTTP/1.1
 User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 
 OpenSSL/0.9.8r zlib/1.2.3
 Host: lucene.jugem.jp
 Accept: */*

  HTTP/1.1 200 OK
  Date: Wed, 09 Jan 2013 13:23:15 GMT
  Server: Apache/2.0.59 (Unix)
  Vary: User-Agent,Host,Accept-Encoding
  Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
  Accept-Ranges: bytes
  Content-Length: 22594
  Cache-Control: private
  Pragma: no-cache
  Connection: close
  Content-Type: text/html


 B) a log from MCF 1.0.1

 DEBUG 2013-01-09 23:40:11,313 (Thread-472) - Open connection to 
 210.172.160.170:80
 DEBUG 2013-01-09 23:40:11,436 (Thread-472) -  GET /?eid=39 
 HTTP/1.1[\r][\n]
 DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Using virtual host name: 
 lucene.jugem.jp
 DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Adding Host request header
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  From: 
 shinichiro.ab...@gmail.com[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  Host: 
 lucene.jugem.jp[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  [\r][\n]
 DEBUG 2013-01-09 23:40:11,629 (Thread-472) -  HTTP/1.1 200 OK[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Date: Wed, 09 Jan 2013 
 14:39:24 GMT[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Server: Apache/2.0.59 
 (Unix)[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Vary: 
 User-Agent,Host,Accept-Encoding[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Last-Modified: Tue, 08 Jan 
 2013 07:58:33 GMT[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Accept-Ranges: 
 bytes[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Content-Length: 
 22594[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Cache-Control: 
 private[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Pragma: no-cache[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Connection: close[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Content-Type: 
 text/html[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  [\r][\n]
 DEBUG 2013-01-09 23:40:12,054 (Worker thread '0') - Should close connection 
 in response to directive: close

 Is it enough to diagnose?

 Thank you very much,
 Shinichiro




 On 2013/01/09, at 23:12, Karl Wright wrote:

 Wire debugging with MCF 1.0.1 requires different logging.ini
 parameters, because it uses commons-httpclient instead.  That's
 described here:

 http://hc.apache.org/httpclient-3.x/logging.html

 I will need a working comparison to diagnose what is happening, so
 please either get a log from curl, or better yet from MCF 1.0.1.

 Thanks!
 Karl


 On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I did wire debugging:
 curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got 
 a 200.

 The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs.

 [1]
 DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 
 HTTP/1.1
 DEBUG 2013-01-09 22:07:26,495 (Thread-474) -  GET /?eid=39 
 HTTP/1.1[\r][\n]
 DEBUG 2013-01-09 22:07:26,496 (Thread-474) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  From: 
 shinichiro.ab...@gmail.com[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Host: 
 lucene.jugem.jp:80[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Connection: 
 Keep-Alive[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  [\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  GET /?eid=39 HTTP/1.1
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  From: 
 shinichiro.ab...@gmail.com
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Host: lucene.jugem.jp:80
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Connection: Keep-Alive
 DEBUG 2013-01-09 22:07:26,556 (Thread-474) -  HTTP/1.1 302 
 Found[\r][\n]
 DEBUG 2013-01

Re: Monitoring Manifold CF

2013-01-16 Thread Karl Wright
Hi,

The REST API can give you the job status.

Karl

On Wed, Jan 16, 2013 at 6:12 AM, Christian Hepworth
christian.hepwo...@york.ac.uk wrote:
 Hello

 We are using Manifold CF to index Solr, via an Oracle connection. Our job is
 currently scheduled to run every evening, but we have had a few failed jobs.

 Ideally, we would like to use a tool such as Nagios to monitor the logs (or
 something else) and report on the success/failure of the job. We are
 struggling to find any output in the logs which would indicate the status of
 a job.

 Has anyone else put this sort of monitoring in place? Any advice would be
 much appreciated.

 Many thanks

 Christian


Re: Crawling new/updated files using Windows share connection takes too long

2013-01-18 Thread Karl Wright
Hi Shigeki,

What database is ManifoldCF configured to use in this case?  Do you
see any indication of slow queries in the ManifoldCF log?


Karl

On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:
 Hello


 I would like some advice to improve crawling time of new/updated files using
 Windows share connection.

 I crawl file in Windows server and index them into Solr.

 Currently, the second crawling of two hundred thousands files takes  over 5
 hours, even though any files are not updated, created, deleted.

 I assume MCF does the following processes (let me know if I am wrong)

 - obtain updated time of a file
 - compare the updated time with the one MCF obtained last time crawling(
 probably stored in DB)
 - if they are different MCF recognizes the file is to be indexed.

 If the above processes are done for two thousands files, what part of the
 processes could take time the most? obtaining updated time? reading data
 from DB? what could be done to increase the crawling time do you think?

 Please give me some advice.


 Regards,

 Shigeki




Re: XML parsing error quits file crawling using Windows share connection

2013-01-21 Thread Karl Wright
This means that the Solr you are talking to has returned an
unintelligible (non-XML) response.

When this happens I believe the actual return text is included in the
Simple History, so I'd look there first to see what the problem might
be.

You may also eventually want to update to the current ManifoldCF 1.1
release candidate, which has a revised Solr connector based on SolrJ.
I don't think that will help but at least you'll know it isn't our
code. ;-)

Karl

On Sun, Jan 20, 2013 at 11:56 PM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi

 I use trunk 1.1dev downloaded on Dec 12th and crawl files using Windows
 Share Connection to index them into Solr 4.0

 There was the following error and it quit the crawling job.

 2013/01/19 20:59:04 ERROR (Worker thread '8') - Exception tossed: XML
 parsing error on response
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error
 on response
 at
 org.apache.manifoldcf.agents.output.solr.HttpPoster$CodeDetails.parseIngestionResponse(HttpPoster.java:2059)
 at
 org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1365)



 Anyone has any ideas?


 Regards,


 Shigeki


Re: Job hanging on Starting up with never ending external query.

2013-01-21 Thread Karl Wright
Hi Anthony,

What happens between the framework recognizing that the job should be
started (which it does fine in both cases), and actually achieving a
correct job start, is the seeding phase, which is going to try to
execute the seeding query against your Oracle database.  If something
happens at that time to hang the JDBC connection's seeding query, then
it precisely explains the behavior you are seeing.

It is also the case that the timeout on the queries that the JDBC
connector does is effectively infinite.  This makes me suspicious that
what is happening is an Oracle query is going out but there is no
response ever coming back.

The other possibility is that the JDBC connector is in fact correctly
throwing a ServiceInterruption, but that the ManifoldCF code is either
not handling it properly, or the connector is not forming it properly.
 In that case, when you notice a hung job, the startup thread will be
a particular place in the code, and I can diagnose it that way.

The first order of business is therefore to get a thread dump when the
system is hung.  That will help confirm the picture.  There are a
number of additional questions here.

(1) Why is this happening?  Is there any possibility that the Oracle
database you are crawling is (very occasionally) not able to properly
respond to a JDBC query?  I can imagine that, under some network
conditions, it might be possible for the Oracle JDBC driver to wind up
waiting indefinitely for a response that never comes.

(2) Given that we can't always control the infrastructure we're trying
to crawl through, should we attempt to provide a reasonable
workaround?  For example, a timeout on JDBC connector queries, where
we throw a ServiceInterruption if the timeout is exceeded?

Karl

On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
anthony.leon...@york.ac.uk wrote:
 Hi there,

 We have recently started running a nightly job 2AM in ManifoldCF to extract
 data from an Oracle repository and populate a Solr index. Most nights this
 works fine, but occasionally the job has been hanging at the Starting up
 phase. We have observed this on our test setup also occasionally. A restart
 of ManifoldCF usually solves this.

 Using the simple history reports today I looked up all records and sorted
 them by the Time column, largest first, and found the following:

 Start Time,Activity,Identifier,Result Code,Bytes,Time,Result
 Description
 11-12-2012 05:00:05.941,external query ... SQL QUERY
 ...,ERROR,0,1926607529,Interrupted: null
 01-21-2013 02:00:11.843,external query ... SQL QUERY
 ...,ERROR,0,31644956,Interrupted: null
 01-17-2013 02:00:03.600,external query ... SQL QUERY
 ...,ERROR,0,31637594,Interrupted: null
 12-04-2012 12:12:19.860,external query ... SQL QUERY
 ...,OK,0,17511,
 ... etc ...

 If the Time column is in millis that means the first query was hanging for
 22 days! (This was in the period before we went live when our live server
 was sitting idle for a while.) The other two occasions it was hanging for
 about 8 hours until we arrived to restart the job in the morning. I have
 confirmed that the Oracle database we are connecting to was available
 throughout these periods. These times are also too long for any network or
 database timeouts, which makes me suspect that it's a problem with the
 application.

 We have the following logging config in properties.xml

   property name=org.apache.manifoldcf.jobs value=ALL/
   property name=org.apache.manifoldcf.connectors value=ALL/
   property name=org.apache.manifoldcf.agents value=ALL/
   property name=org.apache.manifoldcf.misc value=ALL/

 The job failed again last night and when I checked at 10:40 AM this morning
 the last few lines of manifoldcf.log were:

 DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
 1352455005553 needs to be started; it was last checked at 1358733575454, and
 now it is 1358733585635
 DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match found
 within interval 1358733575454 to 1358733585635
 DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
 1352455005553 needs to be started; it was last checked at 1358733585636, and
 now it is 1358733595662
 DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match found
 within interval 1358733585636 to 1358733595662
 DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
 1352455005553 needs to be started; it was last checked at 1358733595663, and
 now it is 1358733605813
 DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND within
 interval 1358733595663 to 1358733605813
 DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job '1352455005553' is
 within run window at 1358733605813 ms. (which starts at 135873360 ms.)
 DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job start
 for job 1352455005553
 DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job 1352455005553
 for startup
 DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The 

Re: Crawling new/updated files using Windows share connection

2013-01-21 Thread Karl Wright
CONNECTORS-618

Karl

On Mon, Jan 21, 2013 at 9:08 AM, Karl Wright daddy...@gmail.com wrote:
 Bad news, I am afraid.  MySQL seems to always put null values at the
 front of the index, and that cannot be changed through any means I can
 find.  This is different from all other databases I know of.

 The only possible fixes for the problem are as follows:

 (1) Not use a null doc priority but instead use an actual special
 number that is guaranteed to always sort to the end.  This would be
 non-trivial because this column is a FLOAT value, and round-off errors
 will prevent the ManifoldCF code from reliably using a special number
 like that on all databases.

 (2) Use docpriorities that are ordered in the opposite way - which
 would work ONLY for MySQL and would break all other databases.

 I'll create a ticket and think about the problem some more.

 Karl

 On Mon, Jan 21, 2013 at 8:48 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Shigeki,

 I reviewed the code in detail.  At the time CONNECTORS-290 was fixed,
 all document priorities were set to null whenever a job was paused or
 aborted, so what I suspected might be the problem cannot in fact
 happen.

 The most likely possible explanation for MySQL's behavior, therefore,
 is that MySQL orders null docpriority values BEFORE all other rows in
 the index it is using for queue stuffing.  I have no other way of
 explaining why it thinks it needs to go through 6.5 million rows
 before it gets to the ones that are active.

 If this is the case, it may be possible to tell MySQL to order null
 column values to the END instead of the beginning of the index.  I'll
 do some research on this later and get back to you.

 Thanks,
 Karl



 On Mon, Jan 21, 2013 at 6:21 AM, Karl Wright daddy...@gmail.com wrote:
 Are there any large paused or aborted jobs present on the same
 ManifoldCF?  If so, can you tell me whether the job is paused, or
 aborted?  (I am betting paused...)

 Karl

 On Mon, Jan 21, 2013 at 5:59 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,


 Here is the explain. There isn't such sort...

 mysql explain SELECT
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0 IN ('P','G') AND
 t0.checkaction='R' AND t0.checktime=1358649661663 AND EXISTS(SELECT 'x'
 FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND
 t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
 t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
 t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events 
 t4
 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC
 LIMIT 4800;
 +++---++--++-+-+--+-+
 | id | select_type| table | type   | possible_keys
 | key| key_len | ref | rows | Extra   |
 +++---++--++-+-+--+-+
 |  1 | PRIMARY| t0| index  | NULL
 | I1358228295210 | 25  | NULL| 4800 | Using where |
 |  4 | DEPENDENT SUBQUERY | t3| ref| I1358228295216
 | I1358228295216 | 8   | manifoldcf.t0.id|1 | |
 |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
 | PRIMARY| 767 | manifoldcf.t3.eventname |1 | Using index |
 |  3 | DEPENDENT SUBQUERY | t2| ref|
 I1358228295209,I1358228295212,I1358228295211 | I1358228295209 | 122 |
 manifoldcf.t0.dochash   |1 | Using where |
 |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1358228295219
 | PRIMARY| 8   | manifoldcf.t0.jobid |1 | Using where |
 +++---++--++-+-+--+-+
 5 rows in set (0.00 sec)


 Regards,


 Shigeki



 2013/1/21 Karl Wright daddy...@gmail.com

  takes too long
 MIME-Version: 1.0
 Content-Type: multipart/alternative; boundary=14dae9399bd9676e5704d3c356e9

 --14dae9399bd9676e5704d3c356e9
 Content-Type: text/plain; charset=utf-8
 Content-Transfer-Encoding: 7bit

 Can you get an EXPLAIN for this query? It sounds like it is
 disregarding the hint for some reason.

 Karl

 Sent from my Windows Phone
 From: Shigeki Kobayashi
 Sent: 1/20/2013 9:37 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Crawling new/updated files using Windows share connection
 takes too long
 Hi Karl.

 I configured MySQL 5.5 to run MCF this time.
 The version of MCF is trunk 1.1dev downloaded on Dec, 12th. , which you
 fixed
 the slow query using FORCE INDEX. Solr is 4.0

 I thought is was fixed but the log shows that  the following are slow
 queries

Re: Job hanging on Starting up with never ending external query.

2013-01-21 Thread Karl Wright
kill -QUIT should not abort the agents process, just cause a thread
dump.  kill -9 is a different story.

You can also do the same thing by using jstack, in the jvm bin directory.

Karl


On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
anthony.leon...@york.ac.uk wrote:
 Dear Karl,

 Many thanks for your insights. I'll do a kill -QUIT next time we have this
 issue which should hopefully give me the thread dump. However we've noticed
 that killing processes means we have to run the locks-clean script so it's
 not our favourite way of doing it.

 Also I definitely think a timeout for queries would be a good thing.

 I guess we go back to checking that the connection to the database should
 have been ok last night...

 Best wishes,
 Anthony.

 --
 Dr Anthony Leonard
 System Integrator, Information Directorate
 University of York, Heslington, York, UK, YO10 5DD
 Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
 Times Higher Education University of the Year 2010


 On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright daddy...@gmail.com wrote:

 Hi Anthony,

 What happens between the framework recognizing that the job should be
 started (which it does fine in both cases), and actually achieving a
 correct job start, is the seeding phase, which is going to try to
 execute the seeding query against your Oracle database.  If something
 happens at that time to hang the JDBC connection's seeding query, then
 it precisely explains the behavior you are seeing.

 It is also the case that the timeout on the queries that the JDBC
 connector does is effectively infinite.  This makes me suspicious that
 what is happening is an Oracle query is going out but there is no
 response ever coming back.

 The other possibility is that the JDBC connector is in fact correctly
 throwing a ServiceInterruption, but that the ManifoldCF code is either
 not handling it properly, or the connector is not forming it properly.
  In that case, when you notice a hung job, the startup thread will be
 a particular place in the code, and I can diagnose it that way.

 The first order of business is therefore to get a thread dump when the
 system is hung.  That will help confirm the picture.  There are a
 number of additional questions here.

 (1) Why is this happening?  Is there any possibility that the Oracle
 database you are crawling is (very occasionally) not able to properly
 respond to a JDBC query?  I can imagine that, under some network
 conditions, it might be possible for the Oracle JDBC driver to wind up
 waiting indefinitely for a response that never comes.

 (2) Given that we can't always control the infrastructure we're trying
 to crawl through, should we attempt to provide a reasonable
 workaround?  For example, a timeout on JDBC connector queries, where
 we throw a ServiceInterruption if the timeout is exceeded?

 Karl

 On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
 anthony.leon...@york.ac.uk wrote:
  Hi there,
 
  We have recently started running a nightly job 2AM in ManifoldCF to
  extract
  data from an Oracle repository and populate a Solr index. Most nights
  this
  works fine, but occasionally the job has been hanging at the Starting
  up
  phase. We have observed this on our test setup also occasionally. A
  restart
  of ManifoldCF usually solves this.
 
  Using the simple history reports today I looked up all records and
  sorted
  them by the Time column, largest first, and found the following:
 
  Start Time,Activity,Identifier,Result
  Code,Bytes,Time,Result
  Description
  11-12-2012 05:00:05.941,external query ... SQL QUERY
  ...,ERROR,0,1926607529,Interrupted: null
  01-21-2013 02:00:11.843,external query ... SQL QUERY
  ...,ERROR,0,31644956,Interrupted: null
  01-17-2013 02:00:03.600,external query ... SQL QUERY
  ...,ERROR,0,31637594,Interrupted: null
  12-04-2012 12:12:19.860,external query ... SQL QUERY
  ...,OK,0,17511,
  ... etc ...
 
  If the Time column is in millis that means the first query was hanging
  for
  22 days! (This was in the period before we went live when our live
  server
  was sitting idle for a while.) The other two occasions it was hanging
  for
  about 8 hours until we arrived to restart the job in the morning. I have
  confirmed that the Oracle database we are connecting to was available
  throughout these periods. These times are also too long for any network
  or
  database timeouts, which makes me suspect that it's a problem with the
  application.
 
  We have the following logging config in properties.xml
 
property name=org.apache.manifoldcf.jobs value=ALL/
property name=org.apache.manifoldcf.connectors value=ALL/
property name=org.apache.manifoldcf.agents value=ALL/
property name=org.apache.manifoldcf.misc value=ALL/
 
  The job failed again last night and when I checked at 10:40 AM this
  morning
  the last few lines of manifoldcf.log were:
 
  DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
  1352455005553 needs to be started; it was last checked

Re: Crawling new/updated files using Windows share connection

2013-01-21 Thread Karl Wright
I checked a fix for this into trunk.

Please sync up with trunk and see if this fixes your problem.  If it
does, I will gladly include the fix in MCF 1.1.

Karl


On Mon, Jan 21, 2013 at 9:14 AM, Karl Wright daddy...@gmail.com wrote:
 CONNECTORS-618

 Karl

 On Mon, Jan 21, 2013 at 9:08 AM, Karl Wright daddy...@gmail.com wrote:
 Bad news, I am afraid.  MySQL seems to always put null values at the
 front of the index, and that cannot be changed through any means I can
 find.  This is different from all other databases I know of.

 The only possible fixes for the problem are as follows:

 (1) Not use a null doc priority but instead use an actual special
 number that is guaranteed to always sort to the end.  This would be
 non-trivial because this column is a FLOAT value, and round-off errors
 will prevent the ManifoldCF code from reliably using a special number
 like that on all databases.

 (2) Use docpriorities that are ordered in the opposite way - which
 would work ONLY for MySQL and would break all other databases.

 I'll create a ticket and think about the problem some more.

 Karl

 On Mon, Jan 21, 2013 at 8:48 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Shigeki,

 I reviewed the code in detail.  At the time CONNECTORS-290 was fixed,
 all document priorities were set to null whenever a job was paused or
 aborted, so what I suspected might be the problem cannot in fact
 happen.

 The most likely possible explanation for MySQL's behavior, therefore,
 is that MySQL orders null docpriority values BEFORE all other rows in
 the index it is using for queue stuffing.  I have no other way of
 explaining why it thinks it needs to go through 6.5 million rows
 before it gets to the ones that are active.

 If this is the case, it may be possible to tell MySQL to order null
 column values to the END instead of the beginning of the index.  I'll
 do some research on this later and get back to you.

 Thanks,
 Karl



 On Mon, Jan 21, 2013 at 6:21 AM, Karl Wright daddy...@gmail.com wrote:
 Are there any large paused or aborted jobs present on the same
 ManifoldCF?  If so, can you tell me whether the job is paused, or
 aborted?  (I am betting paused...)

 Karl

 On Mon, Jan 21, 2013 at 5:59 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,


 Here is the explain. There isn't such sort...

 mysql explain SELECT
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0 IN ('P','G') AND
 t0.checkaction='R' AND t0.checktime=1358649661663 AND EXISTS(SELECT 'x'
 FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND
 t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
 t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
 t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events 
 t4
 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC
 LIMIT 4800;
 +++---++--++-+-+--+-+
 | id | select_type| table | type   | possible_keys
 | key| key_len | ref | rows | Extra   
 |
 +++---++--++-+-+--+-+
 |  1 | PRIMARY| t0| index  | NULL
 | I1358228295210 | 25  | NULL| 4800 | Using where 
 |
 |  4 | DEPENDENT SUBQUERY | t3| ref| I1358228295216
 | I1358228295216 | 8   | manifoldcf.t0.id|1 | 
 |
 |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
 | PRIMARY| 767 | manifoldcf.t3.eventname |1 | Using index 
 |
 |  3 | DEPENDENT SUBQUERY | t2| ref|
 I1358228295209,I1358228295212,I1358228295211 | I1358228295209 | 122 |
 manifoldcf.t0.dochash   |1 | Using where |
 |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1358228295219
 | PRIMARY| 8   | manifoldcf.t0.jobid |1 | Using where 
 |
 +++---++--++-+-+--+-+
 5 rows in set (0.00 sec)


 Regards,


 Shigeki



 2013/1/21 Karl Wright daddy...@gmail.com

  takes too long
 MIME-Version: 1.0
 Content-Type: multipart/alternative; 
 boundary=14dae9399bd9676e5704d3c356e9

 --14dae9399bd9676e5704d3c356e9
 Content-Type: text/plain; charset=utf-8
 Content-Transfer-Encoding: 7bit

 Can you get an EXPLAIN for this query? It sounds like it is
 disregarding the hint for some reason.

 Karl

 Sent from my Windows Phone
 From: Shigeki Kobayashi
 Sent: 1/20/2013 9:37 PM
 To: user@manifoldcf.apache.org
 Subject: Re: Crawling new/updated files using Windows share connection
 takes too long
 Hi Karl.

 I configured MySQL 5.5 to run MCF

Re: Job hanging on Starting up with never ending external query.

2013-01-22 Thread Karl Wright
Hmm.

The following threads are of interest here:

Thread 29975: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
be imprecise)
 - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
 - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
 - 
org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.init(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
(Interpreted frame)
 - 
org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
 - 
org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
long, int) @bci=106, line=246 (Interpreted frame)
 - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
line=179 (Interpreted frame)

... which is probably waiting in this one:

Thread 24457: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
be imprecise)
 - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
 - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
@bci=80, line=80 (Interpreted frame)
 - 
org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
java.lang.String, java.lang.String, java.lang.String,
java.lang.String) @bci=433, line=128 (Interpreted frame)
 - 
org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
@bci=36, line=1212 (Interpreted frame)

... which is waiting to obtain a JDBC connection, and the reason it
can't is because it thinks that the only available JDBC connection is
currently in use.

Since you have only a single connection around, and nothing else is
active, it stands to reason that a JDBC connection handle has somehow
been leaked, which is a challenge since connections are typically
freed in a try/finally block through ManifoldCF.

I notice that your stack frames are pretty unusual - what JDK is this
that you are using?

Karl


Since
On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
anthony.leon...@york.ac.uk wrote:
 Dear Karl,

 Our DBA noticed that each time our job was run 10 Oracle connections were
 created. So, we dropped the Max connections parameter on the repository
 connection config to 1 and re-ran the job with the DBA watching. The job
 worked fine but the DBA reported that 1 connection was created and then 10
 more briefly ...

 Out of curiosity we re-ran the job again with no further changes and this
 time got the following results:

 * the job hung in the Starting Up phase again, with the same logging and
 symptoms as detailed before on this thread.
 * the DBA reported seeing no connections at all this time.
 * I have attached a thread dump created by jstack -F pid. This is
 reporting all threads as blocked.

 Any ideas? Any help with this would certainly be very gratefully received.

 Best wishes,
 Anthony.

 --
 Dr Anthony Leonard
 System Integrator, Information Directorate
 University of York, Heslington, York, UK, YO10 5DD
 Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
 Times Higher Education University of the Year 2010


 On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright daddy...@gmail.com wrote:

 kill -QUIT should not abort the agents process, just cause a thread
 dump.  kill -9 is a different story.

 You can also do the same thing by using jstack, in the jvm bin directory.

 Karl


 On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
 anthony.leon...@york.ac.uk wrote:
  Dear Karl,
 
  Many thanks for your insights. I'll do a kill -QUIT next time we have
  this
  issue which should hopefully give me the thread dump. However we've
  noticed
  that killing processes means we have to run the locks-clean script so
  it's
  not our favourite way of doing it.
 
  Also I definitely think a timeout for queries would be a good thing.
 
  I guess we go back to checking that the connection to the database
  should
  have been ok last night...
 
  Best wishes,
  Anthony.
 
  --
  Dr Anthony Leonard
  System Integrator, Information Directorate
  University of York, Heslington, York, UK, YO10 5DD
  Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
  Times Higher Education University of the Year 2010
 
 
  On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright daddy...@gmail.com wrote:
 
  Hi Anthony,
 
  What happens between the framework recognizing that the job should be
  started (which it does fine in both cases), and actually achieving a
  correct job start, is the seeding phase, which is going to try to
  execute the seeding query against your Oracle database.  If something
  happens at that time to hang the JDBC connection's seeding query, then
  it precisely explains the behavior you are seeing.
 
  It is also

Re: Job hanging on Starting up with never ending external query.

2013-01-22 Thread Karl Wright
I've looked into the code in some detail.  There is indeed a place
where it is possible for a JDBC connection handle to be leaked, I
believe.  However, it's not clear whether this is the circumstance you
are encountering or not, since it does in involve an exception getting
thrown doing something not terribly likely to cause exceptions.

I've opened a ticket - CONNECTORS-620.

Karl

On Tue, Jan 22, 2013 at 12:53 PM, Karl Wright daddy...@gmail.com wrote:
 Hmm.

 The following threads are of interest here:

 Thread 29975: (state = BLOCKED)
  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
 be imprecise)
  - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
  - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
  - 
 org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.init(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
 java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
 (Interpreted frame)
  - 
 org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
 java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
  - 
 org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
 org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
 long, int) @bci=106, line=246 (Interpreted frame)
  - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
 line=179 (Interpreted frame)

 ... which is probably waiting in this one:

 Thread 24457: (state = BLOCKED)
  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
 be imprecise)
  - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
  - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
 @bci=80, line=80 (Interpreted frame)
  - 
 org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
 java.lang.String, java.lang.String, java.lang.String,
 java.lang.String) @bci=433, line=128 (Interpreted frame)
  - 
 org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
 @bci=36, line=1212 (Interpreted frame)

 ... which is waiting to obtain a JDBC connection, and the reason it
 can't is because it thinks that the only available JDBC connection is
 currently in use.

 Since you have only a single connection around, and nothing else is
 active, it stands to reason that a JDBC connection handle has somehow
 been leaked, which is a challenge since connections are typically
 freed in a try/finally block through ManifoldCF.

 I notice that your stack frames are pretty unusual - what JDK is this
 that you are using?

 Karl


 Since
 On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
 anthony.leon...@york.ac.uk wrote:
 Dear Karl,

 Our DBA noticed that each time our job was run 10 Oracle connections were
 created. So, we dropped the Max connections parameter on the repository
 connection config to 1 and re-ran the job with the DBA watching. The job
 worked fine but the DBA reported that 1 connection was created and then 10
 more briefly ...

 Out of curiosity we re-ran the job again with no further changes and this
 time got the following results:

 * the job hung in the Starting Up phase again, with the same logging and
 symptoms as detailed before on this thread.
 * the DBA reported seeing no connections at all this time.
 * I have attached a thread dump created by jstack -F pid. This is
 reporting all threads as blocked.

 Any ideas? Any help with this would certainly be very gratefully received.

 Best wishes,
 Anthony.

 --
 Dr Anthony Leonard
 System Integrator, Information Directorate
 University of York, Heslington, York, UK, YO10 5DD
 Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
 Times Higher Education University of the Year 2010


 On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright daddy...@gmail.com wrote:

 kill -QUIT should not abort the agents process, just cause a thread
 dump.  kill -9 is a different story.

 You can also do the same thing by using jstack, in the jvm bin directory.

 Karl


 On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
 anthony.leon...@york.ac.uk wrote:
  Dear Karl,
 
  Many thanks for your insights. I'll do a kill -QUIT next time we have
  this
  issue which should hopefully give me the thread dump. However we've
  noticed
  that killing processes means we have to run the locks-clean script so
  it's
  not our favourite way of doing it.
 
  Also I definitely think a timeout for queries would be a good thing.
 
  I guess we go back to checking that the connection to the database
  should
  have been ok last night...
 
  Best wishes,
  Anthony.
 
  --
  Dr Anthony Leonard
  System Integrator, Information Directorate
  University of York, Heslington, York, UK, YO10 5DD
  Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
  Times Higher Education University of the Year 2010
 
 
  On Mon

Re: max_pred_locks_per_transaction

2013-01-25 Thread Karl Wright
Hi Erlend,

Leaving logging at the default values would have shown the ERROR
message you have below.  So the cause for the pause must have been
something else.

When ManifoldCF seems to make no progress, the first thing to do is
look at the simple history and see if it is retrying on something for
some reason.  If that is not helpful, get a thread dump.  You can use
jstack for that purpose.

As for the PostgreSQL parameters, max_pred_locks_per_transaction seems
to be PostgreSQL 9 black magic.  Here's the documentation:

http://www.postgresql.org/docs/9.1/static/runtime-config-locks.html

Default is 64, but they don't say how it is allocated.  I'd guess
therefore that you should try 50% higher and see if that works, e.g.
96.  I guess the limit is the amount of shared memory your OS allows
you to allocate.

Karl

On Fri, Jan 25, 2013 at 5:26 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:

 After we started to crawl journals, the crawler just stopped after a couple
 of hours. Running MCF in debug mode gave me the stack trace shown below. I
 thing we need to adjust some PG parameters, perhaps
 max_pred_locks_per_transaction. The database admins are now asking me about
 which value to set. They have increased it, but I don't know whether it is
 sufficient.

 Another thing. I did not see this message until I changed the log level to
 debug, but the log4j should catch this error messages with warn level
 enabled. So maybe it is a dead end, i.e. a totally different cause and that
 this just occurred as a coincidence.

 ERROR 2013-01-19 03:47:15,049 (Worker thread '49') - Worker thread
 aborting and restarting due to database connection reset: Database
 exception: Exception doing query: ERROR: out of shared memory
 Hint: You might need to increase max_pred_locks_per_transaction.
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
 exception: Exception doing query: ERROR: out of shared memory
 Hint: You might need to increase max_pred_locks_per_transaction.
 at
 org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
 at
 org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
 at
 org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
 at
 org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
 at
 org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
 at
 org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:803)
 at
 org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
 at
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
 at
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.flush(WorkerThread.java:1863)
 at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:554)
 Caused by: org.postgresql.util.PSQLException: ERROR: out of shared memory
 Hint: You might need to increase max_pred_locks_per_transaction.
 at
 org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102)
 at
 org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835)
 at
 org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
 at
 org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500)
 at
 org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
 at
 org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273)
 at org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
 at
 org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
 DEBUG 2013-01-19 03:47:23,386 (Idle cleanup thread) - Checking for
 connections, idleTimeout: 1358563583386

 --
 Erlend Garåsen
 Center for Information Technology Services
 University of Oslo
 P.O. Box 1086 Blindern, N-0317 OSLO, Norway
 Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Job hanging on Starting up with never ending external query.

2013-01-25 Thread Karl Wright
You can download the current release candidate for 1.1 (RC6) from
http://people.apache.org/~kwright/apache-manifoldcf-1.1 .

Karl


On Fri, Jan 25, 2013 at 12:02 PM, Anthony Leonard
anthony.leon...@york.ac.uk wrote:
 Hi Karl,

 Thank you so much for this. Sorry for the lack of response as we've been
 working on other things.

 One question - would we have to build ManifoldCF ourselves to get the new
 code you've checked in or would it already be part of a binary distribution
 somewhere?

 Best wishes,
 Anthony.

 --
 Dr Anthony Leonard
 System Integrator, Information Directorate
 University of York, Heslington, York, UK, YO10 5DD
 Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
 Times Higher Education University of the Year 2010


 On Tue, Jan 22, 2013 at 6:51 PM, Karl Wright daddy...@gmail.com wrote:

 I've checked in code in both trunk and the release branch for this issue.

 It would be good if you could try this again in your environment.  The
 fix simply prevents some kinds of exceptions from causing a handle
 leak.  Please try this with only 1 JDBC Connection connection handle
 per JVM and let me know if you see any hangs.

 Thanks,
 Karl


 On Tue, Jan 22, 2013 at 1:11 PM, Karl Wright daddy...@gmail.com wrote:
  I've looked into the code in some detail.  There is indeed a place
  where it is possible for a JDBC connection handle to be leaked, I
  believe.  However, it's not clear whether this is the circumstance you
  are encountering or not, since it does in involve an exception getting
  thrown doing something not terribly likely to cause exceptions.
 
  I've opened a ticket - CONNECTORS-620.
 
  Karl
 
  On Tue, Jan 22, 2013 at 12:53 PM, Karl Wright daddy...@gmail.com
  wrote:
  Hmm.
 
  The following threads are of interest here:
 
  Thread 29975: (state = BLOCKED)
   - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
  be imprecise)
   - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
   - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
   -
  org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.init(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
  java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
  (Interpreted frame)
   -
  org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
  java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
   -
  org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
  org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
  long, int) @bci=106, line=246 (Interpreted frame)
   - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
  line=179 (Interpreted frame)
 
  ... which is probably waiting in this one:
 
  Thread 24457: (state = BLOCKED)
   - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
  be imprecise)
   - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
   - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
  @bci=80, line=80 (Interpreted frame)
   -
  org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
  java.lang.String, java.lang.String, java.lang.String,
  java.lang.String) @bci=433, line=128 (Interpreted frame)
   -
  org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
  @bci=36, line=1212 (Interpreted frame)
 
  ... which is waiting to obtain a JDBC connection, and the reason it
  can't is because it thinks that the only available JDBC connection is
  currently in use.
 
  Since you have only a single connection around, and nothing else is
  active, it stands to reason that a JDBC connection handle has somehow
  been leaked, which is a challenge since connections are typically
  freed in a try/finally block through ManifoldCF.
 
  I notice that your stack frames are pretty unusual - what JDK is this
  that you are using?
 
  Karl
 
 
  Since
  On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
  anthony.leon...@york.ac.uk wrote:
  Dear Karl,
 
  Our DBA noticed that each time our job was run 10 Oracle connections
  were
  created. So, we dropped the Max connections parameter on the
  repository
  connection config to 1 and re-ran the job with the DBA watching. The
  job
  worked fine but the DBA reported that 1 connection was created and
  then 10
  more briefly ...
 
  Out of curiosity we re-ran the job again with no further changes and
  this
  time got the following results:
 
  * the job hung in the Starting Up phase again, with the same logging
  and
  symptoms as detailed before on this thread.
  * the DBA reported seeing no connections at all this time.
  * I have attached a thread dump created by jstack -F pid. This is
  reporting all threads as blocked.
 
  Any ideas? Any help with this would certainly be very gratefully
  received

Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Karl Wright
Ok, so let's back up a bit.

First, which version of ManifoldCF is this?  I need to know that
before I can interpret the stack trace.

Second, what do you see when you view the connection in the crawler
UI?  Does it say Connection working, or something else, and if so,
what?

I've created a ticket for better error reporting in this connector -
it was a contribution and AFAIK the error handling is not very robust
at this point, but I can fix that quickly with your help. ;-)

Karl

On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote:

 So you saw events in the history which correspond to these documents
 and which are of type Indexation that say success?  If that is the
 case, then the ElasticSearch connector thinks it handed the documents
 successfully to the ElasticSearch server.

 Ah, no, the activity is fetch rather than indexation. e.g.

 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361

 I don't see any history entries relating to indexing as a specific
 activity in its own right. Sorry, that was probably a red herring, I
 don't think it's getting that far.

 I just noticed that above all the service interruption reported
 warnings are some errors like this:

 ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
 org.apache.manifoldcf.core.interfaces.ManifoldCFException:
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)

 Sadly there's no description, just a stacktrace.

 I know the ES server is visible from the MCF server -- actually
 they're the same machine, and it's configured to use
 http://127.0.0.1:9200/ as the server URL. And I can go to the command
 line on that server and curl that URL successfully.


Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Karl Wright
I agree that the Elastic Search connector needs far better logging and
error handling.  CONNECTORS-629.

Karl

On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Nailed it with the help of wireshark! Turns out it was my fault -- I
 had set it up to use (i.e. create) an index called DocumentumRoW but
 it turns out ES index names must be all lowercase.

 Never knew that before.

 Slightly annoyed that ES didn't log that...

 Thanks again for your help Karl :-)

 My only request on the MCF front would be that it would be nice for
 the output connector to log the actual status code and content of a
 non-successful HTTP response.


 On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote:
 That information isn't being recorded in manifoldcf.log unfortunately
 -- I included all that was there. And there are no exceptions in
 elasticsearch.log either...

 I'll try running wireshark to see if I can follow the TCP stream.



 On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote:
 Ok, ElasticSearch is not happy about something when the document is
 being posted.  The connector is seeing a non-200 HTTP response, and
 throwing an exception as a result:

   if (!checkResultCode(method.getStatusCode()))
 throw new ManifoldCFException(getResultDescription());

 Presumably the exception message in the log tells us what that HTTP
 code is, but you did not include that key info.

 Karl

 On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Thanks for all your help Karl!

 It's 1.0.1 from the binary distro.

 And yes, it says Connection working when I view it.

 On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote:
 Ok, so let's back up a bit.

 First, which version of ManifoldCF is this?  I need to know that
 before I can interpret the stack trace.

 Second, what do you see when you view the connection in the crawler
 UI?  Does it say Connection working, or something else, and if so,
 what?

 I've created a ticket for better error reporting in this connector -
 it was a contribution and AFAIK the error handling is not very robust
 at this point, but I can fix that quickly with your help. ;-)

 Karl

 On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote:

 So you saw events in the history which correspond to these documents
 and which are of type Indexation that say success?  If that is the
 case, then the ElasticSearch connector thinks it handed the documents
 successfully to the ElasticSearch server.

 Ah, no, the activity is fetch rather than indexation. e.g.

 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361

 I don't see any history entries relating to indexing as a specific
 activity in its own right. Sorry, that was probably a red herring, I
 don't think it's getting that far.

 I just noticed that above all the service interruption reported
 warnings are some errors like this:

 ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
 org.apache.manifoldcf.core.interfaces.ManifoldCFException:
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)

 Sadly there's no description, just a stacktrace.

 I know the ES server is visible from the MCF server -- actually
 they're the same machine, and it's configured to use
 http://127.0.0.1:9200/ as the server URL. And I can go to the command
 line on that server and curl that URL successfully.



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg


Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Karl Wright
I just checked in a refactoring to trunk that should improve Elastic
Search error reporting significantly.

Karl


On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright daddy...@gmail.com wrote:
 I agree that the Elastic Search connector needs far better logging and
 error handling.  CONNECTORS-629.

 Karl

 On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Nailed it with the help of wireshark! Turns out it was my fault -- I
 had set it up to use (i.e. create) an index called DocumentumRoW but
 it turns out ES index names must be all lowercase.

 Never knew that before.

 Slightly annoyed that ES didn't log that...

 Thanks again for your help Karl :-)

 My only request on the MCF front would be that it would be nice for
 the output connector to log the actual status code and content of a
 non-successful HTTP response.


 On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote:
 That information isn't being recorded in manifoldcf.log unfortunately
 -- I included all that was there. And there are no exceptions in
 elasticsearch.log either...

 I'll try running wireshark to see if I can follow the TCP stream.



 On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote:
 Ok, ElasticSearch is not happy about something when the document is
 being posted.  The connector is seeing a non-200 HTTP response, and
 throwing an exception as a result:

   if (!checkResultCode(method.getStatusCode()))
 throw new ManifoldCFException(getResultDescription());

 Presumably the exception message in the log tells us what that HTTP
 code is, but you did not include that key info.

 Karl

 On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Thanks for all your help Karl!

 It's 1.0.1 from the binary distro.

 And yes, it says Connection working when I view it.

 On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote:
 Ok, so let's back up a bit.

 First, which version of ManifoldCF is this?  I need to know that
 before I can interpret the stack trace.

 Second, what do you see when you view the connection in the crawler
 UI?  Does it say Connection working, or something else, and if so,
 what?

 I've created a ticket for better error reporting in this connector -
 it was a contribution and AFAIK the error handling is not very robust
 at this point, but I can fix that quickly with your help. ;-)

 Karl

 On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote:

 So you saw events in the history which correspond to these documents
 and which are of type Indexation that say success?  If that is the
 case, then the ElasticSearch connector thinks it handed the documents
 successfully to the ElasticSearch server.

 Ah, no, the activity is fetch rather than indexation. e.g.

 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361

 I don't see any history entries relating to indexing as a specific
 activity in its own right. Sorry, that was probably a red herring, I
 don't think it's getting that far.

 I just noticed that above all the service interruption reported
 warnings are some errors like this:

 ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
 org.apache.manifoldcf.core.interfaces.ManifoldCFException:
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)

 Sadly there's no description, just a stacktrace.

 I know the ES server is visible from the MCF server -- actually
 they're the same machine, and it's configured to use
 http://127.0.0.1:9200/ as the server URL. And I can go to the command
 line on that server and curl that URL successfully.



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



 --

 http://tinyurl.com/andrew-clegg-linkedin | http

Re: Diagnosing REJECTED documents in job history

2013-02-01 Thread Karl Wright
The problem is that there are some documents you are indexing that
have no mime type set at all.  The ElasticSearch connector is not
handling that case properly.  I've opened ticket CONNECTORS-637, and
will fix it shortly.

Karl

On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Hi Karl,

 The extended logging has helped me find the next problem :-)

 Now I'm seeing hundreds of exceptions like this in the manifold log:


 FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed: null
 java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.containsKey(TreeMap.java:209)
 at java.util.TreeSet.contains(TreeSet.java:217)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)


 There'll be a whole batch, then a pause, then another batch. I suspect
 this is because MCF is retrying?

 My theory about this is that Documentum is returning the mime type as
 just pdf instead of application/pdf -- although I did add pdf as
 an allowed mime type in the ElasticSearch page of the job config, just
 to see if it would parse this ok.

 Do you know if there's any way to map from a source's content type to
 a destination's content type?



 On 31 January 2013 23:09, Karl Wright daddy...@gmail.com wrote:
 I just chased down and fixed a problem in trunk.  ElasticSearch is now
 returning a 201 code for successful indexing in some cases, and the
 connector was not handling that as 'success'.

 Karl


 On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright daddy...@gmail.com wrote:
 Please let me know if you see any problems.  I'll fix anything you
 find as quickly as I can.

 Karl

 On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Great, thanks, I'll give it a try.

 On 30 January 2013 18:52, Karl Wright daddy...@gmail.com wrote:
 I just checked in a refactoring to trunk that should improve Elastic
 Search error reporting significantly.

 Karl


 On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright daddy...@gmail.com wrote:
 I agree that the Elastic Search connector needs far better logging and
 error handling.  CONNECTORS-629.

 Karl

 On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Nailed it with the help of wireshark! Turns out it was my fault -- I
 had set it up to use (i.e. create) an index called DocumentumRoW but
 it turns out ES index names must be all lowercase.

 Never knew that before.

 Slightly annoyed that ES didn't log that...

 Thanks again for your help Karl :-)

 My only request on the MCF front would be that it would be nice for
 the output connector to log the actual status code and content of a
 non-successful HTTP response.


 On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote:
 That information isn't being recorded in manifoldcf.log unfortunately
 -- I included all that was there. And there are no exceptions in
 elasticsearch.log either...

 I'll try running wireshark to see if I can follow the TCP stream.



 On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote:
 Ok, ElasticSearch is not happy about something when the document is
 being posted.  The connector is seeing a non-200 HTTP response, and
 throwing an exception as a result:

   if (!checkResultCode(method.getStatusCode()))
 throw new ManifoldCFException(getResultDescription());

 Presumably the exception message in the log tells us what that HTTP
 code is, but you did not include that key info.

 Karl

 On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg 
 andrew.cl...@gmail.com wrote:
 Thanks for all your help Karl!

 It's 1.0.1 from the binary distro.

 And yes, it says Connection working when I view it.

 On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote:
 Ok, so let's back up a bit.

 First, which version of ManifoldCF is this?  I need to know that
 before I can interpret the stack trace.

 Second, what do you see when you view the connection in the crawler
 UI?  Does it say Connection working, or something else, and if so,
 what?

 I've created a ticket for better error reporting in this connector -
 it was a contribution and AFAIK the error handling is not very 
 robust

  1   2   3   4   5   6   7   8   9   10   >