Re: When running on MySQL initialize.sh causes access denied error
This sounds like a reasonable fix. Would you be so kind as to create a ticket, and attach your proposed change? Adding a special property for mysql is also reasonable. Karl On Mon, May 21, 2012 at 5:34 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi guys. I suppose some pepole use multiple servers to create MCF-MySQL environtment. Well, I'm one of them but I found that initialize.sh causes access denied error if DB server is seperated from MCF's. Suppose each server's IP are like followings: MySQL Server IP: A MCF Server IP: B and properties.xml has the follwing parameters and values: property name=org.apache.manifoldcf.databaseimplementationclass value=org.apache.manifoldcf.core.database.DBInterfaceMySQL/ property name=org.apache.manifoldcf.dbsuperusername value=root/ property name=org.apache.manifoldcf.dbsuperuserpassword value=password/ property name=org.apache.manifoldcf.database.name value=manifoldcf/ property name=org.apache.manifoldcf.mysql.server value=A/ Then, executing initialize.sh causes the follwing error: Caused by: java.sql.SQLException: Access denied for user 'manifoldcf'@'B' (using password: YES) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:943) at com.mysql.jdbc.MysqlIO.secureAuth411(MysqlIO.java:4113) at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1308) at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2336) at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2369) at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2153) at com.mysql.jdbc.ConnectionImpl.init(ConnectionImpl.java:792) at com.mysql.jdbc.JDBC4Connection.init(JDBC4Connection.java:47) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) The problem is that MCF requests MySQL to create a new manifoldcf user with localhost access. In this case, of course, MCF is on the server with IP B, MySQL should create a new user with IP Address B. The following method is the one creating a new MySQL user in the user table. Modifying the part where IP information is added solves this problem. JAR NAME :mcf-core.jar PACKAGE :org.apache.manifoldcf.core.database CLASS NAME :DBInterfaceMySQL METHOD NAME: public void createUserAndDatabase if(userName != null) try { list.clear(); list.add(userName); // list.add(localhost); list.add(IP_ADDRESS_B); list.add(password); ... ... } I guess it would be nice if properties.xml can have a new property taking MCF server IP to have MySQL create a manifoldcf user with that IP. What do you think? Regards, Shigeki
Re: When running on MySQL initialize.sh causes access denied error
I checked a fix into trunk. The property is: org.apache.manifoldcf.mysql.client ... which defaults to localhost. Karl On Mon, May 21, 2012 at 9:17 PM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: OK, I posted a ticket as CONNECTORS-476. Thanks. Shigeki 2012/5/21 Karl Wright daddy...@gmail.com This sounds like a reasonable fix. Would you be so kind as to create a ticket, and attach your proposed change? Adding a special property for mysql is also reasonable. Karl On Mon, May 21, 2012 at 5:34 AM, Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp wrote: Hi guys. I suppose some pepole use multiple servers to create MCF-MySQL environtment. Well, I'm one of them but I found that initialize.sh causes access denied error if DB server is seperated from MCF's. Suppose each server's IP are like followings: MySQL Server IP: A MCF Server IP: B and properties.xml has the follwing parameters and values: property name=org.apache.manifoldcf.databaseimplementationclass value=org.apache.manifoldcf.core.database.DBInterfaceMySQL/ property name=org.apache.manifoldcf.dbsuperusername value=root/ property name=org.apache.manifoldcf.dbsuperuserpassword value=password/ property name=org.apache.manifoldcf.database.name value=manifoldcf/ property name=org.apache.manifoldcf.mysql.server value=A/ Then, executing initialize.sh causes the follwing error: Caused by: java.sql.SQLException: Access denied for user 'manifoldcf'@'B' (using password: YES) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:943) at com.mysql.jdbc.MysqlIO.secureAuth411(MysqlIO.java:4113) at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1308) at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2336) at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2369) at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2153) at com.mysql.jdbc.ConnectionImpl.init(ConnectionImpl.java:792) at com.mysql.jdbc.JDBC4Connection.init(JDBC4Connection.java:47) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) The problem is that MCF requests MySQL to create a new manifoldcf user with localhost access. In this case, of course, MCF is on the server with IP B, MySQL should create a new user with IP Address B. The following method is the one creating a new MySQL user in the user table. Modifying the part where IP information is added solves this problem. JAR NAME :mcf-core.jar PACKAGE :org.apache.manifoldcf.core.database CLASS NAME :DBInterfaceMySQL METHOD NAME: public void createUserAndDatabase if(userName != null) try { list.clear(); list.add(userName); // list.add(localhost); list.add(IP_ADDRESS_B); list.add(password); ... ... } I guess it would be nice if properties.xml can have a new property taking MCF server IP to have MySQL create a manifoldcf user with that IP. What do you think? Regards, Shigeki
Re: Proposed first graduation step: Moving the repository
INFRA-4802. Karl On Thu, May 17, 2012 at 2:33 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: 2012/5/17 Karl Wright daddy...@gmail.com Looks like I don't have permissions to do this. I suppose I would need to open an infra ticket? Yes, I think so. Tommaso Karl On Wed, May 16, 2012 at 10:25 PM, Karl Wright daddy...@gmail.com wrote: Folks, Heads up: Now that we've graduated, I'd like to move the repository from https://svn.apache.org/repos/asf/incubator/lcf to https://svn.apache.org/repos/asf/manifoldcf. This, of course, will mean that all workspaces will need to do a svn switch operation to change their path. svn help switch should give you sufficient hints as to how. I'm planning to do the move tomorrow morning, as soon as Abe-san is done producing a 0.5.1 RC0 release candidate. Please object if you want me to hold off. [I'm hoping, of course, that I now have the proper permissions to do this. We'll see shortly.] Karl
Re: Proposed first graduation step: Moving the repository
I've found the documents on PMC chair responsibilities, and have started the process of setting up ManifoldCF as a TLP in the foundation documents. I've opened a second ticket (INFRA-4806) for creating a root site under /www/manifoldcf.apache.org. But I've not found a checklist of all the tasks that need to be done to complete graduation, and google is not helpful. Does anyone have a link I can use? Karl On Thu, May 17, 2012 at 6:31 AM, Karl Wright daddy...@gmail.com wrote: INFRA-4802. Karl On Thu, May 17, 2012 at 2:33 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: 2012/5/17 Karl Wright daddy...@gmail.com Looks like I don't have permissions to do this. I suppose I would need to open an infra ticket? Yes, I think so. Tommaso Karl On Wed, May 16, 2012 at 10:25 PM, Karl Wright daddy...@gmail.com wrote: Folks, Heads up: Now that we've graduated, I'd like to move the repository from https://svn.apache.org/repos/asf/incubator/lcf to https://svn.apache.org/repos/asf/manifoldcf. This, of course, will mean that all workspaces will need to do a svn switch operation to change their path. svn help switch should give you sufficient hints as to how. I'm planning to do the move tomorrow morning, as soon as Abe-san is done producing a 0.5.1 RC0 release candidate. Please object if you want me to hold off. [I'm hoping, of course, that I now have the proper permissions to do this. We'll see shortly.] Karl
Re: Proposed first graduation step: Moving the repository
Since he's already done, I'm going to give this a try right now. Karl On Wed, May 16, 2012 at 10:25 PM, Karl Wright daddy...@gmail.com wrote: Folks, Heads up: Now that we've graduated, I'd like to move the repository from https://svn.apache.org/repos/asf/incubator/lcf to https://svn.apache.org/repos/asf/manifoldcf. This, of course, will mean that all workspaces will need to do a svn switch operation to change their path. svn help switch should give you sufficient hints as to how. I'm planning to do the move tomorrow morning, as soon as Abe-san is done producing a 0.5.1 RC0 release candidate. Please object if you want me to hold off. [I'm hoping, of course, that I now have the proper permissions to do this. We'll see shortly.] Karl
Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused
Shigeki, There are dozens of individual kinds of error that the Web Connector detects and retries for; it would of course be possible to allow users to set parameters to control all of them but it seems to me like it would be almost too much freedom. And, like I said initially, one prime reason for the retry strategies of each error type is to avoid having ManifoldCF behave badly and get blocked by the webmaster of the site being crawled. Having said that, if you have a case for changing the strategy for any particular kind of error, we can certainly look into that. In the case of connect exceptions, because there is a fairly long socket timeout trying to connect (it's measured in minutes), and because attempting to connect ties up a worker thread for that whole time, you really don't want to retry too frequently. You could make the case for retrying for a longer period of time (say, 12 or 24 hours), or for slightly more frequently (1 hour instead of 2 hours). If you have a case for doing that please go ahead and create a ticket. Thanks, Karl On Thu, May 10, 2012 at 10:09 PM, 小林 茂樹(情報システム本部 / サービス企画部) shigeki.kobayas...@g.softbank.co.jp wrote: Karl, There should be a Scheduled value also listed which is *when* the URL will be retried So, I see valuse in Scheduled and Retry Limit. The next re-crawling is two hours later and the final crawling is six hour later. It sounds like too much waiting. Are you guys planning to create new feature you can change these waiting periods, or a such thing already exists? Thanks for sharing your knowledge. Best regards, Shigeki 2012/5/10 Karl Wright daddy...@gmail.com Waiting for Processing means that the URL will be retried. There should be a Scheduled value also listed which is *when* the URL will be retried, and a Scheduled action column that says Process. If you see these things you only need to wait until the time specified and the document will be recrawled. Karl On Wed, May 9, 2012 at 9:54 PM, 小林 茂樹(情報システム本部 / サービス企画部) shigeki.kobayas...@g.softbank.co.jp wrote: Karl, Thanks for the reply. For web crawling, no single URL failure will cause the job to abort; OK, so I understand if I want it stopped, I need to manually abort the job. You can check on the status of an individual URL by using the Document Status report. The Document Status report says the seed URL is Waiting for Proecssing, which makes sense because the connection is refused. The report does not show retry count. The MCF log outputs exception. Is this also expected behavior?: - DEBUG 2012-05-10 10:10:48,215 (Worker thread '34') - WEB: Fetch exception for 'http://xxx.xxx.xxx/index.html' java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(Unknown Source) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(Unknown Source) at org.apache.commons.httpclient.HttpConnection.open(Unknown Source) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(Unknown Source) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Unknown Source) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown Source) at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown Source) at org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244) WARN 2012-05-10 10:10:48,216 (Worker thread '34') - Pre-ingest service interruption reported for job 1335340623530 connection 'WEB': Timed out waiting for a connection for 'http://xxx.xxx.xxx/index.html': Connection refused - Regards, Shigeki 2012/5/9 Karl Wright daddy...@gmail.com Hi, ManifoldCF's web connector is, in general, very cautious about not offending the owners of sites. If it concludes that the site has blocked access to a URL, it may remove the URL from its queue for politeness, which would prevent further crawling of that URL for the duration of the current job. Under most cases, however, if a URL is temporarily unavailable
Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused
Waiting for Processing means that the URL will be retried. There should be a Scheduled value also listed which is *when* the URL will be retried, and a Scheduled action column that says Process. If you see these things you only need to wait until the time specified and the document will be recrawled. Karl On Wed, May 9, 2012 at 9:54 PM, 小林 茂樹(情報システム本部 / サービス企画部) shigeki.kobayas...@g.softbank.co.jp wrote: Karl, Thanks for the reply. For web crawling, no single URL failure will cause the job to abort; OK, so I understand if I want it stopped, I need to manually abort the job. You can check on the status of an individual URL by using the Document Status report. The Document Status report says the seed URL is Waiting for Proecssing, which makes sense because the connection is refused. The report does not show retry count. The MCF log outputs exception. Is this also expected behavior?: - DEBUG 2012-05-10 10:10:48,215 (Worker thread '34') - WEB: Fetch exception for 'http://xxx.xxx.xxx/index.html' java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(Unknown Source) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(Unknown Source) at org.apache.commons.httpclient.HttpConnection.open(Unknown Source) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(Unknown Source) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Unknown Source) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown Source) at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown Source) at org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244) WARN 2012-05-10 10:10:48,216 (Worker thread '34') - Pre-ingest service interruption reported for job 1335340623530 connection 'WEB': Timed out waiting for a connection for 'http://xxx.xxx.xxx/index.html': Connection refused - Regards, Shigeki 2012/5/9 Karl Wright daddy...@gmail.com Hi, ManifoldCF's web connector is, in general, very cautious about not offending the owners of sites. If it concludes that the site has blocked access to a URL, it may remove the URL from its queue for politeness, which would prevent further crawling of that URL for the duration of the current job. Under most cases, however, if a URL is temporarily unavailable, it will be requeued for crawling at a later time. The typical pattern is to attempt to recrawl the URL periodically (e.g. every 5 minutes) for many hours before giving up on it. For web crawling, no single URL failure will cause the job to abort; it will continue running until all the other URLs have been processed or forever (if the job is continuous). You can check on the status of an individual URL by using the Document Status report. This report should tell you what ManifoldCF intends to do with a specific document. If you locate one such URL and try out this report, what does it say? Karl On Tue, May 8, 2012 at 10:04 PM, 小林 茂樹(情報システム本部 / サービス企画部) shigeki.kobayas...@g.softbank.co.jp wrote: Hi guys. I need some advice for stopping the MCF web crawler from a running state when a network connection refused. I use MCF 0.5 with Solr 3.5. I was testing what would happen to the web crawler when shutting down the web site that is to be crawled. I checked the simple history and saw “Connection refused” with status code of “-1”, that looked fine. But as I was waiting, the job status never changed and remained running. The crawler never crawls in this situation, but when I opened the web site, the crawler never started crawling again either. At least, somehow, I want the crawler to stop from running when a network connection refused, but I don’t know how. Does anyone have any ideas?
Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused
Hi, ManifoldCF's web connector is, in general, very cautious about not offending the owners of sites. If it concludes that the site has blocked access to a URL, it may remove the URL from its queue for politeness, which would prevent further crawling of that URL for the duration of the current job. Under most cases, however, if a URL is temporarily unavailable, it will be requeued for crawling at a later time. The typical pattern is to attempt to recrawl the URL periodically (e.g. every 5 minutes) for many hours before giving up on it. For web crawling, no single URL failure will cause the job to abort; it will continue running until all the other URLs have been processed or forever (if the job is continuous). You can check on the status of an individual URL by using the Document Status report. This report should tell you what ManifoldCF intends to do with a specific document. If you locate one such URL and try out this report, what does it say? Karl On Tue, May 8, 2012 at 10:04 PM, 小林 茂樹(情報システム本部 / サービス企画部) shigeki.kobayas...@g.softbank.co.jp wrote: Hi guys. I need some advice for stopping the MCF web crawler from a running state when a network connection refused. I use MCF 0.5 with Solr 3.5. I was testing what would happen to the web crawler when shutting down the web site that is to be crawled. I checked the simple history and saw “Connection refused” with status code of “-1”, that looked fine. But as I was waiting, the job status never changed and remained running. The crawler never crawls in this situation, but when I opened the web site, the crawler never started crawling again either. At least, somehow, I want the crawler to stop from running when a network connection refused, but I don’t know how. Does anyone have any ideas?
Re: manifoldcf 0.5 from Windows Dev machine to Debian Server
Did you use Tomcat on Windows? There is a -D switch you need to use when starting Tomcat which tells ManifoldCF web applications where to find the properties.xml file. It may be that you'd need to make modifications to the tomcat startup (/etc/init.d/tomcat6) to set that property. The other thing to note is that, unless you change something explicit, under Debian tomcat runs as the tomcat6 user. So your sync directory has to be both readable and writeable by that user, as well as the user that runs the agents process. Indeed, at MetaCarta we gave up and ran everything as the tomcat6 user - it seemed easier. Karl On Wed, May 9, 2012 at 6:14 AM, Marcus Kröller kroel...@igd-r.fraunhofer.de wrote: Hello everybody, for searching internal resources (MySQL DBs, Wiki, Filesystem, own JDBC based connector) we have created a ManifoldCF and Solr instance (0.5 and 3.4). Development happened on Windows (7) machines and everything is running as desired. Now I am facing the challenge of getting it all to run on a Debian Server with a Tomcat6 and PGSQl 8.4 on the same machine. Configuration paths should be adjusted properly. The webapps and the agent start individually. We are using scripts calling the JAVA command API as well as the servlet API via curl (easier for connection/job creation using json files) to initialise DB, register Agent, connectors, connections etc. (which have been translated to bash including EOL characters). The problem is: following the script the servlet API does not respond and the Crawler UI hangs on the empty template when requesting any of the lists (Connections, Jobs, etc.). When restoring a database dump from the Windows machine these lists are accessible but starting the Agent process leads to the same non reacting behavior. I imagine I am facing permission issues for the synch directory but was unable to find documentation or similar issues and I would unfortunately not consider myself a Linux professional. Any input would be highly appreciated. Regards and thank you, Marcus Kröller Student Research Assistant - Fraunhofer IGD Rostock
Re: JDBC Connection Exception
FWIW, the ticket is CONNECTORS-96. I've created a branch to work on it. I'll let you know when I think it's ready to try out. Karl On Mon, May 7, 2012 at 5:53 AM, Karl Wright daddy...@gmail.com wrote: Also, there has been a long-running ticket to replace the JDBC pool driver with something more modern for a while. Many of the off-the-shelf pool drivers are inadequate for various reasons, so I have one that I wrote myself, but it is not yet committed. So I am curious - which connections are timing out? The Oracle connections or the Postgresql ones? Karl On Mon, May 7, 2012 at 5:34 AM, Karl Wright daddy...@gmail.com wrote: What database are you using? (Not the JDBC database, the underlying one...) If PostgreSQL, what version? What version of ManifoldCF? If you could also post some of the long-running queries, that would be good as well. Depending on the database, ManifoldCF periodically re-analyzes/reindexes the underlying database during the crawl, which when the table is large can cause some warnings about long-running queries, because during the reindex process the database performance is slowed. That's not usually a problem, other than briefly slowing the crawl. However, it's also possible that there's a point where Postgresql's plan is poor, and we should see that because the warning also dumps the plan. Truncating the jobqueue table is not recommended, since then ManifoldCF has no idea of what it has crawled and what it hasn't, and its incremental properties tend to suffer. Karl On Mon, May 7, 2012 at 1:25 AM, Michael Le michael.aaron...@gmail.com wrote: Hello, Using a JDBC Repository connection to an Oracle 11g database, I've had issues where in the initial seeding stage the connection to the database is closing in the middle of processing the result set. The original data table I'm trying to index is about 10 million records, and with the original code, I could never get past about 750K records. I spent some time with the pooling parameters to the bitmachanic database pooling, but the API and source doesn't seem to be available any more. Even the original author doesn't have the code or specs any more. The parameter modifications to the pool allowed me to get through the first stage of processing a 2M row subset, but during the second stage where it's trying to obtain the documents, the connections again started being closed. I ended up just replacing the connection pool code, with an oracle implementation, and its churning through the documents happily. As a foot note, on my sample subset of about 400K documents, the throughput went from about 10 documents/s to 19 docs/s, but this may just be a side effect of oracle database load or network traffic. Has anyone else had issues processing a large Oracle repository? I've noted the benchmarks were done with 300K documents, and even in our initial testing with about 500K documents, no issues arose. The second and more pressing issue is the jobqueues table. In the process of dubugging the database connection issues, jobs were started, stopped, deleted, aborted, and various WHERE clauses were applied to the seeding queries/jobs. MCF is now reporting that there are long running queries against this table. In the past, I've just truncated the jobqueues table, but this had the side effect of stuffing a document into solr (output connector) multiple times. What API calls, or sql can I run to clean up the jobqueues table? Should I just wait for all jobs to finish and then at that point truncate the table? I've broken my data into several smaller subsets of around 1-2 million rows, but that has the side effect of a jobqueues table that is 6-8 million rows. Any support would be greatly appreciated. Thanks, -Michael Le
Re: ManifoldCF 0.5 / SharePoint 2010 connector
Hi Prem, In the future questions like this should go the the connectors-user list, not my personal email. If you search the users list you will find that a number of people have successfully used ManifoldCF to crawl SharePoint recently. You can see this yourself by searching the archive here: http://incubator.apache.org/connectors/en_US/mail.html . I do not remember what version they are using but we have made no functional changes to the SharePoint connector between version 0.4 and 0.5. The users include at least one other who is crawling secure governmental systems. Internationalization of the UI was the only change that was done. If you would like assistance in diagnosing your particular problems, please provide some details as to the exact problems you are having. Are you able to establish a working connection to SharePoint? What version of SharePoint are you trying to connect to? If version 3 or above, did you deploy the ManifoldCF user permissions webservice? Thanks, Karl On Tue, May 8, 2012 at 8:58 AM, prem bangle prem...@gmail.com wrote: Hi Karl, We are unable to successfully crawl SharePoint 2010 repository using ManifoldCF 2010 ver 0.5. Do you have feedback from others successfully crawling SharePoint 2010? Your opinion on this will help us go forward. Google searches and issues recorded in Apache Jira did not help us to come to a conclusion. We are evaluating ManifoldCF in the context of one of the Dept. of Homeland Security(DHS) programs. Any feedback from you much appreciated. thanks Prem
Re: ManifoldCF 0.5 / SharePoint 2010 connector
Hi Daniel, Here's the story. The SharePoint connector works using SharePoint web services, and has been explicitly tested against both SharePoint 2003 and SharePoint 2007. It has not been explicitly tested against SharePoint 2010, because none of the developers have a working SharePoint 2010 instance to test against. The MetaCarta Permissions web service was designed to provide access to folder and file permissions, which appeared in SharePoint 2007 and are no doubt also present in SharePoint 2010. So, I would expect the following for SharePoint 2010. - the basic web services should continue to work as they did in SharePoint 2007. If you can connect and get Connection working it basically confirms this picture. - the MetaCarta Permissions service is a greater risk. It may not work because it is compiled against SharePoint.dll from SharePoint 2007, not SharePoint 2010. It's technically still required, so if it *doesn't* work we're going to need to make some changes to support SharePoint 2010. So I'd suggest that you try the following in order: (1) First, try connecting to SharePoint 2010, specifying SharePoint 2.0 in the connection parameters. Do not try deploying the MetaCarta Permissions service for this test. If you can connect, and crawl, then we're in pretty good shape. (2) If (1) works, then try deploying the MC Permissions service on the SharePoint 2010 server. If it deploys correctly, then try connecting to it by specifying a SharePoint 3.0 connection. If you get back Connection working from that, then it is functioning, and everything should be working. Please let me know exactly how far you get in this process, and what errors you see in both the manifoldcf.log and for the connection status. Thanks! Karl On Tue, May 8, 2012 at 10:06 AM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Karl When upgrading to SharePoint 2010, will we still need to install the MetaCarta Permission Webservice to sharepoint instance? Thanks Dan Silvia From: Karl Wright [daddy...@gmail.com] Sent: Tuesday, May 08, 2012 9:06 AM To: prem bangle; connectors-user@incubator.apache.org Subject: Re: ManifoldCF 0.5 / SharePoint 2010 connector Hi Prem, In the future questions like this should go the the connectors-user list, not my personal email. If you search the users list you will find that a number of people have successfully used ManifoldCF to crawl SharePoint recently. You can see this yourself by searching the archive here: http://incubator.apache.org/connectors/en_US/mail.html . I do not remember what version they are using but we have made no functional changes to the SharePoint connector between version 0.4 and 0.5. The users include at least one other who is crawling secure governmental systems. Internationalization of the UI was the only change that was done. If you would like assistance in diagnosing your particular problems, please provide some details as to the exact problems you are having. Are you able to establish a working connection to SharePoint? What version of SharePoint are you trying to connect to? If version 3 or above, did you deploy the ManifoldCF user permissions webservice? Thanks, Karl On Tue, May 8, 2012 at 8:58 AM, prem bangle prem...@gmail.com wrote: Hi Karl, We are unable to successfully crawl SharePoint 2010 repository using ManifoldCF 2010 ver 0.5. Do you have feedback from others successfully crawling SharePoint 2010? Your opinion on this will help us go forward. Google searches and issues recorded in Apache Jira did not help us to come to a conclusion. We are evaluating ManifoldCF in the context of one of the Dept. of Homeland Security(DHS) programs. Any feedback from you much appreciated. thanks Prem
Re: ManifoldCF 0.5 / SharePoint 2010 connector
) DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - Target service: StsAdapterSoap DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - Enter: SOAPPart::getAsSOAPEnvelope() DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - org.apache.axis.i18n.resource::handleGetObject(currForm) DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - current form is FORM_SOAPENVELOPE DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - org.apache.axis.i18n.resource::handleGetObject(addHeader00) DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - Adding header to message... DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - org.apache.axis.i18n.resource::handleGetObject(addHeader00) DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - Adding header to message... DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - MessageContext: setTargetService(http://schemas.microsoft.com/sharepoint/dsp/queryRequest) DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - org.apache.axis.i18n.resource::handleGetObject(noService10) DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - Exception: org.apache.axis.ConfigurationException: No service named http://schemas.microsoft.com/sharepoint/dsp/queryRequest is available org.apache.axis.ConfigurationException: No service named http://schemas.microsoft.com/sharepoint/dsp/queryRequest is available at org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper$ResourceProvider.getService(SPSProxyHelper.java:2208) at org.apache.axis.AxisEngine.getService(AxisEngine.java:311) at org.apache.axis.MessageContext.setTargetService(MessageContext.java:756) at org.apache.axis.transport.http.HTTPTransport.setupMessageContextImpl(HTTPTransport.java:89) at org.apache.axis.client.Transport.setupMessageContext(Transport.java:46) at org.apache.axis.client.Call.invoke(Call.java:2738) at org.apache.axis.client.Call.invoke(Call.java:2443) at org.apache.axis.client.Call.invoke(Call.java:2366) at org.apache.axis.client.Call.invoke(Call.java:1812) at com.microsoft.schemas.sharepoint.dsp.StsAdapterSoapStub.query(StsAdapterSoapStub.java:317) at org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getDocuments(SPSProxyHelper.java:540) at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:906) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:561) at org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper$ResourceProvider.getService(SPSProxyHelper.java:2208) at org.apache.axis.AxisEngine.getService(AxisEngine.java:311) at org.apache.axis.MessageContext.setTargetService(MessageContext.java:756) at org.apache.axis.transport.http.HTTPTransport.setupMessageContextImpl(HTTPTransport.java:89) at org.apache.axis.client.Transport.setupMessageContext(Transport.java:46) at org.apache.axis.client.Call.invoke(Call.java:2738) at org.apache.axis.client.Call.invoke(Call.java:2443) at org.apache.axis.client.Call.invoke(Call.java:2366) at org.apache.axis.client.Call.invoke(Call.java:1812) at com.microsoft.schemas.sharepoint.dsp.StsAdapterSoapStub.query(StsAdapterSoapStub.java:317) at org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getDocuments(SPSProxyHelper.java:540) at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:906) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:561) DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - MessageContext: setServiceHandler(null) ### On Tue, May 8, 2012 at 11:14 AM, Karl Wright daddy...@gmail.com wrote: Hi Daniel, Here's the story. The SharePoint connector works using SharePoint web services, and has been explicitly tested against both SharePoint 2003 and SharePoint 2007. It has not been explicitly tested against SharePoint 2010, because none of the developers have a working SharePoint 2010 instance to test against. The MetaCarta Permissions web service was designed to provide access to folder and file permissions, which appeared in SharePoint 2007 and are no doubt also present in SharePoint 2010. So, I would expect the following for SharePoint 2010. - the basic web services should continue to work as they did in SharePoint 2007. If you can connect and get Connection working it basically confirms this picture. - the MetaCarta Permissions service is a greater risk. It may not work because it is compiled against SharePoint.dll from SharePoint 2007, not SharePoint 2010. It's technically still required, so if it *doesn't* work we're going to need to make some changes to support SharePoint 2010. So I'd suggest that you
Re: JDBC Connection Exception
What database are you using? (Not the JDBC database, the underlying one...) If PostgreSQL, what version? What version of ManifoldCF? If you could also post some of the long-running queries, that would be good as well. Depending on the database, ManifoldCF periodically re-analyzes/reindexes the underlying database during the crawl, which when the table is large can cause some warnings about long-running queries, because during the reindex process the database performance is slowed. That's not usually a problem, other than briefly slowing the crawl. However, it's also possible that there's a point where Postgresql's plan is poor, and we should see that because the warning also dumps the plan. Truncating the jobqueue table is not recommended, since then ManifoldCF has no idea of what it has crawled and what it hasn't, and its incremental properties tend to suffer. Karl On Mon, May 7, 2012 at 1:25 AM, Michael Le michael.aaron...@gmail.com wrote: Hello, Using a JDBC Repository connection to an Oracle 11g database, I've had issues where in the initial seeding stage the connection to the database is closing in the middle of processing the result set. The original data table I'm trying to index is about 10 million records, and with the original code, I could never get past about 750K records. I spent some time with the pooling parameters to the bitmachanic database pooling, but the API and source doesn't seem to be available any more. Even the original author doesn't have the code or specs any more. The parameter modifications to the pool allowed me to get through the first stage of processing a 2M row subset, but during the second stage where it's trying to obtain the documents, the connections again started being closed. I ended up just replacing the connection pool code, with an oracle implementation, and its churning through the documents happily. As a foot note, on my sample subset of about 400K documents, the throughput went from about 10 documents/s to 19 docs/s, but this may just be a side effect of oracle database load or network traffic. Has anyone else had issues processing a large Oracle repository? I've noted the benchmarks were done with 300K documents, and even in our initial testing with about 500K documents, no issues arose. The second and more pressing issue is the jobqueues table. In the process of dubugging the database connection issues, jobs were started, stopped, deleted, aborted, and various WHERE clauses were applied to the seeding queries/jobs. MCF is now reporting that there are long running queries against this table. In the past, I've just truncated the jobqueues table, but this had the side effect of stuffing a document into solr (output connector) multiple times. What API calls, or sql can I run to clean up the jobqueues table? Should I just wait for all jobs to finish and then at that point truncate the table? I've broken my data into several smaller subsets of around 1-2 million rows, but that has the side effect of a jobqueues table that is 6-8 million rows. Any support would be greatly appreciated. Thanks, -Michael Le
Re: JDBC Connection Exception
Also, there has been a long-running ticket to replace the JDBC pool driver with something more modern for a while. Many of the off-the-shelf pool drivers are inadequate for various reasons, so I have one that I wrote myself, but it is not yet committed. So I am curious - which connections are timing out? The Oracle connections or the Postgresql ones? Karl On Mon, May 7, 2012 at 5:34 AM, Karl Wright daddy...@gmail.com wrote: What database are you using? (Not the JDBC database, the underlying one...) If PostgreSQL, what version? What version of ManifoldCF? If you could also post some of the long-running queries, that would be good as well. Depending on the database, ManifoldCF periodically re-analyzes/reindexes the underlying database during the crawl, which when the table is large can cause some warnings about long-running queries, because during the reindex process the database performance is slowed. That's not usually a problem, other than briefly slowing the crawl. However, it's also possible that there's a point where Postgresql's plan is poor, and we should see that because the warning also dumps the plan. Truncating the jobqueue table is not recommended, since then ManifoldCF has no idea of what it has crawled and what it hasn't, and its incremental properties tend to suffer. Karl On Mon, May 7, 2012 at 1:25 AM, Michael Le michael.aaron...@gmail.com wrote: Hello, Using a JDBC Repository connection to an Oracle 11g database, I've had issues where in the initial seeding stage the connection to the database is closing in the middle of processing the result set. The original data table I'm trying to index is about 10 million records, and with the original code, I could never get past about 750K records. I spent some time with the pooling parameters to the bitmachanic database pooling, but the API and source doesn't seem to be available any more. Even the original author doesn't have the code or specs any more. The parameter modifications to the pool allowed me to get through the first stage of processing a 2M row subset, but during the second stage where it's trying to obtain the documents, the connections again started being closed. I ended up just replacing the connection pool code, with an oracle implementation, and its churning through the documents happily. As a foot note, on my sample subset of about 400K documents, the throughput went from about 10 documents/s to 19 docs/s, but this may just be a side effect of oracle database load or network traffic. Has anyone else had issues processing a large Oracle repository? I've noted the benchmarks were done with 300K documents, and even in our initial testing with about 500K documents, no issues arose. The second and more pressing issue is the jobqueues table. In the process of dubugging the database connection issues, jobs were started, stopped, deleted, aborted, and various WHERE clauses were applied to the seeding queries/jobs. MCF is now reporting that there are long running queries against this table. In the past, I've just truncated the jobqueues table, but this had the side effect of stuffing a document into solr (output connector) multiple times. What API calls, or sql can I run to clean up the jobqueues table? Should I just wait for all jobs to finish and then at that point truncate the table? I've broken my data into several smaller subsets of around 1-2 million rows, but that has the side effect of a jobqueues table that is 6-8 million rows. Any support would be greatly appreciated. Thanks, -Michael Le
Re: Adding more exclusions and document deletion
If the job is run in continuous mode, you would have to wait until the document went away. But if you are using a job meant to be run to the end then it should pick up changes to the spec on each run. Karl On Mon, May 7, 2012 at 7:11 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: How sophisticated is MCF when it comes to document deletion? I have previously entered a lot of URLs into the exclude from crawl list in order to exclude them from my web crawl. Now, a couple of months later, I want to exclude a bunch of other URLs as well since these now are handled by our CMS instead. Will MCF delete these new URLs/documents from our Solr server at next job run or will they only be deleted when they have been outdated? Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Ingestion API socket timeout exception waiting for response code
Thanks for the update! Karl On Mon, May 7, 2012 at 7:15 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: Document deletion works perfectly after I reinstalled the SSL certificate and reentered the username and password to our Solr server. So I think this issue has been solved. Erlend On 27.04.12 12.11, Erlend Garåsen wrote: Many thanks for your suggestions and help, Karl. Using a filesystem crawl was actually a good idea for debugging/testing. To install a new version of Solr is not that easy on our test server for many reasons, generally because it is under control of another division dealing with servers at the uni, even though I can get root access. Anyway, according to the logs on our Solr 3.2 server, it seems that MCF successfully managed to delete one test document I removed: [2012-04-27 11:18:33.092] {delete=[file:/tmp/mcf/docs/app_lasso.pdf]} 0 7 [2012-04-27 11:18:33.092] [] webapp=/solr path=/update params={} status=0 QTime=7 The result code is 200 according to Simple History in MCF. I entered the passwords once again for the Solr servers into the Solr output configuration, deleted and uploaded our SSL certificate once again before I did the filesystem test. I should have performed the tests prior to the password updates. The crawl will start again later today at 6 pm on our production server, so I will try to figure out whether we still have problems later. I'm going to Scotland later this evening for some days without my laptop, so I cannot check the status of my crawl before I'm back, but I'll let my colleague watch the logs. Erlend On 26.04.12 21.14, Karl Wright wrote: Hi Erlend, I had some time today and was able to verify that everything worked fine against what I have currently on my laptop, which is Solr 3.2. The second job run looks like this: 04-26-2012 15:11:44.154 job end 1335467343879(test) 0 1 04-26-2012 15:11:34.159 document deletion (solr) file:/C:/testcrawl/there.txt 200 0 117 04-26-2012 15:11:24.690 read document C:\testcrawl OK 0 1 04-26-2012 15:11:24.494 job start 1335467343879(test) 0 1 So it appears that either something changed in Solr, or SSL support is broken, or your network is not permitting a valid HTTP response for some reason. Karl On Thu, Apr 26, 2012 at 11:10 AM, Karl Wrightdaddy...@gmail.com wrote: Hi Erlend, Can you try the following: (1) Make a fresh Solr checkout of 3.6 or whatever Solr version you are using, and build it (2) Start it (3) Run a simple filesystem crawl using a Solr connection that is created with the default values (4) Delete a file in your filesystem that was crawled (5) Crawl again Does the deletion happen OK? AFAIK, nothing has changed in the Solr connector that should affect the ability to delete. This test will confirm that it is still working. Thanks, Karl On Thu, Apr 26, 2012 at 10:19 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: It seems that MCF cannot delete documents from Solr. A timeout occurs, and the job stops after a while. This is what I can see from the log: WARN 2012-04-20 18:24:30,373 (Worker thread '16') - Service interruption reported for job 1327930125433 connection 'Web crawler': Ingestion API socket timeout exception waiting for response code: Read timed out; ingestion will be retried again later If I take a further look in Simple History, it seems that this error is related to document deletion. I have tried to delete the document manually by using curl from the same server MCF is installed on in case we have some access restrictions, but Curr succeeded. We do not have any problems with adding, the timeout only occurs while deleting documents. I have checked our Solr configuration. MCF does use the correct path for document deletion, i.e. /update. The correct realm, username and password for our Solr server are entered correctly and the SSL certificate is valid as well. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Can we have location indexed as a field into solr through ManifoldCF
You want to look at the end-user documentation where it describes the Metadata tab for the windows share connector. Karl On Thu, May 3, 2012 at 4:14 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, Am using ManifoldCF to crawl Windows Share repository and index documents from them into Solr. I have got a requirement where I want the location of the document to be indexed as a field in Solr. Can we achieve this from ManifoldCF ? For example, when I define a Windows Share repository connection with server1, and a Job with path name path1 (present on server1), I want all the documents that are indexed into Solr with this job to have a field which tells me the location, which is \\server1\path1. Hope am clear in what I want. Can we achieve this by defining some parameters at ManifoldCF end or is there any possible way ? Can someone please let me know of any ideas to get this done ? Thanks and Regards, Swapna.
Re: MCF Crawler UI doesn't load
You need a patch. See CONNECTORS-467. Karl On Tue, May 1, 2012 at 12:25 PM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, Till recently, I have been using ManifoldCF 0.4 version with Tomcat 7.0 and it used to work perfectly fine. Am trying to switch to ManifoldCF 0.5 version and I have built from source and configured everything that I did for earlier version. But am not able to browse to the page http://localhost:8080/mcf-crawler-ui. It says HTTP Status 404 - /mcf-crawler-ui type Status report message /mcf-crawler-ui description The requested resource (/mcf-crawler-ui) is not available. Apache Tomcat/7.0.22 And, am able to use the other service like http://localhost:8080/mcf-api-service/json/outputconnectors, it is giving me the json object listing all connectors. Can someone please help me out in resolving this issue ? Thanks and Regards, Swapna.
Re: Output Connector for SearchBlox
The right guy to look at this is on Easter vacation at the moment. I'm sure he will respond when he is back. Thanks, Karl On Sat, Apr 7, 2012 at 5:36 PM, Timo Selvaraj tselva...@searchblox.com wrote: Hello, Anyone available (on a paid basis) to create a output connector for SearchBlox? I am interested in creating an output connector for SearchBlox (through the REST API http://www.searchblox.com/developers/api ) for contribution to the ManifoldCF project. Please message me directly tselva...@searchblox.com Thanks, -- Timo Selvaraj SearchBlox Software, Inc. http://www.searchblox.com/
Re: Running 2 jobs to update same document Index but different
I did not see that you tried creating a filesystem connection and job. Did you do that, and did it work for you without sending a deletion? If not, please go back to using the manifoldcf id field and try that first. Here is the patch I'd like you to apply: === --- framework/agents/src/main/java/org/apache/manifoldcf/agents/incrementalingest/IncrementalIngester.java (revision 1307149) +++ framework/agents/src/main/java/org/apache/manifoldcf/agents/incrementalingest/IncrementalIngester.java (working copy) @@ -697,6 +697,8 @@ { IOutputConnection connection = connectionManager.load(outputConnectionName); +Logging.ingest.error(Deleting documents!, new Exception(Deletion stack trace)); + if (Logging.ingest.isDebugEnabled()) { int i = 0; Then, rebuild ManifoldCF. Every document that is deleted from the index will generate a trace in the log. Run your crawl and send me one of those traces. Karl On Fri, Mar 30, 2012 at 6:06 AM, Anupam Bhattacharya anupam...@gmail.com wrote: I checked the Manifoldcf logs and i there were no exceptions. Additionally i changed the id (uniqueKey) in SOLR to the documentum specific unique id i.e. r_object_id and ran the job. This i time i could easily create the indexes. For (4) please provide the places for which i need to enable logging. On Thu, Mar 29, 2012 at 6:56 PM, Karl Wright daddy...@gmail.com wrote: But as per my observation the deletion happens only when uniqueKey in SOLR schema is set to id. The SOLR setup cannot influence the flow in ManifoldCF unless it causes SOLR to reject the ManifoldCF requests. So I suspect that the delete request is happening in both cases, and it is not getting acted upon by SOLR in the case where uniqueKey is not set to id. That's because the delete request from ManifoldCF will be for a key that solr doesn't recognize as such. Please do try recommendations (3) and (4). Karl
Re: Running 2 jobs to update same document Index but different fields
The REJECTED result is because the document has the wrong mime type or is too long, according to your length restriction. Do you have just one job, or do you still have two? If you have two jobs covering the same overall documents with different document criteria, this is the kind of thing that happens when you run one job after the other; the documents belonging to the first. You will only need one job if you try the plan I was talking about, but it should include the PDFs as well as the XML documents. If you only have one job, then I can't explain it unless you changed the document criteria and ran the job a second time. Karl On Thu, Mar 29, 2012 at 3:39 AM, Anupam Bhattacharya anupam...@gmail.comwrote: Okay. I tried to use the id which is formed my manifoldcf documentum connector. I ran the job i could see in between from the SOLR admin screen that documents were getting indexed. But just after the end of the job i see all my created indexes gets deleted. Snippet from Simple History is given below. Why this document deletion activity gets added and deletes all my created indexes when i keep the unique id as id in the schema.xml file of SOLR ? Start Time http://localhost:8080/mcf-crawler-ui/execute.jsp Activityhttp://localhost:8080/mcf-crawler-ui/execute.jsp Identifier Result Code http://localhost:8080/mcf-crawler-ui/execute.jsp Bytes http://localhost:8080/mcf-crawler-ui/execute.jsp Timehttp://localhost:8080/mcf-crawler-ui/execute.jspResult Description 03-29-2012 13:00:26.837 document deletion (Solr_TEST_QA) http://example.domain.com:8088/webtop/component/drl?versio... nLabel=CURRENTobjectId=09d905e78000676d 200 0 110 03-29-2012 12:55:37.869 fetch 09d905e78000676d REJECTED 86823 4184 03-29-2012 12:55:34.934 document ingest (Solr_TEST_QA) http://example.domain.com:8088/webtop/component/drl?versio... nLabel=CURRENTobjectId=09d905e78000676d 200 8158 235 On Thu, Mar 29, 2012 at 12:41 AM, Karl Wright daddy...@gmail.com wrote: So do you find this design appropriate and feasible ? It sounds like you are still trying to merge records in Solr but this time using Solr Cell to somehow do this. Since SolrCell is a pipeline, I don't think you will find it easy to keep data from one job aligned with data from another. That's why I suggested just allowing both kinds of documents to be indexed as-is, and just making sure that you include a metadata reference to the main document in each. Karl On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya anupam...@gmail.com wrote: The second option seems to be more useful as it will allow me add to any business logic. So similar to SOLR Cell (/update/extract) my new RequestHandler will be added in solrconfig.xml which will do all the manipulations. Later, I need to get all field values into a temp variable by first searching by id in the lucene indexes and then add these values into the incoming new field values list. So do you find this design appropriate and feasible ? Anupam On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright daddy...@gmail.com wrote: Thanks - now I understand what you are trying to do more clearly. The Documentum connector is going to pick up the XML document and the PDF document as separate entities. Thus, they'd also be indexed in Solr separately. So if we use that as a starting point, let's see where it might lead. First, you'd want each PDF document to have metadata that refers back to the XML parent document. I'm not sure how easy it is to set up such a metadata reference in Documentum, but I vaguely recall there was indeed some such field. So let's presume you can get that. Then, you'd want to make sure your Solr schema included an XML document field, which had the URL of the parent XML document (or, for XML documents, the document's own URL) as content. That would be the ID you'd use to present a result item to a user. Does this sound reasonable so far? The only other piece you might need is manipulation of either the PDF's metadata, or the XML document's metadata, or both. For that, I'd use Solr Cell to perform whatever mappings and manipulations made sense before the documents actually get indexed. Karl On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya anupam...@gmail.com wrote: I would have been happy if I had to index PDF and XML separately. But for my use-case. XML is the main document containing bibliographic information (which needs to presented as search result) and consists a reference to a child/supporting document which is a actual PDF file. I need to index the PDF text and if any search matches with the PDF content then the parent/XML bibliographic information needs to be presented. I am trying to call the SOLR search engine with one single query to show corresponding XML detail for a search term present in the PDF. I checked that from SOLR 4.x version SOLR
Re: Running 2 jobs to update same document Index but different fields
Right, LUCENE never did allow you to modify a document's indexes, only replace them. What I'm trying to tell you is that there is no reason to have the same document ID for both documents. ManifoldCF will support treating the XML document and PDF document as different documents in Solr. But if you want them to in fact be the same document, just combined in some way, neither ManifoldCF nor Lucene will support that at this time. Karl On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya anupam...@gmail.com wrote: I saw that the index getting created by 1st PDF indexing job which worked perfectly well for a particular id. Later when i ran the 2nd XML indexing Job for the same id. I lost all field indexed by the 1st job and i was left out with field indexes created my this 2nd job. I thought that it would combine field values for a specified doc id. As per Lucene developers they mention that by design Lucene doesn't support this. Pls. see following url :: https://issues.apache.org/jira/browse/LUCENE-3837 -Anupam On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright daddy...@gmail.com wrote: The Solr handler that you are using should not matter here. Can you look at the Simple History report, and do the following: - Look for a document that is being indexed in both PDF and XML. - Find the ingestion activity for that document for both PDF and XML - Compare the ID's (which for the ingestion activity are the URL's of the documents in Webtop) If the URLs are in fact different, then you should be able to make this work. You need to look at how you configured your Solr instance, and which fields you are specifying in your Solr output connection. You want those Webtop urls to be indexed as the unique document identifier in Solr, not some other ID. Thanks, Karl On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya anupam...@gmail.com wrote: Today I ran 2 job one by one but it seems since we are using /update/extract Request Handler the field values for common id gets overridden by the latest job. I want to update certain field in the lucene indexes for the doc rather than completely update with new values and by loosing other field value entries. On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright daddy...@gmail.com wrote: For Documentum, content length is in bytes, I believe. It does not set the length, it filters out all documents greater than the specified length. Leaving the field blank will perform no filtering. Document types in Documentum are specified by mime type, so you'd want to select all that apply. The actual one used will depend on how your particular instance of Documentum is configured, but if you pick them all you should have no problem. Karl On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya anupam...@gmail.com wrote: Thanks!! Seems from your explanation that i can update same documents other field values. I inquired about this because I have two different document with a parent-child relationship which needs to be indexed as one document in lucene index. As you must have understood by now that i am trying to do this for Documentum CMS. I have seen the configuration screen for setting the Content length second for filtering document type. So my question is what unit the Content length accepts values (bit,bytes,KB,MB etc) whether this configuration set the lengths for documents full text indexing ?. Additionally to scan only one kind of document e.g PDF what should be added to filter those documents? is it application/pdf OR PDF ? Regards Anupam On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright daddy...@gmail.com wrote: The document key in Solr is the url of the document, as constructed by the connector you are using. If you are using the same document to construct two different Solr documents, ManifoldCF by definition cannot be aware of this. But if these are different files from the point of view of ManifoldCF they will have different URLs and be treated differently. The jobs can overlap in this case with no difficulty. Karl On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya anupam...@gmail.com wrote: I want to configure two jobs to index in SOLR using ManifoldCF using /extract/update requestHandler. 1st to synchronize only XML files 2nd to synchronize the PDF file. If both these document share a unique id. Can i combine the indexes for both in 1 SOLR schema without overriding the details added by previous job. suppose, xmldoc indexes field0(id), field1, field2, field3 pdfdoc indexes field0(id), field4, field5, field6. Output docindex == (xml+pdf doc), field0(id), field1, field2, field3, field4, field5, field6 Regards Anupam -- Thanks Regards Anupam Bhattacharya -- Thanks Regards Anupam Bhattacharya
Re: Running 2 jobs to update same document Index but different fields
Thanks - now I understand what you are trying to do more clearly. The Documentum connector is going to pick up the XML document and the PDF document as separate entities. Thus, they'd also be indexed in Solr separately. So if we use that as a starting point, let's see where it might lead. First, you'd want each PDF document to have metadata that refers back to the XML parent document. I'm not sure how easy it is to set up such a metadata reference in Documentum, but I vaguely recall there was indeed some such field. So let's presume you can get that. Then, you'd want to make sure your Solr schema included an XML document field, which had the URL of the parent XML document (or, for XML documents, the document's own URL) as content. That would be the ID you'd use to present a result item to a user. Does this sound reasonable so far? The only other piece you might need is manipulation of either the PDF's metadata, or the XML document's metadata, or both. For that, I'd use Solr Cell to perform whatever mappings and manipulations made sense before the documents actually get indexed. Karl On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya anupam...@gmail.com wrote: I would have been happy if I had to index PDF and XML separately. But for my use-case. XML is the main document containing bibliographic information (which needs to presented as search result) and consists a reference to a child/supporting document which is a actual PDF file. I need to index the PDF text and if any search matches with the PDF content then the parent/XML bibliographic information needs to be presented. I am trying to call the SOLR search engine with one single query to show corresponding XML detail for a search term present in the PDF. I checked that from SOLR 4.x version SOLR-Join Plugin is introduced. (http://wiki.apache.org/solr/Join) but work like inner query. Again the main requirement is that the PDF should be searchable but it master details from XML should only be presented to request the actual PDF. -Anupam On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright daddy...@gmail.com wrote: This doesn't sound like a problem a connector can solve. The problem sounds like severe misuse of Solr/Lucene to me. You are using the wrong document key and Lucene does not let you modify a document index once it is created, and no matter what you do to ManifoldCF it can't get around that restriction. So it sounds like you need to fundamentally rethink your design. If all you want to do is index XML and PDF as separate documents, just change your Solr output connection specification to change the selected id field appropriately. Then, BOTH documents will be indexed by Solr, each with different metadata as you originally specified. I'm frankly having a really hard time seeing why this is so hard. Karl On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya anupam...@gmail.com wrote: Should I write a new Documentum Connector with our specific use-case to go forward ? I guess your book will be helpful to understand connector framework in manifoldcf. On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright daddy...@gmail.com wrote: Right, LUCENE never did allow you to modify a document's indexes, only replace them. What I'm trying to tell you is that there is no reason to have the same document ID for both documents. ManifoldCF will support treating the XML document and PDF document as different documents in Solr. But if you want them to in fact be the same document, just combined in some way, neither ManifoldCF nor Lucene will support that at this time. Karl On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya anupam...@gmail.com wrote: I saw that the index getting created by 1st PDF indexing job which worked perfectly well for a particular id. Later when i ran the 2nd XML indexing Job for the same id. I lost all field indexed by the 1st job and i was left out with field indexes created my this 2nd job. I thought that it would combine field values for a specified doc id. As per Lucene developers they mention that by design Lucene doesn't support this. Pls. see following url :: https://issues.apache.org/jira/browse/LUCENE-3837 -Anupam On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright daddy...@gmail.com wrote: The Solr handler that you are using should not matter here. Can you look at the Simple History report, and do the following: - Look for a document that is being indexed in both PDF and XML. - Find the ingestion activity for that document for both PDF and XML - Compare the ID's (which for the ingestion activity are the URL's of the documents in Webtop) If the URLs are in fact different, then you should be able to make this work. You need to look at how you configured your Solr instance, and which fields you are specifying in your Solr output connection
Re: Running 2 jobs to update same document Index but different fields
So do you find this design appropriate and feasible ? It sounds like you are still trying to merge records in Solr but this time using Solr Cell to somehow do this. Since SolrCell is a pipeline, I don't think you will find it easy to keep data from one job aligned with data from another. That's why I suggested just allowing both kinds of documents to be indexed as-is, and just making sure that you include a metadata reference to the main document in each. Karl On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya anupam...@gmail.com wrote: The second option seems to be more useful as it will allow me add to any business logic. So similar to SOLR Cell (/update/extract) my new RequestHandler will be added in solrconfig.xml which will do all the manipulations. Later, I need to get all field values into a temp variable by first searching by id in the lucene indexes and then add these values into the incoming new field values list. So do you find this design appropriate and feasible ? Anupam On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright daddy...@gmail.com wrote: Thanks - now I understand what you are trying to do more clearly. The Documentum connector is going to pick up the XML document and the PDF document as separate entities. Thus, they'd also be indexed in Solr separately. So if we use that as a starting point, let's see where it might lead. First, you'd want each PDF document to have metadata that refers back to the XML parent document. I'm not sure how easy it is to set up such a metadata reference in Documentum, but I vaguely recall there was indeed some such field. So let's presume you can get that. Then, you'd want to make sure your Solr schema included an XML document field, which had the URL of the parent XML document (or, for XML documents, the document's own URL) as content. That would be the ID you'd use to present a result item to a user. Does this sound reasonable so far? The only other piece you might need is manipulation of either the PDF's metadata, or the XML document's metadata, or both. For that, I'd use Solr Cell to perform whatever mappings and manipulations made sense before the documents actually get indexed. Karl On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya anupam...@gmail.com wrote: I would have been happy if I had to index PDF and XML separately. But for my use-case. XML is the main document containing bibliographic information (which needs to presented as search result) and consists a reference to a child/supporting document which is a actual PDF file. I need to index the PDF text and if any search matches with the PDF content then the parent/XML bibliographic information needs to be presented. I am trying to call the SOLR search engine with one single query to show corresponding XML detail for a search term present in the PDF. I checked that from SOLR 4.x version SOLR-Join Plugin is introduced. (http://wiki.apache.org/solr/Join) but work like inner query. Again the main requirement is that the PDF should be searchable but it master details from XML should only be presented to request the actual PDF. -Anupam On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright daddy...@gmail.com wrote: This doesn't sound like a problem a connector can solve. The problem sounds like severe misuse of Solr/Lucene to me. You are using the wrong document key and Lucene does not let you modify a document index once it is created, and no matter what you do to ManifoldCF it can't get around that restriction. So it sounds like you need to fundamentally rethink your design. If all you want to do is index XML and PDF as separate documents, just change your Solr output connection specification to change the selected id field appropriately. Then, BOTH documents will be indexed by Solr, each with different metadata as you originally specified. I'm frankly having a really hard time seeing why this is so hard. Karl On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya anupam...@gmail.com wrote: Should I write a new Documentum Connector with our specific use-case to go forward ? I guess your book will be helpful to understand connector framework in manifoldcf. On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright daddy...@gmail.com wrote: Right, LUCENE never did allow you to modify a document's indexes, only replace them. What I'm trying to tell you is that there is no reason to have the same document ID for both documents. ManifoldCF will support treating the XML document and PDF document as different documents in Solr. But if you want them to in fact be the same document, just combined in some way, neither ManifoldCF nor Lucene will support that at this time. Karl On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya anupam...@gmail.com wrote: I saw that the index getting created by 1st PDF
Re: Running 2 jobs to update same document Index but different fields
For Documentum, content length is in bytes, I believe. It does not set the length, it filters out all documents greater than the specified length. Leaving the field blank will perform no filtering. Document types in Documentum are specified by mime type, so you'd want to select all that apply. The actual one used will depend on how your particular instance of Documentum is configured, but if you pick them all you should have no problem. Karl On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya anupam...@gmail.com wrote: Thanks!! Seems from your explanation that i can update same documents other field values. I inquired about this because I have two different document with a parent-child relationship which needs to be indexed as one document in lucene index. As you must have understood by now that i am trying to do this for Documentum CMS. I have seen the configuration screen for setting the Content length second for filtering document type. So my question is what unit the Content length accepts values (bit,bytes,KB,MB etc) whether this configuration set the lengths for documents full text indexing ?. Additionally to scan only one kind of document e.g PDF what should be added to filter those documents? is it application/pdf OR PDF ? Regards Anupam On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright daddy...@gmail.com wrote: The document key in Solr is the url of the document, as constructed by the connector you are using. If you are using the same document to construct two different Solr documents, ManifoldCF by definition cannot be aware of this. But if these are different files from the point of view of ManifoldCF they will have different URLs and be treated differently. The jobs can overlap in this case with no difficulty. Karl On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya anupam...@gmail.com wrote: I want to configure two jobs to index in SOLR using ManifoldCF using /extract/update requestHandler. 1st to synchronize only XML files 2nd to synchronize the PDF file. If both these document share a unique id. Can i combine the indexes for both in 1 SOLR schema without overriding the details added by previous job. suppose, xmldoc indexes field0(id), field1, field2, field3 pdfdoc indexes field0(id), field4, field5, field6. Output docindex == (xml+pdf doc), field0(id), field1, field2, field3, field4, field5, field6 Regards Anupam
RE: SmbException
Hi, The problem you are seeing is a server-side error of some kind. The jcifs connector will retry documents that fail to fetch properly after some period of time, usually five minutes. Warning messages are nevertheless recorded in the log for every retry. So, unless the job aborts, or ceases to make progress, everything may actually be ok. If the job does abort or stops moving forward it means that certain documents are having consistent errors. You can find an example of such a document easily enough by reviewing the Simple History. Then, see if you can reach the doc via windows, or if you get errors also. Please let us know what you find. Karl Sent from my Windows Phone From: takagig Sent: 3/17/2012 1:48 AM To: connectors-user@incubator.apache.org Subject: SmbException Hi, everyone I am using ManifoldCF(from trunk build. 0.5?) for crawling Windows Share Folder for our application. When I run ManifoldCF sometimes I am getting SmbException. SmbException occur more often around crawling 70,000 files over. Please suggest a method of solving this problem. 1)My Environment -Windows 2003 R2 SE SP2 -ManifoldCF (from trunk | 2012-03-01) 2)Case crawling target = 100,000 files. max complete files = 79503 files. 3)ManifoldCF Status (showjobstatus.jsp) Error: SmbException thrown: No process is on the other end of the pipe. 4)Log ++ WARN 2012-03-16 17:58:55,731 (Worker thread '8') - JCIFS: Possibly transient exception detected on attempt 1 while getting share security: 0x jcifs.dcerpc.DcerpcException: 0x at jcifs.dcerpc.DcerpcBind.getResult(DcerpcBind.java:40) at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:249) at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:126) at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:140) at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2943) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2342) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.describeDocumentSecurity(SharedDriveConnector.java:1003) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getDocumentVersions(SharedDriveConnector.java:546) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318) WARN 2012-03-16 17:58:55,731 (Worker thread '8') - JCIFS: Possibly transient exception detected on attempt 2 while getting share security: No process is on the other end of the pipe. jcifs.smb.SmbException: No process is on the other end of the pipe. at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563) at jcifs.smb.SmbTransport.send(SmbTransport.java:663) at jcifs.smb.SmbSession.send(SmbSession.java:238) at jcifs.smb.SmbTree.send(SmbTree.java:119) at jcifs.smb.SmbFile.send(SmbFile.java:775) at jcifs.smb.SmbFile.open0(SmbFile.java:989) at jcifs.smb.SmbFile.open(SmbFile.java:1006) at jcifs.smb.SmbFileOutputStream.init(SmbFileOutputStream.java:142) at jcifs.smb.TransactNamedPipeOutputStream.init(TransactNamedPipeOutputStream.java:32) at jcifs.smb.SmbNamedPipe.getNamedPipeOutputStream(SmbNamedPipe.java:187) at jcifs.dcerpc.DcerpcPipeHandle.doSendFragment(DcerpcPipeHandle.java:68) at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:190) at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:126) at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:140) at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2943) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2342) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.describeDocumentSecurity(SharedDriveConnector.java:1003) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getDocumentVersions(SharedDriveConnector.java:546) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318) ++ What kind of other information should I provide ?
Re: [ManifoldCF] Crawling with the WEB repository connector causes Repeated service interruptions
Hi Shigeki, A service interruption means that a connector (either a repository connector like the web connector or an output connector like the Solr connector) could not communicate with the configured service. Repeated service interruptions means that certain URLs failed to fetch properly even after a pattern of retries which lasted many hours. ManifoldCF connectors deal with such errors in one of several ways, depending on the exact details of the error: - ignore it and proceed - retry periodically for some time interval, and then give up and proceed - retry periodically for some time interval, and then shut down the job It sounds like your job has encountered one of the latter errors. The Error: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500 indicates that the problem is due to communication with Solr. Apparently certain documents you are indexing are causing Solr to return an error code 500, which is an internal server error, and is usually associated with a Solr exception. You will need to diagnose why this is, and take corrective steps, in order for your ManifoldCF job to complete successfully. Job no longer active is harmless - it's a side effect of the job shutting down. When a job is shutting down, active document processing cannot always be interrupted within a connector, but the framework helps it to stop quickly by throwing this exception. Thanks, Karl 2012/3/16 小林 茂樹(情報システム本部 / サービス企画部) shigeki.kobayas...@g.softbank.co.jp: I was crawling web sites with links to html and pdf files on the provided multiprocess-example agent for a few hours, then Simple History started showing -104 result code with a message saying Interrupted: Job no longer active. After the same error occurred repeatedly around 40 times, the job status became Aborting and then ended up with Error: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500. The job was interrupted and stopped. Does anyone know what situation brings Repeated service interruptions and has jobs stopped? Also in what circumstance an error status code -104 occurs? What is the meaning of the code -104? If you have any ideas, please advise me on how to avoid this error. I am using the followings: Solr 1.4 (Extracting Request Handler is set) ManifoldCF 0.4 (multiprocess-example) - Repository connector: WEB - Output connector: Solr Tomcat 6.0.29 PostgreSQL 9.1.3 Here is MCF’s debug log right before the job was interrupted: DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to get connection to http://xx.xx.xx.xx:80 (95697 ms) DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895 ms before starting fetch on http://xx.xx.xx.xx:80 DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to get connection to http://xx.xx.xx.xx:80 (99593 ms) DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully got connection to http://xx.xx.xx.xx:80 (99593 ms) DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an HttpClient object DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an HttpClient object after 0 ms. DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for '/xx/xx.pdf' DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 128 ms. DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a read wait on bin 'xx.xx' of 62 ms. DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a
Re: Transforming Manifold Metadata Prior to Pushing the Data into SOLR
Please see my response interleaved below. On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker mpar...@apogeeintegration.com wrote: I'm trying to push data into SOLR.. Is there a way to transform the metadata coming in from different data sources like SharePoint, and the File Share, prior to posting it into SOLR? In general, ManifoldCF does not have data transformation abilities. With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to extract content from documents and to perform transformations to document metadata etc. It is possible that at some point it will be possible to do more transformations in ManifoldCF in order to support search engines that don't have a pipeline, but that is currently not available. For instance, documents have metadata specifying their file path. I need to transform that to a URL I can use within SOLR to retrieve that document through a servlet that I wrote. The ManifoldCF model is that a connector creates a URL for each document that it indexes, using whatever makes sense for that particular repository to get you back to the document in question. So, for instance, Documentum documents will use URLs that point at Documentum's Webtop web application. It would be helpful to understand more precisely what you are trying to do. You could, for instance, modify your servlet to redirect to the ManifoldCF-generated URL. It gets indexed into Solr as the id field. Also, based on specific metadata that I'm seeing in the documents, I might want to conditionally add populate other fields in SOLR index. That sounds like a job for the Tika pipeline to me. Thanks, Karl -- This e-mail and any files transmitted with it may be proprietary. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.
Re: Cannot find OracleDriver
So if the Database and Host field really is 21:16:18:145:1521, try 21.16.18.145:1521 instead. ;-) Karl On Mon, Feb 27, 2012 at 9:22 AM, Matthew Parker mpar...@apogeeintegration.com wrote: type: JDBC Authority: None Database Type: ORACLE Database and Host: 21:16:18:145:1521 Instance/Database: main User Name: Password: X On Sun, Feb 26, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote: I haven't seen this one. I'd love to know what the connect descriptor it refers to is. Can you tell me what the parameters all look like for the JDBC connection you are setting up? Are you specifying, for instance, the port as part of the server name? Karl On Sat, Feb 25, 2012 at 1:22 PM, Matthew Parker mpar...@apogeeintegration.com wrote: Karl, That fixed the driver issue. I just updated my start.jar file by hand for now. The problem I have now is connecting to ORACLE. I can do it through NetBeans on my machine, but I cannot connect through ManfoldCF with the same settings. I get the following error: Error getting connection. Listener refused the connection with the following error. ORA-12514. TNS:Listener does not currently know of service requested in connect descriptor. This might be more of an ORACLE issue than Manifold issue, but I was wondering whether you've encountered the same thing during testing? Regards, Matt On Fri, Jan 20, 2012 at 10:28 AM, Matthew Parker mpar...@apogeeintegration.com wrote: Thanks Karl. On Thu, Jan 19, 2012 at 9:44 PM, Karl Wright daddy...@gmail.com wrote: The problem has been fixed on trunk. Basically, the instructions changed as did some of the build files. It turned out to be extremely challenging to get JDBC drivers to run when they were loaded by anything other than the system classloader, so that's what I was forced to insure. Thanks, Karl On Thu, Jan 19, 2012 at 3:33 PM, Karl Wright daddy...@gmail.com wrote: The ticket for this problem is CONNECTORS-390. Karl On Thu, Jan 19, 2012 at 3:05 PM, Matthew Parker mpar...@apogeeintegration.com wrote: Many thanks. I'll give that a try. On Thu, Jan 19, 2012 at 3:01 PM, Karl Wright daddy...@gmail.com wrote: The problem is that the JDBC driver is using a pool driver that is in common with the core of ManifoldCF. So the connector-lib path, which only the connectors know about, won't do. That's a bug which I'll create a ticket for. A temporary fix, which is slightly involved, requires you to put the ojdbc6.jar in the example/lib area, as you already tried, but in addition you will need to explicitly include the jar in your classpath. Normally the start.jar's manifest describes all the jars in the initial classpath. I thought it was possible to also include additional classpath info through the normal --classpath mechanism, but that doesn't seem to work, so you may be stuck with modifying the root build.xml file to add the jar to the manifest. I'm going to experiment a bit and see if I can come up with something quickly. Karl On Thu, Jan 19, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote: I was able to reproduce the problem. I'll get back to you when I figure out what the issue is. Karl On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I've used the jar file in NetBeans to connect to the database without any issue. Seems more like a class loader issue. On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I have the latest release from the Apache Manifold site (i.e. 0.3-incubating). I checked the driver jar file with winzip, and the driver name is still the same (oracle.jdbc.OracleDriver). I'm running java 1.6.0_18-b7 on Windows XP SP 3. On Thu, Jan 19, 2012 at 2:27 PM, Karl Wright daddy...@gmail.com wrote: MCF's Oracle support was written against earlier versions of the Oracle driver. It is possible that they have changed the driver class. If the driver winds up in the dist/connector-lib directory (I'm assuming you are using trunk or 0.4-incubating), then it should be accessible. Could you please try the following: jar -tf ojdbc6.jar | grep oracle/jdbc/OracleDriver ... assuming you are using Linux? If the driver class IS found, then the other possibility is that the jar is compiled against a later version of Java than the one you are using to run MCF. Please let me know what you find. Karl On Thu, Jan 19, 2012 at 1:43 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I downloaded MCF and started playing with the default setup under Jetty
Re: Cannot find OracleDriver
The connect URL it will use given those parameters is the following: String dburl = jdbc: + providerName + // + host + / + database + ((instanceName==null)?:;instance=+instanceName); Or, filled in with your parameters: jdbc:oracle:thin:@//21.16.18.145:1521/main The main at the end is what I would wonder about. Oracle's default is database; if you leave the database/instance name field blank, that's what you'll get. I also recommend turning on connector debugging, in properties.xml, by adding: property name=org.apache.manifoldcf.connectors value=DEBUG/ ... and restarting ManifoldCF. Try viewing the connection in the UI; you should see the connect string logged, as well as possibly a more detailed response. Thanks, Karl On Mon, Feb 27, 2012 at 11:12 AM, Matthew Parker mpar...@apogeeintegration.com wrote: Sorry. I used the wrong character. It is configured for 21.16.18.145:1521 On Mon, Feb 27, 2012 at 10:27 AM, Karl Wright daddy...@gmail.com wrote: So if the Database and Host field really is 21:16:18:145:1521, try 21.16.18.145:1521 instead. ;-) Karl On Mon, Feb 27, 2012 at 9:22 AM, Matthew Parker mpar...@apogeeintegration.com wrote: type: JDBC Authority: None Database Type: ORACLE Database and Host: 21:16:18:145:1521 Instance/Database: main User Name: Password: X On Sun, Feb 26, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote: I haven't seen this one. I'd love to know what the connect descriptor it refers to is. Can you tell me what the parameters all look like for the JDBC connection you are setting up? Are you specifying, for instance, the port as part of the server name? Karl On Sat, Feb 25, 2012 at 1:22 PM, Matthew Parker mpar...@apogeeintegration.com wrote: Karl, That fixed the driver issue. I just updated my start.jar file by hand for now. The problem I have now is connecting to ORACLE. I can do it through NetBeans on my machine, but I cannot connect through ManfoldCF with the same settings. I get the following error: Error getting connection. Listener refused the connection with the following error. ORA-12514. TNS:Listener does not currently know of service requested in connect descriptor. This might be more of an ORACLE issue than Manifold issue, but I was wondering whether you've encountered the same thing during testing? Regards, Matt On Fri, Jan 20, 2012 at 10:28 AM, Matthew Parker mpar...@apogeeintegration.com wrote: Thanks Karl. On Thu, Jan 19, 2012 at 9:44 PM, Karl Wright daddy...@gmail.com wrote: The problem has been fixed on trunk. Basically, the instructions changed as did some of the build files. It turned out to be extremely challenging to get JDBC drivers to run when they were loaded by anything other than the system classloader, so that's what I was forced to insure. Thanks, Karl On Thu, Jan 19, 2012 at 3:33 PM, Karl Wright daddy...@gmail.com wrote: The ticket for this problem is CONNECTORS-390. Karl On Thu, Jan 19, 2012 at 3:05 PM, Matthew Parker mpar...@apogeeintegration.com wrote: Many thanks. I'll give that a try. On Thu, Jan 19, 2012 at 3:01 PM, Karl Wright daddy...@gmail.com wrote: The problem is that the JDBC driver is using a pool driver that is in common with the core of ManifoldCF. So the connector-lib path, which only the connectors know about, won't do. That's a bug which I'll create a ticket for. A temporary fix, which is slightly involved, requires you to put the ojdbc6.jar in the example/lib area, as you already tried, but in addition you will need to explicitly include the jar in your classpath. Normally the start.jar's manifest describes all the jars in the initial classpath. I thought it was possible to also include additional classpath info through the normal --classpath mechanism, but that doesn't seem to work, so you may be stuck with modifying the root build.xml file to add the jar to the manifest. I'm going to experiment a bit and see if I can come up with something quickly. Karl On Thu, Jan 19, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote: I was able to reproduce the problem. I'll get back to you when I figure out what the issue is. Karl On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I've used the jar file in NetBeans to connect to the database without any issue. Seems more like a class loader issue. On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I have the latest release from the Apache Manifold site (i.e. 0.3-incubating). I
Re: ManifoldCF's dist/shapoint-integration dir
Hi Daniel, I have not personally tried ManifoldCF on JBoss, but since both Jetty and Tomcat work without modification I would wonder if there is a JBoss classloader option you might be setting incorrectly. The reason this is likely is because the web container specification is pretty clear about the hierarchical order of resolution of classes for web applications, and it is this characteristic which will determine whether JDBC DriverManager registration works properly or not. Jetty has two possible settings, for instance - one that makes it conform to the spec, and one that is useful for single-process deployments. Perhaps other users on this list might have some hints? Karl On Thu, Feb 23, 2012 at 7:47 PM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Karl I have been trying to configure ManifoldCF to run on JBoss. When I Manifold on JBoss the connection pool can't be created. Do we need to set the datasource through the web console of JBoss. I believe the code is in the DatabaseFactory. Thanks Dan From: Karl Wright [daddy...@gmail.com] Sent: Monday, February 13, 2012 10:10 AM To: Silvia, Daniel [USA] Cc: connectors-user@incubator.apache.org Subject: Re: ManifoldCF's dist/shapoint-integration dir The SharePoint connector only looks at documents within libraries, and documents within folders in those libraries. I don't know how SharePoint is structuring your Wiki content, though. If it is individual documents within libraries, it should be accessible by the SharePoint Connector. If it is some other construct, then it likely won't be found by that connector. The Simple History is going to list the URLs that the SharePoint connector fetches. If you know the URL of a piece of Wiki content and that URL does not appear in the Simple History, it's not being fetched. Similarly, if the URL of that piece of Wiki content has no library name in the path, it's not something the SharePoint Connector will be able to index. If the SharePoint connector is not going to do it for you, and your wiki content is being rendered in a manner that supports standard Wiki API calls, you can use the Wiki Connector to index it. If that too isn't going to work, then we should analyze exactly what SharePoint is presenting with a view towards extending the SharePoint connector. Karl On Mon, Feb 13, 2012 at 9:51 AM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Karl Does the SharePoint connector only pull files from the SharePoint instance and not content like Wiki content. As mentioned in the previous e-mail I am able to see the xml content in the log file for the wikis with the element similar to someWikisomeNameWiki_rowsome other elementsWikiFiledcontent./WikiField/someNameWiki_row/someWiki. However, I do not see information in the Simple History Report pulling Wiki information or the .aspx pages. Does this report only produce information on files and not content pulled from SharePoint? I am just trying to figure out if I need to configure another connector to pull content from SharePoint other than the SharePoint connector. Thanks Dan From: Karl Wright [daddy...@gmail.com] Sent: Sunday, February 12, 2012 12:08 PM To: Silvia, Daniel [USA] Cc: connectors-user@incubator.apache.org Subject: Re: ManifoldCF's dist/shapoint-integration dir Hi Daniel, If you are seeing fetches in the Simple History that include the wiki URLs you are trying to capture, the SharePoint job is likely correct. Are you seeing Document ingest activities for the same documents? If so, they are being sent to Solr, and you'd have to look into the Solr configuration to figure out why they aren't being indexed. Thanks, On Sun, Feb 12, 2012 at 11:37 AM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Karl Quick question regarding SharePoint Wikis and ingesting them into Solr. I have been trying to get the Wikis, created in SharePoint, to be ingested into Solr. I am able to see the Wikis in the logging where the SharePoint Connector pulls everything from site, however, I do not see the Wikis content in the solr instance. When creating a job to run, do I need to indicate a path similar to *Wiki* for the entire site or do I need to configure the solr metadata in the job to capture WikiField element in the xml being passed to the Solr connector? Thanks for your help. Dan From: Karl Wright [daddy...@gmail.com] Sent: Tuesday, January 31, 2012 10:52 AM To: Silvia, Daniel [USA] Cc: connectors-user@incubator.apache.org Subject: Re: ManifoldCF's dist/shapoint-integration dir It's been a while since I've set up a SharePoint job but I think what you are missing is a file rule (instead of just a library rule). Here's what the end-user documentation says on the matter: Each rule consists of a path, a rule
Re: Need Help on setting up ManifoldCF
Hi Anupam, I did not see a ticket from you about the DOCUMENTUM environment variable and the dmcl.ini vs. dfc.properties file. I've created an issue at https://issues.apache.org/jira/browse/CONNECTORS-410 to track this problem. It would be great if you could confirm that: (a) the DOCUMENTUM environment variable is still needed at all by DFC, and (b) that when it is set properly, the file dfc.properties can be found at $DOCUMENTUM\dfc.properties (on Windows, at least). Thanks, Karl On Tue, Feb 14, 2012 at 3:23 PM, Karl Wright daddy...@gmail.com wrote: Hi Anupam, Please post emails like this directly to connectors-user@incubator.apache.org. See below for responses. On Tue, Feb 14, 2012 at 3:07 PM, Anupam Bhattacharya anupam...@gmail.com wrote: Hello Karl, I am a software programmer in DuPont, Gurgaon, India. Recently, due to the economic instability all over the world the company has decided to go for cheaper Search Engine Applications. Thus we are getting rid of many costly proprietary Search Applications and will be replacing with FAST. Although, I recently came across SOLR search engine ManiFoldCF Connector framework. Thus, I am currently driving this effort within my company as i am a big supporter of open source technologies. I started my career in Alfresco CMS and now working on Search Technologies. Currently I am facing lots of initial building/deploying/installing issues. I have already referred the url http://incubator.apache.org/connectors/en_US/how-to-build-and-deploy.html Read it multiple times but still face many issues. I downloaded the latest 0.4 version and it seems the documentation is not up to date on the above link. The online documentation is pertinent to trunk. The documentation you want to use is contained within the 0.4-incubating release. Go to dist/doc and you will see it there. Few issues which took me a long time to resolve which can be added in ManifoldCF wiki as learnings for others are listed below: a. No single example is given for running the executecommand.bat with proper arguments. Only list of commands given with parameter defined. I'm not entirely sure I get this. Do you just want an example in the documentation? b. Setting where and which file for the property manifoldcf.configfile for deploying the war on tomcat with Postgresql database. The documentation already tells you that you need to add an appropriate -D to your tomcat invocation to point to your properties.xml file. Tomcat documentation differs from version to version and platform to platform on how best to do that, and if you run under Windows there's even a service wrapper with a configuration UI that allows you to set these parameters. So it's way beyond ManifoldCF's mission to describe all that, I think. c. I am trying to build the Documentum Connector but came to know that some additional environment variables needs to be added for DOCUMENTUM. Additionally the latest version of documentum uses dfc.properties file while run.bat look for dctl.ini file. Could you open a ticket in Jira for this issue? https://issues.apache.org/jira. It should not be a problem if you modify the script temporarily, but we can readily make the script look for either of these. d. postgresql driver is jdbc3 thus it creates problem with JVM6 or above. We use JDK 6 all the time without problems, so I don't know what you are talking about here. e. I was getting errors during the ant build which tries to delete jar files from lib directory. Don't have the source code right now with me thus cant provide the full path. It sounds like you were trying to run ant while you still had ManifoldCF processes running from the same tree. f. It was advised in the documentation to set MCF_Home for example_multiprocess project but it seems the build of documentum connector refers to this property differently from run.bat. Yes, this was noticed and fixed on trunk recently. Can you please update the Apache ManifoldCF website with the latest installation procedures. Also, It will be very kind of you in the meanwhile if you can send few notes for me to head start the configuration of ManifoldCF, with SOLR Documentum connector. The documentation online has been updated to be consistent with trunk, so if you want to use the trunk version this might be a good opportunity to help clarify the documentation. Either that or you will need to stick with the 0.4-incubating release and the 0.4-incubating documentation that is part of it; we cannot at this time update documentation that has already been released. Thanks, Karl Looking forward for your help. Thanks Regards Anupam Bhattacharya
Re: Need Help on setting up ManifoldCF
By all means, please go ahead. Solr has a tutorial - maybe something like that would be appropriate? Karl On Tue, Feb 14, 2012 at 6:54 PM, Hitoshi Ozawa ozawa_hito...@ogis-ri.co.jp wrote: Hi, I agree with Anupam on getting started with ManifoldCF. I'm thinking of writing up a simple quick guide because many people are having trouble. I think it would help others if there was a simple example with ManifoldCF + Solr + local file + jsp to crawl some files in local directory (ManifoldCF documents in PDF?) and search and display results. H.Ozawa (2012/02/15 5:23), Karl Wright wrote: Hi Anupam, Please post emails like this directly to connectors-user@incubator.apache.org. See below for responses. On Tue, Feb 14, 2012 at 3:07 PM, Anupam Bhattacharya anupam...@gmail.com wrote: Hello Karl, I am a software programmer in DuPont, Gurgaon, India. Recently, due to the economic instability all over the world the company has decided to go for cheaper Search Engine Applications. Thus we are getting rid of many costly proprietary Search Applications and will be replacing with FAST. Although, I recently came across SOLR search engine ManiFoldCF Connector framework. Thus, I am currently driving this effort within my company as i am a big supporter of open source technologies. I started my career in Alfresco CMS and now working on Search Technologies. Currently I am facing lots of initial building/deploying/installing issues. I have already referred the url http://incubator.apache.org/connectors/en_US/how-to-build-and-deploy.html Read it multiple times but still face many issues. I downloaded the latest 0.4 version and it seems the documentation is not up to date on the above link. The online documentation is pertinent to trunk. The documentation you want to use is contained within the 0.4-incubating release. Go to dist/doc and you will see it there. Few issues which took me a long time to resolve which can be added in ManifoldCF wiki as learnings for others are listed below: a. No single example is given for running the executecommand.bat with proper arguments. Only list of commands given with parameter defined. I'm not entirely sure I get this. Do you just want an example in the documentation? b. Setting where and which file for the property manifoldcf.configfile for deploying the war on tomcat with Postgresql database. The documentation already tells you that you need to add an appropriate -D to your tomcat invocation to point to your properties.xml file. Tomcat documentation differs from version to version and platform to platform on how best to do that, and if you run under Windows there's even a service wrapper with a configuration UI that allows you to set these parameters. So it's way beyond ManifoldCF's mission to describe all that, I think. c. I am trying to build the Documentum Connector but came to know that some additional environment variables needs to be added for DOCUMENTUM. Additionally the latest version of documentum uses dfc.properties file while run.bat look for dctl.ini file. Could you open a ticket in Jira for this issue? https://issues.apache.org/jira. It should not be a problem if you modify the script temporarily, but we can readily make the script look for either of these. d. postgresql driver is jdbc3 thus it creates problem with JVM6 or above. We use JDK 6 all the time without problems, so I don't know what you are talking about here. e. I was getting errors during the ant build which tries to delete jar files from lib directory. Don't have the source code right now with me thus cant provide the full path. It sounds like you were trying to run ant while you still had ManifoldCF processes running from the same tree. f. It was advised in the documentation to set MCF_Home for example_multiprocess project but it seems the build of documentum connector refers to this property differently from run.bat. Yes, this was noticed and fixed on trunk recently. Can you please update the Apache ManifoldCF website with the latest installation procedures. Also, It will be very kind of you in the meanwhile if you can send few notes for me to head start the configuration of ManifoldCF, with SOLR Documentum connector. The documentation online has been updated to be consistent with trunk, so if you want to use the trunk version this might be a good opportunity to help clarify the documentation. Either that or you will need to stick with the 0.4-incubating release and the 0.4-incubating documentation that is part of it; we cannot at this time update documentation that has already been released. Thanks, Karl Looking forward for your help. Thanks Regards Anupam Bhattacharya
Re: ManifoldCF's dist/shapoint-integration dir
The SharePoint connector only looks at documents within libraries, and documents within folders in those libraries. I don't know how SharePoint is structuring your Wiki content, though. If it is individual documents within libraries, it should be accessible by the SharePoint Connector. If it is some other construct, then it likely won't be found by that connector. The Simple History is going to list the URLs that the SharePoint connector fetches. If you know the URL of a piece of Wiki content and that URL does not appear in the Simple History, it's not being fetched. Similarly, if the URL of that piece of Wiki content has no library name in the path, it's not something the SharePoint Connector will be able to index. If the SharePoint connector is not going to do it for you, and your wiki content is being rendered in a manner that supports standard Wiki API calls, you can use the Wiki Connector to index it. If that too isn't going to work, then we should analyze exactly what SharePoint is presenting with a view towards extending the SharePoint connector. Karl On Mon, Feb 13, 2012 at 9:51 AM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Karl Does the SharePoint connector only pull files from the SharePoint instance and not content like Wiki content. As mentioned in the previous e-mail I am able to see the xml content in the log file for the wikis with the element similar to someWikisomeNameWiki_rowsome other elementsWikiFiledcontent./WikiField/someNameWiki_row/someWiki. However, I do not see information in the Simple History Report pulling Wiki information or the .aspx pages. Does this report only produce information on files and not content pulled from SharePoint? I am just trying to figure out if I need to configure another connector to pull content from SharePoint other than the SharePoint connector. Thanks Dan From: Karl Wright [daddy...@gmail.com] Sent: Sunday, February 12, 2012 12:08 PM To: Silvia, Daniel [USA] Cc: connectors-user@incubator.apache.org Subject: Re: ManifoldCF's dist/shapoint-integration dir Hi Daniel, If you are seeing fetches in the Simple History that include the wiki URLs you are trying to capture, the SharePoint job is likely correct. Are you seeing Document ingest activities for the same documents? If so, they are being sent to Solr, and you'd have to look into the Solr configuration to figure out why they aren't being indexed. Thanks, On Sun, Feb 12, 2012 at 11:37 AM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Karl Quick question regarding SharePoint Wikis and ingesting them into Solr. I have been trying to get the Wikis, created in SharePoint, to be ingested into Solr. I am able to see the Wikis in the logging where the SharePoint Connector pulls everything from site, however, I do not see the Wikis content in the solr instance. When creating a job to run, do I need to indicate a path similar to *Wiki* for the entire site or do I need to configure the solr metadata in the job to capture WikiField element in the xml being passed to the Solr connector? Thanks for your help. Dan From: Karl Wright [daddy...@gmail.com] Sent: Tuesday, January 31, 2012 10:52 AM To: Silvia, Daniel [USA] Cc: connectors-user@incubator.apache.org Subject: Re: ManifoldCF's dist/shapoint-integration dir It's been a while since I've set up a SharePoint job but I think what you are missing is a file rule (instead of just a library rule). Here's what the end-user documentation says on the matter: Each rule consists of a path, a rule type, and an action. The actions are Include and Exclude. The rule type tells the connection what kind of SharePoint entity it is allowed to exactly match. For example, a File rule will only exactly match SharePoint paths that represent files - it cannot exactly match sites or libraries. The path itself is just a sequence of characters, where the * character has the special meaning of being able to match any number of any kind of characters, and the ? character matches exactly one character of any kind. The rule matcher extends strict, exact matching by introducing a concept of implicit inclusion rules. If your rule action is Include, and you specify (say) a File rule, the matcher presumes implicit inclusion rules for the corresponding site and library. So, if you create an Include File rule that matches (for example) /MySite/MyLibrary/MyFile, there is an implied Site Include rule for /MySite, and an implied Library Include rule for /MySite/MyLibrary. Similarly, if you create a Library Include rule, there is an implied Site Include rule that corresponds to it. Note that these shortcuts only applies to Include rules - there are no corresponding implied Exclude rules. What this means is that you should probably be declaring file rules with * as the file name for each
Re: Unable to index Windows share repositories
Nothing has changed as far as the connectors are concerned. Is your domain controller now upgraded to a different version of windows too? If so you may need to play around with the fields that are used for authorization, e.g. the form of the username and/or the domain name. Windows is not an open platform and they change stuff all the time, but to the best of my knowledge they have not introduced any new authentication modes in Windows 7, so something should work. If not the guy to talk with is Michael Allen, who maintains the jcifs library. Karl On Fri, Feb 10, 2012 at 2:17 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, Till recently, I have been using ManifoldCF trunk code (before 0.4 was released) on Windows XP. I was able to index files from Windows Share repositories successfully into Solr. Now, I have started using ManifoldCF 0.4 version on Windows 7. With the new setup, am able to index files from File system repository with no issue, but I have problems indexing data from Windows Share repository. The job starts and ends with Result Description : Authorization: Access is denied. in Simple History. The log file has the message JCIFS: Authorization exception reading document/directory smb://nhance29/TestMails/ - skipping Can you please tell me what needs to be done to resolve this ? I tried enabling Debug from properties.xml and this is what I get in the log file. DEBUG 2012-02-10 12:34:37,869 (Startup thread) - Connecting to: smb://GLOBAL;stgserver:password@nhance29/ DEBUG 2012-02-10 12:34:37,907 (Startup thread) - Seed = 'smb://nhance29/TestMails/' DEBUG 2012-02-10 12:34:39,781 (Worker thread '1') - JCIFS: getVersions(): documentIdentifiers[0] is: smb://nhance29/TestMails/ DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: In checkInclude for 'smb://nhance29/TestMails/' DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Matching startpoint 'smb://nhance29/TestMails/' against actual 'smb://nhance29/TestMails/' DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Startpoint found! DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Startpoint: always included DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Leaving checkInclude for 'smb://nhance29/TestMails/' DEBUG 2012-02-10 12:34:44,421 (Worker thread '1') - JCIFS: Processing 'smb://nhance29/TestMails/' DEBUG 2012-02-10 12:34:44,421 (Worker thread '1') - JCIFS: 'smb://nhance29/TestMails/' is a directory WARN 2012-02-10 12:34:44,425 (Worker thread '1') - JCIFS: Possibly transient exception detected on attempt 1 while listing files: Access is denied. jcifs.smb.SmbAuthException: Access is denied. at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:546) at jcifs.smb.SmbTransport.send(SmbTransport.java:640) at jcifs.smb.SmbSession.send(SmbSession.java:238) at jcifs.smb.SmbTree.send(SmbTree.java:119) at jcifs.smb.SmbFile.send(SmbFile.java:775) at jcifs.smb.SmbFile.doFindFirstNext(SmbFile.java:1986) at jcifs.smb.SmbFile.doEnum(SmbFile.java:1738) at jcifs.smb.SmbFile.listFiles(SmbFile.java:1715) at jcifs.smb.SmbFile.listFiles(SmbFile.java:1704) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileListFiles(SharedDriveConnector.java:2224) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:701) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:561) Thanks and Regards, Swapna.
Re: Unable to index Windows share repositories
Good to hear. The connector, by the way, is resigned to the fact that sometimes various things fail when talking to Windows, which is why you see the transient failure notification; it will retry on its own eventually without killing the job, and only give up when things don't work for an extended period of time. Karl On Fri, Feb 10, 2012 at 5:08 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, Not sure why, but now am able to index data from Windows Share repositories into Solr. I don't get the Access denied messages any more, although I haven't changed anything. Sorry for the inconvenience caused. Will get back again if I see any issue. Thanks and Regards, Swapna. On Fri, Feb 10, 2012 at 12:47 PM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, Till recently, I have been using ManifoldCF trunk code (before 0.4 was released) on Windows XP. I was able to index files from Windows Share repositories successfully into Solr. Now, I have started using ManifoldCF 0.4 version on Windows 7. With the new setup, am able to index files from File system repository with no issue, but I have problems indexing data from Windows Share repository. The job starts and ends with Result Description : Authorization: Access is denied. in Simple History. The log file has the message JCIFS: Authorization exception reading document/directory smb://nhance29/TestMails/ - skipping Can you please tell me what needs to be done to resolve this ? I tried enabling Debug from properties.xml and this is what I get in the log file. DEBUG 2012-02-10 12:34:37,869 (Startup thread) - Connecting to: smb://GLOBAL;stgserver:password@nhance29/ DEBUG 2012-02-10 12:34:37,907 (Startup thread) - Seed = 'smb://nhance29/TestMails/' DEBUG 2012-02-10 12:34:39,781 (Worker thread '1') - JCIFS: getVersions(): documentIdentifiers[0] is: smb://nhance29/TestMails/ DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: In checkInclude for 'smb://nhance29/TestMails/' DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Matching startpoint 'smb://nhance29/TestMails/' against actual 'smb://nhance29/TestMails/' DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Startpoint found! DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Startpoint: always included DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Leaving checkInclude for 'smb://nhance29/TestMails/' DEBUG 2012-02-10 12:34:44,421 (Worker thread '1') - JCIFS: Processing 'smb://nhance29/TestMails/' DEBUG 2012-02-10 12:34:44,421 (Worker thread '1') - JCIFS: 'smb://nhance29/TestMails/' is a directory WARN 2012-02-10 12:34:44,425 (Worker thread '1') - JCIFS: Possibly transient exception detected on attempt 1 while listing files: Access is denied. jcifs.smb.SmbAuthException: Access is denied. at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:546) at jcifs.smb.SmbTransport.send(SmbTransport.java:640) at jcifs.smb.SmbSession.send(SmbSession.java:238) at jcifs.smb.SmbTree.send(SmbTree.java:119) at jcifs.smb.SmbFile.send(SmbFile.java:775) at jcifs.smb.SmbFile.doFindFirstNext(SmbFile.java:1986) at jcifs.smb.SmbFile.doEnum(SmbFile.java:1738) at jcifs.smb.SmbFile.listFiles(SmbFile.java:1715) at jcifs.smb.SmbFile.listFiles(SmbFile.java:1704) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileListFiles(SharedDriveConnector.java:2224) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:701) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:561) Thanks and Regards, Swapna.
Re: Web Crawl using ManifoldCF
On Wed, Feb 8, 2012 at 8:24 AM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Carl I want to thank you for your help regarding the Sharepoint to Solr connections, everything seems to be working properly after getting the Viewers and Home Owners groups permission set properly by our SharePoint Admins. That's great news! Thanks for sticking with it. ;-) However, I have another question regarding pulling site content from the SharePoint instance and not the files stored on the SharePoint instance. When creating a Respository connection, would you use the Web connection type to pull site content? If that is the case, when creating the job, do you indicate just the site url you want to crawl to pull site content in the Seed tab? Are we using the correct connection repository? Is there a respository type we use to just crawl websites for the content and not files? I think that's the right approach, if there's a document you can crawl somewhere that has a reference to the other documents, or the documents all refer to each other. You need such a document or documents at the root of a document web, otherwise a web crawler has no way of locating the documents in question. That would be how you identify your seed document. For typical (non SharePoint) sites, that's usually the main URL of the site. So, for example, if you wanted to crawl cnn.com you'd probably use a seed of http://www.cnn.com, because that's a good place to start to get to all of cnn's content. If no such document(s) exist, then web crawling is not going to do it. If this site is served by SharePoint, then some kind of enhancement to the SharePoint connector would be a better approach. Thanks, Karl As you can see, I hope I have explained myself properly, we are trying to just crawl site content. Thanks Dan
Re: ManifoldCF's dist/shapoint-integration dir
Ok, let's do one thing at a time. First: For the Path tab where there are Path Rules, are these the paths we want ManifoldCF to follow? Each site, and each Library like Documents and Shared Documents. And in the Metadata tab, this is the tab where you indicate for each Site and Library you want to include specific metadata or include all metadata? For SharePoint, there are Path Rules and Metadata Rules. The Path Rules describe what documents you want to include or exclude. The Metadata Rules describe what metadata you want to include or exclude. For right now I would ignore the Metadata Rules and just make sure you have Path Rules that mean that you have included documents. As I run the report, I see Documents, Active, and Processed where the numbers change under the Active column as well as the Document and Processed column (these just get larger, where Active changes). This report we actually call the Job Status screen. The fact that the numbers get larger and the job doesn't just end indicates that you are successfully crawling your SharePoint, and you have set up the job to include at least some documents. This is good news. However, this is NOT the Simple History report I was alluding to earlier. To get to that report, click on the Simple History link on the left-hand navigation area. This report will show the events of your choice (default - ALL recorded events) over a given time window (default: the last hour). If you've done this right you should at least see a Job start event. The events you are most interested in are the fetch (which describes all attempts to fetch documents from SharePoint) and document ingest, which describe attempts to get documents into Solr. You can refresh the displayed events by clicking the Go button in the middle of the screen whenever you wish. I'd like you to delete your job, create it again, and start it. Then, while it is running, I'd like you to go to the Simple History screen, and select the appropriate connection (your SharePoint repository connection), and click the Go button. So as not to skip anything basic: (1) What event types do you see? (2) Are there fetch events? (3) Are there document ingest events? If you see no fetch events, that implies you have either not specified any documents to include in your job, OR your Solr connection is configured to reject too many document types so they are all getting filtered out. If you see document ingest events, but those have errors, it implies that the configuration of your Solr connection is incorrect and does not match the way your Solr is configured. If you send me a specific error code and/or text I can help you figure out what is happening. If you see document ingest events with NO errors, but the Solr instance is not getting documents, you are describing an impossible situation. While your Solr instance may not be configured to have the Extracting Update Handler active, or it may be at a different URL than what you pointed at, that would definitely yield errors or notifications in the Simple History. Please let me know what you actually see. Karl On Tue, Jan 31, 2012 at 7:53 AM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Karl I am trying to figure out why I can't see anything being indexed into our Solr index. I was looking at another post where you were working with Martijn and that individual was not able to see info getting into Solr. In the report that I have set up, I have included all metadata associated to each site, Share Documents, and Documents. In the Solr Field Mapping, I am associating metadata fields that are indicated in the MetaData tab to fields that exist in our solr index. For the Path tab where there are Path Rules, are these the paths we want ManifoldCF to follow? Each site, and each Library like Documents and Shared Documents. And in the Metadata tab, this is the tab where you indicate for each Site and Library you want to include specific metadata or include all metadata? As I run the report, I see Documents, Active, and Processed where the numbers change under the Active column as well as the Document and Processed column (these just get larger, where Active changes). While I was researching why I may not be seeing something over on the Solr side, I saw your communication with another individual indicating that I should see something like literal.xxx=yyy in the Solr log. This is an older post so there maybe something else I should see. But the only thing I see when I look at the Solr log is [ ] webapp=/solr path=/update/extract params={commit=true} status=0 QTime=0. Any ideas. Thanks From: Karl Wright [daddy...@gmail.com] Sent: Monday, January 30, 2012 10:40 AM To: Silvia, Daniel [USA] Subject: Re: ManifoldCF's dist/shapoint-integration dir The default time range for the Simple History is the last hour. I suspect you are unaware of that. If you
Re: ManifoldCF's dist/shapoint-integration dir
I should clarify that the reason for deleting and recreating the job is because ManifoldCF crawls incrementally. If you just run a job a second time you may well not get any documents if none have changed from the first time the job was run. Thanks, Karl On Tue, Jan 31, 2012 at 9:00 AM, Karl Wright daddy...@gmail.com wrote: Ok, let's do one thing at a time. First: For the Path tab where there are Path Rules, are these the paths we want ManifoldCF to follow? Each site, and each Library like Documents and Shared Documents. And in the Metadata tab, this is the tab where you indicate for each Site and Library you want to include specific metadata or include all metadata? For SharePoint, there are Path Rules and Metadata Rules. The Path Rules describe what documents you want to include or exclude. The Metadata Rules describe what metadata you want to include or exclude. For right now I would ignore the Metadata Rules and just make sure you have Path Rules that mean that you have included documents. As I run the report, I see Documents, Active, and Processed where the numbers change under the Active column as well as the Document and Processed column (these just get larger, where Active changes). This report we actually call the Job Status screen. The fact that the numbers get larger and the job doesn't just end indicates that you are successfully crawling your SharePoint, and you have set up the job to include at least some documents. This is good news. However, this is NOT the Simple History report I was alluding to earlier. To get to that report, click on the Simple History link on the left-hand navigation area. This report will show the events of your choice (default - ALL recorded events) over a given time window (default: the last hour). If you've done this right you should at least see a Job start event. The events you are most interested in are the fetch (which describes all attempts to fetch documents from SharePoint) and document ingest, which describe attempts to get documents into Solr. You can refresh the displayed events by clicking the Go button in the middle of the screen whenever you wish. I'd like you to delete your job, create it again, and start it. Then, while it is running, I'd like you to go to the Simple History screen, and select the appropriate connection (your SharePoint repository connection), and click the Go button. So as not to skip anything basic: (1) What event types do you see? (2) Are there fetch events? (3) Are there document ingest events? If you see no fetch events, that implies you have either not specified any documents to include in your job, OR your Solr connection is configured to reject too many document types so they are all getting filtered out. If you see document ingest events, but those have errors, it implies that the configuration of your Solr connection is incorrect and does not match the way your Solr is configured. If you send me a specific error code and/or text I can help you figure out what is happening. If you see document ingest events with NO errors, but the Solr instance is not getting documents, you are describing an impossible situation. While your Solr instance may not be configured to have the Extracting Update Handler active, or it may be at a different URL than what you pointed at, that would definitely yield errors or notifications in the Simple History. Please let me know what you actually see. Karl On Tue, Jan 31, 2012 at 7:53 AM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Karl I am trying to figure out why I can't see anything being indexed into our Solr index. I was looking at another post where you were working with Martijn and that individual was not able to see info getting into Solr. In the report that I have set up, I have included all metadata associated to each site, Share Documents, and Documents. In the Solr Field Mapping, I am associating metadata fields that are indicated in the MetaData tab to fields that exist in our solr index. For the Path tab where there are Path Rules, are these the paths we want ManifoldCF to follow? Each site, and each Library like Documents and Shared Documents. And in the Metadata tab, this is the tab where you indicate for each Site and Library you want to include specific metadata or include all metadata? As I run the report, I see Documents, Active, and Processed where the numbers change under the Active column as well as the Document and Processed column (these just get larger, where Active changes). While I was researching why I may not be seeing something over on the Solr side, I saw your communication with another individual indicating that I should see something like literal.xxx=yyy in the Solr log. This is an older post so there maybe something else I should see. But the only thing I see when I look at the Solr log is [ ] webapp=/solr path=/update/extract
Re: ManifoldCF's dist/shapoint-integration dir
When I select only the fetch activity, I don't see anything in the events, when I select the Document Ingest activity, I don't see anything in the events. So either you've already run the job and the documents were accessed the first time (and won't be accessed again until they change), or the problem is likely that your SharePoint Path Rules are not including any documents. It would be very helpful at this point to include a screen shot of the job you've created. Since you are not on the net, perhaps you can jot down your SharePoint path rules for me to have a look at, as they are displayed when you view the job. Thanks, Karl On Tue, Jan 31, 2012 at 9:44 AM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Karl Ok, I have created a new job and ran the job and went to the Simple History Report. I see the Events. If all the Activities in the Simple History Report, Document Deletion(SolrPipeline), Document Ingest(SolrPipeline), and Fetch are selected I see a start job and end job for events . When I get to the Simple History Report I can select the Connection, I don't have an option to select the Activities I run the report first. When I select only the fetch activity, I don't see anything in the events, when I select the Document Ingest activity, I don't see anything in the events. My solr output connection has the following information: Protocol: http Server: the server name Port:8080 (we are running solr on Jboss port 8080) Web Application Name: solr Core Name: collection1 Update Handler: update/extract Remove Handler: /update Status Handler: /admin/ping From: Karl Wright [daddy...@gmail.com] Sent: Tuesday, January 31, 2012 9:00 AM To: Silvia, Daniel [USA]; connectors-user@incubator.apache.org Subject: Re: ManifoldCF's dist/shapoint-integration dir Ok, let's do one thing at a time. First: For the Path tab where there are Path Rules, are these the paths we want ManifoldCF to follow? Each site, and each Library like Documents and Shared Documents. And in the Metadata tab, this is the tab where you indicate for each Site and Library you want to include specific metadata or include all metadata? For SharePoint, there are Path Rules and Metadata Rules. The Path Rules describe what documents you want to include or exclude. The Metadata Rules describe what metadata you want to include or exclude. For right now I would ignore the Metadata Rules and just make sure you have Path Rules that mean that you have included documents. As I run the report, I see Documents, Active, and Processed where the numbers change under the Active column as well as the Document and Processed column (these just get larger, where Active changes). This report we actually call the Job Status screen. The fact that the numbers get larger and the job doesn't just end indicates that you are successfully crawling your SharePoint, and you have set up the job to include at least some documents. This is good news. However, this is NOT the Simple History report I was alluding to earlier. To get to that report, click on the Simple History link on the left-hand navigation area. This report will show the events of your choice (default - ALL recorded events) over a given time window (default: the last hour). If you've done this right you should at least see a Job start event. The events you are most interested in are the fetch (which describes all attempts to fetch documents from SharePoint) and document ingest, which describe attempts to get documents into Solr. You can refresh the displayed events by clicking the Go button in the middle of the screen whenever you wish. I'd like you to delete your job, create it again, and start it. Then, while it is running, I'd like you to go to the Simple History screen, and select the appropriate connection (your SharePoint repository connection), and click the Go button. So as not to skip anything basic: (1) What event types do you see? (2) Are there fetch events? (3) Are there document ingest events? If you see no fetch events, that implies you have either not specified any documents to include in your job, OR your Solr connection is configured to reject too many document types so they are all getting filtered out. If you see document ingest events, but those have errors, it implies that the configuration of your Solr connection is incorrect and does not match the way your Solr is configured. If you send me a specific error code and/or text I can help you figure out what is happening. If you see document ingest events with NO errors, but the Solr instance is not getting documents, you are describing an impossible situation. While your Solr instance may not be configured to have the Extracting Update Handler active, or it may be at a different URL than what you pointed at, that would definitely yield errors or notifications in the Simple History
Re: ManifoldCF's dist/shapoint-integration dir
It's been a while since I've set up a SharePoint job but I think what you are missing is a file rule (instead of just a library rule). Here's what the end-user documentation says on the matter: Each rule consists of a path, a rule type, and an action. The actions are Include and Exclude. The rule type tells the connection what kind of SharePoint entity it is allowed to exactly match. For example, a File rule will only exactly match SharePoint paths that represent files - it cannot exactly match sites or libraries. The path itself is just a sequence of characters, where the * character has the special meaning of being able to match any number of any kind of characters, and the ? character matches exactly one character of any kind. The rule matcher extends strict, exact matching by introducing a concept of implicit inclusion rules. If your rule action is Include, and you specify (say) a File rule, the matcher presumes implicit inclusion rules for the corresponding site and library. So, if you create an Include File rule that matches (for example) /MySite/MyLibrary/MyFile, there is an implied Site Include rule for /MySite, and an implied Library Include rule for /MySite/MyLibrary. Similarly, if you create a Library Include rule, there is an implied Site Include rule that corresponds to it. Note that these shortcuts only applies to Include rules - there are no corresponding implied Exclude rules. What this means is that you should probably be declaring file rules with * as the file name for each library, rather than a library rule. You might want to just try this. If you still have trouble, you can try setting the org.apache.manifoldcf.connectors property to DEBUG in the properties.xml file and restarting ManifoldCF before your next crawl. The manifoldcf.log file will then have output describing the decisions the SharePoint connector made about each site, library, file, or folder it encountered. Thanks, Karl On Tue, Jan 31, 2012 at 10:27 AM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Karl The Path Rules are : Path Match: /Shared Documents Type: library Action: include Path Match: /IDD/Shared Documents Type: library Action: include Path Match: /IDD/Documents Type: library Action: include Path Match: /manifoldcf/Shared Documents Type: library Action: include I hope this helps. I really appreciate your help. From: Karl Wright [daddy...@gmail.com] Sent: Tuesday, January 31, 2012 10:01 AM To: Silvia, Daniel [USA] Cc: connectors-user@incubator.apache.org Subject: Re: ManifoldCF's dist/shapoint-integration dir When I select only the fetch activity, I don't see anything in the events, when I select the Document Ingest activity, I don't see anything in the events. So either you've already run the job and the documents were accessed the first time (and won't be accessed again until they change), or the problem is likely that your SharePoint Path Rules are not including any documents. It would be very helpful at this point to include a screen shot of the job you've created. Since you are not on the net, perhaps you can jot down your SharePoint path rules for me to have a look at, as they are displayed when you view the job. Thanks, Karl On Tue, Jan 31, 2012 at 9:44 AM, Silvia, Daniel [USA] silvia_dan...@bah.com wrote: Hi Karl Ok, I have created a new job and ran the job and went to the Simple History Report. I see the Events. If all the Activities in the Simple History Report, Document Deletion(SolrPipeline), Document Ingest(SolrPipeline), and Fetch are selected I see a start job and end job for events . When I get to the Simple History Report I can select the Connection, I don't have an option to select the Activities I run the report first. When I select only the fetch activity, I don't see anything in the events, when I select the Document Ingest activity, I don't see anything in the events. My solr output connection has the following information: Protocol: http Server: the server name Port:8080 (we are running solr on Jboss port 8080) Web Application Name: solr Core Name: collection1 Update Handler: update/extract Remove Handler: /update Status Handler: /admin/ping From: Karl Wright [daddy...@gmail.com] Sent: Tuesday, January 31, 2012 9:00 AM To: Silvia, Daniel [USA]; connectors-user@incubator.apache.org Subject: Re: ManifoldCF's dist/shapoint-integration dir Ok, let's do one thing at a time. First: For the Path tab where there are Path Rules, are these the paths we want ManifoldCF to follow? Each site, and each Library like Documents and Shared Documents. And in the Metadata tab, this is the tab where you indicate for each Site and Library you want to include specific metadata or include all metadata? For SharePoint, there are Path Rules and Metadata Rules. The Path Rules describe what documents you want
Re: Cannot find OracleDriver
MCF's Oracle support was written against earlier versions of the Oracle driver. It is possible that they have changed the driver class. If the driver winds up in the dist/connector-lib directory (I'm assuming you are using trunk or 0.4-incubating), then it should be accessible. Could you please try the following: jar -tf ojdbc6.jar | grep oracle/jdbc/OracleDriver ... assuming you are using Linux? If the driver class IS found, then the other possibility is that the jar is compiled against a later version of Java than the one you are using to run MCF. Please let me know what you find. Karl On Thu, Jan 19, 2012 at 1:43 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I downloaded MCF and started playing with the default setup under Jetty and Derby. It starts up without any issue. I would like to connect to our ORACLE database and import data into SOLR. I placed the ojdbc6.jar file in the connectors/jdbc/jdbc-drivers directory as stated in the README instruction file to use the ORACLE driver. I ran ant build from the main directory, and restarted the example in dist/example using Jetty. When I setup a connector, MCF throws an exception stating that it cannot find oracle.jdbc.OracleDriver class. Looking in the connector-lib directory, the oracle jar is there. I also tried placing the ojdbc6.jar in the dist/example/lib directory, but that didn't fix the problem either. Can anyone point me in the right direction? TIA -- This e-mail and any files transmitted with it may be proprietary. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.
Re: Cannot find OracleDriver
I was able to reproduce the problem. I'll get back to you when I figure out what the issue is. Karl On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I've used the jar file in NetBeans to connect to the database without any issue. Seems more like a class loader issue. On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I have the latest release from the Apache Manifold site (i.e. 0.3-incubating). I checked the driver jar file with winzip, and the driver name is still the same (oracle.jdbc.OracleDriver). I'm running java 1.6.0_18-b7 on Windows XP SP 3. On Thu, Jan 19, 2012 at 2:27 PM, Karl Wright daddy...@gmail.com wrote: MCF's Oracle support was written against earlier versions of the Oracle driver. It is possible that they have changed the driver class. If the driver winds up in the dist/connector-lib directory (I'm assuming you are using trunk or 0.4-incubating), then it should be accessible. Could you please try the following: jar -tf ojdbc6.jar | grep oracle/jdbc/OracleDriver ... assuming you are using Linux? If the driver class IS found, then the other possibility is that the jar is compiled against a later version of Java than the one you are using to run MCF. Please let me know what you find. Karl On Thu, Jan 19, 2012 at 1:43 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I downloaded MCF and started playing with the default setup under Jetty and Derby. It starts up without any issue. I would like to connect to our ORACLE database and import data into SOLR. I placed the ojdbc6.jar file in the connectors/jdbc/jdbc-drivers directory as stated in the README instruction file to use the ORACLE driver. I ran ant build from the main directory, and restarted the example in dist/example using Jetty. When I setup a connector, MCF throws an exception stating that it cannot find oracle.jdbc.OracleDriver class. Looking in the connector-lib directory, the oracle jar is there. I also tried placing the ojdbc6.jar in the dist/example/lib directory, but that didn't fix the problem either. Can anyone point me in the right direction? TIA -- This e-mail and any files transmitted with it may be proprietary. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration. -- Regards, Matt Parker (CTR) Senior Software Architect Apogee Integration, LLC 5180 Parkstone Drive, Suite #160 Chantilly, Virginia 20151 703.272.4797 (site) 703.474.1918 (cell) www.apogeeintegration.com -- Regards, Matt Parker (CTR) Senior Software Architect Apogee Integration, LLC 5180 Parkstone Drive, Suite #160 Chantilly, Virginia 20151 703.272.4797 (site) 703.474.1918 (cell) www.apogeeintegration.com -- This e-mail and any files transmitted with it may be proprietary. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.
Re: Cannot find OracleDriver
The problem is that the JDBC driver is using a pool driver that is in common with the core of ManifoldCF. So the connector-lib path, which only the connectors know about, won't do. That's a bug which I'll create a ticket for. A temporary fix, which is slightly involved, requires you to put the ojdbc6.jar in the example/lib area, as you already tried, but in addition you will need to explicitly include the jar in your classpath. Normally the start.jar's manifest describes all the jars in the initial classpath. I thought it was possible to also include additional classpath info through the normal --classpath mechanism, but that doesn't seem to work, so you may be stuck with modifying the root build.xml file to add the jar to the manifest. I'm going to experiment a bit and see if I can come up with something quickly. Karl On Thu, Jan 19, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote: I was able to reproduce the problem. I'll get back to you when I figure out what the issue is. Karl On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I've used the jar file in NetBeans to connect to the database without any issue. Seems more like a class loader issue. On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I have the latest release from the Apache Manifold site (i.e. 0.3-incubating). I checked the driver jar file with winzip, and the driver name is still the same (oracle.jdbc.OracleDriver). I'm running java 1.6.0_18-b7 on Windows XP SP 3. On Thu, Jan 19, 2012 at 2:27 PM, Karl Wright daddy...@gmail.com wrote: MCF's Oracle support was written against earlier versions of the Oracle driver. It is possible that they have changed the driver class. If the driver winds up in the dist/connector-lib directory (I'm assuming you are using trunk or 0.4-incubating), then it should be accessible. Could you please try the following: jar -tf ojdbc6.jar | grep oracle/jdbc/OracleDriver ... assuming you are using Linux? If the driver class IS found, then the other possibility is that the jar is compiled against a later version of Java than the one you are using to run MCF. Please let me know what you find. Karl On Thu, Jan 19, 2012 at 1:43 PM, Matthew Parker mpar...@apogeeintegration.com wrote: I downloaded MCF and started playing with the default setup under Jetty and Derby. It starts up without any issue. I would like to connect to our ORACLE database and import data into SOLR. I placed the ojdbc6.jar file in the connectors/jdbc/jdbc-drivers directory as stated in the README instruction file to use the ORACLE driver. I ran ant build from the main directory, and restarted the example in dist/example using Jetty. When I setup a connector, MCF throws an exception stating that it cannot find oracle.jdbc.OracleDriver class. Looking in the connector-lib directory, the oracle jar is there. I also tried placing the ojdbc6.jar in the dist/example/lib directory, but that didn't fix the problem either. Can anyone point me in the right direction? TIA -- This e-mail and any files transmitted with it may be proprietary. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration. -- Regards, Matt Parker (CTR) Senior Software Architect Apogee Integration, LLC 5180 Parkstone Drive, Suite #160 Chantilly, Virginia 20151 703.272.4797 (site) 703.474.1918 (cell) www.apogeeintegration.com -- Regards, Matt Parker (CTR) Senior Software Architect Apogee Integration, LLC 5180 Parkstone Drive, Suite #160 Chantilly, Virginia 20151 703.272.4797 (site) 703.474.1918 (cell) www.apogeeintegration.com -- This e-mail and any files transmitted with it may be proprietary. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.
Re: Programmatic Interaction with ManifoldCF
Hi Swapna, Passwords are stored in obfuscated form. There's a different method call to set passwords accordingly, which performs the obfuscation. See org.apache.manifoldcf.core.interfaces.ConfigParams.setObfuscatedParameter(String key, String value). Thanks Karl On Mon, Jan 16, 2012 at 6:23 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, I have looked at the examples you suggested and they have been very helpful. Am using the API to get/put different jobs/repository connections etc and everything is working fine. But I have an issue when creating a Windows Share repository connection. Am using some code something like below: Configuration connectionConfiguration = new Configuration(); addParameterNode(connectionConfiguration,Server,serverName); addParameterNode(connectionConfiguration,Domain/Realm,GLOBAL); addParameterNode(connectionConfiguration,User Name,userName); addParameterNode(connectionConfiguration,Password,password); Then, am using this connectionConfiguration to create the repository connection. The connection is getting created without any issue, but when I check the status using crawler UI, it shows that the connection is not working. When I edit the connection (from crawler UI) to see the details, the password is shown as empty. Can you please tell me how to create a Windows Share repository connection using this API such that it works by using the credentials that are sent as arguments ? Thanks and Regards, Swapna. On Thu, Jan 5, 2012 at 11:43 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Thanks Karl, I have started looking into the API Service. Will get back to you with more specific questions once into it. Thanks and Regards, Swapna. On Tue, Jan 3, 2012 at 7:47 PM, daddy...@gmail.com daddy...@gmail.com wrote: The preferred way to do this is via the API Service. See Chapter 3 of ManifoldCF in Action. There are examples at http://manifoldcfinaction.googlecode.com/svnroot/trunk from the book. Karl Sent from my Nokia phone -Original Message- From: Swapna Vuppala Sent: 03/01/2012, 6:55 AM To: connectors-user@incubator.apache.org Subject: Programmatic Interaction with ManifoldCF Hi, Am looking for the best way to interact with ManifoldCF programmatically for my purposes. My target is to develop a small tool (command-line) which can read a XML file to get the list of locations that have to be crawled, and create repository connections and jobs that use the created repository connections, with paths that have been read from the XML file. If I write a Java program for this, which API should I be using ? Earlier, I have looked at scripts that can be run to create repository connections, jobs etc . Should I run such scripts from my Java program or is there a better way to approach this ? Or is it possible to use the classes of ManifoldCF in my program to achieve this ? If so, how ? Can you please direct me to the ideal approach ? Thanks and Regards, Swapna.
Re: required attribute of solr-integration security fields
required=true affects the update handler, though, and ManifoldCF does not send __nosecurity__ as a value; it expects Solr to add it. So without default value, the solr 3.x and solr 4.x components do not work. ManifoldCF in Action has its own example, which doesn't use __nosecurity__, but is slower. The book is now out of date, though, in this regard. You should not mix schema.xml from the book with code from the ManifoldCF tree. Karl On Tue, Jan 10, 2012 at 3:44 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: The tests for the components would not pass if required was true I run ant test for the components and passed when adding required=true. If someone will forget to set default value of __nosecurity__, required=true is effective. In fact I forgot to add default value, so I couldn't search anything. For, I used security fields described in ManifoldCF in Action which don't have default attribute. So I think setting required=true is helpful. Regards, Shinichiro Abe On 2012/01/10, at 15:47, Karl Wright wrote: The fields should be required=false but with a default value of __nosecurity__. I believe that means that if there is no field value attached to the document when it is sent to Solr, Solr will make sure it has the value __nosecurity__. The tests for the components would not pass if required was true, so I am a little puzzled as to why you feel there is a problem here? Here's what the tests use for schema: !-- MCF Security fields -- field name=allow_token_document type=string indexed=true stored=false multiValued=true default=__nosecurity__/ field name=deny_token_document type=string indexed=true stored=false multiValued=true default=__nosecurity__/ field name=allow_token_share type=string indexed=true stored=false multiValued=true default=__nosecurity__/ field name=deny_token_share type=string indexed=true stored=false multiValued=true default=__nosecurity__/ Here's how the test documents are added: assertU(adoc(id, da12, allow_token_document, token1, allow_token_document, token2)); assertU(adoc(id, da13-dd3, allow_token_document, token1, allow_token_document, token3, deny_token_document, token3)); assertU(adoc(id, sa123-sd13, allow_token_share, token1, allow_token_share, token2, allow_token_share, token3, deny_token_share, token1, deny_token_share, token3)); assertU(adoc(id, sa3-sd1-da23, allow_token_document, token2, allow_token_document, token3, allow_token_share, token3, deny_token_share, token1)); assertU(adoc(id, notoken)); Karl On Mon, Jan 9, 2012 at 11:12 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi. README[1] of solr-integration says that you will need to add security fields, and specify required=false. I should specify required=true because MCF connectors always return any tokens and we can't search anything if these fields have no tokens (that is, null and these fields don't even have __nosecurity__ that stands for no security token.) when using MCF security plugin. May I open JIRA ticket for modifying README? Is there a reason that should be required=false? [1]https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/solr/integration/README-3.x.txt Regards, Shinichiro Abe
Re: required attribute of solr-integration security fields
I might as well clearly specify the necessity of field value by using required=true. What do you think? If you do that, the Solr Connector will cease to work. Try it if you do not believe me. The current contract is that the connector sends in tokens of each type to Solr, and will send zero tokens if it has none. In that case, if you set required=true Solr will reject the document with an error. Karl On Tue, Jan 10, 2012 at 7:43 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi, So without default value, the solr 3.x and solr 4.x components do not work. Okay, I know we must have default value. Then do we need to have required attribute itself? I don't mind having required attribute or not though(because we have default value), I might as well clearly specify the necessity of field value by using required=true. What do you think? Thank you, Shinichiro Abe On 2012/01/10, at 20:17, Karl Wright wrote: required=true affects the update handler, though, and ManifoldCF does not send __nosecurity__ as a value; it expects Solr to add it. So without default value, the solr 3.x and solr 4.x components do not work. ManifoldCF in Action has its own example, which doesn't use __nosecurity__, but is slower. The book is now out of date, though, in this regard. You should not mix schema.xml from the book with code from the ManifoldCF tree. Karl On Tue, Jan 10, 2012 at 3:44 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: The tests for the components would not pass if required was true I run ant test for the components and passed when adding required=true. If someone will forget to set default value of __nosecurity__, required=true is effective. In fact I forgot to add default value, so I couldn't search anything. For, I used security fields described in ManifoldCF in Action which don't have default attribute. So I think setting required=true is helpful. Regards, Shinichiro Abe On 2012/01/10, at 15:47, Karl Wright wrote: The fields should be required=false but with a default value of __nosecurity__. I believe that means that if there is no field value attached to the document when it is sent to Solr, Solr will make sure it has the value __nosecurity__. The tests for the components would not pass if required was true, so I am a little puzzled as to why you feel there is a problem here? Here's what the tests use for schema: !-- MCF Security fields -- field name=allow_token_document type=string indexed=true stored=false multiValued=true default=__nosecurity__/ field name=deny_token_document type=string indexed=true stored=false multiValued=true default=__nosecurity__/ field name=allow_token_share type=string indexed=true stored=false multiValued=true default=__nosecurity__/ field name=deny_token_share type=string indexed=true stored=false multiValued=true default=__nosecurity__/ Here's how the test documents are added: assertU(adoc(id, da12, allow_token_document, token1, allow_token_document, token2)); assertU(adoc(id, da13-dd3, allow_token_document, token1, allow_token_document, token3, deny_token_document, token3)); assertU(adoc(id, sa123-sd13, allow_token_share, token1, allow_token_share, token2, allow_token_share, token3, deny_token_share, token1, deny_token_share, token3)); assertU(adoc(id, sa3-sd1-da23, allow_token_document, token2, allow_token_document, token3, allow_token_share, token3, deny_token_share, token1)); assertU(adoc(id, notoken)); Karl On Mon, Jan 9, 2012 at 11:12 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi. README[1] of solr-integration says that you will need to add security fields, and specify required=false. I should specify required=true because MCF connectors always return any tokens and we can't search anything if these fields have no tokens (that is, null and these fields don't even have __nosecurity__ that stands for no security token.) when using MCF security plugin. May I open JIRA ticket for modifying README? Is there a reason that should be required=false? [1]https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/solr/integration/README-3.x.txt Regards, Shinichiro Abe
Re: required attribute of solr-integration security fields
The fields should be required=false but with a default value of __nosecurity__. I believe that means that if there is no field value attached to the document when it is sent to Solr, Solr will make sure it has the value __nosecurity__. The tests for the components would not pass if required was true, so I am a little puzzled as to why you feel there is a problem here? Here's what the tests use for schema: !-- MCF Security fields -- field name=allow_token_document type=string indexed=true stored=false multiValued=true default=__nosecurity__/ field name=deny_token_document type=string indexed=true stored=false multiValued=true default=__nosecurity__/ field name=allow_token_share type=string indexed=true stored=false multiValued=true default=__nosecurity__/ field name=deny_token_share type=string indexed=true stored=false multiValued=true default=__nosecurity__/ Here's how the test documents are added: assertU(adoc(id, da12, allow_token_document, token1, allow_token_document, token2)); assertU(adoc(id, da13-dd3, allow_token_document, token1, allow_token_document, token3, deny_token_document, token3)); assertU(adoc(id, sa123-sd13, allow_token_share, token1, allow_token_share, token2, allow_token_share, token3, deny_token_share, token1, deny_token_share, token3)); assertU(adoc(id, sa3-sd1-da23, allow_token_document, token2, allow_token_document, token3, allow_token_share, token3, deny_token_share, token1)); assertU(adoc(id, notoken)); Karl On Mon, Jan 9, 2012 at 11:12 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi. README[1] of solr-integration says that you will need to add security fields, and specify required=false. I should specify required=true because MCF connectors always return any tokens and we can't search anything if these fields have no tokens (that is, null and these fields don't even have __nosecurity__ that stands for no security token.) when using MCF security plugin. May I open JIRA ticket for modifying README? Is there a reason that should be required=false? [1]https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/solr/integration/README-3.x.txt Regards, Shinichiro Abe
Re: Jetty configuration
The single-process example was originally conceived as just a quick-and-dirty way to get ManifoldCF running, and nobody thought it would ever become a serious deployment model. But having said that it is pretty straightforward to add support for more Jetty configuration options, I believe. Please consider opening a ticket and specifying what you'd like to see as far as Jetty configuration support. I suppose just supporting a jetty.xml would be sufficient? Karl On Wed, Dec 28, 2011 at 1:17 PM, M Kelleher mj.kelle...@gmail.com wrote: In spite of the class that starts jetty and MCF, is there still a way to configure jetty to use any of the supported OPTIONS, or specify a jetty.xml? I would like to enable JMX and configure BASIC authentication for the container, and also enable a plugin that will allow me to specify what IP addresses jetty will respond to. Before I write my own wrapper replacement for start.jar and the invocation of the manifoldcf jetty starter class, I was hoping that there was a built in way to specify these kinds of configurations. Thanks. Sent from my iPad
Re: Incubator status, Manning email, is any action needed by users?
is the continued Incubator status a problem? It's only a problem in that some potential users of the software may avoid it due to this status, which is unfortunate. It also limits books sales. Is there something more we, as a group, need to do to push this forward? The decision to not pursue graduation at this time has to do with nothing more than the percentage of commits that are done by all the active committers, and their distribution. ASF wants their projects well-covered, and ManifoldCF has not yet achieved that. In all other respects, I believe ManifoldCF would have no problem in being able to graduate. For people who follow the project, this means basically that we need your contributions and your continued involvement. If you contribute consistently and well, you may be asked to become a committer, and that would certainly help the project towards graduation. Karl On Mon, Dec 12, 2011 at 11:40 PM, Mark Bennett mbenn...@ideaeng.com wrote: I got the email from Manning mentioning that the print book would be delayed, in favor of updates to the electronic copy, since the incubation period has been extended. This seems quite reasonable. This is NOT a post about the book - if it were, I'd post to that board. But more importantly, is the continued Incubator status a problem? Is it something we need to do something about? Is there something more we, as a group, need to do to push this forward? For example, something with doc or unit tests to comply with ASF? Or starting to Blog campaign to lobby ASF? OR Is this even a problem to be worked on? Maybe there's good reasons for it, and Karl is content? I'm a newb to these lists, so wanted to ask before doing anything. Only saw one recent post on the topic. Thanks, Mark -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
Re: URL modification
The existing canonicalization logic is complex and cannot correctly be represented as a simple regexp (or I would have simply done it that way in the first place ;-) ). But we could certainly entertain the notion of adding arbitrary parameter removal/addition as being one of the kinds of canonicalizations that could be done. If you think you need this enhancement please create a ticket for it and we'll mull it over. Thanks, Karl On Wed, Dec 7, 2011 at 8:59 AM, Michael Kelleher mj.kelle...@gmail.com wrote: Is it possible to modify the URL's at all at collection time or before fetch time? There is a URL parameter I would like to remove before the URL is fetched. Canonicalization seems to do that, but the modification types are fixed, remove: JSP sessions, ASP sessions, PHP sessions, BV sessions. It does not seem to allow a regex to transform the URL. thanks
Re: WEB: Illegal seed URL
The URL as stated is fine and is pretty standard. I don't think there's a problem there, unless you inadvertantly fixed something when you changed the hostname. Can you look at the log - there may well be a stack trace, especially if you have property name=org.apache.manifoldcf.connectors value=DEBUG/ set. I'd love to see what the trace is. Karl On Tue, Dec 6, 2011 at 1:52 PM, Michael Kelleher mj.kelle...@gmail.com wrote: Here is my seed URL (minus the hostname): https://hostname.com/vwebv/search?searchArg=dvdsearchCode=SALLsearchType=1recCount=100 I am using a Web Crawler connection that has been tested with the NullOutputConnector - so I dont think the issue can be here I am also using the Solr Output Connector - this had been throwing an Exception till I fixed the core name - this is the first time I have used this. So, maybe I dont have things configured correct here. However, there are no exceptions in the log. Also, I am not using authentication at all on Solr. I looked at the class: connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java and it was not Obvious what the issue is. Also, in logging.ini - I changed the logging level to DEBUG and restarted before I tested the crawl, which further obscures the logic to me in WebcrawlerConnector.java Is there somewhere else I can set logging levels. I am not sure my change to logging.ini is having any effect. Also, is there some other test you might suggest? thanks. --mike
Re: Problem crawling windows share
About your capture - Michael Allen says the following: Actually this has nothing to do with DFS. JCIFS does not get to the point where it does DFS anything. The capture shows a vanilla STATUS_LOGON_FAILURE when GLOBAL\swapna.vuppala tries to auth with l-carx01.global.arup.com. So the possible causes for this are 1) the account name is not valid 2) the supplied password is incorrect 3) some security policy is deliberately blocking that user or particular type of auth or 4) some server configuration is incompatible with JCIFS. I only mention this last option because I noticed the target server has security signatures disabled. That's strange. If they're messing around with things like that, who knows what their clients are expected to do. Try a Windows client that uses NTLM instead of Kerberos. Meaning try a machine that is not joined to the domain so that when you try to access the target it asks you for credentials at which point you can test with GLOBAL\swapna.vuppala. Then it will use NTLM and you can actually compare captures. If the operator doesn't have a laptop or something not joined to the domain, it might be sufficient to log into a workstation using machine credentials and not domain credentials. Also when testing JCIFS you should use a simple stand-alone program like examples/ListFiles.java. In other words: (a) Since JCIFS does not use Kerberos for authentication, you need to try to log into the recalcitrant server via Windows without using Kerberos to be able to do a side-by-side comparison. Michael has some ways of doing that, above. (b) You may find that it doesn't work, in which case JCIFS is not going to work either. (c) If it *does* work, then try to generate your side-by-side comparisons using a simpler example rather than ManifoldCF en toto; you can see how at jcifs.samba.org, or I can help you further. He also mentions that there is some bizarreness on the response that indicates that the server is configured in a way that he's never seen before. And believe me, Michael has seen a *lot* of strange configurations... Hope this helps. Karl On Mon, Nov 28, 2011 at 4:12 AM, Karl Wright daddy...@gmail.com wrote: That should read properties.xml, not properties.ini. It looks like this page needs updating. The debug property in the XML form is: property name=org.apache.manifoldcf.connectors value=DEBUG/ I don't think it will provide you with any additional information that is useful for debugging your authentication issue, however, if that is why you are looking at it. There may be some jcifs.jar debugging switches that might be of more help, but in the end I suspect you will need a packet capture of both a successful connection (via Windows) and an unsuccessful one (via MCF). The guy you will need to talk with after that is the jcifs author Michael Allen; I can give you his email address if you get that far. Karl On Mon, Nov 28, 2011 at 1:30 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, I was planning to debug jCIFS repository connection using WireShark and I came across this https://cwiki.apache.org/CONNECTORS/debugging-connections.html Here, I see something as add org.apache.manifoldcf.connectors=DEBUG to the properties.ini file. Is it the properties.xml file that is being referred here ? If not, where do I find properties.ini file ? Thanks and Regards, Swapna. On Thu, Nov 17, 2011 at 1:31 PM, Karl Wright daddy...@gmail.com wrote: See http://jcifs.samba.org/src/docs/api/overview-summary.html#scp. The properties jcifs.smb.lmCompatibility and jcifs.smb.client.useExtendedSecurity are the ones you may want to change. These two properties go together so certain combinations make sense and others don't, so there's really only combinations you need but I'll need to look at what they are and get back to you later today. As far as setting the switches are concerned, if you are using the Quick Start you do this trivially by: java -Dxxx -Dyyy -jar start.jar If you are using the multi-process configuration, that is what the defines directory is for; you only need to create files in that directory with the names jcifs.smb.lmCompatibility and jcifs.smb.client.useExtendedSecurity containing the values you want to set. Karl On Thu, Nov 17, 2011 at 1:11 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Am able to access the folders on the problem server through windows explorer, (\\server3\Folder1). I tried couple of things with the credentials form, changing username, domain etc.. but I keep getting the same error Couldn't connect to server: Logon failure: unknown user name or bad password Can you tell me more about the -D switch you were talking of ? Thanks and Regards, Swapna. On Tue, Nov 15, 2011 at 12:40 PM, Karl Wright daddy...@gmail.com wrote: Glad you chased it down this far. First thing to try is whether you can get into the problem server using Windows
Re: Export crawled URLs
Well, the history comes from the repohistory table, yes - but you may not be able to construct a query with entityid=jobs.id, first of all because that is incorrect (what the entity field contains is dependent on the activity type), and secondly because that column is potentially long and only some kinds of queries can be done against it. Specifically it cannot be built into an index on PostgreSQL. Karl On Sun, Dec 4, 2011 at 7:50 PM, Hitoshi Ozawa ozawa_hito...@ogis-ri.co.jp wrote: Is history just entries in the repohistory table with entitityid = jobs.id? H.Ozawa (2011/12/03 1:43), Karl Wright wrote: The best place to get this from is the simple history. A command-line utility to dump this information to a text file should be possible with the currently available interface primitives. If that is how you want to go, you will need to run ManifoldCF in multiprocess mode. Alternatively you might want to request the info from the API, but that's problematic because nobody has implemented report support in the API as of now. A final alternative is to get this from the log. There is an [INFO] level line from the web connector for every fetch, I seem to recall, and you might be able to use that. Thanks, Karl On Fri, Dec 2, 2011 at 11:18 AM, M Kellehermj.kelle...@gmail.com wrote: Is it possible to export / download the list of URLs visited during a crawl job? Sent from my iPad
Re: Exception while processing document
Hmm, this is a new one for me. Can you include the entire trace? Karl On Sat, Nov 26, 2011 at 2:43 PM, Michael Kelleher mj.kelle...@gmail.com wrote: I get the following exception: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty Anyone know what this relates to and how to fix it? I currently have nutch 1.3 crawling the same site without exceptions.
Re: Question about deploy to Tomcat
On Sat, Nov 26, 2011 at 8:02 PM, Michael Kelleher mj.kelle...@gmail.com wrote: I have been reading file: http://incubator.apache.org/connectors/how-to-build-and-deploy.html#Running+ManifoldCF If I am reading this correctly, it seems that the Database initialization is completely manual, and does not happen at MCF startup time. Is this correct? Yes - there is an example set of steps listed in the how-to-build-and-deploy page. This is necessary because of the multiprocess nature of the model. The standalone instance does not seem to work for me, and apparently neither does deploying this to Tomcat. The standalone instance seems fine here, and passes the tests. Can you be more explicit about what is happening for you? Karl
Re: Question about deploy to Tomcat
On Sun, Nov 27, 2011 at 5:23 PM, Adam LaPila adam.lap...@lmal.com.au wrote: Hi Michael, I too am having trouble getting MCF with tomcat to work successfully, if get it completed il be sure to reply with details on how I got it to work. What happened for me, I followed the steps several times on the how-to page. when in my browser I could open up the main MCF page, but if I clicked on any of the links (output, repository, job,etc) it would just open a blank page. It would be the mcf-crawler-ui layout but there would be nothing displayed apart from the default template. Did you have something similar? It sounds like something is wrong with database communication. What database are you using, and are there any exceptions in the Tomcat logs pertaining to ManifoldCF? I bet there are. Karl Cheers, Adam. -Original Message- From: Michael Kelleher [mailto:mj.kelle...@gmail.com] Sent: Sunday, 27 November 2011 12:03 PM To: connectors-user@incubator.apache.org Subject: Question about deploy to Tomcat I have been reading file: http://incubator.apache.org/connectors/how-to-build-and-deploy.html#Running+ManifoldCF If I am reading this correctly, it seems that the Database initialization is completely manual, and does not happen at MCF startup time. Is this correct? The standalone instance does not seem to work for me, and apparently neither does deploying this to Tomcat. This message is intended only for the use of the intended recipient(s) If you are not an intended recipient, you are hereby notified that any use, dissemination, disclosure or copying of this communication is strictly prohibited. If you have received this communication in error please destroy all copies of this message and its attachments and notify the sender immediately
Re: Authority Connection works unpredictably
Hi Swapna, There should be manifoldcf log output that contains the actual stack trace of the exception. That would be very helpful; I need the line numbers. The code is quite simple, and indicates that the LDAP server is refusing a connection: protected void getSession() throws ManifoldCFException { if (ctx == null) { // Calculate the ldap url first String ldapURL = ldap://; + domainControllerName + :389; Hashtable env = new Hashtable(); env.put(Context.INITIAL_CONTEXT_FACTORY,com.sun.jndi.ldap.LdapCtxFactory); env.put(Context.SECURITY_AUTHENTICATION,authentication); env.put(Context.SECURITY_PRINCIPAL,userName); env.put(Context.SECURITY_CREDENTIALS,password); //connect to my domain controller env.put(Context.PROVIDER_URL,ldapURL); //specify attributes to be returned in binary format env.put(java.naming.ldap.attributes.binary,tokenGroups objectSid); // Now, try the connection... try { ctx = new InitialLdapContext(env,null); } catch (AuthenticationException e) { // This means we couldn't authenticate! throw new ManifoldCFException(Authentication problem authenticating admin user '+userName+': +e.getMessage(),e); } catch (CommunicationException e) { // This means we couldn't connect, most likely throw new ManifoldCFException(Couldn't communicate with domain controller '+domainControllerName+': +e.getMessage(),e); } catch (NamingException e) { throw new ManifoldCFException(e.getMessage(),e); } } else { // Attempt to reconnect. I *hope* this is efficient and doesn't do unnecessary work. try { ctx.reconnect(null); } catch (AuthenticationException e) { // This means we couldn't authenticate! throw new ManifoldCFException(Authentication problem authenticating admin user '+userName+': +e.getMessage(),e); } catch (CommunicationException e) { // This means we couldn't connect, most likely throw new ManifoldCFException(Couldn't communicate with domain controller '+domainControllerName+': +e.getMessage(),e); } catch (NamingException e) { throw new ManifoldCFException(e.getMessage(),e); } } expiration = System.currentTimeMillis() + expirationInterval; try { responseLifetime = Long.parseLong(this.cacheLifetime) * 60L * 1000L; LRUsize = Integer.parseInt(this.cacheLRUsize); } catch (NumberFormatException e) { throw new ManifoldCFException(Cache lifetime or Cache LRU size must be an integer: +e.getMessage(),e); } } Your problem description indicates that it is possible that the ctx.reconnect() call is failing to reconnect, but a new connection works OK on your setup. A stack trace should tell me everything. Thanks, Karl On Wed, Nov 23, 2011 at 12:58 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Even after reducing the max connections to 3, the connection fails abruptly for me. Currently, the domain controller am using is mapped to only one IP address, and that responds on ping, and the max connections are 3. It was working yesterday and it fails suddenly throwing different exceptions like below: Threw exception: 'Couldn't communicate with domain controller 'globalad1': null' Threw exception: 'Couldn't communicate with domain controller 'globalad1.global.arup.com': null' Threw exception: 'globalad1.global.arup.com:389; socket closed' Sometimes, it works when I change the cache lifetime parameter. What others factors do you think that can cause this to fail ? Thanks and Regards, Swapna. On Tue, Nov 22, 2011 at 11:56 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: OK.. Thanks for the information On Mon, Nov 21, 2011 at 6:31 PM, Karl Wright daddy...@gmail.com wrote: The sAMAccountName and UserPrincipalName LDAP fields were used by different versions of Windows at different points in time. Some backwards compatibility was maintained, however Microsoft has apparently decided to deprecate one of them (can't remember which), and thus you need support for both. Karl On Mon, Nov 21, 2011 at 6:39 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Yes, my Active Directory authority connection is configured to talk to only one IP address and that particular one is responding to ping always. Earlier, the max connections parameter was set to 10, now I reduced it to 3. Its working as of now and I'll keep checking if its going to throw an exception. Thanks a lot for the inputs. Also, I was wondering what the difference was between 2 options for Login name AD attribute, sAMAccountName and UserPrincipalName ? Thanks and Regards, Swapna. On Mon, Nov 21, 2011 at 4:57 PM, Karl
Re: Authority Connection works unpredictably
To clarify, what I think may be happening is this. (1) The Java LDAP context is keeping a socket connection to the AD controller. (2) The AD controller must be configured to close connections forcibly after a certain period of time. (3) The LDAP context's reconnect() operation doesn't recover from a socket that was closed by the server. (4) The authority code won't release the LDAP context until 5 idle minutes go by. So basically, a connection winds up in a busted state and doesn't recover, if the server closes the socket out from under the ldap connection. It's easy to fix, so I've opened a ticket (CONNECTORS-291), and will commit code changes to trunk shortly. What version of MCF are you using? Karl On Wed, Nov 23, 2011 at 5:23 AM, Karl Wright daddy...@gmail.com wrote: Hi Swapna, There should be manifoldcf log output that contains the actual stack trace of the exception. That would be very helpful; I need the line numbers. The code is quite simple, and indicates that the LDAP server is refusing a connection: protected void getSession() throws ManifoldCFException { if (ctx == null) { // Calculate the ldap url first String ldapURL = ldap://; + domainControllerName + :389; Hashtable env = new Hashtable(); env.put(Context.INITIAL_CONTEXT_FACTORY,com.sun.jndi.ldap.LdapCtxFactory); env.put(Context.SECURITY_AUTHENTICATION,authentication); env.put(Context.SECURITY_PRINCIPAL,userName); env.put(Context.SECURITY_CREDENTIALS,password); //connect to my domain controller env.put(Context.PROVIDER_URL,ldapURL); //specify attributes to be returned in binary format env.put(java.naming.ldap.attributes.binary,tokenGroups objectSid); // Now, try the connection... try { ctx = new InitialLdapContext(env,null); } catch (AuthenticationException e) { // This means we couldn't authenticate! throw new ManifoldCFException(Authentication problem authenticating admin user '+userName+': +e.getMessage(),e); } catch (CommunicationException e) { // This means we couldn't connect, most likely throw new ManifoldCFException(Couldn't communicate with domain controller '+domainControllerName+': +e.getMessage(),e); } catch (NamingException e) { throw new ManifoldCFException(e.getMessage(),e); } } else { // Attempt to reconnect. I *hope* this is efficient and doesn't do unnecessary work. try { ctx.reconnect(null); } catch (AuthenticationException e) { // This means we couldn't authenticate! throw new ManifoldCFException(Authentication problem authenticating admin user '+userName+': +e.getMessage(),e); } catch (CommunicationException e) { // This means we couldn't connect, most likely throw new ManifoldCFException(Couldn't communicate with domain controller '+domainControllerName+': +e.getMessage(),e); } catch (NamingException e) { throw new ManifoldCFException(e.getMessage(),e); } } expiration = System.currentTimeMillis() + expirationInterval; try { responseLifetime = Long.parseLong(this.cacheLifetime) * 60L * 1000L; LRUsize = Integer.parseInt(this.cacheLRUsize); } catch (NumberFormatException e) { throw new ManifoldCFException(Cache lifetime or Cache LRU size must be an integer: +e.getMessage(),e); } } Your problem description indicates that it is possible that the ctx.reconnect() call is failing to reconnect, but a new connection works OK on your setup. A stack trace should tell me everything. Thanks, Karl On Wed, Nov 23, 2011 at 12:58 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Even after reducing the max connections to 3, the connection fails abruptly for me. Currently, the domain controller am using is mapped to only one IP address, and that responds on ping, and the max connections are 3. It was working yesterday and it fails suddenly throwing different exceptions like below: Threw exception: 'Couldn't communicate with domain controller 'globalad1': null' Threw exception: 'Couldn't communicate with domain controller 'globalad1.global.arup.com': null' Threw exception: 'globalad1.global.arup.com:389; socket closed' Sometimes, it works when I change the cache lifetime parameter. What others factors do you think that can cause this to fail ? Thanks and Regards, Swapna. On Tue, Nov 22, 2011 at 11:56 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: OK.. Thanks for the information On Mon, Nov 21, 2011 at 6:31 PM, Karl Wright daddy...@gmail.com wrote: The sAMAccountName and UserPrincipalName LDAP fields were used by different versions of Windows at different points in time. Some backwards compatibility was maintained, however Microsoft has apparently
Re: Authority Connection works unpredictably
I've attached a patch to the ticket so even MCF 0.3 users should be able to apply it. Karl On Wed, Nov 23, 2011 at 5:40 AM, Karl Wright daddy...@gmail.com wrote: To clarify, what I think may be happening is this. (1) The Java LDAP context is keeping a socket connection to the AD controller. (2) The AD controller must be configured to close connections forcibly after a certain period of time. (3) The LDAP context's reconnect() operation doesn't recover from a socket that was closed by the server. (4) The authority code won't release the LDAP context until 5 idle minutes go by. So basically, a connection winds up in a busted state and doesn't recover, if the server closes the socket out from under the ldap connection. It's easy to fix, so I've opened a ticket (CONNECTORS-291), and will commit code changes to trunk shortly. What version of MCF are you using? Karl On Wed, Nov 23, 2011 at 5:23 AM, Karl Wright daddy...@gmail.com wrote: Hi Swapna, There should be manifoldcf log output that contains the actual stack trace of the exception. That would be very helpful; I need the line numbers. The code is quite simple, and indicates that the LDAP server is refusing a connection: protected void getSession() throws ManifoldCFException { if (ctx == null) { // Calculate the ldap url first String ldapURL = ldap://; + domainControllerName + :389; Hashtable env = new Hashtable(); env.put(Context.INITIAL_CONTEXT_FACTORY,com.sun.jndi.ldap.LdapCtxFactory); env.put(Context.SECURITY_AUTHENTICATION,authentication); env.put(Context.SECURITY_PRINCIPAL,userName); env.put(Context.SECURITY_CREDENTIALS,password); //connect to my domain controller env.put(Context.PROVIDER_URL,ldapURL); //specify attributes to be returned in binary format env.put(java.naming.ldap.attributes.binary,tokenGroups objectSid); // Now, try the connection... try { ctx = new InitialLdapContext(env,null); } catch (AuthenticationException e) { // This means we couldn't authenticate! throw new ManifoldCFException(Authentication problem authenticating admin user '+userName+': +e.getMessage(),e); } catch (CommunicationException e) { // This means we couldn't connect, most likely throw new ManifoldCFException(Couldn't communicate with domain controller '+domainControllerName+': +e.getMessage(),e); } catch (NamingException e) { throw new ManifoldCFException(e.getMessage(),e); } } else { // Attempt to reconnect. I *hope* this is efficient and doesn't do unnecessary work. try { ctx.reconnect(null); } catch (AuthenticationException e) { // This means we couldn't authenticate! throw new ManifoldCFException(Authentication problem authenticating admin user '+userName+': +e.getMessage(),e); } catch (CommunicationException e) { // This means we couldn't connect, most likely throw new ManifoldCFException(Couldn't communicate with domain controller '+domainControllerName+': +e.getMessage(),e); } catch (NamingException e) { throw new ManifoldCFException(e.getMessage(),e); } } expiration = System.currentTimeMillis() + expirationInterval; try { responseLifetime = Long.parseLong(this.cacheLifetime) * 60L * 1000L; LRUsize = Integer.parseInt(this.cacheLRUsize); } catch (NumberFormatException e) { throw new ManifoldCFException(Cache lifetime or Cache LRU size must be an integer: +e.getMessage(),e); } } Your problem description indicates that it is possible that the ctx.reconnect() call is failing to reconnect, but a new connection works OK on your setup. A stack trace should tell me everything. Thanks, Karl On Wed, Nov 23, 2011 at 12:58 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Even after reducing the max connections to 3, the connection fails abruptly for me. Currently, the domain controller am using is mapped to only one IP address, and that responds on ping, and the max connections are 3. It was working yesterday and it fails suddenly throwing different exceptions like below: Threw exception: 'Couldn't communicate with domain controller 'globalad1': null' Threw exception: 'Couldn't communicate with domain controller 'globalad1.global.arup.com': null' Threw exception: 'globalad1.global.arup.com:389; socket closed' Sometimes, it works when I change the cache lifetime parameter. What others factors do you think that can cause this to fail ? Thanks and Regards, Swapna. On Tue, Nov 22, 2011 at 11:56 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: OK.. Thanks for the information On Mon, Nov 21, 2011 at 6:31 PM, Karl Wright daddy...@gmail.com wrote: The sAMAccountName
Re: MCF - Oracle to Solr
Hi Adam, Like I said before, the Simple History shows clearly that you have a perfectly reasonable URL for your documents. That is NOT the problem. The URL does not even need to be real, it's just an identifier of sorts as far as ManifoldCF and Solr are concerned. As I said before, you will probably want to make it real eventually, because otherwise there's no way to link back to display the content of your search results, but that's not important for indexing. Many people have indexed JDBC content successfully. But Solr is, on the other hand, very highly configurable, and depending on how you have set up your solrconfig.xml and/or schema.xml file you can certainly get back 500 errors or 400 errors from it when ManifoldCF tries to index something. When that happens all that usually needs to be done is that either the configuration of the output connection needs to be changed, or the solrconfig.xml and/or schema.xml needs to be changed. So let's start by exploring how you have set up your Solr. Are you running the solr example without modification? Or have you (or someone else) set Solr up specifically for your search problem? Can you find out where the Solr standard error and standard output is going? If so, you should see output for each document that ManifoldCF tries to index. Do you see this output, and what does it say? I should also mention that several versions of Solr returned 400 errors for zero-length documents indexed through the extracting update handler, which is what ManifoldCF uses. This is not usually a problem anyhow because, although it is noisy, there would not be any content for the document anyway. But is there any possibility that the database field you are indexing as the content field has nothing in it some or all of the time? Karl On Mon, Nov 21, 2011 at 1:01 AM, Adam LaPila adam.lap...@lmal.com.au wrote: Hi Karl, Still no luck. You wouldn't happen to have a link to any good resources of how to index a DB to Solr with MCF...other than the end-user examples from the website? Perhaps some of your own work with the use of a Database, can be oracle, mysql, etc. Do you know or anyone who has tried this before and was successful? any design documents...im been googling non stop, surely someone has done this before With the CONCAT('http://localhost:8080/solr?id=',AIRCRAFT_ID) AS $(URLCOLUMN)..FROM...WHERE) Does the URL need to be the link to the solr example? As the end user documentation says URLCOLUMN, The name of an expected resultset column containing a URL This is what I have been getting in the Simple History since my last email. 11-21-2011 16:53:28.378 document ingest (Solr) localhost:8080/solr?id=AC004 400 48 1 Bad Request 11-21-2011 16:53:28.347 document ingest (Solr) localhost:8080/solr?id=AC003 400 43 1 Bad Request 11-21-2011 16:53:28.331 document ingest (Solr) localhost:8080/solr?id=AC002 400 34 1 Bad Request 11-21-2011 16:53:28.300 document ingest (Solr) localhost:8080/solr?id=AC001 400 27 1 Bad Request Sorry for any troubles, just a little confused with it all. Regards, Adam. -Original Message- From: Karl Wright [mailto:daddy...@gmail.com] Sent: Monday, 21 November 2011 11:43 AM To: connectors-user@incubator.apache.org Subject: Re: MCF - Oracle to Solr Hi Adam, The 500 error is coming from Solr, so the place to look is in the Solr logs and output. If you are running the Solr example, you should be seeing stack traces which may shed light on what is happening. FWIW, I doubt very much that this has anything to do with your URL construction, which looks good based on what the Simple History indicates. Thanks, Karl On Sun, Nov 20, 2011 at 7:02 PM, Adam LaPila adam.lap...@lmal.com.au wrote: Hello, Im trying to get MCF to index my oracle repository to my solr output repository. I have been following the end-user documentation and im still having trouble getting things to work. I have also installed and running solr off a tomcat server on port 8080 I have set up my output, repository connectors. These seem to be fine, as it has the Connection Working status. I am sure the problem is how I'm setting up my job to extract the database table data, to my solr index. I received an email from Karl a couple of days ago in regards to the queries provided. SELECT CONCAT('http://myserver.com?id=',Aircraft_ID) AS $(URLCOLUMN), ... FROM ... WHERE ... I have changed my query to be more like this. This is what I have as my Data Query: SELECT AIRCRAFT_ID AS $(IDCOLUMN), AIRCRAFT_INFO as $(DATACOLUMN), CONCAT('http://localhost:8080/solr?id=',AIRCRAFT_ID) AS $(URLCOLUMN) FROM AIRCRAFT WHERE AIRCRAFT_ID IN $(IDLIST) When I run the job, I find that in the simple history I get something like this. document ingest (Solr) http://localhost:8080/solr?id=AC001 500 27 16 Internal Server Error AC001 is of the ID's
Re: Authority Connection works unpredictably
So let me get this straight - your Active Directory authority connection is configured to talk to only one IP address? and that IP address responds to ping even when you are receiving an error back from the authority connection? Another possibility is that the DC can only accept a limited number of connections at a time. What is the max connections parameter for your authority connection? Try reducing it to no more than 3-4 and see if that helps. Karl On Mon, Nov 21, 2011 at 5:34 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, I think I see many domain controllers for the domain am using. But I see only one IP address mapped to the domain controller name that am using in the credentials form. As I told you, its working sometimes and throwing exception sometimes. But ping works always fine on the domain controller name that am using, from which I assume that it is not unreachable. Can you tell me what else I should be checking or what other factors could be causing this to fail ? Thanks and Regards, Swapna. On Thu, Nov 17, 2011 at 1:18 PM, Karl Wright daddy...@gmail.com wrote: Try doing nslookup on the domain controller. In some larger companies there are many domain controllers all with the same name but different IP's. These *should* all be in synch but it may be the case that they are not - or some of them are unreachable or offline. This can also be the cause of intermittent authorization failures during crawling. If that is the case you have the option of setting the local machine's /etc/hosts file to point to a couple of domain controller instances that are local and in good working order, rather than rely on DNS to find one. Karl On Thu, Nov 17, 2011 at 1:32 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, I seem to have some problem with Authority Connection. When I define an Authority Connection specifying all the parameters like Domain Controller, username, password etc, the connection status shows Connection Working and everything works fine, crawling and sending docs to solr, using mcf-authority-service to get only those docs that a user has got permission to see etc. But suddenly, the connection status for the Authority Connection throws an exception, and when I play around the credentials form toggling Login name AD attribute, or changing domain controller name, or authentication , or sometimes even with the same settings that threw an exception earlier, the status shows Connection working again. I cannot define when it fails and when it works and for what settings it works. Can someone help me in understanding why this is happening and what needs to be done to make it work always ? Thanks and Regards, Swapna.
Re: Problem crawling windows share
See http://jcifs.samba.org/src/docs/api/overview-summary.html#scp. The properties jcifs.smb.lmCompatibility and jcifs.smb.client.useExtendedSecurity are the ones you may want to change. These two properties go together so certain combinations make sense and others don't, so there's really only combinations you need but I'll need to look at what they are and get back to you later today. As far as setting the switches are concerned, if you are using the Quick Start you do this trivially by: java -Dxxx -Dyyy -jar start.jar If you are using the multi-process configuration, that is what the defines directory is for; you only need to create files in that directory with the names jcifs.smb.lmCompatibility and jcifs.smb.client.useExtendedSecurity containing the values you want to set. Karl On Thu, Nov 17, 2011 at 1:11 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Am able to access the folders on the problem server through windows explorer, (\\server3\Folder1). I tried couple of things with the credentials form, changing username, domain etc.. but I keep getting the same error Couldn't connect to server: Logon failure: unknown user name or bad password Can you tell me more about the -D switch you were talking of ? Thanks and Regards, Swapna. On Tue, Nov 15, 2011 at 12:40 PM, Karl Wright daddy...@gmail.com wrote: Glad you chased it down this far. First thing to try is whether you can get into the problem server using Windows Explorer. Obviously ManifoldCF is not going to be able to do it if Windows can't. If you *can* get in, then just playing with the form of the credentials in the MCF connection might do the trick. Some Windows or net appliance servers are picky about this. Try various things like leaving the domain blank and specifying the user as a...@domain.com, for instance. There's also a different NTLM mode you can operation jcifs in that some servers may be configured to require; this would need you to set a -D switch on the command line to enable. Karl On Tue, Nov 15, 2011 at 12:10 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Thanks for the input. It looks like my problem is related to the second one that you specified. One of the directories in the path am trying to index is actually redirecting to a different server. And when I specify this new server in defining the repository connection, with my credentials, the connection fails with the message: Couldn't connect to server: Logon failure: unknown user name or bad password I'll look into why am not able to connect to this server. Thanks and Regards, Swapna. On Mon, Nov 14, 2011 at 4:56 PM, Karl Wright daddy...@gmail.com wrote: There's two kinds of problem you might be having. The first is intermittent, and the second is not intermittent but would have something to do with specific directories. Intermittent problems might include a domain controller that is not always accessible. In such cases, the crawl will proceed but will tend to fail unpredictably. On the other hand, if you have a directory that is handled by a DFS redirection, it is possible that the redirection is indicating a new server (lets call it server3) which may not like the precise form of your login credentials. Can you determine which scenario you are seeing? Karl On Mon, Nov 14, 2011 at 3:11 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, I have been using windows share repository connection to crawl and get data from a particular server (server 1). Its working perfectly fine. However, am having trouble when I try with data from another server (server 2). When I define a repository connection of type windows share and specify the server name (server 2) with my credentials, the connection status shows Connection working. But when I run a job to use this repository connection and index data from a location on this server 2, I keep getting the exception below: JCIFS: Possibly transient exception detected on attempt 3 while checking if file exists: Logon failure: unknown user name or bad password. jcifs.smb.SmbAuthException: Logon failure: unknown user name or bad password. at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:544) at jcifs.smb.SmbTransport.send(SmbTransport.java:661) at jcifs.smb.SmbSession.sessionSetup(SmbSession.java:390) at jcifs.smb.SmbSession.send(SmbSession.java:218) at jcifs.smb.SmbTree.treeConnect(SmbTree.java:176) at jcifs.smb.SmbFile.doConnect(SmbFile.java:911) at jcifs.smb.SmbFile.connect(SmbFile.java:954) at jcifs.smb.SmbFile.connect0(SmbFile.java:880) at jcifs.smb.SmbFile.queryPath(SmbFile.java:1335) at jcifs.smb.SmbFile.exists(SmbFile.java:1417) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileExists
Re: Problem crawling windows share
There's two kinds of problem you might be having. The first is intermittent, and the second is not intermittent but would have something to do with specific directories. Intermittent problems might include a domain controller that is not always accessible. In such cases, the crawl will proceed but will tend to fail unpredictably. On the other hand, if you have a directory that is handled by a DFS redirection, it is possible that the redirection is indicating a new server (lets call it server3) which may not like the precise form of your login credentials. Can you determine which scenario you are seeing? Karl On Mon, Nov 14, 2011 at 3:11 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, I have been using windows share repository connection to crawl and get data from a particular server (server 1). Its working perfectly fine. However, am having trouble when I try with data from another server (server 2). When I define a repository connection of type windows share and specify the server name (server 2) with my credentials, the connection status shows Connection working. But when I run a job to use this repository connection and index data from a location on this server 2, I keep getting the exception below: JCIFS: Possibly transient exception detected on attempt 3 while checking if file exists: Logon failure: unknown user name or bad password. jcifs.smb.SmbAuthException: Logon failure: unknown user name or bad password. at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:544) at jcifs.smb.SmbTransport.send(SmbTransport.java:661) at jcifs.smb.SmbSession.sessionSetup(SmbSession.java:390) at jcifs.smb.SmbSession.send(SmbSession.java:218) at jcifs.smb.SmbTree.treeConnect(SmbTree.java:176) at jcifs.smb.SmbFile.doConnect(SmbFile.java:911) at jcifs.smb.SmbFile.connect(SmbFile.java:954) at jcifs.smb.SmbFile.connect0(SmbFile.java:880) at jcifs.smb.SmbFile.queryPath(SmbFile.java:1335) at jcifs.smb.SmbFile.exists(SmbFile.java:1417) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileExists(SharedDriveConnector.java:2064) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getDocumentVersions(SharedDriveConnector.java:521) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318) Am able to access this location from windows explorer. What else should I be checking or what could be the reasons/factors causing this to fail ? Thanks and Regards, Swapna.
Re: Problem crawling windows share
Glad you chased it down this far. First thing to try is whether you can get into the problem server using Windows Explorer. Obviously ManifoldCF is not going to be able to do it if Windows can't. If you *can* get in, then just playing with the form of the credentials in the MCF connection might do the trick. Some Windows or net appliance servers are picky about this. Try various things like leaving the domain blank and specifying the user as a...@domain.com, for instance. There's also a different NTLM mode you can operation jcifs in that some servers may be configured to require; this would need you to set a -D switch on the command line to enable. Karl On Tue, Nov 15, 2011 at 12:10 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Thanks for the input. It looks like my problem is related to the second one that you specified. One of the directories in the path am trying to index is actually redirecting to a different server. And when I specify this new server in defining the repository connection, with my credentials, the connection fails with the message: Couldn't connect to server: Logon failure: unknown user name or bad password I'll look into why am not able to connect to this server. Thanks and Regards, Swapna. On Mon, Nov 14, 2011 at 4:56 PM, Karl Wright daddy...@gmail.com wrote: There's two kinds of problem you might be having. The first is intermittent, and the second is not intermittent but would have something to do with specific directories. Intermittent problems might include a domain controller that is not always accessible. In such cases, the crawl will proceed but will tend to fail unpredictably. On the other hand, if you have a directory that is handled by a DFS redirection, it is possible that the redirection is indicating a new server (lets call it server3) which may not like the precise form of your login credentials. Can you determine which scenario you are seeing? Karl On Mon, Nov 14, 2011 at 3:11 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, I have been using windows share repository connection to crawl and get data from a particular server (server 1). Its working perfectly fine. However, am having trouble when I try with data from another server (server 2). When I define a repository connection of type windows share and specify the server name (server 2) with my credentials, the connection status shows Connection working. But when I run a job to use this repository connection and index data from a location on this server 2, I keep getting the exception below: JCIFS: Possibly transient exception detected on attempt 3 while checking if file exists: Logon failure: unknown user name or bad password. jcifs.smb.SmbAuthException: Logon failure: unknown user name or bad password. at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:544) at jcifs.smb.SmbTransport.send(SmbTransport.java:661) at jcifs.smb.SmbSession.sessionSetup(SmbSession.java:390) at jcifs.smb.SmbSession.send(SmbSession.java:218) at jcifs.smb.SmbTree.treeConnect(SmbTree.java:176) at jcifs.smb.SmbFile.doConnect(SmbFile.java:911) at jcifs.smb.SmbFile.connect(SmbFile.java:954) at jcifs.smb.SmbFile.connect0(SmbFile.java:880) at jcifs.smb.SmbFile.queryPath(SmbFile.java:1335) at jcifs.smb.SmbFile.exists(SmbFile.java:1417) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileExists(SharedDriveConnector.java:2064) at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getDocumentVersions(SharedDriveConnector.java:521) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318) Am able to access this location from windows explorer. What else should I be checking or what could be the reasons/factors causing this to fail ? Thanks and Regards, Swapna.
Re: Authorization for Ubuntu server and Windows WS not in a domain
Hi, File acls in ManifoldCF would normally be handled by the repository connector, during the process of indexing documents. The corresponding Linux authority information would include what Unix groups a user was part of. There is currently no authority connector I am aware of that does that. You could perhaps try to write your own - I doubt it would be very hard to write. Karl On Sat, Nov 12, 2011 at 9:39 AM, mi...@grf.bg.ac.rs wrote: Hello, So I could use MCF as a servise provider for authorization. That is nice but only if the formats agree (I'll check that). But still there is one question left: What authorization component to use if users and indexed files are not on the Windows servers, but on the Ubuntu server with Unix style file rights of type rwxr--r-- ? Also, could I use AD component for Windows not in a domain? The format of a ManifoldCF access token is a collaboration between an authority connector and repository connectors designed to work with that authority connector. If you've already indexed documents using another mechanism, you can still use ManifoldCF's authority service to obtain access tokens for authenticated users. This service is a web application accessible by http. You can see what it returns (after defining an authority connection or two in the ManifoldCF UI) by simply using curl: curl http://localhost:8345/mcf-authority-service/UserACLs?username=myn...@mydomain.com ... and noting what is returned. The access tokens indexed in Solr by your crawler will have to match the access token format returned by the authority service, or the Solr query modification components we supply will not work. Hope this helps, Karl On Fri, Nov 11, 2011 at 2:12 PM, mi...@grf.bg.ac.rs wrote: Hello, I would like to test ManifoldCF (MCF) in order to achieve doc level security in my SOLR search app. Actually I already developed my own document crawler apart from the MCF framework. My test SOLR app is located on Ubunutu server (the indexed docs are located on that server). When I tried to use MCF Quick Start App I didn't know what authorization connector to use for this case (of course I would use MCF crawlers in this case to retrieve documents). I need MCF only for authorization. My crawler already uses Java 7 capabilities to retrieve file ACL(Posix attributes) What classes do I need to use from MCF libraries to perform authorization based on Ubuntu server usernames? Finally, is it possible to perform authorization against Windows accounts on workstations not in a domain (local users)? Thank you, Milos
Re: Solr - ManifoldCFSecurityFilter
It doesn't have to be in the URL, but is has to be in the solr request object somehow. If you want another source for the parameter, please describe what you are trying to do and maybe we can come up with something different. Karl On Fri, Oct 21, 2011 at 4:34 AM, Wunderlich, Tobias tobias.wunderl...@igd-r.fraunhofer.de wrote: Hey guys, I’ve got a question concerning the security search component for Solr. To authorize a user I have to send the parameter “AuthenticatedUserName=…”. Does this always have to happen directly in the URL or is there another way to send this parameter?
Re: Using Active Directory
) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:254) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:372) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:98) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4584) at org.apache.catalina.core.StandardContext$2.call(StandardContext.java:5262) at org.apache.catalina.core.StandardContext$2.call(StandardContext.java:5257) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.ClassNotFoundException: org.apache.solr.mcf.ManifoldCFSearchComponent at Have I missed any steps ?? What else should I be doing for Solr Integration ?? Thanks and Regards, Swapna. On Fri, Oct 14, 2011 at 2:53 PM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Thanks a lot for the info Shinichiro Abe, I'll look into it. Thanks and Regards, Swapna. On Fri, Oct 14, 2011 at 2:21 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi. If you can use ManifoldCF 0.4 trunk, you can use solr integration components. Recently the plugin is added. Please see: http://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/solr/integration/README-3.x.txt You can get the results depending on user access tokens on Solr side. curl http://localhost:8983/solr/select?q=*:*AuthenticatedUserName=username@domain Regards, Shinichiro Abe On 2011/10/14, at 16:39, Swapna Vuppala wrote: Hi Karl, Thanks for the reply. I built jCIFS connector, registered it, created a repository connection of type Windows Share, and created a job using Solr connection and Windows share connection. I modified the Solr schema to include fields field name=allow_token_document type=string indexed=true stored=true multiValued=true/ field name=deny_token_document type=string indexed=true stored=true multiValued=true/ field name=allow_token_share type=string indexed=true stored=true multiValued=true/ field name=deny_token_share type=string indexed=true stored=true multiValued=true/ I set the stored attribute to true just for testing purposes. Now when I run the job, I see these tokens in the indexed data as expected. My next job would be to make the search from Solr secure. Do I have to make any changes on Solr side to make use of these tokens and present only those docs to the user that he's entitled to see ? Can you please direct me as to how to filter the search results depending upon the user's credentials ? Thanks and Regards, Swapna. On Thu, Oct 13, 2011 at 1:22 PM, Karl Wright daddy...@gmail.com wrote: Hi, First, it is DOCUMENT access tokens that are sent to Solr, not user access tokens. You must therefore be crawling a repository that has some notion of security. The File System connector does not do that; you probably want to use the CIFS connector instead. Thanks, Karl On Thu, Oct 13, 2011 at 3:19 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, Am trying to use Active Directory authority connection to address Solr security. I created an Authority Connection of type Active Directory (the connection status shows Connection Working) and used it in creating a File System repository connection. Then, I created a job with Solr as output connection and the above created repository connection. As per my understanding ( I might be totally wrong, please correct me if so), ManifoldCF now sends user's access tokens along with the documents to be indexed to Solr. I should be able to see the access tokens in Solr's indexed data either by extending the schema with fields field name=allow_token_document type=string indexed=true stored=true multiValued=true/ field name=deny_token_document type=string indexed=true stored=true multiValued=true/ or they come as some automatic fields that Solr creates , with the attr_ prefix as specified at http://www.mail-archive.com/connectors-user@incubator.apache.org/msg00462.html But am not able to see any access tokens with/without modifying Solr schema. Have I missed configuring anything else or how I do I check if my Active Directory connection is working properly ?? Am using ManifoldCF 0.3 version
Re: Trouble accessing mcf-api-service
Glad you worked it out! Thanks, Karl On Mon, Oct 10, 2011 at 5:36 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: I could resolve the issue by following commands at http://incubator.apache.org/connectors/programmatic-operation.html#Control+by+Servlet+API Thanks and Regards, Swapna. On Mon, Oct 10, 2011 at 12:18 PM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi, Am able to access and use ManifoldCF crawler (configured on my windows machine on Tomcat). But am not able to access http://localhost:8080/mcf-api-service/json. (was trying to refer last section at http://www.searchworkings.org/blog/-/blogs/344989). It says {error:Unrecognized resource.} I copied all the war files (mcf-api-service.war, mcf-authority-service.war, mcf-crawler-ui.war) to webapps folder of my Tomcat installation folder (C:\Program Files\Apache Software Foundation\Tomcat 7.0\webapps), configured properties.xml to use PostgreSql database Can you please help me resolve this issue ? Thanks and Regards, Swapna.
Re: MCF 0.3 - WebCrawlerConnector - Ingestion Problems
Hi Tobias, Sorry for the delay. There are a number of reasons a document can be rejected for indexing. They are: (1) URL criteria, as specified in the Web job's specification information (2) Maximum document length, as controlled by the output connection (you never told us what that was) (3) Mime type criteria, as controlled by the output connection So I bet this is a mime type issue. What content-type does the page have? What output connector are you using? Karl On Thu, Oct 6, 2011 at 7:18 AM, Wunderlich, Tobias tobias.wunderl...@igd-r.fraunhofer.de wrote: Hey guys, I try to crawl a website generated with a Mediawiki-extension and always get the message: “[WebcrawlerConnector.java:1312] - WEB: Decided not to ingest 'http://wiki.host/index.php?title=Spezial%3AAlle+Seitenfrom=pto=snamespace=0' because it did not match ingestability criteria” Seed-url: 'http://wiki.host/index.php?title=Spezial%3AAlle+Seitenfrom=pto=snamespace=0 Inclusions (crawl and index): .* Exclusions: none Other sites are crawled without problems, so I’m wondering what those ingestability criteria exactly are. Best regards, Tobias
Re: config files for ManifoldCF
Did you remember to start the agents process? Karl On Wed, Oct 5, 2011 at 5:47 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, I installed postgresql database and changed properties.xml accordingly, used executecommand.bat to initialize database, install schema, register solr, filesystem and active directory connectors and ran agents process. Am able to access the crawler UI at http://localhost:8080/mcf-crawler-ui and define SOLR output connection, file system repository connection and also a job. But my problem is that when I run the job, the status shows Starting up and does not change after that. Connection status for Solr connection shows Connection working I see nothing in manifoldcf.log. Can you please direct me as to where to look for any errors or how to resolve this ? Thanks and Regards, Swapna. On Tue, Oct 4, 2011 at 3:38 PM, Karl Wright daddy...@gmail.com wrote: How you add the -D switch for tomcat depends on what platform you are running tomcat. On Windows, there is an application that allows you to add commands to the java invocation. On linux, the /etc/init.d/tomcat script allows you to set options - depending on version, you can even put these in a directory that the script scrapes to put them together. As for what else you need: - a properties.xml file that specifies a synch directory - you will need to initialize the database, register the crawler agent, and register the connectors using commands, as described in how-to-build-and-deploy - You'll need to run the agents process, and any of the sidecar processes needed by the connectors you have registered. There are scripts for all of these, which require you to set MCF_HOME and JAVA_HOME environment variables first. Karl On Tue, Oct 4, 2011 at 4:15 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Thanks Karl and Piergiorgio, I need one more clarification, but that's regarding deploying ManifoldCF on Tomcat. I have built ManifoldCF 0.3 and have been running it so far on Jetty and everything works fine. But now I want to use Tomcat instead of Jetty. I tried instructions at http://incubator.apache.org/connectors/how-to-build-and-deploy.html. I already have Tomcat installed on my machine. So I copied the war files (mfc-api-service,mfc-authority-service,crawler-ui) into Tomcat's webapps directory, copied all contents of dist directory of manifoldcf into a separate directory (which I set as MFC_HOME environment variable). Now am trying to access the crawler UI at http://localhost:8080/mcf-crawler-ui/ But I get the exception org.apache.jasper.JasperException: javax.servlet.ServletException: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Initialization failed: Could not read configuration file 'C:\lcf\properties.xml' I understand that the property org.apache.manifoldcf.configfile is not set. How do I set this and what else do I have to do for proper and complete deployment on tomcat ? Thanks a lot in advance, Swapna. On Mon, Oct 3, 2011 at 5:13 AM, Karl Wright daddy...@gmail.com wrote: Hi Swapna, To clarify Piergiorgio's answer a little, ManifoldCF uses a properties.xml file for its basic configuration information. However, everything else is kept in the database. That includes connection definitions and job definitions. I recommend that you start by using the Quick-Start example uses an embedded Apache Derby database instance, by default. You can change this later, of course. For real work we recommend PostgreSQL. You can find more information at http://incubator.apache.org/connectors/how-to-build-and-deploy.html. Have a look at the quick-start instructions. Karl On Sun, Oct 2, 2011 at 1:43 PM, Piergiorgio Lucidi piergior...@apache.org wrote: Hi Swapna, 2011/10/2 Swapna Vuppala swapna.kollip...@gmail.com Hi, Am new to using ManifoldCF and I have got couple of doubts about using it. Am interested in knowing about what are the config files that are used in ManifoldCF, where they are located and how they are used. Also, I was wondering where all the information about output connection definitions, repository definitions and job definitions, defined by a user using the crawler UI, are stored. The unique config file is the properties.xml that you need to add a new JVM parameter: -Dorg.apache.manifoldcf.configfile=configuration file path This only if you are deploying ManifoldCF in an application server. Otherwise you can leave properties.xml in your user home/lcf folder. You can find an example of the properties.xml file in the dist/example folder of the distribution bundle. All the information managed by the UI Crawler are stored in a database, HSQL by default but you can configure a Postgresql DBMS changing the properties.xml file, for more
Re: config files for ManifoldCF
I worked with Swapna directly to resolve this. Turned out he'd been using ^C to kill the agents process, so a LockClean procedure fixed it. Karl On Wed, Oct 5, 2011 at 6:37 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: I used this property name=org.apache.manifoldcf.synchdirectory value=c:/mysynchdir/ Thanks and Regards, Swapna. On Wed, Oct 5, 2011 at 4:05 PM, Karl Wright daddy...@gmail.com wrote: What do you have set for your synch directory? Karl On Wed, Oct 5, 2011 at 6:09 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Yes, I used executecommand.bat org.apache.manifoldcf.agents.AgentRun Thanks and Regards, Swapna. On Wed, Oct 5, 2011 at 3:37 PM, Karl Wright daddy...@gmail.com wrote: Did you remember to start the agents process? Karl On Wed, Oct 5, 2011 at 5:47 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, I installed postgresql database and changed properties.xml accordingly, used executecommand.bat to initialize database, install schema, register solr, filesystem and active directory connectors and ran agents process. Am able to access the crawler UI at http://localhost:8080/mcf-crawler-ui and define SOLR output connection, file system repository connection and also a job. But my problem is that when I run the job, the status shows Starting up and does not change after that. Connection status for Solr connection shows Connection working I see nothing in manifoldcf.log. Can you please direct me as to where to look for any errors or how to resolve this ? Thanks and Regards, Swapna. On Tue, Oct 4, 2011 at 3:38 PM, Karl Wright daddy...@gmail.com wrote: How you add the -D switch for tomcat depends on what platform you are running tomcat. On Windows, there is an application that allows you to add commands to the java invocation. On linux, the /etc/init.d/tomcat script allows you to set options - depending on version, you can even put these in a directory that the script scrapes to put them together. As for what else you need: - a properties.xml file that specifies a synch directory - you will need to initialize the database, register the crawler agent, and register the connectors using commands, as described in how-to-build-and-deploy - You'll need to run the agents process, and any of the sidecar processes needed by the connectors you have registered. There are scripts for all of these, which require you to set MCF_HOME and JAVA_HOME environment variables first. Karl On Tue, Oct 4, 2011 at 4:15 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Thanks Karl and Piergiorgio, I need one more clarification, but that's regarding deploying ManifoldCF on Tomcat. I have built ManifoldCF 0.3 and have been running it so far on Jetty and everything works fine. But now I want to use Tomcat instead of Jetty. I tried instructions at http://incubator.apache.org/connectors/how-to-build-and-deploy.html. I already have Tomcat installed on my machine. So I copied the war files (mfc-api-service,mfc-authority-service,crawler-ui) into Tomcat's webapps directory, copied all contents of dist directory of manifoldcf into a separate directory (which I set as MFC_HOME environment variable). Now am trying to access the crawler UI at http://localhost:8080/mcf-crawler-ui/ But I get the exception org.apache.jasper.JasperException: javax.servlet.ServletException: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Initialization failed: Could not read configuration file 'C:\lcf\properties.xml' I understand that the property org.apache.manifoldcf.configfile is not set. How do I set this and what else do I have to do for proper and complete deployment on tomcat ? Thanks a lot in advance, Swapna. On Mon, Oct 3, 2011 at 5:13 AM, Karl Wright daddy...@gmail.com wrote: Hi Swapna, To clarify Piergiorgio's answer a little, ManifoldCF uses a properties.xml file for its basic configuration information. However, everything else is kept in the database. That includes connection definitions and job definitions. I recommend that you start by using the Quick-Start example uses an embedded Apache Derby database instance, by default. You can change this later, of course. For real work we recommend PostgreSQL. You can find more information at http://incubator.apache.org/connectors/how-to-build-and-deploy.html. Have a look at the quick-start instructions. Karl On Sun, Oct 2, 2011 at 1:43 PM, Piergiorgio Lucidi piergior...@apache.org wrote: Hi Swapna, 2011/10/2 Swapna Vuppala
Re: config files for ManifoldCF
How you add the -D switch for tomcat depends on what platform you are running tomcat. On Windows, there is an application that allows you to add commands to the java invocation. On linux, the /etc/init.d/tomcat script allows you to set options - depending on version, you can even put these in a directory that the script scrapes to put them together. As for what else you need: - a properties.xml file that specifies a synch directory - you will need to initialize the database, register the crawler agent, and register the connectors using commands, as described in how-to-build-and-deploy - You'll need to run the agents process, and any of the sidecar processes needed by the connectors you have registered. There are scripts for all of these, which require you to set MCF_HOME and JAVA_HOME environment variables first. Karl On Tue, Oct 4, 2011 at 4:15 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Thanks Karl and Piergiorgio, I need one more clarification, but that's regarding deploying ManifoldCF on Tomcat. I have built ManifoldCF 0.3 and have been running it so far on Jetty and everything works fine. But now I want to use Tomcat instead of Jetty. I tried instructions at http://incubator.apache.org/connectors/how-to-build-and-deploy.html. I already have Tomcat installed on my machine. So I copied the war files (mfc-api-service,mfc-authority-service,crawler-ui) into Tomcat's webapps directory, copied all contents of dist directory of manifoldcf into a separate directory (which I set as MFC_HOME environment variable). Now am trying to access the crawler UI at http://localhost:8080/mcf-crawler-ui/ But I get the exception org.apache.jasper.JasperException: javax.servlet.ServletException: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Initialization failed: Could not read configuration file 'C:\lcf\properties.xml' I understand that the property org.apache.manifoldcf.configfile is not set. How do I set this and what else do I have to do for proper and complete deployment on tomcat ? Thanks a lot in advance, Swapna. On Mon, Oct 3, 2011 at 5:13 AM, Karl Wright daddy...@gmail.com wrote: Hi Swapna, To clarify Piergiorgio's answer a little, ManifoldCF uses a properties.xml file for its basic configuration information. However, everything else is kept in the database. That includes connection definitions and job definitions. I recommend that you start by using the Quick-Start example uses an embedded Apache Derby database instance, by default. You can change this later, of course. For real work we recommend PostgreSQL. You can find more information at http://incubator.apache.org/connectors/how-to-build-and-deploy.html. Have a look at the quick-start instructions. Karl On Sun, Oct 2, 2011 at 1:43 PM, Piergiorgio Lucidi piergior...@apache.org wrote: Hi Swapna, 2011/10/2 Swapna Vuppala swapna.kollip...@gmail.com Hi, Am new to using ManifoldCF and I have got couple of doubts about using it. Am interested in knowing about what are the config files that are used in ManifoldCF, where they are located and how they are used. Also, I was wondering where all the information about output connection definitions, repository definitions and job definitions, defined by a user using the crawler UI, are stored. The unique config file is the properties.xml that you need to add a new JVM parameter: -Dorg.apache.manifoldcf.configfile=configuration file path This only if you are deploying ManifoldCF in an application server. Otherwise you can leave properties.xml in your user home/lcf folder. You can find an example of the properties.xml file in the dist/example folder of the distribution bundle. All the information managed by the UI Crawler are stored in a database, HSQL by default but you can configure a Postgresql DBMS changing the properties.xml file, for more information about all the parameters you can visit the following page: http://incubator.apache.org/connectors/how-to-build-and-deploy.html#The+ManifoldCF+configuration+file Hope this helps. Piergiorgio Can you please help me in clarifying these doubts ? Thanks and Regards, Swapna. -- Piergiorgio Lucidi http://about.me/piergiorgiolucidi
Re: config files for ManifoldCF
In the Apache Tomcat group of programs, there's a Configure Tomcat application. Click on that, and you will find within the ability to add switches to your tomcat instance. Since you've never used Java before, I'm afraid this group is not likely to be your best resource. You might try googling for some web resources that will likely help you more than I can in that regard. Karl On Tue, Oct 4, 2011 at 6:34 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Hi Karl, Thanks for the quick response. Am running Tomcat on Windows, sorry that I didn't mention it earlier. How do I do this on Windows ?? Also, this is the first time am working on Java environment and my questions may look too trivial. Please bear with me. Thanks and Regards, Swapna. On Tue, Oct 4, 2011 at 3:38 PM, Karl Wright daddy...@gmail.com wrote: How you add the -D switch for tomcat depends on what platform you are running tomcat. On Windows, there is an application that allows you to add commands to the java invocation. On linux, the /etc/init.d/tomcat script allows you to set options - depending on version, you can even put these in a directory that the script scrapes to put them together. As for what else you need: - a properties.xml file that specifies a synch directory - you will need to initialize the database, register the crawler agent, and register the connectors using commands, as described in how-to-build-and-deploy - You'll need to run the agents process, and any of the sidecar processes needed by the connectors you have registered. There are scripts for all of these, which require you to set MCF_HOME and JAVA_HOME environment variables first. Karl On Tue, Oct 4, 2011 at 4:15 AM, Swapna Vuppala swapna.kollip...@gmail.com wrote: Thanks Karl and Piergiorgio, I need one more clarification, but that's regarding deploying ManifoldCF on Tomcat. I have built ManifoldCF 0.3 and have been running it so far on Jetty and everything works fine. But now I want to use Tomcat instead of Jetty. I tried instructions at http://incubator.apache.org/connectors/how-to-build-and-deploy.html. I already have Tomcat installed on my machine. So I copied the war files (mfc-api-service,mfc-authority-service,crawler-ui) into Tomcat's webapps directory, copied all contents of dist directory of manifoldcf into a separate directory (which I set as MFC_HOME environment variable). Now am trying to access the crawler UI at http://localhost:8080/mcf-crawler-ui/ But I get the exception org.apache.jasper.JasperException: javax.servlet.ServletException: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Initialization failed: Could not read configuration file 'C:\lcf\properties.xml' I understand that the property org.apache.manifoldcf.configfile is not set. How do I set this and what else do I have to do for proper and complete deployment on tomcat ? Thanks a lot in advance, Swapna. On Mon, Oct 3, 2011 at 5:13 AM, Karl Wright daddy...@gmail.com wrote: Hi Swapna, To clarify Piergiorgio's answer a little, ManifoldCF uses a properties.xml file for its basic configuration information. However, everything else is kept in the database. That includes connection definitions and job definitions. I recommend that you start by using the Quick-Start example uses an embedded Apache Derby database instance, by default. You can change this later, of course. For real work we recommend PostgreSQL. You can find more information at http://incubator.apache.org/connectors/how-to-build-and-deploy.html. Have a look at the quick-start instructions. Karl On Sun, Oct 2, 2011 at 1:43 PM, Piergiorgio Lucidi piergior...@apache.org wrote: Hi Swapna, 2011/10/2 Swapna Vuppala swapna.kollip...@gmail.com Hi, Am new to using ManifoldCF and I have got couple of doubts about using it. Am interested in knowing about what are the config files that are used in ManifoldCF, where they are located and how they are used. Also, I was wondering where all the information about output connection definitions, repository definitions and job definitions, defined by a user using the crawler UI, are stored. The unique config file is the properties.xml that you need to add a new JVM parameter: -Dorg.apache.manifoldcf.configfile=configuration file path This only if you are deploying ManifoldCF in an application server. Otherwise you can leave properties.xml in your user home/lcf folder. You can find an example of the properties.xml file in the dist/example folder of the distribution bundle. All the information managed by the UI Crawler are stored in a database, HSQL by default but you can configure a Postgresql DBMS changing the properties.xml file, for more information about all the parameters you can visit the following page: http
Re: Not able to create sharepoint connection
You might be interested to know that we now prebuild the SharePoint MCPermissions web service, and will be including this as part of the release for ManifoldCF 0.4-incubating. You can pick it up on trunk now; it's delivered (along with installation instructions and .bat scripts) under dist/sharepoint-integration when you build. Karl On Wed, Jul 20, 2011 at 6:12 AM, Karl Wright daddy...@gmail.com wrote: Hi Pravin, The .NET piece of the SharePoint connector is necessary only if you select the sharepoint 3.0 radio button when you set up your SharePoint repository connection. If you don't want to build and deploy the MCPermissions plugin on the server-side SharePoint, you can certainly avoid that by simply selecting the sharepoint 2.0 radio button instead. However, if you choose to do it this way, then SharePoint's file and folder permissions will not be accessible to the connector. (These were added in 3.0 but Microsoft overlooked the web service methods that would allow external access to them.) It sounds like, with this change, you were probably successful in connecting to the second system you tried. For the first system, can you tell me a bit more about it? For example, what version of SharePoint was it? And, did you follow the instructions and use your browser to determine the proper connection URL? I'm actually very interested to talk with someone who has access to a functioning SharePoint setup, since I lost access to my SharePoint testbed a while ago and would dearly love to bring that connector up to snuff for SharePoint 2010. I'm hoping Microsoft actually corrected the missing feature that required the MCPermissions plugin, for instance. Please let me know if you are willing to experiment a bit to help us with this connector. Thanks! Karl On Wed, Jul 20, 2011 at 5:20 AM, Pravin Agrawal pravin_agra...@persistent.co.in wrote: Hi All, I was trying out the sharepoint repository connector provided by manifold CF, following are the steps that I carried out to get it up and running. For building and deploying the CF, I followed the procedure given at http://incubator.apache.org/connectors/how-to-build-and-deploy.html I have build the connector on my linux machine and deployed the application using quick start process. The question that came to me during the build process was – is .NET and MS Visual studio absolutely necessary for building the sharepoint connector or is it sufficient to provide only those 5 wsdl file mentioned in the guide and does it works on windows only as I am able to see manifoldCF UI on to my browser by simply deploying the quickstart version along with sharepoint repository connection. I started to make the sharepoint repository connection to crawl one of the sharepoint site and following are the problems I encountered while creating it 1. After creating the sharepoint repository connection, the status shows me a message as follows Connection status: The site at http://portal.mydomain.co.in/sites/Documents/Forms did not exist 2. With a different sharepoint site, connection status shows me following message Connection status: ManifoldCF's MCPermissions web service may not be installed on the target SharePoint server. MCPermissions service is needed for SharePoint repositories version 3.0 or higher, to allow access to security information for files and folders. Consult your system administrator. Can any when tell me whether I am missing some steps Thanks in advance. -Regards, Pravin DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: config files for ManifoldCF
Hi Swapna, To clarify Piergiorgio's answer a little, ManifoldCF uses a properties.xml file for its basic configuration information. However, everything else is kept in the database. That includes connection definitions and job definitions. I recommend that you start by using the Quick-Start example uses an embedded Apache Derby database instance, by default. You can change this later, of course. For real work we recommend PostgreSQL. You can find more information at http://incubator.apache.org/connectors/how-to-build-and-deploy.html. Have a look at the quick-start instructions. Karl On Sun, Oct 2, 2011 at 1:43 PM, Piergiorgio Lucidi piergior...@apache.org wrote: Hi Swapna, 2011/10/2 Swapna Vuppala swapna.kollip...@gmail.com Hi, Am new to using ManifoldCF and I have got couple of doubts about using it. Am interested in knowing about what are the config files that are used in ManifoldCF, where they are located and how they are used. Also, I was wondering where all the information about output connection definitions, repository definitions and job definitions, defined by a user using the crawler UI, are stored. The unique config file is the properties.xml that you need to add a new JVM parameter: -Dorg.apache.manifoldcf.configfile=configuration file path This only if you are deploying ManifoldCF in an application server. Otherwise you can leave properties.xml in your user home/lcf folder. You can find an example of the properties.xml file in the dist/example folder of the distribution bundle. All the information managed by the UI Crawler are stored in a database, HSQL by default but you can configure a Postgresql DBMS changing the properties.xml file, for more information about all the parameters you can visit the following page: http://incubator.apache.org/connectors/how-to-build-and-deploy.html#The+ManifoldCF+configuration+file Hope this helps. Piergiorgio Can you please help me in clarifying these doubts ? Thanks and Regards, Swapna. -- Piergiorgio Lucidi http://about.me/piergiorgiolucidi
Re: Indexing Wikipedia/MediaWiki
This looked easy enough that I just went ahead and implemented it. If you check out trunk, and add site map document URLs to the Feed URLs tab for an RSS job, it should locate the documents the sitemap points at. Furthermore it should not chase links within those documents unless the documents are also site map documents or rss feeds in their own right. Karl On Fri, Sep 16, 2011 at 5:31 AM, Karl Wright daddy...@gmail.com wrote: It might be worth exploring sitemaps. http://en.wikipedia.org/wiki/Site_map It may be possible to create a connector, much like the RSS connector, that you can point at a site map and it would just pick up the pages. In fact, I think it would be straightforward to modify the RSS connector to understand sitemap format. If you can do a little research to figure out if this might work for you, I'd be willing to do some work and try to implement it. Karl On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias tobias.wunderl...@igd-r.fraunhofer.de wrote: Hey folks, I am currently working on a project to create a basic search platform using Solr and ManifoldCF. One of the content-repositories I need to index is a wiki (MediaWiki) and that’s where I ran into a wall. I tried using the web-connector, but simply crawling the sites resulted in a lot of content I don’t need (navigation-links, …) and not every information I wanted was gathered (author, last modified, …). The only metadata I got was the one included in head/meta, which wasn’t relevant. Is there another way to get the wiki’s data and more important is there a way to get the right data into the right field? I know that there is a way to export the wiki-sites in xml with wiki-syntax, but I don’t know how that would help me. I could simply use solr’s DataImportHandler to index a complete wiki-dump, but it would be nice to use the same framework for every repository, especially since manifold manages all the recrawling. Does anybody have some experience in this direction, or any idea for a solution? Thanks in advance, Tobias
Re: Last-modified from Web crawler
If I recall, the Solr output connector has a tab that will let you map incoming metadata to whatever solr fieldname you want. It's called the Solr Field Mapping tab, and you set it on each job that indexes to a solr output connection. Give it a try and see if it works for you. Karl On Wed, Aug 24, 2011 at 4:38 AM, Jan Høydahl jan@cominvent.com wrote: Wow, that was quick :) So, how can we now configure so that Last-Modified is sent to the solr output connector as e.g. literal.last_modified ? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 22. aug. 2011, at 17.09, Jan Høydahl wrote: CONNECTORS-243 -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 22. aug. 2011, at 16.38, Karl Wright wrote: It would have to be sent as a metadata field. This should not be difficult to implement. Can you create a JIRA ticket for it please? Thanks, Karl On Mon, Aug 22, 2011 at 10:35 AM, Jan Høydahl jan@cominvent.com wrote: Hi, How can we have the Web connector send the last-modified value from a page's HTTP header to the output connector? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com
Re: Setting heapsize of agent
Good point. We should probably have an environment variable or script parameter for this. Would you like to create a ticket? Karl On Fri, Jun 24, 2011 at 1:51 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hello. When using ./executecommand.sh org.apache.manifoldcf.agents.AgentRun, where do we set JVM heapsize of agent (-Xms1024m -Xmx1024m)? We cannot use files in processes/define folder which adds -D switch. Regards, Shinichiro Abe
Travel assistance, ApacheCon NA 2011
The Apache Software Foundation (ASF)'s Travel Assistance Committee (TAC) is now accepting applications for ApacheCon North America 2011, 7-11 November in Vancouver BC, Canada. The TAC is seeking individuals from the Apache community at-large --users, developers, educators, students, Committers, and Members-- who would like to attend ApacheCon, but need some financial support in order to be able to get there. There are limited places available, and all applicants will be scored on their individual merit. Financial assistance is available to cover flights/trains, accommodation and entrance fees either in part or in full, depending on circumstances. However, the support available for those attending only the BarCamp (7-8 November) is less than that for those attending the entire event (Conference + BarCamp 7-11 November). The Travel Assistance Committee aims to support all official ASF events, including cross-project activities; as such, it may be prudent for those in Asia and Europe to wait for an event geographically closer to them. More information can be found at http://www.apache.org/travel/index.html including a link to the online application and detailed instructions for submitting. Applications will close on 8 July 2011 at 22:00 BST (UTC/GMT +1). We wish good luck to all those who will apply, and thank you in advance for tweeting, blogging, and otherwise spreading the word. Regards, The Travel Assistance Committee
ManifoldCF now officially requires Java 1.5
Hi everyone, I've checked in changes that move ManifoldCF from mostly the Java 1.4 world into the Java 1.5 world. This should introduce no compilation errors in user connector code, but most people will need to do a clean recompile to get a working system again. Please let me know ASAP if anyone finds any problems. Thanks! Karl
My ManifoldCF talk has been accepted for ApacheCon North America 2011 in Vancouver
I'll be giving a 45-minute introductory talk in Vancouver at ApacheCon North America, some time between November 9 and November 11, 2011. If anyone has any particular detail or issue they would like to see in the talk, I'd be happy to entertain your suggestion. Please let me know. Karl
Re: Re-sending docs to output connector
More thoughts: Including this functionality as a general feature of ManifoldCF would allow one to use ManifoldCF as a repository of content in its own right. In this model, probably the data would be keyed by the output connection name, and if integrated at this level in theory this would work with any output connection. The UI modifications would be modest and would consist of additional buttons on the output connection view page to re-feed documents to the connection rather than recrawl. Advantages: Would leverage multiple output connectors transparently, and would support the refeed everything to Solr model. Guaranteed commit on the part of a target search engine would no longer be a requirement. Downside: First, lots of storage would be required that probably can't live in PostgreSQL, complicating the deployment model. Second, depending on the details of implementation, there may not be feedback available at crawl time from the output connection about the acceptability of a document for indexing. Third, for many repository connectors the benefit of reading from the file system might well be zero. Fourth, the entire process of keeping the target repository managed properly is a manual one, and thus prone to errors. Karl On Wed, May 25, 2011 at 10:13 AM, Karl Wright daddy...@gmail.com wrote: On a refeed from cache request, send all objects to Solr - this should probably be per Job, not per output connector This is where your proposal gets in trouble, I think. There is no infrastructure mechanism in ManifoldCF to do either of these things at this time. Connections are not aware of what jobs are using them, and there is no way to send a signal to a connector to tell it to refeed, nor is there a button in the crawler UI for it. You're basically proposing significant infrastructure changes in ManifoldCF to support a missing feature in Solr, seems to me. Also, I'm pretty sure we want to try to solve the guaranteed delivery problem using the same mechanism, whatever it turns out to be. The problems are almost identical and the overhead of having two independent solutions for the same issue is very high. So let us try to make this work for both cases. Karl On Wed, May 25, 2011 at 9:55 AM, Jan Høydahl jan@cominvent.com wrote: Hi, Definitely, Solr also needs some sort of guaranteed delivery mechanism, but it's probably not the same thing as this cache, I imagine more like a message queue or callback mechanism. But that's a separate discussion all together :) So if we don't shoot for a 100% solution, but try to solve the need to re-feed a bunch of documents from MCF really quickly after some schema change or other processing change on the output (may be any output really), then we'd have a simpler case: Not a standalone server but a lightweight library (jar) which knows how to talk to a persistent object store (CouchDB), supporting simple put(), get(), delete() operations as well as querying for objects within time stamps etc. An outputConnector that wish to support caching could then inject calls to this library in all places it talks with Solr: * On add: put() object to cache along with a timestamp for sequence, then send doc directly to Solr * On delete: Delete document from cache, then add a delete meta object with timestamp as a transaction log feature, then delete from Solr * On a refeed from cache request, send all objects to Solr - this should probably be per Job, not per output connector * A refeed from cache since timestamp X request would be useful after Solr downtime. The command would use the cache as a transaction log. The cache will always be a mirror of what the output (Solr) SHOULD look like, thus it would also be possible to support a consistency check feature, in which we compare all IDs from cache with all IDs in Solr, and if not equal, get back in sync. Doing this as a lightweight library would then provide a tool for programmers of other clients. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 25. mai 2011, at 13.49, Karl Wright wrote: I've been thinking about this further. First, it seems clear to me that both Solr AND ManifoldCF would need access to the document cache. If the cache lives under ManifoldCF, I cannot see a good way towards a Solr integration that works the way I'd hope it would. Furthermore, the cache is not needed by many (or even most) ManifoldCF targets, so adding this as a general feature of ManifoldCF doesn't make sense to me. On the other hand, while Solr can certainly use this facility, I can well imagine other situations where it would be very useful as well. So I am now leaning towards having a wholly separate service which functions as both a cache and a transaction log. A ManifoldCF output connector would communicate with the service, and Solr also would - or, rather, some automatic Solr-specific push process would query
Re: Re-sending docs to output connector
I've been thinking about this further. First, it seems clear to me that both Solr AND ManifoldCF would need access to the document cache. If the cache lives under ManifoldCF, I cannot see a good way towards a Solr integration that works the way I'd hope it would. Furthermore, the cache is not needed by many (or even most) ManifoldCF targets, so adding this as a general feature of ManifoldCF doesn't make sense to me. On the other hand, while Solr can certainly use this facility, I can well imagine other situations where it would be very useful as well. So I am now leaning towards having a wholly separate service which functions as both a cache and a transaction log. A ManifoldCF output connector would communicate with the service, and Solr also would - or, rather, some automatic Solr-specific push process would query for changes between a specified time range and push those into Solr. Other such processes would be possible too. The list of moving parts would therefore be: - a configuration file containing details on how to communicate with Solr - a stand-alone web application which accepts documents and metadata via HTTP, and can also respond to HTTP transaction log queries and commands - a number of command classes (processes) which provide a means of push the transaction log contents into Solr, using the HTTP API mentioned above. I'd be interested in working on the development of such a widget, but I probably wouldn't have the serious time necessary to do much until July 1 given current schedule. Anybody else interested in collaborating? Other thoughts? Karl On Tue, May 24, 2011 at 7:28 PM, Karl Wright daddy...@gmail.com wrote: The only requirement you may have overlooked is the requirement that Solr be able to take advantage of the item cache automatically if it happens to be restarted in the middle of an indexing pass. If you think about it, you will realize that this cannot be done externally to Solr, unless Solr learns how to pull documents from the item cache, and keep track somehow of the last item/operation it successfully committed. That's why I proposed putting the whole cache under Solr auspices. Deletions also would need to be enumerated in the cache, so it would not really be a cache but more like a transaction log. But I agree that the right place for such a transaction log is effectively between MCF and Solr. Obviously the cache would also need to be disk based, or once again guaranteed delivery would not be possible. Compression might be useful, as would be checkpoints in case the data got large. This is very database-like, so CouchDB might be a reasonable way to do it, especially if this code is considered to be part of Solr. If part of ManifoldCF, we should try to see if PostgreSQL would suffice, since it will likely be already installed and ready to go. Karl On Tue, May 24, 2011 at 5:01 PM, Jan Høydahl jan@cominvent.com wrote: The Refetch all ingested documents works, but with Web crawling the problem is that it will take almost as long as a new crawl to re-feed. The solutions could be A) Add a stand-alone cache in front of Solr B) Add a caching proxy in front of MCF - will allow speedy re-crawl (but clunky to administer) C) Extend MCF with an optional item cache. This could allow a refeed from cache button somewhere... The cache in C could be realized externally to MCF, e.g. as a CouchDB cluster. To enable, you'd add the CouchDB access into to properties.xml. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 24. mai 2011, at 15.11, Karl Wright wrote: ManifoldCF is designed to deal with the problem of repeated or continuous crawling, doing only what is needed on subsequent crawls. It is thus a true incremental crawler. But in order for this to work for you, you need to let ManifoldCF do its job of keeping track of what documents (and what document versions) have been handed to the output connection. For the situation where you change something in Solr, the ManifoldCF solution to that is the refetch all ingested documents button in the Crawler UI. This is on the view page for the output connection. Clicking that button will cause ManifoldCF to re-index all documents - but will also require ManifoldCF to recrawl them, because ManifoldCF does not keep copies of the documents it crawls anywhere. If you need to avoid recrawling at all costs when you change Solr configurations, you may well need to put some sort of software of your own devising between ManifoldCF and Solr. You basically want to develop a content repository which ManifoldCF outputs to which can be scanned to send to your Solr instance. I actually proposed this design for a Solr guaranteed delivery mechanism, because until Solr commits a document it can still be lost if the Solr instance is shut down. Clearly something like this is needed and would also likely solve your problem too. The main issue
Re: Re-sending docs to output connector
On a refeed from cache request, send all objects to Solr - this should probably be per Job, not per output connector This is where your proposal gets in trouble, I think. There is no infrastructure mechanism in ManifoldCF to do either of these things at this time. Connections are not aware of what jobs are using them, and there is no way to send a signal to a connector to tell it to refeed, nor is there a button in the crawler UI for it. You're basically proposing significant infrastructure changes in ManifoldCF to support a missing feature in Solr, seems to me. Also, I'm pretty sure we want to try to solve the guaranteed delivery problem using the same mechanism, whatever it turns out to be. The problems are almost identical and the overhead of having two independent solutions for the same issue is very high. So let us try to make this work for both cases. Karl On Wed, May 25, 2011 at 9:55 AM, Jan Høydahl jan@cominvent.com wrote: Hi, Definitely, Solr also needs some sort of guaranteed delivery mechanism, but it's probably not the same thing as this cache, I imagine more like a message queue or callback mechanism. But that's a separate discussion all together :) So if we don't shoot for a 100% solution, but try to solve the need to re-feed a bunch of documents from MCF really quickly after some schema change or other processing change on the output (may be any output really), then we'd have a simpler case: Not a standalone server but a lightweight library (jar) which knows how to talk to a persistent object store (CouchDB), supporting simple put(), get(), delete() operations as well as querying for objects within time stamps etc. An outputConnector that wish to support caching could then inject calls to this library in all places it talks with Solr: * On add: put() object to cache along with a timestamp for sequence, then send doc directly to Solr * On delete: Delete document from cache, then add a delete meta object with timestamp as a transaction log feature, then delete from Solr * On a refeed from cache request, send all objects to Solr - this should probably be per Job, not per output connector * A refeed from cache since timestamp X request would be useful after Solr downtime. The command would use the cache as a transaction log. The cache will always be a mirror of what the output (Solr) SHOULD look like, thus it would also be possible to support a consistency check feature, in which we compare all IDs from cache with all IDs in Solr, and if not equal, get back in sync. Doing this as a lightweight library would then provide a tool for programmers of other clients. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 25. mai 2011, at 13.49, Karl Wright wrote: I've been thinking about this further. First, it seems clear to me that both Solr AND ManifoldCF would need access to the document cache. If the cache lives under ManifoldCF, I cannot see a good way towards a Solr integration that works the way I'd hope it would. Furthermore, the cache is not needed by many (or even most) ManifoldCF targets, so adding this as a general feature of ManifoldCF doesn't make sense to me. On the other hand, while Solr can certainly use this facility, I can well imagine other situations where it would be very useful as well. So I am now leaning towards having a wholly separate service which functions as both a cache and a transaction log. A ManifoldCF output connector would communicate with the service, and Solr also would - or, rather, some automatic Solr-specific push process would query for changes between a specified time range and push those into Solr. Other such processes would be possible too. The list of moving parts would therefore be: - a configuration file containing details on how to communicate with Solr - a stand-alone web application which accepts documents and metadata via HTTP, and can also respond to HTTP transaction log queries and commands - a number of command classes (processes) which provide a means of push the transaction log contents into Solr, using the HTTP API mentioned above. I'd be interested in working on the development of such a widget, but I probably wouldn't have the serious time necessary to do much until July 1 given current schedule. Anybody else interested in collaborating? Other thoughts? Karl On Tue, May 24, 2011 at 7:28 PM, Karl Wright daddy...@gmail.com wrote: The only requirement you may have overlooked is the requirement that Solr be able to take advantage of the item cache automatically if it happens to be restarted in the middle of an indexing pass. If you think about it, you will realize that this cannot be done externally to Solr, unless Solr learns how to pull documents from the item cache, and keep track somehow of the last item/operation it successfully committed. That's why I proposed putting the whole cache under Solr
Re: Re-sending docs to output connector
ManifoldCF is designed to deal with the problem of repeated or continuous crawling, doing only what is needed on subsequent crawls. It is thus a true incremental crawler. But in order for this to work for you, you need to let ManifoldCF do its job of keeping track of what documents (and what document versions) have been handed to the output connection. For the situation where you change something in Solr, the ManifoldCF solution to that is the refetch all ingested documents button in the Crawler UI. This is on the view page for the output connection. Clicking that button will cause ManifoldCF to re-index all documents - but will also require ManifoldCF to recrawl them, because ManifoldCF does not keep copies of the documents it crawls anywhere. If you need to avoid recrawling at all costs when you change Solr configurations, you may well need to put some sort of software of your own devising between ManifoldCF and Solr. You basically want to develop a content repository which ManifoldCF outputs to which can be scanned to send to your Solr instance. I actually proposed this design for a Solr guaranteed delivery mechanism, because until Solr commits a document it can still be lost if the Solr instance is shut down. Clearly something like this is needed and would also likely solve your problem too. The main issue, though, is that it would need to be integrated with Solr itself, because you'd really want it to pick up where it left off if Solr is cycled etc. In my opinion this functionality really can't function as part of ManifoldCF for that reason. Karl On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl jan@cominvent.com wrote: Hi, Is there an easy way to separate fetching from ingestion? I'd like to first run a crawl for several days, and then feed it to my Solr output as fast as possible. Also, after schema changes in Solr, there is a need to re-feed all docs. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
Re: Treatment of protected files
This should be enough. I'll open a ticket. The changes to the solr connector are trivial; I can do them and check them in, if someone is willing to try it out for real. Karl On Thu, May 19, 2011 at 6:11 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: Here's what I found in my simple history logs: org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while getting content for thmx and xps file types So, yes, Tika exceptions are stored in the MCF logs, so I guess it should be possible to find a workaround for this. Erlend On 19.05.11 12.00, Karl Wright wrote: There was a Solr ticket created I believe by Shinichiro. The question is whether the Solr 500 response has anything in its body that could help ManifoldCF recognize a Tika exception. If not there is little the Solr connector can do to detect this case. The problem is that you need to look in the Simple History to see what the response actually is, and I don't think Shinichiro did that. Karl On Thu, May 19, 2011 at 4:42 AM, Erlend Garåsene.f.gara...@usit.uio.no wrote: Do we have an MCF ticket for this issue yet? Or is rather a Solr issue? I agree with Karl. We should look for a TikaException and then tell MCF to skip affecting documents. But maybe this should just be a temporary fix until it has been fixed in Solr Cell. Exactly the same happens if Tika cannot parse a document which it does not support. Solr/Solr Cell returns a 500 server error, causing MCF to retry over and over again: [2011-05-18 17:39:34.104] [] webapp=/solr path=/update/extract params={literal.id=http://foreninger.uio.no/akademikerne/Tillitsvalgte_i_akademikerforeninger_files/themedata.thmx} status=500 QTime=5 [2011-05-18 17:39:39.102] {} 0 4 [2011-05-18 17:39:39.103] org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while getting content for thmx and xps file types And finally, the job just aborts: Exception tossed: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500 at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:630) Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Ingestion HTTP error code 500 at org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1362) I guess I can find a workaround since I have created my own ExtractingRequestHandler in order to support language detection etc., but I think MCF should act differently when the underlying cause is a TikaException. Erlend On 27.04.11 12.25, Karl Wright wrote: If I recall, it treats the 400 response as meaning this document should be skipped, and it treats the 500 response as meaning this document should be retried because I have absolutely no idea what happened. However, we could modify the code for the 500 response to look at the content of the response as well, and look for a string in it that would give us a clue, such as TikaException. If we see a TikaException, we could have it conclude this document should be skipped. That was what I was thinking. Karl On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi.Thank you for your reply. It seems that Solr.ExtractingRequestHandler responds the same HTTP response(SERVER_ERROR( 500 )) at any time error occurs. I'll try to open a ticket for solr. Is it correct that MCF re-try crawling was processed when it receives 500 level response, not 400 level response? Thank you. Shinichiro Abe On 2011/04/27, at 14:45, Karl Wright wrote: So the 500 error is occurring because Solr is throwing an exception at indexing time, is that correct? If this is correct, then here's my take. (1) A 500 error is a nasty error that Solr should not be returning under normal conditions. (2) A password-protected PDF is not what I would consider exceptional, so Tika should not be throwing an exception when it sees it, merely (at worst) logging an error and continuing. However, having said that, output connectors in ManifoldCF can make the decision to never retry the document, by returning a certain status, provided the connector can figure out that the error warrants this treatment. My suggestion is therefore the following. First, we should open a ticket for Solr about this. Second, if you can see the error output from the Simple History for a TikaException being thrown in Solr, we can look for that text in the response from Solr and perhaps modify the Solr Connector to detect the case. If you could open a ManifoldCF ticket and include that text I'd be very grateful. Thanks! Karl On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hello. There are pdf
Re: Which version of Solr have implements the Document Level Access Control
ok, if you try what I sent and it works, I will check it in. Karl On Thu, May 5, 2011 at 6:29 PM, Kadri Atalay atalay.ka...@gmail.com wrote: I'm assuming that since this is a Domain logon name, we don't need to add any escaping sequence, otherwise OS would reject it during authentication. Yes, you are right, userSID is needed, if user is not any part of group but still have access to document. On Thu, May 5, 2011 at 6:23 PM, Karl Wright daddy...@gmail.com wrote: Thanks - we do need the user sid, so I will put that back. Also, I'd like to ask what you know about escaping the user name in this expression: String searchFilter = ((objectClass=user)(sAMAccountName= + userName + )); It seems to me that there is probably some escaping needed, but I don't know what style. Do you think it is the same (C-style, with \ escape) as for the other case? Karl On Thu, May 5, 2011 at 6:20 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Karl, String returnedAtts[]={tokenGroups} is ONLY returning the memberGroups, C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-32-544 TOKEN:TEQA-DC:S-1-5-32-555 TOKEN:TEQA-DC:S-1-5-21- 1212545812-2858578934-3563067286-1124 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-1-0 but, - String returnedAtts[] = {tokenGroups,objectSid}; is returning memberGroups AND SID for that user. C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-32-544 TOKEN:TEQA-DC:S-1-5-32-555 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480 TOKEN:TEQA-DC:S-1-1-0 Since we are only interested in the member groups, tokenGroups is sufficient, but if you also need user SID then you might keep the objectSID as well. Thanks Kadri On Thu, May 5, 2011 at 6:01 PM, Karl Wright daddy...@gmail.com wrote: I am curious about the following change, which does not seem correct: //Specify the attributes to return - String returnedAtts[] = {tokenGroups,objectSid}; + String returnedAtts[]={tokenGroups}; searchCtls.setReturningAttributes(returnedAtts); Karl On Thu, May 5, 2011 at 5:36 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Karl, The ActiveDirectoryAuthority.java is attached. I'm not sure about clicking Grant ASF License, or how to do that from Tortoise. But, you got my consent for granting the ASF license. Thanks Kadri On Thu, May 5, 2011 at 5:28 PM, Karl Wright daddy...@gmail.com wrote: You may attach the whole ActiveDirectoryAuthority.java file to the ticket if you prefer. But you must click the Grant ASF License button. Karl On Thu, May 5, 2011 at 5:24 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Karl, I'm using the Tortoise SVN, and new to SVN.. Do you know how to do this with Tortoise ? Otherwise, I can just send the source code directly to you. BTW, there are some changes in the ParseUser method also, you can see all when you run the diff. Thanks Kadri
Re: Which version of Solr have implements the Document Level Access Control
It must mean we're somehow throwing an exception in the case where the user is missing. I bet I know why - the CN lookup is failing instead. I'll see if I can change it. Karl On Thu, May 5, 2011 at 6:43 PM, Kadri Atalay atalay.ka...@gmail.com wrote: It works, only difference I see with previous one is: if a domain is reachable, message usernotfound makes a better indicator, somehow we lost that. C:\OPTtestauthority C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY Previous one C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com; USERNOTFOUND:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-32-544 TOKEN:TEQA-DC:S-1-5-32-555 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480 TOKEN:TEQA-DC:S-1-1-0 C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=kata...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1473 TOKEN:TEQA-DC:S-1-1-0 C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay@fakedomain; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY On Thu, May 5, 2011 at 6:29 PM, Karl Wright daddy...@gmail.com wrote: I've cleaned things up slightly to restore the objectSid and also to fix an infinite loop ifyou have more than one comma in the escape expression. I've attached the file, can you see if it works? Thanks, Karl On Thu, May 5, 2011 at 6:23 PM, Karl Wright daddy...@gmail.com wrote: Thanks - we do need the user sid, so I will put that back. Also, I'd like to ask what you know about escaping the user name in this expression: String searchFilter = ((objectClass=user)(sAMAccountName= + userName + )); It seems to me that there is probably some escaping needed, but I don't know what style. Do you think it is the same (C-style, with \ escape) as for the other case? Karl On Thu, May 5, 2011 at 6:20 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Karl, String returnedAtts[]={tokenGroups} is ONLY returning the memberGroups, C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-32-544 TOKEN:TEQA-DC:S-1-5-32-555 TOKEN:TEQA-DC:S-1-5-21- 1212545812-2858578934-3563067286-1124 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-1-0 but, - String returnedAtts[] = {tokenGroups,objectSid}; is returning memberGroups AND SID for that user. C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-32-544 TOKEN:TEQA-DC:S-1-5-32-555 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480 TOKEN:TEQA-DC:S-1-1-0 Since we are only interested in the member groups, tokenGroups is sufficient, but if you also need user SID then you might keep the objectSID as well. Thanks Kadri On Thu, May 5, 2011 at 6:01 PM, Karl Wright daddy...@gmail.com wrote: I am curious about the following change, which does not seem correct: //Specify the attributes to return - String returnedAtts[] = {tokenGroups,objectSid}; + String returnedAtts[]={tokenGroups}; searchCtls.setReturningAttributes(returnedAtts); Karl On Thu, May 5, 2011 at 5:36 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Karl, The ActiveDirectoryAuthority.java is attached. I'm not sure about clicking Grant ASF License, or how to do that from Tortoise. But, you got my consent for granting the ASF license
Re: Which version of Solr have implements the Document Level Access Control
Try this. Karl On Thu, May 5, 2011 at 7:12 PM, Karl Wright daddy...@gmail.com wrote: It must mean we're somehow throwing an exception in the case where the user is missing. I bet I know why - the CN lookup is failing instead. I'll see if I can change it. Karl On Thu, May 5, 2011 at 6:43 PM, Kadri Atalay atalay.ka...@gmail.com wrote: It works, only difference I see with previous one is: if a domain is reachable, message usernotfound makes a better indicator, somehow we lost that. C:\OPTtestauthority C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY Previous one C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com; USERNOTFOUND:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-32-544 TOKEN:TEQA-DC:S-1-5-32-555 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480 TOKEN:TEQA-DC:S-1-1-0 C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=kata...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1473 TOKEN:TEQA-DC:S-1-1-0 C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay@fakedomain; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY On Thu, May 5, 2011 at 6:29 PM, Karl Wright daddy...@gmail.com wrote: I've cleaned things up slightly to restore the objectSid and also to fix an infinite loop ifyou have more than one comma in the escape expression. I've attached the file, can you see if it works? Thanks, Karl On Thu, May 5, 2011 at 6:23 PM, Karl Wright daddy...@gmail.com wrote: Thanks - we do need the user sid, so I will put that back. Also, I'd like to ask what you know about escaping the user name in this expression: String searchFilter = ((objectClass=user)(sAMAccountName= + userName + )); It seems to me that there is probably some escaping needed, but I don't know what style. Do you think it is the same (C-style, with \ escape) as for the other case? Karl On Thu, May 5, 2011 at 6:20 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Karl, String returnedAtts[]={tokenGroups} is ONLY returning the memberGroups, C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-32-544 TOKEN:TEQA-DC:S-1-5-32-555 TOKEN:TEQA-DC:S-1-5-21- 1212545812-2858578934-3563067286-1124 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-1-0 but, - String returnedAtts[] = {tokenGroups,objectSid}; is returning memberGroups AND SID for that user. C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-32-544 TOKEN:TEQA-DC:S-1-5-32-555 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480 TOKEN:TEQA-DC:S-1-1-0 Since we are only interested in the member groups, tokenGroups is sufficient, but if you also need user SID then you might keep the objectSID as well. Thanks Kadri On Thu, May 5, 2011 at 6:01 PM, Karl Wright daddy...@gmail.com wrote: I am curious about the following change, which does not seem correct: //Specify the attributes to return - String returnedAtts[] = {tokenGroups,objectSid}; + String returnedAtts[]={tokenGroups}; searchCtls.setReturningAttributes(returnedAtts); Karl On Thu, May 5, 2011 at 5:36 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Karl, The ActiveDirectoryAuthority.java is attached. I'm not sure about clicking Grant ASF License, or how to do
Re: Which version of Solr have implements the Document Level Access Control
I think yours was working because it was returning cn=null, cn=users, which was a result of the fact that cn was null and the expression was assembled using the + operator. When I separated the ldap escape out, it caused a null pointer exception to be thrown instead. It should be fixed now. Karl On Thu, May 5, 2011 at 7:19 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Fyi. The file I sent you was returning usernotfound. Sent from my iPhone On May 5, 2011, at 7:12 PM, Karl Wright daddy...@gmail.com wrote: It must mean we're somehow throwing an exception in the case where the user is missing. I bet I know why - the CN lookup is failing instead. I'll see if I can change it. Karl On Thu, May 5, 2011 at 6:43 PM, Kadri Atalay atalay.ka...@gmail.com wrote: It works, only difference I see with previous one is: if a domain is reachable, message usernotfound makes a better indicator, somehow we lost that. C:\OPTtestauthority C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY Previous one C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com; USERNOTFOUND:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-32-544 TOKEN:TEQA-DC:S-1-5-32-555 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480 TOKEN:TEQA-DC:S-1-1-0 C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=kata...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1473 TOKEN:TEQA-DC:S-1-1-0 C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay@fakedomain; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY On Thu, May 5, 2011 at 6:29 PM, Karl Wright daddy...@gmail.com wrote: I've cleaned things up slightly to restore the objectSid and also to fix an infinite loop ifyou have more than one comma in the escape expression. I've attached the file, can you see if it works? Thanks, Karl On Thu, May 5, 2011 at 6:23 PM, Karl Wright daddy...@gmail.com wrote: Thanks - we do need the user sid, so I will put that back. Also, I'd like to ask what you know about escaping the user name in this expression: String searchFilter = ((objectClass=user)(sAMAccountName= + userName + )); It seems to me that there is probably some escaping needed, but I don't know what style. Do you think it is the same (C-style, with \ escape) as for the other case? Karl On Thu, May 5, 2011 at 6:20 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Karl, String returnedAtts[]={tokenGroups} is ONLY returning the memberGroups, C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-32-544 TOKEN:TEQA-DC:S-1-5-32-555 TOKEN:TEQA-DC:S-1-5-21- 1212545812-2858578934-3563067286-1124 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-1-0 but, - String returnedAtts[] = {tokenGroups,objectSid}; is returning memberGroups AND SID for that user. C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-5-32-545 TOKEN:TEQA-DC:S-1-5-32-544 TOKEN:TEQA-DC:S-1-5-32-555 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480 TOKEN:TEQA-DC:S-1-1-0 Since we are only interested in the member groups, tokenGroups is sufficient, but if you also need user SID then you might keep the objectSID as well. Thanks Kadri On Thu, May 5, 2011 at 6:01 PM, Karl Wright daddy...@gmail.com wrote: I am curious about the following change, which does not seem correct: //Specify the attributes to return
Re: Which version of Solr have implements the Document Level Access Control
I thought you were using the Quick Start, whcih does not have a sync directory. Karl On Tue, May 3, 2011 at 6:16 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Note: Did that, still didn't helped, but deleting the contents of mysyncdir worked. On Tue, May 3, 2011 at 5:48 PM, Karl Wright daddy...@gmail.com wrote: Never seen that before. Do you have more than one instance running? Only one instance can run at a time or the database is unhappy. If that still doesn't seem to be the problem, try ant clean and then ant build again. It will clean out the existing database instance. Karl On Tue, May 3, 2011 at 5:34 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Karl, You are right, somehow I still had the OLD 195 code.. Just got the latest, compiled, but this one doesn't start after the message Configuration file successfully read Any ideas ? Thanks Kadri On Tue, May 3, 2011 at 3:12 PM, Karl Wright daddy...@gmail.com wrote: The latest CONNECTORS-195 branch code doesn't use sAMAccountName. It uses ObjectSid. Your schema has ObjectSid. The version of ActiveDirectoryAuthority in trunk looks up ObjectSid too. Indeed, the only change is the addition of the following: if (theGroups.size() == 0) return userNotFoundResponse; This CANNOT occur for an existing user, because all existing users must have at least one SID. And, if existing users returned the proper SIDs before, this should not change anything. So I cannot see how you could be getting the result you claim. Are you SURE you synched up the CONNECTORS-195 branch and built that? I have not checked this code into trunk yet. Karl On Tue, May 3, 2011 at 2:46 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Carl, Got the latest one, built and tried but same result.. At the mean time took a look my user account with AD browser, and as you can see (attached) it does have a sAMAccountName attribute. BTW, do we have to use objectClass = user for the search filter ? May need to check into this.. Thanks Kadri On Tue, May 3, 2011 at 1:16 PM, Karl Wright daddy...@gmail.com wrote: I tried locating details of DSID-031006E0 on MSDN, to no avail. Microsoft apparently doesn't document this error. But I asked around, and there are two potential avenues forward. Avenue 1: There is a Windows tool called LDP, which should allow you to browse AD's LDAP. What you would need to do is confirm that each user has a sAMAccountName attribute. If they *don't*, it is possible that the domain was not set up in compatibility mode, which means we'll need to find a different attribute to query against. Avenue 2: Just change the string sAMAccountName in the ActiveDirectoryAuthority.java class to uid, and try again. The uid attribute should exist on all AD installations after Windows 2000. Thanks, Karl On Tue, May 3, 2011 at 12:52 PM, Karl Wright daddy...@gmail.com wrote: I removed the object scope from the user lookup - it's worth another try. Care to synch up an run again? Karl On Tue, May 3, 2011 at 12:36 PM, Karl Wright daddy...@gmail.com wrote: As I feared, the new user-exists-check code is not correct in some way. Apparently we can't retrieve the attribute I'm looking for by this kind of query. The following website seems to have some suggestions as to how to do better, with downloadable samples, but I'm not going to be able to look at it in any detail until this evening. http://www.techtalkz.com/windows-server-2003/424352-get-samaccountnames-all-users-active-directory-group.html Karl On Tue, May 3, 2011 at 12:12 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Karl, Here is the first round of tests with CONNECTORS-195t: Now we are getting all responses as TEQA-DC:DEAD_AUTHORITY.. even with valid users. Please take a look at the 2 bitmap files I have attached. (they have the screen shots from debug screens) invalid user and invalid domain C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain; USERNOTFOUND:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY invalid user and valid (full domain name) C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com; USERNOTFOUND:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY valid user and valid domain (please see bitmap file katalay_ad...@teqa.bmp) This name gets the similar error as the first fakeuser eventhough it's a valid user. C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa; USERNOTFOUND:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY
Re: Which version of Solr have implements the Document Level Access Control
I went back over these emails. It appears that at no time have you actually received SIDs, either user or group, back from any Authority Connector inquiry: response to actual domain account call: C:\OPT\security_examplecurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-1-0 I could have sworn that you had seen SIDs other than S-1-1-0 back for existing users on your setup, but I can find no evidence that was ever the case. Given that, it seems perfectly reasonable that the change in CONNECTORS-195 would convert ALL of these responses to USERNOTFOUND ones. Other recent users of the AD controller had no difficulty getting SIDs back, most notably Mr. Abe, who worked closely with me on getting the AD connector working with caching. The conclusion I have is that either your domain controller configuration, or your connection credentials/credential permissions, are incorrect. (I'd look carefully at the permissions of the account you are giving to the connection, because on the face of it that sounds most likely). But the fix for non-existent users seems to be right nevertheless, so I will go ahead and commit to trunk. Thanks, Karl On Tue, May 3, 2011 at 7:38 PM, Karl Wright daddy...@gmail.com wrote: Ok, can you try the trunk code? If that works, I'll be shocked. I think something must have changed in your environment since you began this experiment. Karl On Tue, May 3, 2011 at 6:19 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Karl, This is result from lates 195 branch.. I'll run it in the debugger to see actual error messages later on. Is there anyone else can verify this code against their active directory ? Thanks Kadri C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain; USERNOTFOUND:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com; USERNOTFOUND:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa; USERNOTFOUND:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com; USERNOTFOUND:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY C:\OPTcurl http://localhost:8345/mcf-authority-service/UserACLs?username=kata...@teqa.filetek.com; USERNOTFOUND:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY On Tue, May 3, 2011 at 5:48 PM, Karl Wright daddy...@gmail.com wrote: Never seen that before. Do you have more than one instance running? Only one instance can run at a time or the database is unhappy. If that still doesn't seem to be the problem, try ant clean and then ant build again. It will clean out the existing database instance. Karl On Tue, May 3, 2011 at 5:34 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Karl, You are right, somehow I still had the OLD 195 code.. Just got the latest, compiled, but this one doesn't start after the message Configuration file successfully read Any ideas ? Thanks Kadri On Tue, May 3, 2011 at 3:12 PM, Karl Wright daddy...@gmail.com wrote: The latest CONNECTORS-195 branch code doesn't use sAMAccountName. It uses ObjectSid. Your schema has ObjectSid. The version of ActiveDirectoryAuthority in trunk looks up ObjectSid too. Indeed, the only change is the addition of the following: if (theGroups.size() == 0) return userNotFoundResponse; This CANNOT occur for an existing user, because all existing users must have at least one SID. And, if existing users returned the proper SIDs before, this should not change anything. So I cannot see how you could be getting the result you claim. Are you SURE you synched up the CONNECTORS-195 branch and built that? I have not checked this code into trunk yet. Karl On Tue, May 3, 2011 at 2:46 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Carl, Got the latest one, built and tried but same result.. At the mean time took a look my user account with AD browser, and as you can see (attached) it does have a sAMAccountName attribute. BTW, do we have to use objectClass = user for the search filter ? May need to check into this.. Thanks Kadri On Tue, May 3, 2011 at 1:16 PM, Karl Wright daddy...@gmail.com wrote: I tried locating details of DSID-031006E0 on MSDN, to no avail. Microsoft apparently doesn't document this error. But I asked around, and there are two potential avenues forward. Avenue 1: There is a Windows tool called LDP, which should allow you to browse AD's LDAP. What you would need to do is confirm that each user has a sAMAccountName attribute. If they *don't*, it is possible that the domain was not set up in compatibility mode, which means we'll need to find a different attribute
Re: Which version of Solr have implements the Document Level Access Control
NameNotFound exception is never being reached because process is throwing internal exception, and this is never checked. I see the logging trace; it looks like the ldap code is eating the exception and returning a blank list. This is explicitly NOT what is supposed to happen, nor did it happen on JDK 1.5, I am certain. You might find that this behavior has changed between Java releases. Also, what is the reason for adding everyone group for each response ? I added this in because the standard treatment of Active Directory 2000 and 2003 was to exclude the public ACL. Since all users have it, if the user exists (which was the case if NameNotFound exception was not being thrown), it was always safe to add it in. If JDK xxx, which is eating the internal exception, gives back SOME signal that the user does not exist, we can certainly check for that. What signal do you recommend looking for, based on the trace? Is there any way to get at errExPartialResultException (id=7962) from NamingEnumeration answer? Karl On Mon, May 2, 2011 at 3:31 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Karl, I noticed in the code that NameNotFound exception is never being reached because process is throwing internal exception, and this is never checked. (see below) Also, what is the reason for adding everyone group for each response ? theGroups.add(S-1-1-0); When there is no groups or SID's returned, following return code is still used.. return new AuthorizationResponse(tokens,AuthorizationResponse.RESPONSE_OK); Should I assume this code was tested against an Active Directory, and working, and or should I start checking from the beginning every parameter is entered. (see below) For example, in the following code, DIGEST-MD5 GSSAPI is used for security authentication, but user name and password is passed as a clear text.. and not in the format they suggest in their documentation. Thanks Kadri http://download.oracle.com/javase/jndi/tutorial/ldap/security/gssapi.html if (ctx == null) { // Calculate the ldap url first String ldapURL = ldap://; + domainControllerName + :389; Hashtable env = new Hashtable(); env.put(Context.INITIAL_CONTEXT_FACTORY,com.sun.jndi.ldap.LdapCtxFactory); env.put(Context.SECURITY_AUTHENTICATION,DIGEST-MD5 GSSAPI); env.put(Context.SECURITY_PRINCIPAL,userName); env.put(Context.SECURITY_CREDENTIALS,password); //connect to my domain controller env.put(Context.PROVIDER_URL,ldapURL); //specify attributes to be returned in binary format env.put(java.naming.ldap.attributes.binary,tokenGroups objectSid); fakeuser@teqa //Search for objects using the filter NamingEnumeration answer = ctx.search(searchBase, searchFilter, searchCtls); answer LdapSearchEnumeration (id=6635) cleaned false cont Continuation (id=6674) entries VectorE (id=6675) enumClnt LdapClient (id=6676) authenticateCalled true conn Connection (id=6906) isLdapv3 true pcb null pooled false referenceCount 1 unsolicited VectorE (id=6907) errEx PartialResultException (id=6677) cause PartialResultException (id=6677) detailMessage [LDAP: error code 10 - 202B: RefErr: DSID-031006E0, data 0, 1 access points\n\tref 1: 'teqa'\n ArrayList theGroups = new ArrayList(); // All users get certain well-known groups theGroups.add(S-1-1-0); answer LdapSearchEnumeration (id=7940) cleaned false cont Continuation (id=7959) entries VectorE (id=7960) enumClnt LdapClient (id=7961) errEx PartialResultException (id=7962) cause PartialResultException (id=7962) detailMessage [LDAP: error code 10 - 202B: RefErr: DSID-031006E0, data 0, 1 access points\n\tref 1: 'teqa'\n return new AuthorizationResponse(tokens,AuthorizationResponse.RESPONSE_OK); On Tue, Apr 26, 2011 at 12:54 PM, Karl Wright daddy...@gmail.com wrote: If a completely unknown user still comes back as existing, then it's time to look at how your domain controller is configured. Specifically, what do you have it configured to trust? What version of Windows is this? The way LDAP tells you a user does not exist in Java is by an exception. So this statement: NamingEnumeration answer = ctx.search(searchBase, searchFilter, searchCtls); will throw the NameNotFoundException if the name doesn't exist, which the Active Directory connector then catches: catch (NameNotFoundException e) { // This means that the user doesn't exist return userNotFoundResponse; } Clearly this is not working at all for your setup. Maybe you can look at the DC's event logs, and see what kinds of decisions it is making here? It's not making much sense to me
Re: Which version of Solr have implements the Document Level Access Control
I opened a ticket, CONNECTORS-195, and added what I think is an explicit check for existence of the user as a patch. Can you apply the patch and let me know if it seems to fix the problem? Thanks, Karl On Mon, May 2, 2011 at 3:51 PM, Kadri Atalay atalay.ka...@gmail.com wrote: I see, thanks for the response. I'll look into it little deeper, before making a suggestion how to check for this internal exception.. If JDK 1.6 behavior is different than JDK 1.5 for LDAP, this may not be the only problem we may encounter.. Maybe any exception generated by JDK during this request should be evaluated.. We'll see. Thanks. Kadri On Mon, May 2, 2011 at 3:44 PM, Karl Wright daddy...@gmail.com wrote: NameNotFound exception is never being reached because process is throwing internal exception, and this is never checked. I see the logging trace; it looks like the ldap code is eating the exception and returning a blank list. This is explicitly NOT what is supposed to happen, nor did it happen on JDK 1.5, I am certain. You might find that this behavior has changed between Java releases. Also, what is the reason for adding everyone group for each response ? I added this in because the standard treatment of Active Directory 2000 and 2003 was to exclude the public ACL. Since all users have it, if the user exists (which was the case if NameNotFound exception was not being thrown), it was always safe to add it in. If JDK xxx, which is eating the internal exception, gives back SOME signal that the user does not exist, we can certainly check for that. What signal do you recommend looking for, based on the trace? Is there any way to get at errEx PartialResultException (id=7962) from NamingEnumeration answer? Karl On Mon, May 2, 2011 at 3:31 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Karl, I noticed in the code that NameNotFound exception is never being reached because process is throwing internal exception, and this is never checked. (see below) Also, what is the reason for adding everyone group for each response ? theGroups.add(S-1-1-0); When there is no groups or SID's returned, following return code is still used.. return new AuthorizationResponse(tokens,AuthorizationResponse.RESPONSE_OK); Should I assume this code was tested against an Active Directory, and working, and or should I start checking from the beginning every parameter is entered. (see below) For example, in the following code, DIGEST-MD5 GSSAPI is used for security authentication, but user name and password is passed as a clear text.. and not in the format they suggest in their documentation. Thanks Kadri http://download.oracle.com/javase/jndi/tutorial/ldap/security/gssapi.html if (ctx == null) { // Calculate the ldap url first String ldapURL = ldap://; + domainControllerName + :389; Hashtable env = new Hashtable(); env.put(Context.INITIAL_CONTEXT_FACTORY,com.sun.jndi.ldap.LdapCtxFactory); env.put(Context.SECURITY_AUTHENTICATION,DIGEST-MD5 GSSAPI); env.put(Context.SECURITY_PRINCIPAL,userName); env.put(Context.SECURITY_CREDENTIALS,password); //connect to my domain controller env.put(Context.PROVIDER_URL,ldapURL); //specify attributes to be returned in binary format env.put(java.naming.ldap.attributes.binary,tokenGroups objectSid); fakeuser@teqa //Search for objects using the filter NamingEnumeration answer = ctx.search(searchBase, searchFilter, searchCtls); answer LdapSearchEnumeration (id=6635) cleaned false cont Continuation (id=6674) entries VectorE (id=6675) enumClnt LdapClient (id=6676) authenticateCalled true conn Connection (id=6906) isLdapv3 true pcb null pooled false referenceCount 1 unsolicited VectorE (id=6907) errEx PartialResultException (id=6677) cause PartialResultException (id=6677) detailMessage [LDAP: error code 10 - 202B: RefErr: DSID-031006E0, data 0, 1 access points\n\tref 1: 'teqa'\n ArrayList theGroups = new ArrayList(); // All users get certain well-known groups theGroups.add(S-1-1-0); answer LdapSearchEnumeration (id=7940) cleaned false cont Continuation (id=7959) entries VectorE (id=7960) enumClnt LdapClient (id=7961) errEx PartialResultException (id=7962) cause PartialResultException (id=7962) detailMessage [LDAP: error code 10 - 202B: RefErr: DSID-031006E0, data 0, 1 access points\n\tref 1: 'teqa'\n return new AuthorizationResponse(tokens,AuthorizationResponse.RESPONSE_OK); On Tue, Apr 26, 2011 at 12:54 PM, Karl Wright daddy...@gmail.com wrote
Re: Treatment of protected files
If I recall, it treats the 400 response as meaning this document should be skipped, and it treats the 500 response as meaning this document should be retried because I have absolutely no idea what happened. However, we could modify the code for the 500 response to look at the content of the response as well, and look for a string in it that would give us a clue, such as TikaException. If we see a TikaException, we could have it conclude this document should be skipped. That was what I was thinking. Karl On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hi.Thank you for your reply. It seems that Solr.ExtractingRequestHandler responds the same HTTP response(SERVER_ERROR( 500 )) at any time error occurs. I'll try to open a ticket for solr. Is it correct that MCF re-try crawling was processed when it receives 500 level response, not 400 level response? Thank you. Shinichiro Abe On 2011/04/27, at 14:45, Karl Wright wrote: So the 500 error is occurring because Solr is throwing an exception at indexing time, is that correct? If this is correct, then here's my take. (1) A 500 error is a nasty error that Solr should not be returning under normal conditions. (2) A password-protected PDF is not what I would consider exceptional, so Tika should not be throwing an exception when it sees it, merely (at worst) logging an error and continuing. However, having said that, output connectors in ManifoldCF can make the decision to never retry the document, by returning a certain status, provided the connector can figure out that the error warrants this treatment. My suggestion is therefore the following. First, we should open a ticket for Solr about this. Second, if you can see the error output from the Simple History for a TikaException being thrown in Solr, we can look for that text in the response from Solr and perhaps modify the Solr Connector to detect the case. If you could open a ManifoldCF ticket and include that text I'd be very grateful. Thanks! Karl On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hello. There are pdf and office files that are protected by reading password. We do not have to read those files if we do not know the password of files. Now, MCF job starts to crawl the filesystem repository and post to Solr. Document ingestion of non-protected files is done successfully, but one of protected file is not done successfully as far as the job is processed beyond Retry Limit. During that time, it is logging 500 result code in simple history. (Solr throws TikaException caused by PDFBox or apache poi as the reason that it does not read protected documents.) When I ran that test by continuous clawing, not by simple once crawling, the job was done halfway and logged the following: Error: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500 the job tried to crawl that files many times. It seems that a job takes a lot of time and costs for treating protected files. So I want to find a way to skip quickly reading those files. In my survey: Hopfillers is not relevant.(right?) Then Tika, PDFBox, and POI have the mechanism to decrypt protected files, but throw each another exception in the case that given invalid password. It occurs to me that Solr throws another result code when protected files are posted, as one idea apart from possibility or not. Do you have any ideas? Regards, Shinichiro Abe
Re: Which version of Solr have implements the Document Level Access Control
So you are trying to extend the example in the book, correct, to run against active directory and the JCIFS connector? And this is with Solr 3.1? The book was written for Solr 1.4.1, so it's entirely possible that something in Solr changed in relation to the way search components are used. So I think we're going to need to do some debugging. (1) First, to confirm sanity, try using curl against the mcf authority service. Try some combination of users to see how that works, e.g.: curl http://localhost:8345/mcf-authority-service/UserACLs?username=joe; ...and curl http://localhost:8345/mcf-authority-service/UserACLs?username=joe@fakedomain; ...and also the real domain name, whatever that is. See if the access tokens that come back look correct. If they don't then we know where there's an issue. If they *are* correct, let me know and we'll go to the next stage, which would be to make sure the authority service is actually getting called and the proper query is being built and run under Solr 3.1. Thanks, Karl On Tue, Apr 26, 2011 at 11:59 AM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Karl, I followed the instructions, and for testing purposes set stored=true to be able to see the ACL values stored in Solr. But, when I run the search in following format I get peculiar results.. :http://10.1.200.155:8080/solr/select/?q=*%3A*AuthenticatedUserName=username Any user name without a domain name ie AuthenticatedUserName=joe does not return any results (which is correct) But any user name with ANY domain name returns all the indexes ie AuthenticatedUserName=joe@fakedomain (which is not correct) Any thoughts ? Thanks Kadri On Sun, Apr 24, 2011 at 7:08 PM, Karl Wright daddy...@gmail.com wrote: Solr 3.1 is being clever here; it's seeing arguments coming in that do not correspond to known schema fields, and presuming they are automatic fields. So when the schema is unmodified, you see these fields that Solr creates for you, with the attr_ prefix. They are created as being stored, which is not good for access tokens since then you will see them in the response. I don't know if they are indexed or not, but I imagine not, which is also not good. So following the instructions is still the right thing to do, I would say. Karl On Fri, Apr 22, 2011 at 3:24 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Hi Karl, There is one thing I noticed while following the example in chapter 4.: Prior to making any changes into the schema.xml, I was able to see the following security information in query responses: ie: doc - arr name=attr_allow_token_document strTEQA-DC:S-1-3-0/str strTEQA-DC:S-1-5-13/str strTEQA-DC:S-1-5-18/str strTEQA-DC:S-1-5-32-544/str strTEQA-DC:S-1-5-32-545/str strTEQA-DC:S-1-5-32-547/str /arr - arr name=attr_allow_token_share strTEQA-DC:S-1-1-0/str strTEQA-DC:S-1-5-2/str - str TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480 /str /arr - arr name=attr_content - str Autonomy ODBC Fetch Technical Brief 0506 Technical Brief But, after I modified the schema/xml, and added the following fields, !-- Security fields -- field name=allow_token_document type=string indexed=true stored=false multiValued=true/ field name=deny_token_document type=string indexed=true stored=false multiValued=true/ field name=allow_token_share type=string indexed=true stored=false multiValued=true/ field name=deny_token_share type=string indexed=true stored=false multiValued=true/ I longer see neither the attr_allow_token_document or the allow_token_document fields.. Since same fields exist with attr_ prefix, should we need to add these new field names into the schema file, or can we simply change ManifoldSecurity to use attr_ fields ? Also, when Solr is running under Tomcat, I have to re-start the Solr App, or re-start Tomcat to see the newly added indexes.. Any thoughts ? Thanks Kadri On Fri, Apr 22, 2011 at 12:53 PM, Karl Wright daddy...@gmail.com wrote: I don't believe Solr has yet officially released document access control, so you will need to use the patch for ticket 1895. Alternatively, the ManifoldCF in Action chapter 4 example has an implementation based on this ticket. You can get the code for it at https://manifoldcfinaction.googlecode.com/svn/trunk/edition_1/security_example. Thanks, Karl On Fri, Apr 22, 2011 at 11:45 AM, Kadri Atalay atalay.ka...@gmail.com wrote: Hello, Does anyone know which version of Solr have implements the Document Level Access Control, or has it implemented (partially or fully) ? Particularly issue #s 1834, 1872, 1895 Thanks Kadri
Re: Which version of Solr have implements the Document Level Access Control
If a completely unknown user still comes back as existing, then it's time to look at how your domain controller is configured. Specifically, what do you have it configured to trust? What version of Windows is this? The way LDAP tells you a user does not exist in Java is by an exception. So this statement: NamingEnumeration answer = ctx.search(searchBase, searchFilter, searchCtls); will throw the NameNotFoundException if the name doesn't exist, which the Active Directory connector then catches: catch (NameNotFoundException e) { // This means that the user doesn't exist return userNotFoundResponse; } Clearly this is not working at all for your setup. Maybe you can look at the DC's event logs, and see what kinds of decisions it is making here? It's not making much sense to me at this point. Karl On Tue, Apr 26, 2011 at 12:45 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Get the same result with user doesn't exist C:\OPT\security_examplecurl http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-1-0 BTW, is there a command to get all users available in Active Directory, from mcf-authority service, or other test commands to see if it's working correctly ? Also, I set the logging level to finest from Solr Admin for ManifoldCFSecurityFilter,but I don't see any logs created.. Is there any other settings need to be tweaked ? Thanks Kadri On Tue, Apr 26, 2011 at 12:38 PM, Karl Wright daddy...@gmail.com wrote: One other quick note. You might want to try a user that doesn't exist and see what you get. It should be a USERNOTFOUND response. If that's indeed what you get back, then this is a relatively minor issue with Active Directory. Basically the S-1-1-0 SID is added by the active directory authority, so the DC is actually returning an empty list of SIDs for the user with an unknown domain. It *should* tell us the user doesn't exist, I agree, but that's clearly a problem only Active Directory can solve; we can't make that decision in the active directory connector because the DC may be just one node in a hierarchy. Perhaps there's a Microsoft knowledge-base article that would clarify things further. Please let me know what you find. Karl On Tue, Apr 26, 2011 at 12:27 PM, Karl Wright daddy...@gmail.com wrote: The method code from the Active Directory authority that handles the LDAP query construction is below. It looks perfectly reasonable to me: /** Parse a user name into an ldap search base. */ protected static String parseUser(String userName) throws ManifoldCFException { //String searchBase = CN=Administrator,CN=Users,DC=qa-ad-76,DC=metacarta,DC=com; int index = userName.indexOf(@); if (index == -1) throw new ManifoldCFException(Username is in unexpected form (no @): '+userName+'); String userPart = userName.substring(0,index); String domainPart = userName.substring(index+1); // Start the search base assembly StringBuffer sb = new StringBuffer(); sb.append(CN=).append(userPart).append(,CN=Users); int j = 0; while (true) { int k = domainPart.indexOf(.,j); if (k == -1) { sb.append(,DC=).append(domainPart.substring(j)); break; } sb.append(,DC=).append(domainPart.substring(j,k)); j = k+1; } return sb.toString(); } So I have to conclude that your Active Directory domain controller is simply not caring what the DC= fields are, for some reason. No idea why. If you want to confirm this picture, you might want to create a patch to add some Logging.authorityConnectors.debug statements at appropriate places so we can see the actual query it's sending to LDAP. I'm happy to commit this debug output patch eventually if you also want to create a ticket. Thanks, Karl On Tue, Apr 26, 2011 at 12:17 PM, Kadri Atalay atalay.ka...@gmail.com wrote: Yes, ManifoldCF is running with JCIFS connector, and using Solr 3.1 response to first call: C:\OPT\security_examplecurl http://localhost:8345/mcf-authority-service/UserACLs?username=joe; UNREACHABLEAUTHORITY:TEQA-DC TOKEN:TEQA-DC:DEAD_AUTHORITY response to fake domain call: C:\OPT\security_examplecurl http://localhost:8345/mcf-authority-service/UserACLs?username=joe@fakedomain; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-1-0 response to actual domain account call: C:\OPT\security_examplecurl http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa; AUTHORIZED:TEQA-DC TOKEN:TEQA-DC:S-1-1-0 Looks like as long as there is a domain suffix, return is positive.. Thanks Kadri On Tue, Apr 26, 2011 at 12:10 PM, Karl Wright daddy...@gmail.com wrote: So you are trying to extend the example in the book, correct, to run against active directory and the JCIFS connector
Re: Treatment of protected files
So the 500 error is occurring because Solr is throwing an exception at indexing time, is that correct? If this is correct, then here's my take. (1) A 500 error is a nasty error that Solr should not be returning under normal conditions. (2) A password-protected PDF is not what I would consider exceptional, so Tika should not be throwing an exception when it sees it, merely (at worst) logging an error and continuing. However, having said that, output connectors in ManifoldCF can make the decision to never retry the document, by returning a certain status, provided the connector can figure out that the error warrants this treatment. My suggestion is therefore the following. First, we should open a ticket for Solr about this. Second, if you can see the error output from the Simple History for a TikaException being thrown in Solr, we can look for that text in the response from Solr and perhaps modify the Solr Connector to detect the case. If you could open a ManifoldCF ticket and include that text I'd be very grateful. Thanks! Karl On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe shinichiro.ab...@gmail.com wrote: Hello. There are pdf and office files that are protected by reading password. We do not have to read those files if we do not know the password of files. Now, MCF job starts to crawl the filesystem repository and post to Solr. Document ingestion of non-protected files is done successfully, but one of protected file is not done successfully as far as the job is processed beyond Retry Limit. During that time, it is logging 500 result code in simple history. (Solr throws TikaException caused by PDFBox or apache poi as the reason that it does not read protected documents.) When I ran that test by continuous clawing, not by simple once crawling, the job was done halfway and logged the following: Error: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500 the job tried to crawl that files many times. It seems that a job takes a lot of time and costs for treating protected files. So I want to find a way to skip quickly reading those files. In my survey: Hopfillers is not relevant.(right?) Then Tika, PDFBox, and POI have the mechanism to decrypt protected files, but throw each another exception in the case that given invalid password. It occurs to me that Solr throws another result code when protected files are posted, as one idea apart from possibility or not. Do you have any ideas? Regards, Shinichiro Abe