Re: [VOTE] Release Apache ManifoldCF 2.0.2, RC0
Local tests are running fine, but there is a problem with a table which is not properly installed on our Resin Deployment server. I guess the following command should install the needpriority table? Not errors shown runnins this command. $MCF_HOME/executecommand.sh org.apache.manifoldcf.agents.Install But the following does not register properly: $MCF_HOME/executecommand.sh org.apache.manifoldcf.agents.Register org.apache.manifoldcf.crawler.system.CrawlerAgent ERROR: column needpriority does not exist The complete output of the first command is (the output from the other with the stacktrace will follow): Configuration file successfully read [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:host.name=solr-test02.uio.no [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.version=1.7.0_75 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.vendor=Oracle Corporation [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.home=/usit/solr-test02/www/java/jdk1.7.0_75-x86_64/jre [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.class.path=.:../lib/mcf-core.jar:../lib/mcf-agents.jar:../lib/mcf-pull-agent.jar:../lib/hsqldb-2.3.2.jar:../lib/postgresql-9.1-901.jdbc4.jar:../lib/commons-codec-1.9.jar:../lib/commons-collections-3.2.1.jar:../lib/commons-discovery-0.5.jar:../lib/commons-el-1.0.jar:../lib/commons-fileupload-1.2.2.jar:../lib/commons-io-2.1.jar:../lib/commons-lang-2.6.jar:../lib/commons-logging-1.1.3.jar:../lib/ecj-4.3.1.jar:../lib/httpclient-4.3.5.jar:../lib/httpcore-4.3.2.jar:../lib/jasper-6.0.35.jar:../lib/jasper-el-6.0.35.jar:../lib/javax.servlet-api-3.1.0.jar:../lib/json-20090211.jar:../lib/json-simple-1.1.jar:../lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:../lib/juli-6.0.35.jar:../lib/log4j-1.2.16.jar:../lib/mail-1.4.5.jar:../lib/serializer-2.7.1.jar:../lib/slf4j-api-1.7.7.jar:../lib/slf4j-simple-1.7.7.jar:../lib/velocity-1.7.jar:../lib/xalan-2.7.1.jar:../lib/xercesImpl-2.10.0.jar:../lib/xml-apis-1.4.01.jar:../lib/zookeeper-3.4.6.jar: [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.library.path=/local/opt/oraclient10.2/product/10.2.0/lib::/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.compiler=NA [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.version=2.6.18-400.1.1.el5 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.name=resin [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.home=/www/home/resin [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/usit/solr-test02/www/var/data/mcf/mcf-1/conf [main] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=localhost:8349 sessionTimeout=2000 watcher=org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection$ZooKeeperWatcher@37b335b7 [main-SendThread(localhost.localdomain:8349)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost.localdomain/127.0.0.1:8349. Will not attempt to authenticate using SASL (unknown error) [main-SendThread(localhost.localdomain:8349)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established to localhost.localdomain/127.0.0.1:8349, initiating session [main-SendThread(localhost.localdomain:8349)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server localhost.localdomain/127.0.0.1:8349, sessionid = 0x14b3573590c001a, negotiated timeout = 2 Agent tables installed [Shutdown thread] INFO org.apache.zookeeper.ZooKeeper - Session: 0x14b3573590c001a closed [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down Here's the stacktrace from the other command: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: SQLException doing query (42703): ERROR: column needpriority does not exist at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702) at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728) at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771) at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444) at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146) at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191) at
Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
I can verify an eventually network problem by using file-based synchronization instead. I'll do that right away and test RC2 as well, even though you already have three +1's. The three other jobs I started before I left my office on Thursday did all complete successfully. Erlend On 19.09.14 12:27, Karl Wright wrote: Well, it's crawled fine over night, with no issues whatsoever. I'm using a Zookeeper setup, with MCF 1.7.1 RC1. I still maintain you've got something broken with the network in your production machine. Karl On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright daddy...@gmail.com wrote: Well, FWIW it is still crawling perfectly. I'll let it run until done. Karl On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen e.f.gara...@usit.uio.no wrote: I know. I used a lot of time to create the rules which seems to index what we really want. Your observation is correct. Crawling Dspace repositories are very difficult. A lot of nonsense pages we need to filter out. We have crawled this host the last two years using file based synch. I'm planning a new approach, i.e. using a connector etc. E Sent from my iPhone On 18. sep. 2014, at 22:35, Karl Wright daddy...@gmail.com wrote: Ok, I started this crawl. It fetched and processed robots.txt perfectly. And then I saw the following: lots of fetches of fairly good-sized documents, with very few ingestions. The documents that did not ingest look like this: https://www.duo.uio.no/handle/10852/163/discover?order=DESCr...pp=100sort_by=dc.date.issued_dt I think your index inclusion rules may be excluding most of the content. Karl On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright daddy...@gmail.com wrote: Thanks -- I will probably not be able to get to this further until tonight anyhow. Karl On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: I tried to fetch documents by using curl from our prod server just in case a webmaster had blocked access. No problem. Maybe I should ask the webmaster of that host anyway, just to be sure. The interrupted message may have been caused by an abort of that job. I think I should just stop the problematic job and start all the other three remaining jobs instead. I bet they will all complete. Ideally we shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I have just contacted someone who is indexing Dspace resources. I guess a Dspace connector is a better approach. Below you'll find some parameters. REPOSITORY CONNECTION - Throttling - max connections: 30 Throttling - Max fetches/min: 100 Bandwith - max connections: 25 Bandwith - max kbytes/sec: 8000 Bandwith - max fetches/min: 20 JOB SETTINGS Hop filters: Keep forever Seeds: https://www.duo.uio.no/ Exclude from crawl: # Exclude some file types: \.gif$ \.GIF$ \.jpeg$ \.JPEG$ \.jpg$ \.JPG$ \.png$ \.PNG$ \.mpg$ \.MPG$ \.mpeg$ \.MPEG$ \.exe$ \.bmp$ \.BMP$ \.mov$ \.MOV$ \.wmf$ \.css$ \.ico$ \.ICO$ \.mp2$ \.mp3$ \.mp4$ \.wmv$ \.tif$ \.tiff$ \.avi$ \.ogg$ \.ogv$ \.zip$ \.gz$ \.psd$ # TIKA-1011 \.mhtml$ # Exclude log files: \.log$ \.logfile$ # Generelt, ikke tillatt indeksering av DUO-søkeresultater: https?://www\.duo\.uio\.no/sok/search.* # Andre elementer i DUO som skal ekskluderes: https://www\.duo\.uio\.no.*open-search/description\.xml$ https://www\.duo\.uio\.no/(inn|login|feed|search| advanced-search|community-list|browse|password-login|inn|discover).* # Skip locale settings - makes duplicates: https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$ # Temporarily skip PDFs since we are indexing abstracts: https://www\.duo\.uio\.no/bitstream/handle/.+ # skip full item record: https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$ # ny url-struktur: https://www\.duo\.uio\.no/handle/.*\?show=full$ # Skip all navigations but start with letter: https://www\.duo\.uio\.no/.*type=(author|dateissued)$ # Skip search: #https://www\.duo\.uio\.no/handle/.*/discover\?.* https://www\.duo\.uio\.no/handle/.*search-filter\?.* # ny url-struktur: https://www\.duo\.uio\.no/discover\?.* https://www\.duo\.uio\.no/search-filter\?.* # Skip statistics: https://www\.duo\.uio\.no/handle/.*/statistics$ Exclude from index: # Exclude front page - no valuable info and we have QL: https?://www\.duo\.uio\.no/$ # Do not index navigation, but follow: https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+ #ny url-struktur: https://www\.duo\.uio\.no/handle/\d+/\d+/.+ # Exclude id's lower than four, probably category listening: https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$ # ny url-strultur: https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$ Thanks for looking at this! BTW: Within an hour, I will be away from my computer and cannot test anymore until Monday. I'm leaving Oslo for some days, but I will still be able to read and answer emails. Erlend On 18.09.14 13:43, Karl Wright wrote: Hi Erlend, The Interrupted: null message with a -104 code means only that the fetch was interrupted
Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
I'm able to fetch documents from www.duo.uio.no using file-based synchronization, so there are no network problems. Anyway, I'll continue to test RC2. Even though I'm not able to use Zookeeper-based synchronization on that host, I may find other bugs/problems. Erlend On 22.09.14 10:39, Erlend Garåsen wrote: I can verify an eventually network problem by using file-based synchronization instead. I'll do that right away and test RC2 as well, even though you already have three +1's. The three other jobs I started before I left my office on Thursday did all complete successfully. Erlend On 19.09.14 12:27, Karl Wright wrote: Well, it's crawled fine over night, with no issues whatsoever. I'm using a Zookeeper setup, with MCF 1.7.1 RC1. I still maintain you've got something broken with the network in your production machine. Karl On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright daddy...@gmail.com wrote: Well, FWIW it is still crawling perfectly. I'll let it run until done. Karl On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen e.f.gara...@usit.uio.no wrote: I know. I used a lot of time to create the rules which seems to index what we really want. Your observation is correct. Crawling Dspace repositories are very difficult. A lot of nonsense pages we need to filter out. We have crawled this host the last two years using file based synch. I'm planning a new approach, i.e. using a connector etc. E Sent from my iPhone On 18. sep. 2014, at 22:35, Karl Wright daddy...@gmail.com wrote: Ok, I started this crawl. It fetched and processed robots.txt perfectly. And then I saw the following: lots of fetches of fairly good-sized documents, with very few ingestions. The documents that did not ingest look like this: https://www.duo.uio.no/handle/10852/163/discover?order=DESCr...pp=100sort_by=dc.date.issued_dt I think your index inclusion rules may be excluding most of the content. Karl On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright daddy...@gmail.com wrote: Thanks -- I will probably not be able to get to this further until tonight anyhow. Karl On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: I tried to fetch documents by using curl from our prod server just in case a webmaster had blocked access. No problem. Maybe I should ask the webmaster of that host anyway, just to be sure. The interrupted message may have been caused by an abort of that job. I think I should just stop the problematic job and start all the other three remaining jobs instead. I bet they will all complete. Ideally we shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I have just contacted someone who is indexing Dspace resources. I guess a Dspace connector is a better approach. Below you'll find some parameters. REPOSITORY CONNECTION - Throttling - max connections: 30 Throttling - Max fetches/min: 100 Bandwith - max connections: 25 Bandwith - max kbytes/sec: 8000 Bandwith - max fetches/min: 20 JOB SETTINGS Hop filters: Keep forever Seeds: https://www.duo.uio.no/ Exclude from crawl: # Exclude some file types: \.gif$ \.GIF$ \.jpeg$ \.JPEG$ \.jpg$ \.JPG$ \.png$ \.PNG$ \.mpg$ \.MPG$ \.mpeg$ \.MPEG$ \.exe$ \.bmp$ \.BMP$ \.mov$ \.MOV$ \.wmf$ \.css$ \.ico$ \.ICO$ \.mp2$ \.mp3$ \.mp4$ \.wmv$ \.tif$ \.tiff$ \.avi$ \.ogg$ \.ogv$ \.zip$ \.gz$ \.psd$ # TIKA-1011 \.mhtml$ # Exclude log files: \.log$ \.logfile$ # Generelt, ikke tillatt indeksering av DUO-søkeresultater: https?://www\.duo\.uio\.no/sok/search.* # Andre elementer i DUO som skal ekskluderes: https://www\.duo\.uio\.no.*open-search/description\.xml$ https://www\.duo\.uio\.no/(inn|login|feed|search| advanced-search|community-list|browse|password-login|inn|discover).* # Skip locale settings - makes duplicates: https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$ # Temporarily skip PDFs since we are indexing abstracts: https://www\.duo\.uio\.no/bitstream/handle/.+ # skip full item record: https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$ # ny url-struktur: https://www\.duo\.uio\.no/handle/.*\?show=full$ # Skip all navigations but start with letter: https://www\.duo\.uio\.no/.*type=(author|dateissued)$ # Skip search: #https://www\.duo\.uio\.no/handle/.*/discover\?.* https://www\.duo\.uio\.no/handle/.*search-filter\?.* # ny url-struktur: https://www\.duo\.uio\.no/discover\?.* https://www\.duo\.uio\.no/search-filter\?.* # Skip statistics: https://www\.duo\.uio\.no/handle/.*/statistics$ Exclude from index: # Exclude front page - no valuable info and we have QL: https?://www\.duo\.uio\.no/$ # Do not index navigation, but follow: https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+ #ny url-struktur: https://www\.duo\.uio\.no/handle/\d+/\d+/.+ # Exclude id's lower than four, probably category listening: https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$ # ny url-strultur: https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$ Thanks for looking at this! BTW: Within
Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
Even thought Zookeeper is running on the same machine? I'm planning to investigate this issue further by using tcpdump. I have already turned on DEBUG logging, but nothing suspicious is showing up in my logs. This machine is on a very strict network, and that may cause these problems, but it's strange that all the other jobs are working. perfectly. Erlend On 22.09.14 12:26, Karl Wright wrote: Hi Erlend, What I think you might want to look for, network-wise, are periods of significant packet loss. Normally your server seems to have no trouble talking to either zookeeper or the external network, but periodically, it seems to lose that ability for times of at least 20 seconds. It could be bad hardware, it could be routing, hard to tell. What I'd suggest to prove this is to set up a long-running ping, e.g. ping -n 1, from that machine to the server that zookeeper is running on, and then do a crawl. I will wager, well, quite a lot of money, that you will see periods of packet loss. ;-) Karl On Mon, Sep 22, 2014 at 5:05 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: I'm able to fetch documents from www.duo.uio.no using file-based synchronization, so there are no network problems. Anyway, I'll continue to test RC2. Even though I'm not able to use Zookeeper-based synchronization on that host, I may find other bugs/problems. Erlend On 22.09.14 10:39, Erlend Garåsen wrote: I can verify an eventually network problem by using file-based synchronization instead. I'll do that right away and test RC2 as well, even though you already have three +1's. The three other jobs I started before I left my office on Thursday did all complete successfully. Erlend On 19.09.14 12:27, Karl Wright wrote: Well, it's crawled fine over night, with no issues whatsoever. I'm using a Zookeeper setup, with MCF 1.7.1 RC1. I still maintain you've got something broken with the network in your production machine. Karl On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright daddy...@gmail.com wrote: Well, FWIW it is still crawling perfectly. I'll let it run until done. Karl On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen e.f.gara...@usit.uio.no wrote: I know. I used a lot of time to create the rules which seems to index what we really want. Your observation is correct. Crawling Dspace repositories are very difficult. A lot of nonsense pages we need to filter out. We have crawled this host the last two years using file based synch. I'm planning a new approach, i.e. using a connector etc. E Sent from my iPhone On 18. sep. 2014, at 22:35, Karl Wright daddy...@gmail.com wrote: Ok, I started this crawl. It fetched and processed robots.txt perfectly. And then I saw the following: lots of fetches of fairly good-sized documents, with very few ingestions. The documents that did not ingest look like this: https://www.duo.uio.no/handle/10852/163/discover?order=DESC; r...pp=100sort_by=dc.date.issued_dt I think your index inclusion rules may be excluding most of the content. Karl On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright daddy...@gmail.com wrote: Thanks -- I will probably not be able to get to this further until tonight anyhow. Karl On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: I tried to fetch documents by using curl from our prod server just in case a webmaster had blocked access. No problem. Maybe I should ask the webmaster of that host anyway, just to be sure. The interrupted message may have been caused by an abort of that job. I think I should just stop the problematic job and start all the other three remaining jobs instead. I bet they will all complete. Ideally we shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I have just contacted someone who is indexing Dspace resources. I guess a Dspace connector is a better approach. Below you'll find some parameters. REPOSITORY CONNECTION - Throttling - max connections: 30 Throttling - Max fetches/min: 100 Bandwith - max connections: 25 Bandwith - max kbytes/sec: 8000 Bandwith - max fetches/min: 20 JOB SETTINGS Hop filters: Keep forever Seeds: https://www.duo.uio.no/ Exclude from crawl: # Exclude some file types: \.gif$ \.GIF$ \.jpeg$ \.JPEG$ \.jpg$ \.JPG$ \.png$ \.PNG$ \.mpg$ \.MPG$ \.mpeg$ \.MPEG$ \.exe$ \.bmp$ \.BMP$ \.mov$ \.MOV$ \.wmf$ \.css$ \.ico$ \.ICO$ \.mp2$ \.mp3$ \.mp4$ \.wmv$ \.tif$ \.tiff$ \.avi$ \.ogg$ \.ogv$ \.zip$ \.gz$ \.psd$ # TIKA-1011 \.mhtml$ # Exclude log files: \.log$ \.logfile$ # Generelt, ikke tillatt indeksering av DUO-søkeresultater: https?://www\.duo\.uio\.no/sok/search.* # Andre elementer i DUO som skal ekskluderes: https://www\.duo\.uio\.no.*open-search/description\.xml$ https://www\.duo\.uio\.no/(inn|login|feed|search| advanced-search|community-list|browse|password-login| inn|discover).* # Skip locale settings - makes duplicates: https://www\.duo\.uio
Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
I tried to restart the job dealing with www.duo.no on our test server, but it does not seem to touch the robots.txt file at all. That's the reason why it's able to continue. Both servers are set up to obey the rules of such files. Erlend On 18.09.14 11:12, Erlend Garåsen wrote: I'm facing the same problems with robot.txt files using RC1, so maybe this is another issue we have to fix. Can you please try to fetch the host below? For some odd reason, it seems that MCF on our test server can handle it. This is exactly the same that happened when I started MCF (referring to my previous post) after I had deployed RC1: 09-18-2014 11:02:14.400 robots parse https:www.duo.uio.no:443 ERRORS 0 3 Unknown robots.txt line: '' No activity after this error. Here's the robots.txt file: https://www.duo.uio.no/robots.txt This is the content of manifoldcf.log after the startup: WARN 2014-09-18 11:02:14,401 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': '' WARN 2014-09-18 11:02:14,401 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': 'The contents of this file are subject to the license and copyright' WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': 'detailed in the LICENSE and NOTICE files at the root of the source' WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': 'tree and available online at' WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': ' http://www.dspace.org/license/' WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': '' E On 18.09.14 03:12, Karl Wright wrote: Please vote on whether to release Apache ManifoldCF 1.7.1, RC1. This release fixes a number of critical issues, as well as a number of user priorities, most notably: - A bad Zookeeper support issue, which made locking support fail when Zookeeper connections got lost and then restored; - The Alfresco connector, which was nonfunctional in both MCF 1.6 and 1.7; - Solr Cloud support, which had ceased working due to changes to SolrJ; - Non-null connector components caused failure; - PostgreSQL queries not performing well. The complete list of included fixes can be found at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7.1-RC1/CHANGES.txt The release candidate can be downloaded from: http://people.apache.org/~kwright/apache-manifoldcf-1.7.1 There is a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7.1-RC1 Thanks, Karl
Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
MCF should handle invalid robots.txt files. We cannot rely on what people have entered into such files. So I guess MCF should just ignore invalid robots.txt files. I guess it already does. It seems invalid due to use of the = symbol instead of a #. I'm not an expert of such files, so I'm not completely sure. E On 18.09.14 12:04, Karl Wright wrote: Hi Erlend, Your robots file has this at the top: The contents of this file are subject to the license and copyright detailed in the LICENSE and NOTICE files at the root of the source tree and available online at http://www.dspace.org/license/ That's fine except to the best of my knowledge the robots spec does not allow for comments at all. If you have reason to believe that has changed, then please point me at a reference and I can change our robots parser. Thanks, Karl On Thu, Sep 18, 2014 at 6:02 AM, Karl Wright daddy...@gmail.com wrote: Hi Erlend, MCF caches the robots.txt file in the database, which it considers valid for 1 hour. I'll look at the logs and thread dump and let you know if this is a locking issue or something else. Please stand by. Karl On Thu, Sep 18, 2014 at 5:24 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: I tried to restart the job dealing with www.duo.no on our test server, but it does not seem to touch the robots.txt file at all. That's the reason why it's able to continue. Both servers are set up to obey the rules of such files. Erlend On 18.09.14 11:12, Erlend Garåsen wrote: I'm facing the same problems with robot.txt files using RC1, so maybe this is another issue we have to fix. Can you please try to fetch the host below? For some odd reason, it seems that MCF on our test server can handle it. This is exactly the same that happened when I started MCF (referring to my previous post) after I had deployed RC1: 09-18-2014 11:02:14.400 robots parse https:www.duo.uio.no:443 ERRORS 0 3 Unknown robots.txt line: '' No activity after this error. Here's the robots.txt file: https://www.duo.uio.no/robots.txt This is the content of manifoldcf.log after the startup: WARN 2014-09-18 11:02:14,401 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': '' WARN 2014-09-18 11:02:14,401 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': 'The contents of this file are subject to the license and copyright' WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': 'detailed in the LICENSE and NOTICE files at the root of the source' WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': 'tree and available online at' WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': ' http://www.dspace.org/license/' WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown robots.txt line from 'https:www.duo.uio.no:443': '' E On 18.09.14 03:12, Karl Wright wrote: Please vote on whether to release Apache ManifoldCF 1.7.1, RC1. This release fixes a number of critical issues, as well as a number of user priorities, most notably: - A bad Zookeeper support issue, which made locking support fail when Zookeeper connections got lost and then restored; - The Alfresco connector, which was nonfunctional in both MCF 1.6 and 1.7; - Solr Cloud support, which had ceased working due to changes to SolrJ; - Non-null connector components caused failure; - PostgreSQL queries not performing well. The complete list of included fixes can be found at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1. 7.1-RC1/CHANGES.txt The release candidate can be downloaded from: http://people.apache.org/~kwright/apache-manifoldcf-1.7.1 There is a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7.1-RC1 Thanks, Karl
Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
On 18.09.14 13:00, Karl Wright wrote: Hi Erlend, please can you also add the manifoldcf log as well? Yes, I will, but it includes entries from RC0 as well. MCF works perfectly using the other jobs for the other hosts. Take a look at the following once again. MCF is being interrupted: INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH URL|https://www.duo.uio.no/|1411030940209+682605|-104|4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException| Interrupted: Interrupted: null You can find this entry near the other regarding the robots.txt file: http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log Erlend
Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
I tried to fetch documents by using curl from our prod server just in case a webmaster had blocked access. No problem. Maybe I should ask the webmaster of that host anyway, just to be sure. The interrupted message may have been caused by an abort of that job. I think I should just stop the problematic job and start all the other three remaining jobs instead. I bet they will all complete. Ideally we shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I have just contacted someone who is indexing Dspace resources. I guess a Dspace connector is a better approach. Below you'll find some parameters. REPOSITORY CONNECTION - Throttling - max connections: 30 Throttling - Max fetches/min: 100 Bandwith - max connections: 25 Bandwith - max kbytes/sec: 8000 Bandwith - max fetches/min: 20 JOB SETTINGS Hop filters: Keep forever Seeds: https://www.duo.uio.no/ Exclude from crawl: # Exclude some file types: \.gif$ \.GIF$ \.jpeg$ \.JPEG$ \.jpg$ \.JPG$ \.png$ \.PNG$ \.mpg$ \.MPG$ \.mpeg$ \.MPEG$ \.exe$ \.bmp$ \.BMP$ \.mov$ \.MOV$ \.wmf$ \.css$ \.ico$ \.ICO$ \.mp2$ \.mp3$ \.mp4$ \.wmv$ \.tif$ \.tiff$ \.avi$ \.ogg$ \.ogv$ \.zip$ \.gz$ \.psd$ # TIKA-1011 \.mhtml$ # Exclude log files: \.log$ \.logfile$ # Generelt, ikke tillatt indeksering av DUO-søkeresultater: https?://www\.duo\.uio\.no/sok/search.* # Andre elementer i DUO som skal ekskluderes: https://www\.duo\.uio\.no.*open-search/description\.xml$ https://www\.duo\.uio\.no/(inn|login|feed|search|advanced-search|community-list|browse|password-login|inn|discover).* # Skip locale settings - makes duplicates: https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$ # Temporarily skip PDFs since we are indexing abstracts: https://www\.duo\.uio\.no/bitstream/handle/.+ # skip full item record: https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$ # ny url-struktur: https://www\.duo\.uio\.no/handle/.*\?show=full$ # Skip all navigations but start with letter: https://www\.duo\.uio\.no/.*type=(author|dateissued)$ # Skip search: #https://www\.duo\.uio\.no/handle/.*/discover\?.* https://www\.duo\.uio\.no/handle/.*search-filter\?.* # ny url-struktur: https://www\.duo\.uio\.no/discover\?.* https://www\.duo\.uio\.no/search-filter\?.* # Skip statistics: https://www\.duo\.uio\.no/handle/.*/statistics$ Exclude from index: # Exclude front page - no valuable info and we have QL: https?://www\.duo\.uio\.no/$ # Do not index navigation, but follow: https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+ #ny url-struktur: https://www\.duo\.uio\.no/handle/\d+/\d+/.+ # Exclude id's lower than four, probably category listening: https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$ # ny url-strultur: https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$ Thanks for looking at this! BTW: Within an hour, I will be away from my computer and cannot test anymore until Monday. I'm leaving Oslo for some days, but I will still be able to read and answer emails. Erlend On 18.09.14 13:43, Karl Wright wrote: Hi Erlend, The Interrupted: null message with a -104 code means only that the fetch was interrupted by something. Unfortunately, the message is not clear about what the cause of the interruption is. This is unrelated to Zookeeper; but I agree that it is suspicious that many such interruptions appear right after robots is parsed. One cause of a -104 is when the target server forcibly drops the connection, so an InterruptedIOException is thrown. Having a look at the timestamps for the fetch messages, it looks believable that you might have exceeded some predetermined limit on that machine. They're all within a few milliseconds of each other. When a robots file needs to be read, ManifoldCF creates an event for that, and the urls blocked by that event will all be 'fetchable' as soon as the event is released. Perhaps your throttling needs to be adjusted now that the rate limit bug has been fixed? I won't be able to work with this without at least your crawling parameters for the server in question. I can ping that server so if you would like I can try crawling that server from here. For zookeeper, I would still try to either increase your tick count to maybe 1, or better yet, find out why you periodically lose the ability to transmit pings from MCF to your zookeeper process. Thanks, Karl On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: On 18.09.14 13:00, Karl Wright wrote: Hi Erlend, please can you also add the manifoldcf log as well? Yes, I will, but it includes entries from RC0 as well. MCF works perfectly using the other jobs for the other hosts. Take a look at the following once again. MCF is being interrupted: INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH URL| https://www.duo.uio.no/|1411030940209+682605|-104| 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException| https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C4096
Re: Bug-fix release 1.7.1
+1 As an exception to the rule, I will deploy a patched version on our production server just to be sure that we have fixed the problem. For some reason, I'm not able to reproduce the Zookeeper problem on our test server, so I'll go ahead on our prod server instead. I'll let you know whether this solved our problem. Erlend On 16.09.14 23:21, Karl Wright wrote: I think we're going to need a bug fix release for 1.7. Specifically, we need the fixes for CONNECTORS-1031, as well as the fix for the Axis classpath that allows the Alfresco connector to work. Please let me know if you agree that a point fix is warranted. Thanks, Karl
Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC0
I got the following on both my test and prod server. The error also shows up in simple history: Error: KeeperErrorCode = NoNode for /org.apache.manifoldcf.locks-_Cache_OUTPUTCONNECTION_Solr/read-0001039554 I guess it is related to the shutdown process - either when I stopped the Resin instance or the Agent. Just mentioning. BTW, for some reason I had to restart the job on our test server. Our prod server hangs at the moment, so I will try to restart everything once again. The applied patch works well, but version 1.7.1 seems to be tricky. I'll publish a thread dump and my logs if I'm not manage to run MCF on our prod server until I leave office. This is what I can see in manifoldcf.log: WARN 2014-09-17 14:06:52,228 (Shutdown thread) - Exception tossed on repository connector pool cleanup: KeeperErrorCode = NoNode for /org.apache.manifoldcf.serviceactive-_REPOSITORYCONNECTORPOOL_Web-_ANON_5 org.apache.manifoldcf.core.interfaces.ManifoldCFException: KeeperErrorCode = NoNode for /org.apache.manifoldcf.serviceactive-_REPOSITORYCONNECTORPOOL_Web-_ANON_5 at org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.handleKeeperException(ZooKeeperConnection.java:941) at org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.deleteNode(ZooKeeperConnection.java:155) at org.apache.manifoldcf.core.lockmanager.ZooKeeperLockManager.endServiceActivity(ZooKeeperLockManager.java:478) at org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.releaseAll(ConnectorPool.java:735) at org.apache.manifoldcf.core.connectorpool.ConnectorPool.closeAllConnectors(ConnectorPool.java:381) at org.apache.manifoldcf.crawler.repositoryconnectorpool.RepositoryConnectorPool.closeAllConnectors(RepositoryConnectorPool.java:144) at org.apache.manifoldcf.crawler.system.ManifoldCF.localCleanup(ManifoldCF.java:110) at org.apache.manifoldcf.crawler.system.CrawlerAgent.cleanUp(CrawlerAgent.java:105) at org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:171) at org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:386) at org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1295) at org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1483) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /org.apache.manifoldcf.serviceactive-_REPOSITORYCONNECTORPOOL_Web-_ANON_5 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873) at org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.deleteNode(ZooKeeperConnection.java:150) ... 10 more On 17.09.14 13:35, Karl Wright wrote: Please vote on whether to release Apache ManifoldCF 1.7.1, RC0. This release fixes a number of critical issues, as well as a number of user priorities, most notably: - A bad Zookeeper support issue, which made locking support fail when Zookeeper connections got lost and then restored; - The Alfresco connector, which was nonfunctional in both MCF 1.6 and 1.7; - Solr Cloud support, which had ceased working due to changes to SolrJ; - Non-null connector components caused failure; - PostgreSQL queries not performing well. The complete list of included fixes can be found at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7.1-RC0/CHANGES.txt The release candidate can be downloaded from: http://people.apache.org/~kwright/apache-manifoldcf-1.7.1 There is a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7.1-RC0 Thanks, Karl
Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC0
On 17.09.14 14:55, Karl Wright wrote: Hi Erlend, Yes, this is shutdown related. The patch file did not include the fix for this particular problem. The release candidate, however, does. This is not from the patch, but from 1.7.1. I just meant to say that I did not had any problems using the patch. The thread dump is included in my stdout log file since the output of kill -3 where placed there. Please note that it is included in THE END of that file. I'm in a hurry, so I didn't have time to delete all the other irrelevant entries. Sorry about that: http://folk.uio.no/erlendfg/manifoldcf/mcf_agent.stdout.log I'll try to restart everything and get MCF up and running. Runs fine on our test server, but not on prod. I'll get back to this. E
Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC0
Both servers are running now. Not sure about what caused the problems on prod. The only thing I did different was to do a lock clean on prod prior to startup. I'll keep both servers up and running in 24 hours and vote thereafter. Erlend On 17.09.14 15:05, Erlend Garåsen wrote: On 17.09.14 14:55, Karl Wright wrote: Hi Erlend, Yes, this is shutdown related. The patch file did not include the fix for this particular problem. The release candidate, however, does. This is not from the patch, but from 1.7.1. I just meant to say that I did not had any problems using the patch. The thread dump is included in my stdout log file since the output of kill -3 where placed there. Please note that it is included in THE END of that file. I'm in a hurry, so I didn't have time to delete all the other irrelevant entries. Sorry about that: http://folk.uio.no/erlendfg/manifoldcf/mcf_agent.stdout.log I'll try to restart everything and get MCF up and running. Runs fine on our test server, but not on prod. I'll get back to this. E
Re: [VOTE] Release Apache ManifoldCF 1.7 RC2
+1 - Deployed binary dist on Caucho Resin on Linux and ran: - a huge crawl using FileLockManager - Built source dist on OS X and: - Ran single-process version under example directory - Ran ant uitest and test Erlend On 20.08.14 02:58, Mingchun Zhao wrote: Hi all, Please vote on whether to release the ManifoldCF, version 1.7, RC2. RC2 included the following changes from RC1. CONNECTORS-1009: Fix CMIS connector again, to handle typical case where a new version is a new node. CONNECTORS-1011: Upgrade to httpclient 4.3.5. CONNECTORS-1012: Upgrade xmlbeans and POI to fix various CVE's. You can find the artifact at: http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC2 There is also a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC2 Vote will remain open at least 72 hours. Thanks! Mingchun Zhao
Re: [VOTE] Release Apache ManifoldCF 1.7 RC0
On 12.08.14 05:13, Mingchun Zhao wrote: Hi all, Please vote on whether to release the ManifoldCF, version 1.7, RC0. You can find the artifact at: http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0 There is also a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0 Vote will remain open at least 72 hours. Thanks! Mingchun Zhao
Re: [VOTE] Release Apache ManifoldCF 1.7 RC0
-1 All my first tests pass, but I think I found a blocker when I ran the last one. By running MCF using FileLockManager, I'm getting the following error and MCF just tries to run this task over and over again. My synch folder now contains a lot of files and it still grows. I think MCF should handle long URLs and just strip the length of the filename if it becomes too large. INFO 2014-08-15 09:30:54,485 (Worker thread '9') - WEB: FETCH URL|https://www.journals.uio.no/index.php/nordina/search/advancedResults?subject=effective%20continuing%20professional%20development%2C%20authentic%20and%20entrepreneurial%20learning%2C%20science%20and%20technology%20education|1408087853848+633|200|15735| WARN 2014-08-15 09:30:54,609 (Worker thread '9') - Attempt to set file lock '/www/var/data/mcf/mcf-1/conf/../data/synchdir/948/350/lock-Solr58!https58!47!47!www.journals.uio.no58!44347!index.php47!nordina47!search47!advancedResults?subject61!effective%20continuing%20professional%20development%2C%20authentic%20and%20entrepreneurial%20learning%2C%20science%20and%20technology%20education.lock' failed: File name too long java.io.IOException: File name too long at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:947) at org.apache.manifoldcf.core.lockmanager.FileLockObject.grabFileLock(FileLockObject.java:221) at org.apache.manifoldcf.core.lockmanager.FileLockObject.obtainGlobalWriteLockNoWait(FileLockObject.java:77) at org.apache.manifoldcf.core.lockmanager.LockObject.obtainGlobalWriteLock(LockObject.java:121) at org.apache.manifoldcf.core.lockmanager.LockObject.enterWriteLock(LockObject.java:74) at org.apache.manifoldcf.core.lockmanager.LockGate.enterWriteLock(LockGate.java:177) at org.apache.manifoldcf.core.lockmanager.BaseLockManager.enter(BaseLockManager.java:1473) at org.apache.manifoldcf.core.lockmanager.BaseLockManager.enterLocks(BaseLockManager.java:803) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$OutputAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3329) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3051) On 12.08.14 05:13, Mingchun Zhao wrote: Hi all, Please vote on whether to release the ManifoldCF, version 1.7, RC0. You can find the artifact at: http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0 There is also a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0 Vote will remain open at least 72 hours. Thanks! Mingchun Zhao
Re: [VOTE] Release Apache ManifoldCF 1.7 RC0
Another thing. It's not possible to abort the job due to this problem. LockManager still tries to set locks over and over again. It's not just the previous URL/filename I entered, but several others: WARN 2014-08-15 10:07:46,178 (Worker thread '31') - Attempt to set file lock '/www/var/data/mcf/mcf-1/conf/../data/synchdir/664/756/lock-Solr58!https58!47!47!www.journals.uio.no58!44347!index.php47!nordina47!search47!advancedResults?subject61!Small%20group%20learning%2C%203rd%20graders%2C%20learning%20of%20DC-circuit%20phenomena%2C%20active%20and%20spontaneous%20learning.lock' failed: File name too long Erlend On 15.08.14 09:46, Erlend Garåsen wrote: -1 All my first tests pass, but I think I found a blocker when I ran the last one. By running MCF using FileLockManager, I'm getting the following error and MCF just tries to run this task over and over again. My synch folder now contains a lot of files and it still grows. I think MCF should handle long URLs and just strip the length of the filename if it becomes too large. INFO 2014-08-15 09:30:54,485 (Worker thread '9') - WEB: FETCH URL|https://www.journals.uio.no/index.php/nordina/search/advancedResults?subject=effective%20continuing%20professional%20development%2C%20authentic%20and%20entrepreneurial%20learning%2C%20science%20and%20technology%20education|1408087853848+633|200|15735| WARN 2014-08-15 09:30:54,609 (Worker thread '9') - Attempt to set file lock '/www/var/data/mcf/mcf-1/conf/../data/synchdir/948/350/lock-Solr58!https58!47!47!www.journals.uio.no58!44347!index.php47!nordina47!search47!advancedResults?subject61!effective%20continuing%20professional%20development%2C%20authentic%20and%20entrepreneurial%20learning%2C%20science%20and%20technology%20education.lock' failed: File name too long java.io.IOException: File name too long at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:947) at org.apache.manifoldcf.core.lockmanager.FileLockObject.grabFileLock(FileLockObject.java:221) at org.apache.manifoldcf.core.lockmanager.FileLockObject.obtainGlobalWriteLockNoWait(FileLockObject.java:77) at org.apache.manifoldcf.core.lockmanager.LockObject.obtainGlobalWriteLock(LockObject.java:121) at org.apache.manifoldcf.core.lockmanager.LockObject.enterWriteLock(LockObject.java:74) at org.apache.manifoldcf.core.lockmanager.LockGate.enterWriteLock(LockGate.java:177) at org.apache.manifoldcf.core.lockmanager.BaseLockManager.enter(BaseLockManager.java:1473) at org.apache.manifoldcf.core.lockmanager.BaseLockManager.enterLocks(BaseLockManager.java:803) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$OutputAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3329) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3051) On 12.08.14 05:13, Mingchun Zhao wrote: Hi all, Please vote on whether to release the ManifoldCF, version 1.7, RC0. You can find the artifact at: http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0 There is also a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0 Vote will remain open at least 72 hours. Thanks! Mingchun Zhao
Re: [VOTE] Release Apache ManifoldCF 1.6.1, RC1
+1 from me. 1. Ran test, uitest 2. Ran single process example, registered a Solr server and performed a web crawl 2. Deployed on Resin, ran huge crawl with Multiprocess/Zookeeper model Looks good! Erlend On 29.05.14 10:51, Karl Wright wrote: This minor release of ManifoldCF fixes a number of critical problems, including compatibility with JDK 8. A full list of changes can be found with the release candidate. The release candidate can be found at: http://people.apache.org/~kwright/apache-manifoldcf-1.6.1 There is also a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.6.1-RC1 Voting will be held open the Apache-mandated 72 hours. Thanks!
IO exception during indexing: null
I'm getting the following error after I upgraded to version 1.6. I think HttpClient is the source of the problem and that the following ticket describes the issue in detail: https://issues.apache.org/jira/browse/CONNECTORS-661 I have turned on HttpClient logging and placed the manifoldcf.log here: http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log The first line indicates that the Solr server requires authentication. Then it seems that the authentication was unsuccessful using the BASIC method. Then we're getting the following from HttpClient, just like we did as described in CONNECTORS-661: NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. There is nothing wrong with my settings. The realm, user ID and password values are all correctly set. Last time it was something about two HTTP statuses sent (100 and 401), but I can only see the 401 status this time. We have also changed our authentication implementation on our Solr server since then to only rely on Apache, i.e. a setup in httpd.conf. So neither Resin nor mod_caucho should be the problem here. Erlend
Re: IO exception during indexing: null
Thanks for looking at this, Karl. I have sent you the output from tcpdump directly to you. Erlend On 21.05.14 14:42, Karl Wright wrote: Looking at CONNECTORS-661, the fix to this was to enable expect-continue. The current code still does this, so clearly it's not working as expected. I'll post to the HTTPCLIENT list for answers. In the meantime, I think we should open a ticket. Karl On Wed, May 21, 2014 at 8:35 AM, Karl Wright daddy...@gmail.com wrote: Hi Erlend, Looking at the log you provided, it's missing critical information, namely what ManifoldCF is sending to your server. So in its current form it's not very helpful. I can see that there are two 401 responses, but that's about it. Karl On Wed, May 21, 2014 at 6:39 AM, Erlend Garåsen e.f.gara...@usit.uio.nowrote: I'm getting the following error after I upgraded to version 1.6. I think HttpClient is the source of the problem and that the following ticket describes the issue in detail: https://issues.apache.org/jira/browse/CONNECTORS-661 I have turned on HttpClient logging and placed the manifoldcf.log here: http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log The first line indicates that the Solr server requires authentication. Then it seems that the authentication was unsuccessful using the BASIC method. Then we're getting the following from HttpClient, just like we did as described in CONNECTORS-661: NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. There is nothing wrong with my settings. The realm, user ID and password values are all correctly set. Last time it was something about two HTTP statuses sent (100 and 401), but I can only see the 401 status this time. We have also changed our authentication implementation on our Solr server since then to only rely on Apache, i.e. a setup in httpd.conf. So neither Resin nor mod_caucho should be the problem here. Erlend
Re: IO exception during indexing: null
The complete log is not available here: http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log Erlend On 21.05.14 15:09, Erlend Garåsen wrote: Thanks for looking at this, Karl. I have sent you the output from tcpdump directly to you. Erlend On 21.05.14 14:42, Karl Wright wrote: Looking at CONNECTORS-661, the fix to this was to enable expect-continue. The current code still does this, so clearly it's not working as expected. I'll post to the HTTPCLIENT list for answers. In the meantime, I think we should open a ticket. Karl On Wed, May 21, 2014 at 8:35 AM, Karl Wright daddy...@gmail.com wrote: Hi Erlend, Looking at the log you provided, it's missing critical information, namely what ManifoldCF is sending to your server. So in its current form it's not very helpful. I can see that there are two 401 responses, but that's about it. Karl On Wed, May 21, 2014 at 6:39 AM, Erlend Garåsen e.f.gara...@usit.uio.nowrote: I'm getting the following error after I upgraded to version 1.6. I think HttpClient is the source of the problem and that the following ticket describes the issue in detail: https://issues.apache.org/jira/browse/CONNECTORS-661 I have turned on HttpClient logging and placed the manifoldcf.log here: http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log The first line indicates that the Solr server requires authentication. Then it seems that the authentication was unsuccessful using the BASIC method. Then we're getting the following from HttpClient, just like we did as described in CONNECTORS-661: NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. There is nothing wrong with my settings. The realm, user ID and password values are all correctly set. Last time it was something about two HTTP statuses sent (100 and 401), but I can only see the 401 status this time. We have also changed our authentication implementation on our Solr server since then to only rely on Apache, i.e. a setup in httpd.conf. So neither Resin nor mod_caucho should be the problem here. Erlend
Re: New committer: Graeme Seaton
Greetings from Oslo, Norway, and welcome aboard, Graeme! Erlend On 10.03.14 08:18, Karl Wright wrote: The Project Management Committee (PMC) for Apache ManifoldCFhas asked Graeme Seaton to become a committer and we are pleased to announce that they have accepted. Graeme has be instrumental in driving the ManifoldCF project in the direction of scalability, and will no doubt continue to do so now that he has committer and PMC privileges. Being a committer enables easier contribution to theproject since there is no need to go via the patchsubmission process. This should enable better productivity.Being a PMC member enables assistance with the managementand to guide the direction of the project. Thanks, Karl Wright
Re: [VOTE] Release Apache ManifoldCF 1.5, RC7
+1 - Ran ant test | uitest | doc - Installed binary version and ran single process model - Installed source version, built and ran multi-process model and a huge crawl - Deployed on Resin application server and ran a huge crawl Erlend On 04.02.14 13:33, Karl Wright wrote: This is a major release of ManifoldCF that includes the following: - Federated authority support - Multiple authorization domains - ZooKeeper process coordination - Multiple agents processes - Support for SharePoint Claims-based authorization - An Email connector - A revamped look-and-feel Voting will remain open for 3 days. You can download the artifacts from http://people.apache.org/~kwright/apache-manifoldcf-1.5 . There is also a release tag at https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.5-RC7 . This RC includes changes to the dist directory organization so that jar files are not duplicated, saving 40MB from each binary download. It also fixes an issue with connection limits in the zookeeper example. Finally, it fixes a limitation in the CMIS connector (CONNECTORS-864) and a maven build problem (CONNECTORS-865). Also fixes CONNECTORS-866 (the lockclean script), and two more Maven version issues. Finally, corrects a LiveLink connector reversion described in CONNECTORS-871. Missing SolrJ dependencies in CONNECTORS-873. Workaround for SolrJ runtime exception being thrown in CONNECTORS-874. Throttling lockup dealt with, improved, and tested in CONNECTORS-872. Karl
Re: [VOTE] Release Apache ManifoldCF 1.5, RC7
We're still having problems with this release on our test server. It runs stable and does not hang anymore, but nothing gets sent to Solr. Since there was a problem with the SSL certificate in previous RCs, maybe there is a similar problem related to the Solr Output Connector? We have configured the same certificate in order to post documents to Solr. I get entries like this in manifoldcf.log which indicates that documents should be indexed, but they aren't: DEBUG 2014-02-06 10:28:06,609 (Worker thread '29') - WEB: Decided to ingest 'http://www.ibsen.uio.no/varia.xhtml' In Simple history, only fetch activities are shown. Any suggestions how to debug what's really going on? I can try to turn on debug logging for Httpclient in case that helps. Erlend On 2/4/14 1:33 PM, Karl Wright wrote: This is a major release of ManifoldCF that includes the following: - Federated authority support - Multiple authorization domains - ZooKeeper process coordination - Multiple agents processes - Support for SharePoint Claims-based authorization - An Email connector - A revamped look-and-feel Voting will remain open for 3 days. You can download the artifacts from http://people.apache.org/~kwright/apache-manifoldcf-1.5 . There is also a release tag at https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.5-RC7 . This RC includes changes to the dist directory organization so that jar files are not duplicated, saving 40MB from each binary download. It also fixes an issue with connection limits in the zookeeper example. Finally, it fixes a limitation in the CMIS connector (CONNECTORS-864) and a maven build problem (CONNECTORS-865). Also fixes CONNECTORS-866 (the lockclean script), and two more Maven version issues. Finally, corrects a LiveLink connector reversion described in CONNECTORS-871. Missing SolrJ dependencies in CONNECTORS-873. Workaround for SolrJ runtime exception being thrown in CONNECTORS-874. Throttling lockup dealt with, improved, and tested in CONNECTORS-872. Karl
Re: [VOTE] Release Apache ManifoldCF 1.5, RC7
On 06.02.14 12:41, Erlend Garåsen wrote: p://www.ibsen.uio.no/diktsamlinger.xhtml]} 0 16 select * from repohistory where entityid like '%www.ibsen.uio.no/diktsamlinger.xhtml%' 11227;1391439283905;1391439277790;1391439283890;http://www.ibsen.uio.no/diktsamlinger.xhtml;Web;fetch;200; 11227;1391439283948;1391439277782;1391439283923;http://www.ibsen.uio.no/diktsamlinger.xhtml;Web;document ingest (Solr);OK 11227;1391678979841;1391678941727;1391678951418;http://www.ibsen.uio.no/diktsamlinger.xhtml;Web;fetch;200; 11227;1391685900353;1391685874021;1391685881017;http://www.ibsen.uio.no/diktsamlinger.xhtml;Web;fetch;200; 11227;1391686685694;1391686673738;1391686540299;http://www.ibsen.uio.no/diktsamlinger.xhtml;Web;fetch;200; So it should show up in simple history. Erlend
Re: [VOTE] Release Apache ManifoldCF 1.5, RC7
Anny suggestion what to include in logging.ini? I have tried the following without any success: log4j.logger.org.postgresql=DEBUG log4j.logger.java.sql.Connection=DEBUG log4j.logger.java.sql=DEBUG log4j.logger.java.sql.ResultSet=TRACE Erlend On 06.02.14 13:34, Karl Wright wrote: Hi Erlend, This isn't making much sense. Nothing here has changed, AFAIK, between 1.4.1 and 1.5. If you want to see the queries being submitted for the simple history, try these steps: (1) Stop agents process and Resin (2) Enable db debugging (3) Start Resin (4) try the simple history in the UI (5) have a look at the log; the query should be there Karl On Thu, Feb 6, 2014 at 6:59 AM, Erlend Garåsen e.f.gara...@usit.uio.nowrote: On 06.02.14 12:41, Erlend Garåsen wrote: p://www.ibsen.uio.no/diktsamlinger.xhtml]} 0 16 select * from repohistory where entityid like '%www.ibsen.uio.no/ diktsamlinger.xhtml%' 11227;1391439283905;1391439277790;1391439283890;http://www.ibsen.uio.no/ diktsamlinger.xhtml;Web;fetch;200 11227;1391439283948;1391439277782;1391439283923;http://www.ibsen.uio.no/ diktsamlinger.xhtml;Web;document ingest (Solr);OK 11227;1391678979841;1391678941727;1391678951418;http://www.ibsen.uio.no/ diktsamlinger.xhtml;Web;fetch;200 11227;1391685900353;1391685874021;1391685881017;http://www.ibsen.uio.no/ diktsamlinger.xhtml;Web;fetch;200 11227;1391686685694;1391686673738;1391686540299;http://www.ibsen.uio.no/ diktsamlinger.xhtml;Web;fetch;200 So it should show up in simple history. Erlend
Re: [VOTE] Release Apache ManifoldCF 1.5, RC7
On 06.02.14 15:25, Karl Wright wrote: So I conclude that simple history is working fine, but since it is only returning indexing results within the last hour by default it is confusing you. I also think it is likely that documents are getting skipped because you've crawled this set before with the same job and many of the documents have not changed. Karl, we are indexing these documents: I have tail -F opened up from our Solr test server at the moment: [2014-02-06 15:21:00.321] INFO [uio] OP crawl {add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=B]} 0 38 [2014-02-06 15:21:00.359] INFO [uio] OP crawl {add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=N]} 0 23 [2014-02-06 15:21:29.732] INFO [uio] OP crawl {add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=G]} 0 29 [2014-02-06 15:22:11.954] INFO [uio] OP crawl {add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=S]} 0 38 [2014-02-06 15:22:15.752] INFO [uio] OP crawl {add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=D]} 0 28 [2014-02-06 15:22:18.323] INFO [uio] OP crawl {add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=H]} 0 34 [2014-02-06 15:22:21.657] INFO [uio] OP crawl {add=[http://www.ibsen.uio.no/variakronologi.xhtml]} 0 73 How could these log entries show up on our Solr server if the documents were skipped? And why did I get entries like this earlier today: DEBUG 2014-02-06 10:28:06,609 (Worker thread '29') - WEB: Decided to ingest 'http://www.ibsen.uio.no/varia.xhtml' (I have changed the log level back to INFO right now, so I cannot see these entries for the last crawl, but I will re-enable DEBUG again). I have re-ingested all documents several times today to be sure that all documents were crawled all over again. Of course, I can try to remove all jobs, delete all tables in PostgreSQL and try to create everything from scratch in case the old settings did not get upgraded successfully. Unfortunately MCF will delete all tables in my index as well. Erlend
Re: [VOTE] Release Apache ManifoldCF 1.5, RC7
And why do I get the following result from pgAdmin when I run the following SQL?: select * from repohistory where entityid = 'http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=H' 54251;1391440247586;1391440244203;1391440247542;http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=H;Web;document ingest (Solr);OK This shows that the document was indexed, but it's not visible inside simple history. Erlend
Re: [VOTE] Release Apache ManifoldCF 1.5, RC7
On 06.02.14 15:53, Karl Wright wrote: Hi Erlend, Please go into the Simple History, and change the start time of the query to be one day earlier than the default. By default, Simple History only reports the last hour's worth of events. Then it only displays the crawl which completed tonight before I did the upgrade of MCF. If Solr is indexing the documents, you should also see the entries in simple history. I changed the start time to four hours earlier than the default which should catch the Solr activity. The query I posted seems to include an old start time (3rd of Feb) and that's the reason why pgAdmin displays a result set. At that time, I reindexed Solr prior to the MCF upgrade. If I'm re-ingesting all documents and start the job, see activity in both our Solr log and in manifoldcf.log (Decided to ingest...) . And if I continuously refreshing the simple history window, all I can see is fetching activities (and job start etc). For some odd reason, 'document ingest (Solr)' as an activity type does not seem to be added to my repohistory table after I did the upgrade. Take a look at this query: select count(*) from repohistory where owner='Web' AND starttime 1391691978799 and activitytype = 'fetch' == 141. (This is everything from 1:06 pm until now.) But then take a look at this one: select starttime, activitytype from repohistory where owner='Web' AND starttime 1391691978799 and activitytype 'fetch' == 1391693068680;job stop 1391693560219;job start 1391693602432;robots parse 1391694720907;job stop 1391694830347;job start 1391694870310;job stop 1391695481359;job continue 1391695518007;robots parse 1391696593141;job end I can try to debug more tomorrow. Erlend
Re: [VOTE] Release Apache ManifoldCF 1.5, RC7
On 06.02.14 18:18, Karl Wright wrote: Actually yes, I found it. Only exceptions/errors are recorded by the solr connector. CONNECTORS-884. However, I don't think this rises to the level of needing to respin the RC. Do you agree? Since we are on RC7 now, I agree. I'll start a complete crawl after dinner. If that completes, I'll place my final vote. :) Erlend
Re: [VOTE] Release Apache ManifoldCF 1.5, RC5
On 1/29/14 3:57 PM, Karl Wright wrote: Thanks - this shows that threads are all waiting on connection throttling. How many simultaneous connections did you make available to the site you are crawling, and can you look at the simple history report to confirm that there is no activity? I'll dig further through your stack trace while I await your answer. I'll answer even though you have cancelled the vote. No activity in simple history and no new log entries in manifoldcf.log other than the debug entries included in the screenshot. Throttling -- Max connections: 30 Max avg fetches/min: 100 Bandwidth - Max connections: 25 Max kbytes/sec: 2000 Max fetches/min: 20 Erlend
Re: [VOTE] Release Apache ManifoldCF 1.4, RC1
+1 Looks good. 1. Deployed the binary version on Resin and did a test crawl. 2. Built the source version using Ant 3. Ran UI tests 4. Built docs (ant doc) using Forest 0.9 5. Ran ant test 6. Ran the single process model within the example dir, started the web crawler and posted to Solr 4 7. Ran multiprocess example, started the web crawler and posted to Solr. Erlend On 10/22/13 12:05 AM, Karl Wright wrote: Please vote on whether to release Apache ManifoldCF 1.4, RC1. The release candidate can be downloaded from: http://people.apache.org/~kwright/apache-manifoldcf-1.4 There is a tag at: http://svn.apache.org/repos/asf/manifoldcf/tags/release-1.4-RC1http://svn.apache.org/repos/asf/manifoldcf/tags/release-1.4-RC0 This release contains a substantial refactoring of the SharePoint connector, as well as new features including attachment crawling. Proxy support for the wiki and jira connectors has also been added. It also fixes CONNECTORS-790 and CONNECTORS-791. Vote will remain open for at least 72 hours. Thanks, Karl
Re: [VOTE] Release Apache ManifoldCF 1.4, RC0
-1 I'm getting an NPE when running ant test. I will of course withdraw my vote in case I have forgot a crucial step before I ran the tests, but I don't think that is the case. Otherwise, I completed the following tests successfully: 1. Deployed the binary version on Resin and did a test crawl. 2. Built the source version using Ant 3. Ran UI tests 4. Built docs (ant doc) using Forest 0.9 Stack trace: [junit] 2178 [main] INFO org.eclipse.jetty.server.handler.ContextHandler - started o.e.j.w.WebAppContext{/mcf-api-service,file:/private/var/folders/11/q0gk5wfs4pl662rzx319gg14gn/T/jetty-0.0.0.0-8346-mcf-api-service.war-_mcf-api-service-any-/webapp/},../../../framework/build/war-proprietary/mcf-api-service.war [junit] 2190 [main] INFO org.eclipse.jetty.server.AbstractConnector - Started SelectChannelConnector@0.0.0.0:8346 STARTING [junit] 32865 [qtp1273157698-145] WARN org.eclipse.jetty.servlet.ServletHandler - /mcf-api-service/json/jobstatuses/1382362019797 [junit] java.lang.NullPointerException [junit] at org.apache.manifoldcf.crawler.system.ManifoldCF.apiReadJobStatus(ManifoldCF.java:2063) [junit] at org.apache.manifoldcf.crawler.system.ManifoldCF.executeReadCommand(ManifoldCF.java:3196) [junit] at org.apache.manifoldcf.apiservlet.APIServlet.executeRead(APIServlet.java:231) [junit] at org.apache.manifoldcf.apiservlet.APIServlet.doGet(APIServlet.java:77) [junit] at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) [junit] at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) [junit] at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:547) [junit] at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:480) [junit] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) [junit] at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:520) [junit] at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227) [junit] at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:941) [junit] at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:409) [junit] at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:186) [junit] at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:875) [junit] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) [junit] at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) [junit] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110) [junit] at org.eclipse.jetty.server.Server.handle(Server.java:349) [junit] at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:441) [junit] at org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:919) [junit] at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:582) [junit] at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:218) [junit] at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:51) [junit] at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:586) [junit] at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:44) [junit] at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:598) [junit] at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:533) [junit] at java.lang.Thread.run(Thread.java:722) Erlend On 10/21/13 2:23 AM, Karl Wright wrote: Please vote on whether to release Apache ManifoldCF 1.4, RC0. The release candidate can be downloaded from: http://people.apache.org/~kwright/apache-manifoldcf-1.4 There is a tag at: http://svn.apache.org/repos/asf/manifoldcf/tags/release-1.4-RC0 This release contains a substantial refactoring of the SharePoint connector, as well as new features including attachment crawling. Proxy support for the wiki and jira connectors has also been added. Vote will remain open for at least 72 hours. Thanks, Karl
Re: 1.3 release schedule
I'm sorry to inform that I haven't worked with the Hydra connector the last month. I have been busy with a major release of our search project at the university and my summer vaccation starts tomorrow. And I have also been busy with creating patches for Solr. Erlend On 6/24/13 5:13 PM, Karl Wright wrote: Hi All, Our quarterly release schedule means that we'll need to code-freeze what is in 1.3 in about 1 month. There is currently a *lot* of outstanding activity that needs to solidify between now and then, and I'm not likely to be able to get done with everything currently assigned to me. If you have the capacity to look at any of these tickets, please let me know. Thanks, Karl
Re: [VOTE] Release Apache ManifoldCF 1.2, RC1
+1 - Deployed and started a big crawl on Resin. - Ran: ant uitest ant doc - Built using Maven 3.0.4 I will withdraw my vote if CONNECTORS-682 is still not resolved. So far so good, the job has been running for 7 hours. I will check it again on Friday since will be away from my computer till then. Erlend On 08.05.13 12.50, Karl Wright wrote: Please vote on whether to release Apache ManifoldCF 1.2, RC1. The release artifact can be found at: http://people.apache.org/~kwright/apache-manifoldcf-1.2 The release tag can be found at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.2-RC1https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.2-RC0 Fixes from the last RC include: - DropBox connector documentation - A fix for CONNECTORS-682 The 1.2 release has a large number of changes in it, including a new connector for DropBox, and also new framework functionality for minimal crawls, with better support for ADD_CHANGE_DELETE models of crawling. (See CHANGES.txt for a complete list.) Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.2, RC1
On 08.05.13 18.00, Erlend Garåsen wrote: I will withdraw my vote if CONNECTORS-682 is still not resolved. So far so good, the job has been running for 7 hours. I will check it again on Friday since will be away from my computer till then. I just want to inform that the job completed without any errors, so CONNECTORS-682 seems to be resolved. Over 14000 documents crawled. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Release status
On 28.04.13 23.27, Karl Wright wrote: The upshot is that the release is going to be late. I am not yet sure *how* late. Will keep everyone posted. I have explained the problem in detail for my colleague who will get back to work tomorrow when I'm leaving Norway. Unfortunately I don't think I will bring with me my laptop to US since I want to carry as little as possible, only my iPad, so it will be difficult to work more on this issue from tomorrow. I'll be back on Monday next week. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Disabling retries
I enabled this functionality once. In previous releases I think the retry functionality was enabled by default. In HttpClient 4, this became more complicated. I think the method must return true in order to enable retries, but then you need to catch several types of exceptions first in order to figure out when to retry the connection. Just a warning - I haven't tried/tested this, but here's what I added for some time ago. It must probably be adapted to the MCF environment to decide when to retry: HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() { public boolean retryRequest(IOException exception, int executionCount, HttpContext context) { if (executionCount = 3) { // Do not retry if over max retry count return false; } if (exception instanceof InterruptedIOException) { // Timeout return false; } if (exception instanceof UnknownHostException) { // Unknown host return false; } if (exception instanceof ConnectException) { // Connection refused return false; } if (exception instanceof SSLException) { // SSL handshake exception return false; } HttpRequest request = (HttpRequest) context.getAttribute(ExecutionContext.HTTP_REQUEST); boolean idempotent = !(request instanceof HttpEntityEnclosingRequest); if (idempotent) { // Retry if the request is considered idempotent return true; } return false; } Erlend On 07.03.13 14.09, Karl Wright wrote: Hi all, We have code that creates a DefaultHttpClient instance for use with Solr. The HttpEntity that is created when sending data is not reusable, so we've disabled retries (we thought) using the following code: DefaultHttpClient localClient = new DefaultHttpClient(connectionManager,params); // No retries localClient.setHttpRequestRetryHandler(new HttpRequestRetryHandler() { public boolean retryRequest( IOException exception, int executionCount, HttpContext context) { return false; } }); Unfortunately it does not seem to have actually worked; we are still seeing non-reusable stream retry errors in some cases. Has anybody seen this before, and what are we doing wrong? Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Fwd: FW: Google Summer of Code 2013
You were thinking about the general Email connector? https://issues.apache.org/jira/browse/CONNECTORS-553 Since I want to finish the Hydra Output Connector before I start working with this one, I think this might be a good candidate. If it's not too complicated and takes longer than eight weeks to finish. What about a simple LDAP connector for indexing LDAP content? (people/unit trees etc.). Erlend On 07.03.13 16.19, Karl Wright wrote: Hey everyone - Now is the time to come up with reasonable ManifoldCF enhancement ideas that a student could likely succeed at in eight weeks' time. This can range from new connectors (I think we still have an outstanding ticket for a Microsoft Outlook connector, for instance), to significant enhancements you can think of that might be useful. If you don't want to create a ticket yourself, I'm happy to do it. Just drop me an email. Karl -Original Message- From: ext Ulrich Stärk [mailto:u...@apache.org] Sent: Tuesday, March 05, 2013 10:27 AM To: p...@apache.org Subject: Google Summer of Code 2013 Hello PMCs, Google Summer of Code [1] is the ideal opportunity for you to attract new contributors to your projects. The ASF will apply as a participating organization meaning individual projects don't have to apply separately. If you want to participate with your project you NOW need to - understand what it means to be a mentor [2]. - record your project ideas. Just create issues in JIRA, label them with gsoc2013, and they will show up at [3]. Please be as specific as possible when describing your idea. Include the programming language, the tools and skills required, but try not to scare potential students away. They are supposed to learn what's required before the program starts. Use labels, e.g. for the programming language (java, c, c++, erlang, python, brainfuck, ...) or technology area (cloud, xml, web, foo, bar, ...) and record them at [5]. Please use the COMDEV JIRA project for recording your ideas if your project doesn't use JIRA (e.g. httpd, ooo). Contact d...@community.apache.org if you need assistance. - subscribe to code-awa...@apache.org (restricted to potential mentors, meant to be used as a private list - general discussions on the public d...@community.apache.org list as much as possible please). Use a recognized address when subscribing (@apache.org or one of your alias addresses on record). Note that the ASF isn't accepted yet, nevertheless you *really* should start recording your ideas now. Over the years we were able to complete hundreds of projects successfully. Some of our prior students are active contributors now! Let's make this a success again this year! Uli P.S.: Except for the private parts (label spreadsheet mostly), this email is free to be shared publicly if you want to. [1] http://www.google-melange.com/gsoc/homepage/google/gsoc2013 [2] http://community.apache.org/guide-to-being-a-mentor.html [3] http://s.apache.org/gsoc2013ideas [4] http://community.apache.org/gsoc.html [5] http://s.apache.org/gsoclabels -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Fwd: FW: Google Summer of Code 2013
On 07.03.13 19.26, Karl Wright wrote: Yes, I thought the email connector might be a good choice. I think an email connector is more useful than an ldap connector. I'm not sure what the use case would be for an LDAP repository connector. Can you describe a scenario where it might be useful? We were thinking about indexing LDAP resources at the university since we're offering people and unit searches at our web pages. The problem is that we are limited by the search facilities LDAP itself offer. If we had the opportunity to search this information from Solr, we could do faceting and more complex searches, for instance list all professors within an academic field/topic etc. *If* we decide to develop such a connector, we will probably make it, but we might find other solutions. So basically, I think an LDAP connector might be interesting for bigger institutions with a lot of information stored in LDAP. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.1.1, RC0
+1 (I'm withdrawing my previous -1). - Downloaded the binary zip version: * It runs now stable on Resin, even after a test where I stopped the running job and started it once again. Posting to Solr 4.0 works. - Downloaded the source zip version: * ran ant uitest * ran Jetty within both the example and multithread-example folders and posted to Solr 3.1 (backward compatibility test) Erlend On 12.02.13 20.39, Karl Wright wrote: Right, since this is a long-standing problem and this is a patch release, I'd hope that we could hold off until the next real release for this one. I doubt it will be too challenging to fix. Karl On Tue, Feb 12, 2013 at 1:22 PM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: You are probably right. I can withdraw my vote, but I'm unsure whether I should wait and see what happens with the crawl I just started on our test server with new hop filter settings. I can then make a final vote tomorrow. All the other tests I have done with this RC have passed. Erlend On 12.02.13 19.06, Karl Wright wrote: If this problem is non-critical, and has been around a long time, it is not necessary to cancel a release in order to fix it. The logic in question has not changed since probably ManifoldCF 0.3 or so. Karl On Tue, Feb 12, 2013 at 1:04 PM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: -1 due to CONNECTORS-644. The job restart link does not work, causing Error: Repeated service interruptions - failure getting document version. I think this functionality is so basic, so I think it should be solved for this release. This problem is totally unrelated to Resin. It happens if you are running Jetty as well. Erlend On 10.02.13 20.01, Karl Wright wrote: Please vote on whether to release Apache ManifoldCF 1.1.1, RC0. The release artifact can be downloaded from: http://people.apache.org/~kwright/apache-manifoldcf-1.1.1 There is a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1.1-RC0 This release has been made primarily to fix a leak of connection handles, described by CONNECTORS-638. Other major fixes have also been included, specifically: - Fix the maven build (various tickets) - Fix the rather broken Elastic Search connector (also various tickets) Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.1.1, RC0
I tried to restart the crawl ten minutes after I started it. The job ends after a while and will not start again. This is the status after it stopped: Error: Repeated service interruptions - failure getting document version If I start it manually, it just fetches and fetches without posting anything to Solr. The only thing I did while it was running the first time was to edit the exclude list once - removed a white space at the end of a reg exp rule. Then I commented out the regexp line in case it DID affected the documents (it shouldn't) and restarted again. Same problem - the job does not want to start: Error: Repeated service interruptions - failure getting document version Just before the job ends, the result description shows Interrupted: Job no longer active. This is normal, but why won't MCF start the job again after it stops? Same problem after I manually starts it - MCF just fetches and fetches without posting anything to Solr. E On 12.02.13 13.38, Erlend Garåsen wrote: I have changed some settings in MCF which will reduce the heavy load on our PG server (changed hop count mode to Keep unreachable documents, forever). I will start a new crawl today and make a final vote tomorrow. Erlend On 11.02.13 20.49, Karl Wright wrote: I've looked at this enough now to conclude that this problem is probably not intrinsic to ManifoldCF. It may instead be due to timeouts present in Erlend's PostgreSQL installation. I am therefore leaving the vote open until there is some reason to believe that there is a general problem here. Thanks, Karl On Mon, Feb 11, 2013 at 10:02 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: The job just stopped working and nothing suspicious in my logs. The database people are saying that we have connection locks again (idle in transaction). Karl, you mentioned that in order to use the following parameter: property name=org.apache.manifoldcf.database.connectiontracking value=true/ there was no way back to use an older release due to changes in the database. That's ok, but was that just a temporary functionality, which means, I need to clear my database in order to use 1.1.1 RC0? Erlend On 10.02.13 20.01, Karl Wright wrote: Please vote on whether to release Apache ManifoldCF 1.1.1, RC0. The release artifact can be downloaded from: http://people.apache.org/~kwright/apache-manifoldcf-1.1.1 There is a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1.1-RC0 This release has been made primarily to fix a leak of connection handles, described by CONNECTORS-638. Other major fixes have also been included, specifically: - Fix the maven build (various tickets) - Fix the rather broken Elastic Search connector (also various tickets) Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.1.1, RC0
You are probably right. I can withdraw my vote, but I'm unsure whether I should wait and see what happens with the crawl I just started on our test server with new hop filter settings. I can then make a final vote tomorrow. All the other tests I have done with this RC have passed. Erlend On 12.02.13 19.06, Karl Wright wrote: If this problem is non-critical, and has been around a long time, it is not necessary to cancel a release in order to fix it. The logic in question has not changed since probably ManifoldCF 0.3 or so. Karl On Tue, Feb 12, 2013 at 1:04 PM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: -1 due to CONNECTORS-644. The job restart link does not work, causing Error: Repeated service interruptions - failure getting document version. I think this functionality is so basic, so I think it should be solved for this release. This problem is totally unrelated to Resin. It happens if you are running Jetty as well. Erlend On 10.02.13 20.01, Karl Wright wrote: Please vote on whether to release Apache ManifoldCF 1.1.1, RC0. The release artifact can be downloaded from: http://people.apache.org/~kwright/apache-manifoldcf-1.1.1 There is a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1.1-RC0 This release has been made primarily to fix a leak of connection handles, described by CONNECTORS-638. Other major fixes have also been included, specifically: - Fix the maven build (various tickets) - Fix the rather broken Elastic Search connector (also various tickets) Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.1.1, RC0
On 11.02.13 16.26, Karl Wright wrote: trunk has a different schema than 1.1.1. So yes, you'd have to blow away the old database to go back. OK, who knows. Maybe this was the reason why it just stopped. I'll clean up by deleting all tables and related db resources and try again. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.1.1, RC0
On 11.02.13 17.50, Karl Wright wrote: The trace didn't look like it was the result of stuck ManifoldCF locks. Even though it does not seem to be a result of this, I found an error in our control script. The path to executecommand.sh was incorrect for the stop function. I'll send your suggestions to the PostgreSQL admins and try to get more information. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: ManifoldCF 1.1.1?
On 08.02.13 10.31, Maciej Liżewski wrote: I would go with patch. It is serious enough because causes problems with running crawler jobs... +1. I'm running the job once again and examining the logs. I will report any suspicious deviations continuously. I think it is advisable that I complete this job before I conclude that we have got rid of the problem. This may take some time, probably about 30 hours. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.1, RC7
Difficult to say what caused these problems, but I have now deployed RC8 which works well on Resin. I just have a couple of more tests to do, so I will give my final vote within a couple of hours. Erlend On 29.01.13 17.41, Karl Wright wrote: I ran the multiprocess example, with httpmime.jar just as we deliver it in the connector-lib directory, and I did not see this issue. It is almost certainly configuration, seems likely. Karl On Tue, Jan 29, 2013 at 11:26 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: I have to run now, but I will investigate this further. BTW, I have the following in my lib folder, so it should work: httpmime.jar I did not see this yesterday when I was testing RC6 with Resin. The difference now is that the crawler just fetches and fetches, but nothing gets posted to Solr. I hop it is me who have misconfigured something, but I will get back to this as soon as possible. FATAL 2013-01-29 17:19:17,609 (Worker thread '17') - Error tossed: org/apache/http/entity/mime/content/ContentBody java.lang.NoClassDefFoundError: org/apache/http/entity/mime/content/ContentBody at org.apache.manifoldcf.agents.output.solr.HttpPoster.init(HttpPoster.java:246) at org.apache.manifoldcf.agents.output.solr.SolrConnector.getSession(SolrConnector.java:256) at org.apache.manifoldcf.agents.output.solr.SolrConnector.removeDocument(SolrConnector.java:629) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.removeDocument(IncrementalIngester.java:1598) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:469) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1651) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.deleteDocument(WorkerThread.java:1672) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1445) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) Caused by: java.lang.ClassNotFoundException: org.apache.http.entity.mime.content.ContentBody at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:627) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 11 more Erlend On 28.01.13 23.09, Karl Wright wrote: Please vote on whether or not to release ManifoldCF 1.1, RC7. The release artifact can be found at: http://people.apache.org/~kwright/apache-manifoldcf-1.1 There is a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1-RC7 This release candidate fixes a packaging problem for wars similar to CONNECTORS-619. It also fixes a problem with the CMIS connector and another SolrJ-related issue (CONNECTORS-622 and CONNECTORS-623). This release candidate provides a better workaround for CONNECTORS-616 than RC5. It also fixes CONNECTORS-617. This release candidate fixes one problem since RC4, which is the inconfigurability of the commit action path for Solr commits in the Solr connector. This needed to be fixed to maintain backwards compatibility. CONNECTORS-621. This release candidate fixes two problems since RC3. The problems were in the included jars for the multiprocess example (CONNECTORS-619) and in connection leakage for JDBC handles (CONNECTORS-620). This release candidate fixes one problem since RC2. The problem is CONNECTORS-618, which relates to MySQL performance. This release candidate fixes one additional problem since RC1. The problem is CONNECTORS-616, and relates to Solr dropping connections during indexing. This release candidate fixes two other problems since RC0, both related to Solr 4.0.0 support. - CONNECTORS-613: The version of Tika used in Solr 4.0.0 cannot extract text unless told an accurate mime type. While this is probably a Tika bug, in this ticket we at least make sure a good guess as to the mime type is sent to Solr. - CONNECTORS-614: Fix logic having to do with releasing idle Solr connections. This shows up as socket timeout exceptions, because it becomes very easy to exhaust the Solr application server's thread pool when idle connections are not released in a timely way. This release includes a significant amount of long-planned upgrading and refactoring since Apache ManifoldCF 1.0.1, including: - Port to HttpComponents from commons-httpclient - Port
Re: [VOTE] Release Apache ManifoldCF 1.1, RC8
+1 - Deployed on Resin. Adding documents to Solr 4.0 and document deletion works - Ran ant uitest - Ran ant doc - Ran single process. - Started Jetty from the example folder and tried to post to Solr 3.1 (backward compatibility test) I have examined all the logs in case there were stack traces or other error messages, but everything seems to be OK. Erlend On 30.01.13 03.07, Karl Wright wrote: Please vote on whether or not to release ManifoldCF 1.1, RC8. The release artifact can be found at: http://people.apache.org/~kwright/apache-manifoldcf-1.1 There is a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1-RC8 This release candidate upgrades to SolrJ 4.1.0, with necessary dependency upgrades. This was necessary because SolrJ 4.0.0 had serious issues in its Solr Cloud support. SolrJ 4.0.0 also would not work with Solr 4.1.0, but the SolrJ 4.1.0 does work with Solr 4.0.0, mostly. Potential issues remain with cross-version SolrCloud support. See CONNECTORS-627. This release candidate fixes a packaging problem for wars similar to CONNECTORS-619. It also fixes a problem with the CMIS connector and another SolrJ-related issue (CONNECTORS-622 and CONNECTORS-623). This release candidate provides a better workaround for CONNECTORS-616 than RC5. It also fixes CONNECTORS-617. This release candidate fixes one problem since RC4, which is the inconfigurability of the commit action path for Solr commits in the Solr connector. This needed to be fixed to maintain backwards compatibility. CONNECTORS-621. This release candidate fixes two problems since RC3. The problems were in the included jars for the multiprocess example (CONNECTORS-619) and in connection leakage for JDBC handles (CONNECTORS-620). This release candidate fixes one problem since RC2. The problem is CONNECTORS-618, which relates to MySQL performance. This release candidate fixes one additional problem since RC1. The problem is CONNECTORS-616, and relates to Solr dropping connections during indexing. This release candidate fixes two other problems since RC0, both related to Solr 4.0.0 support. - CONNECTORS-613: The version of Tika used in Solr 4.0.0 cannot extract text unless told an accurate mime type. While this is probably a Tika bug, in this ticket we at least make sure a good guess as to the mime type is sent to Solr. - CONNECTORS-614: Fix logic having to do with releasing idle Solr connections. This shows up as socket timeout exceptions, because it becomes very easy to exhaust the Solr application server's thread pool when idle connections are not released in a timely way. This release includes a significant amount of long-planned upgrading and refactoring since Apache ManifoldCF 1.0.1, including: - Port to HttpComponents from commons-httpclient - Port to SolrJ from homegrown for the Solr connector, so that SolrCloud is supported - Improved NTLM support - Partial Kerberos support - Many other improvements, which are summarized in CHANGES.txt -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.1, RC7
I have to run now, but I will investigate this further. BTW, I have the following in my lib folder, so it should work: httpmime.jar I did not see this yesterday when I was testing RC6 with Resin. The difference now is that the crawler just fetches and fetches, but nothing gets posted to Solr. I hop it is me who have misconfigured something, but I will get back to this as soon as possible. FATAL 2013-01-29 17:19:17,609 (Worker thread '17') - Error tossed: org/apache/http/entity/mime/content/ContentBody java.lang.NoClassDefFoundError: org/apache/http/entity/mime/content/ContentBody at org.apache.manifoldcf.agents.output.solr.HttpPoster.init(HttpPoster.java:246) at org.apache.manifoldcf.agents.output.solr.SolrConnector.getSession(SolrConnector.java:256) at org.apache.manifoldcf.agents.output.solr.SolrConnector.removeDocument(SolrConnector.java:629) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.removeDocument(IncrementalIngester.java:1598) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:469) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1651) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.deleteDocument(WorkerThread.java:1672) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1445) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) Caused by: java.lang.ClassNotFoundException: org.apache.http.entity.mime.content.ContentBody at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:627) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 11 more Erlend On 28.01.13 23.09, Karl Wright wrote: Please vote on whether or not to release ManifoldCF 1.1, RC7. The release artifact can be found at: http://people.apache.org/~kwright/apache-manifoldcf-1.1 There is a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1-RC7 This release candidate fixes a packaging problem for wars similar to CONNECTORS-619. It also fixes a problem with the CMIS connector and another SolrJ-related issue (CONNECTORS-622 and CONNECTORS-623). This release candidate provides a better workaround for CONNECTORS-616 than RC5. It also fixes CONNECTORS-617. This release candidate fixes one problem since RC4, which is the inconfigurability of the commit action path for Solr commits in the Solr connector. This needed to be fixed to maintain backwards compatibility. CONNECTORS-621. This release candidate fixes two problems since RC3. The problems were in the included jars for the multiprocess example (CONNECTORS-619) and in connection leakage for JDBC handles (CONNECTORS-620). This release candidate fixes one problem since RC2. The problem is CONNECTORS-618, which relates to MySQL performance. This release candidate fixes one additional problem since RC1. The problem is CONNECTORS-616, and relates to Solr dropping connections during indexing. This release candidate fixes two other problems since RC0, both related to Solr 4.0.0 support. - CONNECTORS-613: The version of Tika used in Solr 4.0.0 cannot extract text unless told an accurate mime type. While this is probably a Tika bug, in this ticket we at least make sure a good guess as to the mime type is sent to Solr. - CONNECTORS-614: Fix logic having to do with releasing idle Solr connections. This shows up as socket timeout exceptions, because it becomes very easy to exhaust the Solr application server's thread pool when idle connections are not released in a timely way. This release includes a significant amount of long-planned upgrading and refactoring since Apache ManifoldCF 1.0.1, including: - Port to HttpComponents from commons-httpclient - Port to SolrJ from homegrown for the Solr connector, so that SolrCloud is supported - Improved NTLM support - Partial Kerberos support - Many other improvements, which are summarized in CHANGES.txt -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [CANCEL][VOTE] Release Apache ManifoldCF 1.1, RC6
RC6 runs well on Resin 4, so please prepare the next RC7. Erlend On 25.01.13 15.57, Karl Wright wrote: FWIW, I'm going to hold off on spinning RC7 until Erlend signs off on RC6 in it University of Oslo Resin environment. Hopefully that should be sometime this weekend. Karl On Fri, Jan 25, 2013 at 9:31 AM, Karl Wright daddy...@gmail.com wrote: Agreed that we aren't shooting for perfection. Absence of significant regression is the best we can hope for. ;-) Unfortunately, due to the significant amount of refactoring that took place in this release, we're still discovering exactly where the bodies lie. But I think we are getting close. Karl On Fri, Jan 25, 2013 at 8:42 AM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Fri, Jan 25, 2013 at 3:35 PM, Karl Wright daddy...@gmail.com wrote: I've pulled up this fix to the release branch, and the vote on RC6 is canceled. However, can everyone do at least a preliminary smoke-test evaluation of RC6 at this time, and not wait for the final vote? We aren't going to converge if we all keep waiting to do the evaluation until we think there are no more RC's likely. No software is perfect and it's always possible to cut more releases to address issues as they come along, so in general I'd recommend people not to vote -1 because of each individual bug they encounter. As long as the code compiles, passes the existing test suite and has no other major issues (known security vulnerabilities, licensing problems, etc.), it should be ready to release. Smaller issues like CONNECTORS-622 can and IMHO should be fixed in the following release instead of holding up the current one. It's better to release early and often than to wait for perfection. BR, Jukka Zitting -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache Manifold 1.1, RC3
-1 so far. Until the problem described below is solved or explained. Running Jetty within the example folder seems to work normally, but not within the multiprocess-example folder. In both configurations I have defined a Solr Output Connector and a web crawler. The funny thing within the latter folder is that nothing is sent to Solr. The crawler just fetches and fetches, and that is the only activity I can see. I have ran: ./start-database.sh ./initialize.sh ./start-agents.sh ./start-webapps.sh The Solr Output connection is working and I have gone through the settings in my job - very similar configurations from my first attempt within the example folder, but nothing shows up. When I looked in my logs, I discovered this: FATAL 2013-01-22 14:10:31,802 (Worker thread '43') - Error tossed: Could not initialize class org.apache.solr.client.solrj.impl.HttpSolrServer java.lang.NoClassDefFoundError: Could not initialize class org.apache.solr.client.solrj.impl.HttpSolrServer at org.apache.manifoldcf.agents.output.solr.HttpPoster.init(HttpPoster.java:246) at org.apache.manifoldcf.agents.output.solr.SolrConnector.getSession(SolrConnector.java:256) at org.apache.manifoldcf.agents.output.solr.SolrConnector.addOrReplaceDocument(SolrConnector.java:609) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1651) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1409) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) BTW, I'm running Solr 3.1, not the latest version. I don't think this has something to do with the problems described above since my Solr server does not seem to be hit my MCF at all. Erlend On 22.01.13 09.59, Karl Wright wrote: Please vote on whether or not to release ManifoldCF 1.1, RC3. The release artifact can be found at: http://people.apache.org/~kwright/apache-manifoldcf-1.1 There is a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1-RC3 Please vote on whether or not to release ManifoldCF 1.1, RC2. The release artifact can be found at: http://people.apache.org/~kwright/apache-manifoldcf-1.1 There is a tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1-RC2 This release candidate fixes one problem since RC2. The problem is CONNECTORS-618, which relates to MySQL performance. This release candidate fixes one additional problem since RC1. The problem is CONNECTORS-616, and relates to Solr dropping connections during indexing. This release candidate fixes two other problems since RC0, both related to Solr 4.0.0 support. - CONNECTORS-613: The version of Tika used in Solr 4.0.0 cannot extract text unless told an accurate mime type. While this is probably a Tika bug, in this ticket we at least make sure a good guess as to the mime type is sent to Solr. - CONNECTORS-614: Fix logic having to do with releasing idle Solr connections. This shows up as socket timeout exceptions, because it becomes very easy to exhaust the Solr application server's thread pool when idle connections are not released in a timely way. This release includes a significant amount of long-planned upgrading and refactoring since Apache ManifoldCF 1.0.1, including: - Port to HttpComponents from commons-httpclient - Port to SolrJ from homegrown for the Solr connector, so that SolrCloud is supported - Improved NTLM support - Partial Kerberos support - Many other improvements, which are summarized in CHANGES.txt Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: New committer: Minoru Osuka
Welcome as a ManifoldCF committer, Minoru! Erlend On 10.01.13 21.00, Karl Wright wrote: The Project Management Committee (PMC) for Apache ManifoldCF has asked Minoru Osuka to become a committer and we are pleased to announce that they have accepted. Minoru has been active in using advanced features of Solr, and has been instrumental in bringing our Solr connector into the modern era. Please join me in welcoming Minoru to the Apache ManifoldCF project! Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Do we need an org.apache.manifoldcf.core.DBClean command class?
On 18.12.12 16.29, Karl Wright wrote: Hmm, somehow you lost a connector jar out of the connector-lib or connector-lib-proprietary area. Deleting the jars before you clean up the database is not going to work. ;-) I guess not. This is probably what I have done. Maybe this behaviour is so odd that we do not need the functionality I was mentioning? As long as one is removing things in the correct order, problems will not show up. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Issues marked fix in ManifoldCF next
The Hydra framework has recently been changed and I just got some neccesary documentation two weeks ago. The Hydra connector will therefore be included in version 1.2 instead. Anyway, I have started to work on it and will commit some very basic stuff soon. Fix version for CONNECTORS-193 is changed from next to 1.2 as well since this has been assigned to me for a while now. Erlend On 08.12.12 03.29, Karl Wright wrote: Hi folks, I've created a ManifoldCF 1.2 release in JIRA and triaged some tickets I intend to work on for that release. I've also closed/resolved a fair number of tickets that were hanging around marked fix in ManifoldCF next. You may want to do the same... Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Developing an Email Connector
Yes, I'm aware of that. It's on my TODO list. Thanks for all your suggestions and comments, but please make a comment in Jira next time. It will be easier for me if everything related to this connector is placed there. :) The connector will not be ready for the next release since I have to finish an another connector first. Erlend On 24.10.12 15.44, Maciej Liżewski wrote: I was thinking about one more thing: document IDs for indexed emails... it should be somehow possible to form them as valid URLs, but POP3 and IMAP does not support it by themselves. So you have to create links to some third party web-mail system but there is a number of such systems and document ID should be customizable to support as many of them as possible... what do you think? 2012/10/15 Erlend Garåsen e.f.gara...@usit.uio.no Sounds like a good idea. I didn't even think about attachments, even though it's quite obvious that we need to take care of them. :) Erlend On 15.10.12 14.40, Maciej Liżewski wrote: I would like to add my 5cents here. I would like the connector to set multivalued cat field with some categories which I could use with faceted dynamic groups, like: with attachment, sender domain, sender name, etc. Also would be nice to index also the attachments as linked documents (they could have same categories as email message above) 15 paź 2012 14:13, Karl Wright daddy...@gmail.com napisał(a): Sounds great! I can't wait to see it. Karl On Mon, Oct 15, 2012 at 6:31 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: Me and Karl had a short discussion about such a connector in Cambridge for some months ago. Now I have created the following ticket regarding an Email Connector: https://issues.apache.org/**jira/browse/CONNECTORS-553https://issues.apache.org/jira/browse/CONNECTORS-553 I'm notifying the list in case some of you have comments or special wishes. Generally it will support IMAP and POP3, SSL/TLS, the possibility to specify the port numbers if necessary, server certificate upload etc. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Developing an Email Connector
Sounds like a good idea. I didn't even think about attachments, even though it's quite obvious that we need to take care of them. :) Erlend On 15.10.12 14.40, Maciej Liżewski wrote: I would like to add my 5cents here. I would like the connector to set multivalued cat field with some categories which I could use with faceted dynamic groups, like: with attachment, sender domain, sender name, etc. Also would be nice to index also the attachments as linked documents (they could have same categories as email message above) 15 paź 2012 14:13, Karl Wright daddy...@gmail.com napisał(a): Sounds great! I can't wait to see it. Karl On Mon, Oct 15, 2012 at 6:31 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: Me and Karl had a short discussion about such a connector in Cambridge for some months ago. Now I have created the following ticket regarding an Email Connector: https://issues.apache.org/jira/browse/CONNECTORS-553 I'm notifying the list in case some of you have comments or special wishes. Generally it will support IMAP and POP3, SSL/TLS, the possibility to specify the port numbers if necessary, server certificate upload etc. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [PROPOSAL] Release a ManifoldCF 1.0.1 release
+1 Erlend On 09.10.12 22.53, Karl Wright wrote: Hi folks, Due to the potential severity of CONNECTORS-551, I think it might be a good idea to release a ManifoldCF 1.0.1 release which contains the fix for this ticket. Please can I have a show of hands as to whether people agree that this is serious enough to warrant such a release. Thanks! Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: question about multiple languages
On 08.10.12 17.03, Maciej Liżewski wrote: Now there are two possibilities: 1. when fields are untouched - processing data (stemming, etc) is same for every document, which is rather wrong because polish stemming is different from english one... :) 2. attributes are mapped to *_lang and every *_lang field has different processing definition (stemming, stop words, etc). The latter seems more reasonable for me and is more common practice. There are different stemmers you may try out such as Hunspell. If you want to detect languages, I would use TikaLanguageIdentifierUpdateProcessorFactory: http://wiki.apache.org/solr/LanguageDetection It can be configured by using an Update Request Processor: http://wiki.apache.org/solr/UpdateRequestProcessor This part I understand, but I am confused on how to perform valid queries in both cases? I have single (simple) page which should work google-like: you enter a text and get results. But there is no language guess process for queries... Do I have to specify on each query whether it should search in 'text_en' or 'text_pl' fields? If so - it is not very good because I would like users to get all documents that match query no matter what language they are written in. There are many similar words, technical names, etc, which are same in many languages... I think you should search in both fields, yes. I will explain why further down. In other words - how to achieve google-like search with stemming for multiple languages and without to force users to select language they would like to search in? Google does a guessing about the query language. If you hit www.google.com, you will be redirected to www.google.pl if you're sitting in Poland. This may also be achieved in your application by detecting the browser's locale etc. Many web application frameworks have support for this. Then you may give (at query time) a higher boost to the fields belonging to the language detected. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.0, RC7
-manifoldcf-1.0 There is also an SVN tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.0-RC7 Fixes since RC6: CONNECTORS-549 Fixes since RC5: CONNECTORS-547 CONNECTORS-548 (documentation fix) Fixes since RC4: CONNECTORS-545 Fixes since RC3: CONNECTORS-544 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.0, RC5
On 28.09.12 13.31, Erlend Garåsen wrote: OK, I will give you a stack trace in the beginning of next week. Do you still need the stack trace? If you do, I need to adjust the log level and/or change the source code in order to print it out. I'm still a little bit worried about how MCF deals with 500 server errors since the job I started last Friday is still running. It retries and retries the three documents I previously mentioned. Is it really a normal behaviour that MCF retries the same document every fourth second after the last attempt and continues do do this (perhaps) thousand times? MCF has probably retried these documents in four days now. I doubt this is normal behaviour. The job should end in the middle of the day on Saturday, and now it's Tuesday. I will test the latest RC after these issues have been clarified. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.0, RC5
Karl, you wrote: I was able to reproduce the exception here using your URL. It is indeed a bug in how it handles the 500 error. OK, then I guess that the StringIndexOutOfBoundsException *was* related to the 500 server issue (It is not clear at all that it is related to the 500 error you described before, but it could be.). To clarify another thing: These three documents are fetched over and over again every fourth second (in four days). I was mentioning this in case we had another issue. I'm just trying to clarify this before I deploy RC7 as I wrote. Anyway, I will deploy RC7 now and start my job once more. Erlend On 02.10.12 11.03, Karl Wright wrote: No stack trace needed. If you read the rest of the mail, you will note that I was able to reproduce the issue using the URL you had provided. There have been two RC's since; we are on RC7 now. Karl On Tue, Oct 2, 2012 at 4:38 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: On 28.09.12 13.31, Erlend Garåsen wrote: OK, I will give you a stack trace in the beginning of next week. Do you still need the stack trace? If you do, I need to adjust the log level and/or change the source code in order to print it out. I'm still a little bit worried about how MCF deals with 500 server errors since the job I started last Friday is still running. It retries and retries the three documents I previously mentioned. Is it really a normal behaviour that MCF retries the same document every fourth second after the last attempt and continues do do this (perhaps) thousand times? MCF has probably retried these documents in four days now. I doubt this is normal behaviour. The job should end in the middle of the day on Saturday, and now it's Tuesday. I will test the latest RC after these issues have been clarified. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.0, RC5
I'm trying to start a crawl before I have to run to the airport. I just discovered that MCF recrawls the same host over and over again when it returns result code 500: 09-28-2012 11:40:11.024 fetch http://foreninger.uio.no/go/oslo_open_2012_no.php 500 It's just not this document, but several others returning the same HTTP result code. Meanwhile, the following is filling up my log: FATAL 2012-09-28 11:42:32,112 (Worker thread '29') - Error tossed: String index out of range: -1 java.lang.StringIndexOutOfBoundsException: String index out of range: -1 I'm pretty sure they are related to each other. I will end this job before I leave because I'm afraid that MCF will try to fetch these documents over and over again during this weekend. Erlend On 28.09.12 09.58, Karl Wright wrote: Please vote +1 to release ManifoldCF 1.0, RC5. The release artifact can be found at: http://people.apache.org/~kwright/apache-manifoldcf-1.0 There is also an SVN tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.0-RC5 Fixes since RC4: CONNECTORS-545 Fixes since RC3: CONNECTORS-544 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.0, RC5
OK, I will give you a stack trace in the beginning of next week. I will start the crawler once more and check the results when I'm back and change my vote then if it is ok. Erlend On 28.09.12 13.26, Karl Wright wrote: Meanwhile, the following is filling up my log: FATAL 2012-09-28 11:42:32,112 (Worker thread '29') - Error tossed: String index out of range: -1 java.lang.StringIndexOutOfBoundsException: String index out of range: -1 This is indeed a problem I agree we should fix, but in order to do that I need a stack trace. It is not clear at all that it is related to the 500 error you described before, but it could be. I will create a ticket for it though. Karl On Fri, Sep 28, 2012 at 5:49 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: I'm trying to start a crawl before I have to run to the airport. I just discovered that MCF recrawls the same host over and over again when it returns result code 500: 09-28-2012 11:40:11.024 fetch http://foreninger.uio.no/go/oslo_open_2012_no.php 500 It's just not this document, but several others returning the same HTTP result code. Meanwhile, the following is filling up my log: FATAL 2012-09-28 11:42:32,112 (Worker thread '29') - Error tossed: String index out of range: -1 java.lang.StringIndexOutOfBoundsException: String index out of range: -1 I'm pretty sure they are related to each other. I will end this job before I leave because I'm afraid that MCF will try to fetch these documents over and over again during this weekend. Erlend On 28.09.12 09.58, Karl Wright wrote: Please vote +1 to release ManifoldCF 1.0, RC5. The release artifact can be found at: http://people.apache.org/~kwright/apache-manifoldcf-1.0 There is also an SVN tag at: https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.0-RC5 Fixes since RC4: CONNECTORS-545 Fixes since RC3: CONNECTORS-544 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.0, RC3
The reason why Resin restores the application is due to a new feature in version 4 and is related to clustering functionality. Resin stores all versions of an application (war) in a local Git repository, which means, it is able to restore an application if it has been deleted: http://www.caucho.com/resin-4.0/admin/clustering-overview.xtp#DeployingApplicationstoaCluster Since I'm going to Amsterdam tomorrow morning, I don't want to change anything in Resin right now in case I break something (other apps are running on the same server). I tried to deploy the combined war on Tomcat, but then I couldn't connect to our PostgreSQL server because it seems to have a new SSL certificate I haven't installed into my local keystore. I guess ut is possible to configure HSQLDB, but I'm afraid that my time is running out. Sorry. Erlend On 26.09.12 18.00, Erlend Garåsen wrote: Yes, I know, Karl, but I'm actually deleting the places where they are unpacked. I will get back to this as soon as I have spoken to my colleague. I will find out why tomorrow after our Solr meeting. If I get a reply this evening, I will try to do a new test from home. Erlend On 26.09.12 17.55, Karl Wright wrote: Usually application servers unpack the war somewhere. Unless you remove the place where it is unpacked you will continue to have the applications even after the war is gone. Karl On Wed, Sep 26, 2012 at 11:52 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: Hm, it seems that Resin manages to restore the three applications even though I delete the three war files and the path where they are built. This makes it a little bit more difficult to test. I haven't restarted Resin, only the instance where MCF is running since other applications are running on the same server. I have asked someone with better server skills and Resin knowledge. Erlend On 26.09.12 15.02, Erlend Garåsen wrote: On 26.09.12 14.39, Karl Wright wrote: I didn't do documentation (or tests) because it is experimental at this point. It replaces ALL of manifoldcf in one war. So it is exactly like the single-process example (and would use the same properties.xml) but deployable as a war. Does this help? Sure! I will test it withing 24 hours, probably later today. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.0, RC0
On 21.09.12 20.55, Karl Wright wrote: I see what has happened here. You unregistered the connectors before you deleted the job. That basically meant that the job cleanup can't take place until the connector(s) it requires are registered again. That's correct, and I also see the problem now. I forgot to install the Filesystem connector before I did the configuration import (normally we do not use this connector, but I installed one as a part of a test I did). After I installed it, I do not longer get an NPE. Maybe our routines for upgrading MCF need to be changed. We want to be sure that these connectors do not need new fields in the database tables due to changed/new functions. Therefore I thought this was the safest approach. First we export the configuration, then we uninstall all connectors by using the executecommand script, then deleting the tables by performing an agents.Uninstall command, then reinstall everything and finally import the configuration. Still I cannot delete my jobs since their statuses are cleaning up. And the reason is because I didn't delete my jobs prior to executing crawler.UnRegisterAll? Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: [VOTE] Release Apache ManifoldCF 1.0, RC0
According to my bash history, I did a LockClean prior to the upgrade: solr-test02 mcf-1 $ history | grep LockClean 154 $MCF_HOME/processes/script/executecommand.sh org.apache.manifoldcf.core.LockClean I will create a ticket. Let's hope this is a local problem on my server and not a bug, but I think it's best to investigate it to be sure. I will try to provide as much information as possible in the ticket. Erlend On 21.09.12 15.08, Karl Wright wrote: Hmm, that's not good. Can you open a ticket for this NPE, and also please attach the .java file it is referring to: _simplereport__jsp.java? It should be found in resin's workarea somewhere. As for the job not being deleted, can you supply further details of your setup? Specifically, properties.xml (so I can see what synch settings you have), and what database, etc. If there is nothing in the log, I'd shut down everything, execute the lock-clean procedure, and start everything back up, and see if that fixed the issue. Karl On Fri, Sep 21, 2012 at 9:01 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: On 21.09.12 14.47, Karl Wright wrote: A temporary error should not block a (non running) job from getting cleaned up. The job can't be deleted until it is stopped, and no outstanding documents are being worked on. How many documents are still listed for the job in the UI? Documents: 0 Active: 0 Proceed: 0 Nothing in the simple history, but I get a 500 Server Exceptions if I try to read the history of the file connector (due to a NPE) - stack trace further down in this post. I can try to unregister all connectors and empty the database and try again. [2012-09-21 14:57:10.220] {resin-port-127.0.0.1:6945-143} java.lang.NullPointerException at _jsp._simplereport__jsp._jspService(_simplereport__jsp.java:367) at _jsp._simplereport__jsp._jspService(_simplereport__jsp.java:36) at com.caucho.jsp.JavaPage.service(JavaPage.java:64) at com.caucho.jsp.Page.pageservice(Page.java:542) at com.caucho.server.dispatch.PageFilterChain.doFilter(PageFilterChain.java:194) at com.caucho.server.webapp.DispatchFilterChain.doFilter(DispatchFilterChain.java:126) at com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:289) at com.caucho.server.webapp.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:298) at com.caucho.server.webapp.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:116) at com.caucho.jsp.PageContextImpl.forward(PageContextImpl.java:1149) at _jsp._execute__jsp._jspService(_execute__jsp.java:1072) at _jsp._execute__jsp._jspService(_execute__jsp.java:36) at com.caucho.jsp.JavaPage.service(JavaPage.java:64) at com.caucho.jsp.Page.pageservice(Page.java:542) at com.caucho.server.dispatch.PageFilterChain.doFilter(PageFilterChain.java:194) at com.caucho.server.webapp.WebAppFilterChain.doFilter(WebAppFilterChain.java:156) at com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:289) at com.caucho.server.hmux.HmuxRequest.handleInvocation(HmuxRequest.java:468) at com.caucho.server.hmux.HmuxRequest.handleRequestImpl(HmuxRequest.java:369) at com.caucho.server.hmux.HmuxRequest.handleRequest(HmuxRequest.java:336) at com.caucho.network.listen.TcpSocketLink.dispatchRequest(TcpSocketLink.java:1301) at com.caucho.network.listen.TcpSocketLink.handleRequest(TcpSocketLink.java:1257) at com.caucho.network.listen.TcpSocketLink.handleRequestsImpl(TcpSocketLink.java:1241
Japanese translation needed for CONNECTORS-486
Japanese translation is needed for CONNECTORS-486. I have just committed English translation (r1388020). It seems that we do not have Japanese translation for how to build and deploy, so I'm unsure whether we need to translate my changes at this time. If we *do* have a translation for that page, here's what needs to be translated: 1. New row in the property.xml table: Yes, if file encryption is used Specify the seed value to be used for encrypting the file to which the crawler configuration is exported. 2. (New section under the commands table): Encrypting crawler configuration data By adding a passcode as a second argument to the ExportConfiguration command class, the file will be encrypted by using the AES algorithm. This can be useful to prevent repository passwords to be stored in clear text. In order to use this functionality, you must enter a seed value to your configuration file. The same passcode along with the seed value are used to decrypt the file with the ImportConfiguration command class. See the documentation for the commands and properties above to find the correct arguments and settings. Thanks, Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 r1388020r
Re: ant test - BUILD FAILED
You're right! I have been so busy the last two weeks, so I have barely read all the last emails. Sorry. Erlend On 11.09.12 15.06, Karl Wright wrote: Hi Erlend, I posted a warning about this a few days back. You need to rerun ant make-core-deps. On Tue, Sep 11, 2012 at 9:02 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: BUILD FAILED /Users/erlendfg/tmp/mcf_2012/build.xml:870: The following error occurred while executing this line: /Users/erlendfg/tmp/mcf_2012/connectors/sharepoint/build.xml:71: /Users/erlendfg/tmp/mcf_2012/lib/sharepoint-2007 does not exist. I just encountered this before I was going to commit my own contribution regarding CONNECTORS-486. Since my changes are not related to Sharepoint, I will commit my changes anyway, but later in the evening. BTW, I will try to promote ManifoldCF at Scandinavia's biggest developer conference tomorrow and Thursday. Basically I'm responsible for promoting Oslo Solr Community, but I have a lot of time to recommend ManifoldCF to people who need a new open source search engine for their business. http://jz12.java.no/ Thanks for doing this! Karl Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: New committer: Ahmet Arslan
Welcome to the MCF community, Ahmet! Erlend On 28.08.12 23.01, Karl Wright wrote: The Project Management Committee (PMC) for Apache ManifoldCF has asked Ahmet Arslan to become a committer and we are pleased to announce that he has accepted. Ahmet brings significant skills and resources in the area of SharePoint development to the project, and we look forward to his continuing contribution in this area, and any other area he wishes to address. Please join me in welcoming Ahmet to the community! Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Where do you MCF committers live?
Since I'm traveling a lot, I'm curious about where you committers live in the world. I have already met Karl in Boston in May this year, and I would like to meet other committers as well. I live in Oslo, Norway's capital and largest city. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Missing jdbcpool in Maven repository
Thanks! mvn-bootstrap.sh is not related to this problem, but I managed to build core with Maven after a svn up. I still have some problems to build other parts of MCF, but I'm afraid that I have to get back to this issue tomorrow since I have to leave my office in five minutes. Erlend On 27.06.12 18.31, Karl Wright wrote: Yup, the dependency is superfluous, and r1354618 removes it. If you find any others, please let me know. Karl On Wed, Jun 27, 2012 at 12:29 PM, Karl Wright daddy...@gmail.com wrote: Also, we no longer have a dependency on jdbcpool so I think that pom can be modified to just remove it. Let me check. Karl On Wed, Jun 27, 2012 at 12:28 PM, Karl Wright daddy...@gmail.com wrote: You need to run the mvn-bootstrap.sh script first. Karl On Wed, Jun 27, 2012 at 11:09 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: mvn eclipse:eclipse fails, probably because the following is not available in any Maven repository (framework/core/pom.xml): dependency groupIdcom.bitmechanic/groupId artifactIdjdbcpool/artifactId version${jdbcpool.version}/version /dependency Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Tranlation isues [was: Re: ManifoldCF 0.6 release]
On 25.06.12 03.25, Shinichiro Abe wrote: 5番目のタブはコミットタブです。これはコミットを動作を制御することができます。すべてのジョブの終了時にドキュメントをコミットするようデフォルトで有効になっています。また、ミリ秒単位で一定時間内に各ドキュメントをコミットすることができます(10秒以内にコミットなら1と登録します)。commit withinの挙動はManifoldCFでなくSolrに委ねられています。タブは以下のように表示されます: Can you please place a link to: http://wiki.apache.org/solr/CommitWithin in your translation? Here is the English version including a link: The fifth tab is the Commits tab, which allows you to control the commit strategies. As well as committing documents at the end of every job, an option which is enabled by default, you may also commit each document within a certain time in milliseconds (e.g. 1 for committing within 10 seconds). The a href=http://wiki.apache.org/solr/CommitWithin;commit within/a strategy will leave the responsibility to Solr instead of ManifoldCF. The tab looks like: Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Tranlation isues [was: Re: ManifoldCF 0.6 release]
The following needs translation: You can for instance add update.chain=myChain to select the document processing pipeline/chain to use for processing documents in Solr. Google Translation: あなたは、たとえば、Solrのドキュメントを処理するために使用する文書処理 パイプライン/チェーンを選択するupdate.chain= myChainを追加することができ ます。 The fourth tab is the Documents tab, which allows you to do document filtering based on size and mime types. By specifying a maximum document length in bytes, you can filter out documents which exceed that size (e.g. 10485760 which is equivalent to 10 MB). If you only want to add documents with specific mime types, you can enter them into the included mime types field (e.g. text/html for filtering out all documents but HTML). The excluded mime types field is for excluding documents with specific mime types (e.g. image/jpeg for filtering out JPEG images). The tab looks like: === 4番目のタブは、ドキュメントのサイズやMIMEタイプに基づいてフィルタリング を行うことができますドキュメントタブです。バイト単位の最大ドキュメント の長さを指定することによって、あなたはそのサイズ(10 MBと同等です例えば 10485760)を超えてドキュメントをフィルタリングすることができます。あなた が特定のMIMEタイプを使用してドキュメントを追加したい場合は、(すべての文 書が、HTMLをフィルタリングするために必要な、例えば text / htmlの)が 含まれてMIMEタイプフィールドにそれらを入力することができます。 MIMEタ イプの除外フィールドは、特定のMIMEタイプ(JPEG画像をフィルタリングする 例: image / jpegに)を使って文書を除くためのものです。タブは以下のよ うに表示されます: The fifth tab is the Commits tab, which allows you to control the commit strategies. As well as committing documents at the end of every job, an option which is enabled by default, you may also commit each document within a certain time in milliseconds (e.g. 1 for committing within 10 seconds). The commit within strategy will leave the responsibility to Solr instead of ManifoldCF. The tab looks like: === 5番目のタブでは、コミットの戦略を制御することができますコミットタブで す。同様にすべてのジョブの終了時にドキュメントをコミットするように、デ フォルトで有効になっているオプションは、また、ミリ秒単位で一定時間(10秒 以内にコミット例えば1)内の各ドキュメントをコミットすることができ ます。戦略にコミットではなくManifoldCFのSolrに責任を残します。あなたは Solrの出力接続を作成するときに、5つのコンフィギュレーションタブが表示さ れます。 Comment: I need to place a link to: http://wiki.apache.org/solr/CommitWithin over the text: commit within (last sentence). I need help to place the link correctly for the Japanese translation. I haven't committed my changes yet since my password to the Apache account has been changed. If I do not manage to fix my account until the beginning of next week, I will add a patch instead. Karl: I'm not sure I explained the purpose of the included mime types field correctly. Please review and comment if necessary. Thanks, Erlend On 22.06.12 17.03, Erlend Garåsen wrote: On 22.06.12 15.08, Karl Wright wrote: That's OK. Any improvement welcome. ;-) Shinichiro Abe: Will you assist me in order to translate the documentation? I have made Japanese screenshots as well, but the content is slightly changed, that is, I have removed certain sentences for the existing tabs (for Solr Output Connection). I will mark the ticket with needs translation after I have committed my changes (including my attempts to do the translations). Or I can simply email the suggested sentences to the list. I will get back to this issue on Sunday (or tomorrow if I get time). Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050