Hi Ahmet, I tried your example, but it looked like it worked fine here. Here's part of the simple history:
>>>>>> 11-18-2012 12:59:52.182 document ingest (null) http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... ndem/gundemdetay/18.11.2012/1628733/default.htm OK 16307 1 11-18-2012 12:59:47.482 document ingest (null) http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... gundemdetay/18.11.2012/1628657/default.htm OK 10573 1 11-18-2012 12:59:47.133 fetch http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... ndem/gundemdetay/18.11.2012/1628733/default.htm 200 16307 5050 11-18-2012 12:59:42.133 fetch http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... gundemdetay/18.11.2012/1628657/default.htm 200 10573 5340 11-18-2012 12:59:42.092 document ingest (null) http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm OK 10212 1 11-18-2012 12:59:37.252 document ingest (null) http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm OK 16105 1 11-18-2012 12:59:37.133 fetch http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm 200 10212 4950 11-18-2012 12:59:32.332 document ingest (null) http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... m/gundemdetay/18.11.2012/1628801/default.htm OK 10170 1 11-18-2012 12:59:32.133 fetch http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm 200 16105 5110 11-18-2012 12:59:27.142 document ingest (null) http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... ndemdetay/18.11.2012/1628661/default.htm OK 10102 1 11-18-2012 12:59:27.133 fetch http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... m/gundemdetay/18.11.2012/1628801/default.htm 200 10170 5200 11-18-2012 12:59:22.182 document ingest (null) http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... gundemdetay/18.11.2012/1628824/default.htm OK 10217 1 11-18-2012 12:59:22.133 fetch http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... ndemdetay/18.11.2012/1628661/default.htm 200 10102 4990 11-18-2012 12:59:18.062 document ingest (null) http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... /gundemdetay/18.11.2012/1628856/default.htm OK 9721 1 11-18-2012 12:59:17.133 fetch http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... gundemdetay/18.11.2012/1628824/default.htm 200 10217 5050 11-18-2012 12:59:12.452 document ingest (null) http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... etti/gundem/gundemdetay/18.11.2012/1628795/default.htm OK 11412 1 11-18-2012 12:59:12.133 fetch http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... /gundemdetay/18.11.2012/1628856/default.htm 200 9721 5930 11-18-2012 12:59:07.133 fetch http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... etti/gundem/gundemdetay/18.11.2012/1628795/default.htm 200 11412 5300 11-18-2012 12:59:06.892 document ingest (null) http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... gundemdetay/17.11.2012/1628402/default.htm OK 11183 1 11-18-2012 12:59:02.772 document ingest (null) http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... /gundem/gundemdetay/18.11.2012/1628740/default.htm OK 10632 1 11-18-2012 12:59:02.153 fetch http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... gundemdetay/17.11.2012/1628402/default.htm 200 11183 4720 11-18-2012 12:58:57.173 fetch http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... /gundem/gundemdetay/18.11.2012/1628740/default.htm 200 10632 5570 11-18-2012 12:58:52.533 robots parse www.hurriyet.com.tr SUCCESS 0 78 11-18-2012 12:58:52.511 robots parse gundem.milliyet.com.tr SUCCESS 0 70 11-18-2012 12:58:52.136 fetch http://www.hurriyet.com.tr/robots.txt 200 928 476 11-18-2012 12:58:52.129 fetch http://gundem.milliyet.com.tr/robots.txt 200 797 453 11-18-2012 12:58:49.013 fetch http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 200 34467 1080 11-18-2012 12:58:48.993 fetch http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml 200 72439 1510 11-18-2012 12:58:44.513 robots parse www.milliyet.com.tr SUCCESS 0 340 11-18-2012 12:58:44.013 fetch http://rss.hurriyet.com.tr/robots.txt 404 4096 770 11-18-2012 12:58:44.013 fetch http://www.milliyet.com.tr/robots.txt 200 17484 840 11-18-2012 12:58:41.502 job start 1353261469661(rss) 0 1 <<<<<< So it looks like there's a http://www.milliyet.com.tr/robots.txt that it fetched fine, and there is no http://rss.hurriyet.com.tr/robots.txt. Does this seem correct to you? Furthermore, there is content that the feed points at that requires access to (and robots fetches for) two other servers... Karl On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <daddy...@gmail.com> wrote: > Odd. The problem is obviously the port of -1. But the code does not > attach a specific port to the URL in that case. > > I will try your example exactly when I have access to internet again. > > Karl > > Sent from my Windows Phone > From: Ahmet Arslan > Sent: 11/17/2012 4:47 PM > To: dev@manifoldcf.apache.org > Subject: Re: Anyone out there using RSS connector, who wants to help? > Hi, > > Regarding "WARN 2012-11-17 23:01:17,649 (Worker thread '31') - > Pre-ingest service interruption reported for job 1353185325276 > connection 'rss': Couldn't fetch robots.txt from > http://www.milliyet.com.tr:-1" > > I see that http://www.milliyet.com.tr/robots.txt exists. > > Ahmet > > --- On Sat, 11/17/12, Ahmet Arslan <iori...@yahoo.com> wrote: > >> From: Ahmet Arslan <iori...@yahoo.com> >> Subject: Re: Anyone out there using RSS connector, who wants to help? >> To: dev@manifoldcf.apache.org >> Date: Saturday, November 17, 2012, 11:11 PM >> Hi Karl, >> >> Never used rss connector. But here is what I have done. >> >> I defined a job to crawl using mcf-trunk. mfc-trunk crawled >> following two URLs: >> >> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 >> >> With CONNECTORS-120 branch I can crawl >> >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 >> >> but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives >> status of "Error: Repeated service interruptions - failure >> getting document version" >> >> I see these in the log file : >> >> WARN 2012-11-17 23:01:17,649 (Worker thread '31') - >> Pre-ingest service interruption reported for job >> 1353185325276 connection 'rss': Couldn't fetch robots.txt >> from http://www.milliyet.com.tr:-1 >> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') - >> Exception tossed: Repeated service interruptions - failure >> getting document version >> org.apache.manifoldcf.core.interfaces.ManifoldCFException: >> Repeated service interruptions - failure getting document >> version >> at >> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) >> WARN 2012-11-17 23:02:27,307 (Worker thread '30') - >> Pre-ingest service interruption reported for job >> 1353185325276 connection 'rss': Couldn't fetch robots.txt >> from http://www.milliyet.com.tr:-1 >> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') - >> Exception tossed: Repeated service interruptions - failure >> getting document version >> org.apache.manifoldcf.core.interfaces.ManifoldCFException: >> Repeated service interruptions - failure getting document >> version >> at >> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) >> >> >> By the way in "Dechromed Content" tab (Job Setting UI) I see >> four " " >> >> Thanks, >> Ahmet >> --- On Fri, 11/16/12, Karl Wright <daddy...@gmail.com> >> wrote: >> >> > From: Karl Wright <daddy...@gmail.com> >> > Subject: Anyone out there using RSS connector, who >> wants to help? >> > To: "dev" <dev@manifoldcf.apache.org> >> > Date: Friday, November 16, 2012, 3:54 PM >> > Hi all, >> > >> > The branch >> > https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120 >> > contains an RSS connector that has been updated to use >> > httpcomponents >> > 4.2.2. I'd love for people who are in a position to >> do >> > significant >> > RSS crawling to try it out before I pull it into >> > trunk. Any takers? >> > >> > Karl >> > >>