Odd. The problem is obviously the port of -1. But the code does not attach a specific port to the URL in that case.
I will try your example exactly when I have access to internet again. Karl Sent from my Windows Phone From: Ahmet Arslan Sent: 11/17/2012 4:47 PM To: dev@manifoldcf.apache.org Subject: Re: Anyone out there using RSS connector, who wants to help? Hi, Regarding "WARN 2012-11-17 23:01:17,649 (Worker thread '31') - Pre-ingest service interruption reported for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1" I see that http://www.milliyet.com.tr/robots.txt exists. Ahmet --- On Sat, 11/17/12, Ahmet Arslan <iori...@yahoo.com> wrote: > From: Ahmet Arslan <iori...@yahoo.com> > Subject: Re: Anyone out there using RSS connector, who wants to help? > To: dev@manifoldcf.apache.org > Date: Saturday, November 17, 2012, 11:11 PM > Hi Karl, > > Never used rss connector. But here is what I have done. > > I defined a job to crawl using mcf-trunk. mfc-trunk crawled > following two URLs: > > http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml > http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 > > With CONNECTORS-120 branch I can crawl > > http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 > > but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives > status of "Error: Repeated service interruptions - failure > getting document version" > > I see these in the log file : > > WARN 2012-11-17 23:01:17,649 (Worker thread '31') - > Pre-ingest service interruption reported for job > 1353185325276 connection 'rss': Couldn't fetch robots.txt > from http://www.milliyet.com.tr:-1 > ERROR 2012-11-17 23:01:17,802 (Worker thread '31') - > Exception tossed: Repeated service interruptions - failure > getting document version > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > Repeated service interruptions - failure getting document > version > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) > WARN 2012-11-17 23:02:27,307 (Worker thread '30') - > Pre-ingest service interruption reported for job > 1353185325276 connection 'rss': Couldn't fetch robots.txt > from http://www.milliyet.com.tr:-1 > ERROR 2012-11-17 23:02:27,329 (Worker thread '30') - > Exception tossed: Repeated service interruptions - failure > getting document version > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > Repeated service interruptions - failure getting document > version > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) > > > By the way in "Dechromed Content" tab (Job Setting UI) I see > four " " > > Thanks, > Ahmet > --- On Fri, 11/16/12, Karl Wright <daddy...@gmail.com> > wrote: > > > From: Karl Wright <daddy...@gmail.com> > > Subject: Anyone out there using RSS connector, who > wants to help? > > To: "dev" <dev@manifoldcf.apache.org> > > Date: Friday, November 16, 2012, 3:54 PM > > Hi all, > > > > The branch > > https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120 > > contains an RSS connector that has been updated to use > > httpcomponents > > 4.2.2. I'd love for people who are in a position to > do > > significant > > RSS crawling to try it out before I pull it into > > trunk. Any takers? > > > > Karl > > >