Hi Ahmet,

I tried your example, but it looked like it worked fine here.  Here's
part of the simple history:

>>>>>>
11-18-2012 12:59:52.182         document ingest (null)
        http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
ndem/gundemdetay/18.11.2012/1628733/default.htm
        OK      16307   1       
11-18-2012 12:59:47.482         document ingest (null)
        http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
gundemdetay/18.11.2012/1628657/default.htm
        OK      10573   1       
11-18-2012 12:59:47.133         fetch
        http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
ndem/gundemdetay/18.11.2012/1628733/default.htm
        200     16307   5050    
11-18-2012 12:59:42.133         fetch
        http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
gundemdetay/18.11.2012/1628657/default.htm
        200     10573   5340    
11-18-2012 12:59:42.092         document ingest (null)
        http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
        OK      10212   1       
11-18-2012 12:59:37.252         document ingest (null)
        http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
        OK      16105   1       
11-18-2012 12:59:37.133         fetch
        http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
        200     10212   4950    
11-18-2012 12:59:32.332         document ingest (null)
        http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
m/gundemdetay/18.11.2012/1628801/default.htm
        OK      10170   1       
11-18-2012 12:59:32.133         fetch
        http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
        200     16105   5110    
11-18-2012 12:59:27.142         document ingest (null)
        http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
ndemdetay/18.11.2012/1628661/default.htm
        OK      10102   1       
11-18-2012 12:59:27.133         fetch
        http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
m/gundemdetay/18.11.2012/1628801/default.htm
        200     10170   5200    
11-18-2012 12:59:22.182         document ingest (null)
        http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
gundemdetay/18.11.2012/1628824/default.htm
        OK      10217   1       
11-18-2012 12:59:22.133         fetch
        http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
ndemdetay/18.11.2012/1628661/default.htm
        200     10102   4990    
11-18-2012 12:59:18.062         document ingest (null)
        http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
/gundemdetay/18.11.2012/1628856/default.htm
        OK      9721    1       
11-18-2012 12:59:17.133         fetch
        http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
gundemdetay/18.11.2012/1628824/default.htm
        200     10217   5050    
11-18-2012 12:59:12.452         document ingest (null)
        http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
        OK      11412   1       
11-18-2012 12:59:12.133         fetch
        http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
/gundemdetay/18.11.2012/1628856/default.htm
        200     9721    5930    
11-18-2012 12:59:07.133         fetch
        http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
        200     11412   5300    
11-18-2012 12:59:06.892         document ingest (null)
        http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
gundemdetay/17.11.2012/1628402/default.htm
        OK      11183   1       
11-18-2012 12:59:02.772         document ingest (null)
        http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
/gundem/gundemdetay/18.11.2012/1628740/default.htm
        OK      10632   1       
11-18-2012 12:59:02.153         fetch
        http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
gundemdetay/17.11.2012/1628402/default.htm
        200     11183   4720    
11-18-2012 12:58:57.173         fetch
        http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
/gundem/gundemdetay/18.11.2012/1628740/default.htm
        200     10632   5570    
11-18-2012 12:58:52.533         robots parse    www.hurriyet.com.tr
        SUCCESS         0       78      
11-18-2012 12:58:52.511         robots parse    gundem.milliyet.com.tr
        SUCCESS         0       70      
11-18-2012 12:58:52.136         fetch   http://www.hurriyet.com.tr/robots.txt
        200     928     476     
11-18-2012 12:58:52.129         fetch   http://gundem.milliyet.com.tr/robots.txt
        200     797     453     
11-18-2012 12:58:49.013         fetch   
http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
        200     34467   1080    
11-18-2012 12:58:48.993         fetch   
http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
        200     72439   1510    
11-18-2012 12:58:44.513         robots parse    www.milliyet.com.tr
        SUCCESS         0       340     
11-18-2012 12:58:44.013         fetch   http://rss.hurriyet.com.tr/robots.txt
        404     4096    770     
11-18-2012 12:58:44.013         fetch   http://www.milliyet.com.tr/robots.txt
        200     17484   840     
11-18-2012 12:58:41.502         job start       1353261469661(rss)
                0       1       
<<<<<<

So it looks like there's a http://www.milliyet.com.tr/robots.txt that
it fetched fine, and there is no
http://rss.hurriyet.com.tr/robots.txt.  Does this seem correct to you?
 Furthermore, there is content that the feed points at that requires
access to (and robots fetches for) two other servers...

Karl

On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <daddy...@gmail.com> wrote:
> Odd. The problem is obviously the port of -1. But the code does not
> attach a specific port to the URL in that case.
>
> I will try your example exactly when I have access to internet again.
>
> Karl
>
> Sent from my Windows Phone
> From: Ahmet Arslan
> Sent: 11/17/2012 4:47 PM
> To: dev@manifoldcf.apache.org
> Subject: Re: Anyone out there using RSS connector, who wants to help?
> Hi,
>
> Regarding  "WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
> Pre-ingest service interruption reported for job 1353185325276
> connection 'rss': Couldn't fetch robots.txt from
> http://www.milliyet.com.tr:-1";
>
> I see that http://www.milliyet.com.tr/robots.txt exists.
>
> Ahmet
>
> --- On Sat, 11/17/12, Ahmet Arslan <iori...@yahoo.com> wrote:
>
>> From: Ahmet Arslan <iori...@yahoo.com>
>> Subject: Re: Anyone out there using RSS connector, who wants to help?
>> To: dev@manifoldcf.apache.org
>> Date: Saturday, November 17, 2012, 11:11 PM
>> Hi Karl,
>>
>> Never used rss connector. But here is what I have done.
>>
>> I defined a job to crawl using mcf-trunk. mfc-trunk crawled
>> following two URLs:
>>
>> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
>> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>>
>> With CONNECTORS-120 branch I can crawl
>>
>> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>>
>> but  http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives
>> status of "Error: Repeated service interruptions - failure
>> getting document version"
>>
>> I see these in the log file :
>>
>>  WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
>> Pre-ingest service interruption reported for job
>> 1353185325276 connection 'rss': Couldn't fetch robots.txt
>> from http://www.milliyet.com.tr:-1
>> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') -
>> Exception tossed: Repeated service interruptions - failure
>> getting document version
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> Repeated service interruptions - failure getting document
>> version
>>     at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>>  WARN 2012-11-17 23:02:27,307 (Worker thread '30') -
>> Pre-ingest service interruption reported for job
>> 1353185325276 connection 'rss': Couldn't fetch robots.txt
>> from http://www.milliyet.com.tr:-1
>> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') -
>> Exception tossed: Repeated service interruptions - failure
>> getting document version
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> Repeated service interruptions - failure getting document
>> version
>>     at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>>
>>
>> By the way in "Dechromed Content" tab (Job Setting UI) I see
>> four "&nbsp;"
>>
>> Thanks,
>> Ahmet
>> --- On Fri, 11/16/12, Karl Wright <daddy...@gmail.com>
>> wrote:
>>
>> > From: Karl Wright <daddy...@gmail.com>
>> > Subject: Anyone out there using RSS connector, who
>> wants to help?
>> > To: "dev" <dev@manifoldcf.apache.org>
>> > Date: Friday, November 16, 2012, 3:54 PM
>> > Hi all,
>> >
>> > The branch 
>> > https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
>> > contains an RSS connector that has been updated to use
>> > httpcomponents
>> > 4.2.2.  I'd love for people who are in a position to
>> do
>> > significant
>> > RSS crawling to try it out before I pull it into
>> > trunk.  Any takers?
>> >
>> > Karl
>> >
>>

Reply via email to