I have just reworked the SharePoint connector in branches/CONNECTORS-120 somewhat, to stream rather than copy through an intermediate file. While I don't expect any change in behavior, it would be good to confirm I didn't do anything stupid, so another sample crawl would be very welcome.
Thanks! Karl On Tue, Nov 20, 2012 at 7:11 AM, Karl Wright <daddy...@gmail.com> wrote: > Thanks for the update! > > I'm working on the web connector now. That's going to require a bit more > work. > > Karl > > On Tue, Nov 20, 2012 at 7:09 AM, Maciej Liżewski > <maciej.lizew...@gmail.com> wrote: >> CONNECTORS-120 is already merged to trunk as I see. Tested wiki connector >> in my environment and works correctly. >> >> >> 2012/11/19 Ahmet Arslan <iori...@yahoo.com> >> >>> Hi Karl, >>> >>> I re-ran experiments with r1411016 and both RSS URLs are working now with >>> CONNECTORS-120. >>> >>> Regarding robots.txt, http://rss.hurriyet.com.tr/robots.txt does not >>> exists but http://www.hurriyet.com.tr/robots.txt exists. >>> >>> Ahmet >>> >>> --- On Sun, 11/18/12, Karl Wright <daddy...@gmail.com> wrote: >>> >>> > From: Karl Wright <daddy...@gmail.com> >>> > Subject: Re: Anyone out there using RSS connector, who wants to help? >>> > To: "Ahmet Arslan" <iori...@yahoo.com>, "dev@manifoldcf.apache.org" < >>> dev@manifoldcf.apache.org> >>> > Date: Sunday, November 18, 2012, 8:04 PM >>> > Hi Ahmet, >>> > >>> > I tried your example, but it looked like it worked fine >>> > here. Here's >>> > part of the simple history: >>> > >>> > >>>>>> >>> > 11-18-2012 12:59:52.182 document ingest >>> > (null) >>> > http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... >>> > ndem/gundemdetay/18.11.2012/1628733/default.htm >>> > OK 16307 >>> > 1 >>> > 11-18-2012 12:59:47.482 document ingest >>> > (null) >>> > http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... >>> > gundemdetay/18.11.2012/1628657/default.htm >>> > OK 10573 >>> > 1 >>> > 11-18-2012 12:59:47.133 fetch >>> > http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... >>> > ndem/gundemdetay/18.11.2012/1628733/default.htm >>> > 200 16307 >>> > 5050 >>> > 11-18-2012 12:59:42.133 fetch >>> > http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... >>> > gundemdetay/18.11.2012/1628657/default.htm >>> > 200 10573 >>> > 5340 >>> > 11-18-2012 12:59:42.092 document ingest >>> > (null) >>> > http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... >>> > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm >>> > OK 10212 >>> > 1 >>> > 11-18-2012 12:59:37.252 document ingest >>> > (null) >>> > http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... >>> > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm >>> > OK 16105 >>> > 1 >>> > 11-18-2012 12:59:37.133 fetch >>> > http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... >>> > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm >>> > 200 10212 >>> > 4950 >>> > 11-18-2012 12:59:32.332 document ingest >>> > (null) >>> > http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... >>> > m/gundemdetay/18.11.2012/1628801/default.htm >>> > OK 10170 >>> > 1 >>> > 11-18-2012 12:59:32.133 fetch >>> > http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... >>> > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm >>> > 200 16105 >>> > 5110 >>> > 11-18-2012 12:59:27.142 document ingest >>> > (null) >>> > http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... >>> > ndemdetay/18.11.2012/1628661/default.htm >>> > OK 10102 >>> > 1 >>> > 11-18-2012 12:59:27.133 fetch >>> > http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... >>> > m/gundemdetay/18.11.2012/1628801/default.htm >>> > 200 10170 >>> > 5200 >>> > 11-18-2012 12:59:22.182 document ingest >>> > (null) >>> > http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... >>> > gundemdetay/18.11.2012/1628824/default.htm >>> > OK 10217 >>> > 1 >>> > 11-18-2012 12:59:22.133 fetch >>> > http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... >>> > ndemdetay/18.11.2012/1628661/default.htm >>> > 200 10102 >>> > 4990 >>> > 11-18-2012 12:59:18.062 document ingest >>> > (null) >>> > http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... >>> > /gundemdetay/18.11.2012/1628856/default.htm >>> > OK 9721 >>> > 1 >>> > 11-18-2012 12:59:17.133 fetch >>> > http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... >>> > gundemdetay/18.11.2012/1628824/default.htm >>> > 200 10217 >>> > 5050 >>> > 11-18-2012 12:59:12.452 document ingest >>> > (null) >>> > http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... >>> > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm >>> > OK 11412 >>> > 1 >>> > 11-18-2012 12:59:12.133 fetch >>> > http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... >>> > /gundemdetay/18.11.2012/1628856/default.htm >>> > 200 9721 >>> > 5930 >>> > 11-18-2012 12:59:07.133 fetch >>> > http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... >>> > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm >>> > 200 11412 >>> > 5300 >>> > 11-18-2012 12:59:06.892 document ingest >>> > (null) >>> > http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... >>> > gundemdetay/17.11.2012/1628402/default.htm >>> > OK 11183 >>> > 1 >>> > 11-18-2012 12:59:02.772 document ingest >>> > (null) >>> > http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... >>> > /gundem/gundemdetay/18.11.2012/1628740/default.htm >>> > OK 10632 >>> > 1 >>> > 11-18-2012 12:59:02.153 fetch >>> > http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... >>> > gundemdetay/17.11.2012/1628402/default.htm >>> > 200 11183 >>> > 4720 >>> > 11-18-2012 12:58:57.173 fetch >>> > http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... >>> > /gundem/gundemdetay/18.11.2012/1628740/default.htm >>> > 200 10632 >>> > 5570 >>> > 11-18-2012 12:58:52.533 robots parse >>> > www.hurriyet.com.tr >>> > SUCCESS 0 >>> > 78 >>> > 11-18-2012 12:58:52.511 robots parse >>> > gundem.milliyet.com.tr >>> > SUCCESS 0 >>> > 70 >>> > 11-18-2012 12:58:52.136 fetch >>> > http://www.hurriyet.com.tr/robots.txt >>> > 200 928 >>> > 476 >>> > 11-18-2012 12:58:52.129 fetch >>> > http://gundem.milliyet.com.tr/robots.txt >>> > 200 797 >>> > 453 >>> > 11-18-2012 12:58:49.013 fetch >>> > http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 >>> > 200 34467 >>> > 1080 >>> > 11-18-2012 12:58:48.993 fetch >>> > http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml >>> > 200 72439 >>> > 1510 >>> > 11-18-2012 12:58:44.513 robots parse >>> > www.milliyet.com.tr >>> > SUCCESS 0 >>> > 340 >>> > 11-18-2012 12:58:44.013 fetch >>> > http://rss.hurriyet.com.tr/robots.txt >>> > 404 4096 >>> > 770 >>> > 11-18-2012 12:58:44.013 fetch >>> > http://www.milliyet.com.tr/robots.txt >>> > 200 17484 >>> > 840 >>> > 11-18-2012 12:58:41.502 job start >>> > 1353261469661(rss) >>> > 0 1 >>> > >>> > <<<<<< >>> > >>> > So it looks like there's a http://www.milliyet.com.tr/robots.txt that >>> > it fetched fine, and there is no >>> > http://rss.hurriyet.com.tr/robots.txt. Does this >>> > seem correct to you? >>> > Furthermore, there is content that the feed points at that >>> > requires >>> > access to (and robots fetches for) two other servers... >>> > >>> > Karl >>> > >>> > On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <daddy...@gmail.com> >>> > wrote: >>> > > Odd. The problem is obviously the port of -1. But the >>> > code does not >>> > > attach a specific port to the URL in that case. >>> > > >>> > > I will try your example exactly when I have access to >>> > internet again. >>> > > >>> > > Karl >>> > > >>> > > Sent from my Windows Phone >>> > > From: Ahmet Arslan >>> > > Sent: 11/17/2012 4:47 PM >>> > > To: dev@manifoldcf.apache.org >>> > > Subject: Re: Anyone out there using RSS connector, who >>> > wants to help? >>> > > Hi, >>> > > >>> > > Regarding "WARN 2012-11-17 23:01:17,649 (Worker >>> > thread '31') - >>> > > Pre-ingest service interruption reported for job >>> > 1353185325276 >>> > > connection 'rss': Couldn't fetch robots.txt from >>> > > http://www.milliyet.com.tr:-1" >>> > > >>> > > I see that http://www.milliyet.com.tr/robots.txt exists. >>> > > >>> > > Ahmet >>> > > >>> > > --- On Sat, 11/17/12, Ahmet Arslan <iori...@yahoo.com> >>> > wrote: >>> > > >>> > >> From: Ahmet Arslan <iori...@yahoo.com> >>> > >> Subject: Re: Anyone out there using RSS connector, >>> > who wants to help? >>> > >> To: dev@manifoldcf.apache.org >>> > >> Date: Saturday, November 17, 2012, 11:11 PM >>> > >> Hi Karl, >>> > >> >>> > >> Never used rss connector. But here is what I have >>> > done. >>> > >> >>> > >> I defined a job to crawl using mcf-trunk. mfc-trunk >>> > crawled >>> > >> following two URLs: >>> > >> >>> > >> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml >>> > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 >>> > >> >>> > >> With CONNECTORS-120 branch I can crawl >>> > >> >>> > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 >>> > >> >>> > >> but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives >>> > >> status of "Error: Repeated service interruptions - >>> > failure >>> > >> getting document version" >>> > >> >>> > >> I see these in the log file : >>> > >> >>> > >> WARN 2012-11-17 23:01:17,649 (Worker thread >>> > '31') - >>> > >> Pre-ingest service interruption reported for job >>> > >> 1353185325276 connection 'rss': Couldn't fetch >>> > robots.txt >>> > >> from http://www.milliyet.com.tr:-1 >>> > >> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') >>> > - >>> > >> Exception tossed: Repeated service interruptions - >>> > failure >>> > >> getting document version >>> > >> >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: >>> > >> Repeated service interruptions - failure getting >>> > document >>> > >> version >>> > >> at >>> > >> >>> > >>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) >>> > >> WARN 2012-11-17 23:02:27,307 (Worker thread >>> > '30') - >>> > >> Pre-ingest service interruption reported for job >>> > >> 1353185325276 connection 'rss': Couldn't fetch >>> > robots.txt >>> > >> from http://www.milliyet.com.tr:-1 >>> > >> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') >>> > - >>> > >> Exception tossed: Repeated service interruptions - >>> > failure >>> > >> getting document version >>> > >> >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: >>> > >> Repeated service interruptions - failure getting >>> > document >>> > >> version >>> > >> at >>> > >> >>> > >>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) >>> > >> >>> > >> >>> > >> By the way in "Dechromed Content" tab (Job Setting >>> > UI) I see >>> > >> four " " >>> > >> >>> > >> Thanks, >>> > >> Ahmet >>> > >> --- On Fri, 11/16/12, Karl Wright <daddy...@gmail.com> >>> > >> wrote: >>> > >> >>> > >> > From: Karl Wright <daddy...@gmail.com> >>> > >> > Subject: Anyone out there using RSS connector, >>> > who >>> > >> wants to help? >>> > >> > To: "dev" <dev@manifoldcf.apache.org> >>> > >> > Date: Friday, November 16, 2012, 3:54 PM >>> > >> > Hi all, >>> > >> > >>> > >> > The branch >>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120 >>> > >> > contains an RSS connector that has been >>> > updated to use >>> > >> > httpcomponents >>> > >> > 4.2.2. I'd love for people who are in a >>> > position to >>> > >> do >>> > >> > significant >>> > >> > RSS crawling to try it out before I pull it >>> > into >>> > >> > trunk. Any takers? >>> > >> > >>> > >> > Karl >>> > >> > >>> > >> >>> > >>>