Hi Vishal, If you find Nutch heavy-weight, consider using http://manifoldcf.apache.org
Ahmet On Wednesday, October 8, 2014 1:54 AM, Vishal Sharma <vish...@grazitti.com> wrote: Hey Jorge, I guess Nutch can help me. Thanks for this. I am sure I should be able to configure it to crawl only specific portions of the site. *Vishal Sharma**TL, Grazitti Interactive*T: +1 650 641 1754 E: vish...@grazitti.com www.grazitti.com [image: Description: LinkedIn] <http://www.linkedin.com/company/grazitti-interactive>[image: Description: Twitter] <https://twitter.com/grazitti>[image: fbook] <https://www.facebook.com/grazitti.interactive>*dreamforce®*Oct 13-16, 2014 *Meet us at the Cloud Expo* Booth N2341 Moscone North, San Francisco Schedule a Meeting <http://www.vcita.com/v/grazittiinteractive/online_scheduling#/schedule> | Follow us <https://twitter.com/grazitti>ZakCalendar Dreamforce® Featured App <https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3> On Tue, Oct 7, 2014 at 2:45 PM, Jorge Luis Betancourt Gonzalez < jlbetanco...@uci.cu> wrote: > If you’re talking about a generic web crawl you could use something like > Nutch [1] keep in mind that his a full web crawler and it does a pretty > good job. I’ve been using it for over more than 2 years now and I’m very > happy, although I don’t crawl just a couple of sites but a more wide > spectrum (think a country web scale). But with Nutch you just have to > configure a couple of options in an xml file and it will crawl the web and > index the content into Solr. > > Regards, > > [1] http://nutch.apache.org > > On Oct 7, 2014, at 4:53 PM, Vishal Sharma <vish...@grazitti.com> wrote: > > > Makes sense. > > > > I'll just dive in now. Thanks so much. > > > > *Vishal Sharma**TL, Grazitti Interactive*T: +1 650 641 1754 > > E: vish...@grazitti.com > > www.grazitti.com [image: Description: LinkedIn] > > <http://www.linkedin.com/company/grazitti-interactive>[image: > Description: > > Twitter] <https://twitter.com/grazitti>[image: fbook] > > <https://www.facebook.com/grazitti.interactive>*dreamforce®*Oct 13-16, > > 2014 *Meet > > us at the Cloud Expo* > > Booth N2341 Moscone North, > > San Francisco > > Schedule a Meeting > > <http://www.vcita.com/v/grazittiinteractive/online_scheduling#/schedule> > > | Follow us <https://twitter.com/grazitti>ZakCalendar > > Dreamforce® Featured > > App > > < > https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3 > > > > > > > > > > > > > > > > On Tue, Oct 7, 2014 at 1:44 PM, Alexandre Rafalovitch < > arafa...@gmail.com> > > wrote: > > > >> I am pretty sure Swift is not Solr. That's why I was asking whether > >> you were starting from scratch. > >> > >> As to the other items, please re-read my original response. Solr has > >> an example reading in RSS feeds, you could probably use that. Or a > >> generic XML using DataImportHandler's mapping. Or directly from > >> database, again with DIH. > >> > >> Basically, it sounds totally doable. So, it's hard to advise anything > >> specific beyond "go, do it" and wait for you to come back with a lot > >> more specific issue once you get going. Most of the issues will be > >> related to your schema and your WordPress configuration, so no > >> abstract advice is available. > >> > >> Regards, > >> Alex. > >> > >> On 7 October 2014 16:36, Vishal Sharma <vish...@grazitti.com> wrote: > >>> Hey Alex, > >>> > >>> Thanks for the prompt response. > >>> > >>> Here is what I am trying to solve: I am showing search results from > >> content > >>> coming from 3 different places on a single site. And, I have done that > by > >>> pumping all this content to Solr server running on single flat schema > by > >>> using different APIs of these platforms. Now, I need to index blog > posts > >>> written in word press also. I was wondering if there is any solution > >>> already availablw which can help me crawl and pump this posst to my > >> running > >>> solr instance. Otherwise I might have to write few more scripts to do > >> that. > >>> > >>> BTW, Is Swift using Solr on the backend? Because I thought its a paid > >>> enterprise solution. > >>> > >> > > Concurso "Mi selfie por los 5". Detalles en > http://justiciaparaloscinco.wordpress.com >