Hi Vishal,

If you find Nutch heavy-weight, consider using http://manifoldcf.apache.org

Ahmet


On Wednesday, October 8, 2014 1:54 AM, Vishal Sharma <vish...@grazitti.com> 
wrote:
Hey Jorge,

I guess Nutch can help me. Thanks for this. I am sure I should be able to
configure it to crawl only specific portions of the site.

*Vishal Sharma**TL, Grazitti Interactive*T: +1 650­ 641 1754
E: vish...@grazitti.com
www.grazitti.com [image: Description: LinkedIn]
<http://www.linkedin.com/company/grazitti-interactive>[image: Description:
Twitter] <https://twitter.com/grazitti>[image: fbook]
<https://www.facebook.com/grazitti.interactive>*dreamforce®*Oct 13-16,
2014 *Meet
us at the Cloud Expo*
Booth N2341 Moscone North,
San Francisco
Schedule a Meeting
<http://www.vcita.com/v/grazittiinteractive/online_scheduling#/schedule>
   |   Follow us <https://twitter.com/grazitti>ZakCalendar
Dreamforce® Featured
App
<https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3>









On Tue, Oct 7, 2014 at 2:45 PM, Jorge Luis Betancourt Gonzalez <
jlbetanco...@uci.cu> wrote:

> If you’re talking about a generic web crawl you could use something like
> Nutch [1] keep in mind that his a full web crawler and it does a pretty
> good job. I’ve been using it for over more than 2 years now and I’m very
> happy, although I don’t crawl just a couple of sites but a more wide
> spectrum (think a country web scale). But with Nutch you just have to
> configure a couple of options in an xml file and it will crawl the web and
> index the content into Solr.
>
> Regards,
>
> [1] http://nutch.apache.org
>
> On Oct 7, 2014, at 4:53 PM, Vishal Sharma <vish...@grazitti.com> wrote:
>
> > Makes sense.
> >
> > I'll just dive in now. Thanks so much.
> >
> > *Vishal Sharma**TL, Grazitti Interactive*T: +1 650­ 641 1754
> > E: vish...@grazitti.com
> > www.grazitti.com [image: Description: LinkedIn]
> > <http://www.linkedin.com/company/grazitti-interactive>[image:
> Description:
> > Twitter] <https://twitter.com/grazitti>[image: fbook]
> > <https://www.facebook.com/grazitti.interactive>*dreamforce®*Oct 13-16,
> > 2014 *Meet
> > us at the Cloud Expo*
> > Booth N2341 Moscone North,
> > San Francisco
> > Schedule a Meeting
> > <http://www.vcita.com/v/grazittiinteractive/online_scheduling#/schedule>
> >   |   Follow us <https://twitter.com/grazitti>ZakCalendar
> > Dreamforce® Featured
> > App
> > <
> https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Oct 7, 2014 at 1:44 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> I am pretty sure Swift is not Solr. That's why I was asking whether
> >> you were starting from scratch.
> >>
> >> As to the other items, please re-read my original response. Solr has
> >> an example reading in RSS feeds, you could probably use that. Or a
> >> generic XML using DataImportHandler's mapping. Or directly from
> >> database, again with DIH.
> >>
> >> Basically, it sounds totally doable. So, it's hard to advise anything
> >> specific beyond "go, do it" and wait for you to come back with a lot
> >> more specific issue once you get going. Most of the issues will be
> >> related to your schema and your WordPress configuration, so no
> >> abstract advice is available.
> >>
> >> Regards,
> >>    Alex.
> >>
> >> On 7 October 2014 16:36, Vishal Sharma <vish...@grazitti.com> wrote:
> >>> Hey Alex,
> >>>
> >>> Thanks for the prompt response.
> >>>
> >>> Here is what I am trying to solve: I am showing search results from
> >> content
> >>> coming from 3 different places on a single site. And, I have done that
> by
> >>> pumping all this content to Solr server running on single flat schema
> by
> >>> using different APIs of these platforms. Now, I need to index blog
> posts
> >>> written in word press also. I was wondering if there is any solution
> >>> already availablw which can help me crawl and pump this posst to my
> >> running
> >>> solr instance. Otherwise I might have to write few more scripts to do
> >> that.
> >>>
> >>> BTW, Is Swift using Solr on the backend? Because I thought its a paid
> >>> enterprise solution.
> >>>
> >>
>
> Concurso "Mi selfie por los 5". Detalles en
> http://justiciaparaloscinco.wordpress.com
>

Reply via email to