Thanks Karl ________________________________________ From: Karl Wright [daddy...@gmail.com] Sent: Wednesday, February 08, 2012 8:40 AM To: Silvia, Daniel [USA] Cc: connectors-user@incubator.apache.org Subject: Re: Web Crawl using ManifoldCF
On Wed, Feb 8, 2012 at 8:24 AM, Silvia, Daniel [USA] <silvia_dan...@bah.com> wrote: > Hi Carl > > > > I want to thank you for your help regarding the Sharepoint to Solr > connections, everything seems to be working properly after getting the > Viewers and Home Owners groups permission set properly by our SharePoint > Admins. That's great news! Thanks for sticking with it. ;-) > However, I have another question regarding pulling site content from > the SharePoint instance and not the files stored on the SharePoint instance. > > > > When creating a Respository connection, would you use the "Web" connection > type to pull site content? If that is the case, when creating the job, do > you indicate just the site url you want to crawl to pull site content in the > "Seed" tab? Are we using the correct connection repository? Is there a > respository type we use to just crawl websites for the content and not > files? > > I think that's the right approach, if there's a document you can crawl somewhere that has a reference to the other documents, or the documents all refer to each other. You need such a document or documents at the root of a document web, otherwise a web crawler has no way of locating the documents in question. That would be how you identify your "seed" document. For typical (non SharePoint) sites, that's usually the main URL of the site. So, for example, if you wanted to crawl cnn.com you'd probably use a seed of http://www.cnn.com, because that's a good place to start to get to all of cnn's content. If no such document(s) exist, then web crawling is not going to do it. If this "site" is served by SharePoint, then some kind of enhancement to the SharePoint connector would be a better approach. Thanks, Karl > > As you can see, I hope I have explained myself properly, we are trying to > just crawl site content. > > > > Thanks > > > > Dan