RE: Nutch identifier while indexing.
You can use the subcollection indexing filter to set a value for URL's that match a string. With it you can distinquish even if they are on the same host and domain. -Original message- > From:mbehlok > Sent: Wed 13-Feb-2013 21:20 > To: user@nutch.apache.org > Subject: Re: Nutch identifier while indexing. > > wish it was that simple: > > SitaA = www.myDomain.com/index.aspx?site=1 > > SitaB = www.myDomain.com/index.aspx?site=2 > > SitaC = www.myDomain.com/index.aspx?site=3 > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html > Sent from the Nutch - User mailing list archive at Nabble.com. >
Re: Nutch identifier while indexing.
The only suggestion that I know is that you can index the site param at the end of the urls as a separate field and make facet search in solr with that param values. Alex. -Original Message- From: mbehlok To: user Sent: Wed, Feb 13, 2013 12:20 pm Subject: Re: Nutch identifier while indexing. wish it was that simple: SitaA = www.myDomain.com/index.aspx?site=1 SitaB = www.myDomain.com/index.aspx?site=2 SitaC = www.myDomain.com/index.aspx?site=3 -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch identifier while indexing.
wish it was that simple: SitaA = www.myDomain.com/index.aspx?site=1 SitaB = www.myDomain.com/index.aspx?site=2 SitaC = www.myDomain.com/index.aspx?site=3 -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch identifier while indexing.
Are you telling that your sites have form siteA.mydomain.com, siteB.mydomain.com, siteC.mydomain.com? Alex. -Original Message- From: mbehlok To: user Sent: Wed, Feb 13, 2013 11:05 am Subject: Nutch identifier while indexing. Hello, I am indexing 3 sites: SiteA SiteB SiteC I want to index these sites in a way that when searching them in solr I can query a search on each of these sites in separate. So one could say... thats easy, just filter them by host... WRONG... Sites are hosted on the same host but have different starting points. That is, starting the crawl from different root urls (SiteA, SiteB, SiteC) produces different results. My imagination tells me to somehow specify an identifier on schema.xml that passes to solr which was the root url that produced that crawl. Any ideas on how to implement this? any variations? Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285.html Sent from the Nutch - User mailing list archive at Nabble.com.