RE: Nutch identifier while indexing.

2013-02-13 Thread Markus Jelsma
You can use the subcollection indexing filter to set a value for URL's that 
match a string. With it you can distinquish even if they are on the same host 
and domain.
 
-Original message-
> From:mbehlok 
> Sent: Wed 13-Feb-2013 21:20
> To: user@nutch.apache.org
> Subject: Re: Nutch identifier while indexing.
> 
> wish it was that simple:
> 
> SitaA = www.myDomain.com/index.aspx?site=1
> 
> SitaB = www.myDomain.com/index.aspx?site=2
> 
> SitaC = www.myDomain.com/index.aspx?site=3
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


Re: Nutch identifier while indexing.

2013-02-13 Thread alxsss
The only suggestion that I know is that you can index the site param at the end 
of the urls as a separate field and make facet search in solr with that param 
values.

Alex.

 

 

 

-Original Message-
From: mbehlok 
To: user 
Sent: Wed, Feb 13, 2013 12:20 pm
Subject: Re: Nutch identifier while indexing.


wish it was that simple:

SitaA = www.myDomain.com/index.aspx?site=1

SitaB = www.myDomain.com/index.aspx?site=2

SitaC = www.myDomain.com/index.aspx?site=3



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


Re: Nutch identifier while indexing.

2013-02-13 Thread mbehlok
wish it was that simple:

SitaA = www.myDomain.com/index.aspx?site=1

SitaB = www.myDomain.com/index.aspx?site=2

SitaC = www.myDomain.com/index.aspx?site=3



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch identifier while indexing.

2013-02-13 Thread alxsss
Are you telling that your sites have form siteA.mydomain.com, 
siteB.mydomain.com, siteC.mydomain.com?

Alex.

 

 

 

-Original Message-
From: mbehlok 
To: user 
Sent: Wed, Feb 13, 2013 11:05 am
Subject: Nutch identifier while indexing.


Hello, I am indexing 3 sites:

SiteA
SiteB
SiteC

I want to index these sites in a way that when searching them in solr I can
query a search on each of these sites in separate. So one could say... thats
easy, just filter them by host... WRONG...  Sites are hosted on the same
host but have different starting points. That is, starting the crawl from
different root urls (SiteA, SiteB, SiteC) produces different results. My
imagination tells me to somehow specify an identifier on schema.xml that
passes to solr which was the root url that produced that crawl. Any ideas on
how to implement this? any variations?

Mitch 
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285.html
Sent from the Nutch - User mailing list archive at Nabble.com.