Re: Italian web sites
The first one. Bye Laura > What does it mean? "Italian website" can be: > - site that use italian language > - site owned by an italian organization > - site hosted in a italian geographical site > Every definition has a different solution. > > Date sent:Wed, 24 Apr 2002 11:02:32 +0200 > From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Subject: Italian web sites > To: [EMAIL PROTECTED] > Send reply to:Lucene Users List > > > Hi all, > > > > I'm using Jobo for spidering web sites and lucene for indexing. The > > problem is that I'd like spidering only Italian web sites. > > How can I see discover the country of a web site? > > > > Dou you know some method that tou can suggest me? > > > > Thanks > > > > > > Laura > > > > > -- > Marco Ferrante ([EMAIL PROTECTED]) > CSITA (Centro Servizi Informatici e Telematici d'Ateneo) > Università degli Studi di Genova - Italy > Via Brigata Salerno, ponte - 16147 Genova > tel (+39) 0103532621 (interno tel. 2621) > -- > > > -- > To unsubscribe, e-mail: <mailto:lucene-user- [EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:lucene-user- [EMAIL PROTECTED]> > >
Re: Italian web sites
What does it mean? "Italian website" can be: - site that use italian language - site owned by an italian organization - site hosted in a italian geographical site Every definition has a different solution. Date sent: Wed, 24 Apr 2002 11:02:32 +0200 From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Subject: Italian web sites To: [EMAIL PROTECTED] Send reply to: Lucene Users List <[EMAIL PROTECTED]> > Hi all, > > I'm using Jobo for spidering web sites and lucene for indexing. The > problem is that I'd like spidering only Italian web sites. > How can I see discover the country of a web site? > > Dou you know some method that tou can suggest me? > > Thanks > > > Laura > -- Marco Ferrante ([EMAIL PROTECTED]) CSITA (Centro Servizi Informatici e Telematici d'Ateneo) Università degli Studi di Genova - Italy Via Brigata Salerno, ponte - 16147 Genova tel (+39) 0103532621 (interno tel. 2621) -- -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
Re: Italian web sites
Laura >Hi all, > >I'm using Jobo for spidering web sites and lucene for indexing. The >problem is that I'd like spidering only Italian web sites. >How can I see discover the country of a web site? > >Dou you know some method that tou can suggest me? The best method I know is using n-grams of characters and use the frequencies of the n-grams that occur most: http://citeseer.nj.nec.com/context/698873/68861 Regards, Ype -- -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
Re: Italian web sites
hm... this looks very interesting! if it is a perl exe you can just copy the text into a temp file and run the per exe on that file and redirect the output to another tmp file. then read the file and use the result in a lucene keyword. mvh karl øie On Wednesday 24 April 2002 13:46, [EMAIL PROTECTED] wrote: > Hi all, > > I have found a very interesting library which is written in perl. > The problem is now how I can use this library. > > Anyway the library is Textcat an you can find it: > > http://odur.let.rug.nl/~vannoord/TextCat/ > > Bye > > Laura > > > > combined with that you could use an italian stop- > > word list to run statistics > > > on a page :-) ?!? > > > > On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote: > > > > > Hi all, > > > > > > I'm using Jobo for spidering web sites and lucene for indexing. The > > > problem is that I'd like spidering only Italian web sites. > > > How can I see discover the country of a web site? > > > > > > Dou you know some method that tou can suggest me? > > > > > > Thanks > > > > > > > > > Laura > > > > > > > > > > > -- > > To unsubscribe, e-mail: <mailto:lucene-user- > > [EMAIL PROTECTED]> > > > For additional commands, e-mail: <mailto:lucene-user- > > [EMAIL PROTECTED]> > > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
Re: Italian web sites
Hi all, I have found a very interesting library which is written in perl. The problem is now how I can use this library. Anyway the library is Textcat an you can find it: http://odur.let.rug.nl/~vannoord/TextCat/ Bye Laura > combined with that you could use an italian stop- word list to run statistics > on a page :-) ?!? > > On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote: > > Hi all, > > > > I'm using Jobo for spidering web sites and lucene for indexing. The > > problem is that I'd like spidering only Italian web sites. > > How can I see discover the country of a web site? > > > > Dou you know some method that tou can suggest me? > > > > Thanks > > > > > > Laura > > > > > -- > To unsubscribe, e-mail: <mailto:lucene-user- [EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:lucene-user- [EMAIL PROTECTED]> > >
Re: Italian web sites
combined with that you could use an italian stop-word list to run statistics on a page :-) ?!? On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote: > Hi all, > > I'm using Jobo for spidering web sites and lucene for indexing. The > problem is that I'd like spidering only Italian web sites. > How can I see discover the country of a web site? > > Dou you know some method that tou can suggest me? > > Thanks > > > Laura > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
RE: Italian web sites
sniff the IP and then using the database at the internet topology website http://netgeo.caida.org/perl/netgeo.cgi you can find the country of origin, (use that to populate your own DB) so retrieval decreases as you accumulate IPs), but that will give you the website in Italy (not Italian websites). Unfortunately unless Italian uses a different encoding for the page, picking it up from the page (JavaScript) won't help much. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 24, 2002 1:03 PM To: [EMAIL PROTECTED] Subject: Italian web sites Hi all, I'm using Jobo for spidering web sites and lucene for indexing. The problem is that I'd like spidering only Italian web sites. How can I see discover the country of a web site? Dou you know some method that tou can suggest me? Thanks Laura -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
Italian web sites
Hi all, I'm using Jobo for spidering web sites and lucene for indexing. The problem is that I'd like spidering only Italian web sites. How can I see discover the country of a web site? Dou you know some method that tou can suggest me? Thanks Laura