Re: Stats for link pages

2011-07-05 Thread Julien Nioche
Alexander, We can already remove boilerplate from HTML pages thanks to Boilerpipe in Tika (there is an open issue on JIRA for this). Markus is looking for a way to classify an entire page as content-rich vs mostly links. Markus : don't know any specific litterature on the subject but determining

Re: Stats for link pages

2011-07-05 Thread Markus Jelsma
Thanks, both of you. I'll do some research on the corpus i have. And Sujit's page is always a nice read! Alexander, We can already remove boilerplate from HTML pages thanks to Boilerpipe in Tika (there is an open issue on JIRA for this). Markus is looking for a way to classify an entire

Re: Stats for link pages

2011-07-05 Thread Alexander Aristov
In that article author uses approach when he extracts text (with links), splits whole text into chunks (by strings in the simpliest case or by paragraph) and then compares chunks with a number of links or grabge text. You can take these figures as input and discard a page if the ratio is not

Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

2011-07-05 Thread Gabriele Kahlout
Hello, I'm trying to understand better Nutch and Solr integration. My understanding is that Documents are added to Solr index from SolrWriter's write(NutchDocument doc) method. But does it make any use of the WhitespaceTokenizerFactory? -- Regards, K. Gabriele --- unchanged since 20/9/10 ---

Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

2011-07-05 Thread Markus Jelsma
No. SolrJ only builds input docs from NutchDocument objects. Solr will do analysis. The integration is analogous to XML post of Solr documents. On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote: Hello, I'm trying to understand better Nutch and Solr integration. My understanding is

Re: Stats for link pages

2011-07-05 Thread Markus Jelsma
Thanks again. If there are more pages to look at, please post them if you know any. On Tuesday 05 July 2011 12:05:33 Alexander Aristov wrote: In that article author uses approach when he extracts text (with links), splits whole text into chunks (by strings in the simpliest case or by

Re: Nutch CrawlDbReader -stats gives EOFException error on hadoop

2011-07-05 Thread Markus Jelsma
Hi Viksit, It's a known issue now: https://issues.apache.org/jira/browse/NUTCH-1029 Cheers, On Thursday 12 May 2011 22:10:12 Viksit Gaur wrote: Hi all, When trying to run nutch's crawldb reader to get stats for my crawl database, I get the following error when calling it using hadoop,

Re: Stats for link pages

2011-07-05 Thread Ken Krugler
On Jul 5, 2011, at 3:05am, Alexander Aristov wrote: In that article author uses approach when he extracts text (with links), splits whole text into chunks (by strings in the simpliest case or by paragraph) and then compares chunks with a number of links or grabge text. You can take these

Re: Searching for documents with a certain boost value

2011-07-05 Thread lewis john mcgibbney
Hi, I am sorry that I have not been able to try and replicate the scenario and confirm whether I get zero scores in a similar situation as I am temporarily unable to do so but I would like to add this resource [1], if you have not seen it yet. I am aware that this doesn't address the problem

Crawling relation database

2011-07-05 Thread lewis john mcgibbney
Hi, I'm curious to hear if anyone has information for configuring Nutch to crawl a RDB such as MySQL. In my hypothetical example there are N number of databases residing in various distributed geographical locations, to make a worst case scenario, say that they are NOT all the same type, and I

Re: Crawling relation database

2011-07-05 Thread Kirby Bohling
I was tasked with doing something like this. We didn't do it, but my thought process was to use the Java JDBC URL pathed out to the database. The two ideas we had were the following: 1. jdbc://dbserver.domain.com/tableName/schemaName/tableName 2. jdbc://dbserver.domain.com/${SQLQuery} Then

Re: Crawling relation database

2011-07-05 Thread Markus Jelsma
H, About geographical search: Solr will do this for you. Built-in for 3.x+ and using third-party plugins for 1.4.x. Both provide different features. In Solr it's you'd not base similarity on geographical data but use spatial data to boost textual similar documents instead, or filter. This

Re: Crawling relation database

2011-07-05 Thread lewis john mcgibbney
thanks to you both On Tue, Jul 5, 2011 at 4:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: H, About geographical search: Solr will do this for you. Built-in for 3.x+ and using third-party plugins for 1.4.x. Both provide different features. In Solr it's you'd not base similarity on

Re: Crawling relation database

2011-07-05 Thread Ken Krugler
Another approach for large DBs is to use Sqoop to pull the DB records into Hadoop, then create Solr indexes from those records. In our situation we're using Cascading, so we just hook up our cascading.solr tap. For small(er) DBs, or if you need incremental updates, then Solr's DIH is more

Re: Crawling relation database

2011-07-05 Thread Kirby Bohling
By the by, this is the best source of open geospatial data I've found is OpenStreetMap [1]. I'm unaware of any Java bindings for it (honestly I haven't looked that hard), but it appears to have pretty excellent data, and the built in search seems better than an other open source implementation