Alexander,
We can already remove boilerplate from HTML pages thanks to Boilerpipe in
Tika (there is an open issue on JIRA for this). Markus is looking for a way
to classify an entire page as content-rich vs mostly links.
Markus : don't know any specific litterature on the subject but determining
Thanks, both of you.
I'll do some research on the corpus i have. And Sujit's page is always a nice
read!
Alexander,
We can already remove boilerplate from HTML pages thanks to Boilerpipe in
Tika (there is an open issue on JIRA for this). Markus is looking for a way
to classify an entire
In that article author uses approach when he extracts text (with links),
splits whole text into chunks (by strings in the simpliest case or by
paragraph) and then compares chunks with a number of links or grabge text.
You can take these figures as input and discard a page if the ratio is not
Hello,
I'm trying to understand better Nutch and Solr integration. My understanding
is that Documents are added to Solr index from SolrWriter's write(NutchDocument
doc) method. But does it make any use of the WhitespaceTokenizerFactory?
--
Regards,
K. Gabriele
--- unchanged since 20/9/10 ---
No. SolrJ only builds input docs from NutchDocument objects. Solr will do
analysis. The integration is analogous to XML post of Solr documents.
On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote:
Hello,
I'm trying to understand better Nutch and Solr integration. My
understanding is
Thanks again. If there are more pages to look at, please post them if you know
any.
On Tuesday 05 July 2011 12:05:33 Alexander Aristov wrote:
In that article author uses approach when he extracts text (with links),
splits whole text into chunks (by strings in the simpliest case or by
Hi Viksit,
It's a known issue now: https://issues.apache.org/jira/browse/NUTCH-1029
Cheers,
On Thursday 12 May 2011 22:10:12 Viksit Gaur wrote:
Hi all,
When trying to run nutch's crawldb reader to get stats for my crawl
database, I get the following error when calling it using hadoop,
On Jul 5, 2011, at 3:05am, Alexander Aristov wrote:
In that article author uses approach when he extracts text (with links),
splits whole text into chunks (by strings in the simpliest case or by
paragraph) and then compares chunks with a number of links or grabge text.
You can take these
Hi,
I am sorry that I have not been able to try and replicate the scenario and
confirm whether I get zero scores in a similar situation as I am temporarily
unable to do so but I would like to add this resource [1], if you have not
seen it yet. I am aware that this doesn't address the problem
Hi,
I'm curious to hear if anyone has information for configuring Nutch to crawl
a RDB such as MySQL. In my hypothetical example there are N number of
databases residing in various distributed geographical locations, to make a
worst case scenario, say that they are NOT all the same type, and I
I was tasked with doing something like this. We didn't do it, but my
thought process was to use the Java JDBC URL pathed out to the
database.
The two ideas we had were the following:
1. jdbc://dbserver.domain.com/tableName/schemaName/tableName
2. jdbc://dbserver.domain.com/${SQLQuery}
Then
H,
About geographical search: Solr will do this for you. Built-in for 3.x+ and
using third-party plugins for 1.4.x. Both provide different features. In Solr
it's you'd not base similarity on geographical data but use spatial data to
boost textual similar documents instead, or filter.
This
thanks to you both
On Tue, Jul 5, 2011 at 4:35 PM, Markus Jelsma markus.jel...@openindex.iowrote:
H,
About geographical search: Solr will do this for you. Built-in for 3.x+ and
using third-party plugins for 1.4.x. Both provide different features. In
Solr
it's you'd not base similarity on
Another approach for large DBs is to use Sqoop to pull the DB records into
Hadoop, then create Solr indexes from those records.
In our situation we're using Cascading, so we just hook up our cascading.solr
tap.
For small(er) DBs, or if you need incremental updates, then Solr's DIH is more
By the by, this is the best source of open geospatial data I've found
is OpenStreetMap [1]. I'm unaware of any Java bindings for it
(honestly I haven't looked that hard), but it appears to have pretty
excellent data, and the built in search seems better than an other
open source implementation
15 matches
Mail list logo