Thanks Julien. This helps. I’ll look into this.
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
Sent: Monday, May 05, 2014 8:57 PM
To: dev@nutch.apache.org
Subject: Re: Post process Nutch data
Hi
As mentioned earlier in a different discussion on this list behemoth would be
the right
Hi Talat,
thanks for the examples. I've also observed that Neko has some problems
even with valid HTML5. Luckily, most pages do not use excessively the
syntactic "freedom" HTML5 allows (not closing tags, leaving "implicit" tags
away). Some problems can be easily fixed (eg., NUTCH-1733), and since
Hi
As mentioned earlier in a different discussion on this list behemoth would
be the right tool for this
Julien
On Monday, 5 May 2014, Srikanth Shankara Rao wrote:
>
> Hi All,
>
> I have crawled Nutch data using 1.8. Data is in HDFS. I would like to
> post-process this data before indexing int
Hi All,
I have crawled Nutch data using 1.8. Data is in HDFS. I would like to
post-process this data before indexing into SOLR. The idea is to transform the
data based on the content and add few additional fields that describe the
content.
I would like to do this as part of a hadoop job. What
Hi Lewis and Sebastian,
First of all thanks for reply :) There is not any issue in our Jira.
But I detected a lot of website that has html tags in parsed text.
For example
http://www.dersimiz.com/kisa-ilginc-enteresan-tuhaf-acayip-sasirtici-bilgiler.asp#.U2c6H3V_t2M
When it is parsed by Neko, i
2014-05-03 20:04 GMT+03:00 Lewis John Mcgibbney :
> Hi Talat,
>
> On Sat, May 3, 2014 at 4:35 AM, wrote:
>>
>>
>> Now used parser plugins nekohtml doesnt parse correctly.
>
>
> What is wrong with it? Are there any issues in Jira to back this up?
>
>>
>> When I tested
>> in huge website site, it le
Hi Lewis,
Thanks for information about this work. Emre worked at our company. I
reviewed the code. The architecture of work based on an abstract
RankingJob. It is similar to our old architecture of IndexerJob.
Moreover Emre didn't use gora, ToolRunner or it didn't get crawlId
etc. I want to create
Hi Sebastian,
Thank you for review my email. "a pluggable RankingJob" means a Job
that has pluggable ranking backends for graph based algorithms. This
job is similar our present architecture of IndexingJob. If we create a
RankingJob in our crawler workflow, we can create a dummy Scoring
Filter that
I have designed a vertical spider and am interested in nutch's
archetecture. After reading some introductions, I have some questions.
1. why nutch 2.x use 3rd part databases such as hbase/cassandra?
as far as I know, nutch 1.x store it's data in hdfs and manage by
itself. Using nosql like hbase
9 matches
Mail list logo