RE: Post process Nutch data

2014-05-05 Thread Srikanth Shankara Rao
Thanks Julien. This helps. I’ll look into this. From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Monday, May 05, 2014 8:57 PM To: dev@nutch.apache.org Subject: Re: Post process Nutch data Hi As mentioned earlier in a different discussion on this list behemoth would be the right

Re: Better Parser Plugin

2014-05-05 Thread Sebastian Nagel
Hi Talat, thanks for the examples. I've also observed that Neko has some problems even with valid HTML5. Luckily, most pages do not use excessively the syntactic "freedom" HTML5 allows (not closing tags, leaving "implicit" tags away). Some problems can be easily fixed (eg., NUTCH-1733), and since

Re: Post process Nutch data

2014-05-05 Thread Julien Nioche
Hi As mentioned earlier in a different discussion on this list behemoth would be the right tool for this Julien On Monday, 5 May 2014, Srikanth Shankara Rao wrote: > > Hi All, > > I have crawled Nutch data using 1.8. Data is in HDFS. I would like to > post-process this data before indexing int

Post process Nutch data

2014-05-05 Thread Srikanth Shankara Rao
Hi All, I have crawled Nutch data using 1.8. Data is in HDFS. I would like to post-process this data before indexing into SOLR. The idea is to transform the data based on the content and add few additional fields that describe the content. I would like to do this as part of a hadoop job. What

Re: Better Parser Plugin

2014-05-05 Thread Talat Uyarer
Hi Lewis and Sebastian, First of all thanks for reply :) There is not any issue in our Jira. But I detected a lot of website that has html tags in parsed text. For example http://www.dersimiz.com/kisa-ilginc-enteresan-tuhaf-acayip-sasirtici-bilgiler.asp#.U2c6H3V_t2M When it is parsed by Neko, i

Re: Better Parser Plugin

2014-05-05 Thread Talat Uyarer
2014-05-03 20:04 GMT+03:00 Lewis John Mcgibbney : > Hi Talat, > > On Sat, May 3, 2014 at 4:35 AM, wrote: >> >> >> Now used parser plugins nekohtml doesnt parse correctly. > > > What is wrong with it? Are there any issues in Jira to back this up? > >> >> When I tested >> in huge website site, it le

Re: Giraph Integration

2014-05-05 Thread Talat Uyarer
Hi Lewis, Thanks for information about this work. Emre worked at our company. I reviewed the code. The architecture of work based on an abstract RankingJob. It is similar to our old architecture of IndexerJob. Moreover Emre didn't use gora, ToolRunner or it didn't get crawlId etc. I want to create

Re: About RankingJob for Giraph

2014-05-05 Thread Talat Uyarer
Hi Sebastian, Thank you for review my email. "a pluggable RankingJob" means a Job that has pluggable ranking backends for graph based algorithms. This job is similar our present architecture of IndexingJob. If we create a RankingJob in our crawler workflow, we can create a dummy Scoring Filter that

some questions about nutch?

2014-05-05 Thread Li Li
I have designed a vertical spider and am interested in nutch's archetecture. After reading some introductions, I have some questions. 1. why nutch 2.x use 3rd part databases such as hbase/cassandra? as far as I know, nutch 1.x store it's data in hdfs and manage by itself. Using nosql like hbase