Re: javascript in href does not get into outlink

2012-07-01 Thread Alexander Aristov
if you referring to these links href="javascript:__doPostBack('lnkGoa','') " then these types of links cannot be processed and get discarded by url normalizer and filter. in fact nutch doesn't run javascript on fetched content and so it cannot invoke javascript ASP functio

How to update the index quickly?

2012-07-01 Thread 何建云
Hi, I am using nutch for a search engine. I can not index webpages until the entire crawling process has ended. But i would like a quick update operation. The data crawled in front of several can be added to the index even if the entire crawl process is not over yet. 1. Have any good idea? 2. I

Re: exception that can't be caught in nutch plugin

2012-07-01 Thread Jiang Fung Wong
Hi, No. I don't have that enabled. The logging level is INFO for others. -Jiang On Fri, Jun 29, 2012 at 5:32 PM, Ferdy Galema wrote: > A quick pointer: > > Do you have trace logging enabled? If so try to disabled and see if that > works. > See https://issues.apache.org/jira/browse/NUTCH-1253 >

Re: ParseSegment taking a long time to finish

2012-07-01 Thread mstekel
Hi guys. Did you find a solution for this issue? -- View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992370.html Sent from the Nutch - User mailing list archive at Nabble.com.

javascript in href does not get into outlink

2012-07-01 Thread arijit
Hi,    I am trying to crawl the url: http://districts.nic.in. The javascript links contain the meat of all information in this website. However, on crawling, nutch ignores all these href="javascript: links.    I have ensured the following: nutch-site.xml contains parse-js in plugin.includes.

Re: Language-focused crawling

2012-07-01 Thread Safdar Kureishy
Thanks Marcus/Alexander. Alexander -- what "filter" are you suggesting I implement, if not the scoring filter? Marcus -- the fetch filter filters the content regardless of who pointed to it. The use-case that I'm trying to implement does that on the next hop ... i.e, pages of other languages that

Re: Nutch Exception

2012-07-01 Thread Alexander Aristov
I would check build path if it has some issues. Then did you set up plugins folder according to the article? I use absolute path to get rid of possible mistaces. What about error messages? Do you have some? Sometimes I experienced problem when eclispe didn't compile when I had errors in xml files

RE: Language-focused crawling

2012-07-01 Thread Markus Jelsma
It's a use case for a fetch filter: https://issues.apache.org/jira/browse/NUTCH-828 -Original message- > From:Alexander Aristov > Sent: Sun 01-Jul-2012 20:43 > To: user@nutch.apache.org; safdar.kurei...@gmail.com > Subject: Re: Language-focused crawling > > Hi > > First of all you u

Re: Language-focused crawling

2012-07-01 Thread Alexander Aristov
Hi First of all you understand that in order to detect page language the page must be crawled and at least sent to parser. As you admitted language-identifier filter adds lang field and that's it. You will need to modify or write your own filter that would discard unwanted languages (return null)

Re: How to Read Data Clawled by nutch1.5 and Search keywords

2012-07-01 Thread Alexander Aristov
Look at this tutorial http://wiki.apache.org/nutch/NutchTutorial it gives basic information about nutch and necessary steps. Best Regards Alexander Aristov On 1 July 2012 11:28, 余靖毅 <502437...@qq.com> wrote: > Hi, > I'm a new nutch's user and developer. I have configed nutch1.5 in > nutc

How to Read Data Clawled by nutch1.5 and Search keywords

2012-07-01 Thread ??????
Hi, I'm a new nutch's user and developer. I have configed nutch1.5 in nutch, and succeeded in crawling data in the Internet. During the process, I didn't set SolrUrl and want to use Solr to read data. Now, data is in dir("crawldb", "linkdb", "segments"). I would like to call API in