Re: Few questions from a newbie

2011-01-26 Thread Julien Nioche
Tom White's book on Hadoop is a must have for anyone wanting to understand how Nutch and Hadoop work. There is a section in it specifically about Nutch written by Andrzej as well On 26 January 2011 03:02, .: Abhishek :. ab1s...@gmail.com wrote: Thanks a bunch Markus. By the way, is there

Re: Few questions from a newbie

2011-01-26 Thread .: Abhishek :.
Thanks Julien. I will get the book :) On Wed, Jan 26, 2011 at 5:09 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Tom White's book on Hadoop is a must have for anyone wanting to understand how Nutch and Hadoop work. There is a section in it specifically about Nutch written by

RE: Few questions from a newbie

2011-01-26 Thread McGibbney, Lewis John
I can only speak for myself but I think that reading up on 'search' E.g. Lucene, is really the first stop prior to engaging with the crawling stuff. There are publications out there dealing with building search applications but these only contain small sections on web crawlers and code examples

Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
Hi list, I have given the set of urls as http://is.gd/Jt32Cf http://is.gd/hS3lEJ http://is.gd/Jy1Im3 http://is.gd/QoJ8xy http://is.gd/e4ct89 http://is.gd/WAOVmd http://is.gd/lhkA69 http://is.gd/3OilLD . 43 such urls And I have run the crawl command bin/nutch crawl urls/ -dir crawl -depth 3

Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
I am developing an application based on twitter feeds...so 90% of the url's will be short urls. So, it is difficult for me to manually convert all these urls to actual urls. Do we have any other solution for this? Thanks and regards, Arjun Kumar Reddy On Wed, Jan 26, 2011 at 7:09 PM, Estrada

Re: Archiving Audio and Video

2011-01-26 Thread Gora Mohanty
On Wed, Jan 26, 2011 at 7:17 PM, Estrada Groups estrada.adam.gro...@gmail.com wrote: Thanks Gora! I am interested I'm searching through the text from these audio and video streams. An example would be a 911 dispatch call and maybe even all the recorded official chatter about it. That is just

Antwort: Re: Few questions from a newbie

2011-01-26 Thread Mike Zuehlke
Hi Arjun, nutch handles redirect by itself - like the return codes 301 and 302. Did you check how much redirects you have to follow until you get HTTP_ACCESS (200). I think there are four redirects needed to get the given url content. So you have to increase the depth for your crawling.

Re: Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
Hi Mike, Actually in my application, I am working on twitter feeds where I am filtering the tweets present with inks and I am storing the contents of the links. I am maintaining all such links in the urls file giving it as an input to nutch crawler. Here, I am not bothered about the inlinks or

Re: Few questions from a newbie

2011-01-26 Thread Churchill Nanje Mambe
hello you have to use the short url APIs and get the long URLs... its abit complex as you have to determine the url if its short, then determine the url shortening service used eg: tinyurl.com bit.ly or goo.gl and then you use their respective api and send in the url and they will return the long

Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
Yea Hi Mambe, Thanks for the feedback. I have mentioned the details of my application in the above post. I have tried doing this crawling job using php-multi curl and I am getting results which are good enough but the problem I am facing is that it is taking hell lot of time to get the contents

Re: Archiving Audio and Video

2011-01-26 Thread Adam Estrada
Another example would be the content embedded in this flash movie. http://digitalmedia.worldbank.org/SSP/lac/investment-in-haiti/ Adam On Wed, Jan 26, 2011 at 1:02 AM, Gora Mohanty g...@mimirtech.com wrote: On Wed, Jan 26, 2011 at 9:15 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote:

Re: Archiving Audio and Video

2011-01-26 Thread Gora Mohanty
On Wed, Jan 26, 2011 at 7:38 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Another example would be the content embedded in this flash movie. http://digitalmedia.worldbank.org/SSP/lac/investment-in-haiti/ [...] ffmpeg can pull out audio from video streams, and a working speech-to-text

Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Claudio Martella
Today I had a look at the code and wrote this class. It works here on my test cluster. It scans the crawldb for entries carrying the STATUS_DB_GONE and it issues a delete to solr for those entries. Is that what you guys have in mind? Should i file a JIRA? On 1/24/11 10:26 AM, Markus Jelsma

Re: Few questions from a newbie

2011-01-26 Thread alxsss
you can put fetch external and internal links to false and increase depth. -Original Message- From: Churchill Nanje Mambe mambena...@afrovisiongroup.com To: user user@nutch.apache.org Sent: Wed, Jan 26, 2011 8:03 am Subject: Re: Few questions from a newbie even if the url

Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Markus Jelsma
Hi, Please open a ticket, i'll test it. Cheers, On Wednesday 26 January 2011 18:12:51 Claudio Martella wrote: Today I had a look at the code and wrote this class. It works here on my test cluster. It scans the crawldb for entries carrying the STATUS_DB_GONE and it issues a delete to solr

Webserver configuration to successfully get modified time?

2011-01-26 Thread Joshua J Pavel
We've been crawling with nutch and deleting the crawldb between crawls. I believe I've managed to get my recrawl script to finally work, but I was disappointed to see that in my db, the modified time of all of my pages is Jan 1 1970. Since I control both the crawler and the web server in our

RE: [Example] Configuration for a Hadoop Cluster

2011-01-26 Thread Chris Woolum
This is a helpful wiki link http://wiki.apache.org/nutch/NutchHadoopTutorial Chris -Original Message- From: estrada.a...@gmail.com [mailto:estrada.a...@gmail.com] On Behalf Of Adam Estrada Sent: Wednesday, January 26, 2011 7:31 AM To: user@nutch.apache.org Subject: [Example]

Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Erlend GarĂ¥sen
But we also need to detect modified documents in order to trigger an update command to Solr (an improvement of SolrIndexer). I was planning to open a Jira issue on this missing functionality this week. Erlend On 26.01.11 18.12, Claudio Martella wrote: Today I had a look at the code and

Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Markus Jelsma
This is default behaviour. If pages are scheduled for fetching they will show up in the next segment. If you index that segment the old document in Solr is overwritten. But we also need to detect modified documents in order to trigger an update command to Solr (an improvement of SolrIndexer).

Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Markus Jelsma
See this post in a recent thread: http://search.lucidimagination.com/search/document/5b7ba8a6fc5e0305/few_questions_from_a_newbie This is default behaviour. If pages are scheduled for fetching they will show up in the next segment. If you index that segment the old document in Solr is

Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Claudio Martella
Yes, absolutely. The only optimization we could make here would be to send to SOLR only updates about documents we know for sure they changed (i.e. based on digests, like the deduplication code). I'm not sure how SOLR behaves if you send an update with no change in the document. I'm sure they