Tom White's book on Hadoop is a must have for anyone wanting to understand
how Nutch and Hadoop work. There is a section in it specifically about Nutch
written by Andrzej as well
On 26 January 2011 03:02, .: Abhishek :. ab1s...@gmail.com wrote:
Thanks a bunch Markus.
By the way, is there
Thanks Julien. I will get the book :)
On Wed, Jan 26, 2011 at 5:09 PM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
Tom White's book on Hadoop is a must have for anyone wanting to understand
how Nutch and Hadoop work. There is a section in it specifically about
Nutch
written by
I can only speak for myself but I think that reading up on 'search' E.g.
Lucene, is really the first stop prior to engaging with the crawling stuff.
There are publications out there dealing with building search applications but
these only contain small sections on web crawlers and code examples
Hi list,
I have given the set of urls as
http://is.gd/Jt32Cf
http://is.gd/hS3lEJ
http://is.gd/Jy1Im3
http://is.gd/QoJ8xy
http://is.gd/e4ct89
http://is.gd/WAOVmd
http://is.gd/lhkA69
http://is.gd/3OilLD
. 43 such urls
And I have run the crawl command bin/nutch crawl urls/ -dir crawl -depth 3
I am developing an application based on twitter feeds...so 90% of the url's
will be short urls.
So, it is difficult for me to manually convert all these urls to actual
urls. Do we have any other solution for this?
Thanks and regards,
Arjun Kumar Reddy
On Wed, Jan 26, 2011 at 7:09 PM, Estrada
On Wed, Jan 26, 2011 at 7:17 PM, Estrada Groups
estrada.adam.gro...@gmail.com wrote:
Thanks Gora! I am interested I'm searching through the text from these audio
and video streams. An example would be a 911 dispatch call and maybe even all
the recorded official chatter about it. That is just
Hi Arjun,
nutch handles redirect by itself - like the return codes 301 and 302.
Did you check how much redirects you have to follow until you get
HTTP_ACCESS (200).
I think there are four redirects needed to get the given url content. So
you have to increase the depth for your crawling.
Hi Mike,
Actually in my application, I am working on twitter feeds where I
am filtering the tweets present with inks and I am storing the contents of
the links. I am maintaining all such links in the urls file giving it as an
input to nutch crawler. Here, I am not bothered about the inlinks or
hello
you have to use the short url APIs and get the long URLs... its abit
complex as you have to determine the url if its short, then determine the
url shortening service used eg: tinyurl.com bit.ly or goo.gl and then you
use their respective api and send in the url and they will return the long
Yea Hi Mambe,
Thanks for the feedback. I have mentioned the details of my application in
the above post.
I have tried doing this crawling job using php-multi curl and I am getting
results which are good enough but the problem I am facing is that it is
taking hell lot of time to get the contents
Another example would be the content embedded in this flash movie.
http://digitalmedia.worldbank.org/SSP/lac/investment-in-haiti/
Adam
On Wed, Jan 26, 2011 at 1:02 AM, Gora Mohanty g...@mimirtech.com wrote:
On Wed, Jan 26, 2011 at 9:15 AM, Adam Estrada
estrada.adam.gro...@gmail.com wrote:
On Wed, Jan 26, 2011 at 7:38 PM, Adam Estrada
estrada.adam.gro...@gmail.com wrote:
Another example would be the content embedded in this flash movie.
http://digitalmedia.worldbank.org/SSP/lac/investment-in-haiti/
[...]
ffmpeg can pull out audio from video streams, and a working
speech-to-text
Today I had a look at the code and wrote this class. It works here on my
test cluster.
It scans the crawldb for entries carrying the STATUS_DB_GONE and it
issues a delete to solr for those entries.
Is that what you guys have in mind? Should i file a JIRA?
On 1/24/11 10:26 AM, Markus Jelsma
you can put fetch external and internal links to false and increase depth.
-Original Message-
From: Churchill Nanje Mambe mambena...@afrovisiongroup.com
To: user user@nutch.apache.org
Sent: Wed, Jan 26, 2011 8:03 am
Subject: Re: Few questions from a newbie
even if the url
Hi,
Please open a ticket, i'll test it.
Cheers,
On Wednesday 26 January 2011 18:12:51 Claudio Martella wrote:
Today I had a look at the code and wrote this class. It works here on my
test cluster.
It scans the crawldb for entries carrying the STATUS_DB_GONE and it
issues a delete to solr
We've been crawling with nutch and deleting the crawldb between crawls. I
believe I've managed to get my recrawl script to finally work, but I was
disappointed to see that in my db, the modified time of all of my pages is
Jan 1 1970. Since I control both the crawler and the web server in our
This is a helpful wiki link
http://wiki.apache.org/nutch/NutchHadoopTutorial
Chris
-Original Message-
From: estrada.a...@gmail.com [mailto:estrada.a...@gmail.com] On Behalf Of Adam
Estrada
Sent: Wednesday, January 26, 2011 7:31 AM
To: user@nutch.apache.org
Subject: [Example]
But we also need to detect modified documents in order to trigger an
update command to Solr (an improvement of SolrIndexer). I was planning
to open a Jira issue on this missing functionality this week.
Erlend
On 26.01.11 18.12, Claudio Martella wrote:
Today I had a look at the code and
This is default behaviour. If pages are scheduled for fetching they will show
up in the next segment. If you index that segment the old document in Solr is
overwritten.
But we also need to detect modified documents in order to trigger an
update command to Solr (an improvement of SolrIndexer).
See this post in a recent thread:
http://search.lucidimagination.com/search/document/5b7ba8a6fc5e0305/few_questions_from_a_newbie
This is default behaviour. If pages are scheduled for fetching they will
show up in the next segment. If you index that segment the old document in
Solr is
Yes, absolutely.
The only optimization we could make here would be to send to SOLR only
updates about documents we know for sure they changed (i.e. based on
digests, like the deduplication code). I'm not sure how SOLR behaves if
you send an update with no change in the document.
I'm sure they
21 matches
Mail list logo