Re: Archiving Audio and Video

2011-01-26 Thread Julien Nioche
Hi Adam,

This could be done by implementing a custom parser for these formats and use
a speech to text API then store the text in the same way as we do for other
formats. Definitely doable, but would require some work.

Julien

On 26 January 2011 03:45, Adam Estrada wrote:

> Curious...I have been using Nutch for a while now and have never tried to
> index any audio or video formats. Is it feasible to grab the audio out of
> both forms of media and then index it? I believe this would require some
> kind of transcription which may be out of reach on this project.
>
> Thanks,
> Adam




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: Few questions from a newbie

2011-01-26 Thread Julien Nioche
Tom White's book on Hadoop is a must have for anyone wanting to understand
how Nutch and Hadoop work. There is a section in it specifically about Nutch
written by Andrzej as well


On 26 January 2011 03:02, .: Abhishek :.  wrote:

> Thanks a bunch Markus.
>
> By the way, is there some book or material on Nutch which would help me
> understanding it better? I  come from an application development background
> and all the crawl n search stuff is *very* new to me :)
>
>
> On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
> wrote:
>
> > These values come from the CrawlDB and have the following meaning.
> >
> > db_unfetched
> > This is the number of URL's that are to be crawled when the next batch is
> > started. This number is usually limited with the generate.max.per.host
> > setting. So, if there are 5000 unfetched and generate.max.per.host is set
> > to
> > 1000, the next batch will fetch only 1000. Watch, the number of unfetched
> > will
> > usually not be 5000-1000 because new URL's have been discovered and added
> > to
> > the CrawlDB.
> >
> > db_fetched
> > These URL's have been fetched. Their next fetch will be
> > db.fetcher.interval.
> > But, this is not always the case. There the adaprive schedule algorithm
> can
> > tune this number depending on several settings. With these you can tune
> the
> > interval when a page is modified or not modified.
> >
> > db_gone
> > HTTP 404 Not Found
> >
> > db_redir-temp
> > HTTP 307 Temporary Redirect
> >
> > db_redir_perm
> > HTTP 301 Moved Permanently
> >
> > Code:
> >
> >
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
> >
> > Configuration:
> > http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> > default.xml?view=markup<
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup
> >
> >
> > > Thanks Chris, Charan and Alex.
> > >
> > > I am looking into the crawl statistics now. And I see fields like
> > > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm,
> what
> > do
> > > they mean?
> > >
> > > And, I also see the db_unfetched is way too high than the db_fetched.
> > Does
> > > it mean most of the pages did not crawl at all due to some issues?
> > >
> > > Thanks again for your time!
> > >
> > > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar  > >wrote:
> > > > db.fetcher.interval : It means that URLS which were fetched in the
> last
> > > > 30 days  will not be fetched. Or A URL is eligible for
> refetch
> > > > only after 30 days of last crawl.
> > > >
> > > > On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
> > > > > How to use solr to index nutch segments?
> > > > > What is the meaning of db.fetcher.interval? Does this mean that if
> I
> > > > > run the same crawl command before 30 days it will do nothing?
> > > > >
> > > > > Thanks.
> > > > > Alex.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -Original Message-
> > > > > From: Charan K 
> > > > > To: user 
> > > > > Cc: user 
> > > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > > Subject: Re: Few questions from a newbie
> > > > >
> > > > >
> > > > > Refer NutchBean.java for the their question. You can run than from
> > > >
> > > > command
> > > >
> > > > > line
> > > > >
> > > > > to test the index.
> > > > >
> > > > >  If you use SOLR indexing, it is going to be much simpler, they
> have
> > a
> > > >
> > > > solr
> > > >
> > > > > java
> > > > >
> > > > > client..
> > > > >
> > > > >
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar 
> > wrote:
> > > > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > > > crawl
> > > > > >
> > > > > > gives u more control and speed
> > > > > >
> > > > > > 2.After the first crawl,the recrawling the same sites time is 30
> > days
> > > >
> > > > by
> > > >
> > > > > > default in db.fetcher.interval,you can change it according to ur
> > own
> > > > > >
> > > > > > convenience.
> > > > > >
> > > > > > 3.I ve no idea about the third question
> > > > > >
> > > > > > cz  i m also a newbie
> > > > > >
> > > > > > Best of luck with nutch learning
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <
> ab1s...@gmail.com
> > >
> > > > >
> > > > > wrote:
> > > > > >> Hi all,
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> I am very new to Nutch and Lucene as well. I am having few
> > questions
> > > > >
> > > > > about
> > > > >
> > > > > >> Nutch, I know they are very much basic but I could not get clear
> > cut
> > > > > >>
> > > > > >> answers
> > > > > >>
> > > > > >> out of googling for this. The questions are,
> > > > > >>
> > > > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > > >
> > > > intranet
> > > >
> > > > > >>  crawl or whole web crawl.
> > > > > >>
> > > > > >>  - How do I set recrawl's for these sa

Re: Few questions from a newbie

2011-01-26 Thread .: Abhishek :.
Thanks Julien. I will get the book :)

On Wed, Jan 26, 2011 at 5:09 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Tom White's book on Hadoop is a must have for anyone wanting to understand
> how Nutch and Hadoop work. There is a section in it specifically about
> Nutch
> written by Andrzej as well
>
>
> On 26 January 2011 03:02, .: Abhishek :.  wrote:
>
> > Thanks a bunch Markus.
> >
> > By the way, is there some book or material on Nutch which would help me
> > understanding it better? I  come from an application development
> background
> > and all the crawl n search stuff is *very* new to me :)
> >
> >
> > On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
> > wrote:
> >
> > > These values come from the CrawlDB and have the following meaning.
> > >
> > > db_unfetched
> > > This is the number of URL's that are to be crawled when the next batch
> is
> > > started. This number is usually limited with the generate.max.per.host
> > > setting. So, if there are 5000 unfetched and generate.max.per.host is
> set
> > > to
> > > 1000, the next batch will fetch only 1000. Watch, the number of
> unfetched
> > > will
> > > usually not be 5000-1000 because new URL's have been discovered and
> added
> > > to
> > > the CrawlDB.
> > >
> > > db_fetched
> > > These URL's have been fetched. Their next fetch will be
> > > db.fetcher.interval.
> > > But, this is not always the case. There the adaprive schedule algorithm
> > can
> > > tune this number depending on several settings. With these you can tune
> > the
> > > interval when a page is modified or not modified.
> > >
> > > db_gone
> > > HTTP 404 Not Found
> > >
> > > db_redir-temp
> > > HTTP 307 Temporary Redirect
> > >
> > > db_redir_perm
> > > HTTP 301 Moved Permanently
> > >
> > > Code:
> > >
> > >
> >
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
> > >
> > > Configuration:
> > > http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> > > default.xml?view=markup<
> >
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup
> > >
> > >
> > > > Thanks Chris, Charan and Alex.
> > > >
> > > > I am looking into the crawl statistics now. And I see fields like
> > > > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm,
> > what
> > > do
> > > > they mean?
> > > >
> > > > And, I also see the db_unfetched is way too high than the db_fetched.
> > > Does
> > > > it mean most of the pages did not crawl at all due to some issues?
> > > >
> > > > Thanks again for your time!
> > > >
> > > > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar <
> charan.ku...@gmail.com
> > > >wrote:
> > > > > db.fetcher.interval : It means that URLS which were fetched in the
> > last
> > > > > 30 days  will not be fetched. Or A URL is eligible for
> > refetch
> > > > > only after 30 days of last crawl.
> > > > >
> > > > > On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
> > > > > > How to use solr to index nutch segments?
> > > > > > What is the meaning of db.fetcher.interval? Does this mean that
> if
> > I
> > > > > > run the same crawl command before 30 days it will do nothing?
> > > > > >
> > > > > > Thanks.
> > > > > > Alex.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > -Original Message-
> > > > > > From: Charan K 
> > > > > > To: user 
> > > > > > Cc: user 
> > > > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > > > Subject: Re: Few questions from a newbie
> > > > > >
> > > > > >
> > > > > > Refer NutchBean.java for the their question. You can run than
> from
> > > > >
> > > > > command
> > > > >
> > > > > > line
> > > > > >
> > > > > > to test the index.
> > > > > >
> > > > > >  If you use SOLR indexing, it is going to be much simpler, they
> > have
> > > a
> > > > >
> > > > > solr
> > > > >
> > > > > > java
> > > > > >
> > > > > > client..
> > > > > >
> > > > > >
> > > > > >
> > > > > > Sent from my iPhone
> > > > > >
> > > > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar  >
> > > wrote:
> > > > > > > 1,to crawl just 5 to 6 websites,u can use both cases but
> intranet
> > > > > > > crawl
> > > > > > >
> > > > > > > gives u more control and speed
> > > > > > >
> > > > > > > 2.After the first crawl,the recrawling the same sites time is
> 30
> > > days
> > > > >
> > > > > by
> > > > >
> > > > > > > default in db.fetcher.interval,you can change it according to
> ur
> > > own
> > > > > > >
> > > > > > > convenience.
> > > > > > >
> > > > > > > 3.I ve no idea about the third question
> > > > > > >
> > > > > > > cz  i m also a newbie
> > > > > > >
> > > > > > > Best of luck with nutch learning
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <
> > ab1s...@gmail.com
> > > >
> > > > > >
> > > > > > wrote:
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> I am very new 

RE: Few questions from a newbie

2011-01-26 Thread McGibbney, Lewis John
I can only speak for myself but I think that reading up on 'search' E.g. 
Lucene, is really the first stop prior to engaging with the crawling stuff. 
There are publications out there dealing with building search applications but 
these only contain small sections on web crawlers and code examples are fairly 
dated now.

Hope this helps


From: .: Abhishek :. [ab1s...@gmail.com]
Sent: 26 January 2011 03:02
To: markus.jel...@openindex.io
Cc: user@nutch.apache.org
Subject: Re: Few questions from a newbie

Thanks a bunch Markus.

By the way, is there some book or material on Nutch which would help me
understanding it better? I  come from an application development background
and all the crawl n search stuff is *very* new to me :)


On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
wrote:

> These values come from the CrawlDB and have the following meaning.
>
> db_unfetched
> This is the number of URL's that are to be crawled when the next batch is
> started. This number is usually limited with the generate.max.per.host
> setting. So, if there are 5000 unfetched and generate.max.per.host is set
> to
> 1000, the next batch will fetch only 1000. Watch, the number of unfetched
> will
> usually not be 5000-1000 because new URL's have been discovered and added
> to
> the CrawlDB.
>
> db_fetched
> These URL's have been fetched. Their next fetch will be
> db.fetcher.interval.
> But, this is not always the case. There the adaprive schedule algorithm can
> tune this number depending on several settings. With these you can tune the
> interval when a page is modified or not modified.
>
> db_gone
> HTTP 404 Not Found
>
> db_redir-temp
> HTTP 307 Temporary Redirect
>
> db_redir_perm
> HTTP 301 Moved Permanently
>
> Code:
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
>
> Configuration:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> default.xml?view=markup
>
> > Thanks Chris, Charan and Alex.
> >
> > I am looking into the crawl statistics now. And I see fields like
> > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what
> do
> > they mean?
> >
> > And, I also see the db_unfetched is way too high than the db_fetched.
> Does
> > it mean most of the pages did not crawl at all due to some issues?
> >
> > Thanks again for your time!
> >
> > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar  >wrote:
> > > db.fetcher.interval : It means that URLS which were fetched in the last
> > > 30 days  will not be fetched. Or A URL is eligible for refetch
> > > only after 30 days of last crawl.
> > >
> > > On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
> > > > How to use solr to index nutch segments?
> > > > What is the meaning of db.fetcher.interval? Does this mean that if I
> > > > run the same crawl command before 30 days it will do nothing?
> > > >
> > > > Thanks.
> > > > Alex.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -Original Message-
> > > > From: Charan K 
> > > > To: user 
> > > > Cc: user 
> > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > Subject: Re: Few questions from a newbie
> > > >
> > > >
> > > > Refer NutchBean.java for the their question. You can run than from
> > >
> > > command
> > >
> > > > line
> > > >
> > > > to test the index.
> > > >
> > > >  If you use SOLR indexing, it is going to be much simpler, they have
> a
> > >
> > > solr
> > >
> > > > java
> > > >
> > > > client..
> > > >
> > > >
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar 
> wrote:
> > > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > > crawl
> > > > >
> > > > > gives u more control and speed
> > > > >
> > > > > 2.After the first crawl,the recrawling the same sites time is 30
> days
> > >
> > > by
> > >
> > > > > default in db.fetcher.interval,you can change it according to ur
> own
> > > > >
> > > > > convenience.
> > > > >
> > > > > 3.I ve no idea about the third question
> > > > >
> > > > > cz  i m also a newbie
> > > > >
> > > > > Best of luck with nutch learning
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :.  >
> > > >
> > > > wrote:
> > > > >> Hi all,
> > > > >>
> > > > >>
> > > > >>
> > > > >> I am very new to Nutch and Lucene as well. I am having few
> questions
> > > >
> > > > about
> > > >
> > > > >> Nutch, I know they are very much basic but I could not get clear
> cut
> > > > >>
> > > > >> answers
> > > > >>
> > > > >> out of googling for this. The questions are,
> > > > >>
> > > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > >
> > > intranet
> > >
> > > > >>  crawl or whole web crawl.
> > > > >>
> > > > >>  - How do I set recrawl's for these same web sites after the first
> > > >
> > > > crawl.
> >

Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
Hi list,

I have given the set of urls as

http://is.gd/Jt32Cf
http://is.gd/hS3lEJ
http://is.gd/Jy1Im3
http://is.gd/QoJ8xy
http://is.gd/e4ct89
http://is.gd/WAOVmd
http://is.gd/lhkA69
http://is.gd/3OilLD
. 43 such urls

And I have run the crawl command bin/nutch crawl urls/ -dir crawl -depth 3

*arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
*CrawlDb statistics start: crawl/crawldb*
*Statistics for CrawlDb: crawl/crawldb*
*TOTAL urls: 43*
*retry 0: 43*
*min score: 1.0*
*avg score: 1.0*
*max score: 1.0*
*status 3 (db_gone): 1*
*status 4 (db_redir_temp): 1*
*status 5 (db_redir_perm): 41*
*CrawlDb statistics: done*

When I am trying to read the content from the segments, the content block is
empty for every record.

Can you please tell me where I can get the content of these urls.

Thanks and regards,*
*Arjun Kumar Reddy


Re: Few questions from a newbie

2011-01-26 Thread Estrada Groups
You probably have to literally click on each URL to get the URL it's 
referencing. Those are URL shorteners  and probably won't play nicely with a 
crawler because of the redirection.

Adam

Sent from my iPhone

On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy  
wrote:

> Hi list,
> 
> I have given the set of urls as
> 
> http://is.gd/Jt32Cf
> http://is.gd/hS3lEJ
> http://is.gd/Jy1Im3
> http://is.gd/QoJ8xy
> http://is.gd/e4ct89
> http://is.gd/WAOVmd
> http://is.gd/lhkA69
> http://is.gd/3OilLD
> . 43 such urls
> 
> And I have run the crawl command bin/nutch crawl urls/ -dir crawl -depth 3
> 
> *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> *CrawlDb statistics start: crawl/crawldb*
> *Statistics for CrawlDb: crawl/crawldb*
> *TOTAL urls: 43*
> *retry 0: 43*
> *min score: 1.0*
> *avg score: 1.0*
> *max score: 1.0*
> *status 3 (db_gone): 1*
> *status 4 (db_redir_temp): 1*
> *status 5 (db_redir_perm): 41*
> *CrawlDb statistics: done*
> 
> When I am trying to read the content from the segments, the content block is
> empty for every record.
> 
> Can you please tell me where I can get the content of these urls.
> 
> Thanks and regards,*
> *Arjun Kumar Reddy


Re: Archiving Audio and Video

2011-01-26 Thread Estrada Groups
Thanks Gora! I am interested I'm searching through the text from these audio 
and video streams. An example would be a 911 dispatch call and maybe even all 
the recorded official chatter about it. That is just a random use case I can 
think of this morning.

Thanks,

Adam

Sent from my iPhone

On Jan 26, 2011, at 1:02 AM, Gora Mohanty  wrote:

> On Wed, Jan 26, 2011 at 9:15 AM, Adam Estrada
>  wrote:
>> Curious...I have been using Nutch for a while now and have never tried to 
>> index any audio or video formats. Is it feasible to grab the audio out of 
>> both forms of media and then index it? I believe this would require some 
>> kind of transcription which may be out of reach on this project.
> [...]
> 
> One should be able to serialize/de-serialize audio, and video streams
> with ffmpeg, but what is your use case here, i.e., what are you planning
> to do with the indexed content?
> 
> Regards,
> Gora


Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
I am developing an application based on twitter feeds...so 90% of the url's
will be short urls.
So, it is difficult for me to manually convert all these urls to actual
urls. Do we have any other solution for this?


Thanks and regards,
Arjun Kumar Reddy


On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups <
estrada.adam.gro...@gmail.com> wrote:

> You probably have to literally click on each URL to get the URL it's
> referencing. Those are URL shorteners  and probably won't play nicely with a
> crawler because of the redirection.
>
> Adam
>
> Sent from my iPhone
>
> On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <
> charjunkumar.re...@iiitb.net> wrote:
>
> > Hi list,
> >
> > I have given the set of urls as
> >
> > http://is.gd/Jt32Cf
> > http://is.gd/hS3lEJ
> > http://is.gd/Jy1Im3
> > http://is.gd/QoJ8xy
> > http://is.gd/e4ct89
> > http://is.gd/WAOVmd
> > http://is.gd/lhkA69
> > http://is.gd/3OilLD
> > . 43 such urls
> >
> > And I have run the crawl command bin/nutch crawl urls/ -dir crawl -depth
> 3
> >
> > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> > *CrawlDb statistics start: crawl/crawldb*
> > *Statistics for CrawlDb: crawl/crawldb*
> > *TOTAL urls: 43*
> > *retry 0: 43*
> > *min score: 1.0*
> > *avg score: 1.0*
> > *max score: 1.0*
> > *status 3 (db_gone): 1*
> > *status 4 (db_redir_temp): 1*
> > *status 5 (db_redir_perm): 41*
> > *CrawlDb statistics: done*
> >
> > When I am trying to read the content from the segments, the content block
> is
> > empty for every record.
> >
> > Can you please tell me where I can get the content of these urls.
> >
> > Thanks and regards,*
> > *Arjun Kumar Reddy
>


Re: Archiving Audio and Video

2011-01-26 Thread Gora Mohanty
On Wed, Jan 26, 2011 at 7:17 PM, Estrada Groups
 wrote:
> Thanks Gora! I am interested I'm searching through the text from these audio 
> and video streams. An example would be a 911 dispatch call and maybe even all 
> the recorded official chatter about it. That is just a random use case I can 
> think of this morning.
[...]

OK, in that case, you will need to use a speech-to-text library, as Julien has
already suggested.

Sounds like an interesting application. I have not used open-source speech-
to-text libraries much, but people say good things about CMU Sphinx
( http://cmusphinx.sourceforge.net/ ).

Regards,
Gora


Antwort: Re: Few questions from a newbie

2011-01-26 Thread Mike Zuehlke
Hi Arjun,

nutch handles redirect by itself - like the return codes 301 and 302.

Did you check how much redirects you have to follow until you get 
HTTP_ACCESS (200).
I think there are four redirects needed to get the given url content. So 
you have to increase the depth for your crawling.

Regards
Mike




Von:Arjun Kumar Reddy 
An: user@nutch.apache.org
Datum:  26.01.2011 15:43
Betreff:Re: Few questions from a newbie



I am developing an application based on twitter feeds...so 90% of the 
url's
will be short urls.
So, it is difficult for me to manually convert all these urls to actual
urls. Do we have any other solution for this?


Thanks and regards,
Arjun Kumar Reddy


On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups <
estrada.adam.gro...@gmail.com> wrote:

> You probably have to literally click on each URL to get the URL it's
> referencing. Those are URL shorteners  and probably won't play nicely 
with a
> crawler because of the redirection.
>
> Adam
>
> Sent from my iPhone
>
> On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <
> charjunkumar.re...@iiitb.net> wrote:
>
> > Hi list,
> >
> > I have given the set of urls as
> >
> > http://is.gd/Jt32Cf
> > http://is.gd/hS3lEJ
> > http://is.gd/Jy1Im3
> > http://is.gd/QoJ8xy
> > http://is.gd/e4ct89
> > http://is.gd/WAOVmd
> > http://is.gd/lhkA69
> > http://is.gd/3OilLD
> > . 43 such urls
> >
> > And I have run the crawl command bin/nutch crawl urls/ -dir crawl 
-depth
> 3
> >
> > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> > *CrawlDb statistics start: crawl/crawldb*
> > *Statistics for CrawlDb: crawl/crawldb*
> > *TOTAL urls: 43*
> > *retry 0: 43*
> > *min score: 1.0*
> > *avg score: 1.0*
> > *max score: 1.0*
> > *status 3 (db_gone): 1*
> > *status 4 (db_redir_temp): 1*
> > *status 5 (db_redir_perm): 41*
> > *CrawlDb statistics: done*
> >
> > When I am trying to read the content from the segments, the content 
block
> is
> > empty for every record.
> >
> > Can you please tell me where I can get the content of these urls.
> >
> > Thanks and regards,*
> > *Arjun Kumar Reddy
>






http://www.zanox.com/disclaimer/znx_logo_01.gif"; alt="disclaimer 
logo: ZANOX.de AG">

We will create the ultimate global alliance to monetize the Internet


ZANOX.de AG | Headquarters: Berlin
AG Charlottenburg | HRB 75459 | Ust-ID: DE 209981705
Executive Board: Philipp Justus (CEO) | Christian Kleinsorge (CSO) | Daniel 
Keller (CTO)
Chairman of the Supervisory Board: Ralph Büchi

Re: Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
Hi Mike,

Actually in my application, I am working on twitter feeds where I
am filtering the tweets present with inks and I am storing the contents of
the links. I am maintaining all such links in the urls file giving it as an
input to nutch crawler. Here, I am not bothered about the inlinks or
outlinks of that particular link.

So, at first I have given the depth as 1 and later on increased to 3. If I
increase the depth, I can prevent the unwanted crawls. So is there any other
solution for this?

I have also changed the number of redirects configuration paramater to 4 in
nutch-default.xml file.

Thanks and regards,*
*Ch. Arjun Kumar Reddy


On Wed, Jan 26, 2011 at 8:28 PM, Mike Zuehlke wrote:

> Hi Arjun,
>
> nutch handles redirect by itself - like the return codes 301 and 302.
>
> Did you check how much redirects you have to follow until you get
> HTTP_ACCESS (200).
> I think there are four redirects needed to get the given url content. So
> you have to increase the depth for your crawling.
>
> Regards
> Mike
>
>
>
>
> Von:Arjun Kumar Reddy 
> An: user@nutch.apache.org
> Datum:  26.01.2011 15:43
> Betreff:Re: Few questions from a newbie
>
>
>
> I am developing an application based on twitter feeds...so 90% of the
> url's
> will be short urls.
> So, it is difficult for me to manually convert all these urls to actual
> urls. Do we have any other solution for this?
>
>
> Thanks and regards,
> Arjun Kumar Reddy
>
>
> On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups <
> estrada.adam.gro...@gmail.com> wrote:
>
> > You probably have to literally click on each URL to get the URL it's
> > referencing. Those are URL shorteners  and probably won't play nicely
> with a
> > crawler because of the redirection.
> >
> > Adam
> >
> > Sent from my iPhone
> >
> > On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <
> > charjunkumar.re...@iiitb.net> wrote:
> >
> > > Hi list,
> > >
> > > I have given the set of urls as
> > >
> > > http://is.gd/Jt32Cf
> > > http://is.gd/hS3lEJ
> > > http://is.gd/Jy1Im3
> > > http://is.gd/QoJ8xy
> > > http://is.gd/e4ct89
> > > http://is.gd/WAOVmd
> > > http://is.gd/lhkA69
> > > http://is.gd/3OilLD
> > > . 43 such urls
> > >
> > > And I have run the crawl command bin/nutch crawl urls/ -dir crawl
> -depth
> > 3
> > >
> > > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> > > *CrawlDb statistics start: crawl/crawldb*
> > > *Statistics for CrawlDb: crawl/crawldb*
> > > *TOTAL urls: 43*
> > > *retry 0: 43*
> > > *min score: 1.0*
> > > *avg score: 1.0*
> > > *max score: 1.0*
> > > *status 3 (db_gone): 1*
> > > *status 4 (db_redir_temp): 1*
> > > *status 5 (db_redir_perm): 41*
> > > *CrawlDb statistics: done*
> > >
> > > When I am trying to read the content from the segments, the content
> block
> > is
> > > empty for every record.
> > >
> > > Can you please tell me where I can get the content of these urls.
> > >
> > > Thanks and regards,*
> > > *Arjun Kumar Reddy
> >
>
>
>
>
>
>
> http://www.zanox.com/disclaimer/znx_logo_01.gif"; alt="disclaimer
> logo: ZANOX.de AG">
>
> 
> We will create the ultimate global alliance to monetize the Internet
>
> 
>
> ZANOX.de AG | Headquarters: Berlin
> AG Charlottenburg | HRB 75459 | Ust-ID: DE 209981705
> Executive Board: Philipp Justus (CEO) | Christian Kleinsorge (CSO) | Daniel
> Keller (CTO)
> Chairman of the Supervisory Board: Ralph Büchi


Re: Few questions from a newbie

2011-01-26 Thread Churchill Nanje Mambe
hello
 you have to use the short url APIs and get the long URLs... its abit
complex as you have to determine the url if its short, then determine the
url shortening service used eg: tinyurl.com bit.ly or goo.gl and then you
use their respective api and send in the url and they will return the long
url... I used this before but it was a simple php based aggregator and not
nutch


Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
Yea Hi Mambe,

Thanks for the feedback. I have mentioned the details of my application in
the above post.
I have tried doing this crawling job using php-multi curl and I am getting
results which are good enough but the problem I am facing is that it is
taking hell lot of time to get the contents of the urls. I have done this
without using any API or conversions.

So, in order to crawl in lesser time limits and also helps me to scale my
application, I have chosen Nutch crawler.

Thanks and regards,*
*Ch. Arjun Kumar Reddy

On Wed, Jan 26, 2011 at 9:19 PM, Churchill Nanje Mambe <
mambena...@afrovisiongroup.com> wrote:

> hello
>  you have to use the short url APIs and get the long URLs... its abit
> complex as you have to determine the url if its short, then determine the
> url shortening service used eg: tinyurl.com bit.ly or goo.gl and then you
> use their respective api and send in the url and they will return the long
> url... I used this before but it was a simple php based aggregator and not
> nutch
>


Re: Archiving Audio and Video

2011-01-26 Thread Adam Estrada
Another example would be the content embedded in this flash movie.

http://digitalmedia.worldbank.org/SSP/lac/investment-in-haiti/

Adam

On Wed, Jan 26, 2011 at 1:02 AM, Gora Mohanty  wrote:
> On Wed, Jan 26, 2011 at 9:15 AM, Adam Estrada
>  wrote:
>> Curious...I have been using Nutch for a while now and have never tried to 
>> index any audio or video formats. Is it feasible to grab the audio out of 
>> both forms of media and then index it? I believe this would require some 
>> kind of transcription which may be out of reach on this project.
> [...]
>
> One should be able to serialize/de-serialize audio, and video streams
> with ffmpeg, but what is your use case here, i.e., what are you planning
> to do with the indexed content?
>
> Regards,
> Gora
>


[Example] Configuration for a Hadoop Cluster

2011-01-26 Thread Adam Estrada
Does anyone have any information on this for use with Nutch?

Thanks,
Adam


Re: Few questions from a newbie

2011-01-26 Thread Churchill Nanje Mambe
even if the url being crawled is shortened, it will still lead nutch to the
actual link and nutch will fetch it

Churchill Nanje Mambe
237 77545907,
AfroVisioN Founder, President,CEO
www.camerborn.com/mambenanje
http://www.afrovisiongroup.com | http://mambenanje.blogspot.com
skypeID: mambenanje
www.twitter.com/mambenanje



On Wed, Jan 26, 2011 at 4:56 PM, Arjun Kumar Reddy <
charjunkumar.re...@iiitb.net> wrote:

> Yea Hi Mambe,
>
> Thanks for the feedback. I have mentioned the details of my application in
> the above post.
> I have tried doing this crawling job using php-multi curl and I am getting
> results which are good enough but the problem I am facing is that it is
> taking hell lot of time to get the contents of the urls. I have done this
> without using any API or conversions.
>
> So, in order to crawl in lesser time limits and also helps me to scale my
> application, I have chosen Nutch crawler.
>
> Thanks and regards,*
> *Ch. Arjun Kumar Reddy
>
> On Wed, Jan 26, 2011 at 9:19 PM, Churchill Nanje Mambe <
> mambena...@afrovisiongroup.com> wrote:
>
> > hello
> >  you have to use the short url APIs and get the long URLs... its abit
> > complex as you have to determine the url if its short, then determine the
> > url shortening service used eg: tinyurl.com bit.ly or goo.gl and then
> you
> > use their respective api and send in the url and they will return the
> long
> > url... I used this before but it was a simple php based aggregator and
> not
> > nutch
> >
>


Re: Few questions from a newbie

2011-01-26 Thread Churchill Nanje Mambe
even if the url being crawled is shortened, it will still lead nutch to the
actual link and nutch will fetch it


Re: Archiving Audio and Video

2011-01-26 Thread Gora Mohanty
On Wed, Jan 26, 2011 at 7:38 PM, Adam Estrada
 wrote:
> Another example would be the content embedded in this flash movie.
>
> http://digitalmedia.worldbank.org/SSP/lac/investment-in-haiti/
[...]

ffmpeg can pull out audio from video streams, and a working
speech-to-text converter can store the audio as text that one
can search through.

A computer-vision library like openCV also lets one search through
the video itself.

Be warned that much of this stuff might be experimental.

Regards,
Gora


Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Claudio Martella
Today I had a look at the code and wrote this class. It works here on my
test cluster.

It scans the crawldb for entries carrying the STATUS_DB_GONE and it
issues a delete to solr for those entries.

Is that what you guys have in mind? Should i file a JIRA?



On 1/24/11 10:26 AM, Markus Jelsma wrote:
> Each item in the CrawlDB carries a status field. Reading the CrawlDB will 
> return this information as well, the same goes for a complete dump with which 
> you could create the appropriate delete statements for your Solr instance.
>
> 51/** Page no longer exists. */
> 52public static final byte STATUS_DB_GONE = 0x03; 
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
>
>> Where is that information stored? it could be then easily used to issue
>> deletes on solr.
>>
>> On 1/23/11 10:32 PM, Markus Jelsma wrote:
>>> Nutch can detect 404's by recrawling existing URL's. The mutation,
>>> however, is not pushed to Solr at the moment.
>>>
 As far as I know, Nutch can only discover new URLs to crawl and send the
 parsed content to Solr. But what about maintaining the index? Say that
 you have a daily Nutch script that fetches/parses the web and updates
 the Solr index. After one month, several web pages have been modified
 and some have also been deleted. In other words, the Solr index is out
 of sync.

 Is it possible to detect such changes in order to send update/delete
 commands to Solr?

 It looks like the Aperture crawler has a workaround for this since the
 crawler handler have methods such as objectChanged(...):
 http://sourceforge.net/apps/trac/aperture/wiki/Crawlers

 Erlend


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.nutch.indexer.solr;

import java.io.IOException;
import java.net.MalformedURLException;
import java.text.SimpleDateFormat;
import java.util.Iterator;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.lib.NullOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.CrawlDb;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.nutch.util.NutchJob;
import org.apache.nutch.util.TimingUtil;
import org.apache.solr.client.solrj.SolrServer;
import org.apach

Re: Few questions from a newbie

2011-01-26 Thread alxsss
you can put fetch external and internal links to false and increase depth.
 

 


 

 

-Original Message-
From: Churchill Nanje Mambe 
To: user 
Sent: Wed, Jan 26, 2011 8:03 am
Subject: Re: Few questions from a newbie


even if the url being crawled is shortened, it will still lead nutch to the

actual link and nutch will fetch it




 


Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Markus Jelsma
Hi,

Please open a ticket, i'll test it.

Cheers,

On Wednesday 26 January 2011 18:12:51 Claudio Martella wrote:
> Today I had a look at the code and wrote this class. It works here on my
> test cluster.
> 
> It scans the crawldb for entries carrying the STATUS_DB_GONE and it
> issues a delete to solr for those entries.
> 
> Is that what you guys have in mind? Should i file a JIRA?
> 
> On 1/24/11 10:26 AM, Markus Jelsma wrote:
> > Each item in the CrawlDB carries a status field. Reading the CrawlDB will
> > return this information as well, the same goes for a complete dump with
> > which you could create the appropriate delete statements for your Solr
> > instance.
> > 
> > 51  /** Page no longer exists. */
> > 52  public static final byte STATUS_DB_GONE = 0x03;
> > 
> > http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apach
> > e/nutch/crawl/CrawlDatum.java?view=markup
> > 
> >> Where is that information stored? it could be then easily used to issue
> >> deletes on solr.
> >> 
> >> On 1/23/11 10:32 PM, Markus Jelsma wrote:
> >>> Nutch can detect 404's by recrawling existing URL's. The mutation,
> >>> however, is not pushed to Solr at the moment.
> >>> 
>  As far as I know, Nutch can only discover new URLs to crawl and send
>  the parsed content to Solr. But what about maintaining the index? Say
>  that you have a daily Nutch script that fetches/parses the web and
>  updates the Solr index. After one month, several web pages have been
>  modified and some have also been deleted. In other words, the Solr
>  index is out of sync.
>  
>  Is it possible to detect such changes in order to send update/delete
>  commands to Solr?
>  
>  It looks like the Aperture crawler has a workaround for this since the
>  crawler handler have methods such as objectChanged(...):
>  http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
>  
>  Erlend

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Claudio Martella
Done.

https://issues.apache.org/jira/browse/NUTCH-963

On 1/26/11 6:30 PM, Markus Jelsma wrote:
> Hi,
>
> Please open a ticket, i'll test it.
>
> Cheers,
>
> On Wednesday 26 January 2011 18:12:51 Claudio Martella wrote:
>> Today I had a look at the code and wrote this class. It works here on my
>> test cluster.
>>
>> It scans the crawldb for entries carrying the STATUS_DB_GONE and it
>> issues a delete to solr for those entries.
>>
>> Is that what you guys have in mind? Should i file a JIRA?
>>
>> On 1/24/11 10:26 AM, Markus Jelsma wrote:
>>> Each item in the CrawlDB carries a status field. Reading the CrawlDB will
>>> return this information as well, the same goes for a complete dump with
>>> which you could create the appropriate delete statements for your Solr
>>> instance.
>>>
>>> 51  /** Page no longer exists. */
>>> 52  public static final byte STATUS_DB_GONE = 0x03;
>>>
>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apach
>>> e/nutch/crawl/CrawlDatum.java?view=markup
>>>
 Where is that information stored? it could be then easily used to issue
 deletes on solr.

 On 1/23/11 10:32 PM, Markus Jelsma wrote:
> Nutch can detect 404's by recrawling existing URL's. The mutation,
> however, is not pushed to Solr at the moment.
>
>> As far as I know, Nutch can only discover new URLs to crawl and send
>> the parsed content to Solr. But what about maintaining the index? Say
>> that you have a daily Nutch script that fetches/parses the web and
>> updates the Solr index. After one month, several web pages have been
>> modified and some have also been deleted. In other words, the Solr
>> index is out of sync.
>>
>> Is it possible to detect such changes in order to send update/delete
>> commands to Solr?
>>
>> It looks like the Aperture crawler has a workaround for this since the
>> crawler handler have methods such as objectChanged(...):
>> http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
>>
>> Erlend


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.




Webserver configuration to successfully get modified time?

2011-01-26 Thread Joshua J Pavel


We've been crawling with nutch and deleting the crawldb between crawls.  I
believe I've managed to get my recrawl script to finally work, but I was
disappointed to see that in my db, the modified time of all of my pages is
Jan 1 1970.   Since I control both the crawler and the web server in our
setup, is there some setting that we can set to enable Nutch to
successfully get the modified time for the pages?  I want to reduce the
number of fetches as much as possible.

Thanks!

RE: [Example] Configuration for a Hadoop Cluster

2011-01-26 Thread Chris Woolum
This is a helpful wiki link 

http://wiki.apache.org/nutch/NutchHadoopTutorial

Chris

-Original Message-
From: estrada.a...@gmail.com [mailto:estrada.a...@gmail.com] On Behalf Of Adam 
Estrada
Sent: Wednesday, January 26, 2011 7:31 AM
To: user@nutch.apache.org
Subject: [Example] Configuration for a Hadoop Cluster

Does anyone have any information on this for use with Nutch?

Thanks,
Adam


Re: Archiving Audio and Video

2011-01-26 Thread Adam Estrada
Thank you very much for the info!

Adam

On Wed, Jan 26, 2011 at 11:37 AM, Gora Mohanty  wrote:
> On Wed, Jan 26, 2011 at 7:38 PM, Adam Estrada
>  wrote:
>> Another example would be the content embedded in this flash movie.
>>
>> http://digitalmedia.worldbank.org/SSP/lac/investment-in-haiti/
> [...]
>
> ffmpeg can pull out audio from video streams, and a working
> speech-to-text converter can store the audio as text that one
> can search through.
>
> A computer-vision library like openCV also lets one search through
> the video itself.
>
> Be warned that much of this stuff might be experimental.
>
> Regards,
> Gora
>


Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Erlend Garåsen


But we also need to detect modified documents in order to trigger an 
update command to Solr (an improvement of SolrIndexer). I was planning 
to open a Jira issue on this missing functionality this week.


Erlend

On 26.01.11 18.12, Claudio Martella wrote:

Today I had a look at the code and wrote this class. It works here on my
test cluster.

It scans the crawldb for entries carrying the STATUS_DB_GONE and it
issues a delete to solr for those entries.

Is that what you guys have in mind? Should i file a JIRA?



On 1/24/11 10:26 AM, Markus Jelsma wrote:

Each item in the CrawlDB carries a status field. Reading the CrawlDB will
return this information as well, the same goes for a complete dump with which
you could create the appropriate delete statements for your Solr instance.

51  /** Page no longer exists. */
52  public static final byte STATUS_DB_GONE = 0x03;

http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup


Where is that information stored? it could be then easily used to issue
deletes on solr.

On 1/23/11 10:32 PM, Markus Jelsma wrote:

Nutch can detect 404's by recrawling existing URL's. The mutation,
however, is not pushed to Solr at the moment.


As far as I know, Nutch can only discover new URLs to crawl and send the
parsed content to Solr. But what about maintaining the index? Say that
you have a daily Nutch script that fetches/parses the web and updates
the Solr index. After one month, several web pages have been modified
and some have also been deleted. In other words, the Solr index is out
of sync.

Is it possible to detect such changes in order to send update/delete
commands to Solr?

It looks like the Aperture crawler has a workaround for this since the
crawler handler have methods such as objectChanged(...):
http://sourceforge.net/apps/trac/aperture/wiki/Crawlers

Erlend






--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Markus Jelsma
This is default behaviour. If pages are scheduled for fetching they will show 
up in the next segment. If you index that segment the old document in Solr is 
overwritten.

> But we also need to detect modified documents in order to trigger an
> update command to Solr (an improvement of SolrIndexer). I was planning
> to open a Jira issue on this missing functionality this week.
> 
> Erlend
> 
> On 26.01.11 18.12, Claudio Martella wrote:
> > Today I had a look at the code and wrote this class. It works here on my
> > test cluster.
> > 
> > It scans the crawldb for entries carrying the STATUS_DB_GONE and it
> > issues a delete to solr for those entries.
> > 
> > Is that what you guys have in mind? Should i file a JIRA?
> > 
> > On 1/24/11 10:26 AM, Markus Jelsma wrote:
> >> Each item in the CrawlDB carries a status field. Reading the CrawlDB
> >> will return this information as well, the same goes for a complete dump
> >> with which you could create the appropriate delete statements for your
> >> Solr instance.
> >> 
> >> 51 /** Page no longer exists. */
> >> 52 public static final byte STATUS_DB_GONE = 0x03;
> >> 
> >> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apac
> >> he/nutch/crawl/CrawlDatum.java?view=markup
> >> 
> >>> Where is that information stored? it could be then easily used to issue
> >>> deletes on solr.
> >>> 
> >>> On 1/23/11 10:32 PM, Markus Jelsma wrote:
>  Nutch can detect 404's by recrawling existing URL's. The mutation,
>  however, is not pushed to Solr at the moment.
>  
> > As far as I know, Nutch can only discover new URLs to crawl and send
> > the parsed content to Solr. But what about maintaining the index?
> > Say that you have a daily Nutch script that fetches/parses the web
> > and updates the Solr index. After one month, several web pages have
> > been modified and some have also been deleted. In other words, the
> > Solr index is out of sync.
> > 
> > Is it possible to detect such changes in order to send update/delete
> > commands to Solr?
> > 
> > It looks like the Aperture crawler has a workaround for this since
> > the crawler handler have methods such as objectChanged(...):
> > http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
> > 
> > Erlend


Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Markus Jelsma
See this post in a recent thread:
http://search.lucidimagination.com/search/document/5b7ba8a6fc5e0305/few_questions_from_a_newbie

> This is default behaviour. If pages are scheduled for fetching they will
> show up in the next segment. If you index that segment the old document in
> Solr is overwritten.
> 
> > But we also need to detect modified documents in order to trigger an
> > update command to Solr (an improvement of SolrIndexer). I was planning
> > to open a Jira issue on this missing functionality this week.
> > 
> > Erlend
> > 
> > On 26.01.11 18.12, Claudio Martella wrote:
> > > Today I had a look at the code and wrote this class. It works here on
> > > my test cluster.
> > > 
> > > It scans the crawldb for entries carrying the STATUS_DB_GONE and it
> > > issues a delete to solr for those entries.
> > > 
> > > Is that what you guys have in mind? Should i file a JIRA?
> > > 
> > > On 1/24/11 10:26 AM, Markus Jelsma wrote:
> > >> Each item in the CrawlDB carries a status field. Reading the CrawlDB
> > >> will return this information as well, the same goes for a complete
> > >> dump with which you could create the appropriate delete statements
> > >> for your Solr instance.
> > >> 
> > >> 51   /** Page no longer exists. */
> > >> 52   public static final byte STATUS_DB_GONE = 0x03;
> > >> 
> > >> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/ap
> > >> ac he/nutch/crawl/CrawlDatum.java?view=markup
> > >> 
> > >>> Where is that information stored? it could be then easily used to
> > >>> issue deletes on solr.
> > >>> 
> > >>> On 1/23/11 10:32 PM, Markus Jelsma wrote:
> >  Nutch can detect 404's by recrawling existing URL's. The mutation,
> >  however, is not pushed to Solr at the moment.
> >  
> > > As far as I know, Nutch can only discover new URLs to crawl and
> > > send the parsed content to Solr. But what about maintaining the
> > > index? Say that you have a daily Nutch script that fetches/parses
> > > the web and updates the Solr index. After one month, several web
> > > pages have been modified and some have also been deleted. In other
> > > words, the Solr index is out of sync.
> > > 
> > > Is it possible to detect such changes in order to send
> > > update/delete commands to Solr?
> > > 
> > > It looks like the Aperture crawler has a workaround for this since
> > > the crawler handler have methods such as objectChanged(...):
> > > http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
> > > 
> > > Erlend


Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Claudio Martella
Yes, absolutely.

The only optimization we could make here would be to send to SOLR only
updates about documents we know for sure they changed (i.e. based on
digests, like the deduplication code). I'm not sure how SOLR behaves if
you send an update with no change in the document.
I'm sure they pretty much do the same internally, so I guess what we'd
minimize is only the transmission of the update.

On 1/26/11 9:16 PM, Markus Jelsma wrote:
> This is default behaviour. If pages are scheduled for fetching they will show 
> up in the next segment. If you index that segment the old document in Solr is 
> overwritten.
>
>> But we also need to detect modified documents in order to trigger an
>> update command to Solr (an improvement of SolrIndexer). I was planning
>> to open a Jira issue on this missing functionality this week.
>>
>> Erlend
>>
>> On 26.01.11 18.12, Claudio Martella wrote:
>>> Today I had a look at the code and wrote this class. It works here on my
>>> test cluster.
>>>
>>> It scans the crawldb for entries carrying the STATUS_DB_GONE and it
>>> issues a delete to solr for those entries.
>>>
>>> Is that what you guys have in mind? Should i file a JIRA?
>>>
>>> On 1/24/11 10:26 AM, Markus Jelsma wrote:
 Each item in the CrawlDB carries a status field. Reading the CrawlDB
 will return this information as well, the same goes for a complete dump
 with which you could create the appropriate delete statements for your
 Solr instance.

 51 /** Page no longer exists. */
 52 public static final byte STATUS_DB_GONE = 0x03;

 http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apac
 he/nutch/crawl/CrawlDatum.java?view=markup

> Where is that information stored? it could be then easily used to issue
> deletes on solr.
>
> On 1/23/11 10:32 PM, Markus Jelsma wrote:
>> Nutch can detect 404's by recrawling existing URL's. The mutation,
>> however, is not pushed to Solr at the moment.
>>
>>> As far as I know, Nutch can only discover new URLs to crawl and send
>>> the parsed content to Solr. But what about maintaining the index?
>>> Say that you have a daily Nutch script that fetches/parses the web
>>> and updates the Solr index. After one month, several web pages have
>>> been modified and some have also been deleted. In other words, the
>>> Solr index is out of sync.
>>>
>>> Is it possible to detect such changes in order to send update/delete
>>> commands to Solr?
>>>
>>> It looks like the Aperture crawler has a workaround for this since
>>> the crawler handler have methods such as objectChanged(...):
>>> http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
>>>
>>> Erlend


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.