RE: Hadoop Tutorial

2011-01-25 Thread McGibbney, Lewis John
In getting Nutch 1.2 up and running with Hadoop should I be using the tutorial 
on the Nutch wiki (1), this tutorial link (2) or the link to the hadoop cluster 
setup (3)

I am using 2 desktops (one running Vista and one running XP) with Cygwin to 
execute commands and wish to experiment running a hadoop tutorial. I have 
Openssh installed with my Cygwin installation and can start sshd service fine 
on both desktops.

Problem begins when I hit a quarter of the way through tutorial (1) when 
attempting to run the following

ssh -l root Mcgibbney-PC
mkdir /nutch
mkdir /nutch/search
mkdir /nutch/filesystem
mkdir /nutch/local
mkdir /nutch/home
groupadd users
useradd -d /nutch/home -g users nutch
chown -R nutch:users /nutch
passwd nutch nutchuserpassword

as my output is as follows

$ ssh -1 root Mcgibbney-PC
ssh: Could not resolve hostname root: hostname nor servname provided, or not 
known

Can anyone provide insight to how I can get past this hurdle.

(1) http://wiki.apache.org/nutch/NutchHadoopTutorial
(2) 
http://wiki.apache.org/nutch/Nutch0.9-Hadoop0.10-Tutorial?highlight=%28hadoop%29%7C%28tutorial%29
(3) http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html


--
Sorry file name should be mapred-site.xml. If it doesnt exist, You can
create one and add the property as defined in below.

Also in old version, Configuration can be defined in hadoop-site.xml, which
is deprecated in hadoop-0.20. If you have already configured in
hadoop-site.xml, Just make sure the following entry is in the right format
:



mapred.job.tracker
 HOST:PAIR
 



Wiki to hadoop cluster setup.

http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html

Thanks,
Charan

On Mon, Jan 24, 2011 at 12:00 PM, McGibbney, Lewis John <
lewis.mcgibb...@gcu.ac.uk> wrote:

> Hi Charan
>
> I have not touched mapred-config.xml and don't appear to have this file in
> my conf directory within 1.2 dist!
>
> 
> From: Charan K [charan.ku...@gmail.com]
> Sent: 24 January 2011 17:49
> To: user@nutch.apache.org
> Subject: Re: Hadoop Tutorial
>
> Hi
>  Can you verify  if you set the job tracker entry in right format  in
> mapred-config.xml?
>
> Thanks,
> Charan
>
> Sent from my iPhone
>
> On Jan 24, 2011, at 9:42 AM, "McGibbney, Lewis John" <
> lewis.mcgibb...@gcu.ac.uk> wrote:
>
> > Hi list,
> >
> > I am using Nutch 1.2 and currently working my way through the Nutch and
> Hadoop tutorial on the wiki for the first time. Not having much luck to date
> and have reached the following part "So log into the master nodes and all of
> the slave nodes as root." which I do not understand. Under
> /logs/hadoop-...-jobtracker-...log I get the following
> >
> >
> > 2011-01-24 16:55:08,008 ERROR mapred.JobTracker -
> java.lang.RuntimeException: Not a host:port pair: local
> >
> > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:136)
> >
> > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:123)
> >
> > at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:1807)
> >
> > at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1579)
> >
> > at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
> >
> > at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)
> >
> > at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)
> >
> > 2011-01-24 16:57:01,233 ERROR mapred.JobTracker -
> java.lang.RuntimeException: Not a host:port pair: local
> >
> > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:136)
> >
> > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:123)
> >
> > at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:1807)
> >
> > at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1579)
> >
> > at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
> >
> > at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)
> >
> > at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)
> >
> >
> >
> > I am pretty much stuck here so any help would be appreciated. Please
> state whether I need to provide more information if this is not sufficient.
> >
> >
> >
> > Thanks
> >
> > Lewis
> >
> > Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
> >
> > Winner: Times Higher Education's Widening Participation Initiative of the
> Year 2009 and Herald Society's Education Initiative of the Year 2009
> >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
> Email has been scanned for viruses by Altman Technologies' email management
> service - www.altman.co.uk/emailsystems
>
> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the
> Year 2009 and Herald Society’s Education Initiative o

Re: Hadoop Tutorial

2011-01-25 Thread Tanguy Moal
Hello,
-1 tells ssh to try protocol version 1 of ssh, that's not what you want to
do.
You want to log in as root user, so you've better use the '-l' with an 'l'
like in 'login' as the option.

Alternatively, you could do :
$ ssh root@Mcgibbney-PC

Provided Mcgibbney-PC is resolved as a hostname (try "ping Mcgibbney-PC", if
that works, you should be fine)


Hope this helps,

--
Tanguy

2011/1/25 McGibbney, Lewis John 

> In getting Nutch 1.2 up and running with Hadoop should I be using the
> tutorial on the Nutch wiki (1), this tutorial link (2) or the link to the
> hadoop cluster setup (3)
>
> I am using 2 desktops (one running Vista and one running XP) with Cygwin to
> execute commands and wish to experiment running a hadoop tutorial. I have
> Openssh installed with my Cygwin installation and can start sshd service
> fine on both desktops.
>
> Problem begins when I hit a quarter of the way through tutorial (1) when
> attempting to run the following
>
> ssh -l root Mcgibbney-PC
> mkdir /nutch
> mkdir /nutch/search
> mkdir /nutch/filesystem
> mkdir /nutch/local
> mkdir /nutch/home
> groupadd users
> useradd -d /nutch/home -g users nutch
> chown -R nutch:users /nutch
> passwd nutch nutchuserpassword
>
> as my output is as follows
>
> $ ssh -1 root Mcgibbney-PC
> ssh: Could not resolve hostname root: hostname nor servname provided, or
> not known
>
> Can anyone provide insight to how I can get past this hurdle.
>
> (1) http://wiki.apache.org/nutch/NutchHadoopTutorial
> (2)
> http://wiki.apache.org/nutch/Nutch0.9-Hadoop0.10-Tutorial?highlight=%28hadoop%29%7C%28tutorial%29
> (3) http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html
>
>
> --
> Sorry file name should be mapred-site.xml. If it doesnt exist, You can
> create one and add the property as defined in below.
>
> Also in old version, Configuration can be defined in hadoop-site.xml, which
> is deprecated in hadoop-0.20. If you have already configured in
> hadoop-site.xml, Just make sure the following entry is in the right format
> :
>
> 
>
>mapred.job.tracker
> HOST:PAIR
> 
> 
>
>
> Wiki to hadoop cluster setup.
>
> http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html
>
> Thanks,
> Charan
>
> On Mon, Jan 24, 2011 at 12:00 PM, McGibbney, Lewis John <
> lewis.mcgibb...@gcu.ac.uk> wrote:
>
> > Hi Charan
> >
> > I have not touched mapred-config.xml and don't appear to have this file
> in
> > my conf directory within 1.2 dist!
> >
> > 
> > From: Charan K [charan.ku...@gmail.com]
> > Sent: 24 January 2011 17:49
> > To: user@nutch.apache.org
> > Subject: Re: Hadoop Tutorial
> >
> > Hi
> >  Can you verify  if you set the job tracker entry in right format  in
> > mapred-config.xml?
> >
> > Thanks,
> > Charan
> >
> > Sent from my iPhone
> >
> > On Jan 24, 2011, at 9:42 AM, "McGibbney, Lewis John" <
> > lewis.mcgibb...@gcu.ac.uk> wrote:
> >
> > > Hi list,
> > >
> > > I am using Nutch 1.2 and currently working my way through the Nutch and
> > Hadoop tutorial on the wiki for the first time. Not having much luck to
> date
> > and have reached the following part "So log into the master nodes and all
> of
> > the slave nodes as root." which I do not understand. Under
> > /logs/hadoop-...-jobtracker-...log I get the following
> > >
> > >
> > > 2011-01-24 16:55:08,008 ERROR mapred.JobTracker -
> > java.lang.RuntimeException: Not a host:port pair: local
> > >
> > > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:136)
> > >
> > > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:123)
> > >
> > > at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:1807)
> > >
> > > at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1579)
> > >
> > > at
> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
> > >
> > > at
> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)
> > >
> > > at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)
> > >
> > > 2011-01-24 16:57:01,233 ERROR mapred.JobTracker -
> > java.lang.RuntimeException: Not a host:port pair: local
> > >
> > > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:136)
> > >
> > > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:123)
> > >
> > > at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:1807)
> > >
> > > at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1579)
> > >
> > > at
> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
> > >
> > > at
> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)
> > >
> > > at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)
> > >
> > >
> > >
> > > I am pretty much stuck here so any help would be appreciated. Please
> > state whether I need to provide more information if this is not
> sufficient.
> > >
> > >
> > >
> > > Thanks

Regarding crawling of short URL's

2011-01-25 Thread Arjun Kumar Reddy
Hi,

My application needs to crawl a set of urls which I give to the urls
directory and fetch only the contents of that urls only.
I am not interested in the contents of the internal or external links.
So I have run the crawl command by giving depth as 1.

bin/nutch crawl urls -dir crawl -depth 1

Nutch crawls the urls and gives me the contents of the given urls.

I am reading the content using readseg utility.

bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch
-nogenerate -noparse -noparsedata

With this I am fetching the content of webpage.

The problem I am facing is if I give direct urls like

http://isoc.org/wp/worldipv6day/
http://openhackindia.eventbrite.com
http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php
http://bangalore.yahoo.com/labs/summerschool.html
http://riadevcamp.eventbrite.com
http://www.sleepingtime.org/

then I am able to get the contents of the webpage.
But when I give the set of urls as short urls like

http://is.gd/jOoAa9
http://is.gd/ubHRAF
http://is.gd/GiFqj9
http://is.gd/H5rUhg
http://is.gd/wvKINL
http://is.gd/K6jTNl
http://is.gd/mpa6fr
http://is.gd/fmobvj
http://is.gd/s7uZfr

I am not able to fetch the contents.

When I read the segments, it is not showing any content. Please find below
the content of dump file read from segments.

Recno:: 0
URL:: http://is.gd/0yKjO6

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1295969171407

Content::
Version: -1
url: http://is.gd/0yKjO6
base: http://is.gd/0yKjO6
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=
http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36
nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8
Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:


Recno:: 1
URL:: http://is.gd/1tpKaN

Content::
Version: -1
url: http://is.gd/1tpKaN
base: http://is.gd/1tpKaN
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=
http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1_fst_=36
nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8
Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0


I have also tried by setting the max.redirects property in nutch-default.xml
as 4 but dint find any progress.
Kindly provide me a solution for this problem.

Thanks and regards,*
*Ch. Arjun Kumar Reddy


Regarding crawling of short URL's

2011-01-25 Thread Arjun Kumar Reddy
Hi,

My application needs to crawl a set of urls which I give to the urls
directory and fetch only the contents of that urls only.
I am not interested in the contents of the internal or external links.
So I have run the crawl command by giving depth as 1.

bin/nutch crawl urls -dir crawl -depth 1

Nutch crawls the urls and gives me the contents of the given urls.

I am reading the content using readseg utility.

bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch
-nogenerate -noparse -noparsedata

With this I am fetching the content of webpage.

The problem I am facing is if I give direct urls like

http://isoc.org/wp/worldipv6day/
http://openhackindia.eventbrite.com
http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php
http://bangalore.yahoo.com/labs/summerschool.html
http://riadevcamp.eventbrite.com
http://www.sleepingtime.org/

then I am able to get the contents of the webpage.
But when I give the set of urls as short urls like

http://is.gd/jOoAa9
http://is.gd/ubHRAF
http://is.gd/GiFqj9
http://is.gd/H5rUhg
http://is.gd/wvKINL
http://is.gd/K6jTNl
http://is.gd/mpa6fr
http://is.gd/fmobvj
http://is.gd/s7uZfr

I am not able to fetch the contents.

When I read the segments, it is not showing any content. Please find below
the content of dump file read from segments.

Recno:: 0
URL:: http://is.gd/0yKjO6

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1295969171407

Content::
Version: -1
url: http://is.gd/0yKjO6
base: http://is.gd/0yKjO6
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=
http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36
nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8
Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:


Recno:: 1
URL:: http://is.gd/1tpKaN

Content::
Version: -1
url: http://is.gd/1tpKaN
base: http://is.gd/1tpKaN
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=
http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1
 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html;
charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0


I have also tried by setting the max.redirects property in nutch-default.xml
as 4 but dint find any progress.
Kindly provide me a solution for this problem.

Thanks and regards,*
*Ch. Arjun Kumar Reddy


RE: Hadoop Tutorial

2011-01-25 Thread McGibbney, Lewis John
Hi Tanguy

Using your first suggestion I am prompted for a password (which I have not set 
up after creating the private and public ssh keys using command ssh-keygen -t 
dsa, I left the password blank). Regarding your alternative suggestion of using 
ping I recieve output that 4 packets sent, 4 packets received and 0% loss.

I am aware that getting ssh up and running is a prerequisite for running Nutch 
with Hadoop. My problem seems to be that I can't configure passwordless login. 
I have been looking for some resources to get this sorted out the best one I 
could find was ths one

http://inside.mines.edu/~gmurray/HowTo/sshNotes.html

Are there anymore suggestions out there as to how I can get this configured.

Lewis


From: Tanguy Moal [tanguy.m...@gmail.com]
Sent: 25 January 2011 13:23
To: user@nutch.apache.org
Subject: Re: Hadoop Tutorial

Hello,
-1 tells ssh to try protocol version 1 of ssh, that's not what you want to
do.
You want to log in as root user, so you've better use the '-l' with an 'l'
like in 'login' as the option.

Alternatively, you could do :
$ ssh root@Mcgibbney-PC

Provided Mcgibbney-PC is resolved as a hostname (try "ping Mcgibbney-PC", if
that works, you should be fine)


Hope this helps,

--
Tanguy

2011/1/25 McGibbney, Lewis John 

> In getting Nutch 1.2 up and running with Hadoop should I be using the
> tutorial on the Nutch wiki (1), this tutorial link (2) or the link to the
> hadoop cluster setup (3)
>
> I am using 2 desktops (one running Vista and one running XP) with Cygwin to
> execute commands and wish to experiment running a hadoop tutorial. I have
> Openssh installed with my Cygwin installation and can start sshd service
> fine on both desktops.
>
> Problem begins when I hit a quarter of the way through tutorial (1) when
> attempting to run the following
>
> ssh -l root Mcgibbney-PC
> mkdir /nutch
> mkdir /nutch/search
> mkdir /nutch/filesystem
> mkdir /nutch/local
> mkdir /nutch/home
> groupadd users
> useradd -d /nutch/home -g users nutch
> chown -R nutch:users /nutch
> passwd nutch nutchuserpassword
>
> as my output is as follows
>
> $ ssh -1 root Mcgibbney-PC
> ssh: Could not resolve hostname root: hostname nor servname provided, or
> not known
>
> Can anyone provide insight to how I can get past this hurdle.
>
> (1) http://wiki.apache.org/nutch/NutchHadoopTutorial
> (2)
> http://wiki.apache.org/nutch/Nutch0.9-Hadoop0.10-Tutorial?highlight=%28hadoop%29%7C%28tutorial%29
> (3) http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html
>
>
> --
> Sorry file name should be mapred-site.xml. If it doesnt exist, You can
> create one and add the property as defined in below.
>
> Also in old version, Configuration can be defined in hadoop-site.xml, which
> is deprecated in hadoop-0.20. If you have already configured in
> hadoop-site.xml, Just make sure the following entry is in the right format
> :
>
> 
>
>mapred.job.tracker
> HOST:PAIR
> 
> 
>
>
> Wiki to hadoop cluster setup.
>
> http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html
>
> Thanks,
> Charan
>
> On Mon, Jan 24, 2011 at 12:00 PM, McGibbney, Lewis John <
> lewis.mcgibb...@gcu.ac.uk> wrote:
>
> > Hi Charan
> >
> > I have not touched mapred-config.xml and don't appear to have this file
> in
> > my conf directory within 1.2 dist!
> >
> > 
> > From: Charan K [charan.ku...@gmail.com]
> > Sent: 24 January 2011 17:49
> > To: user@nutch.apache.org
> > Subject: Re: Hadoop Tutorial
> >
> > Hi
> >  Can you verify  if you set the job tracker entry in right format  in
> > mapred-config.xml?
> >
> > Thanks,
> > Charan
> >
> > Sent from my iPhone
> >
> > On Jan 24, 2011, at 9:42 AM, "McGibbney, Lewis John" <
> > lewis.mcgibb...@gcu.ac.uk> wrote:
> >
> > > Hi list,
> > >
> > > I am using Nutch 1.2 and currently working my way through the Nutch and
> > Hadoop tutorial on the wiki for the first time. Not having much luck to
> date
> > and have reached the following part "So log into the master nodes and all
> of
> > the slave nodes as root." which I do not understand. Under
> > /logs/hadoop-...-jobtracker-...log I get the following
> > >
> > >
> > > 2011-01-24 16:55:08,008 ERROR mapred.JobTracker -
> > java.lang.RuntimeException: Not a host:port pair: local
> > >
> > > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:136)
> > >
> > > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:123)
> > >
> > > at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:1807)
> > >
> > > at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1579)
> > >
> > > at
> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
> > >
> > > at
> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)
> > >
> > > at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)
> > >

CFP - Berlin Buzzwords 2011 - Search, Score, Scale

2011-01-25 Thread Isabel Drost
This is to announce the Berlin Buzzwords 2011. The second edition of the 
successful conference on scalable and open search, data processing and data 
storage in Germany, taking place in Berlin.

Call for Presentations Berlin Buzzwords
   http://berlinbuzzwords.de
  Berlin Buzzwords 2011 - Search, Store, Scale
6/7 June 2011

The event will comprise presentations on scalable data processing. We invite 
you 
to submit talks on the topics:

   * IR / Search - Lucene, Solr, katta or comparable solutions
   * NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others
   * Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives
   * Closely related topics not explicitly listed above are welcome. We are
 looking for presentations on the implementation of the systems themselves,
 real world applications and case studies.

Important Dates (all dates in GMT +2)
   * Submission deadline: March 1st 2011, 23:59 MEZ
   * Notification of accepted speakers: March 22th, 2011, MEZ.
   * Publication of final schedule: April 5th, 2011.
   * Conference: June 6/7. 2011

High quality, technical submissions are called for, ranging from principles to 
practice. We are looking for real world use cases, background on the 
architecture of specific projects and a deep dive into architectures built on 
top of e.g. Hadoop clusters.

Proposals should be submitted at http://berlinbuzzwords.de/content/cfp-0 no 
later than March 1st, 2011. Acceptance notifications will be sent out soon 
after 
the submission deadline. Please include your name, bio and email, the title of 
the talk, a brief abstract in English language. Please indicate whether you 
want 
to give a lightning (10min), short (20min) or long (40min) presentation and 
indicate the level of experience with the topic your audience should have (e.g. 
whether your talk will be suitable for newbies or is targeted for experienced 
users.) If you'd like to pitch your brand new product in your talk, please let 
us know as well - there will be extra space for presenting new ideas, awesome 
products and great new projects.

The presentation format is short. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy to 
provide 
videos after the event, free drinks for attendees as well as an after-show 
party), please contact us.

Follow @hadoopberlin on Twitter for updates. Tickets, news on the conference, 
and the final schedule are be published at http://berlinbuzzwords.de.

Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer.

Please re-distribute this CfP to people who might be interested.

If you are local and wish to meet us earlier, please note that this Thursday 
evening there will be an Apache Hadoop Get Together (videos kindly sponsored by 
Cloudera, venue kindly provided for free by Zanox) featuring talks on Apache 
Hadoop in production as well as news on current Apache Lucene developments.

Contact us at:

newthinking communications 
GmbH Schönhauser Allee 6/7 
10119 Berlin, 
Germany 

Julia Gemählich
Isabel Drost 

+49(0)30-9210 596


signature.asc
Description: This is a digitally signed message part.


Re: Few questions from a newbie

2011-01-25 Thread .: Abhishek :.
Thanks Chris, Charan and Alex.

I am looking into the crawl statistics now. And I see fields like
db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what do
they mean?

And, I also see the db_unfetched is way too high than the db_fetched. Does
it mean most of the pages did not crawl at all due to some issues?

Thanks again for your time!


On Tue, Jan 25, 2011 at 2:33 PM, charan kumar wrote:

> db.fetcher.interval : It means that URLS which were fetched in the last 30
> days  will not be fetched. Or A URL is eligible for refetch only
> after 30 days of last crawl.
>
>
> On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
>
> > How to use solr to index nutch segments?
> > What is the meaning of db.fetcher.interval? Does this mean that if I run
> > the same crawl command before 30 days it will do nothing?
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > -Original Message-
> > From: Charan K 
> > To: user 
> > Cc: user 
> > Sent: Mon, Jan 24, 2011 8:24 pm
> > Subject: Re: Few questions from a newbie
> >
> >
> > Refer NutchBean.java for the their question. You can run than from
> command
> > line
> >
> > to test the index.
> >
> >
> >
> >  If you use SOLR indexing, it is going to be much simpler, they have a
> solr
> > java
> >
> > client..
> >
> >
> >
> > Sent from my iPhone
> >
> >
> >
> > On Jan 24, 2011, at 8:07 PM, Amna Waqar  wrote:
> >
> >
> >
> > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl
> >
> > > gives u more control and speed
> >
> > > 2.After the first crawl,the recrawling the same sites time is 30 days
> by
> >
> > > default in db.fetcher.interval,you can change it according to ur own
> >
> > > convenience.
> >
> > > 3.I ve no idea about the third question
> >
> > > cz  i m also a newbie
> >
> > > Best of luck with nutch learning
> >
> > >
> >
> > >
> >
> > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. 
> > wrote:
> >
> > >
> >
> > >> Hi all,
> >
> > >>
> >
> > >> I am very new to Nutch and Lucene as well. I am having few questions
> > about
> >
> > >> Nutch, I know they are very much basic but I could not get clear cut
> >
> > >> answers
> >
> > >> out of googling for this. The questions are,
> >
> > >>
> >
> > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> intranet
> >
> > >>  crawl or whole web crawl.
> >
> > >>  - How do I set recrawl's for these same web sites after the first
> > crawl.
> >
> > >>  - If I have to start search the results via my own java code which
> jar
> >
> > >>  files or api's or samples should I be looking into.
> >
> > >>  - Is there a book on Nutch?
> >
> > >>
> >
> > >> Thanks a bunch for your patience. I appreciate your time.
> >
> > >>
> >
> > >> ./Abishek
> >
> > >>
> >
> >
> >
> >
> >
> >
>


Re: Few questions from a newbie

2011-01-25 Thread Markus Jelsma
These values come from the CrawlDB and have the following meaning.

db_unfetched
This is the number of URL's that are to be crawled when the next batch is 
started. This number is usually limited with the generate.max.per.host 
setting. So, if there are 5000 unfetched and generate.max.per.host is set to 
1000, the next batch will fetch only 1000. Watch, the number of unfetched will 
usually not be 5000-1000 because new URL's have been discovered and added to 
the CrawlDB.

db_fetched
These URL's have been fetched. Their next fetch will be db.fetcher.interval. 
But, this is not always the case. There the adaprive schedule algorithm can 
tune this number depending on several settings. With these you can tune the 
interval when a page is modified or not modified.

db_gone
HTTP 404 Not Found

db_redir-temp
HTTP 307 Temporary Redirect

db_redir_perm
HTTP 301 Moved Permanently

Code:
http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup

Configuration:
http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
default.xml?view=markup

> Thanks Chris, Charan and Alex.
> 
> I am looking into the crawl statistics now. And I see fields like
> db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what do
> they mean?
> 
> And, I also see the db_unfetched is way too high than the db_fetched. Does
> it mean most of the pages did not crawl at all due to some issues?
> 
> Thanks again for your time!
> 
> On Tue, Jan 25, 2011 at 2:33 PM, charan kumar wrote:
> > db.fetcher.interval : It means that URLS which were fetched in the last
> > 30 days  will not be fetched. Or A URL is eligible for refetch
> > only after 30 days of last crawl.
> > 
> > On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
> > > How to use solr to index nutch segments?
> > > What is the meaning of db.fetcher.interval? Does this mean that if I
> > > run the same crawl command before 30 days it will do nothing?
> > > 
> > > Thanks.
> > > Alex.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -Original Message-
> > > From: Charan K 
> > > To: user 
> > > Cc: user 
> > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > Subject: Re: Few questions from a newbie
> > > 
> > > 
> > > Refer NutchBean.java for the their question. You can run than from
> > 
> > command
> > 
> > > line
> > > 
> > > to test the index.
> > > 
> > >  If you use SOLR indexing, it is going to be much simpler, they have a
> > 
> > solr
> > 
> > > java
> > > 
> > > client..
> > > 
> > > 
> > > 
> > > Sent from my iPhone
> > > 
> > > On Jan 24, 2011, at 8:07 PM, Amna Waqar  wrote:
> > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > crawl
> > > > 
> > > > gives u more control and speed
> > > > 
> > > > 2.After the first crawl,the recrawling the same sites time is 30 days
> > 
> > by
> > 
> > > > default in db.fetcher.interval,you can change it according to ur own
> > > > 
> > > > convenience.
> > > > 
> > > > 3.I ve no idea about the third question
> > > > 
> > > > cz  i m also a newbie
> > > > 
> > > > Best of luck with nutch learning
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. 
> > > 
> > > wrote:
> > > >> Hi all,
> > > >> 
> > > >> 
> > > >> 
> > > >> I am very new to Nutch and Lucene as well. I am having few questions
> > > 
> > > about
> > > 
> > > >> Nutch, I know they are very much basic but I could not get clear cut
> > > >> 
> > > >> answers
> > > >> 
> > > >> out of googling for this. The questions are,
> > > >> 
> > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > 
> > intranet
> > 
> > > >>  crawl or whole web crawl.
> > > >>  
> > > >>  - How do I set recrawl's for these same web sites after the first
> > > 
> > > crawl.
> > > 
> > > >>  - If I have to start search the results via my own java code which
> > 
> > jar
> > 
> > > >>  files or api's or samples should I be looking into.
> > > >>  
> > > >>  - Is there a book on Nutch?
> > > >> 
> > > >> Thanks a bunch for your patience. I appreciate your time.
> > > >> 
> > > >> 
> > > >> 
> > > >> ./Abishek


Re: Regarding crawling of short URL's

2011-01-25 Thread Markus Jelsma
Reading a URL from the DB returns the HTTP response of that URL, some header 
information and body.  Crawling a URL with a HTTP redirect won't result in the 
HTTP response of the redirection target for that redirecting URL.

> Hi,
> 
> My application needs to crawl a set of urls which I give to the urls
> directory and fetch only the contents of that urls only.
> I am not interested in the contents of the internal or external links.
> So I have run the crawl command by giving depth as 1.
> 
> bin/nutch crawl urls -dir crawl -depth 1
> 
> Nutch crawls the urls and gives me the contents of the given urls.
> 
> I am reading the content using readseg utility.
> 
> bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch
> -nogenerate -noparse -noparsedata
> 
> With this I am fetching the content of webpage.
> 
> The problem I am facing is if I give direct urls like
> 
> http://isoc.org/wp/worldipv6day/
> http://openhackindia.eventbrite.com
> http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
> http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locatio
> ns.php http://bangalore.yahoo.com/labs/summerschool.html
> http://riadevcamp.eventbrite.com
> http://www.sleepingtime.org/
> 
> then I am able to get the contents of the webpage.
> But when I give the set of urls as short urls like
> 
> http://is.gd/jOoAa9
> http://is.gd/ubHRAF
> http://is.gd/GiFqj9
> http://is.gd/H5rUhg
> http://is.gd/wvKINL
> http://is.gd/K6jTNl
> http://is.gd/mpa6fr
> http://is.gd/fmobvj
> http://is.gd/s7uZfr
> 
> I am not able to fetch the contents.
> 
> When I read the segments, it is not showing any content. Please find below
> the content of dump file read from segments.
> 
> Recno:: 0
> URL:: http://is.gd/0yKjO6
> 
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Tue Jan 25 20:56:07 IST 2011
> Modified time: Thu Jan 01 05:30:00 IST 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1295969171407
> 
> Content::
> Version: -1
> url: http://is.gd/0yKjO6
> base: http://is.gd/0yKjO6
> contentType: text/html
> metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0
> Location= http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1
> _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html;
> charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
> Content:
> 
> 
> Recno:: 1
> URL:: http://is.gd/1tpKaN
> 
> Content::
> Version: -1
> url: http://is.gd/1tpKaN
> base: http://is.gd/1tpKaN
> contentType: text/html
> metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0
> Location=
> http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1
> _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html;
> charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
> Content:
> 
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Tue Jan 25 20:56:07 IST 2011
> Modified time: Thu Jan 01 05:30:00 IST 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> 
> 
> I have also tried by setting the max.redirects property in
> nutch-default.xml as 4 but dint find any progress.
> Kindly provide me a solution for this problem.
> 
> Thanks and regards,*
> *Ch. Arjun Kumar Reddy


Re: Few questions from a newbie

2011-01-25 Thread .: Abhishek :.
Thanks a bunch Markus.

By the way, is there some book or material on Nutch which would help me
understanding it better? I  come from an application development background
and all the crawl n search stuff is *very* new to me :)


On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
wrote:

> These values come from the CrawlDB and have the following meaning.
>
> db_unfetched
> This is the number of URL's that are to be crawled when the next batch is
> started. This number is usually limited with the generate.max.per.host
> setting. So, if there are 5000 unfetched and generate.max.per.host is set
> to
> 1000, the next batch will fetch only 1000. Watch, the number of unfetched
> will
> usually not be 5000-1000 because new URL's have been discovered and added
> to
> the CrawlDB.
>
> db_fetched
> These URL's have been fetched. Their next fetch will be
> db.fetcher.interval.
> But, this is not always the case. There the adaprive schedule algorithm can
> tune this number depending on several settings. With these you can tune the
> interval when a page is modified or not modified.
>
> db_gone
> HTTP 404 Not Found
>
> db_redir-temp
> HTTP 307 Temporary Redirect
>
> db_redir_perm
> HTTP 301 Moved Permanently
>
> Code:
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
>
> Configuration:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> default.xml?view=markup
>
> > Thanks Chris, Charan and Alex.
> >
> > I am looking into the crawl statistics now. And I see fields like
> > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what
> do
> > they mean?
> >
> > And, I also see the db_unfetched is way too high than the db_fetched.
> Does
> > it mean most of the pages did not crawl at all due to some issues?
> >
> > Thanks again for your time!
> >
> > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar  >wrote:
> > > db.fetcher.interval : It means that URLS which were fetched in the last
> > > 30 days  will not be fetched. Or A URL is eligible for refetch
> > > only after 30 days of last crawl.
> > >
> > > On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
> > > > How to use solr to index nutch segments?
> > > > What is the meaning of db.fetcher.interval? Does this mean that if I
> > > > run the same crawl command before 30 days it will do nothing?
> > > >
> > > > Thanks.
> > > > Alex.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -Original Message-
> > > > From: Charan K 
> > > > To: user 
> > > > Cc: user 
> > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > Subject: Re: Few questions from a newbie
> > > >
> > > >
> > > > Refer NutchBean.java for the their question. You can run than from
> > >
> > > command
> > >
> > > > line
> > > >
> > > > to test the index.
> > > >
> > > >  If you use SOLR indexing, it is going to be much simpler, they have
> a
> > >
> > > solr
> > >
> > > > java
> > > >
> > > > client..
> > > >
> > > >
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar 
> wrote:
> > > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > > crawl
> > > > >
> > > > > gives u more control and speed
> > > > >
> > > > > 2.After the first crawl,the recrawling the same sites time is 30
> days
> > >
> > > by
> > >
> > > > > default in db.fetcher.interval,you can change it according to ur
> own
> > > > >
> > > > > convenience.
> > > > >
> > > > > 3.I ve no idea about the third question
> > > > >
> > > > > cz  i m also a newbie
> > > > >
> > > > > Best of luck with nutch learning
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :.  >
> > > >
> > > > wrote:
> > > > >> Hi all,
> > > > >>
> > > > >>
> > > > >>
> > > > >> I am very new to Nutch and Lucene as well. I am having few
> questions
> > > >
> > > > about
> > > >
> > > > >> Nutch, I know they are very much basic but I could not get clear
> cut
> > > > >>
> > > > >> answers
> > > > >>
> > > > >> out of googling for this. The questions are,
> > > > >>
> > > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > >
> > > intranet
> > >
> > > > >>  crawl or whole web crawl.
> > > > >>
> > > > >>  - How do I set recrawl's for these same web sites after the first
> > > >
> > > > crawl.
> > > >
> > > > >>  - If I have to start search the results via my own java code
> which
> > >
> > > jar
> > >
> > > > >>  files or api's or samples should I be looking into.
> > > > >>
> > > > >>  - Is there a book on Nutch?
> > > > >>
> > > > >> Thanks a bunch for your patience. I appreciate your time.
> > > > >>
> > > > >>
> > > > >>
> > > > >> ./Abishek
>


Archiving Audio and Video

2011-01-25 Thread Adam Estrada
Curious...I have been using Nutch for a while now and have never tried to index 
any audio or video formats. Is it feasible to grab the audio out of both forms 
of media and then index it? I believe this would require some kind of 
transcription which may be out of reach on this project.

Thanks,
Adam

Re: Archiving Audio and Video

2011-01-25 Thread Gora Mohanty
On Wed, Jan 26, 2011 at 9:15 AM, Adam Estrada
 wrote:
> Curious...I have been using Nutch for a while now and have never tried to 
> index any audio or video formats. Is it feasible to grab the audio out of 
> both forms of media and then index it? I believe this would require some kind 
> of transcription which may be out of reach on this project.
[...]

One should be able to serialize/de-serialize audio, and video streams
with ffmpeg, but what is your use case here, i.e., what are you planning
to do with the indexed content?

Regards,
Gora