Re: hadoop on unstable nodes

2010-08-03 Thread Alex Loddengaard
I don't know of any research, but such a scenario is likely not going to
turn out so well.  Hadoop is very network hungry and is designed to be run
in a datacenter.  Sorry I don't have more information for you.

Alex

On Mon, Aug 2, 2010 at 9:14 PM, Rahul.V. wrote:

> Hi,
> Is there any research currently going on where map reduce is applied to
> nodes in normal internet scenarios?.In environments where network bandwidth
> is at premium what are the tweaks applied to hadoop?
> I would be very thankful if you can post me links in this direction.
>
> --
> Regards,
> R.V.
>


Re: jobtracker.jsp reports "GC overhead limit exceeded"

2010-07-30 Thread Alex Loddengaard
err, "ps aux", not "ps".

Alex

On Fri, Jul 30, 2010 at 3:19 PM, Alex Loddengaard  wrote:

> What does "ps" show you?  How much memory is being used by the jobtracker,
> and how large is its heap (loop for HADOOP_HEAPSIZE in hadoop-env.sh)?  Also
> consider turning on GC logging, which will find its way to the jobtracker
> .out log in /var/log/hadoop:
>
> <http://java.sun.com/developer/technicalArticles/Programming/GCPortal/>
>
> Alex
>
>
> On Fri, Jul 30, 2010 at 3:10 PM, jiang licht wrote:
>
>> http://server:50030/jobtracker.jsp generates the following error message:
>>
>> HTTP ERROR: 500
>>
>> GC overhead limit exceeded
>>
>> RequestURI=/jobtracker.jsp
>> Caused by:
>>
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> Powered by Jetty://
>>
>> The jobtracker is running below the limit. But "hadoop job -status" seems
>> to halt and does not response ...
>>
>> The last 2 lines of jobtracker logs:
>>
>> 2010-07-30 13:53:18,482 DEBUG org.apache.hadoop.mapred.JobTracker: Got
>> heartbeat from: 
>> tracker_host1:localhost.localdomain/127.0.0.1:53914(restarted: false 
>> initialContact: false acceptNewTasks: true) with
>> responseId: -31252
>> 2010-07-30 13:55:32,917 DEBUG org.apache.hadoop.mapred.JobTracker:
>> Starting launching task sweep
>>
>> Any thought about this?
>>
>> Thanks!
>> --Michael
>>
>>
>>
>
>
>


Re: jobtracker.jsp reports "GC overhead limit exceeded"

2010-07-30 Thread Alex Loddengaard
What does "ps" show you?  How much memory is being used by the jobtracker,
and how large is its heap (loop for HADOOP_HEAPSIZE in hadoop-env.sh)?  Also
consider turning on GC logging, which will find its way to the jobtracker
.out log in /var/log/hadoop:



Alex

On Fri, Jul 30, 2010 at 3:10 PM, jiang licht  wrote:

> http://server:50030/jobtracker.jsp generates the following error message:
>
> HTTP ERROR: 500
>
> GC overhead limit exceeded
>
> RequestURI=/jobtracker.jsp
> Caused by:
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> Powered by Jetty://
>
> The jobtracker is running below the limit. But "hadoop job -status" seems
> to halt and does not response ...
>
> The last 2 lines of jobtracker logs:
>
> 2010-07-30 13:53:18,482 DEBUG org.apache.hadoop.mapred.JobTracker: Got
> heartbeat from: 
> tracker_host1:localhost.localdomain/127.0.0.1:53914(restarted: false 
> initialContact: false acceptNewTasks: true) with
> responseId: -31252
> 2010-07-30 13:55:32,917 DEBUG org.apache.hadoop.mapred.JobTracker: Starting
> launching task sweep
>
> Any thought about this?
>
> Thanks!
> --Michael
>
>
>


Re: What is IPC classes for???

2010-07-07 Thread Alex Loddengaard
Hi Ahmad,

Those are the classes used for daemons to talk to each other.  Take a look
at what IPC means:



Hope this helps.

Alex

On Wed, Jul 7, 2010 at 11:32 AM, Ahmad Shahzad  wrote:

> Hi ALL,
>  Can anyone tell me that what is the purpose of IPC classes in
> hadoop. They are in the \src\core\org\apache\hadoop\ipc folder.
>
> Regards,
> Ahmad Shahzad
>


Re: Question about Data Node configuration

2010-07-07 Thread Alex Loddengaard
I would recommend not putting / in dfs.data.dir.  You'll want that space for
logs, which will grow very large in heavily-used clusters (userlogs in
particular).

/ for OS and logs
/mount* for mapred.local.dir and dfs.data.dir

Hope this helps.

Alex

On Wed, Jul 7, 2010 at 10:38 AM, A Levine  wrote:

> I am trying to configure a large install and I have a question about
> the configuration of Data Nodes.  Each data node has multiple drives.
> Each drive is 1TB in size.  In the hdfs-site.xml, I can have multiple
> directories (which will be mounted drives) specified as shown by:
>
>  
>dfs.data.dir
>/mount1,/mount2,/mount3,
>true
>  
>
> For the drive that has the OS, only 100G will be used for the OS.  Is
> it good practice to have a partition on the drive that has the OS used
> for the dfs.data.dir?  Will this slow things down?  Will the size
> difference available to each directory be a problem?  Also, if it is
> not a good idea to use the OS drive, then how about pointing logs to
> that drive?
>
> andrew
>


Re: Please help! Corrupt fsimage?

2010-07-07 Thread Alex Loddengaard
Hi Peter,

The edits.new file is used when the edits and fsimage is pulled to the
secondarynamenode.  Here's the process:

1) SNN pulls edits and fsimage
2) NN starts writing edits to edits.new
3) SNN sends new fsimage to NN
4) NN replaces its fsimage with the SNN fsimage
5) NN replaces edits with edits.new

Certainly taking a different fsimage and trying to apply edits to it won't
work.  Your best bet might be to take the 3-day-old fsimage with an empty
edits and delete edits.new.  But before you do any of this, make sure you
completely backup all values for dfs.name.dir and dfs.checkpoint.dir.  What
are the timestamps on the fsimage files in each dfs.name.dir and
dfs.checkpoint.dir?

Do the namenode and secondarynamenode have enough disk space?  Have you
consulted the logs to learn why the SNN/NN didn't properly update the
fsimage and edits log?

Hope this helps.

Alex

On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk  wrote:

> Just a little update. We found a working fsimage that was just a couple of
> days older than the corrupt one. We tried to replace the fsimage with the
> working one, and kept the edits and edits.new files, hoping the the latest
> edits would be still in use. However, when starting the namenode, the
> following error message appears. Any thought ideas or hints of how to
> continue? Edit the edits files somehow?
>
> TIA,
> Peter
>
> 2010-07-07 16:21:10,312 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files = 28372
> 2010-07-07 16:21:11,162 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files under construction = 8
> 2010-07-07 16:21:11,164 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 3315887 loaded in 0 seconds.
> 2010-07-07 16:21:11,164 DEBUG
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
> /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 numblocks :
> 1
> clientHolder  clientMachine
> 2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
> FSDirectory.unprotectedDelete: failed to remove
> /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because it
> does not exist
> 2010-07-07 16:21:11,164 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
>at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
>at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
>at
>
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
>at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
>at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
> at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:292)
>at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:279)
>at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>
> 2010-07-07 16:21:11,165 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
> /
>
>
> On Wed, Jul 7, 2010 at 14:46, Peter Falk  wrote:
>
> > Hi,
> >
> > After a restart of our live cluster today, the name node fails to start
> > with the log message seen below. There is a big file called edits.new in
> the
> > "current" folder that seems be the only one that have received changes
> > recently (no changes to the edits or the fsimage for over a month). Is
> that
> > normal?
> >
> > The last change to the edits.new file was right before shutting down the
> > cluster. It seems like the shutdown was unable to store valid fsimage,
> > edits, edits.new files. The secondary name node image does not include
> the
> > edits.new file, only edits and fsimage, which are identical to the name
> > nodes version. So no help from them.
> >
> > Would appreciate any help in understanding what could have gone wrong.
> The
> > shutdown seemed to complete just fine, without any error message. Is
> there
> > any way to recreate the image from the data, or any other way to save our
> > production data?
> >
> > Sincerely,
> > Peter
> >
> > 2010-07-07 14:30:26,949 INFO org.apache.hadoop.ipc.metrics.R

Re: What is it??? help required

2010-07-05 Thread Alex Loddengaard
Hi Ahmad,

On Sat, Jul 3, 2010 at 11:21 AM, Ahmad Shahzad  wrote:
>
> 1) What is the purpose of the HttpServer that is started at port 50060, and
> jetty bounded to it.
>

This is used for web UI status (just like the jobtracker and namenode web
UI), along with map -> reduce intermediate data transfers.


> 2) What is the purpose of the TaskTracker that is started at localhost and
> port 42641.
>

This is the tasktracker daemon that communicates with the jobtracker and
coordinates tasks on the local machine.

Alex


Re: Text files vs. SequenceFiles

2010-07-02 Thread Alex Loddengaard
Hi David,

On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch wrote:
>
> * We should use a SequenceFile (binary) format as it's faster for the
> machine to read than parsing text, and the files are smaller.
>
> * We should use a text file format as it's easier for humans to read,
> easier to change, text files can be compressed quite small, and a) if the
> text format is designed well and b) given the context of a distributed
> system like Hadoop where you can throw more nodes at a problem, the text
> parsing time will wind up being negligible/irrelevant in the overall
> processing time.
>

SequenceFiles can also be compressed, either per record or per block.  This
is advantageous if you want to use gzip, because gzip isn't splittable.  A
SF compressed by blocks is therefor splittable, because each block is
gzipped vs. the entire file being gzipped.

As for readability, "hadoop fs -text" is the same as "hadoop fs -cat" for
SequenceFiles.

Lastly, I promise that eventually you'll run out of space in your cluster
and wish you did better compression.  Plus compression makes jobs faster.

The general recommendation is to use SequenceFiles as early in your ETL as
possible.  Usually people get their data in as text, and after the first MR
pass they work with SequenceFiles from there on out.

Alex


Re: Intermediate files generated.

2010-07-01 Thread Alex Loddengaard
You could use the HDFS API from within your mapper, and run with 0 reducers.

Alex

On Thu, Jul 1, 2010 at 3:07 PM, Pramy Bhats wrote:

> Hi,
>
> I am using hadoop framework for writing MapReduce jobs. I want  to redirect
> the output of Map into files of my choice and later use those files as
> input
> for Reduce phase.
>
>
> Could you please suggest, how to proceed for it ?
>
> thanks,
> --PB.
>


Re: Decommissioning Individual Disks

2009-09-10 Thread Alex Loddengaard
Hi David,
Unfortunately there's really no way to do what you're hoping to do in an
automatic way.  You can move the block files (including their .meta files)
from one disk to another.  Do this when the datanode daemon is stopped.
 Then, when you start the datanode daemon, it will scan dfs.data.dir and be
totally happy if blocks have moved hard drives.  I've never tried to do this
myself, but others on the list have suggested this technique for "balancing
disks."

You could also change your process around a little.  It's not too crazy to
decommission an entire node, replace one of its disks, then bring it back
into the cluster.  Seems to me that this is a much saner approach: your ops
team will tell you which disk needs replacing.  You decommission the node,
they replace the disk, you add the node back to the pool.  Your call I
guess, though.

Hope this was helpful.

Alex

On Thu, Sep 10, 2009 at 6:30 PM, David B. Ritch wrote:

> What do you do with the data on a failing disk when you replace it?
>
> Our support person comes in occasionally, and often replaces several
> disks when he does.  These are disks that have not yet failed, but
> firmware indicates that failure is imminent.  We need to be able to
> migrate our data off these disks before replacing them.  If we were
> replacing entire servers, we would decommission them - but we have 3
> data disks per server.  If we were replacing one disk at a time, we
> wouldn't worry about it (because of redundancy).  We can decommission
> the servers, but moving all the data off of all their disks is a waste.
>
> What's the best way to handle this?
>
> Thanks!
>
> David
>


Re: Delete replicated blocks?

2009-08-27 Thread Alex Loddengaard
I don't know for sure, but running the rebalancer might do this for you.

<
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html#Rebalancer
>

Alex

On Thu, Aug 27, 2009 at 9:18 AM, Michael Thomas wrote:

> dfs.replication is only used by the client at the time the files are
> written.  Changing this setting will not automatically change the
> replication level on existing files.  To do that, you need to use the
> hadoop cli:
>
> hadoop fs -setrep -R 1 /
>
> --Mike
>
>
> Vladimir Klimontovich wrote:
> > This will happen automatically.
> > On Aug 27, 2009, at 6:04 PM, Andy Liu wrote:
> >
> >> I'm running a test Hadoop cluster, which had a dfs.replication value
> >> of 3.
> >> I'm now running out of disk space, so I've reduced dfs.replication to
> >> 1 and
> >> restarted my datanodes.  Is there a way to free up the over-replicated
> >> blocks, or does this happen automatically at some point?
> >>
> >> Thanks,
> >> Andy
> >
> > ---
> > Vladimir Klimontovich,
> > skype: klimontovich
> > GoogleTalk/Jabber: klimontov...@gmail.com
> > Cell phone: +7926 890 2349
> >
>
>


Re: control map to split assignment

2009-08-27 Thread Alex Loddengaard
Hi Rares,

Unfortunately there isn't a way to control the scheduling of individual
tasks, at least as far as I know.  Might you be able to split this up into
two jobs: one for the "short" inputs; another for the "long" inputs?  Just a
thought.

Alex

On Wed, Aug 26, 2009 at 6:52 PM, Rares Vernica  wrote:

> Hello,
>
> I wonder is there is a way to control how maps are assigned to splits
> in order to balance the load across the cluster.
>
> Here is a simplified example. I have tow types of inputs: "long" and
> "short". Each input is in a different file and will be processed by a
> single map task. Suppose the "long" inputs take 10s to process while
> the "short" inputs take 3s to process. I have two "long" inputs and
> two "short" inputs. My cluster has 2 nodes and each node can execute
> only one map task at a time. A possible schedule of the tasks could be
> the following:
>
> Node 1: "long map", "short map" -> 10s + 3s = 13s
> Node 2: "long map", "short map" -> 10s + 3s = 13s
>
> So, my job will be done in 13s. Another possible schedule is:
>
> Node 1: "long map" -> 10s
> Node 2: "short map", "short map", "long map" -> 3s + 3s + 10s = 16s
>
> And, my job will be done in 16s. Clearly, the first scheduling is better.
>
> Is there a way to control how the schedule is build? If I can control
> which inputs are processed first, I could schedule the "long" inputs
> to be processed first and so they will be balanced across nodes and I
> will end up with something similar to the first schedule.
>
> I could configure the job so that a "long" input gets processed by
> more that a map, and so end up balancing the work, but I noticed that
> overall, this takes more time than a bad scheduling with only one map
> per input.
>
> Thanks!
>
> Cheers,
> Rares Vernica
>


Re: Intra-datanode balancing?

2009-08-25 Thread Alex Loddengaard
Changing the ordering of dfs.data.dir won't change anything, because
dfs.data.dir is written to in a round-robin fashion.

Kris, I think you're stuck with the hack you're performing :(.  Sorry I
don't have better news.

Alex

On Tue, Aug 25, 2009 at 1:16 PM, Ted Dunning  wrote:

> Change the ordering of the volumes in the ocnfig files.
>
> On Tue, Aug 25, 2009 at 12:51 PM, Kris Jirapinyo  >wrote:
>
> > Hi all,
> >I know this has been filed as a JIRA improvement already
> > http://issues.apache.org/jira/browse/HDFS-343, but is there any good
> > workaround at the moment?  What's happening is I have added a few new EBS
> > volumes to half of the cluster, but Hadoop doesn't want to write to them.
> > When I try to do cluster rebalancing, since the new disks make the
> > percentage used lower, it fills up the first two existing local disks,
> > which
> > is exactly what I don't want to happen.  Currently, I just delete several
> > subdirs from dfs, since I know that with a replication factor of 3, it'll
> > be
> > ok, so that fixes the problems in the short term.  But I still cannot get
> > Hadoop to use those new larger disks efficiently.  Any thoughts?
> >
> > -- Kris.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: How can I copy files from S3 to my local hadoop cluster

2009-08-21 Thread Alex Loddengaard
Hi Jeff,

You can use distcp.  Something like "hadoop distcp s3n//bucket/object
foo/bar".  Read more here: 

Alex

On Fri, Aug 21, 2009 at 3:19 AM, zhang jianfeng  wrote:

> Hi all,
>
>
> I found hadoop has a filesystem implementation for S3, So how can I copy
> files from S3 to my local hadoop cluster ?
> Is there any Java API examples?
>
>
> Thank you.
>
> Jeff Zhang
>


Re: Exception when starting namenode

2009-08-21 Thread Alex Loddengaard
Have you tampered with anything in dfs.name.dir?  This exception occurs when
your image files in dfs.name.dir are corrupt.  What have you set
dfs.name.dir to?  If it's set to /tmp, then I imagine tmpwatch might have
deleted your HDFS metadata.

Hope this helps.

Alex

On Thu, Aug 20, 2009 at 10:08 PM, Zheng Lv wrote:

> Hello,
>
>I got these exceptions when I started the cluster, any suggestions?
>I used hadoop 0.15.2.
>2009-08-21 12:12:53,463 ERROR org.apache.hadoop.dfs.NameNode:
> java.io.EOFException
>at java.io.DataInputStream.readInt(DataInputStream.java:375)
>at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:650)
>at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:614)
>at
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
>at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76)
>at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:221)
>at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130)
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:168)
>at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:804)
>at org.apache.hadoop.dfs.NameNode.main(NameNode.java:813)
>Thank you,
>LvZheng
>


Re: which group to post to?

2009-07-28 Thread Alex Loddengaard
Hi Mark, please use common-user.  core-user will eventually be deprecated,
as it existed before the project split.

Thanks,

Alex

On Tue, Jul 28, 2009 at 10:59 AM, Mark Kerzner wrote:

> Hi,
> it seems that posting to core-user works the same as common-user, and it
> does not matter which I post to, is that right?
>
> Thank you,
> Mark
>


Re: permission denied error on multiple slaves

2009-07-10 Thread Alex Loddengaard
This sounds like a SSH key issue.  I'm going to assume that you're invoking
the start-*.sh scripts from the NameNode.  On the NameNode, you'll want to
run "ssh-keygen -t rsa" as the user that runs Hadoop (probably "hadoop").
This should create two files: ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub.  scp the
*.pub file to all of your other nodes, and store that file as
~/.ssh/authorized_keys on each node, including the NameNode.  Give
~/.ssh/authorized_keys 600 permissions.  You should be good to go.  I
recommend testing this stuff before running the start-*.sh scripts.

You may always want to look at our (Cloudera's) RPMs and DEBs.  They
simplify the installation of Hadoop and give you init scripst to start all
the daemons.  Then you can avoid the start-*.sh scripts.  <
http://www.cloudera.com/hadoop>

Hope this helps.

Alex

On Fri, Jul 10, 2009 at 9:19 AM, Divij Durve  wrote:

> Hey everyone,
>
> I am quite new to using hadoop. I have got the config and everything
> working
> perfect with 1 namenode/jobtracker, 1 datanode, 1 secondary data node.
> However, Keeping the config the same i just added a slave to the list in
> the
> conf/slaves files and tried running the cluster. This resulted in me
> getting
> permission denied when i put in the password for ssh in. The ssh
> passwordless login is not working for some reason. its only the data nodes
> that are giving trouble however, the secondary name node is starting up
> without a hitch even though that pass is the last one to be entered.
> Any ideas/suggestions anyone might have.
>
> Thanks
>


Re: how to compress..!

2009-07-09 Thread Alex Loddengaard
A few comments before I answer:
1) Each time you send an email, we receive two emails.  Is your mail client
misconfigured?
2) You already asked this question in another thread :).  See my response
there.

Short answer: <
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html
>

Alex

On Thu, Jul 9, 2009 at 1:11 AM, Sugandha Naolekar wrote:

> Hello!
>
> How to compress data by using hadoop api's??
>
> I want to write a java code to comperss the core files(the data I am going
> to dump in HDFS) and then place in HDFS. So, the api's usage is sufficient.
> What about making related changes in hadoop-site.xml file?
>
>
> --
> Regards!
> Sugandha
>


Re: how to use hadoop in real life?

2009-07-09 Thread Alex Loddengaard
Writing a Java program that uses the API is basically equivalent to
installed a Hadoop client and writing a Python script to manipulate HDFS and
fire off a MR job.  It's up to you to decide how much you like Java :).

Alex

On Thu, Jul 9, 2009 at 2:27 AM, Shravan Mahankali <
shravan.mahank...@catalytic.com> wrote:

> Hi Group,
>
> I have data to be analyzed and I would like to dump this data to Hadoop
> from
> machine.X where as Hadoop is running from machine.Y, after dumping this
> data
> to data I would like to initiate a job, get this data analyzed and get the
> output information back to machine.X
>
> I would like to do all this programmatically. Am going through Hadoop API
> for this same purpose. I remember last day Alex was saying to install
> Hadoop
> in machine.X, but I was not sure why to do that?
>
> I simple write a Java program including Hadoop-core jar, I was planning to
> use "FsUrlStreamHandlerFactory" to connect to Hadoop in machine.Y and then
> use "org.apache.hadoop.fs.shell" to copy data to Hadoop machine and
> initiate
> the job and get the results.
>
> Please advice.
>
> Thank You,
> Shravan Kumar. M
> Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
> -
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system
> administrator - netopshelpd...@catalytic.com
>
> -Original Message-
> From: Shravan Mahankali [mailto:shravan.mahank...@catalytic.com]
> Sent: Thursday, July 09, 2009 10:35 AM
> To: common-user@hadoop.apache.org
> Cc: 'Alex Loddengaard'
> Subject: RE: how to use hadoop in real life?
>
> Thanks for the information Ted.
>
> Regards,
> Shravan Kumar. M
> Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
> -
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system
> administrator - netopshelpd...@catalytic.com
>
> -Original Message-
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: Wednesday, July 08, 2009 10:48 PM
> To: common-user@hadoop.apache.org; shravan.mahank...@catalytic.com
> Cc: Alex Loddengaard
> Subject: Re: how to use hadoop in real life?
>
> In general hadoop is simpler than you might imagine.
>
> Yes, you need to create directories to store data.  This is much lighter
> weight than creating a table in SQL.
>
> But the key question is volume.  Hadoop makes some things easier and Pig
> queries are generally easier to write than SQL (for programmers ... not for
> those raised on SQL), but, overall, map-reduce programs really are more
> work
> to write than SQL queries until you get to really large scale problems.
>
> If your database has less than 10 million rows or so, I would recommend
> that
> you consider doing all analysis in SQL augmented by procedural languages.
> Only as your data goes beyond 100 million to a billion rows do the clear
> advantages of map-reduce formulation become apparent.
>
> On Tue, Jul 7, 2009 at 11:35 PM, Shravan Mahankali <
> shravan.mahank...@catalytic.com> wrote:
>
> > Use Case: We have a web app where user performs some actions, we have to
> > track these actions and various parameters related to action initiator,
> we
> > actually store this information in the database. But our manager has
> > suggested evaluating Hadoop for this scenario, however, am not clear that
> > every time I run a job in Hadoop I have to create a directory and how can
> I
> > track that later to read the data analyzed by Hadoop. Even though I drop
> > user action information in Hadoop, I have to put this information in our
> > database such that it knows the trend and responds for various of
> requests
> > accordingy.
> >
>
>


Re: Few Queries..!!!

2009-07-09 Thread Alex Loddengaard
Answers in-line.  Let me know if any questions follow.

Alex

On Wed, Jul 8, 2009 at 10:49 PM, Sugandha Naolekar
wrote:

> Hello!
>
> I have a 7 node hadoop cluster!
>
> As of now, I am able to transfer(dump) the data in HDFS from a remote
> node(not a part of hadoop cluster). And through web UI, I am able to
> download the same.
>
> -> but, If I need to restrict that web UI to few users only, what am I
> supposed to do?
>
Hadoop doesn't have any mechanism for authentication, so you'll have to do
this with Linux tools.  It's also dangerous to restrict access to the web
ports, because those same ports are used by the Hadoop daemons themselves.
You could use iptables to create an IP whitelist, and include your users'
IPs, as well as your nodes' IPs.  There may be a way to massage Jetty to
restrict access, but I don't know enough about Jetty to be able to say for
sure.

>
> -> Also, if I need to do some kind of search, i.e; whether a particular
> file
> or folder is available or not in HDFS..??? Will I be able to do it simply,
> by writing a code using hadoop Filesystem api's.? Will it be fast and
> efficient in case of data extending to huge amount?

The API should be sufficient here.  Another possibility, if you'd rather not
use Java, is to get the Fuse contrib project working and mount HDFS onto a
Linux box.  Then you could use Python, bash, or whatever to do your file
traversals.  Note that fuse isn't widely used, so it may be hard to get
going (I've never done it).

>
>
> -> Also, after above tasks, I want to implement compression algorithms. The
> data that is getting placed in HDFS, shold be placed in a compressed
> format.
> Will I have to use hadoop api's only, or some map-reudce techniques? In
> those complete episode, Map-reduce is necessary? If yes, where??

There are a few different ways to do this.  Probably the easiest being the
following.  First, put your data in HDFS in its original format.  Then, use
IdentityMapper and IdentityReducer to read your (assumingly plain text) data
via TextInputFormat, and configure your job to use SequenceFileOutputFormat
(to learn about the different compression options, see <
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html>).
After this map reduce job is done, you will have your original data, and
your data in SequenceFiles.  Make sense?

>
>
>
> --
> Regards!
> Sugandha
>


Re: apply

2009-07-07 Thread Alex Loddengaard
Email common-user-subscr...@hadoop.apache.org to subscribe.  Thanks,

Alex

On Tue, Jul 7, 2009 at 1:05 AM, antrao  wrote:

> Hi, i hope i could join the mail list.
>


Re: how to use hadoop in real life?

2009-07-06 Thread Alex Loddengaard
Answers inline.  Hope this is helpful.

Alex

On Mon, Jul 6, 2009 at 5:25 AM, Shravan Mahankali <
shravan.mahank...@catalytic.com> wrote:

> Hi Group,
>
>
>
> Finally I have written a sample Mapred program, submitted this job to
> Hadoop
> and got the expected results. Thanks to all of you!
>
>
>
> Now I don't have an idea of how to use Hadoop in real life (am sorry if am
> asking wrong question at wrong time.! (So, am right ;-))) :
>
>
>
> 1) If I re-submit my job, Hadoop responds with an error message saying:
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> hdfs://localhost:9000/user/root/impressions_output already exists
>
If the output directory specified to a job already exists, Hadoop won't run
the job.  This makes a lot of sense, because it helps you avoid overwriting
your data.  Subsequent job runs will have to write to different output
directories.

>
> 2) How to automatically execute Hadoop jobs? let's say I have set a cron
> job
> which runs various Hadoop jobs at specified times. Is this the way we do in
> Hadoop world?

Cron is a very good tool for running jobs at various times :).  That said,
cron does not provide a workflow management system that can tie jobs
together.  For workflow management, the community has hamake (<
http://code.google.com/p/hamake/>), Oozie (<
http://issues.apache.org/jira/browse/HADOOP-5303>), and Cascading (<
http://www.cascading.org/>).

>
>
> 3) Can I submit jobs to Hadoop from a different machine/ network/ domain?
>
A different machine is easy.  Just install Hadoop as you did on your other
machines, and in hadoop-site.xml (assuming you're not using 0.20), just
configure hadoop.job.ugi (optional, actually), fs.default.name, and
mapred.job.tracker.  As for a different network / domain, take a look at
this: <
http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
>

>
> 4) I would like to generate reports from the data collected in the Hadoop.
> How can I do that?

What do you mean by "data collected in the Hadoop?"  Log files?  Output
data?

>
>
> 5) Am thinking of replacing data in my database with Hadoop and query
> Hadoop
> for various information. Is this correct?

This probably isn't correct.  HDFS does not have good latency, so it
shouldn't be queried in a real-time/interactive environment.  Similarly,
once a file is written, it can only be appended to, which makes HDFS a bad
application for a database.  Lastly, HDFS doesn't really give you any sort
of transactional behavior that a database would, which may or may not be
what you're looking for.  You may want to take a look at HBase (<
http://hadoop.apache.org/hbase/>), as it is much more like a database than
HDFS.

>
>
> 6) How can I access analyzed data in Hadoop from external world, external
> program?

Depending on the type of program accessing HDFS, you could just read files
from HDFS.  You could use DBOutputFormat (<
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/db/DBOutputFormat.html>)
to write to a SQL database somewhere, and similarly you could output SQL
syntax (or even data that can be imported with something like mysqlimport <
http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html>) that you then run
on a database somewhere.  Again, it's important to realize that HDFS is by
no means meant to serve interactive/real-time data as a local file system
might.  HDFS is meant for throughput for purposes of analyzing lots of data,
and for storing lots of data.  HBase, however, is being used by several
people for real-time/interactive queries.

>
>
>
>
> NOTE: I would like to use Java for any of above implementations.
>
>
>
> Thanks in advance,
>
> Shravan Kumar. M
>
> Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
>
> -
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system
> administrator -  
> netopshelpd...@catalytic.com
>
>
>
>


Re: Wich is better Namenode and JobTracker run in different server or not?

2009-07-03 Thread Alex Loddengaard
It's unnecessary to run the NN and JT daemons on separate machines in small
clusters with more than three nodes.  You'll only have performance benefits
by putting these daemons on separate machines if you have a large (100s of
nodes) cluster.  It makes sense to separate the NN and JT daemons in a three
node cluster as well, assuming each node is a DataNode and TaskTracker as
well.

Hope this clears things up.

Alex

On Fri, Jul 3, 2009 at 3:37 AM, calikus  wrote:

>
> Hi,
>
> I wonder which is better, Namenode and JobTracker run in different server
> or
> not?
>
>
> --
> View this message in context:
> http://www.nabble.com/Wich-is-better-Namenode-and-JobTracker-run-in-different-server-or-not--tp24321039p24321039.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: My secondary namenode seem not be running, and may be the reason of my problem!!!

2009-07-03 Thread Alex Loddengaard
Hi,

It's unclear exactly what the problem is, so you should try and follow the
getting started guide more closely:

<
http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
>

You should get a single-node cluster working before you try and get a
multi-node cluster.

Good luck!

Alex

On Fri, Jul 3, 2009 at 2:21 AM, C J  wrote:

> Hallo everyone,
> I have installed the hadoop 0.18.3 on three linux machines, I am trying to
> run the
> example of WordCountv1.0 on a cluster. But I guess I have a problem
> somewhere.
> *
> Problem*
>
> *After formating the name node:*
> I am getting several STARTUP_MSG and at the end a "SHUTDOWN_MSG: shutting
> down the namenode..."
> is this normal?
>
> *Afterwards I try starting the dfs:**
> *I get a message "starting namenode..."* *afterwards I get another message
> "starting secondary namenode"
> at this stage the shell is blocked unless I press the enter key. Then the
> system tries to start another secondary
> namenode and the shell is then not blocked. What is going on?
> *
> Then I proceed an try starting the mapred:*
> I get the two messages "starting jobtracker." and "starting
> tasktracker"
>
> *Following the tutorial for runnin WordCode v1.0, If I try to list the
> files
> in the input folder I have created
> *I get the famous error "Retrying connect to server:134.130.222.20:9000"
> .what am I doing something wrong?
>
>
> *Steps I have already verified*
>
> *I have already checked the iptables of the three machines and they look**:
> **
> *Chain INPUT (policy ACCEPT)
> target prot opt source destiantion
>
> Chain FORWARD (policy ACCEPT)
> target prot opt source destination
>
> Chain OUTPUT (policy ACCEPT)
> targetprot opt source destination
>
> *
> My hadoop-site.xml file looks like this
> * configuration>
>
>fs.default.name
>134.130.222.20:9000/
>
>
>mapred.job.tracker
>134.130.222.18:9001
>
>
> 
>
> *Can someone help me out?*
> *Thank you, CJ*
>


Re: Unsusbscribe

2009-07-01 Thread Alex Loddengaard
Try emailing common-user-unsubscr...@hadoop.apache.org

Alex

On Wed, Jul 1, 2009 at 3:25 PM, Himanshu Vijay  wrote:

>
>


Re: Permissions needed to run RandomWriter ?

2009-06-29 Thread Alex Loddengaard
Make sure /user/smulcahy exists in HDFS.  Also, make sure that
/hadoop/mapred/system in HDFS is 733 and owned by hadoop:supergroup.

Let me know if this doesn't work for you.  Also, what version of Hadoop are
you running?

Hope this helps!

Alex

On Mon, Jun 29, 2009 at 1:11 AM, stephen mulcahy
wrote:

> Alex Loddengaard wrote:
>
>> Have you tried to run the example job as the superuser?  It seems like
>> this
>> might be an issue where hadoop.tmp.dir doesn't have the correctly
>> permissions.  hadoop.tmp.dir and dfs.data.dir should be owned by the unix
>> user running your Hadoop daemons and owner-writtable and readable.
>>
>> Can you confirm this is the case?  Thanks,
>>
>
> Hi Alex,
>
> The RandomWriter example runs without any problems when run as the hadoop
> user (i.e. the superuser / user that runs the hadoop daemons).
>
> hadoop.tmp.dir permissions
>
> smulc...@hadoop01:~$ ls -la /data1/hadoop-tmp/
> total 16
> drwxr-xr-x 4 hadoop hadoop 4096 2009-06-19 14:01 .
> drwxr-xr-x 5 root   root   4096 2009-06-19 10:12 ..
> drwxr-xr-x 4 hadoop hadoop 4096 2009-06-19 10:16 dfs
> drwxr-xr-x 3 hadoop hadoop 4096 2009-06-19 10:49 mapred
>
>
>
> smulc...@hadoop01:~$ ls -la /data?/hdfs
> /data1/hdfs:
> total 8
> drwxr-xr-x 2 hadoop hadoop 4096 2009-06-19 10:12 .
> drwxr-xr-x 5 root   root   4096 2009-06-19 10:12 ..
>
> /data2/hdfs:
> total 8
> drwxr-xr-x 2 hadoop hadoop 4096 2009-06-19 10:12 .
> drwxr-xr-x 4 root   root   4096 2009-06-19 10:12 ..
>
> Does hadoop.tmp.dir need to be writeable by all users running hadoop jobs?
>
>
> -stephen
>
> --
> Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
> NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
> http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
>