Re: CUDA on Hadoop

2011-02-10 Thread Lance Norskog
If you want to use Python, one of the Py+CUDA projects generates CUDA
C from the Python byte-codes. You don't have to write any C. I don't
remember which project it is.

This lets you debug the CUDA code in isolation, then run it from the
Hadoop streaming mode.


On 2/9/11, Adarsh Sharma adarsh.sha...@orkash.com wrote:
 He Chen wrote:
 Hi sharma

 I shared our slides about CUDA performance on Hadoop clusters. Feel
 free to modified it, please mention the copyright!

 Chen

 On Wed, Feb 9, 2011 at 11:13 AM, He Chen airb...@gmail.com
 mailto:airb...@gmail.com wrote:

 Hi  Sharma

 I have some experiences on working Hybrid Hadoop with GPU. Our
 group has tested CUDA performance on Hadoop clusters. We obtain 20
 times speedup and save up to 95% power consumption in some
 computation-intensive test case.

 You can parallel your Java code by using JCUDA which is a kind of
 API to help you call CUDA in your Java code.

 Chen


 On Wed, Feb 9, 2011 at 8:45 AM, Steve Loughran ste...@apache.org
 mailto:ste...@apache.org wrote:

 On 09/02/11 13:58, Harsh J wrote:

 You can check-out this project which did some work for
 Hama+CUDA:
 http://code.google.com/p/mrcl/


 Amazon let you bring up a Hadoop cluster on machines with GPUs
 you can code against, but I haven't heard of anyone using it.
 The big issue is bandwidth; it just doesn't make sense for a
 classic scan through the logs kind of problem as the
 disk:GPU bandwidth ratio is even worse than disk:CPU.

 That said, if you were doing something that involved a lot of
 compute on a block of data (e.g. rendering tiles in a map),
 this could work.



 Thanks Chen , I am looking for some White-Papers on the mentioned topic
 or concerning.
 I think no one has write any white paper on this topic Or I'm wrong.

 However U'r Ppt is very nice.
 Thanx Once again .

 Adarsh



-- 
Lance Norskog
goks...@gmail.com


Is there any smart ways to give arguments to mappers reducers from a main job?

2011-02-10 Thread Jun Young Kim

Hi, all

in my job, I wanna pass some arguments to mappers and reducers from a 
main job.


I googled some references to do that by using Configuration.

but, it's not working.

code)

job)
Configuration conf = new Configuration();
conf.set(test, value);

mapper)

doMap() extends Mapper... {
System.out.println(context.getConfiguration.get(test));
/// -- this printed out null
}

How could I do that to make it working?--

Junyoung Kim (juneng...@gmail.com)



Re: why is it invalid to have non-alphabet characters as a result of MultipleOutputs?

2011-02-10 Thread Jun Young Kim

OK. thanks for your replies.

I decided to use '00' as a delimiter. :(

Junyoung Kim (juneng...@gmail.com)


On 02/09/2011 01:46 AM, David Rosenstrauch wrote:

On 02/08/2011 05:01 AM, Jun Young Kim wrote:

Hi,

Multipleoutputs supports to have named outputs as a result of a hadoop.
but, it has inconvenient restrictions to have it.

only, alphabet characters are valid as a named output.

A ~ Z
a ~ z
0 ~ 9

are only characters we can take.

I believe if I can use other chars like '.', '_', it could be more
convenient for me.


There's already a bug report open for this.

https://issues.apache.org/jira/browse/MAPREDUCE-2293

DR


Re: Is there any smart ways to give arguments to mappers reducers from a main job?

2011-02-10 Thread Harsh J
Your 'Job' must reference this Configuration object for it to take
those values. If it does not know about it, it would not work,
logically :-)

For example, create your Configuration and set things into it, and
only then do new Job(ConfigurationObj) to make it use your configured
object for this job.

On Thu, Feb 10, 2011 at 3:19 PM, Jun Young Kim juneng...@gmail.com wrote:
 Hi, all

 in my job, I wanna pass some arguments to mappers and reducers from a main
 job.

 I googled some references to do that by using Configuration.

 but, it's not working.

 code)

 job)
 Configuration conf = new Configuration();
 conf.set(test, value);

 mapper)

 doMap() extends Mapper... {
 System.out.println(context.getConfiguration.get(test));
 /// -- this printed out null
 }

 How could I do that to make it working?--

 Junyoung Kim (juneng...@gmail.com)





-- 
Harsh J
www.harshj.com


Re: Is there any smart ways to give arguments to mappers reducers from a main job?

2011-02-10 Thread li ping
correct.
Just like this:

Configuration conf = new Configuration();
conf.setStrings(test, test);
Job job = new Job(conf, job name);

On Thu, Feb 10, 2011 at 6:42 PM, Harsh J qwertyman...@gmail.com wrote:

 Your 'Job' must reference this Configuration object for it to take
 those values. If it does not know about it, it would not work,
 logically :-)

 For example, create your Configuration and set things into it, and
 only then do new Job(ConfigurationObj) to make it use your configured
 object for this job.

 On Thu, Feb 10, 2011 at 3:19 PM, Jun Young Kim juneng...@gmail.com
 wrote:
  Hi, all
 
  in my job, I wanna pass some arguments to mappers and reducers from a
 main
  job.
 
  I googled some references to do that by using Configuration.
 
  but, it's not working.
 
  code)
 
  job)
  Configuration conf = new Configuration();
  conf.set(test, value);
 
  mapper)
 
  doMap() extends Mapper... {
  System.out.println(context.getConfiguration.get(test));
  /// -- this printed out null
  }
 
  How could I do that to make it working?--
 
  Junyoung Kim (juneng...@gmail.com)
 
 



 --
 Harsh J
 www.harshj.com




-- 
-李平


Re: CUDA on Hadoop

2011-02-10 Thread Steve Loughran

On 09/02/11 17:31, He Chen wrote:

Hi sharma

I shared our slides about CUDA performance on Hadoop clusters. Feel free to
modified it, please mention the copyright!


This is nice. If you stick it up online you should link to it from the 
Hadoop wiki pages -maybe start a hadoop+cuda page and refer to it




Re: CUDA on Hadoop

2011-02-10 Thread Adarsh Sharma

Steve Loughran wrote:

On 09/02/11 17:31, He Chen wrote:

Hi sharma

I shared our slides about CUDA performance on Hadoop clusters. Feel 
free to

modified it, please mention the copyright!


This is nice. If you stick it up online you should link to it from the 
Hadoop wiki pages -maybe start a hadoop+cuda page and refer to it


Yes,  This will be very helpful for others too. But This much 
information is not sufficient , need more.




Best Regards

Adarsh Sharma





some doubts Hadoop MR

2011-02-10 Thread Matthew John
Hi all,

I had some doubts regarding the functioning of Hadoop MapReduce :

1) I understand that every MapReduce job is parameterized using an XML file
(with all the job configurations). So whenever I set certain parameters
using my MR code (say I set splitsize to be 32kb) it does get reflected
in the job (number of mappers). How exactly does that happen ? Does the
parameters coded in the MR module override the default parameters set in the
configuration XML ? And how does the JobTracker ensure that the
configuration is followed by all the TaskTrackers ? What is the mechanism
followed ?

2) Assume I am running cascading (chained) MR modules. In this case I feel
there is a huge overhead when output of MR1 is written back to HDFS and then
read from there as input of MR2.Can this be avoided ? (maybe store it in
some memory without hitting the HDFS and NameNode ) Please let me know if
there s some means of exercising this because it will increase the
efficiency of chained MR to a great extent.

Matthew


Re: some doubts Hadoop MR

2011-02-10 Thread Harsh J
Hello,

On Thu, Feb 10, 2011 at 5:16 PM, Matthew John
tmatthewjohn1...@gmail.com wrote:
 Hi all,

 I had some doubts regarding the functioning of Hadoop MapReduce :

 1) I understand that every MapReduce job is parameterized using an XML file
 (with all the job configurations). So whenever I set certain parameters
 using my MR code (say I set splitsize to be 32kb) it does get reflected
 in the job (number of mappers). How exactly does that happen ? Does the
 parameters coded in the MR module override the default parameters set in the
 configuration XML ? And how does the JobTracker ensure that the
 configuration is followed by all the TaskTrackers ? What is the mechanism
 followed ?

Yes, your configurations are applied over the defaults that are loaded
from Hadoop's core/etc jars.

A job is represented by its job file + jars/files, where the job file
is the 'job.xml' produced by the configuration saving mechanism,
performed upon submission of a Job. This file is distributed to all
workers to read and utilize, by the JobTracker as part of its
submission and localization process. I suggest reading Hadoop's source
code from the submit call upwards.

 2) Assume I am running cascading (chained) MR modules. In this case I feel
 there is a huge overhead when output of MR1 is written back to HDFS and then
 read from there as input of MR2.Can this be avoided ? (maybe store it in
 some memory without hitting the HDFS and NameNode ) Please let me know if
 there s some means of exercising this because it will increase the
 efficiency of chained MR to a great extent.

Not possible to pipeline in Apache Hadoop. Have a look at HOP (Hadoop
On-line project), which has some of what you seek.

-- 
Harsh J
www.harshj.com


Re: Hadoop Multi user - Cluster Setup

2011-02-10 Thread Piyush Joshi
Hey Amit, please try HOD or hadoop on demand tool. This will suffice to your
need for creating multiple users on ur cluster.

-Piyush

On Thu, Feb 10, 2011 at 12:42 AM, Kumar, Amit H. ahku...@odu.edu wrote:

 Dear All,

 I am trying to setup Hadoop for multiple users in a class, on our cluster.
 For some reason I don't seem to get it right. If only one user is running it
 works great.
 I would want to have all of the users submit a Hadoop job to the existing
 DataNode and on the cluster, not sure if this is right.
 Do I need to start a DataNode for every user, if so I was not able to do
 because I ran into issues of port already being used.
 Please advise. Below are few of the config files.

 Also I have tired searching for other documents, that tell us to create a
 user Hadoop and a group Hadoop and then start the daemons as Hadoop
 user. This didn't work for me as well.  I am sure I am doing something
 wrong. Could anyone please thrown in some more ideas.

 =List of env changed in Hadoop-env.sh:
 export HADOOP_LOG_DIR=/scratch/$USER/hadoop-logs
 export HADOOP_PID_DIR=/scratch/$USER/.var/hadoop/pids

 #cat core-site.xml
 configuration
 property
 namefs.default.name/name
 valuehdfs://frontend:9000/value
 /property
 property
namehadoop.tmp.dir/name
value/scratch/${user.name}/hadoop-FS/value
descriptionA base for other temporary directories./description
 /property
 /configuration

 # cat hdfs-site.xml
 configuration
 property
 namedfs.replication/name
 value1/value
 /property
 property
 namedfs.name.dir/name
 value/scratch/${user.name}/.hadoop/.transaction/.edits/value
 /property
 /configuration

 # cat mapred-site.xml
 configuration
 property
 namemapred.job.tracker/name
 valuefrontend:9001/value
 /property
 property
 namemapreduce.tasktracker.map.tasks.maximum/name
 value2/value
 /property
 property
 namemapreduce.tasktracker.reduce.tasks.maximum/name
 value2/value
 /property
 /configuration


 Thank you,
 Amit





Re: Could not add a new data node without rebooting Hadoop system

2011-02-10 Thread 안의건
Dear Harsh,

Your advice gave me insight, and I finally solved my problem.

I'm not sure this is the correct way, but anyway it worked in my situation.

I hope it would be helpful to someone else who has similar problem with me.



hadoop/conf
slaves update
*.xml update

hadoop/bin start-dfs.sh
hadoop/bin start-maperd.sh

--


Regards,
Henny (ahneui...@gmail.com)

2011/2/7 Harsh J qwertyman...@gmail.com

 On Mon, Feb 7, 2011 at 5:16 PM, ahn ahneui...@gmail.com wrote:
  Hello everybody
  1. configure conf/slaves and *.xml files on master machine
 
  2. configure conf/master and *.xml files on slave machine

 'slaves' and 'masters' file are generally only required in the master
 machine, and only if you are using the start-* scripts supplied with
 Hadoop for use with SSH (FAQ has an entry on this) from master.

  3. run ${HADOOP}/bin/hadoop datanode
  But when I ran the commands on the master node, the master node was
  recognized as a data node.

 3. wasn't a valid command in this case. start-dfs.sh

  When I ran the commands on the data node which I want to add, the data
 node
  was not properly added.(The number of total data node didn't show any
  change)

 What do the logs say for the DataNode on the slave? Does it start
 successfully? If fs.default.name is set properly in slave's
 core-site.xml it should be able to communicate properly if started
 (and if the version is not mismatched).

 --
 Harsh J
 www.harshj.com



hadoop 0.20 append - some clarifications

2011-02-10 Thread Gokulakannan M
Hi All,

I have run the hadoop 0.20 append branch . Can someone please clarify the
following behavior?

A writer writing a file but he has not flushed the data and not closed the
file. Could a parallel reader read this partial file? 

For example,

1. a writer is writing a 10MB file(block size 2 MB) 

2. wrote the file upto 5MB (2 finalized blocks + 1 blockBeingWritten) . note
that writer is not calling FsDataOutputStream sync( ) at all

3. now a reader tries to read the above partially written file

I can be able to see that the reader can be able to see the partially
written 5MB data but I feel the reader should be able to see the data only
after the writer calls sync() api. 

Is this the correct behavior or my understanding is wrong?

 

 Thanks,

  Gokul



Re: hadoop 0.20 append - some clarifications

2011-02-10 Thread Ted Dunning
Correct is a strong word here.

There is actually an HDFS unit test that checks to see if partially written
and unflushed data is visible.  The basic rule of thumb is that you need to
synchronize readers and writers outside of HDFS.  There is no guarantee that
data is visible or invisible after writing, but there is a guarantee that it
will become visible after sync or close.

On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M gok...@huawei.com wrote:

 Is this the correct behavior or my understanding is wrong?



Re: MRUnit and Herriot

2011-02-10 Thread Edson Ramiro
Hi,

I took a look around on the Internet, but I didn't find any docs about
MiniDFS
and MiniMRCluster. Is there docs about them?

It remember me this phrase I got from the Herriot [1] page.
As always your best source of information and knowledge about any software
system is its source code :)

Do you think is possible to have just one tool to cover all kinds of tests?

Another question, do you know if is possible to evaluate a MR program, eg
sort,
with Herriot considering several test data?

Thanks in Advance

--
Edson Ramiro Lucas Filho
{skype, twitter, gtalk}: erlfilho
http://www.inf.ufpr.br/erlf07/


On Mon, Feb 7, 2011 at 10:29 PM, Konstantin Boudnik c...@apache.org wrote:

 On Mon, Feb 7, 2011 at 04:20, Edson Ramiro erlfi...@gmail.com wrote:
  Well, I'm studying the Hadoop test tools to evaluate some (if there are)
  deficiences, also trying to compare these tools to see what one cover
 that
  other doesn't and what is possible to do with each one.

 There's also a simulated test cluster infrastructure called MiniDFS
 and MiniMRCluster to allow you to develop functional tests without
 actual cluster deployment.

  As far as I know we have just Herriot and MRUnit for test, and them do
  different things as you said me :)
 
  I'm very interested in your initial version, is there a link?

 Not at the moment, but I will send it here as soon as a initial
 version is pushed out.

 
  Thanks in advance
 
  --
  Edson Ramiro Lucas Filho
  {skype, twitter, gtalk}: erlfilho
  http://www.inf.ufpr.br/erlf07/
 
 
  On Fri, Feb 4, 2011 at 3:40 AM, Konstantin Boudnik c...@apache.org
 wrote:
 
  Yes, Herriot can be used for integration tests of MR. Unit test is a
 very
  different thing and normally is done against a 'unit of compilation'
 e.g. a
  class, etc. Typically you won't expect to do unit tests against a
 deployed
  cluster.
 
  There is fault injection framework wich works at the level of functional
  tests
  (with mini-clusters). Shortly we'll be opening an initial version of
 smoke
  and
  integration test framework (maven and JUnit based).
 
  It'd be easier to provide you with a hint if you care to explain what
  you're
  trying to solve.
 
  Cos
 
  On Thu, Feb 03, 2011 at 10:25AM, Edson Ramiro wrote:
   Thank you a lot Konstantin, you cleared my mind.
  
   So, Herriot is a framework designed to test Hadoop as a whole, and
 (IMHO)
  is
   a tool for help Hadoop developers and not for who is developing MR
  programs,
   but can we use Herriot to do unit, integration or other tests on our
 MR
   jobs?
  
   Do you know another test tool or test framework for Hadoop?
  
   Thanks in Advance
  
   --
   Edson Ramiro Lucas Filho
   {skype, twitter, gtalk}: erlfilho
   http://www.inf.ufpr.br/erlf07/
  
  
   On Wed, Feb 2, 2011 at 4:58 PM, Konstantin Boudnik c...@apache.org
  wrote:
  
(Moving to common-user where this belongs)
   
Herriot is system test framework which runs against a real physical
cluster deployed with a specially crafted build of Hadoop. That
instrumented build of provides an extra APIs not available in Hadoop
otherwise. These APIs are created to facilitate cluster software
testability. Herriot isn't limited by MR but also covered (although
 in
a somewhat lesser extend) HDFS side of Hadoop.
   
MRunit is for MR job unit testing as in making sure that your MR
 job
is ok and/or to allow you to debug it locally before scale
 deployment.
   
So, long story short - they are very different ;) Herriot can do
intricate fault injection and can work closely with a deployed
 cluster
(say control Hadoop nodes and daemons); MRUnit is focused on MR jobs
testing.
   
Hope it helps.
--
  Take care,
Konstantin (Cos) Boudnik
   
   
On Wed, Feb 2, 2011 at 05:44, Edson Ramiro erlfi...@gmail.com
 wrote:
 Hi all,

 Plz, could you explain me the difference between MRUnit and
 Herriot?

 I've read the documentation of both and they seem very similar to
 me.

 Is Herriot an evolution of MRUnit?

 What can Herriot do that MRUnit can't?

 Thanks in Advance

 --
 Edson Ramiro Lucas Filho
 {skype, twitter, gtalk}: erlfilho
 http://www.inf.ufpr.br/erlf07/

   
 
  -BEGIN PGP SIGNATURE-
  Version: GnuPG v1.4.10 (GNU/Linux)
 
  iF4EAREIAAYFAk1LkUYACgkQenyFlstYjhIyYwD9HM7YvfdcvBuqdN24No5T4dLe
  lDLVlnEs8QIN4V7RqAYBAJ8liUG2YZ+c/wvWL3/lVAGY+Fqls0k4OYLG4rXJrwwD
  =h/66
  -END PGP SIGNATURE-
 
 
 



Fwd: multiple namenode directories

2011-02-10 Thread mike anderson
-- Forwarded message --
From: mike anderson saidthero...@gmail.com
Date: Thu, Feb 10, 2011 at 11:57 AM
Subject: multiple namenode directories
To: core-u...@hadoop.apache.org


This should be a straightforward question, but better safe than sorry.

I wanted to add a second name node directory (on an NFS as a backup), so now
my hdfs-site.xml contains:

  property
namedfs.name.dir/name
value/mnt/hadoop/name/value
  /property
  property
namedfs.name.dir/name
value/public/hadoop/name/value
  /property


When I go to start DFS i'm getting the exception:

org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory
/public/hadoop/name is in an inconsistent state: storage directory does not
exist or is not accessible.


After googling a bit, it seems like I want to do bin/hadoop namenode
-format

Is this right? As long as I shut down DFS before issuing the command I
shouldn't lose any data?

Thanks in advance,
Mike


multiple namenode directories

2011-02-10 Thread mike anderson
This should be a straightforward question, but better safe than sorry.

I wanted to add a second name node directory (on an NFS as a backup), so now
my hdfs-site.xml contains:

  property
namedfs.name.dir/name
value/mnt/hadoop/name/value
  /property
  property
namedfs.name.dir/name
value/public/hadoop/name/value
  /property


When I go to start DFS i'm getting the exception:

org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory
/public/hadoop/name is in an inconsistent state: storage directory does not
exist or is not accessible.


After googling a bit, it seems like I want to do bin/hadoop namenode
-format

Is this right? As long as I shut down DFS before issuing the command I
shouldn't lose any data?

Thanks in advance,
Mike


Re: CUDA on Hadoop

2011-02-10 Thread He Chen
Thank you Steve Loughran. I just created a new page on Hadoop wiki, however,
how can I create a new document page on Hadoop Wiki?

Best wishes

Chen

On Thu, Feb 10, 2011 at 5:38 AM, Steve Loughran ste...@apache.org wrote:

 On 09/02/11 17:31, He Chen wrote:

 Hi sharma

 I shared our slides about CUDA performance on Hadoop clusters. Feel free
 to
 modified it, please mention the copyright!


 This is nice. If you stick it up online you should link to it from the
 Hadoop wiki pages -maybe start a hadoop+cuda page and refer to it




Re: multiple namenode directories

2011-02-10 Thread Harsh J
DO NOT format your NameNode. Formatting a NameNode is equivalent to
formatting a FS -- you're bound lose it all.

And while messing with NameNode, after bringing it down safely, ALWAYS
take a backup of the existing dfs.name.dir contents and preferably the
SNN checkpoint directory contents too (if you're running it).

The RIGHT way to add new directories to the NameNode's dfs.name.dir is
by comma-separating them in the same value and NOT by adding two
properties - that is not how Hadoop's configuration operates. In your
case, bring NN down and edit conf as:

  property
    namedfs.name.dir/name
    value/mnt/hadoop/name,/public/hadoop/name/value
  /property

Create the new directory by copying the existing one. Both must have
the SAME file and structure in them, like mirror copies of one
another. Ensure that this new location, apart from being symmetric in
content, is also symmetric in permissions. NameNode will require WRITE
permissions via its user on all locations configured.

Having configured properly and ensured that both storage directories
mirror one another, launch your NameNode back up again (feel a little
paranoid and do check namenode logs for any issues -- in which case
your backup would be very essential as a requirement for recovery!).

P.s. Hold on for a bit for a possible comment from another user before
getting into action. I've added extra directories this way, but I do
not know if this is the genuine way to do so - although it feels
right to me.

On Thu, Feb 10, 2011 at 10:27 PM, mike anderson saidthero...@gmail.com wrote:
 This should be a straightforward question, but better safe than sorry.

 I wanted to add a second name node directory (on an NFS as a backup), so now
 my hdfs-site.xml contains:

  property
    namedfs.name.dir/name
    value/mnt/hadoop/name/value
  /property
  property
    namedfs.name.dir/name
    value/public/hadoop/name/value
  /property


 When I go to start DFS i'm getting the exception:

 org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory
 /public/hadoop/name is in an inconsistent state: storage directory does not
 exist or is not accessible.


 After googling a bit, it seems like I want to do bin/hadoop namenode
 -format

 Is this right? As long as I shut down DFS before issuing the command I
 shouldn't lose any data?

 Thanks in advance,
 Mike




-- 
Harsh J
www.harshj.com


Map reduce streaming unable to partition

2011-02-10 Thread Kelly Burkhart
Hi,

I'm trying to get partitioning working from a streaming map/reduce
job.  I'm using hadoop r0.20.2.

Consider the following files, both in the same hdfs directory:

f1:
01:01:01TABa,a,a,a,a,1
01:01:02TABa,a,a,a,a,2
01:02:01TABa,a,a,a,a,3
01:02:02TABa,a,a,a,a,4
02:01:01TABa,a,a,a,a,5
02:01:02TABa,a,a,a,a,6
02:02:01TABa,a,a,a,a,7
02:02:02TABa,a,a,a,a,8

f2:
01:01:01TABb,b,b,b,b,1
01:01:02TABb,b,b,b,b,2
01:02:01TABb,b,b,b,b,3
01:02:02TABb,b,b,b,b,4
02:01:01TABb,b,b,b,b,5
02:01:02TABb,b,b,b,b,6
02:02:01TABb,b,b,b,b,7
02:02:02TABb,b,b,b,b,8

I execute the following command:

hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \
  -D stream.map.output.field.separator=: \
  -D stream.num.map.output.key.fields=3 \
  -D map.output.key.field.separator=: \
  -D mapred.text.key.partitioner.options=-k1,1 \
  -input /tmp/krb/part \
  -output /tmp/krb/mp \
  -mapper /bin/cat \
  -reducer /bin/cat \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

(actually I've executed about a zillion permutations of various -D arguments...)

I end up with a single file sorted by the entire key, exactly what I
expect if no partitioning at all is going on.  What I'm hoping to end
up with is two output files, each file has the first component of the
key in common:

01:01:01TABa,a,a,a,a,1
01:01:01TABb,b,b,b,b,1
01:01:02TABa,a,a,a,a,2
01:01:02TABb,b,b,b,b,2
01:02:01TABa,a,a,a,a,3
01:02:01TABb,b,b,b,b,3
01:02:02TABa,a,a,a,a,4
01:02:02TABb,b,b,b,b,4

Can anyone suggest a command that may partition files as I describe?

Also, it seems that the API has changed considerably from my version
0.20.x to the latest version r0.21.  Is 0.20 expected to work?  Or are
there some fatal issues that forced major work resulting in release
0.21.

Thanks,

-Kelly


RE: Hadoop Multi user - Cluster Setup

2011-02-10 Thread Kumar, Amit H.
Li Ping: Disabling dfs.permissions did the charm!. 

I have the following questions, if you can help me understand this better: 
1. Not sure what are the consequences of disabling it or even doing chmod o+w 
on the entire filesyste(/). 
2. Is there any need to have the permissions in place, other than securing 
users from each other's work. 
3. Is it still possible to have the hdfs permissions enabled and yet be able to 
run multiple user submitting jobs to a common pool of resources.

Thank you so much for your help!
Amit


 -Original Message-
 From: li ping [mailto:li.j...@gmail.com]
 Sent: Wednesday, February 09, 2011 9:00 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop Multi user - Cluster Setup
 
 If can check this property in hdfs-site.xml
 
 property
   namedfs.permissions/name
   valuetrue/value
   description
 If true, enable permission checking in HDFS.
 If false, permission checking is turned off,
 but all other behavior is unchanged.
 Switching from one parameter value to the other does not change the
 mode,
 owner or group of files or directories.
   /description
 /property
 
 You can disable this option.
 
 the second way is:
 running the command in hadoop. hadoop fs -chmod o+w /
 It has the same effect with first one
 
 On Thu, Feb 10, 2011 at 3:12 AM, Kumar, Amit H. ahku...@odu.edu
 wrote:
 
  Dear All,
 
  I am trying to setup Hadoop for multiple users in a class, on our
 cluster.
  For some reason I don't seem to get it right. If only one user is
 running it
  works great.
  I would want to have all of the users submit a Hadoop job to the
 existing
  DataNode and on the cluster, not sure if this is right.
  Do I need to start a DataNode for every user, if so I was not able to
 do
  because I ran into issues of port already being used.
  Please advise. Below are few of the config files.
 
  Also I have tired searching for other documents, that tell us to
 create a
  user Hadoop and a group Hadoop and then start the daemons as
 Hadoop
  user. This didn't work for me as well.  I am sure I am doing
 something
  wrong. Could anyone please thrown in some more ideas.
 
  =List of env changed in Hadoop-env.sh:
  export HADOOP_LOG_DIR=/scratch/$USER/hadoop-logs
  export HADOOP_PID_DIR=/scratch/$USER/.var/hadoop/pids
 
  #cat core-site.xml
  configuration
  property
  namefs.default.name/name
  valuehdfs://frontend:9000/value
  /property
  property
 namehadoop.tmp.dir/name
 value/scratch/${user.name}/hadoop-FS/value
 descriptionA base for other temporary
 directories./description
  /property
  /configuration
 
  # cat hdfs-site.xml
  configuration
  property
  namedfs.replication/name
  value1/value
  /property
  property
  namedfs.name.dir/name
 
 value/scratch/${user.name}/.hadoop/.transaction/.edits/value
  /property
  /configuration
 
  # cat mapred-site.xml
  configuration
  property
  namemapred.job.tracker/name
  valuefrontend:9001/value
  /property
  property
  namemapreduce.tasktracker.map.tasks.maximum/name
  value2/value
  /property
  property
  namemapreduce.tasktracker.reduce.tasks.maximum/name
  value2/value
  /property
  /configuration
 
 
  Thank you,
  Amit
 
 
 
 
 
 --
 -李平
 
 
 --
 BEGIN-ANTISPAM-VOTING-LINKS
 --
 
 Teach CanIt if this mail (ID 444122709) is spam:
 Spam:
 https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020
 9c=s
 Not spam:
 https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020
 9c=n
 Forget vote:
 https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020
 9c=f
 --
 END-ANTISPAM-VOTING-LINKS



Re: multiple namenode directories

2011-02-10 Thread Harsh J
The links appeared outdated, I've updated those to reflect the current
release 0.21's configurations. The configuration descriptions describe
properly, the way to set them 'right'.

For 0.20 releases, only the configuration name changes:
dfs.name.dir instead of dfs.namenode.name.dir, and
dfs.data.dir instead of dfs.datanode.data.dir

The value formatting remains the same.

On Thu, Feb 10, 2011 at 11:18 PM, mike anderson saidthero...@gmail.com wrote:
 Whew, glad I asked.

 It might be useful for someone to update the wiki:
 http://wiki.apache.org/hadoop/FAQ#How_do_I_set_up_a_hadoop_node_to_use_multiple_volumes.3F

 -Mike

 On Thu, Feb 10, 2011 at 12:43 PM, Harsh J qwertyman...@gmail.com wrote:

 DO NOT format your NameNode. Formatting a NameNode is equivalent to
 formatting a FS -- you're bound lose it all.

 And while messing with NameNode, after bringing it down safely, ALWAYS
 take a backup of the existing dfs.name.dir contents and preferably the
 SNN checkpoint directory contents too (if you're running it).

 The RIGHT way to add new directories to the NameNode's dfs.name.dir is
 by comma-separating them in the same value and NOT by adding two
 properties - that is not how Hadoop's configuration operates. In your
 case, bring NN down and edit conf as:

   property
     namedfs.name.dir/name
     value/mnt/hadoop/name,/public/hadoop/name/value
   /property

 Create the new directory by copying the existing one. Both must have
 the SAME file and structure in them, like mirror copies of one
 another. Ensure that this new location, apart from being symmetric in
 content, is also symmetric in permissions. NameNode will require WRITE
 permissions via its user on all locations configured.

 Having configured properly and ensured that both storage directories
 mirror one another, launch your NameNode back up again (feel a little
 paranoid and do check namenode logs for any issues -- in which case
 your backup would be very essential as a requirement for recovery!).

 P.s. Hold on for a bit for a possible comment from another user before
 getting into action. I've added extra directories this way, but I do
 not know if this is the genuine way to do so - although it feels
 right to me.

 On Thu, Feb 10, 2011 at 10:27 PM, mike anderson saidthero...@gmail.com
 wrote:
  This should be a straightforward question, but better safe than sorry.
 
  I wanted to add a second name node directory (on an NFS as a backup), so
 now
  my hdfs-site.xml contains:
 
   property
     namedfs.name.dir/name
     value/mnt/hadoop/name/value
   /property
   property
     namedfs.name.dir/name
     value/public/hadoop/name/value
   /property
 
 
  When I go to start DFS i'm getting the exception:
 
  org.apache.hadoop.hdfs.server.common.InconsistentFSStateException:
 Directory
  /public/hadoop/name is in an inconsistent state: storage directory does
 not
  exist or is not accessible.
 
 
  After googling a bit, it seems like I want to do bin/hadoop namenode
  -format
 
  Is this right? As long as I shut down DFS before issuing the command I
  shouldn't lose any data?
 
  Thanks in advance,
  Mike
 



 --
 Harsh J
 www.harshj.com





-- 
Harsh J
www.harshj.com


Re: Hadoop Multi user - Cluster Setup

2011-02-10 Thread Harsh J
Please read the HDFS Permissions guide which explains the
understanding required to have a working permissions model on the DFS:
http://hadoop.apache.org/hdfs/docs/current/hdfs_permissions_guide.html

On Thu, Feb 10, 2011 at 11:15 PM, Kumar, Amit H. ahku...@odu.edu wrote:
 Li Ping: Disabling dfs.permissions did the charm!.

 I have the following questions, if you can help me understand this better:
 1. Not sure what are the consequences of disabling it or even doing chmod o+w 
 on the entire filesyste(/).
 2. Is there any need to have the permissions in place, other than securing 
 users from each other's work.
 3. Is it still possible to have the hdfs permissions enabled and yet be able 
 to run multiple user submitting jobs to a common pool of resources.

 Thank you so much for your help!
 Amit


 -Original Message-
 From: li ping [mailto:li.j...@gmail.com]
 Sent: Wednesday, February 09, 2011 9:00 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop Multi user - Cluster Setup

 If can check this property in hdfs-site.xml

 property
   namedfs.permissions/name
   valuetrue/value
   description
     If true, enable permission checking in HDFS.
     If false, permission checking is turned off,
     but all other behavior is unchanged.
     Switching from one parameter value to the other does not change the
 mode,
     owner or group of files or directories.
   /description
 /property

 You can disable this option.

 the second way is:
 running the command in hadoop. hadoop fs -chmod o+w /
 It has the same effect with first one

 On Thu, Feb 10, 2011 at 3:12 AM, Kumar, Amit H. ahku...@odu.edu
 wrote:

  Dear All,
 
  I am trying to setup Hadoop for multiple users in a class, on our
 cluster.
  For some reason I don't seem to get it right. If only one user is
 running it
  works great.
  I would want to have all of the users submit a Hadoop job to the
 existing
  DataNode and on the cluster, not sure if this is right.
  Do I need to start a DataNode for every user, if so I was not able to
 do
  because I ran into issues of port already being used.
  Please advise. Below are few of the config files.
 
  Also I have tired searching for other documents, that tell us to
 create a
  user Hadoop and a group Hadoop and then start the daemons as
 Hadoop
  user. This didn't work for me as well.  I am sure I am doing
 something
  wrong. Could anyone please thrown in some more ideas.
 
  =List of env changed in Hadoop-env.sh:
  export HADOOP_LOG_DIR=/scratch/$USER/hadoop-logs
  export HADOOP_PID_DIR=/scratch/$USER/.var/hadoop/pids
 
  #cat core-site.xml
  configuration
      property
          namefs.default.name/name
          valuehdfs://frontend:9000/value
      /property
      property
         namehadoop.tmp.dir/name
         value/scratch/${user.name}/hadoop-FS/value
         descriptionA base for other temporary
 directories./description
      /property
  /configuration
 
  # cat hdfs-site.xml
  configuration
      property
          namedfs.replication/name
          value1/value
      /property
      property
          namedfs.name.dir/name
 
 value/scratch/${user.name}/.hadoop/.transaction/.edits/value
      /property
  /configuration
 
  # cat mapred-site.xml
  configuration
      property
          namemapred.job.tracker/name
          valuefrontend:9001/value
      /property
      property
          namemapreduce.tasktracker.map.tasks.maximum/name
          value2/value
      /property
      property
          namemapreduce.tasktracker.reduce.tasks.maximum/name
          value2/value
      /property
  /configuration
 
 
  Thank you,
  Amit
 
 
 


 --
 -李平


 --
 BEGIN-ANTISPAM-VOTING-LINKS
 --

 Teach CanIt if this mail (ID 444122709) is spam:
 Spam:
 https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020
 9c=s
 Not spam:
 https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020
 9c=n
 Forget vote:
 https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020
 9c=f
 --
 END-ANTISPAM-VOTING-LINKS





-- 
Harsh J
www.harshj.com


recommendation on HDDs

2011-02-10 Thread Shrinivas Joshi
What would be a good hard drive for a 7 node cluster which is targeted to
run a mix of IO and CPU intensive Hadoop workloads? We are looking for
around 1 TB of storage on each node distributed amongst 4 or 5 disks. So
either 250GB * 4 disks or 160GB * 5 disks. Also it should be less than 100$
each ;)

I looked at HDD benchmark comparisons on tomshardware, storagereview etc.
Got overwhelmed with the # of benchmarks and different aspects of HDD
performance.

Appreciate your help on this.

-Shrinivas


Re: recommendation on HDDs

2011-02-10 Thread Ted Dunning
Get bigger disks.  Data only grows and having extra is always good.

You can get 2TB drives for $100 and 1TB for  $75.

As far as transfer rates are concerned, any 3GB/s SATA drive is going to be
about the same (ish).  Seek times will vary a bit with rotation speed, but
with Hadoop, you will be doing long reads and writes.

Your controller and backplane will have a MUCH bigger vote in getting
acceptable performance.  With only 4 or 5 drives, you don't have to worry
about super-duper backplane, but you can still kill performance with a lousy
controller.

On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi jshrini...@gmail.comwrote:

 What would be a good hard drive for a 7 node cluster which is targeted to
 run a mix of IO and CPU intensive Hadoop workloads? We are looking for
 around 1 TB of storage on each node distributed amongst 4 or 5 disks. So
 either 250GB * 4 disks or 160GB * 5 disks. Also it should be less than 100$
 each ;)

 I looked at HDD benchmark comparisons on tomshardware, storagereview etc.
 Got overwhelmed with the # of benchmarks and different aspects of HDD
 performance.

 Appreciate your help on this.

 -Shrinivas



Re: Map reduce streaming unable to partition

2011-02-10 Thread Kelly Burkhart
OK, I think I sumbled upon the correct incantation:

time hadoop jar
/opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \
  -D map.output.key.field.separator=: \
  -D mapred.text.key.partitioner.options=-k1,1 \
  -D mapred.reduce.tasks=16 \
  -input /tmp/krb/part \
  -output /tmp/krb/mp \
  -mapper /bin/cat \
  -reducer /bin/cat \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

This will partition and sort the files as I expect, leaving me with 16
output files, 14 of which are empty and 2 non-empty.  If I increase
the number of partitions in the data so they exceed the number of
reduce tasks, multiple partitions will be written to some or all of
the output files.  I believe I can deal with that now that I
understand it, but it would be nice if the number of output files was
equal to the number of partitions in the data.

-K

On Thu, Feb 10, 2011 at 11:45 AM, Kelly Burkhart
kelly.burkh...@gmail.com wrote:
 Hi,

 I'm trying to get partitioning working from a streaming map/reduce
 job.  I'm using hadoop r0.20.2.

 Consider the following files, both in the same hdfs directory:

 f1:
 01:01:01TABa,a,a,a,a,1
 01:01:02TABa,a,a,a,a,2
 01:02:01TABa,a,a,a,a,3
 01:02:02TABa,a,a,a,a,4
 02:01:01TABa,a,a,a,a,5
 02:01:02TABa,a,a,a,a,6
 02:02:01TABa,a,a,a,a,7
 02:02:02TABa,a,a,a,a,8

 f2:
 01:01:01TABb,b,b,b,b,1
 01:01:02TABb,b,b,b,b,2
 01:02:01TABb,b,b,b,b,3
 01:02:02TABb,b,b,b,b,4
 02:01:01TABb,b,b,b,b,5
 02:01:02TABb,b,b,b,b,6
 02:02:01TABb,b,b,b,b,7
 02:02:02TABb,b,b,b,b,8

 I execute the following command:

 hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \
  -D stream.map.output.field.separator=: \
  -D stream.num.map.output.key.fields=3 \
  -D map.output.key.field.separator=: \
  -D mapred.text.key.partitioner.options=-k1,1 \
  -input /tmp/krb/part \
  -output /tmp/krb/mp \
  -mapper /bin/cat \
  -reducer /bin/cat \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

 (actually I've executed about a zillion permutations of various -D 
 arguments...)

 I end up with a single file sorted by the entire key, exactly what I
 expect if no partitioning at all is going on.  What I'm hoping to end
 up with is two output files, each file has the first component of the
 key in common:

 01:01:01TABa,a,a,a,a,1
 01:01:01TABb,b,b,b,b,1
 01:01:02TABa,a,a,a,a,2
 01:01:02TABb,b,b,b,b,2
 01:02:01TABa,a,a,a,a,3
 01:02:01TABb,b,b,b,b,3
 01:02:02TABa,a,a,a,a,4
 01:02:02TABb,b,b,b,b,4

 Can anyone suggest a command that may partition files as I describe?

 Also, it seems that the API has changed considerably from my version
 0.20.x to the latest version r0.21.  Is 0.20 expected to work?  Or are
 there some fatal issues that forced major work resulting in release
 0.21.

 Thanks,

 -Kelly



Re: MRUnit and Herriot

2011-02-10 Thread Konstantin Boudnik
On Thu, Feb 10, 2011 at 08:39, Edson Ramiro erlfi...@gmail.com wrote:
 Hi,

 I took a look around on the Internet, but I didn't find any docs about
 MiniDFS
 and MiniMRCluster. Is there docs about them?

 It remember me this phrase I got from the Herriot [1] page.
 As always your best source of information and knowledge about any software
 system is its source code :)

Yes, this still holds ;) Source code is your best friend for a number
of reasons:
  - this is _the_ best documentation for the code and shows what an
application does
  - it is always up-to-date
  - developers can focus on their development/testing rather then
writing an end-user documents about some internals (which no-one but
other developers will ever need)

 Do you think is possible to have just one tool to cover all kinds of tests?

Sure, why not? I am also a big believer that a single OS would do just fine.

 Another question, do you know if is possible to evaluate a MR program, eg
 sort, with Herriot considering several test data?

Absolutely... Herriot does run work-loads against a physical clusters.
So, I don't see why it can be impossible. Would be most effective use
of your time? Perhaps not, because Herriot requires a specially
tailored (instrumented) cluster to be executed against.

What you need, I think, is a simple way to get a jar file containing
some tests, drop it to a cluster's gateway machine and run then. Looks
like as what we are trying to achieve in iTest I have mentioned
earlier.

Cos

 Thanks in Advance

 --
 Edson Ramiro Lucas Filho
 {skype, twitter, gtalk}: erlfilho
 http://www.inf.ufpr.br/erlf07/


 On Mon, Feb 7, 2011 at 10:29 PM, Konstantin Boudnik c...@apache.org wrote:

 On Mon, Feb 7, 2011 at 04:20, Edson Ramiro erlfi...@gmail.com wrote:
  Well, I'm studying the Hadoop test tools to evaluate some (if there are)
  deficiences, also trying to compare these tools to see what one cover
 that
  other doesn't and what is possible to do with each one.

 There's also a simulated test cluster infrastructure called MiniDFS
 and MiniMRCluster to allow you to develop functional tests without
 actual cluster deployment.

  As far as I know we have just Herriot and MRUnit for test, and them do
  different things as you said me :)
 
  I'm very interested in your initial version, is there a link?

 Not at the moment, but I will send it here as soon as a initial
 version is pushed out.

 
  Thanks in advance
 
  --
  Edson Ramiro Lucas Filho
  {skype, twitter, gtalk}: erlfilho
  http://www.inf.ufpr.br/erlf07/
 
 
  On Fri, Feb 4, 2011 at 3:40 AM, Konstantin Boudnik c...@apache.org
 wrote:
 
  Yes, Herriot can be used for integration tests of MR. Unit test is a
 very
  different thing and normally is done against a 'unit of compilation'
 e.g. a
  class, etc. Typically you won't expect to do unit tests against a
 deployed
  cluster.
 
  There is fault injection framework wich works at the level of functional
  tests
  (with mini-clusters). Shortly we'll be opening an initial version of
 smoke
  and
  integration test framework (maven and JUnit based).
 
  It'd be easier to provide you with a hint if you care to explain what
  you're
  trying to solve.
 
  Cos
 
  On Thu, Feb 03, 2011 at 10:25AM, Edson Ramiro wrote:
   Thank you a lot Konstantin, you cleared my mind.
  
   So, Herriot is a framework designed to test Hadoop as a whole, and
 (IMHO)
  is
   a tool for help Hadoop developers and not for who is developing MR
  programs,
   but can we use Herriot to do unit, integration or other tests on our
 MR
   jobs?
  
   Do you know another test tool or test framework for Hadoop?
  
   Thanks in Advance
  
   --
   Edson Ramiro Lucas Filho
   {skype, twitter, gtalk}: erlfilho
   http://www.inf.ufpr.br/erlf07/
  
  
   On Wed, Feb 2, 2011 at 4:58 PM, Konstantin Boudnik c...@apache.org
  wrote:
  
(Moving to common-user where this belongs)
   
Herriot is system test framework which runs against a real physical
cluster deployed with a specially crafted build of Hadoop. That
instrumented build of provides an extra APIs not available in Hadoop
otherwise. These APIs are created to facilitate cluster software
testability. Herriot isn't limited by MR but also covered (although
 in
a somewhat lesser extend) HDFS side of Hadoop.
   
MRunit is for MR job unit testing as in making sure that your MR
 job
is ok and/or to allow you to debug it locally before scale
 deployment.
   
So, long story short - they are very different ;) Herriot can do
intricate fault injection and can work closely with a deployed
 cluster
(say control Hadoop nodes and daemons); MRUnit is focused on MR jobs
testing.
   
Hope it helps.
--
  Take care,
Konstantin (Cos) Boudnik
   
   
On Wed, Feb 2, 2011 at 05:44, Edson Ramiro erlfi...@gmail.com
 wrote:
 Hi all,

 Plz, could you explain me the difference between MRUnit and
 Herriot?

 I've read the documentation of both and 

Re: hadoop 0.20 append - some clarifications

2011-02-10 Thread Konstantin Boudnik
You might also want to check append design doc published at HDFS-265
--
  Take care,
Konstantin (Cos) Boudnik




On Thu, Feb 10, 2011 at 07:11, Gokulakannan M gok...@huawei.com wrote:
 Hi All,

 I have run the hadoop 0.20 append branch . Can someone please clarify the
 following behavior?

 A writer writing a file but he has not flushed the data and not closed the
 file. Could a parallel reader read this partial file?

 For example,

 1. a writer is writing a 10MB file(block size 2 MB)

 2. wrote the file upto 5MB (2 finalized blocks + 1 blockBeingWritten) . note
 that writer is not calling FsDataOutputStream sync( ) at all

 3. now a reader tries to read the above partially written file

 I can be able to see that the reader can be able to see the partially
 written 5MB data but I feel the reader should be able to see the data only
 after the writer calls sync() api.

 Is this the correct behavior or my understanding is wrong?



  Thanks,

  Gokul




Re: some doubts Hadoop MR

2011-02-10 Thread Greg Roelofs
 2) Assume I am running cascading (chained) MR modules. In this case I feel
 there is a huge overhead when output of MR1 is written back to HDFS and then
 read from there as input of MR2.Can this be avoided ? (maybe store it in
 some memory without hitting the HDFS and NameNode ) Please let me know if
 there s some means of exercising this because it will increase the
 efficiency of chained MR to a great extent.

 Not possible to pipeline in Apache Hadoop. Have a look at HOP (Hadoop
 On-line project), which has some of what you seek.

It is under some circumstances.  With ChainMapper and ChainReducer, if the
key/value signatures of the inputs and outputs of all mappers and reducers
are the same, then the only disk I/O is at the endpoints.  Note that there
is _no_ buffering at all, however (just a single-element queue between each
pair), so all maps and reduces in each ChainMapper or ChainReducer chain
have to reside in memory simultaneously.

I haven't ever used them, btw, so I don't know how useful or efficient they
are.  I just came across them while working on another feature that turns
out to be fundamentally incompatible with them...

Greg


File name which includes defined keyword

2011-02-10 Thread 안의건
File name which includes defined keyword

Dear all

I have an error when I copy localsrc in Hadoop fs commands.

e.g.

hadoop/bin hadoop fs -copyFromLocal abcdef:abcdef.exm /test

I can't copy a localsrc which includes ':' to the dst. Does anybody know
what could I do?


Regards,
Henny ahn(ahneui...@gmail.com)


Re: File name which includes defined keyword

2011-02-10 Thread Harsh J
There appears to be a bug filed about this, check it's JIRA out here:
https://issues.apache.org/jira/browse/HDFS-13

On Fri, Feb 11, 2011 at 6:09 AM, 안의건 ahneui...@gmail.com wrote:
 File name which includes defined keyword

 Dear all

 I have an error when I copy localsrc in Hadoop fs commands.

 e.g.

 hadoop/bin hadoop fs -copyFromLocal abcdef:abcdef.exm /test

 I can't copy a localsrc which includes ':' to the dst. Does anybody know
 what could I do?


 Regards,
 Henny ahn(ahneui...@gmail.com)




-- 
Harsh J
www.harshj.com


How do I insert a new node while running a MapReduce hadoop?

2011-02-10 Thread Sandro Simas
Hi, i started using hadoop now and I'm doing some tests on a cluster of three
machines. I wanted to insert a new node after the MapReduce started, is this
possible? How do I?


Re: How do I insert a new node while running a MapReduce hadoop?

2011-02-10 Thread li ping
of course you can.
What is the node type, datanode?job tracker?task tracker?
Let's say you are trying to add a datanode.
You can modify the xml file let the datanode pointed to the NameNode,
JobTracker, TaskTracker.

property
 namefs.default.name/name
 valuehdfs://:9000//value
 /property

property
  namemapred.job.tracker/name
  valueip:port/value
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
/property

In most cases, the tasktracker and datanode are running on the same machine
(to get the best performance).

After doing this, you can start the hdfs by command start-dfs.sh
On Fri, Feb 11, 2011 at 11:13 AM, Sandro Simas sandro.csi...@gmail.comwrote:

 Hi, i started using hadoop now and I'm doing some tests on a cluster of
 three
 machines. I wanted to insert a new node after the MapReduce started, is
 this
 possible? How do I?




-- 
-李平


Re: hadoop 0.20 append - some clarifications

2011-02-10 Thread Ted Dunning
It is a bit confusing.

SequenceFile.Writer#sync isn't really sync.

There is SequenceFile.Writer#syncFs which is more what you might expect to
be sync.

Then there is HADOOP-6313 which specifies hflush and hsync.  Generally, if
you want portable code, you have to reflect a bit to figure out what can be
done.

On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M gok...@huawei.com wrote:

  Thanks Ted for clarifying.

 So the *sync* is to just flush the current buffers to datanode and persist
 the block info in namenode once per block, isn't it?



 Regarding reader able to see the unflushed data, I faced an issue in the
 following scneario:

 1. a writer is writing a *10MB* file(block size 2 MB)

 2. wrote the file upto 4MB (2 finalized blocks in *current* and nothing in
 *blocksBeingWritten* directory in DN) . So 2 blocks are written

 3. client calls addBlock for the 3rd block on namenode and not yet created
 outputstream to DN(or written anything to DN). At this point of time, the
 namenode knows about the 3rd block but the datanode doesn't.

 4. at point 3, a reader is trying to read the file and he is getting
 exception and not able to read the file as the datanode's getBlockInfo
 returns null to the client(of course DN doesn't know about the 3rd block
 yet)

 In this situation the reader cannot see the file. But when the block
 writing is in progress , the read is successful.

 *Is this a bug that needs to be handled in append branch?*



  -Original Message-
  From: Konstantin Boudnik [mailto:c...@boudnik.org]
  Sent: Friday, February 11, 2011 4:09 AM
 To: common-user@hadoop.apache.org
  Subject: Re: hadoop 0.20 append - some clarifications

  You might also want to check append design doc published at HDFS-265



 I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's
 design doc won't apply to it.


  --

 *From:* Ted Dunning [mailto:tdunn...@maprtech.com]
 *Sent:* Thursday, February 10, 2011 9:29 PM
 *To:* common-user@hadoop.apache.org; gok...@huawei.com
 *Cc:* hdfs-u...@hadoop.apache.org
 *Subject:* Re: hadoop 0.20 append - some clarifications



 Correct is a strong word here.



 There is actually an HDFS unit test that checks to see if partially written
 and unflushed data is visible.  The basic rule of thumb is that you need to
 synchronize readers and writers outside of HDFS.  There is no guarantee that
 data is visible or invisible after writing, but there is a guarantee that it
 will become visible after sync or close.

 On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M gok...@huawei.com wrote:

 Is this the correct behavior or my understanding is wrong?