sub

2011-05-04 Thread ChenJF
Hello, hadoop~

-- 
--from feng:)
- - - - - - - - - - - - - - - - - - - - - - - - - -
Blog: www.jferic.com/blog
Email: cjfsmart...@gmail.com
Studio: ws.nju.edu.cn
- - - - - - - - - - - - - - - - - - - - - - - - - -


33 Days left to Berlin Buzzwords 2011

2011-05-04 Thread Simon Willnauer
hey folks,

BerlinBuzzwords 2011 is close only 33 days left until the big Search,
Store and Scale opensource crowd is gathering
in Berlin on June 6th/7th.

The conference again focuses on the topics search,
data analysis and NoSQL. It is to take place on June 6/7th 2011 in Berlin.

We are looking forward to two awesome keynote speakers who shaped the world of
open source data analysis: Doug Cutting, founder of Apache Lucene and
Hadoop) as
well as Ted Dunning (Chief Application Architect at MapR Technologies
and active
developer at Apache Hadoop and Mahout).

We are amazed by the amount and quality of the talk submissions we
got. As a result
this year we have added one more track to the main conference. If you haven't
done so already, make sure to book your ticket now - early bird tickets are
already sold out since April 7th and there might not be many tickets left.

As we would like to give visitors of our main conference a reason to stay in
town for the whole week, we have been talking to local co-working spaces and
companies asking them for free space and WiFi to host Hackathons right after the
main conference - that is on June 8th through 10th.

If you would like to gather with fellow developers and users of your project,
fix bugs together, hack on new features or give users a hands-on introduction to
your tools, please submit your workshop proposal to our wiki:

http://berlinbuzzwords.de/node/428

Please note that slots are assigned on a first come first serve basis. We are
doing our best to get you connected, however space is limited.

The deal is simple: We get you in touch with a conference room provider. Your
event gets promoted in our schedule. Co-Ordination however is completely up to
you: Make sure to provide an interesting abstract, provide a Hackathon
registration area - see the Barcamp page for a good example:

http://berlinbuzzwords.de/wiki/barcamp

Attending Hackathons requires a Berlin Buzzwords ticket and (then free)
registration at the Hackathon in question.

Hope I see you all around in Berlin,

Simon


Change block size from 64M to 128M does not work on Hadoop-0.21

2011-05-04 Thread He Chen
Hi all

I met a problem about changing block size from 64M to 128M. I am sure I
modified the correct configuration file hdfs-site.xml. Because I can change
the replication number correctly. However, it does not work on block size
changing.

For example:

I change the dfs.block.size to 134217728 bytes.

I upload a file which is 128M and use fsck to find how many blocks this
file has. It shows:
/user/file1/file 134217726 bytes, 2 blocks(s): OK
0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010
]
1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010]

The hadoop version is 0.21. Any suggestion will be appreciated!

thanks

Chen


Re: How do I create per-reducer temporary files?

2011-05-04 Thread Matt Pouttu-Clarke
Hi Bryan,

These are called side effect files, and I use them extensively:

O'Riley Hadoop 2nd Edition, p. 187
Pro Hadoop, p. 279

You get the path to the save the file(s) using:
http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
leOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapred.JobConf%29

The output committer moves these files from the work directory to the output
directory when the task completes.  That way you don't have duplicate files
due to speculative execution.  You should also generate a unique name for
each of your output files by using this function to prevent file name
collisions:
http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
leOutputFormat.html#getUniqueName%28org.apache.hadoop.mapred.JobConf,%20java
.lang.String%29

Hope this helps,
Matt


On 5/4/11 12:18 PM, Bryan Keller brya...@gmail.com wrote:

 Right. What I am struggling with is how to retrieve the path/drive that the
 reducer is using, so I can use the same path for local temp files.
 
 On May 4, 2011, at 9:03 AM, Robert Evans wrote:
 
 Bryan,
 
 I believe that map/reduce gives you a single drive to write to so that your
 reducer has less of an impact on other reducers/mappers running on the same
 box.  If you want to write to more drives I thought the idea would then be to
 increase the number of reducers you have and let mapred assign each to a
 drive to use, instead of having one reducer eating up I/O bandwidth from all
 of the drives.
 
 --Bobby Evans
 
 On 5/4/11 7:11 AM, Bryan Keller brya...@gmail.com wrote:
 
 I too am looking for the best place to put local temp files I create during
 reduce processing. I am hoping there is a variable or property someplace that
 defines a per-reducer temp directory. The mapred.child.tmp property is by
 default simply the relative directory ./tmp so it isn't useful on it's own.
 
 I have 5 drives being used in mapred.local.dir, and I was hoping to use
 them all for writing temp files, rather than specifying a single temp
 directory that all my reducers use.
 
 
 On Apr 9, 2011, at 2:40 AM, Harsh J wrote:
 
 Hello,
 
 On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill bill...@gmail.com wrote:
 If I try:
 
 storePath = FileOutputFormat.getPathForWorkFile(context, my-file,
 .seq);
 writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
   configuration, storePath, IntWritable.class, itemClass);
 ...
 reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
 storePath, configuration);
 
 I get an exception about a mismatch in file systems when trying to read
 from
 the file.
 
 Alternately if I try:
 
 storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
 my-file, .seq));
 writer = SequenceFile.createWriter(FileSystem.get(configuration),
   configuration, storePath, IntWritable.class, itemClass);
 ...
 reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
 storePath, configuration);
 
 FileOutputFormat.getPathForWorkFile will give back HDFS paths. And
 since you are looking to create local temporary files to be used only
 by the task within itself, you shouldn't really worry about unique
 filenames (stuff can go wrong).
 
 You're looking for the tmp/ directory locally created in the FS where
 the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp).
 You can create a regular file there using vanilla Java APIs for files,
 or using RawLocalFS + your own created Path (not derived via
 OutputFormat/etc.).
 
 storePath = new Path(new
 Path(context.getConf().get(mapred.child.tmp), my-file.seq);
 writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
   configuration, storePath, IntWritable.class, itemClass);
 ...
 reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
 storePath, configuration);
 
 The above should work, I think (haven't tried, but the idea is to use
 the mapred.child.tmp).
 
 Also see: 
 http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Directory+
 Structure
 
 --
 Harsh J
 
 
 


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information of iCrossing. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.




don't want to output anything

2011-05-04 Thread Gang Luo


Hi,

I use MapReduce to process and output my own stuff, in a customized way. I 
don't 
use context.write to output anything, and thus I don't want the empty files 
part-r-x on my fs. Is there someway to eliminate the output?

Thanks.

-Gang



Re: Change block size from 64M to 128M does not work on Hadoop-0.21

2011-05-04 Thread Harsh J
Your client (put) machine must have the same block size configuration
during upload as well.

Alternatively, you may do something explicit like `hadoop dfs
-Ddfs.block.size=size -put file file`

On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote:
 Hi all

 I met a problem about changing block size from 64M to 128M. I am sure I
 modified the correct configuration file hdfs-site.xml. Because I can change
 the replication number correctly. However, it does not work on block size
 changing.

 For example:

 I change the dfs.block.size to 134217728 bytes.

 I upload a file which is 128M and use fsck to find how many blocks this
 file has. It shows:
 /user/file1/file 134217726 bytes, 2 blocks(s): OK
 0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010
 ]
 1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010]

 The hadoop version is 0.21. Any suggestion will be appreciated!

 thanks

 Chen




-- 
Harsh J


Re: Change block size from 64M to 128M does not work on Hadoop-0.21

2011-05-04 Thread He Chen
Hi Harsh

Thank you for the reply.

Actually, the hadoop directory is on my NFS server, every node reads the
same file from NFS server. I think this is not a problem.

I like your second solution. But I am not sure, whether the namenode
will divide those 128MB

 blocks to smaller ones in future or not.

Chen

On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote:

 Your client (put) machine must have the same block size configuration
 during upload as well.

 Alternatively, you may do something explicit like `hadoop dfs
 -Ddfs.block.size=size -put file file`

 On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote:
  Hi all
 
  I met a problem about changing block size from 64M to 128M. I am sure I
  modified the correct configuration file hdfs-site.xml. Because I can
 change
  the replication number correctly. However, it does not work on block size
  changing.
 
  For example:
 
  I change the dfs.block.size to 134217728 bytes.
 
  I upload a file which is 128M and use fsck to find how many blocks this
  file has. It shows:
  /user/file1/file 134217726 bytes, 2 blocks(s): OK
  0. blk_xx len=67108864 repl=2 [192.168.0.3:50010,
 192.168.0.32:50010
  ]
  1. blk_xx len=67108862 repl=2 [192.168.0.9:50010,
 192.168.0.8:50010]
 
  The hadoop version is 0.21. Any suggestion will be appreciated!
 
  thanks
 
  Chen
 



 --
 Harsh J



Re: Change block size from 64M to 128M does not work on Hadoop-0.21

2011-05-04 Thread He Chen
Tried second solution. Does not work, still 2 64M blocks. h

On Wed, May 4, 2011 at 3:16 PM, He Chen airb...@gmail.com wrote:

 Hi Harsh

 Thank you for the reply.

 Actually, the hadoop directory is on my NFS server, every node reads the
 same file from NFS server. I think this is not a problem.

 I like your second solution. But I am not sure, whether the namenode
 will divide those 128MB

  blocks to smaller ones in future or not.

 Chen

 On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote:

 Your client (put) machine must have the same block size configuration
 during upload as well.

 Alternatively, you may do something explicit like `hadoop dfs
 -Ddfs.block.size=size -put file file`

 On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote:
  Hi all
 
  I met a problem about changing block size from 64M to 128M. I am sure I
  modified the correct configuration file hdfs-site.xml. Because I can
 change
  the replication number correctly. However, it does not work on block
 size
  changing.
 
  For example:
 
  I change the dfs.block.size to 134217728 bytes.
 
  I upload a file which is 128M and use fsck to find how many blocks
 this
  file has. It shows:
  /user/file1/file 134217726 bytes, 2 blocks(s): OK
  0. blk_xx len=67108864 repl=2 [192.168.0.3:50010,
 192.168.0.32:50010
  ]
  1. blk_xx len=67108862 repl=2 [192.168.0.9:50010,
 192.168.0.8:50010]
 
  The hadoop version is 0.21. Any suggestion will be appreciated!
 
  thanks
 
  Chen
 



 --
 Harsh J





Re: don't want to output anything

2011-05-04 Thread Gang Luo
Exactly what I want. Thanks Harsh J.

-Gang



- 原始邮件 
发件人: Harsh J ha...@cloudera.com
收件人: common-user@hadoop.apache.org
发送日期: 2011/5/4 (周三) 4:03:35 下午
主   题: Re: don't want to output anything

Hello Gang,

On Thu, May 5, 2011 at 1:22 AM, Gang Luo lgpub...@yahoo.com.cn wrote:


 Hi,

 I use MapReduce to process and output my own stuff, in a customized way. I 
don't
 use context.write to output anything, and thus I don't want the empty files
 part-r-x on my fs. Is there someway to eliminate the output?

You're looking for the NullOutputFormat:
http://search-hadoop.com/?q=nulloutputformat

-- 
Harsh J



(nfs) outputdir

2011-05-04 Thread gabriel

 Hello

I'm using a small fully distributed Hadoop cluster. All Hadoop daemons run under 
hadoop users, and I submit jobs as
user.

I ran into a couple of problems when I set mapred.output.dir to an (nfs) 
file:// location.

1. The output dir gets created, but it belongs to hadoop.
It sort of makes sense, the processes writing the output files run as hadoop 
users.
I would like the resulting output dir to belong to user (same as when setting 
mapred.output.dir=hdfs://...)

2. My job driver creates a report file in the output dir after the job is 
complete.
However, the job driver is run by user, which doesn't have permissions to write in 
output dir (output dir belongs to hadoop).



Forcing the users to run jobs as hadoop (the hadoop admin user) is a poor 
option.
Do I have any other choices?

Thank you very much
Gabriel Balan

--
The statements and opinions expressed here are my own and do not necessarily 
represent those of Oracle Corporation.



Re: Change block size from 64M to 128M does not work on Hadoop-0.21

2011-05-04 Thread He Chen
Got it. Thankyou Harsh. BTW
It is `hadoop dfs -Ddfs.blocksize=size -put file file`. No dot between
block and size

On Wed, May 4, 2011 at 3:18 PM, He Chen airb...@gmail.com wrote:

 Tried second solution. Does not work, still 2 64M blocks. h


 On Wed, May 4, 2011 at 3:16 PM, He Chen airb...@gmail.com wrote:

 Hi Harsh

 Thank you for the reply.

 Actually, the hadoop directory is on my NFS server, every node reads the
 same file from NFS server. I think this is not a problem.

 I like your second solution. But I am not sure, whether the namenode
 will divide those 128MB

  blocks to smaller ones in future or not.

 Chen

 On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote:

 Your client (put) machine must have the same block size configuration
 during upload as well.

 Alternatively, you may do something explicit like `hadoop dfs
 -Ddfs.block.size=size -put file file`

 On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote:
  Hi all
 
  I met a problem about changing block size from 64M to 128M. I am sure I
  modified the correct configuration file hdfs-site.xml. Because I can
 change
  the replication number correctly. However, it does not work on block
 size
  changing.
 
  For example:
 
  I change the dfs.block.size to 134217728 bytes.
 
  I upload a file which is 128M and use fsck to find how many blocks
 this
  file has. It shows:
  /user/file1/file 134217726 bytes, 2 blocks(s): OK
  0. blk_xx len=67108864 repl=2 [192.168.0.3:50010,
 192.168.0.32:50010
  ]
  1. blk_xx len=67108862 repl=2 [192.168.0.9:50010,
 192.168.0.8:50010]
 
  The hadoop version is 0.21. Any suggestion will be appreciated!
 
  thanks
 
  Chen
 



 --
 Harsh J






Re: Cluster hard drive ratios

2011-05-04 Thread M. C. Srivas
Hey Matt,

 we are using the same Dell boxes, and we can get 2 GB/s per node (read and
write) without problems.


On Wed, May 4, 2011 at 8:43 AM, Matt Goeke msg...@gmail.com wrote:

 I have been reviewing quite a few presentations on the web from
 various businesses, in addition to the ones I watched first hand at
 the cloudera data summit last week, and I am curious as to others
 thoughts around hard drive ratios. Various sources including Cloudera
 have sited 1 HDD x 2 cores x 4 GB ECC but this makes me wonder what
 the upper bound for HDDs is in this ratio. We have specced out various
 machines from Dell and it is possible to get dual hexacores with 14
 drives (2 raided for OS and 12x2TB) but this seems to conflict with
 that original ratio and some of the specs I have witnessed in
 presentations (which are mostly 4 drive configurations). I would
 assume all you incur is additional complexity and more potential for
 hardware failure on a specific machine but I have seen little to no
 data stating at what point there is a plateau in write speed
 performance. Can anyone give personal experience around this type of
 setup?

 If we accept that we are incurring the negatives I stated above but we
 gain higher data density in the cluster then is this setup fine or we
 overlooking something?

 Thanks,
 Matt



bin/start-dfs/mapred.sh with input slave file

2011-05-04 Thread Matthew John
Hi all,

I see that there is an option to provide a slaves_file as input to
bin/start-dfs.sh and bin/start-mapred.sh so that slaves are parsed from this
input file rather than the default conf/slaves.

Can someone please help me with the syntax for this. I am not able to figure
this out.

Thanks,
Matthew John


Re: bin/start-dfs/mapred.sh with input slave file

2011-05-04 Thread Harsh J
Keep two configuration directories with different slaves files (say
conf.dfs/ and conf.mr/) and use `hadoop-daemons.sh --config {conf dir
path} start {daemon}` to start up DN/TT daemons.

On Thu, May 5, 2011 at 8:06 AM, Matthew John tmatthewjohn1...@gmail.com wrote:
 Hi all,

 I see that there is an option to provide a slaves_file as input to
 bin/start-dfs.sh and bin/start-mapred.sh so that slaves are parsed from this
 input file rather than the default conf/slaves.

 Can someone please help me with the syntax for this. I am not able to figure
 this out.

 Thanks,
 Matthew John




-- 
Harsh J


Re: How do I create per-reducer temporary files?

2011-05-04 Thread Matt Pouttu-Clarke
Bryan,

Not sure you should be concerned with whether the output is on local vs.
HDFS.  I wouldn't think there would be much of a performance difference if
you are doing streaming output (append) in both cases.  Hadoop already uses
local storage where ever possible (including for the task working
directories as far as I know).  I've never had performance problems with
side effect files, as long as the correct setup is used.

Definitely if multiple mounts are available locally where the tasks are
running you can add a comma delimited list to mapreduce.cluster.local.dir in
mapred-site.xml of those machines:
http://hadoop.apache.org/common/docs/current/cluster_setup.html#mapred-site.
xml

Theoretically you can use the methods I listed below to create unique
files/paths under /tmp or any other mount point you wish.  However, it is
much better to let Hadoop manage where the files are stored (i.e. Use the
work directory given to you).

If you add multiple paths to mapreduce.cluster.local.dir then Hadoop will
spread the I/O from multiple mappers/reducers across these paths.  Likewise
you can mount RAID 0 (stripe) of multiple drives to get the same effect.
You can use a single RAID 0 to keep the mapred-site.xml uniform.  RAID 0 is
fine since speculative execution takes care of if a disk fails.

If would be helpful to know your use case since the primary option is
normally to create multiple outputs from a reducer:
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred
uce/lib/output/MultipleOutputs.html

Most likely you should try that before going into the realm of side effect
files (or messing with local temp on the task nodes).  Try the multiple
outputs if you are dealing with streaming data.  If you absolutely cannot
get it to work then you may have to cross check the other more complex
options.

Cheers,
Matt



On 5/4/11 1:07 PM, Bryan Keller brya...@gmail.com wrote:

 Am I mistaken or are side-effect files on HDFS? I need my temp files to be on
 the local filesystem. Also, the java working directory is not the reducer's
 local processing directory, thus ./tmp doesn't get me what I'm after. As it
 stands now I'm using java.io.tmpdir which is not a long-term solution for me.
 I am looking to use the reducer's task-specific local directory which should
 be balanced across my local drives.
 
 On May 4, 2011, at 12:31 PM, Matt Pouttu-Clarke wrote:
 
 Hi Bryan,
 
 These are called side effect files, and I use them extensively:
 
 O'Riley Hadoop 2nd Edition, p. 187
 Pro Hadoop, p. 279
 
 You get the path to the save the file(s) using:
 http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
 leOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapred.JobConf%29
 
 The output committer moves these files from the work directory to the output
 directory when the task completes.  That way you don't have duplicate files
 due to speculative execution.  You should also generate a unique name for
 each of your output files by using this function to prevent file name
 collisions:
 http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
 leOutputFormat.html#getUniqueName%28org.apache.hadoop.mapred.JobConf,%20java
 .lang.String%29
 
 Hope this helps,
 Matt
 
 
 On 5/4/11 12:18 PM, Bryan Keller brya...@gmail.com wrote:
 
 Right. What I am struggling with is how to retrieve the path/drive that the
 reducer is using, so I can use the same path for local temp files.
 
 On May 4, 2011, at 9:03 AM, Robert Evans wrote:
 
 Bryan,
 
 I believe that map/reduce gives you a single drive to write to so that your
 reducer has less of an impact on other reducers/mappers running on the same
 box.  If you want to write to more drives I thought the idea would then be
 to
 increase the number of reducers you have and let mapred assign each to a
 drive to use, instead of having one reducer eating up I/O bandwidth from
 all
 of the drives.
 
 --Bobby Evans
 
 On 5/4/11 7:11 AM, Bryan Keller brya...@gmail.com wrote:
 
 I too am looking for the best place to put local temp files I create during
 reduce processing. I am hoping there is a variable or property someplace
 that
 defines a per-reducer temp directory. The mapred.child.tmp property is by
 default simply the relative directory ./tmp so it isn't useful on it's
 own.
 
 I have 5 drives being used in mapred.local.dir, and I was hoping to use
 them all for writing temp files, rather than specifying a single temp
 directory that all my reducers use.
 
 
 On Apr 9, 2011, at 2:40 AM, Harsh J wrote:
 
 Hello,
 
 On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill bill...@gmail.com wrote:
 If I try:
 
storePath = FileOutputFormat.getPathForWorkFile(context, my-file,
 .seq);
writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
  configuration, storePath, IntWritable.class, itemClass);
...
reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
 storePath, configuration);
 
 I get an