Re: aggregation by time window

2013-01-28 Thread Kai Voigt
Hi again,

the idea is that you emit every event multiple times. So your map input record 
(event1, 10:07) will be emitted seven times during the map() call. Like I said, 
(10:04,event1), (10:05,event1), ..., (10:10,event1) will be the seven outputs 
for processing a single event.

The output key will be the time stamps in which neighbourhood or interval each 
event should be joined with events that happened +/- 3 minutes near it. So 
events which happened within a 7 minutes distance will both be emitted with the 
same time stamp as the map() output, and thus meet in a reduce() call.

A reduce() call will look like this: reduce(10:03, list_of_events). And those 
events had time stamps between 10:00 and 10:06 in the original input.

Kai

Am 28.01.2013 um 14:43 schrieb Oleg Ruchovets :

> Hi Kai.
>It is very interesting. Can you please explain in more details your
> Idea?
> What will be a key in a map phase?
> 
> Suppose we have event at 10:07. How would you emit this to the multiple
> buckets?
> 
> Thanks
> Oleg.
> 
> 
> On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt  wrote:
> 
>> Quick idea:
>> 
>> since each of your events will go into several buckets, you could use
>> map() to emit each item multiple times for each bucket.
>> 
>> Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets :
>> 
>>> Hi ,
>>>   I have such row data structure:
>>> 
>>> event_id  |   time
>>> ==
>>> event1 |  10:07
>>> event2 |  10:10
>>> event3 |  10:12
>>> 
>>> event4 |   10:20
>>> event5 |   10:23
>>> event6 |   10:25
>> 
>> map(event1,10:07) would emit (10:04,event1), (10:05,event1), ...,
>> (10:10,event1) and so on.
>> 
>> In reduce(), all your desired events would meet for the same minute.
>> 
>> Kai
>> 
>> --
>> Kai Voigt
>> k...@123.org
>> 
>> 
>> 
>> 
>> 

-- 
Kai Voigt
k...@123.org






Re: aggregation by time window

2013-01-28 Thread Kai Voigt
Quick idea:

since each of your events will go into several buckets, you could use map() to 
emit each item multiple times for each bucket.

Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets :

> Hi ,
>I have such row data structure:
> 
> event_id  |   time
> ==
> event1 |  10:07
> event2 |  10:10
> event3 |  10:12
> 
> event4 |   10:20
> event5 |   10:23
> event6 |   10:25

map(event1,10:07) would emit (10:04,event1), (10:05,event1), ..., 
(10:10,event1) and so on.

In reduce(), all your desired events would meet for the same minute.

Kai

-- 
Kai Voigt
k...@123.org






Re: Transfer large file >50Gb with DistCp from s3 to cluster

2012-09-04 Thread Kai Voigt
Hi,

my guess is that you run "hadoop distcp" on one of the datanodes... In that 
case, the node will get the first replica of each block. But you should also 
see copies on more nodes as well. But that one node will get a replica of all 
the blocks.

Kai

Am 04.09.2012 um 22:07 schrieb Soulghost :

> 
> Hello guys 
> 
> I have a problem using the DistCp to transfer a large file from s3 to HDFS
> cluster, whenever I tried to make the copy, I only saw processing work and
> memory usage in one of the nodes, not in all of them, I don't know if this
> is the proper behaviour of this or if it is a configuration problem. If I
> make the transfer of multiple files each node handles a single file at the
> same time, I understand that this transfer would be in parallel but it
> doesn't seems like that. 
> 
> I am using 0.20.2 distribution for hadoop in a two Ec2Instances cluster, I
> was hoping that any of you have an idea of how it works distCp and which
> properties could I tweak to improve the transfer rate that is currently in
> 0.7 Gb per minute. 
> 
> Regards.
> -- 
> View this message in context: 
> http://old.nabble.com/Transfer-large-file-%3E50Gb-with-DistCp-from-s3-to-cluster-tp34389118p34389118.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 

-- 
Kai Voigt
k...@123.org






Re: Hadoop or HBase

2012-08-28 Thread Kai Voigt
Having a distributed filesystem doesn't save you from having backups. If 
someone deletes a file in HDFS, it's gone.

What backend storage is supported by your CMS?

Kai

Am 28.08.2012 um 08:36 schrieb Kushal Agrawal :

> As the data is too much in (10's of terabytes) it's difficult to take backup
> because it takes 1.5 days to take backup of data every time. Instead of that
> if we uses distributed file system we need not to do that.
> 
> Thanks & Regards,
> Kushal Agrawal
> kushalagra...@teledna.com
>  
> -Original Message-
> From: Kai Voigt [mailto:k...@123.org] 
> Sent: Tuesday, August 28, 2012 11:57 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Hadoop or HBase
> 
> Typically, CMSs require a RDBMS. Which Hadoop and HBase are not.
> 
> Which CMS do you plan to use, and what's wrong with MySQL or other open
> source RDBMSs?
> 
> Kai
> 
> Am 28.08.2012 um 08:21 schrieb "Kushal Agrawal" :
> 
>> Hi,
>>I wants to use DFS for Content-Management-System (CMS), in
> that I just wants to store and retrieve files.
>> Please suggest me what should I use:
>> Hadoop or HBase
>> 
>> Thanks & Regards,
>> Kushal Agrawal
>> kushalagra...@teledna.com
>> 
>> One Earth. Your moment. Go green...
>> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise private information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the email by you is prohibited.
>> 
> 
> -- 
> Kai Voigt
> k...@123.org
> 
> 
> 
> 
> 
> 

-- 
Kai Voigt
k...@123.org






Re: Hadoop or HBase

2012-08-27 Thread Kai Voigt
Typically, CMSs require a RDBMS. Which Hadoop and HBase are not.

Which CMS do you plan to use, and what's wrong with MySQL or other open source 
RDBMSs?

Kai

Am 28.08.2012 um 08:21 schrieb "Kushal Agrawal" :

> Hi,
> I wants to use DFS for Content-Management-System (CMS), in 
> that I just wants to store and retrieve files.
> Please suggest me what should I use:
> Hadoop or HBase
>  
> Thanks & Regards,
> Kushal Agrawal
> kushalagra...@teledna.com
>  
> One Earth. Your moment. Go green...
> This message is for the designated recipient only and may contain privileged, 
> proprietary, or otherwise private information. If you have received it in 
> error, please notify the sender immediately and delete the original. Any 
> other use of the email by you is prohibited.
>  

-- 
Kai Voigt
k...@123.org






Re: Simple hadoop processes/testing on windows machine

2012-07-25 Thread Kai Voigt
I suggest using a virtual machine with all required services installed and 
configured.

Cloudera offers a distribution as a VM, at 
https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads#CDHDownloads-CDH4PackagesandDownloads

So all you need is installing VMware player on your Windows box and deploy that 
VM.

Kai

Am 25.07.2012 um 22:49 schrieb "Brown, Berlin [GCG-PFS]" 
:

> Is there a tutorial out there or quick startup that would allow me to:
> 
> 
> 
> 1.   Start my hadoop nodes by double clicking on a node, possibly a
> single node
> 
> 2.   Installing my Task
> 
> 3.   And then my client connects to those nodes.
> 
> 
> 
> I want to avoid having to install a hadoop windows service or any such
> installs.  Actually, I don't want to install anything for hadoop to run,
> I want to be able to unarchive the software and just double-click and
> run.  I also want to avoid additional hadoop windows registry or env
> params?
> 
> 
> 
> -
> 
> 
> 

-- 
Kai Voigt
k...@123.org






Re: Counting records

2012-07-23 Thread Kai Voigt
Hi,

an additional idea is to use the counter API inside the framework.

http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/ has 
a good example.

Kai

Am 23.07.2012 um 16:25 schrieb Peter Marron:

> I am a complete noob with Hadoop and MapReduce and I have a question that
> is probably silly, but I still don't know the answer.
> 
> To simplify (a fair bit) I want to count all the records that meet specific 
> criteria.
> I would like to use MapReduce because I anticipate large sources and I want to
> get the performance and reliability that MapReduce offers.

-- 
Kai Voigt
k...@123.org






Re: HDFS APPEND

2012-07-13 Thread Kai Voigt
Hello,

Am 13.07.2012 um 22:19 schrieb abhiTowson cal:

> Does CDH4 have append option??

Yes!!

Check out 
http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#append(org.apache.hadoop.fs.Path)

Kai

-- 
Kai Voigt
k...@123.org






Re: Binary Files With No Record Begin and End

2012-07-05 Thread Kai Voigt
Hi,

if you know the block size, you can calculate the offsets for your records. And 
write a custom record reader class to seek into your records.

Kai

Am 05.07.2012 um 22:54 schrieb MJ Sam:

> Hi,
> 
> The input of my map reduce is a binary file with no record begin and
> end marker. The only thing is that each record is a fixed 180bytes
> size in the binary file. How do I make Hadoop to properly find the
> record in the splits when a record overlap two splits. I was thinking
> to make the splits size to be a multiple of 180 but was wondering if
> there is anything else that I can do?  Please note that my files are
> not sequence file and just a custom binary file.
> 

-- 
Kai Voigt
k...@123.org






Re: file checksum

2012-06-25 Thread Kai Voigt
HDFS has block checksums. Whenever a block is written to the datanodes, a 
checksum is calculated and written with the block to the datanodes' disks.

Whenever a block is requested, the block's checksum is verified against the 
stored checksum. If they don't match, that block is corrupt. But since there's
additional replicas of the block, chances are high one block is matching the 
checksum. Corrupt blocks will be scheduled to be rereplicated.

Also, to prevent bit rod, blocks are checked periodically (weekly by default, I 
believe, you can configure that period) in the background.

Kai

Am 25.06.2012 um 13:29 schrieb Rita:

> Does Hadoop, HDFS in particular, do any sanity checks of the file before
> and after balancing/copying/reading the files? We have 20TB of data and I
> want to make sure after these operating are completed the data is still in
> good shape. Where can I read about this?
> 
> tia
> 
> -- 
> --- Get your facts first, then you can distort them as you please.--

-- 
Kai Voigt
k...@123.org






Re: map task execution time

2012-04-05 Thread Kai Voigt
Hi,

Am 05.04.2012 um 00:20 schrieb bikash sharma:

> Is it possible to get the execution time of the constituent map/reduce
> tasks of a MapReduce job (say sort) at the end of a job run?
> Preferably, can we obtain this programatically?


you can access the JobTracker's web UI and see the start and stop timestamps 
for every individual task.

Since the JobTracker Java API is exposed, you can write your own application to 
fetch that data through your own code.

Also, "hadoop job" on the command line can be used to read job statistics.

Kai


-- 
Kai Voigt
k...@123.org






Re: How does Hadoop compile the program written in language other than Java ?

2012-03-04 Thread Kai Voigt
Hi,

please read http://developer.yahoo.com/hadoop/tutorial/module4.html#streaming 
and http://developer.yahoo.com/hadoop/tutorial/module4.html#streaming for more 
explanation and examples.

Kai

Am 04.03.2012 um 16:10 schrieb Lac Trung:

> Can you give one or some examples about this, Kai ?
> I haven't understood how Hadoop run a mapreduce program in other language :D
> 
> Vào 02:21 Ngày 04 tháng 3 năm 2012, Kai Voigt  đã viết:
> 
>> Hi,
>> 
>> the streaming API doesn't compile the streaming scripts.
>> 
>> The PHP/Perl/Python/Ruby scripts you create as mapper and reducer will be
>> called as external programs.
>> 
>> The input key/value pairs will be send to your scripts as stdin, and the
>> output will be collected from their stdout.
>> 
>> So, no compilation, the scripts will just be executed.
>> 
>> Kai
>> 
>> Am 04.03.2012 um 15:42 schrieb Lac Trung:
>> 
>>> Hi everyone !
>>> 
>>> Hadoop is written in Java, so mapreduce programs are written in Java,
>> too.
>>> But Hadoop provides an API to MapReduce that allows you to write your map
>>> and reduce functions in languages other than Java (ex. Python), called
>>> Hadoop Streaming.
>>> I read the guide of Hadoop Streaming in
>>> here<
>> http://www.hadoop.apache.org/common/docs/r0.15.2/streaming.html#More+usage+examples
>>> 
>>> but
>>> I haven't seen any paragraph that write about converting language to
>> Java.
>>> Can anybody tell me how Hadoop compile the program written in language
>>> other than Java.
>>> 
>>> Thank you !
>>> --
>>> Lac Trung
>> 
>> --
>> Kai Voigt
>> k...@123.org
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Lạc Trung
> 20083535

-- 
Kai Voigt
k...@123.org






Re: How does Hadoop compile the program written in language other than Java ?

2012-03-04 Thread Kai Voigt
Hi,

the streaming API doesn't compile the streaming scripts.

The PHP/Perl/Python/Ruby scripts you create as mapper and reducer will be 
called as external programs.

The input key/value pairs will be send to your scripts as stdin, and the output 
will be collected from their stdout.

So, no compilation, the scripts will just be executed.

Kai

Am 04.03.2012 um 15:42 schrieb Lac Trung:

> Hi everyone !
> 
> Hadoop is written in Java, so mapreduce programs are written in Java, too.
> But Hadoop provides an API to MapReduce that allows you to write your map
> and reduce functions in languages other than Java (ex. Python), called
> Hadoop Streaming.
> I read the guide of Hadoop Streaming in
> here<http://www.hadoop.apache.org/common/docs/r0.15.2/streaming.html#More+usage+examples>
> but
> I haven't seen any paragraph that write about converting language to Java.
> Can anybody tell me how Hadoop compile the program written in language
> other than Java.
> 
> Thank you !
> -- 
> Lac Trung

-- 
Kai Voigt
k...@123.org






Re: dfs.block.size

2012-02-27 Thread Kai Voigt
"hadoop fsck  -blocks" is something that I think of quickly.

http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck has more 
details

Kai

Am 28.02.2012 um 02:30 schrieb Mohit Anchlia:

> How do I verify the block size of a given file? Is there a command?
> 
> On Mon, Feb 27, 2012 at 7:59 AM, Joey Echeverria  wrote:
> 
>> dfs.block.size can be set per job.
>> 
>> mapred.tasktracker.map.tasks.maximum is per tasktracker.
>> 
>> -Joey
>> 
>> On Mon, Feb 27, 2012 at 10:19 AM, Mohit Anchlia 
>> wrote:
>>> Can someone please suggest if parameters like dfs.block.size,
>>> mapred.tasktracker.map.tasks.maximum are only cluster wide settings or
>> can
>>> these be set per client job configuration?
>>> 
>>> On Sat, Feb 25, 2012 at 5:43 PM, Mohit Anchlia >> wrote:
>>> 
>>>> If I want to change the block size then can I use Configuration in
>>>> mapreduce job and set it when writing to the sequence file or does it
>> need
>>>> to be cluster wide setting in .xml files?
>>>> 
>>>> Also, is there a way to check the block of a given file?
>>>> 
>> 
>> 
>> 
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>> 

-- 
Kai Voigt
k...@123.org






Re: Secondary namenode fsimage concept

2011-10-06 Thread Kai Voigt
Hi,

yes, the secondary namenode is actually a badly named piece of software, as 
it's not a namenode at all. And it's going to be renamed to checkpoint node.

To prevent metadata loss when your namenode fails, you should write the 
namenode files to a local RAID and also a networked storage (NFS, SAN, DRBD). 
It's not the secondary namenode's task to make the metadata available.

Kai

Am 06.10.2011 um 09:04 schrieb shanmuganathan.r:

> Hi Kai,
> 
>  There is no datas stored  in the secondarynamenode related to the Hadoop 
> cluster . Am I correct?
> If it correct means If we run the secondaryname node in separate machine then 
> fetching , merging and transferring time is increased if the cluster has 
> large data in the namenode fsimage file . At the time if fail over occurs , 
> then how can we recover the nearly one hour changes in the HDFS file ? 
> (default check point time is one hour)
> 
> Thanks R.Shanmuganathan  
> 
> 
> 
> 
> 
> 
>  On Thu, 06 Oct 2011 12:20:28 +0530 Kai Voigt<k...@123.org> wrote 
>  
> 
> 
> Hi, 
> 
> the secondary namenode only fetches the two files when a checkpointing is 
> needed. 
> 
> Kai 
> 
> Am 06.10.2011 um 08:45 schrieb shanmuganathan.r: 
> 
> > Hi Kai, 
> > 
> > In the Second part I meant 
> > 
> > 
> > Is the secondary namenode also contain the FSImage file or the two 
> files(FSImage and EdiltLog) are transferred from the namenode at the 
> checkpoint time. 
> > 
> > 
> > Thanks 
> > Shanmuganathan 
> > 
> > 
> > 
> > 
> > 
> >  On Thu, 06 Oct 2011 11:37:50 +0530 Kai 
> Voigt&lt;k...@123.org&gt; wrote  
> > 
> > 
> > Hi, 
> > 
> > you're correct when saying the namenode hosts the fsimage file and the 
> edits log file. 
> > 
> > The fsimage file contains a snapshot of the HDFS metadata (a filename to 
> blocks list mapping). Whenever there is a change to HDFS, it will be appended 
> to the edits file. Think of it as a database transaction log, where changes 
> will not be applied to the datafile, but appended to a log. 
> > 
> > To prevent the edits file growing infinitely, the secondary namenode 
> periodically pulls these two files, and the namenode starts writing changes 
> to a new edits file. Then, the secondary namenode merges the changes from the 
> edits file with the old snapshot from the fsimage file and creates an updated 
> fsimage file. This updated fsimage file is then copied to the namenode. 
> > 
> > Then, the entire cycle starts again. To answer your question: The 
> namenode has both files, even if the secondary namenode is running on a 
> different machine. 
> > 
> > Kai 
> > 
> > Am 06.10.2011 um 07:57 schrieb shanmuganathan.r: 
> > 
> > &gt; 
> > &gt; Hi All, 
> > &gt; 
> > &gt; I have a doubt in hadoop secondary namenode concept . Please 
> correct if the following statements are wrong . 
> > &gt; 
> > &gt; 
> > &gt; The namenode hosts the fsimage and edit log files. The 
> secondary namenode hosts the fsimage file only. At the time of checkpoint the 
> edit log file is transferred to the secondary namenode and the both files are 
> merged, Then the updated fsimage file is transferred to the namenode . Is it 
> correct? 
> > &gt; 
> > &gt; 
> > &gt; If we run the secondary namenode in separate machine , then 
> both machines contain the fsimage file . Namenode only contains the editlog 
> file. Is it true? 
> > &gt; 
> > &gt; 
> > &gt; 
> > &gt; Thanks R.Shanmuganathan 
> > &gt; 
> > &gt; 
> > &gt; 
> > &gt; 
> > &gt; 
> > &gt; 
> > 
> > -- 
> > Kai Voigt 
> > k...@123.org 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> -- 
> Kai Voigt 
> k...@123.org 
> 
> 
> 
> 
> 
> 

-- 
Kai Voigt
k...@123.org






Re: Secondary namenode fsimage concept

2011-10-05 Thread Kai Voigt
Hi,

the secondary namenode only fetches the two files when a checkpointing is 
needed.

Kai

Am 06.10.2011 um 08:45 schrieb shanmuganathan.r:

> Hi Kai,
> 
>  In the Second part I meant 
> 
> 
> Is the secondary namenode also contain the FSImage file or the two 
> files(FSImage and EdiltLog) are transferred from the namenode at the 
> checkpoint time.
> 
> 
> Thanks 
> Shanmuganathan
> 
> 
> 
> 
> 
>  On Thu, 06 Oct 2011 11:37:50 +0530 Kai Voigt<k...@123.org> wrote 
>  
> 
> 
> Hi, 
> 
> you're correct when saying the namenode hosts the fsimage file and the edits 
> log file. 
> 
> The fsimage file contains a snapshot of the HDFS metadata (a filename to 
> blocks list mapping). Whenever there is a change to HDFS, it will be appended 
> to the edits file. Think of it as a database transaction log, where changes 
> will not be applied to the datafile, but appended to a log. 
> 
> To prevent the edits file growing infinitely, the secondary namenode 
> periodically pulls these two files, and the namenode starts writing changes 
> to a new edits file. Then, the secondary namenode merges the changes from the 
> edits file with the old snapshot from the fsimage file and creates an updated 
> fsimage file. This updated fsimage file is then copied to the namenode. 
> 
> Then, the entire cycle starts again. To answer your question: The namenode 
> has both files, even if the secondary namenode is running on a different 
> machine. 
> 
> Kai 
> 
> Am 06.10.2011 um 07:57 schrieb shanmuganathan.r: 
> 
> > 
> > Hi All, 
> > 
> > I have a doubt in hadoop secondary namenode concept . Please correct if 
> the following statements are wrong . 
> > 
> > 
> > The namenode hosts the fsimage and edit log files. The secondary 
> namenode hosts the fsimage file only. At the time of checkpoint the edit log 
> file is transferred to the secondary namenode and the both files are merged, 
> Then the updated fsimage file is transferred to the namenode . Is it correct? 
> > 
> > 
> > If we run the secondary namenode in separate machine , then both 
> machines contain the fsimage file . Namenode only contains the editlog file. 
> Is it true? 
> > 
> > 
> > 
> > Thanks R.Shanmuganathan 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> -- 
> Kai Voigt 
> k...@123.org 
> 
> 
> 
> 
> 
> 
> 

-- 
Kai Voigt
k...@123.org






Re: Secondary namenode fsimage concept

2011-10-05 Thread Kai Voigt
Hi,

you're correct when saying the namenode hosts the fsimage file and the edits 
log file.

The fsimage file contains a snapshot of the HDFS metadata (a filename to blocks 
list mapping). Whenever there is a change to HDFS, it will be appended to the 
edits file. Think of it as a database transaction log, where changes will not 
be applied to the datafile, but appended to a log.

To prevent the edits file growing infinitely, the secondary namenode 
periodically pulls these two files, and the namenode starts writing changes to 
a new edits file. Then, the secondary namenode merges the changes from the 
edits file with the old snapshot from the fsimage file and creates an updated 
fsimage file. This updated fsimage file is then copied to the namenode.

Then, the entire cycle starts again. To answer your question: The namenode has 
both files, even if the secondary namenode is running on a different machine.

Kai

Am 06.10.2011 um 07:57 schrieb shanmuganathan.r:

> 
> Hi All,
> 
>I have a doubt in hadoop secondary namenode concept . Please 
> correct if the following statements are wrong .
> 
> 
> The namenode hosts the fsimage and edit log files. The secondary namenode 
> hosts the fsimage file only. At the time of checkpoint the edit log file is 
> transferred to the secondary namenode and the both files are merged, Then the 
> updated fsimage file is transferred to the namenode . Is it correct?
> 
> 
> If we run the secondary namenode in separate machine , then both machines 
> contain the fsimage file . Namenode only contains the editlog file. Is it 
> true?
> 
> 
> 
> Thanks R.Shanmuganathan  
> 
> 
> 
> 
> 
> 

-- 
Kai Voigt
k...@123.org






Re: Reducer to concatenate string values

2011-09-19 Thread Kai Voigt
Hi Daniel,

the values for a single key will be passed to reduce() in a non-predictable 
order. Actually, when running the same job on the same data again, the order is 
most likely different every time.

If you want the values to be in a sorted way, you need to apply a 'secondary 
sort'. The basic idea is to attach your values to the key, and then benefit 
from the sorting Hadoop does on the key.

However, you need to write some code to make that happen. Josh wrote a nice 
series of articles on it, and you will find more if you google for "secondary 
sort".

http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/
http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-2/
http://www.cloudera.com/blog/2011/04/simple-moving-average-secondary-sort-and-mapreduce-part-3/

Kai

Am 20.09.2011 um 07:43 schrieb Daniel Yehdego:

> 
> Good evening, 
> I have a certain value output from a mapper and I want to concatenate the 
> string outputs using a Reducer (one reducer).But the order of the 
> concatenated string values is not in order. How can I use a reducer that 
> receives a value from a mapper output and concatenate the strings in order. 
> waiting your response and thanks in advance.  
> 
> Regards, 
> 
> Daniel T. Yehdego
> Computational Science Program 
> University of Texas at El Paso, UTEP 
> dtyehd...@miners.utep.edu   

-- 
Kai Voigt
k...@123.org






Re: phases of Hadoop Jobs

2011-09-18 Thread Kai Voigt
Hi Chen,

yes, it saves time to move map() output to the nodes where they will be needed 
for the reduce() input. After map() has processed the first blocks, it makes 
sense to copy that output to the reduce nodes. Imagine a very large map() 
output. If shuffle© would be postponed after all map nodes are done, we'd 
wait. So those things happen in parallel.

Consider it like Unix pipes where you start processing the output of the first 
command as the input of the next command.

$ command1 | command2

as opposed to store the output first and then process it.

$ command1 > file ; $command2 < file

Kai

Am 19.09.2011 um 08:20 schrieb He Chen:

> Hi Kai
> 
> Thank you  for the reply.
> 
> The reduce() will not start because the shuffle phase does not finish. And
> the shuffle phase will not finish untill alll mapper end.
> 
> I am curious about the design purpose about overlapping the map and reduce
> stage. Was this only for saving shuffling time? Or there are some other
> reasons.
> 
> Best wishes!
> 
> Chen
> On Mon, Sep 19, 2011 at 12:36 AM, Kai Voigt  wrote:
> 
>> Hi Chen,
>> 
>> the times when nodes running instances of the map and reduce nodes overlap.
>> But map() and reduce() execution will not.
>> 
>> reduce nodes will start copying data from map nodes, that's the shuffle
>> phase. And the map nodes are still running during that copy phase. My
>> observation had been that if the map phase progresses from 0 to 100%, it
>> matches with the reduce phase progress from 0-33%. For example, if you map
>> progress shows 60%, reduce might show 20%.
>> 
>> But the reduce() will not start until all the map() code has processed the
>> entire input. So you will never see the reduce progress higher than 66% when
>> map progress didn't reach 100%.
>> 
>> If you see map phase reaching 100%, but reduce phase not making any higher
>> number than 66%, it means your reduce() code is broken or slow because it
>> doesn't produce any output. An infinitive loop is a common mistake.
>> 
>> Kai
>> 
>> Am 19.09.2011 um 07:29 schrieb He Chen:
>> 
>>> Hi Arun
>>> 
>>> I have a question. Do you know what is the reason that hadoop allows the
>> map
>>> and the reduce stage overlap? Or anyone knows about it. Thank you in
>>> advance.
>>> 
>>> Chen
>>> 
>>> On Sun, Sep 18, 2011 at 11:17 PM, Arun C Murthy 
>> wrote:
>>> 
>>>> Nan,
>>>> 
>>>> The 'phase' is implicitly understood by the 'progress' (value) made by
>> the
>>>> map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
>>>> 
>>>> For e.g.
>>>> Reduce:
>>>> 0-33% -> Shuffle
>>>> 34-66% -> Sort (actually, just 'merge', there is no sort in the reduce
>>>> since all map-outputs are sorted)
>>>> 67-100% -> Reduce
>>>> 
>>>> With 0.23 onwards the Map has phases too:
>>>> 0-90% -> Map
>>>> 91-100% -> Final Sort/merge
>>>> 
>>>> Now,about starting reduces early - this is done to ensure shuffle can
>>>> proceed for completed maps while rest of the maps run, there-by
>> pipelining
>>>> shuffle and map completion. There is a 'reduce slowstart' feature to
>> control
>>>> this - by default, reduces aren't started until 5% of maps are complete.
>>>> Users can set this higher.
>>>> 
>>>> Arun
>>>> 
>>>> On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:
>>>> 
>>>>> Hi, all
>>>>> 
>>>>> recently, I was hit by a question, "how is a hadoop job divided into 2
>>>>> phases?",
>>>>> 
>>>>> In textbooks, we are told that the mapreduce jobs are divided into 2
>>>> phases,
>>>>> map and reduce, and for reduce, we further divided it into 3 stages,
>>>>> shuffle, sort, and reduce, but in hadoop codes, I never think about
>>>>> this question, I didn't see any variable members in JobInProgress class
>>>>> to indicate this information,
>>>>> 
>>>>> and according to my understanding on the source code of hadoop, the
>>>> reduce
>>>>> tasks are unnecessarily started until all mappers are finished, in
>>>>> constract, we can see the reduce tasks are in shuffle stage while there
>>>> are
>>>>> mappers which are still in running,
>>>>> So how can I indicate the phase which the job is belonging to?
>>>>> 
>>>>> Thanks
>>>>> --
>>>>> Nan Zhu
>>>>> School of Electronic, Information and Electrical Engineering,229
>>>>> Shanghai Jiao Tong University
>>>>> 800,Dongchuan Road,Shanghai,China
>>>>> E-Mail: zhunans...@gmail.com
>>>> 
>>>> 
>> 
>> --
>> Kai Voigt
>> k...@123.org
>> 
>> 
>> 
>> 
>> 

-- 
Kai Voigt
k...@123.org






Re: phases of Hadoop Jobs

2011-09-18 Thread Kai Voigt
Hi Chen,

the times when nodes running instances of the map and reduce nodes overlap. But 
map() and reduce() execution will not.

reduce nodes will start copying data from map nodes, that's the shuffle phase. 
And the map nodes are still running during that copy phase. My observation had 
been that if the map phase progresses from 0 to 100%, it matches with the 
reduce phase progress from 0-33%. For example, if you map progress shows 60%, 
reduce might show 20%.

But the reduce() will not start until all the map() code has processed the 
entire input. So you will never see the reduce progress higher than 66% when 
map progress didn't reach 100%.

If you see map phase reaching 100%, but reduce phase not making any higher 
number than 66%, it means your reduce() code is broken or slow because it 
doesn't produce any output. An infinitive loop is a common mistake.

Kai

Am 19.09.2011 um 07:29 schrieb He Chen:

> Hi Arun
> 
> I have a question. Do you know what is the reason that hadoop allows the map
> and the reduce stage overlap? Or anyone knows about it. Thank you in
> advance.
> 
> Chen
> 
> On Sun, Sep 18, 2011 at 11:17 PM, Arun C Murthy  wrote:
> 
>> Nan,
>> 
>> The 'phase' is implicitly understood by the 'progress' (value) made by the
>> map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
>> 
>> For e.g.
>> Reduce:
>> 0-33% -> Shuffle
>> 34-66% -> Sort (actually, just 'merge', there is no sort in the reduce
>> since all map-outputs are sorted)
>> 67-100% -> Reduce
>> 
>> With 0.23 onwards the Map has phases too:
>> 0-90% -> Map
>> 91-100% -> Final Sort/merge
>> 
>> Now,about starting reduces early - this is done to ensure shuffle can
>> proceed for completed maps while rest of the maps run, there-by pipelining
>> shuffle and map completion. There is a 'reduce slowstart' feature to control
>> this - by default, reduces aren't started until 5% of maps are complete.
>> Users can set this higher.
>> 
>> Arun
>> 
>> On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:
>> 
>>> Hi, all
>>> 
>>> recently, I was hit by a question, "how is a hadoop job divided into 2
>>> phases?",
>>> 
>>> In textbooks, we are told that the mapreduce jobs are divided into 2
>> phases,
>>> map and reduce, and for reduce, we further divided it into 3 stages,
>>> shuffle, sort, and reduce, but in hadoop codes, I never think about
>>> this question, I didn't see any variable members in JobInProgress class
>>> to indicate this information,
>>> 
>>> and according to my understanding on the source code of hadoop, the
>> reduce
>>> tasks are unnecessarily started until all mappers are finished, in
>>> constract, we can see the reduce tasks are in shuffle stage while there
>> are
>>> mappers which are still in running,
>>> So how can I indicate the phase which the job is belonging to?
>>> 
>>> Thanks
>>> --
>>> Nan Zhu
>>> School of Electronic, Information and Electrical Engineering,229
>>> Shanghai Jiao Tong University
>>> 800,Dongchuan Road,Shanghai,China
>>> E-Mail: zhunans...@gmail.com
>> 
>> 

-- 
Kai Voigt
k...@123.org






Re: phases of Hadoop Jobs

2011-09-18 Thread Kai Voigt
Hi,

this 0-33-66-100% phases are really confusing to beginners. We see that in our 
training classes. The output should be more verbose, such as breaking down the 
phases into seperate progress numbers.

Does that make sense?

Am 19.09.2011 um 06:17 schrieb Arun C Murthy:

> Nan,
> 
> The 'phase' is implicitly understood by the 'progress' (value) made by the 
> map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
> 
> For e.g. 
> Reduce: 
> 0-33% -> Shuffle
> 34-66% -> Sort (actually, just 'merge', there is no sort in the reduce since 
> all map-outputs are sorted)
> 67-100% -> Reduce
> 
> With 0.23 onwards the Map has phases too:
> 0-90% -> Map
> 91-100% -> Final Sort/merge
> 
> Now,about starting reduces early - this is done to ensure shuffle can proceed 
> for completed maps while rest of the maps run, there-by pipelining shuffle 
> and map completion. There is a 'reduce slowstart' feature to control this - 
> by default, reduces aren't started until 5% of maps are complete. Users can 
> set this higher.
> 
> Arun
> 
> On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:
> 
>> Hi, all
>> 
>> recently, I was hit by a question, "how is a hadoop job divided into 2
>> phases?",
>> 
>> In textbooks, we are told that the mapreduce jobs are divided into 2 phases,
>> map and reduce, and for reduce, we further divided it into 3 stages,
>> shuffle, sort, and reduce, but in hadoop codes, I never think about
>> this question, I didn't see any variable members in JobInProgress class
>> to indicate this information,
>> 
>> and according to my understanding on the source code of hadoop, the reduce
>> tasks are unnecessarily started until all mappers are finished, in
>> constract, we can see the reduce tasks are in shuffle stage while there are
>> mappers which are still in running,
>> So how can I indicate the phase which the job is belonging to?
>> 
>> Thanks
>> -- 
>> Nan Zhu
>> School of Electronic, Information and Electrical Engineering,229
>> Shanghai Jiao Tong University
>> 800,Dongchuan Road,Shanghai,China
>> E-Mail: zhunans...@gmail.com
> 
> 

-- 
Kai Voigt
k...@123.org






Re: Hadoop with Netapp

2011-08-25 Thread Kai Voigt
Hi,

Am 25.08.2011 um 08:58 schrieb Hakan İlter:

> We are going to create a new Hadoop cluster in our company, i have to get
> some advises from you:
> 
> 1. Does anyone have stored whole Hadoop data not on local disks but
> on Netapp or other storage system? Do we have to store datas on local disks,
> if so is it because of performace issues?

HDFS and MapReduce benefit massively from local storage, so using any kind of 
remote storage (SAN, Amazon S3, etc) will make things slower.

> 2. What do you think about running Hadoop nodes in virtual (VMware)
> servers?


Virtualization can make certain things easier to handle, but it's a layer that 
will eat resources.

Kai

-- 
Kai Voigt
k...@123.org






Re: Where is the hadoop-examples source code for the Sort example mapper/reducer?

2011-08-13 Thread Kai Voigt
Good job, in MapReduce you can build your own Partitioner. That is code 
determining which reducer will get which keys.

For simplicity, assume you're running 26 reducers. Your custom Partitioner will 
make sure the first reducer gets all keys starting with 'a', and so on.

Since the keys will be sorted within a single reducer, you can concatenate your 
26 output files to get an overall sorted output.

Making sense?

Kai

Am 13.08.2011 um 17:44 schrieb Sean Hogan:

> Oh, okay, got it - if there was more than one reducer then there needs to be
> a way to guarantee that the overall output from multiple reducers will still
> be sorted.
> 
> So I want to look for where the implementation of the shuffle/sort phase is
> located. Or find something on how Hadoop implements the MapReduce
> sort/shuffle phase.
> 
> Thanks!
> 
> -Sean

-- 
Kai Voigt
k...@123.org






Re: Where is the hadoop-examples source code for the Sort example mapper/reducer?

2011-08-13 Thread Kai Voigt
Hi,

the Identity Mapper and Reducer do what the name implies, they pretty much 
return their input as their output.

TeraSort relies on the sorting that is built in Hadoop's Sort&Shuffle phase.

So, the map() method in TeraSort looks like this:

map(offset, line) -> (line, _)

offset is the key to map() and represents the byte offset of the line (which is 
the value). map() returns the line as the key and some value which is not 
needed.

reduce() looks like this:

reduce(line, values) -> (line)

Again, the input is returned as is. The sort&shuffle layer between map() and 
reduce() guarantees that keys (lines) will come in sorted order. That's why the 
overall output will be the sorted input.

This all is easy when there's just one reducer. Question to make sure you 
understood things so far: What's the issue with more than one reducer?

Kai

Am 13.08.2011 um 17:10 schrieb Sean Hogan:

> Thanks for the link, but it hasn't helped answer my original question - that
> Sort.java seems to use IdentityMapper and IdentityReducer. Perhaps it is the
> Sort.java that is used when executing the below command, but I can't figure
> out what it actually uses for the mapper and reducer. It's entirely possible
> I'm just missing something obvious.
> 
> I'm interested in seeing how the map and reduce fits into sorting with the
> following command:
> 
> $ hadoop jar hadoop-*-examples.jar sort input output
> 
> I'd appreciate it if someone could explain what mappers/reducers are used in
> that above command (link to the implementation of whatever sort they use and
> how it fits into MapReduce)
> 
> Thanks.
> 
> -Sean

-- 
Kai Voigt
k...@123.org






Re: Where is the hadoop-examples source code for the Sort example mapper/reducer?

2011-08-13 Thread Kai Voigt
Hi,

some search on Google would have told you. Here's one link:

http://code.google.com/p/hop/source/browse/trunk/src/examples/org/apache/hadoop/examples/?r=131

Kai

Am 13.08.2011 um 15:27 schrieb Sean Hogan:

> Hi all,
> 
> I was interested in learning from how Hadoop implements their sort algorithm
> in the map/reduce framework. Could someone point me to the directory of the
> source code that has the mapper/reducer that the Sort example uses by
> default when I invoke:
> 
> $ hadoop jar hadoop-*-examples.jar sort input output
> 
> Thanks. I've found Sort.java here :
> 
> http://svn.apache.org/viewvc/hadoop/common/trunk/mapreduce/src/examples/org/apache/hadoop/examples/
> 
> But have not been able to track down the mapper/reducer implementation.
> 
> -Sean Hogan

-- 
Kai Voigt
k...@123.org






Re: Question about RAID controllers and hadoop

2011-08-11 Thread Kai Voigt
Yahoo did some testing 2 years ago: http://markmail.org/message/xmzc45zi25htr7ry

But updated benchmark would be interesting to see.

Kai

Am 12.08.2011 um 00:13 schrieb GOEKE, MATTHEW (AG/1000):

> My assumption would be that having a set of 4 raid 0 disks would actually be 
> better than having a controller that allowed pure JBOD of 4 disks due to the 
> cache on the controller. If anyone has any personal experience with this I 
> would love to know performance numbers but our infrastructure guy is doing 
> tests on exactly this over the next couple days so I will pass it along once 
> we have it.
> 
> Matt
> 
> -Original Message-
> From: Bharath Mundlapudi [mailto:bharathw...@yahoo.com] 
> Sent: Thursday, August 11, 2011 5:00 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Question about RAID controllers and hadoop
> 
> True, you need a P410 controller. You can create RAID0 for each disk to make 
> it as JBOD.
> 
> 
> -Bharath
> 
> 
> 
> 
> From: Koert Kuipers 
> To: common-user@hadoop.apache.org
> Sent: Thursday, August 11, 2011 2:50 PM
> Subject: Question about RAID controllers and hadoop
> 
> Hello all,
> We are considering using low end HP proliant machines (DL160s and DL180s)
> for cluster nodes. However with these machines if you want to do more than 4
> hard drives then HP puts in a P410 raid controller. We would configure the
> RAID controller to function as JBOD, by simply creating multiple RAID
> volumes with one disk. Does anyone have experience with this setup? Is it a
> good idea, or am i introducing a i/o bottleneck?
> Thanks for your help!
> Best, Koert
> This e-mail message may contain privileged and/or confidential information, 
> and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in error, 
> please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media. Other use 
> of this e-mail by you is strictly prohibited.
> 
> All e-mails and attachments sent and received are subject to monitoring, 
> reading and archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for checking 
> for the presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any damage 
> caused by any such code transmitted by or accompanying
> this e-mail or any attachment.
> 
> 
> The information contained in this email may be subject to the export control 
> laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR) and 
> sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
> information you are obligated to comply with all
> applicable U.S. export laws and regulations.
> 
> 

-- 
Kai Voigt
k...@123.org






Re: Where is web interface in stand alone operation?

2011-08-10 Thread Kai Voigt
Hi,

for further clarification. Are you running in standalone mode (1 JVM which runs 
everything) or in pseudo-distributed mode (1 Machine, but with 5 JVMs)? With 
PDM, you can access the web interfaces on ports 50030 and 50070. With SM, you 
should be able to at least to process monitoring (unix time command, various 
trace commands)

Kai

Am 10.08.2011 um 15:58 schrieb A Df:

> 
> 
> Hello Harsh:
> 
> See inline at *
> 
> 
>> 
>> From: Harsh J 
>> To: common-user@hadoop.apache.org; A Df 
>> Sent: Wednesday, 10 August 2011, 14:44
>> Subject: Re: Where is web interface in stand alone operation?
>> 
>> A Df,
>> 
>> The web UIs are a feature of the daemons JobTracker and NameNode. In
>> standalone/'local'/'file:///' modes, these daemons aren't run
>> (actually, no daemon is run at all), and hence there would be no 'web'
>> interface.
>> 
>> *ok, but is there any other way to check the performance in this mode such 
>> as time to complete etc? I am trying to compare performance between the two. 
>> And also for the pseudo mode how would I change the ports for the web 
>> interface because I have to connect to a remote server which only allows 
>> certain ports to be accessed from the web?
>> 
>> On Wed, Aug 10, 2011 at 6:01 PM, A Df  wrote:
>>> Dear All:
>>> 
>>> I know that in pseudo mode that there is a web interface for the NameNode 
>>> and the JobTracker but where is it for the standalone operation? The Hadoop 
>>> page at http://hadoop.apache.org/common/docs/current/single_node_setup.html 
>>> just shows to run the jar example but how do you view job details? For 
>>> example time to complete etc. I know it will not be as detailed as the 
>>> other modes but I wanted to compare the job peformance in standalone vs 
>>> pseudo mode. Thank you.
>>> 
>>> 
>>> Cheers,
>>> A Df
>>> 
>> 
>> 
>> 
>> -- 
>> Harsh J
>> 
>> 

-- 
Kai Voigt
k...@123.org






Re: Where is web interface in stand alone operation?

2011-08-10 Thread Kai Voigt
Hi,

just connect to http://localhost:50070/ or http://localhost:50030/ to access 
the web interfaces.

Kai

Am 10.08.2011 um 14:31 schrieb A Df:

> Dear All:
> 
> I know that in pseudo mode that there is a web interface for the NameNode and 
> the JobTracker but where is it for the standalone operation? The Hadoop page 
> at http://hadoop.apache.org/common/docs/current/single_node_setup.html just 
> shows to run the jar example but how do you view job details? For example 
> time to complete etc. I know it will not be as detailed as the other modes 
> but I wanted to compare the job peformance in standalone vs pseudo mode. 
> Thank you.
> 
> 
> Cheers,
> A Df

-- 
Kai Voigt
k...@123.org






Re: Can I number output results with a Counter?

2011-05-20 Thread Kai Voigt
Also, with speculative execution enabled, you might see a higher count as you 
expect while the same task is running multiple times in parallel. When a task 
gets killed because another instance was quicker, those counters will be 
removed from the global count though.

Kai

Am 20.05.2011 um 19:34 schrieb Joey Echeverria:

> Counters are a way to get status from your running job. They don't
> increment a global state. They locally save increments and
> periodically report those increments to the central counter. That
> means that the final count will be correct, but you can't use them to
> coordinate counts while your job is running.
> 
> -Joey
> 
> On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner  wrote:
>> Joey,
>> 
>> You understood me perfectly well. I see your first advice, but I am not
>> allowed to have gaps. A central service is something I may consider if
>> single reducer becomes a worse bottleneck than it.
>> 
>> But what are counters for? They seem to be exactly that.
>> 
>> Mark
>> 
>> On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria  wrote:
>> 
>>> To make sure I understand you correctly, you need a globally unique
>>> one up counter for each output record?
>>> 
>>> If you had an upper bound on the number of records a single reducer
>>> could output and you can afford to have gaps, you could just use the
>>> task id and multiply that by the max number of records and then one up
>>> from there.
>>> 
>>> If that doesn't work for you, then you'll need to use some kind of
>>> central service for allocating numbers which could become a
>>> bottleneck.
>>> 
>>> -Joey
>>> 
>>> On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner 
>>> wrote:
>>>> Hi, can I use a Counter to give each record in all reducers a consecutive
>>>> number? Currently I am using a single Reducer, but it is an anti-pattern.
>>>> But I need to assign consecutive numbers to all output records in all
>>>> reducers, and it does not matter how, as long as each gets its own
>>> number.
>>>> 
>>>> If it IS possible, then how are multiple processes accessing those
>>> counters
>>>> without creating race conditions.
>>>> 
>>>> Thank you,
>>>> 
>>>> Mark
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>>> 
>> 
> 
> 
> 
> -- 
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
> 

-- 
Kai Voigt
k...@123.org






Re: providing the same input to more than one Map task

2011-04-25 Thread Kai Voigt
Hi,

I'd use the distributed cache to store the vector on every mapper machine 
locally.

Kai

Am 22.04.2011 um 21:15 schrieb Alexandra Anghelescu:

> Hi all,
> 
> I am trying to perform matrix-vector multiplication using Hadoop.
> So I have matrix M in a file, and vector v in another file. How can I make
> it so that each Map task will get the whole vector v and a chunk of matrix
> M?
> Basically I want my map function to output key-value pairs (i,m[i,j]*v[j]),
> where i is the row number, and j the column number. And the reduce function
> will sum up all the values with the same key i, and that will be the ith
> element of my result vector.
> Or can you suggest another way to do it?
> 
> 
> Thanks,
> Alexandra Anghelescu

-- 
Kai Voigt
k...@123.org






Re: How HDFS decides where to put the block

2011-04-18 Thread Kai Voigt
Hi,

I found 
http://hadoopblog.blogspot.com/2009/09/hdfs-block-replica-placement-in-your.html
 explains the process nicely.

The first replica of each block will be stored on the client machine, if it's a 
datanode itself. Makes sense, as it doesn't require a network transfer. 
Otherwise, a random datanode will be picked for the first replica.

The second replica will be written to a random datanode on a random rack other 
than the rack where the first replica is stored. Here, HDFS's rack awareness 
will be utilized. So HDFS would survive a rack failure.

The second replica will be written to the same rack as the second replica, but 
another random datanode in that rack. That will make the pipeline between 
second and third replica quick.

Does that make sense to you? However, this is the current hard coded policy, 
there's ideas to make that policy customizable 
(https://issues.apache.org/jira/browse/HDFS-385).

Kai

Am 18.04.2011 um 15:46 schrieb Nan Zhu:

> Hi, all
> 
> I'm confused by a question that "how does the HDFS decide where to put the
> data blocks "
> 
> I mean that the user invokes some commands like "./hadoop put ***", we
> assume that this  file consistes of 3 blocks, but how HDFS decides where
> these 3 blocks to be put?
> 
> Most of the materials don't involve this issue, but just introduce the data
> replica where talking about blocks in HDFS,
> 
> can anyone give me some instructions?
> 
> Thanks
> 
> Nan
> 
> -- 
> Nan Zhu
> School of Software,5501
> Shanghai Jiao Tong University
> 800,Dongchuan Road,Shanghai,China
> E-Mail: zhunans...@gmail.com

-- 
Kai Voigt
k...@123.org






Re: Best way to Merge small XML files

2011-02-03 Thread Kai Voigt
Did you look into Hadoop Archives?

http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html

Kai

Am 03.02.2011 um 11:44 schrieb madhu phatak:

> Hi
> You can write an InputFormat which create input splits from multiple files .
> It will solve your problem.
> 
> On Wed, Feb 2, 2011 at 4:04 PM, Shuja Rehman  wrote:
> 
>> Hi Folks,
>> 
>> I am having hundreds of small xml files coming each hour. The size varies
>> from 5 Mb to 15 Mb. As Hadoop did not work well with small files so i want
>> to merge these small files. So what is the best option to merge these xml
>> files?
>> 
>> 
>> 
>> --
>> Regards
>> Shuja-ur-Rehman Baig
>> <http://pk.linkedin.com/in/shujamughal>
>> 

-- 
Kai Voigt
k...@123.org