Reading multiple input files.

2014-01-10 Thread Kim Chew
How does a MR job read multiple input files from different locations?

What if the input files are not in hdfs and located on different servers?
Do I have to copy them to hdfs first and instruct my MR job to read from
them? Can I instruct my MR job to read directly from those servers?

Thanks.

Kim


Re: Hadoop 2.2.0 YARN isolation and Windows

2014-01-10 Thread Hadoop Dev
Hi,
Though I am not an expert in Hadoop, I think Hadoop's one of the main
motives is to have data locality (i.e business logic(job) should run
locally with the data.. which is further helped by having data replicas on
various data nodes so that the task which is loaded locally to the data
block would read it) This saves network bandwidth. So May be you can have
HDFS and YARN service running on different cluster but I think that's not
what Hadoop aims for.. This is just my point of view..




On Fri, Jan 10, 2014 at 3:45 AM, Arpit Agarwal wrote:

> Hi Kevin,
>
> *> 2. According to
> http://www.i-programmer.info/news/197-data-mining/6518-hadoop-2-introduces-yarn.html
> ,
> Hadoop 2.2.0 supports Microsoft Windows. How do/Can you configure YARN for
> secure container isolation in Windows? It seems that the ContainerExecutor
> and DefaultContainerExecutor can detect and run on Windows, but the secure
> LinuxContainerExecutor are for *nix systems, so is there anything in place
> for maximum security like LCE is?*
>
> Cluster security is not supported on Windows at the moment.
>
>
>
> *> 1. Does YARN need to run on the same machines that are hosting the HDFS
> services or can HDFS be remote of a YARN cluster? If this done by placing
> the remote HDFS cluster's configuration files (core-site.xml and
> hdfs-site.xml) on the YARN cluster's machines?*
>
> *> 3. If 1 is yes, then is it possible to have a cluster mixed with both
> Linux and Windows machines running YARN and working together?*
> It should work in theory if you get the configuration right - I have not
> tried it out so I am not sure. YARN containers and HDFS datanodes should be
> collocated for good performance. The MapReduce compute model especially
> depends on access to fast local storage.
>
>
>
> On Thu, Jan 9, 2014 at 1:19 PM, Kevin  wrote:
>
>> Hi,
>>
>> Three questions about the new Hadoop release regarding YARN:
>>
>> 1. Does YARN need to run on the same machines that are hosting the HDFS
>> services or can HDFS be remote of a YARN cluster? If this done by placing
>> the remote HDFS cluster's configuration files (core-site.xml and
>> hdfs-site.xml) on the YARN cluster's machines?
>>
>> 2. According to
>> http://www.i-programmer.info/news/197-data-mining/6518-hadoop-2-introduces-yarn.html,
>> Hadoop 2.2.0 supports Microsoft Windows. How do/Can you configure YARN for
>> secure container isolation in Windows? It seems that the ContainerExecutor
>> and DefaultContainerExecutor can detect and run on Windows, but the secure
>> LinuxContainerExecutor are for *nix systems, so is there anything in place
>> for maximum security like LCE is?
>>
>> 3. If 1 is yes, then is it possible to have a cluster mixed with both
>> Linux and Windows machines running YARN and working together?
>>
>> Thanks,
>> Kevin
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.


MultipleOutputs and Hadoop Pipes 2.2.0

2014-01-10 Thread Silvina CaĆ­no Lores
Hello everyone,

I'm trying to separate the reducer's output in order to have one file per
key with a specific name, something similar to
THIS
.


I did it in 1.1.2 using MultipleTextOutputFormat, but after migrating to
2.2.0 I don't seem to find the proper way to do it since I'm using C++ code
and Pipes and I couldn't find a way to link MultipleOutputs (which belongs
to the Java library) to my code.

Am I missing something? Is there any workaround you can suggest?

Thanks a lot in advance!

Regards,
Silvina


Switch and configure single node and cluster XML configuration

2014-01-10 Thread Michael Bremen

HI,

I have a workstation and a laptop. I have installed 2.1.1 release on 
both computers. Now I want to test single node and cluster setup. I am 
attaching the conf/core-site.xml and conf/hdfs-site.xml. The network 
topology is as follows


Workstation connected to router via ethernet.
Laptop connected to router via wireless.
SSH ( without pass phrase ) working between the two.

LAN broadcast ip : 192.168.0.0
Workstation will act as the master in cluster setup with ip: 192.168.0.101
Laptop ip: 192.168.0.103
Operating systems : Ubuntu 13.04 distribution of Linux operating system.

I have commented the hdfs-site.xml properties for switching between 
single node and cluster setup.


Problems :
1. In single node configuration ( Datanode and secondary Namenode 
successfully working) Namenode failed to start on the workstation.
2. In cluster node configuration, the secondary name node is on laptop. 
The master script ( start-dfs.sh ) tries to start the secondary name 
node on the laptop with IP address 192.168.0.0 instead of 192.168.0.103 
hence failed to connect to port 22. ( using 192.168.0.0:22 instead of 
192.168.0.103:22).


Am I configuring the hdfs-site.xml correct ?

Thanks.








 
 fs.default.name
 hdfs://localhost:9000
 








 
 
 dfs.replication
 1
 
 
 dfs.namenode.secondary.http-address
 0.0.0.0:50090
 
 
 dfs.datanode.address
 0.0.0.0:50010
 
 
 dfs.datanode.http.address
 0.0.0.0:50075
 
 
 dfs.datanode.ipc.address
 0.0.0.0:50020
 
 
 dfs.namenode.http-address
 0.0.0.0:50070
 



RE: Running Hadoop v2 clustered mode MR on an NFS mounted filesystem

2014-01-10 Thread java8964
When you said that the mappers seem to be accessing file sequentially, why do 
you think so?
NFS maybe changes something, but mappers shouldn't access file sequentially. 
NFS could make the file unsplittable, but you need to more test to verify it.
The class you want to check out is the 
org.apache.hadoop.mapred.FileInputFormat, especially method getSplits().
The above code is the key how the split list is generated. If it doesn't 
performance well for your underline storage system, you can always write your 
own InputFormat to utilize your own storage system.
Yong

From: atish.kath...@gmail.com
Date: Wed, 8 Jan 2014 15:48:12 +0530
Subject: Re: Running Hadoop v2 clustered mode MR on an NFS mounted filesystem
To: user@hadoop.apache.org

Figured out 1. The output of the reduce was going to the slave node, while I 
was looking for it in the master node. Which is perfectly fine. Need guidance 
for 2. though!


ThanksAtish

On Wed, Jan 8, 2014 at 3:30 PM, Atish Kathpal  wrote:



Hi
By giving the complete URI, the MR jobs worked across both nodes. Thanks a lot 
for the advice. 



Two issues though:1. On completion of the MR job, I see only the "_SUCCESS" 
file in the output directory, but no part-r file containing the actual results 
of the wordcount job. However I am seeing the correct output on running MR over 
HDFS. What is going wrong? Any place I can find logs for the MR job. I see no 
errors on the console.



Command used: hadoop jar 
/home/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar 
wordcount file:///home/hduser/testmount/ file:///home/hduser/testresults/





2. I am observing that the mappers seem to be accessing files sequentially, 
splitting the files across mappers, and then reading data in parallelel, then 
moving on to the next file. What I want instead is that, files themselves 
should be accessed in parallel, that is, if there are 10 files to be MRed, then 
MR should ask for each of these files in parallel in one go, and then work on 
the splits of these files in parallel.



Why do I need this? Some of the data coming from the NFS mount point is coming 
from offline media (which takes ~5-10 seconds of time before first bytes are 
received). So I would like all required files to be asked at the onset itself 
from the NFS mount point. This way several offline media will be spun up 
parallely and as the data from these media gets available MR can process them.




Would be glad to get inputs on these points!
ThanksAtish
Tip for those who are trying similar stuff::In my case. after a while the jobs 
would fail, complaining of "java.lang.OutOfMemoryError: Java heap space", but I 
was able to rectify this with help from: 
http://stackoverflow.com/questions/13674190/cdh-4-1-error-running-child-java-lang-outofmemoryerror-java-heap-space









On Sun, Dec 22, 2013 at 2:47 PM, Atish Kathpal  wrote:




Thanks Devin, Yong, and Chris for your replies and suggestions. I will test the 
suggestions made by Yong and Devin and get back to you guys.




As on the bottlenecking issue, I agree, but  I am trying to run few MR jobs on 
a traditional NAS server. I can live with a few bottlenecks, so long as I don't 
have to move the data to a dedicated HDFS cluster.






On Sat, Dec 21, 2013 at 8:06 AM, Chris Mawata  wrote:






  

  
  
Yong raises an important issue:  You
  have thrown out the I/O advantages of HDFS and also thrown out the
  advantages of data locality. It would be interesting to know why
  you are taking this approach.

  Chris

  

  On 12/20/2013 9:28 AM, java8964 wrote:



  
  I believe the "-fs local" should be removed too.
The reason is that even you have a dedicated JobTracker after
removing "-jt local", but with "-fs local", I believe that all
the mappers will be run sequentially.



"-fs local" will force the mapreducer run in "local" mode,
  which is really a test mode.



What you can do is to remove both "-fs local -jt local",
  but give the FULL URI of the input and output path, to tell
  Hadoop that they are local filesystem instead of HDFS.



"hadoop jar
  
/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

  wordcount file:///hduser/mount_point
  file:///results"



Keep in mind followings:



1) The NFS mount need to be available in all your Task
  Nodes, and mounted in the same way.
2) Even you can do that, but your sharing storage will be
  your bottleneck. NFS won't work well for scalability. 



Yong

  

  
Date: Fri, 20 Dec 2013 09:01:32 -0500

Subject: Re: Running Hadoop v2 clustered mode MR on an NFS
mounted filesystem

From: dsui...@rdx.com

To: user@ha

RE: Reading multiple input files.

2014-01-10 Thread java8964
Yes. 
The hadoop is very flexible for underline storage system. It is in your 
control,  how to utilize the cluster's resource, include CPU, memory, IO and 
network bandwidth.
Check out hadoop NLineInportFormat, it maybe the right choice for your case.
You can put all the metadata of your files (data) into one text file, and send 
this text file to your MR job.
Each mapper will get one line text from the above file, and start to process 
data representing by this one line text.
Is it a good solution for you? You have to judge it by yourself. Keep in mind 
followings:
1) Normally, the above case is good for a MR job to load data from a third 
party system, for CPU intensive jobs.2) You do utilize the cluster, as if you 
have 100 mapper tasks, and 100 files to be processed, you get pretty good 
concurrency.
But:
1) Are your files (or data) equally split around the third party system? In the 
above example, for 100 files (or chunks of data), if one file is 10G, and the 
rest are only 100M, then one mapper will take MUCH longer than the rest. You 
will have lone tail problem, and hurt overall performance.2) NO data locality 
advantage compared to HDFS. All the mappers need to load the data from a third 
party system remotely.3) If each file (or chunk data) are very large, what 
about fail over? For example, if you have 100 mapper task slots, but only 20 
files, with 10G data each, then you under-utilize your cluster resource, as 
only 20 mappers will handle them, the rest 80 mapper tasks will be just idle. 
More important, if one mapper failed, all the already processed data has to be 
discard. Another mapper has to restart from beginning for this chunk of data. 
Your overall performance is hurt.
As you can see, you get a lot of benefits from the HDFS.  You lost all of them. 
Sometimes you have no other choices, but have to load the data on the fly from 
some 3rd party system. But you need to think above, and try to seek all the 
benefits which HDFS can provide to you, from the 3rd party system, if you can.
Yong

Date: Fri, 10 Jan 2014 01:21:19 -0800
Subject: Reading multiple input files.
From: kchew...@gmail.com
To: user@hadoop.apache.org

How does a MR job read multiple input files from different locations?

What if the input files are not in hdfs and located on different servers? Do I 
have to copy them to hdfs first and instruct my MR job to read from them? Can I 
instruct my MR job to read directly from those servers?


Thanks.

Kim
  

[no subject]

2014-01-10 Thread Andrea Barbato
Hi, i have a simple question.
I have this example code:

class WordCountMapper : public HadoopPipes::Mapper {public:
  // constructor: does nothing
  WordCountMapper( HadoopPipes::TaskContext& context ) { }
  // map function: receives a line, outputs (word,"1") to reducer.
  void map( HadoopPipes::MapContext& context ) { ... }
  }};
class WordCountReducer : public HadoopPipes::Reducer {public:
  // constructor: does nothing
  WordCountReducer(HadoopPipes::TaskContext& context) {}
  // reduce function
  void reduce( HadoopPipes::ReduceContext& context ) { ... }};
int main(int argc, char *argv[]) {
  return 
HadoopPipes::runTask(HadoopPipes::TemplateFactory()
);}

Can I write some code lines (like the I/O operations) in the main function
body?
Thanks in advance.


Hadoop Pipes C++ Simple Question

2014-01-10 Thread Andrea Barbato
Hi, i have a simple question.
I have this example code:

class WordCountMapper : public HadoopPipes::Mapper {public:
  // constructor: does nothing
  WordCountMapper( HadoopPipes::TaskContext& context ) { }
  // map function: receives a line, outputs (word,"1") to reducer.
  void map( HadoopPipes::MapContext& context ) { ... }
  }};
class WordCountReducer : public HadoopPipes::Reducer {public:
  // constructor: does nothing
  WordCountReducer(HadoopPipes::TaskContext& context) {}
  // reduce function
  void reduce( HadoopPipes::ReduceContext& context ) { ... }};
int main(int argc, char *argv[]) {
  return 
HadoopPipes::runTask(HadoopPipes::TemplateFactory()
);}

Can I write some code lines (like the I/O operations) in the main function
body?
Thanks in advance.


Re: Switch and configure single node and cluster XML configuration

2014-01-10 Thread Mohammad Alkahtani
Find this tutorial for install a multi-node cluster

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
In hdfs-site.xml you don't need to specify the address for secondary node, you 
write the address in the file slave 

Regards,
Mohammad Alkahtani


> On 10 Jan 2014, at 04:36 pm, Michael Bremen  wrote:
> 
> HI,
> 
> I have a workstation and a laptop. I have installed 2.1.1 release on both 
> computers. Now I want to test single node and cluster setup. I am attaching 
> the conf/core-site.xml and conf/hdfs-site.xml. The network topology is as 
> follows
> 
> Workstation connected to router via ethernet.
> Laptop connected to router via wireless.
> SSH ( without pass phrase ) working between the two.
> 
> LAN broadcast ip : 192.168.0.0
> Workstation will act as the master in cluster setup with ip: 192.168.0.101
> Laptop ip: 192.168.0.103
> Operating systems : Ubuntu 13.04 distribution of Linux operating system.
> 
> I have commented the hdfs-site.xml properties for switching between single 
> node and cluster setup.
> 
> Problems :
> 1. In single node configuration ( Datanode and secondary Namenode 
> successfully working) Namenode failed to start on the workstation.
> 2. In cluster node configuration, the secondary name node is on laptop. The 
> master script ( start-dfs.sh ) tries to start the secondary name node on the 
> laptop with IP address 192.168.0.0 instead of 192.168.0.103 hence failed to 
> connect to port 22. ( using 192.168.0.0:22 instead of 192.168.0.103:22).
> 
> Am I configuring the hdfs-site.xml correct ?
> 
> Thanks.
> 
> 
> 


Re: Find max and min of a column in a csvfile

2014-01-10 Thread Jiayu Ji
if you are doing with only one column, then I think the key/value pair
could be Null and number( elements) . If you are doing more than one
column, then column name and numbers.


On Fri, Jan 10, 2014 at 12:36 AM, unmesha sreeveni wrote:

>
> Need help
> How to find the maximum element and min element of a col in a csv file
> .What will be the mapper output.
>
> --
> *Thanks & Regards*
>
> Unmesha Sreeveni U.B
> Junior Developer
>
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


-- 
Jiayu (James) Ji,

Cell: (312)823-7393


Re: HDFS authentication

2014-01-10 Thread Jing Zhao
You can setup Kerberos for HDFS or even the whole Hadoop. The link
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.8.0/bk_installing_manually_book/content/rpm-chap14.html
contains detailed setup steps.

Thanks,
-Jing

On Thu, Jan 9, 2014 at 11:56 PM, Pinak Pani
 wrote:
> Does HDFS provide any build in authentication out of the box? I wanted to
> make explicit access to HDFS from Java. I wanted people to access HDFS using
> "username:password@hdfs://client.skynet.org:9000/user/data" or something
> like that.
>
> I am new to Hadoop. We are planning to use Hadoop mainly for Archiving and
> probably processing at a later time. The idea is customers can setup their
> own HDFS cluster and provide us the HDFS URL to dump the data to.
>
> Is it possible to have access to HDFS in a similar way we access databases
> using credential?
>
> Thanks.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: HDFS authentication

2014-01-10 Thread Juan Carlos
As far as I know, the only authentication method available in hdfs 2.2.0 is
Kerberos, so it's not possible to authenticate with an URL.
Regards


2014/1/10 Pinak Pani 

> Does HDFS provide any build in authentication out of the box? I wanted to
> make explicit access to HDFS from Java. I wanted people to access HDFS
> using "username:password@hdfs://client.skynet.org:9000/user/data" or
> something like that.
>
> I am new to Hadoop. We are planning to use Hadoop mainly for Archiving and
> probably processing at a later time. The idea is customers can setup their
> own HDFS cluster and provide us the HDFS URL to dump the data to.
>
> Is it possible to have access to HDFS in a similar way we access databases
> using credential?
>
> Thanks.
>


Re: HDFS authentication

2014-01-10 Thread Larry McCay
Hi Pinak -

If you want to use the REST interface of webhdfs then you can setup Knox as
the Hadoop REST Gateway and authentication against LDAP or other stores
through the Apache Shiro integration. This opens up your authentication
possibilities.

http://knox.incubator.apache.org/

It would then proxy your access to HDFS and the rest of Hadoop through the
gateway.

If you intend to only use the Hadoop command like tooling then you are
limited to only Kerberos for real authentication.

HTH.

--larry


On Fri, Jan 10, 2014 at 3:31 AM, Juan Carlos  wrote:

> As far as I know, the only authentication method available in hdfs 2.2.0
> is Kerberos, so it's not possible to authenticate with an URL.
> Regards
>
>
> 2014/1/10 Pinak Pani 
>
>> Does HDFS provide any build in authentication out of the box? I wanted to
>> make explicit access to HDFS from Java. I wanted people to access HDFS
>> using "username:password@hdfs://client.skynet.org:9000/user/data" or
>> something like that.
>>
>> I am new to Hadoop. We are planning to use Hadoop mainly for Archiving
>> and probably processing at a later time. The idea is customers can setup
>> their own HDFS cluster and provide us the HDFS URL to dump the data to.
>>
>> Is it possible to have access to HDFS in a similar way we access
>> databases using credential?
>>
>> Thanks.
>>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Reading multiple input files.

2014-01-10 Thread Kim Chew
Yong, this is very helpful! Thanks.

Still trying to wrap my around all this :)

Let's stick to this hypothetical scenario that my data files are located on
different server, for example,

machine-1:/foo/bar.txt
machine-2:/foo/bar.txt
machine-3:/foo/bar.txt
machine-4:/foo/bar.txt
machine-5:/foo/bar.txt
...

So how does Hadoop determine how many mapper does it need? Can I run my job
like this?

hadoop MyJob -input /foo -output output

Kim


On Fri, Jan 10, 2014 at 8:04 AM, java8964  wrote:

> Yes.
>
> The hadoop is very flexible for underline storage system. It is in your
> control,  how to utilize the cluster's resource, include CPU, memory, IO
> and network bandwidth.
>
> Check out hadoop NLineInportFormat, it maybe the right choice for your
> case.
>
> You can put all the metadata of your files (data) into one text file, and
> send this text file to your MR job.
>
> Each mapper will get one line text from the above file, and start to
> process data representing by this one line text.
>
> Is it a good solution for you? You have to judge it by yourself. Keep in
> mind followings:
>
> 1) Normally, the above case is good for a MR job to load data from a third
> party system, for CPU intensive jobs.
> 2) You do utilize the cluster, as if you have 100 mapper tasks, and 100
> files to be processed, you get pretty good concurrency.
>
> But:
>
> 1) Are your files (or data) equally split around the third party system?
> In the above example, for 100 files (or chunks of data), if one file is
> 10G, and the rest are only 100M, then one mapper will take MUCH longer than
> the rest. You will have lone tail problem, and hurt overall performance.
> 2) NO data locality advantage compared to HDFS. All the mappers need to
> load the data from a third party system remotely.
> 3) If each file (or chunk data) are very large, what about fail over? For
> example, if you have 100 mapper task slots, but only 20 files, with 10G
> data each, then you under-utilize your cluster resource, as only 20 mappers
> will handle them, the rest 80 mapper tasks will be just idle. More
> important, if one mapper failed, all the already processed data has to be
> discard. Another mapper has to restart from beginning for this chunk of
> data. Your overall performance is hurt.
>
> As you can see, you get a lot of benefits from the HDFS.  You lost all of
> them. Sometimes you have no other choices, but have to load the data on the
> fly from some 3rd party system. But you need to think above, and try to
> seek all the benefits which HDFS can provide to you, from the 3rd party
> system, if you can.
>
> Yong
>
> --
> Date: Fri, 10 Jan 2014 01:21:19 -0800
> Subject: Reading multiple input files.
> From: kchew...@gmail.com
> To: user@hadoop.apache.org
>
>
> How does a MR job read multiple input files from different locations?
>
> What if the input files are not in hdfs and located on different servers?
> Do I have to copy them to hdfs first and instruct my MR job to read from
> them? Can I instruct my MR job to read directly from those servers?
>
> Thanks.
>
> Kim
>


Re:

2014-01-10 Thread Zesheng Wu
Of course you can, you can think this as an independent runnable program.


2014/1/11 Andrea Barbato 

> Hi, i have a simple question.
> I have this example code:
>
> class WordCountMapper : public HadoopPipes::Mapper {public:
>   // constructor: does nothing
>   WordCountMapper( HadoopPipes::TaskContext& context ) { }
>   // map function: receives a line, outputs (word,"1") to reducer.
>   void map( HadoopPipes::MapContext& context ) { ... }
>   }};
> class WordCountReducer : public HadoopPipes::Reducer {public:
>   // constructor: does nothing
>   WordCountReducer(HadoopPipes::TaskContext& context) {}
>   // reduce function
>   void reduce( HadoopPipes::ReduceContext& context ) { ... }};
> int main(int argc, char *argv[]) {
>   return 
> HadoopPipes::runTask(HadoopPipes::TemplateFactory()
>  );}
>
> Can I write some code lines (like the I/O operations) in the main function
> body?
> Thanks in advance.
>



-- 
Best Wishes!

Yours, Zesheng


RE: Reading multiple input files.

2014-01-10 Thread java8964
How many mappers being generated depends on the InputFormat class you choose 
for your MR job.
The default one, which in case you didn't specify in your job, is 
TextInputFormat, which will generate one split per block, assuming your file is 
splitable.
Which task node will run the mapper depends on a lot of conditions. You can 
search online about it, but just assume it is unpredictable as now. Then all of 
your task nodes need to be able to access your data.
If you are using HDFS, then you should have no problem, as HDFS is available to 
all the task nodes. If the data is in the local disk, then it has to be 
available on all the task nodes.
Yong

Date: Fri, 10 Jan 2014 14:18:53 -0800
Subject: Re: Reading multiple input files.
From: kchew...@gmail.com
To: user@hadoop.apache.org

Yong, this is very helpful! Thanks.

Still trying to wrap my around all this :)

Let's stick to this hypothetical scenario that my data files are located on 
different server, for example,


machine-1:/foo/bar.txt
machine-2:/foo/bar.txt
machine-3:/foo/bar.txt
machine-4:/foo/bar.txt
machine-5:/foo/bar.txt
...


So how does Hadoop determine how many mapper does it need? Can I run my job 
like this?

hadoop MyJob -input /foo -output output

Kim



On Fri, Jan 10, 2014 at 8:04 AM, java8964  wrote:




Yes. 
The hadoop is very flexible for underline storage system. It is in your 
control,  how to utilize the cluster's resource, include CPU, memory, IO and 
network bandwidth.

Check out hadoop NLineInportFormat, it maybe the right choice for your case.
You can put all the metadata of your files (data) into one text file, and send 
this text file to your MR job.

Each mapper will get one line text from the above file, and start to process 
data representing by this one line text.
Is it a good solution for you? You have to judge it by yourself. Keep in mind 
followings:

1) Normally, the above case is good for a MR job to load data from a third 
party system, for CPU intensive jobs.2) You do utilize the cluster, as if you 
have 100 mapper tasks, and 100 files to be processed, you get pretty good 
concurrency.

But:
1) Are your files (or data) equally split around the third party system? In the 
above example, for 100 files (or chunks of data), if one file is 10G, and the 
rest are only 100M, then one mapper will take MUCH longer than the rest. You 
will have lone tail problem, and hurt overall performance.
2) NO data locality advantage compared to HDFS. All the mappers need to load 
the data from a third party system remotely.3) If each file (or chunk data) are 
very large, what about fail over? For example, if you have 100 mapper task 
slots, but only 20 files, with 10G data each, then you under-utilize your 
cluster resource, as only 20 mappers will handle them, the rest 80 mapper tasks 
will be just idle. More important, if one mapper failed, all the already 
processed data has to be discard. Another mapper has to restart from beginning 
for this chunk of data. Your overall performance is hurt.

As you can see, you get a lot of benefits from the HDFS.  You lost all of them. 
Sometimes you have no other choices, but have to load the data on the fly from 
some 3rd party system. But you need to think above, and try to seek all the 
benefits which HDFS can provide to you, from the 3rd party system, if you can.

Yong

Date: Fri, 10 Jan 2014 01:21:19 -0800
Subject: Reading multiple input files.
From: kchew...@gmail.com
To: user@hadoop.apache.org


How does a MR job read multiple input files from different locations?

What if the input files are not in hdfs and located on different servers? Do I 
have to copy them to hdfs first and instruct my MR job to read from them? Can I 
instruct my MR job to read directly from those servers?



Thanks.

Kim