from:"rab ra"

Open already existing sequenceFile

2015-05-27 Thread rab ra

Hello

Is this possible to open already existing sequence file and append to it?
I could not find any pointers and tutorials anywhere.
Can someone help me out here?


with thanks and regards
Bala

Cannot initialize cluser issue - Why jobclient-tests jar is needed?

2015-05-12 Thread rab ra

Hello

In one of my use case, i am running a hadoop job using the following command

java -cp /etc/hadoop/conf .class

This command gave some error that

"cannot initialize cluster. please check the configuration for
mapreduce.framework.name and the correspond server address"

i understand that i need to specify the hadoop jars in the classpath.
Running the command 'hadoop classpath' and putting the output into
classpath did work. However, I wanted to narrow down to the exact jars that
are needed. So,  started putting each and every jar in the classpath and
finally figured out the jar files that are actually needed.

hadoop-yarn-client-2.6.0.jar
hadoop-mapreducec-client-common-2.6.0.jar
hadoop-mapreduce-client-core-2.6.0.jar
hadoop-mapreduce-client-jobclient-2.6.0.jar
hadoop-mapreduce-client-jobclient-2.6.0-tests.jar


However, i expected only jobclient-2.6.0.jar should be enough but that is
not the case and i also needed to include jobclient-tests.jar as well. Can
someone throw light in this? Why would the tests jar is needed?


regards
Balachandar

Re: simple hadoop MR program to be executed using java

2015-01-17 Thread rab ra

On Sat, Jan 17, 2015 at 12:33 AM, Chris Nauroth 
wrote:

> Hello Rab,
>
> There is actually quite a lot of logic in the "hadoop jar" shell scripts
> to set up the classpath (including Hadoop configuration file locations) and
> set up extra arguments (like heap sizes and log file locations).  It is
> possible to replicate it with a straight java call, but it might not be
> worth the effort, and end users of your jar would lose functionality
> implemented in the shell scripts, such as configuration file location
> overrides.
>
> If you still want to pursue this, then you might want to make a small
> change to the "hadoop jar" script and add a line right before the java call
> to echo the command it's running.  That will give you a sense for the java
> command that ultimately gets run.  You could also take a look at the
> process table for a running "hadoop jar" process and inspect its command
> line and environment variables.
>
> Another potentially helpful tool  is the "hadoop classpath" command:
>
>
> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/CommandsManual.html#classpath
>
> This uses the full logic of the shell scripts for classpath construction,
> but then just echoes it instead of using it to run a jar.
>
> Chris Nauroth
> Hortonworks
> http://hortonworks.com/
>
>
> Hello,

Thanks for your response. I had a feeling that if a web application needs
to process a request from the client and subsequently span MR jobs, it
would not span command line process using 'hadoop' command and there would
be a way to instantiate a hadoop driver class that contains Mapper and
reducer. In this setup, I expected there would be a place where all the
hadoop related configuration / jars would be placed so that they are
available for hadoop job. Hence, asked this question. I thought it is
straightforward and many people would have attempted it and hence getting
help in the form of documentation and blog would not be problem. I spent
two days in this but still could not find a way to do this.
'
regards
rab



> On Fri, Jan 16, 2015 at 10:15 AM, rab ra  wrote:
>
>> Hello,
>>
>> I have a simple java program that sets up a MR job. I could successfully
>> execute this in Hadoop infrastructure (hadoop 2x) using 'hadoop jar
>> '. But I want to achieve the same thing using java command as below.
>>
>> java 
>>
>> 1. How can I pass hadoop configuration to this className?
>> 2. What extra arguments do I need to supply?
>> 3. Any link/documentation would be highly appreciated.
>>
>>
>> regards
>> rab
>>
>>
>>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

simple hadoop MR program to be executed using java

2015-01-16 Thread rab ra

Hello,

I have a simple java program that sets up a MR job. I could successfully
execute this in Hadoop infrastructure (hadoop 2x) using 'hadoop jar
'. But I want to achieve the same thing using java command as below.

java 

1. How can I pass hadoop configuration to this className?
2. What extra arguments do I need to supply?
3. Any link/documentation would be highly appreciated.


regards
rab

Launching Hadoop map reduce job from a servlet

2015-01-16 Thread rab ra

Hello,

I have a servlet program deployed in jetty server listening to the port
8080. As soon as a request arrives from a client, it parses the request,
and instantiate MR program that is to be launched in Hadoop cluster. Here,
I cannot launch the hadoop job using hadoop command as 'hadoop jar  .". From servlet code, I instantiate the MR main program that
implements Tools and contains Mapper and Reducer classes in it.

My issue is that though the job is launched, it always uses Local
JobRunner. I do have hadoop installed with all the configuration files
contains right information in it. For instance, in my mapred-site.xml, i
have setup 'yarn' as my mapreduce framework.

With the current configuration setup, I was able to submit jobs to yarn
through hadoop command. But i want to achieve this through 'java' command.

1. How can I do it? If there is any pointer/link, please share it.
2. I tried to setup all the configuration inside the code something like as
below

conf.set("mapreduce,framework.name","yarn");



But somehow, it seems that these information is not cascading to job
despite creating job instance with the above configuration. So, I am
struggling to make hadoop configuration to java application.

I would be grateful to you for any help to fix this issue


regards
rab

Appending to hadoop sequence file

2014-09-13 Thread rab ra

Hello,

Is there a way to append to a sequence file? I need to open a new seq file,
write something to it, close it and save. Later, I want to open again, add
some more information to that seq file and close it. Is it possible? I am
using Hadoop 2x.

Same question to MapFile too?


regards
rab

Floatwritable and hadoop streming

2014-09-12 Thread rab ra

Hello



In my use case, I need to build a single big sequence file. The key value
pairs are generated by map processes and a single reducer is used to
generate the sequence file. My value is a floatwritable (a list of float
values). I use hadoop streaming 2.4. i have a mapper that prints key value
pairs like this



println "${key}@${value}"



In my reducer, I need to collect them and append them into a sequence file.
However, in reducer, the input is a string and I do not know how to convert
'value' into floatWritable?



any clue here?



thanks
Rab

Sequential files sizes

2014-09-02 Thread rab ra

Hello,

In one my use-cases, I generate large number of sequential files. In all of
these files, I store a bunch of key/value pairs. The key is a string, and
value is a list of FLOAT values. I know the number of float values that I
am storing, and based on which I am estimating the size of the file to be
around 700KB (approximately). However, when I see size in HDFS, it shows
very less, something around 20KB. I am not using compression technique
while writing the sequence files. Any clue here?


regards
rab

Re: Hadoop InputFormat - Processing large number of small files

2014-09-01 Thread rab ra

Hi
>
>
>
> I tried to use your CombileFileInputFormat implementation. However, I get
the following exception
>
>
>
> ‘not org.apache.hadoop.mapred.InputFormat’
>
>
>
> I am using hadoop 2.4.1 and it looks like it expect older interface as it
does not accept
‘org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat’.  May I know
what version of Hadoop you used?
>
>
>
>
>
> Looks like I need to use older one
‘org.apache.hadoop.mapred.lib.CombineFileInputFormat’ ?
>
>
>
> Thanks and Regards
>
> rab
On 20 Aug 2014 22:59, "Felix Chern"  wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra  wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus 
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra  wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: toolrunner issue

2014-09-01 Thread rab ra

Yes, its my bad.. U r right..
Thanks
On 1 Sep 2014 17:11, "unmesha sreeveni"  wrote:

> public class MyClass extends Configured implements Tool{
> public static void main(String[] args) throws Exception {
> Configuration conf = new Configuration();
>  int res = ToolRunner.run(conf, new MyClass(), args);
>  System.exit(res);
>  }
>
>  @Override
> public int run(String[] args) throws Exception {
>  // TODO Auto-generated method stub
> Job job = new Job(conf, "");
> job.setJarByClass(MyClass.class);
>  job.setMapOutputKeyClass(IntWritable.class);
> job.setMapOutputValueClass(TwovalueWritable.class);
>  job.setOutputKeyClass(IntWritable.class);
> job.setOutputValueClass(TwovalueWritable.class);
>  job.setMapperClass(Mapper.class);
> job.setReducerClass(Reducer.class);
>  job.setInputFormatClass(TextInputFormat.class);
> job.setOutputFormatClass(TextOutputFormat.class);
>.
> }
>
> I am able to work without any errors. Please make sure that you are doing
> the same code above.
>
>
> On Mon, Sep 1, 2014 at 4:18 PM, rab ra  wrote:
>
>> Hello
>>
>> I m having an issue in running one simple map reduce job.
>> The portion of the code is below. It gives a warning that Hadoop command
>> line parsing was not peformed.
>> This occurs despite the class implements Tool interface. Any clue?
>>
>> public static void main(String[] args) throws Exception {
>>
>> try{
>>
>> int exitcode = ToolRunner.run(new Configuration(), new
>> MyClass(), args);
>>
>> System.exit(exitcode);
>>
>> }
>>
>> catch(Exception e)
>>
>> {
>>
>> e.printStackTrace();
>>
>> }
>>
>> }
>>
>>
>>
>> @Override
>>
>> public int run(String[] args) throws Exception {
>>
>> JobConf conf = new JobConf(MyClass.class);
>>
>> System.out.println(args);
>>
>> FileInputFormat.addInputPath(conf, new Path("/smallInput"));
>>
>> conf.setInputFormat(CFNInputFormat.class);
>>
>> conf.setMapperClass(MyMapper.class);
>>
>> conf.setMapOutputKeyClass(Text.class);
>>
>> conf.setMapOutputValueClass(Text.class);
>>
>> FileOutputFormat.setOutputPath(conf, new Path("/TEST"));
>>
>> JobClient.runJob(conf);
>>
>> return 0;
>>
>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

toolrunner issue

2014-09-01 Thread rab ra

Hello

I m having an issue in running one simple map reduce job.
The portion of the code is below. It gives a warning that Hadoop command
line parsing was not peformed.
This occurs despite the class implements Tool interface. Any clue?

public static void main(String[] args) throws Exception {

try{

int exitcode = ToolRunner.run(new Configuration(), new
MyClass(), args);

System.exit(exitcode);

}

catch(Exception e)

{

e.printStackTrace();

}

}



@Override

public int run(String[] args) throws Exception {

JobConf conf = new JobConf(MyClass.class);

System.out.println(args);

FileInputFormat.addInputPath(conf, new Path("/smallInput"));

conf.setInputFormat(CFNInputFormat.class);

conf.setMapperClass(MyMapper.class);

conf.setMapOutputKeyClass(Text.class);

conf.setMapOutputValueClass(Text.class);

FileOutputFormat.setOutputPath(conf, new Path("/TEST"));

JobClient.runJob(conf);

return 0;

RE: Appending to HDFS file

2014-08-28 Thread rab ra

Thank you all,

It works now

Regards
rab
On 28 Aug 2014 12:06, "Liu, Yi A"  wrote:

>  Right, please use FileSystem#append
>
>
>
> *From:* Stanley Shi [mailto:s...@pivotal.io]
> *Sent:* Thursday, August 28, 2014 2:18 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Appending to HDFS file
>
>
>
> You should not use this method:
>
> FSDataOutputStream fp = fs.create(pt, true)
>
>
>
> Here's the java doc for this "create" method:
>
>
>
>   /**
>
>* Create an FSDataOutputStream at the indicated Path.
>
>* @param f the file to create
>
>* @*param overwrite if a file with this name already exists, then if
> true,*
>
>*   the file will be overwritten, and if false an exception will be
> thrown.
>
>*/
>
>   public FSDataOutputStream create(Path f, boolean overwrite)
>
>   throws IOException {
>
> return create(f, overwrite,
>
>   getConf().getInt("io.file.buffer.size", 4096),
>
>   getDefaultReplication(f),
>
>   getDefaultBlockSize(f));
>
>   }
>
>
>
> On Wed, Aug 27, 2014 at 2:12 PM, rab ra  wrote:
>
>
> hello
>
> Here is d code snippet, I use to append
>
> def outFile = "${outputFile}.txt"
>
> Path pt = new Path("${hdfsName}/${dir}/${outFile}")
>
> def fs = org.apache.hadoop.fs.FileSystem.get(configuration);
>
> FSDataOutputStream fp = fs.create(pt, true)
>
> fp << "${key} ${value}\n"
>
> On 27 Aug 2014 09:46, "Stanley Shi"  wrote:
>
> would you please past the code in the loop?
>
>
>
> On Sat, Aug 23, 2014 at 2:47 PM, rab ra  wrote:
>
> Hi
>
> By default, it is true in hadoop 2.4.1. Nevertheless, I have set it to
> true explicitly in hdfs-site.xml. Still, I am not able to achieve append.
>
> Regards
>
> On 23 Aug 2014 11:20, "Jagat Singh"  wrote:
>
> What is value of dfs.support.append in hdfs-site.cml
>
>
>
>
> https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>
>
>
>
>
>
>
> On Sat, Aug 23, 2014 at 1:41 AM, rab ra  wrote:
>
> Hello,
>
>
>
> I am currently using Hadoop 2.4.1.I am running a MR job using hadoop
> streaming utility.
>
>
>
> The executable needs to write large amount of information in a file.
> However, this write is not done in single attempt. The file needs to be
> appended with streams of information generated.
>
>
>
> In the code, inside a loop, I open a file in hdfs, appends some
> information. This is not working and I see only the last write.
>
>
>
> How do I accomplish append operation in hadoop? Can anyone share a pointer
> to me?
>
>
>
>
>
>
>
>
>
> regards
>
> Bala
>
>
>
>
>
>
>
> --
>
> Regards,
>
> *Stanley Shi,*
>
>
>
>
>
> --
>
> Regards,
>
> *Stanley Shi,*
>
>

Re: Appending to HDFS file

2014-08-26 Thread rab ra

hello

Here is d code snippet, I use to append

def outFile = "${outputFile}.txt"

Path pt = new Path("${hdfsName}/${dir}/${outFile}")

def fs = org.apache.hadoop.fs.FileSystem.get(configuration);

FSDataOutputStream fp = fs.create(pt, true)

fp << "${key} ${value}\n"
On 27 Aug 2014 09:46, "Stanley Shi"  wrote:

> would you please past the code in the loop?
>
>
> On Sat, Aug 23, 2014 at 2:47 PM, rab ra  wrote:
>
>> Hi
>>
>> By default, it is true in hadoop 2.4.1. Nevertheless, I have set it to
>> true explicitly in hdfs-site.xml. Still, I am not able to achieve append.
>>
>> Regards
>> On 23 Aug 2014 11:20, "Jagat Singh"  wrote:
>>
>>> What is value of dfs.support.append in hdfs-site.cml
>>>
>>>
>>> https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>>
>>>
>>>
>>>
>>> On Sat, Aug 23, 2014 at 1:41 AM, rab ra  wrote:
>>>
>>>> Hello,
>>>>
>>>> I am currently using Hadoop 2.4.1.I am running a MR job using hadoop
>>>> streaming utility.
>>>>
>>>> The executable needs to write large amount of information in a file.
>>>> However, this write is not done in single attempt. The file needs to be
>>>> appended with streams of information generated.
>>>>
>>>> In the code, inside a loop, I open a file in hdfs, appends some
>>>> information. This is not working and I see only the last write.
>>>>
>>>> How do I accomplish append operation in hadoop? Can anyone share a
>>>> pointer to me?
>>>>
>>>>
>>>>
>>>>
>>>> regards
>>>> Bala
>>>>
>>>
>>>
>
>
> --
> Regards,
> *Stanley Shi,*
>
>

Re: Hadoop InputFormat - Processing large number of small files

2014-08-26 Thread rab ra

Hi,

Is it not good idea to model key as Text type?

I have a large number of sequential files that has bunch of key value
pairs. I will read these seq files inside the map. Hence my map needs only
filenames. I believe, with CombineFileInputFormat the map will run on nodes
where data is already available and hence my explicit hdfs read will be
faster.
I do not want the contents in the map as all key value pairs are not needed.

Regards
Rab
On 20 Aug 2014 22:59, "Felix Chern"  wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra  wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus 
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra  wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Sequence files and merging

2014-08-23 Thread rab ra

Hello,

I need few clarifications for the following questions related to
sequenceFiles

1. I have a bunch of sequence file. Each file has 8 keys and corresponding
values. The values are float array bytes, and key is a name which is a
string.  Now, storing these smaller files and processing is not efficient
as there can be milliions of such files. Hence, I am thinking of creating
one sequence file out of such large number of files. Is it possible? I read
in the literature that there are ways to merge sequence files. My question
is that if I merge large number of sequence files, how can I retrieve
individual small sequence file in my map processes?

2. when I merge, it becomes a different sequence file altogether with keys
merged? If this is the case, my keys will be same for all the files. How it
will be handled?  Will there be any problem here?

3. Is it possible to append keys and values to existing sequence file?



regards
rab

Re: Hadoop YARM Cluster Setup Questions

2014-08-23 Thread rab ra

Hi,

1. Typically,we used to copy the slaves file all the participating nodes
though I do not have concrete theory to back up this. Atleast, this is what
I was doing in hadoop 1.2 and I am doing the same in hadoop 2x

2. I think, you should investigate the yarn GUI and see how many maps it
has spanned. There is a high possibility that both the maps are running in
the same node in parallel. Since there are two splits, there would be two
map processes, and one node is capable of handling more than one map.

3. There could be no replica of input file stored and it is small, and
hence stored in a single block in one node itself.

These could be few hints which might help you

regards
rab

On Sat, Aug 23, 2014 at 12:26 PM, S.L  wrote:

> Hi Folks,
>
> I was not able to find  a clear answer to this , I know that on the master
> node we need to have a slaves file listing all the slaves , but do we need
> to have the slave nodes have a master file listing the single name node( I
> am not using a secondary name node). I only have the slaves file on the
> master node.
>
> I was not able to find a clear answer to this ,the reason I ask this is
> because when I submit a hadoop job , even though the input is being split
> into 2 parts , only one data node is assigned applications , the other two
> ( I have three) are no tbeing assigned any applications.
>
> Thanks in advance!
>

Re: Appending to HDFS file

2014-08-22 Thread rab ra

Hi

By default, it is true in hadoop 2.4.1. Nevertheless, I have set it to true
explicitly in hdfs-site.xml. Still, I am not able to achieve append.

Regards
On 23 Aug 2014 11:20, "Jagat Singh"  wrote:

> What is value of dfs.support.append in hdfs-site.cml
>
>
> https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>
>
>
>
> On Sat, Aug 23, 2014 at 1:41 AM, rab ra  wrote:
>
>> Hello,
>>
>> I am currently using Hadoop 2.4.1.I am running a MR job using hadoop
>> streaming utility.
>>
>> The executable needs to write large amount of information in a file.
>> However, this write is not done in single attempt. The file needs to be
>> appended with streams of information generated.
>>
>> In the code, inside a loop, I open a file in hdfs, appends some
>> information. This is not working and I see only the last write.
>>
>> How do I accomplish append operation in hadoop? Can anyone share a
>> pointer to me?
>>
>>
>>
>>
>> regards
>> Bala
>>
>
>

Appending to HDFS file

2014-08-22 Thread rab ra

Hello,

I am currently using Hadoop 2.4.1.I am running a MR job using hadoop
streaming utility.

The executable needs to write large amount of information in a file.
However, this write is not done in single attempt. The file needs to be
appended with streams of information generated.

In the code, inside a loop, I open a file in hdfs, appends some
information. This is not working and I see only the last write.

How do I accomplish append operation in hadoop? Can anyone share a pointer
to me?




regards
Bala

Re: Hadoop InputFormat - Processing large number of small files

2014-08-21 Thread rab ra

Hello,

This means that a file with names of all the files that need to be
processed and is fed to hadoop with NLineInputFormat?

If this is the case, then, how can we ensure that map processes are
scheduled in the node where blocks containing the files are stored already?

regards
rab


On Thu, Aug 21, 2014 at 9:07 PM, Felix Chern  wrote:

> If I were you, I’ll first generate a file with those file name:
>
> hadoop fs -ls > term_file
>
> Then run the normal map reduce job
>
> Felix
>
> On Aug 21, 2014, at 1:38 AM, rab ra  wrote:
>
> Thanks for the link. If it is not required for CFinputformat to have
> contents of the files in the map process but only the filename, what
> changes need to be done in the code?
>
> rab.
> On 20 Aug 2014 22:59, "Felix Chern"  wrote:
>
>> I wrote a post on how to use CombineInputFormat:
>>
>> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
>> In the RecordReader constructor, you can get the context of which file
>> you are reading in.
>> In my example, I created FileLineWritable to include the filename in the
>> mapper input key.
>> Then you can use the input key as:
>>
>>   public static class TestMapper extends Mapper> Text, IntWritable>{  private Text txt = new Text();  private IntWritable
>> count = new IntWritable(1);  public void map (FileLineWritable key, Text
>> val, Context context) throws IOException, InterruptedException{
>> StringTokenizer st = new StringTokenizer(val.toString());  while (st.
>> hasMoreTokens()){  txt.set(key.fileName + st.nextToken());  context.write
>> (txt, count);  } } }
>>
>>
>> Cheers,
>> Felix
>>
>>
>> On Aug 20, 2014, at 8:19 AM, rab ra  wrote:
>>
>> Thanks for the response.
>>
>> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
>> process either as key or value. But, I think this file format reads the
>> contents of the file. I wish to have a inputformat that just gives filename
>> or list of filenames.
>>
>> Also, files are very small. The wholeFileInputFormat spans one map
>> process per file and thus results huge number of map processes. I wish to
>> span a single map process per group of files.
>>
>> I think I need to tweak CombineFileInputFormat's recordreader() so that
>> it does not read the entire file but just filename.
>>
>>
>> regards
>> rab
>>
>> regards
>> Bala
>>
>>
>> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus 
>> wrote:
>>
>>> Have you looked at the WholeFileInputFormat implementations? There are
>>> quite a few if search for them...
>>>
>>>
>>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>>
>>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra  wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a use case wherein i need to process huge set of files stored in
>>>> HDFS. Those files are non-splittable and they need to be processed as a
>>>> whole. Here, I have the following question for which I need answers to
>>>> proceed further in this.
>>>>
>>>> 1.  I wish to schedule the map process in task tracker where data is
>>>> already available. How can I do it? Currently, I have a file that contains
>>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>>> map process then accesses the file via FSDataInputStream and work with it.
>>>> Is there a way to ensure this map process is running on the node where the
>>>> file is available?.
>>>>
>>>> 2.  Since the files are not large and it can be called as 'small' files
>>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>>> process more than one file in a single map process.  What I need here is a
>>>> format that can process more than one files in a single map but does not
>>>> have to read the files, and either in key or value, it has the filenames.
>>>> In map process then, I can run a loop to process these files. Any help?
>>>>
>>>> 3. Any othe alternatives?
>>>>
>>>>
>>>>
>>>> regards
>>>>  rab
>>>>
>>>>
>>>
>>
>>
>

Re: Hadoop InputFormat - Processing large number of small files

2014-08-21 Thread rab ra

Thanks for the link. If it is not required for CFinputformat to have
contents of the files in the map process but only the filename, what
changes need to be done in the code?

rab.
On 20 Aug 2014 22:59, "Felix Chern"  wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra  wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus 
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra  wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

2014-08-20 Thread rab ra

Thanks for the response.

Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
process either as key or value. But, I think this file format reads the
contents of the file. I wish to have a inputformat that just gives filename
or list of filenames.

Also, files are very small. The wholeFileInputFormat spans one map process
per file and thus results huge number of map processes. I wish to span a
single map process per group of files.

I think I need to tweak CombineFileInputFormat's recordreader() so that it
does not read the entire file but just filename.


regards
rab

regards
Bala


On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus 
wrote:

> Have you looked at the WholeFileInputFormat implementations? There are
> quite a few if search for them...
>
>
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>
> Regards,
> Shahab
>
>
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra  wrote:
>
>> Hello,
>>
>> I have a use case wherein i need to process huge set of files stored in
>> HDFS. Those files are non-splittable and they need to be processed as a
>> whole. Here, I have the following question for which I need answers to
>> proceed further in this.
>>
>> 1.  I wish to schedule the map process in task tracker where data is
>> already available. How can I do it? Currently, I have a file that contains
>> list of filenames. Each map get one line of it via NLineInputFormat. The
>> map process then accesses the file via FSDataInputStream and work with it.
>> Is there a way to ensure this map process is running on the node where the
>> file is available?.
>>
>> 2.  Since the files are not large and it can be called as 'small' files
>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>> process more than one file in a single map process.  What I need here is a
>> format that can process more than one files in a single map but does not
>> have to read the files, and either in key or value, it has the filenames.
>> In map process then, I can run a loop to process these files. Any help?
>>
>> 3. Any othe alternatives?
>>
>>
>>
>> regards
>>  rab
>>
>>
>

Re: Muliple map writing into same hdfs file

2014-08-20 Thread rab ra

Hello ,


I finally moved to per task write and reducers gather them all and write
them into the file

Thanks for the help


regards
rab


On Fri, Jul 11, 2014 at 10:50 AM, Bertrand Dechoux 
wrote:

> And beside with a single file, if that were possible, how do you handle
> error? Let' say task 1 ran 3 times : 1 error, 1 speculative and 1
> success... A per-task file has been a standard to easily solve that
> problem.
>
> Bertrand Dechoux
>
>
> On Thu, Jul 10, 2014 at 10:00 PM, Vinod Kumar Vavilapalli <
> vino...@hortonworks.com> wrote:
>
>> Current writes to a single file in HDFS is not possible today. You  may
>> want to write a per-task file and use that entire directory as your output.
>>
>> +Vinod
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>>
>> On Wed, Jul 9, 2014 at 10:42 PM, rab ra  wrote:
>>
>>>
>>> hello
>>>
>>>
>>>
>>> I have one use-case that spans multiple map tasks in hadoop environment.
>>> I use hadoop 1.2.1 and with 6 task nodes. Each map task writes their output
>>> into a file stored in hdfs. This file is shared across all the map tasks.
>>> Though, they all computes thier output but some of them are missing in the
>>> output file.
>>>
>>>
>>>
>>> The output file is an excel file with 8 parameters(headings). Each map
>>> task is supposed to compute all these 8 values, and save it as soon as it
>>> is computed. This means, the programming logic of a map task opens the
>>> file, writes the value and close, 8 times.
>>>
>>>
>>>
>>> Can someone give me a hint on whats going wrong here?
>>>
>>>
>>>
>>> Is it possible to make more than one map task to write in a shared file
>>> in HDFS?
>>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>

Hadoop InputFormat - Processing large number of small files

2014-08-19 Thread rab ra

Hello,

I have a use case wherein i need to process huge set of files stored in
HDFS. Those files are non-splittable and they need to be processed as a
whole. Here, I have the following question for which I need answers to
proceed further in this.

1.  I wish to schedule the map process in task tracker where data is
already available. How can I do it? Currently, I have a file that contains
list of filenames. Each map get one line of it via NLineInputFormat. The
map process then accesses the file via FSDataInputStream and work with it.
Is there a way to ensure this map process is running on the node where the
file is available?.

2.  Since the files are not large and it can be called as 'small' files by
hadoop standard. Now, I came across CombineFileInputFormat that can process
more than one file in a single map process.  What I need here is a format
that can process more than one files in a single map but does not have to
read the files, and either in key or value, it has the filenames. In map
process then, I can run a loop to process these files. Any help?

3. Any othe alternatives?



regards
rab

More than one map task in a node - Hadoop 2x

2014-07-23 Thread rab ra

Hello,

I am trying to successfully configure hadoop 2.4.0 to run more than one map
task in a node. I have done this in hadoop 1x and I found it was
straightforward. But in Hadoop 2x, with yarn coming in, I found bit
difficult to follow the documentation. Can someone give me the link or
share some ideas on this topic? Thanks


with regards
rab

Hadoop streaming - Class not found

2014-07-23 Thread rab ra

Hello,

I am trying to run an executable using hadoop streaming 2.4

My executable is my mapper which is a groovy script. This script uses a
class from a jar file which I am sending via -libjars argument.

The hadoop streaming is made to span maps via an input file, each line
feeds to one map.

The question is, though the hadoop successfully executes the use case, but,
I see that some maps failed and restarted later. The failure was due to
failing to locate the class. The script has some imports and they are not
found. However, they are all in jar file.

I am tempted to think that when hadoop executes the first few map tasks,
the jar file is not "prepared yet" to be made available to maps and hence
the initial maps failed to locate the class, and later, when they are
restarted, it is able to locate the class and executes smoothly.

Is this correct? If not, can someone tell me why this behavior? How can I
get around this issue? Because of this, the use case takes little more time
to execute. I fear, when I expand the use case, this will surely cause
performance delay.


with regards
rab

multiple map tasks writing in same hdfs file -issue

2014-07-10 Thread rab ra

Hello


I have one use-case that spans multiple map tasks in hadoop environment. I
use hadoop 1.2.1 and with 6 task nodes. Each map task writes their output
into a file stored in hdfs. This file is shared across all the map tasks.
Though, they all computes thier output but some of them are missing in the
output file.



The output file is an excel file with 8 parameters(headings). Each map task
is supposed to compute all these 8 values, and save it as soon as it is
computed. This means, the programming logic of a map task opens the file,
writes the value and close, 8 times.



Can someone give me a hint on whats going wrong here?



Is it possible to make more than one map task to write in a shared file in
HDFS?

Regards
Rab

Muliple map writing into same hdfs file

2014-07-09 Thread rab ra

hello



I have one use-case that spans multiple map tasks in hadoop environment. I
use hadoop 1.2.1 and with 6 task nodes. Each map task writes their output
into a file stored in hdfs. This file is shared across all the map tasks.
Though, they all computes thier output but some of them are missing in the
output file.



The output file is an excel file with 8 parameters(headings). Each map task
is supposed to compute all these 8 values, and save it as soon as it is
computed. This means, the programming logic of a map task opens the file,
writes the value and close, 8 times.



Can someone give me a hint on whats going wrong here?



Is it possible to make more than one map task to write in a shared file in
HDFS?

RE: HDFS data transfer is faster than SCP based transfer?

2014-01-25 Thread rab ra

The input files are provided as argument to a binary being executed by map
process. This binary cannot read from hdfs and i cant rewrite it.
On 25 Jan 2014 19:47, "John Lilley"  wrote:

>  There are no short-circuit writes, only reads, AFAIK.
>
> Is it necessary to transfer from HDFS to local disk?  Can you read from
> HDFS directly using the FileSystem interface?
>
> john
>
>
>
> *From:* Shekhar Sharma [mailto:shekhar2...@gmail.com]
> *Sent:* Saturday, January 25, 2014 3:44 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: HDFS data transfer is faster than SCP based transfer?
>
>
>
> We have the concept of short circuit reads which directly reads from data
> node which improve read performance. Do we have similar concept like short
> circuit writes
>
> On 25 Jan 2014 16:10, "Harsh J"  wrote:
>
> There's a lot of difference here, although both do use TCP underneath,
> but do note that SCP securely encrypts data but stock HDFS
> configuration does not.
>
> You can also ask SCP to compress data transfer via the "-C" argument
> btw - unsure if you already applied that pre-test - it may help show
> up some difference. Also, the encryption algorithm can be changed to a
> weaker one if security is not a concern during the transfer, via "-c
> arcfour".
>
> On Fri, Jan 24, 2014 at 10:55 AM, rab ra  wrote:
> > Hello
> >
> > I have a use case that requires transfer of input files from remote
> storage
> > using SCP protocol (using jSCH jar).  To optimize this use case, I have
> > pre-loaded all my input files into HDFS and modified my use case so that
> it
> > copies required files from HDFS. So, when tasktrackers works, it copies
> > required number of input files to its local directory from HDFS. All my
> > tasktrackers are also datanodes. I could see my use case has run faster.
> The
> > only modification in my application is that file copy from HDFS instead
> of
> > transfer using SCP. Also, my use case involves parallel operations (run
> in
> > tasktrackers) and they do lot of file transfer. Now all these transfers
> are
> > replaced with HDFS copy.
> >
> > Can anyone tell me HDFS transfer is faster as I witnessed? Is it
> because, it
> > uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease
> > of time?
> >
> >
> > with thanks and regards
> > rab
>
>
>
> --
> Harsh J
>

Re: HDFS data transfer is faster than SCP based transfer?

2014-01-24 Thread rab ra

It is not a single file. Lot of small files. Files are stored in HDFS and
map operations copies required files from hdfs. One map process running in
one node only. Each file will be about 16MB
On 24 Jan 2014 23:49, "Vinod Kumar Vavilapalli" 
wrote:

> Is it a single file? Lots of files? How big are the files? Is the copy on
> a single node or are you running some kind of a MapReduce program?
>
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> On Fri, Jan 24, 2014 at 7:21 AM, rab ra  wrote:
>
>> Hi
>>
>> Can anyone please answer my query?
>>
>> -Rab
>> -- Forwarded message --
>> From: "rab ra" 
>> Date: 24 Jan 2014 10:55
>> Subject: HDFS data transfer is faster than SCP based transfer?
>> To: 
>>
>> Hello
>>
>> I have a use case that requires transfer of input files from remote
>> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
>> have pre-loaded all my input files into HDFS and modified my use case so
>> that it copies required files from HDFS. So, when tasktrackers works, it
>> copies required number of input files to its local directory from HDFS. All
>> my tasktrackers are also datanodes. I could see my use case has run faster.
>> The only modification in my application is that file copy from HDFS instead
>> of transfer using SCP. Also, my use case involves parallel operations (run
>> in tasktrackers) and they do lot of file transfer. Now all these transfers
>> are replaced with HDFS copy.
>>
>> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
>> it uses TCP/IP? Can anyone give me reasonable reasons to support the
>> decrease of time?
>>
>>
>> with thanks and regards
>> rab
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Fwd: HDFS data transfer is faster than SCP based transfer?

2014-01-24 Thread rab ra

Hi

Can anyone please answer my query?

-Rab
-- Forwarded message --
From: "rab ra" 
Date: 24 Jan 2014 10:55
Subject: HDFS data transfer is faster than SCP based transfer?
To: 

Hello

I have a use case that requires transfer of input files from remote storage
using SCP protocol (using jSCH jar).  To optimize this use case, I have
pre-loaded all my input files into HDFS and modified my use case so that it
copies required files from HDFS. So, when tasktrackers works, it copies
required number of input files to its local directory from HDFS. All my
tasktrackers are also datanodes. I could see my use case has run faster.
The only modification in my application is that file copy from HDFS instead
of transfer using SCP. Also, my use case involves parallel operations (run
in tasktrackers) and they do lot of file transfer. Now all these transfers
are replaced with HDFS copy.

Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
it uses TCP/IP? Can anyone give me reasonable reasons to support the
decrease of time?

with thanks and regards
rab

HDFS data transfer is faster than SCP based transfer?

2014-01-23 Thread rab ra

Hello

I have a use case that requires transfer of input files from remote storage
using SCP protocol (using jSCH jar).  To optimize this use case, I have
pre-loaded all my input files into HDFS and modified my use case so that it
copies required files from HDFS. So, when tasktrackers works, it copies
required number of input files to its local directory from HDFS. All my
tasktrackers are also datanodes. I could see my use case has run faster.
The only modification in my application is that file copy from HDFS instead
of transfer using SCP. Also, my use case involves parallel operations (run
in tasktrackers) and they do lot of file transfer. Now all these transfers
are replaced with HDFS copy.

Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
it uses TCP/IP? Can anyone give me reasonable reasons to support the
decrease of time?


with thanks and regards
rab

Re: Problem with RPC encryption over wire

2013-11-13 Thread rab ra

Any hint ?
On 14 Nov 2013 11:45, "rab ra"  wrote:

> Thank for the response.
>
> I have removed rpc.protection parameter in all of my configuration and now
> I am getting an error as below:-
>
>
>
> Any Hint on whats going on here
>
>
>
> 13/11/14 10:10:47 INFO mapreduce.Job: Task Id :
> attempt_1384339616944_0002_m_26_0, Status : FAILED
>
> Container launch failed for container_1384339616944_0002_01_29 :
> java.net.SocketTimeoutException: Call From U1204D32/40.221.95.97 to
> U1204D32.blr.in.as.airbus.corp:45003 failed on socket timeout exception:
> java.net.SocketTimeoutException: 6 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected 
> local=/40.221.95.97:48268remote=U1204D32.blr.in.as.airbus.corp/
> 40.221.95.97:45003]; For more details see:
> http://wiki.apache.org/hadoop/SocketTimeout
>
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
>   at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>
>   at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
>
>   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
>
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>
>   at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>
>   at com.sun.proxy.$Proxy30.startContainers(Unknown Source)
>
>   at
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
>
>   at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:151)
>
>   at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)
>
>   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>   at java.lang.Thread.run(Thread.java:744)
>
> Caused by: java.net.SocketTimeoutException: 6 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected 
> local=/40.221.95.97:48268remote=U1204D32.blr.in.as.airbus.corp/
> 40.221.95.97:45003]
>
>   at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>
>   at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>
>   at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>
>   at java.io.FilterInputStream.read(FilterInputStream.java:133)
>
>   at java.io.FilterInputStream.read(FilterInputStream.java:133)
>
>   at
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:457)
>
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>
>   at java.io.DataInputStream.readInt(DataInputStream.java:387)
>
>   at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:995)
>
>   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:891)
>
>
>
> 13/11/14 10:10:50 INFO mapreduce.Job:  map 72% reduce 24%
>
> 13/11/14 10:19:13 INFO mapreduce.Job:  map 76% reduce 24%
>
> 13/11/14 10:19:55 INFO mapreduce.Job:  map 78% reduce 24%
>
> 13/11/14 10:20:10 INFO mapreduce.Job:  map 80% reduce 24%
>
> 13/11/14 10:20:29 INFO mapreduce.Job:  map 81% reduce 24%
>
> 13/11/14 10:21:50 INFO mapreduce.Job:  map 82% reduce 24%
>
> 13/11/14 10:22:01 INFO mapreduce.Job:  map 83% reduce 24%
>
> 13/11/14 10:22:10 INFO mapreduce.Job:  map 84% reduce 24%
>
> 13/11/14 10:22:13 INFO mapreduce.Job:  map 85% reduce 24%
>
> 13/11/14 10:22:16 INFO mapreduce.Job:  map 86% reduce 24%
>
> 13/11/14 10:22:27 INFO mapreduce.Job:  map 86% reduce 28%
>
> 13/11/14 10:22:30 INFO mapreduce.Job:  map 86% reduce 29%
>
> 13/11/14 10:28:30 INFO mapreduce.Job:  map 88% reduce 29%
>
> 13/11/14 10:28:35 INFO mapreduce.Job:  map 90% reduce 29%
>
> 13/11/14 10:28:37 INFO mapreduce.Job:  map 97% reduce 29%
>
> 13/11/14 10:28:43 INFO mapreduce.Job:  map 98% reduce 29%
>
> 13/11/14 10:28:45 INFO mapreduce.Job:  map 99% redu

Re: Folder not created using Hadoop Mapreduce code

2013-11-13 Thread rab ra

Unless you FileSystem's mkdir() method , i m not sure you create a folder
in hdfs
On 14 Nov 2013 11:58, "unmesha sreeveni"  wrote:

> I am trying to create a file with in "in" folder. but when i tried to run
> this in cluster i noticed that this "in" folder is not within hdfs.
>
> why is it so?
>
> Any thing wrong?
>
> My Driver code is
>
>  Path in = new Path("in");
> Path input = new Path("in/inputfile");
> BufferedWriter createinput = new BufferedWriter(new 
> OutputStreamWriter(fs.create(input)));
>
> According to this code a "in" folder and a file "inputfile" should be
> created in working directory of cluster right?
>
> --
> *Thanks & Regards*
>
> Unmesha Sreeveni U.B
>
> *Junior Developer*
>
>
>

Re: Problem with RPC encryption over wire

2013-11-13 Thread rab ra

 mapred.ClientServiceDelegate: Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history
server

13/11/14 10:29:56 ERROR security.UserGroupInformation:
PriviledgedActionException as:prrekapaee9f (auth:SIMPLE)
cause:java.io.IOException:
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
java.lang.NullPointerException

  at
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getCounters(HistoryClientService.java:220)

  at
org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getCounters(MRClientProtocolPBServiceImpl.java:159)

  at
org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:281)

  at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)

  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)

  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)

  at java.security.AccessController.doPrivileged(Native Method)

  at javax.security.auth.Subject.doAs(Subject.java:415)

  at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)

  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)



 [Error] Exception:

java.io.IOException:
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
java.lang.NullPointerException

  at
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getCounters(HistoryClientService.java:220)

  at
org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getCounters(MRClientProtocolPBServiceImpl.java:159)

  at
org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:281)

  at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)

  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)

  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)

  at java.security.AccessController.doPrivileged(Native Method)

  at javax.security.auth.Subject.doAs(Subject.java:415)

  at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)

  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)



  at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:331)

  at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:368)

  at
org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511)

  at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756)

  at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753)

  at java.security.AccessController.doPrivileged(Native Method)

  at javax.security.auth.Subject.doAs(Subject.java:415)

  at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)

  at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753)

  at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361)

  at
org.apache.hadoop.mapred.JobClient$NetworkedJob.monitorAndPrintJob(JobClient.java:407)

  at
org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:855)

  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:835)

  at com.eads.dcgo.uc4.DCGoUC4.run(DCGoUC4.java:593)

  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

  at com.eads.dcgo.uc4.DCGoUC4.main(DCGoUC4.java:694)

  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
On 13 Nov 2013 20:11, "Daryn Sharp"  wrote:

>  "No common protection layer between server and client " likely means the
> host for job submission does not have hadoop.rpc.protection=privacy.  In
> order for QOP to work, all client hosts (DN & others used to access the
> cluster) must have an identical setting.
>
>  A few quick questions: I'm assuming you mis-posted your configs and the
> protection setting isn't really commented out?  Your configs don't show
> security being enabled, but you do have it enabled, correct?  Otherwise QOP
> shouldn't apply.  Perhaps a bit obvious, but did you restart your NN after
> changing the QOP?  Since your defaultFS is just "master", are you using HA?
>
>  It's a bit concerning that you aren't consistently receiving the
> mismatch error.  Is the client looping on retries and then you get timeouts
> after 5 attempts?  If yes, we've got a major bug.  5 is the default number
> of RPC readers which handle SASL auth which means the protection mismatch
> is killing off the reader threads and rendering the NN unusable.  This

Problem with RPC encryption over wire

2013-11-13 Thread rab ra

Hello,

I am facing a problem in using Hadoop RPC encryption while transfer feature
in hadoop 2.2.0. I have 3 node cluster


Service running in node 1 (master)
Resource manager
Namenode
DataNode
SecondaryNamenode

Service running in slaves ( node 2 & 3)
NodeManager



I am trying to make data transfer between master and slave secure. For
that, I wanted to use data encryption over wire (RPC encryption) feature of
hadoop 2.2.0

When I ran the code, I get the below exception

Caused by: java.net.SocketTimeoutException: 6 millis timeout while
waiting for channel to be ready for read.


In another run, I saw in log the following error

No common protection layer between server and client

Not sure whether my configuration is inline with what I want to achieve.

Can someone give me some hint on where I am going wrong?

By the way, I have the below configuration setting in all of these nodes

Core-site.xml



  
fs.defaultFS
hdfs://master:8020
  

  
hadoop.tmp.dir
/tmp
  

  
io.file.buffer.size
131072
  



Hdfs-site.xml


  
dfs.replication
1
   

  
dfs.name.dir
/app/hadoop/dfs-2.2.0/name
  

  
dfs.data.dir
/app/hadoop/dfs-2.2.0/data
  

  
dfs.encrypt.data.transfer
true
  

  
dfs.encrypt.data.transfer.algorithm
rc4
  

  
dfs.block.access.token.enable
true
  



Mapred-site.xml



  
mapreduce.framework.name
yarn
  

  
mapreduce.tasktracker.map.tasks.maximum
1
  

  
mapreduce.tasktracker.reduce.tasks.maximum
1
  

  
mapreduce.map.speculative
false
  

  
mapreduce.reduce.speculative
false
  

  
mapreduce.map.java.opts
-Xmx1024m
  




Yarn-site.xml



  
yarn.resourcemanager.hostname
master
  

  
yarn.log-aggregation-enable
true
  

  
yarn.nodemanager.aux-services
mapreduce_shuffle
  

  
yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler
  





With thanks and regards
Rab

Tasktracker not running with LinuxTaskController

2013-11-12 Thread rab ra

Hi

I would like to use LinuxTaskController with hadoop 1.2.1. Accordingly i
changed the configuration. When i started my service all but tasktracker
are up. When i see the tasktracker log it says LinuxTaskController class
not found. Pls note I did not build the taskcontroler execurable.
Can someone help me here?

Thanks and regards

Re: Hadoop on multiple user mode?

2013-11-12 Thread rab ra

Thanks for the response. However I could not find LinuxTaskController in
hadoop 2.2.0.
On 12 Nov 2013 03:10, "Harsh J"  wrote:

> If you'd like the tasks to use the actual submitting user accounts,
> you'll need to turn on security, or more specifically use the
> LinuxTaskController instead of the DefaultTaskController.
>
> On Mon, Nov 11, 2013 at 10:07 PM, rab ra  wrote:
> > -- Forwarded message --
> > From: "rab ra" 
> > Date: 11 Nov 2013 20:11
> > Subject: Hadoop on multiple user mode
> > To: "user@hadoop.apache.org" 
> >
> > Hello
> >
> > I want to configure hadoop so that it is started as user admin and more
> than
> > one user can launch job. I notice that while i submit job as a guest
> user,
> > the map process is executed as admin user. I print user home in my main
> code
> > as well as inside map process. Is there a way span map process a job
> > submitting user?
>
>
>
> --
> Harsh J
>

Hadoop on multiple user mode?

2013-11-11 Thread rab ra

-- Forwarded message --
From: "rab ra" 
Date: 11 Nov 2013 20:11
Subject: Hadoop on multiple user mode
To: "user@hadoop.apache.org" 

Hello

I want to configure hadoop so that it is started as user admin and more
than one user can launch job. I notice that while i submit job as a guest
user, the map process is executed as admin user. I print user home in my
main code as well as inside map process. Is there a way span map process a
job submitting user?

Re: Why SSH

2013-11-11 Thread rab ra

Thanks for d response. It helps. :-)
On 11 Nov 2013 14:04, "unmesha sreeveni"  wrote:

> I guess it is  TCP .
>
> The server is 
> DataXceiverServer<http://www.grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-hdfs/2.0.0-cdh4.3.0/org/apache/hadoop/hdfs/server/datanode/DataXceiverServer.java?av=f>
>  and
> the client is 
> DFSClient<http://www.grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-hdfs/2.0.0-cdh4.3.0/org/apache/hadoop/hdfs/DFSClient.java#DFSClient>.
> Basiclly, they use the Java Socket API.
>
> DataXceiverServer<http://www.grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-hdfs/2.0.0-cdh4.3.0/org/apache/hadoop/hdfs/server/datanode/DataXceiverServer.java?av=f>
> :
> http://www.grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-hdfs/2.0.0-cdh4.3.0/org/apache/hadoop/hdfs/server/datanode/DataXceiverServer.java?av=f
>
>
>
> On Mon, Nov 11, 2013 at 12:11 PM, Harsh J  wrote:
>
>> Neither of those. We stream data directly over a TCP network socket.
>>
>> Please read http://wiki.apache.org/hadoop/FAQ#Does_Hadoop_require_SSH.3F
>> regarding your SSH confusion.
>>
>> On Mon, Nov 11, 2013 at 10:21 AM, rab ra  wrote:
>> > Hello
>> >
>> > I have a question. To transfer the files to datanodes what protocol
>> hadoop
>> > uses? SSH or http or https
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> *Thanks & Regards*
>
> Unmesha Sreeveni U.B
>
> *Junior Developer*
>
> *Amrita Center For Cyber Security *
>
>
> * Amritapuri.www.amrita.edu/cyber/ <http://www.amrita.edu/cyber/>*
>

Hadoop on multiple user mode

2013-11-11 Thread rab ra

Hello

I want to configure hadoop so that it is started as user admin and more
than one user can launch job. I notice that while i submit job as a guest
user, the map process is executed as admin user. I print user home in my
main code as well as inside map process. Is there a way span map process a
job submitting user?

Why SSH

2013-11-10 Thread rab ra

Hello

I have a question. To transfer the files to datanodes what protocol hadoop
uses? SSH or http or https

Sending map process to multiple nodes, special use case

2013-11-07 Thread rab ra

Hello

In one of my use case, I am sending map processes to large number of hadoop
nodes. Assuming that the nodes are obtained from public cloud. I would like
to ensure that the security of the nodes are not compromised. For
this,planning to implement voting mechanism wherein multiple copies, lets
say 3,  of same map process is sent to 3 different nodes. In this regard, i
have the following question.

1. I am using NLineInputFormat, wherein each line is sent to one map
process. Is there any mechanism in hadoop to create 3 similar map processes
for single line? This I can mimic by writing same lines thrice in the input
file which is referred by NLineInputFormat. Is there any other elegant way
to do this?

2. Is there any mechanism with which I can ensure similar map processes are
sent to three different nodes?. Any way to control scheduling of map
processes to specific nodes. For example, map1 should go to node 1, and so
on.

3. Or is there any scheduler that implements voting mechanism that I can
use in conjunction with hadoop?

I am not sure about my above approach. Basically, I would like to ensure
the results generated by the nodes are correct and can be trusted. For
instance, I am sending one map process to three nodes. I verify the results
from these three nodes and if one node has given different result, it is
assumed that the node need to verified.

Is there any other possible approach, please share with me.

regards
rab

send map process to specific node

2013-11-07 Thread rab ra

Hello,

I have a use case scenario wherein I need to schedule map process to
particular node. Ideally, I want to send the map processes to the node of
my interest. Is it possible ? If not, Is there any workaround? Please share
some pointers to appropriate literature in this regard

How to pass parameter to mappers

2013-08-28 Thread rab ra

Hello

Any hint on how to pass parameters to mappers in 1.2.1 hadoop release?

Re: Issue with fs.delete

2013-08-28 Thread rab ra

Yes

I fixed the uri and it worked

Thanks
On 28 Aug 2013 14:11, "Harsh J"  wrote:

> Please also try to share your error/stacktraces when you post a question.
>
> All I can suspect is that your URI is malformed, and is missing the
> authority component. That is, it should be
> hdfs://host:port/path/to/file and not hdfs:/path/to/file.
>
> On Wed, Aug 28, 2013 at 1:44 PM, rab ra  wrote:
> > -- Forwarded message --
> > From: "rab ra" 
> > Date: 28 Aug 2013 13:26
> > Subject: Issue with fs.delete
> > To: "us...@hadoop.apache.org" 
> >
> > Hello,
> >
> > I am having a trouble in deleting a file from hdfs. I am using hadoop
> 1.2.1
> > stable release. I use the following code segment in my program
> >
> >
> > fs.delete(new Path("hdfs:/user//input/input.txt"))
> > fs.copyFromLocalFile(false,false,new Path("input.txt"),new
> > Path("hdfs:/user//input/input.txt"))
> >
> > Any hint?
> >
> >
>
>
>
> --
> Harsh J
>

Fwd: Issue with fs.delete

2013-08-28 Thread rab ra

-- Forwarded message --
From: "rab ra" 
Date: 28 Aug 2013 13:26
Subject: Issue with fs.delete
To: "us...@hadoop.apache.org" 

Hello,

I am having a trouble in deleting a file from hdfs. I am using hadoop 1.2.1
stable release. I use the following code segment in my program

fs.delete(new Path("hdfs:/user//input/input.txt"))
fs.copyFromLocalFile(false,false,new Path("input.txt"),new
Path("hdfs:/user//input/input.txt"))

Any hint?

Re: running map tasks in remote node

2013-08-25 Thread rab ra

Dear Yong,

Thanks for your elaborate answer. Your answer really make sense and I am
ending something close to it expect shared storage.

In my usecase, I am not allowed to use any shared storage system. The
reason being that the slave nodes may not be safe for hosting sensible
data. (Because, they could belong to different enterprise, may be from
cloud) I do agree that we still need this data on the slave node while
doing processing and hence need to transfer the data from the enterprise
node to the processing nodes. But that's ok as this is better than using
the slave nodes for storage. If I can use shared storage then I could use
hdfs itself. I wrote simple example code with 2 node cluster setup and was
testing various input formats such as WholeFileInputFormat,
NLineInputFormat, TextInputFormat. I faced issues when I do not want to use
shared storage as I explained in my last email. I was thinking that having
the input file in the master node (job tracker) is sufficient and it will
send portion of the input file to the map process in the second node
(slave). But this was not the case as the method setInputPath() (and map
reduce system) expect this path is a shared one.  All these my observations
lead to straightforward question that "Is map reduce system expect a shared
storage system ? And that input directories need to be present in that
shared system? Is there a workaround for this issue?". Infact,I am prepared
to use hdfs just for convincing map reduce system and feed input to it. And
for actual processing I shall end up transferring the required data files
to the slave nodes.

I do note that I cannot enjoy the advantages that comes with hdfs such as
data replication, data location aware system etc.

with thanks and regards
rabmdu

On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 wrote:

> It is possible to do what you are trying to do, but only make sense if
> your MR job is very CPU intensive, and you want to use the CPU resource in
> your cluster, instead of the IO.
>
> You may want to do some research about what is the HDFS's role in Hadoop.
> First but not least, it provides a central storage for all the files will
> be processed by MR jobs. If you don't want to use HDFS, so you need to
>  identify a share storage to be shared among all the nodes in your cluster.
> HDFS is NOT required, but a shared storage is required in the cluster.
>
> For simply your question, let's just use NFS to replace HDFS. It is
> possible for a POC to help you understand how to set it up.
>
> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
> by this way:
>
> 1) Mount /share_data in all of your 2 data nodes. They need to have the
> same mount. So /share_data in each data node point to the same NFS
> location. It doesn't matter where you host this NFS share, but just make
> sure each data node mount it as the same /share_data
> 2) Create a folder under /share_data, put all your data into that folder.
> 3) When kick off your MR jobs, you need to give a full URL of the input
> path, like 'file:///shared_data/myfolder', also a full URL of the output
> path, like 'file:///shared_data/output'. In this way, each mapper will
> understand that in fact they will run the data from local file system,
> instead of HDFS. That's the reason you want to make sure each task node has
> the same mount path, as 'file:///shared_data/myfolder' should work fine for
> each  task node. Check this and make sure that /share_data/myfolder all
> point to the same path in each of your task node.
> 4) You want each mapper to process one file, so instead of using the
> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make
> sure that every file under '/share_data/myfolder' won't be split and sent
> to the same mapper processor.
> 5) In the above set up, I don't think you need to start NameNode or
> DataNode process any more, anyway you just use JobTracker and TaskTracker.
> 6) Obviously when your data is big, the NFS share will be your bottleneck.
> So maybe you can replace it with Share Network Storage, but above set up
> gives you a start point.
> 7) Keep in mind when set up like above, you lost the Data Replication,
> Data Locality etc, that's why I said it ONLY makes sense if your MR job is
> CPU intensive. You simple want to use the Mapper/Reducer tasks to process
> your data, instead of any scalability of IO.
>
> Make sense?
>
> Yong
>
> --
> Date: Fri, 23 Aug 2013 15:43:38 +0530
> Subject: Re: running map tasks in remote node
>
> From: rab...@gmail.com
> To: user@hadoop.apache.org
>
> Thanks for the reply.
>
> I am basically exploring possible ways to work with hadoop framework for
> one of my use case. I have my limitations in using hdfs but agree with the
> fact that using map reduce in conjunction with hdfs makes sense.
>
> I successfully tested wholeFileInputFormat by some googling

Re: running map tasks in remote node

2013-08-23 Thread rab ra

Thanks for the reply.

I am basically exploring possible ways to work with hadoop framework for
one of my use case. I have my limitations in using hdfs but agree with the
fact that using map reduce in conjunction with hdfs makes sense.

I successfully tested wholeFileInputFormat by some googling.

Now, coming to my use case. I would like to keep some files in my master
node and want to do some processing in the cloud nodes. The policy does not
allow us to configure and use cloud nodes as HDFS.  However, I would like
to span a map process in those nodes. Hence, I set input path as local file
system, for example, $HOME/inputs. I have a file listing filenames (10
lines) in this input directory.  I use NLineInputFormat and span 10 map
process. Each map process gets a line. The map process will then do a file
transfer and process it.  However, I get an error in the map saying that
the FileNotFoundException $HOME/inputs. I am sure this directory is present
in my master but not in the slave nodes. When I copy this input directory
to slave nodes, it works fine. I am not able to figure out how to fix this
and the reason for the error. I am not understand why it complains about
the input directory is not present. As far as I know, slave nodes get a map
and map method contains contents of the input file. This should be fine for
the map logic to work.

with regards
rabmdu

On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 wrote:

> If you don't plan to use HDFS, what kind of sharing file system you are
> going to use between cluster? NFS?
> For what you want to do, even though it doesn't make too much sense, but
> you need to the first problem as the shared file system.
>
> Second, if you want to process the files file by file, instead of block by
> block in HDFS, then you need to use the WholeFileInputFormat (google this
> how to write one). So you don't need a file to list all the files to be
> processed, just put them into one folder in the sharing file system, then
> send this folder to your MR job. In this way, as long as each node can
> access it through some file system URL, each file will be processed in each
> mapper.
>
> Yong
>
> --
> Date: Wed, 21 Aug 2013 17:39:10 +0530
> Subject: running map tasks in remote node
> From: rab...@gmail.com
> To: user@hadoop.apache.org
>
>
> Hello,
>
> Here is the new bie question of the day.
>
> For one of my use cases, I want to use hadoop map reduce without HDFS.
> Here, I will have a text file containing a list of file names to process.
> Assume that I have 10 lines (10 files to process) in the input text file
> and I wish to generate 10 map tasks and execute them in parallel in 10
> nodes. I started with basic tutorial on hadoop and could setup single node
> hadoop cluster and successfully tested wordcount code.
>
> Now, I took two machines A (master) and B (slave). I did the below
> configuration in these machines to setup a two node cluster.
>
> hdfs-site.xml
>
> 
> 
> 
> 
> 
>   dfs.replication
>   1
> 
> 
>   dfs.name.dir
>   /tmp/hadoop-bala/dfs/name
> 
> 
>   dfs.data.dir
>   /tmp/hadoop-bala/dfs/data
> 
> 
>  mapred.job.tracker
> A:9001
> 
>
> 
>
> mapred-site.xml
>
> 
> 
>
> 
>
> 
> 
> mapred.job.tracker
> A:9001
> 
> 
>   mapreduce.tasktracker.map.tasks.maximum
>1
> 
> 
>
> core-site.xml
>
> 
> 
> 
> 
>  
> fs.default.name
> hdfs://A:9000
> 
> 
>
>
> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
> another file called ‘masters’ wherein an entry ‘A’ is there.
>
> I have kept my input file at A. I see the map method process the input
> file line by line but they are all processed in A. Ideally, I would expect
> those processing to take place in B.
>
> Can anyone highlight where I am going wrong?
>
>  regards
> rab
>

Fwd: Create a file in local file system in map method

2013-08-22 Thread rab ra

-- Forwarded message --
From: "rab ra" 
Date: 22 Aug 2013 15:14
Subject: Create a file in local file system in map method
To: "us...@hadoop.apache.org" 

Hi

i am not able to create a file in my local file system from my map method.
Is there a way to do it? Please provide some links. My map need to generate
one file during execution. When executing in remote node it shoul generate
the file in a predefined location inthe execution node.

Thanks
Rab

running map tasks in remote node

2013-08-22 Thread rab ra

Hello,

Here is the new bie question of the day.

For one of my use cases, I want to use hadoop map reduce without HDFS.
Here, I will have a text file containing a list of file names to process.
Assume that I have 10 lines (10 files to process) in the input text file
and I wish to generate 10 map tasks and execute them in parallel in 10
nodes. I started with basic tutorial on hadoop and could setup single node
hadoop cluster and successfully tested wordcount code.

Now, I took two machines A (master) and B (slave). I did the below
configuration in these machines to setup a two node cluster.

hdfs-site.xml






  dfs.replication
  1


  dfs.name.dir
  /tmp/hadoop-bala/dfs/name


  dfs.data.dir
  /tmp/hadoop-bala/dfs/data


 mapred.job.tracker
A:9001




mapred-site.xml








mapred.job.tracker
A:9001


  mapreduce.tasktracker.map.tasks.maximum
   1



core-site.xml





 
fs.default.name
hdfs://A:9000




In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
another file called ‘masters’ wherein an entry ‘A’ is there.

I have kept my input file at A. I see the map method process the input file
line by line but they are all processed in A. Ideally, I would expect those
processing to take place in B.

Can anyone highlight where I am going wrong?

 regards
rab

running map task in remote node

2013-08-21 Thread rab ra

Hello,

Here is the new bie question of the day.

For one of my use cases, I want to use hadoop map reduce without HDFS.
Here, I will have a text file containing a list of file names to process.
Assume that I have 10 lines (10 files to process) in the input text file
and I wish to generate 10 map tasks and execute them in parallel in 10
nodes. I started with basic tutorial on hadoop and could setup single node
hadoop cluster and successfully tested wordcount code.

Now, I took two machines A (master) and B (slave). I did the below
configuration in these machines to setup a two node cluster.

hdfs-site.xml






  dfs.replication
  1


  dfs.name.dir
  /tmp/hadoop-bala/dfs/name


  dfs.data.dir
  /tmp/hadoop-bala/dfs/data


 mapred.job.tracker
A:9001




mapred-site.xml








mapred.job.tracker
A:9001


  mapreduce.tasktracker.map.tasks.maximum
   1



core-site.xml





 
fs.default.name
hdfs://A:9000




In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
another file called ‘masters’ wherein an entry ‘A’ is there.

I have kept my input file at A. I see the map method process the input file
line by line but they are all processed in A. Ideally, I would expect those
processing to take place in B.

Can anyone highlight where I am going wrong?


regards
rab

51 matches

Mail list logo