RE: When to use DFSInputStream and HdfsDataInputStream

2013-09-30 Thread Uma Maheswara Rao G
Hi Rob, DFSInputStream: InterfaceAudience for this class is private and you should not use this class directly. This class mainly implements actual core functionality of read. And this is DFS specific implementation only. HdfsDataInputStream : InterfaceAudience for this class is public and you

Question on BytesWritable

2013-09-30 Thread Chandra Mohan, Ananda Vel Murugan
Hi, I am using Hadoop 1.0.2. I have written a map reduce job. I have a requirement to process the whole file without splitting. So I have written a new input format to process the file as a whole by overriding the isSplittable() method. I have also created a new Record reader implementation to

When to use DFSInputStream and HdfsDataInputStream

2013-09-30 Thread Rob Blah
Hi What is the use case difference between: - DFSInputStream and HdfsDataInputStream - DFSOutputStream and HdfsDataOutputStream When one should be preferred over other? From sources I see they have similar functionality, only HdfsData*Stream "follows" Data*Stream instead of *Stream. Also is DFS*S

RE: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

2013-09-30 Thread java8964 java8964
I am also thinking about this for my current project, so here I share some of my thoughts, but maybe some of them are not correct. 1) In my previous projects years ago, we store a lot of data as plain text, as at that time, people thinks the Big data can store all the data, no need to worry abou

RE: All datanodes are bad IOException when trying to implement multithreading serialization

2013-09-30 Thread java8964 java8964
Not exactly know what you are trying to do, but it seems like the memory is your bottle neck, and you think you have enough CPU resource, so you want to use multi-thread to utilize CPU resources? You can start multi-threads in your mapper, as if you think your mapper logic is very cpu intensive

Re: Cluster config: Mapper:Reducer Task Capapcity

2013-09-30 Thread Sandy Ryza
Hi Himanshu, Changing the ratio is definitely a reasonable thing to do. The capacities come from the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum tasktracker configurations. You can tweak these on your nodes to get your desired ratio. -Sandy On Mon, Sep 30,

Cluster config: Mapper:Reducer Task Capapcity

2013-09-30 Thread Himanshu Vijay
Hi, Our Hadoop cluster is running 0.20.203. The cluster currently has 'Map Task Capacity' of 8900+ 'Reduce Task Capacity' of 3300+ resulting in a ratio of 2.7. We have a lot of variety of jobs running and we want to increase the throughput. My manual observation was that we hit the Mapper capacit

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

2013-09-30 Thread Rahul Bhattacharjee
Sequence files are language neutral as Avro. Yes , but not sure about the support of other language lib for processing seq files. Thanks, Rahul On Mon, Sep 30, 2013 at 11:10 PM, Peyman Mohajerian wrote: > It is not recommended to keep the data at rest in sequences format, > because it is Java

Hadoop Fault Injection example

2013-09-30 Thread Felipe Gutierrez
Is there a build.xml available to use fault injection with Hadoop as this tutorial says? http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/FaultInjectFramework.html#Aspect_Example I cannot find the jar file for org.apache.hadoop.fi.ProbabilityModel and org.apache.hadoop.hdfs.se

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

2013-09-30 Thread Peyman Mohajerian
It is not recommended to keep the data at rest in sequences format, because it is Java specific and you cannot share it with other none-java systems easily, it is ideal for running map/reduce jobs. On approach would be to bring all the data of different formats in HDFS as is and then convert them t

Re: Can container requests be made paralelly from multiple threads

2013-09-30 Thread Krishna Kishore Bonagiri
Hi Omkar, I have a distributed application that I am trying to port to YARN. My application does many things in multiple threads in parallel, and those threads in turn run some executables (how many of them depends on some business logic and is variable). Now I am trying to launch those executabl

Add machine with bigger storage to cluster

2013-09-30 Thread Amit Sela
I would like to add new machines to my existing cluster but they won't be similar to the current nodes. I have to scenarios I'm thinking of: 1. What are the implications (besides initial load balancing) of adding a new node to the cluster, if this node runs on a machine similar to all other nodes

Re: Retrieve and compute input splits

2013-09-30 Thread Sai Sai
Thanks for your suggestions and replies. I am still confused about this: To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6). My question: Does the input split in the above statement refer to the p

Re: unable to restart namenode on hadoop 1.0.4

2013-09-30 Thread Ravi Shetye
I do not think these are same issue, Please correct me if I am worng. the SO link is abour SNN unable to establish communication with NN. In my case I am unable to launch NN itself. The NLP issue is at the highlighted line, but I am not sure how to go about resolving it /** Add a node child to

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

2013-09-30 Thread Raj K Singh
for xml files processing hadoop comes with a class for this purpose called StreamXmlRecordReader,You can use it by setting your input format to StreamInputFormat and setting the stream.recordreader.class property to org.apache.hadoop.streaming.StreamXmlRecordReader. for Json files, an open-source

NullPointerException when start datanode

2013-09-30 Thread lei liu
I use CDH-4.3.1, When I start datanode, there are below error: 2013-09-26 17:57:07,803 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 0.0.0.0:40075 2013-09-26 17:57:07,814 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dfs.webhdfs.enabled = false 2013-09-26 17:

Re: unable to restart namenode on hadoop 1.0.4

2013-09-30 Thread Manoj Sah
Hi, http://stackoverflow.com/questions/5490805/hadoop-nullpointerexcep try this link Thanks Manoj On Mon, Sep 30, 2013 at 1:03 PM, Ravi Shetye wrote: > Can some one please help me about how I go ahead debugging the issue.The > NN log has the following error stack > > 2013-09-30 07:28:42,768 I

cmsg cancel

2013-09-30 Thread Fabian Zimmermann
sorry, just trying to cancel my mail

File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

2013-09-30 Thread Wolfgang Wyremba
Hello, the file format topic is still confusing me and I would appreciate if you could share your thoughts and experience with me. >From reading different books/articles/websites I understand that - Sequence files (used frequently but not only for binary data), - AVRO, - RC (was developed to work

unable to restart namenode on hadoop 1.0.4

2013-09-30 Thread Ravi Shetye
Can some one please help me about how I go ahead debugging the issue.The NN log has the following error stack 2013-09-30 07:28:42,768 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started 2013-09-30 07:28:42,967 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAd