HDFS metrics

2013-06-12 Thread Pedro Sá da Costa
I am using Yarn, and

1 - I want to know the average IO throughput of the HDFS (like know how
fast the datanodes are writing in a disk) so that I can compare beween 2
HDFS intances. The command hdfs dfsadmin -report doesn't give me that.
The HDFS has a command for that?

2 - and there is a similar thing to know how fast the data is being
transferred between map and reduces?
-- 
Best regards,


Re: HDFS metrics

2013-06-12 Thread Bhasker Allene

http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/

On 12/06/2013 09:49, Pedro Sá da Costa wrote:

I am using Yarn, and

1 - I want to know the average IO throughput of the HDFS (like know 
how fast the datanodes are writing in a disk) so that I can compare 
beween 2 HDFS intances. The command hdfs dfsadmin -report doesn't 
give me that. The HDFS has a command for that?


2 - and there is a similar thing to know how fast the data is being 
transferred between map and reduces?

--
Best regards,


--
Thanks  Regards,
Bhasker Allene



Get the history info in Yarn

2013-06-12 Thread Pedro Sá da Costa
I tried the command mapred job list all to get the history of the jobs
completed, but the log  doesn't have the time where a jobs started, end,
the number of maps and reduce, and the size of data read and written. Can I
get this info by a shell command?

I am using Yarn.

-- 
Best regards,


RE: Get the history info in Yarn

2013-06-12 Thread Devaraj K
Hi,

 

You can get all the details for Job using this mapred command

 

mapred job –status Job-ID

 

For this you need to have Job History Server Running and the same job
history server address configured in the client side.

 

 

Thanks  Regards

Devaraj K

 

From: Pedro Sá da Costa [mailto:psdc1...@gmail.com] 
Sent: Thursday, June 13, 2013 10:52 AM
To: mapreduce-user
Subject: Get the history info in Yarn

 

I tried the command mapred job list all to get the history of the jobs
completed, but the log  doesn't have the time where a jobs started, end, the
number of maps and reduce, and the size of data read and written. Can I get
this info by a shell command?

I am using Yarn.



-- 
Best regards,



Task Tracker going down on hive cluster

2013-06-12 Thread Ravi Shetye
In last 4-5 of day the task tracker on one of my slave machines has gone
down couple of time. It has been working fine from the past 4-5 months

The cluster configuration is
4 machine cluster on AWS
1 m2.xlarge master
3 m2.xlarge slaves

The cluster is dedicated to run hive queries, with the data residing on s3.

the slave on which the task tracker went down had the following log

***
2013-06-11 00:26:30,968 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060,
dest: 10.190.***.***:60659, bytes: 38, op: MAPRED_SHUFFLE, cliID:
attempt_201306071409_0151_m_005693_0, duration: 279198
2013-06-11 00:26:30,971 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060,
dest: 10.191.**.***:37605, bytes: 38, op: MAPRED_SHUFFLE, cliID:
attempt_201306071409_0151_m_005700_0, duration: 193135
2013-06-11 00:26:30,971 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060,
dest: 10.190.***.***:60630, bytes: 6, op: MAPRED_SHUFFLE, cliID:
attempt_201306071409_0151_m_005700_0, duration: 192011
2013-06-11 00:26:30,972 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060,
dest: 10.190.***.***:60656, bytes: 6, op: MAPRED_SHUFFLE, cliID:
attempt_201306071409_0151_m_005693_0, duration: 178209
2013-06-11 00:26:30,973 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060,
dest: 10.8.***.**:45321, bytes: 6, op: MAPRED_SHUFFLE, cliID:
attempt_201306071409_0151_m_005694_0, duration: 186452
2013-06-11 00:26:30,973 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060,
dest: 10.190.***.***:60659, bytes: 6, op: MAPRED_SHUFFLE, cliID:
attempt_201306071409_0151_m_005694_0, duration: 157360
2013-06-11 00:26:30,974 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060,
dest: 10.8.***.**:45321, bytes: 38, op: MAPRED_SHUFFLE, cliID:
attempt_201306071409_0151_m_005700_0, duration: 157555
2013-06-11 00:26:30,991 INFO org.apache.hadoop.mapred.JvmManager: JVM Not
killed jvm_201306071409_0151_m_-435659475 but just removed
2013-06-11 00:26:30,991 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201306071409_0151_m_-435659475 exited with exit code 0. Number of tasks
it ran: 0
2013-06-11 00:26:30,991 ERROR org.apache.hadoop.mapred.JvmManager: Caught
Throwable in JVMRunner. Aborting TaskTracker.
org.apache.hadoop.fs.FSError: java.io.IOException: Broken pipe
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:200)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:220)
at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:315)
at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:148)
at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233)
at java.io.BufferedWriter.close(BufferedWriter.java:265)
at java.io.PrintWriter.close(PrintWriter.java:312)
at
org.apache.hadoop.mapred.TaskController.writeCommand(TaskController.java:231)
at
org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:126)
at
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:497)
at
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:471)
Caused by: java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:297)
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:198)
... 13 more
2013-06-11 00:26:31,007 INFO org.apache.hadoop.mapred.JvmManager: In
JvmRunner constructed JVM ID: jvm_201306071409_0151_m_-495709221
2013-06-11 00:26:31,008 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060,
dest: 10.190.***.***:60656, bytes: 6, op: MAPRED_SHUFFLE, cliID:
attempt_201306071409_0151_m_005694_0, duration: 222430
2013-06-11 00:26:31,008 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060,
dest: 10.190.***.***:60653, bytes: 38, op: MAPRED_SHUFFLE, cliID:
attempt_201306071409_0151_m_005693_0, duration: 154027
2013-06-11 00:26:31,008 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060,
dest: 10.190.***.***:60659, bytes: 6, op: MAPRED_SHUFFLE, cliID:
attempt_201306071409_0151_m_005700_0, duration: 132067
2013-06-11 00:26:31,326 INFO org.apache.hadoop.mapred.JvmManager: JVM
Runner jvm_201306071409_0151_m_-495709221 spawned.
2013-06-11 00:26:31,328 INFO org.apache.hadoop.mapred.TaskController:
Writing commands to
/mnt/app/hadoop-tmp/ttprivate/taskTracker/piyushv/jobcache/job_201306071409_0151/attempt_201306071409_0151_m_005717_0/taskjvm.sh
2013-06-11 00:26:31,331 INFO

Re: Container allocation on the same node

2013-06-12 Thread Krishna Kishore Bonagiri
Hi Harsh,

   What will happen when I specify local host as the required host? Doesn't
the resource manager give me all the containers on the local host?  I don't
want to constrain myself to the local host, which might be busy while other
nodes in the cluster have enough resources available for me.

Thanks,
Kishore


On Wed, Jun 12, 2013 at 6:45 PM, Harsh J ha...@cloudera.com wrote:

 You can request containers with the local host name as the required
 host, and perhaps reject and re-request if they aren't designated to
 be on that one until you have sufficient. This may take a while
 though.

 On Wed, Jun 12, 2013 at 6:25 PM, Krishna Kishore Bonagiri
 write2kish...@gmail.com wrote:
  Hi,
 
I want to get some containers for my application on the same node, is
  there a way to make such a request.
 
For example, I have an application which needs 10 containers, but have
 a
  constraint that a set of those containers need to be running on the same
  node. Can I ask my resource manager to give me, let us say 5 containers
 on
  the same node?
 
I know that there is now a way to specify the node name on which I
 need a
  container, but I don't bother which node in the cluster I get them
 allocated
  on, I just need them on the same node.
 
Please suggest me if it is possible, and how can I do that?
 
  Thanks,
  Kishore



 --
 Harsh J



Re: Management API

2013-06-12 Thread MARCOS MEDRADO RUBINELLI
Rita,

There aren't any specs as far as I know, but in my experience the interface is 
stable enough from version to version, with the occasional extra field added 
here or there. If you query specifically for the beans you want (e.g. 
http://namenode:50070/jmx?get=Hadoop:service=NameNode,name=NameNodeInfo::LiveNodes
 ) and build in some flexibility, you shouldn't have any problems.


Regards,
Marcos

On 09-06-2013 11:30, Rita wrote:
Are there any specs for the JSON schema?


On Thu, Jun 6, 2013 at 9:49 AM, MARCOS MEDRADO RUBINELLI 
marc...@buscapecompany.commailto:marc...@buscapecompany.com wrote:
Brian,

If you have access to the web UI, you can get those metrics in JSON from the 
JMXJsonServlet. Try hitting http://namenode_hostname:50070/jmx?qry=Hadoop:* and 
http://jobtracker_v1_hostname:50030/jmx?qry=hadoop:*

It isn't as extensive as other options, but if you just need a snapshot of node 
capacity and utilization, it's pretty handy. I used it to plug some basic 
warnings into Nagios.

Regards,
Marcos


On 06-06-2013 09:51, Brian Mason wrote:
I am looking for a way to access a list of Nodes, Compute, Data etc ..  My 
application is not running on the name node.  It is remote.  The 2.0 Yarn API 
look like they may be useful, but I am not on 2.0 and cannot move to 2,0 
anytime soon.

DFSClient.java looks useful, but its not in the API docs so I am not sure how 
to use it or even if I should.
Any pointers would be helpful.

Thanks,




--
--- Get your facts first, then you can distort them as you please.--



Re: Now give .gz file as input to the MAP

2013-06-12 Thread Sanjay Subramanian
Rahul-da

I found bz2 pretty slow (although splittable) so I switched to snappy (only 
sequence files are splittable but compress-decompress is fast)

Thanks
Sanjay

From: Rahul Bhattacharjee 
rahul.rec@gmail.commailto:rahul.rec@gmail.com
Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Date: Tuesday, June 11, 2013 9:53 PM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Now give .gz file as input to the MAP

Nothing special is required for process .gz files using MR. however , as Sanjay 
mentioned , verify the codec's configured in core-site and another thing to 
note is that these files are not splittable.

You might want to use bz2 , these are splittable.

Thanks,
Rahul


On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:

hadoopConf.set(mapreduce.job.inputformat.class, 
com.wizecommerce.utils.mapred.TextInputFormat);

hadoopConf.set(mapreduce.job.outputformat.class, 
com.wizecommerce.utils.mapred.TextOutputFormat);

No special settings required for reading Gzip except these above

I u want to output Gzip


hadoopConf.set(mapreduce.output.fileoutputformat.compress, true);

hadoopConf.set(mapreduce.output.fileoutputformat.compress.codec, 
org.apache.hadoop.io.compress.GzipCodec);


Make sure Gzip codec is defined in core-site.xml
!-- core-site.xml --
property
nameio.compression.codecs/name

valueorg.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec/value
/property

I have a question

Why are u using GZIP as input to Map ? These are not splittable…Unless u have 
to read multilines (like lines between a BEGIN and END block in a log file) and 
send it as one record to the mapper

Also in Non-splitable Snappy Codec is better

Good Luck


sanjay

From: samir das mohapatra 
samir.help...@gmail.commailto:samir.help...@gmail.com
Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Date: Tuesday, June 11, 2013 9:07 PM
To: cdh-u...@cloudera.commailto:cdh-u...@cloudera.com 
cdh-u...@cloudera.commailto:cdh-u...@cloudera.com, 
user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org, 
user-h...@hadoop.apache.orgmailto:user-h...@hadoop.apache.org 
user-h...@hadoop.apache.orgmailto:user-h...@hadoop.apache.org
Subject: Now give .gz file as input to the MAP

Hi All,
Did any one worked on, how to pass the .gz file as  file input for 
mapreduce job ?

Regards,
samir.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Now give .gz file as input to the MAP

2013-06-12 Thread Rahul Bhattacharjee
Yeah I too found that quite slow and memory hungry !

Thanks,
Rahul-da


On Wed, Jun 12, 2013 at 11:13 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.com wrote:

  Rahul-da

  I found bz2 pretty slow (although splittable) so I switched to snappy
 (only sequence files are splittable but compress-decompress is fast)

  Thanks
 Sanjay

   From: Rahul Bhattacharjee rahul.rec@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Tuesday, June 11, 2013 9:53 PM
 To: user@hadoop.apache.org user@hadoop.apache.org
 Subject: Re: Now give .gz file as input to the MAP

   Nothing special is required for process .gz files using MR. however ,
 as Sanjay mentioned , verify the codec's configured in core-site and
 another thing to note is that these files are not splittable.

  You might want to use bz2 , these are splittable.

 Thanks,
 Rahul


 On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian 
 sanjay.subraman...@wizecommerce.com wrote:

  hadoopConf.set(mapreduce.job.inputformat.class,
 com.wizecommerce.utils.mapred.TextInputFormat);

 hadoopConf.set(mapreduce.job.outputformat.class,
 com.wizecommerce.utils.mapred.TextOutputFormat);
  No special settings required for reading Gzip except these above

  I u want to output Gzip

  hadoopConf.set(mapreduce.output.fileoutputformat.compress, true);

 hadoopConf.set(mapreduce.output.fileoutputformat.compress.codec,
 org.apache.hadoop.io.compress.GzipCodec);

 Make sure Gzip codec is defined in core-site.xml
  !-- core-site.xml --
  property
  nameio.compression.codecs/name
  value
 org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec/
 value
  /property

  I have a question

  Why are u using GZIP as input to Map ? These are not splittable…Unless
 u have to read multilines (like lines between a BEGIN and END block in a
 log file) and send it as one record to the mapper

  Also in Non-splitable Snappy Codec is better

  Good Luck


  sanjay

   From: samir das mohapatra samir.help...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Tuesday, June 11, 2013 9:07 PM
 To: cdh-u...@cloudera.com cdh-u...@cloudera.com, 
 user@hadoop.apache.org user@hadoop.apache.org, 
 user-h...@hadoop.apache.org user-h...@hadoop.apache.org
 Subject: Now give .gz file as input to the MAP

   Hi All,
 Did any one worked on, how to pass the .gz file as  file input for
 mapreduce job ?

 Regards,
 samir.

 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution is
 prohibited. If you are not the intended recipient, please contact the
 sender by reply email and destroy all copies of the original message along
 with any attachments, from your computer system. If you are the intended
 recipient, please be advised that the content of this message is subject to
 access, review and disclosure by the sender's Email System Administrator.



 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution is
 prohibited. If you are not the intended recipient, please contact the
 sender by reply email and destroy all copies of the original message along
 with any attachments, from your computer system. If you are the intended
 recipient, please be advised that the content of this message is subject to
 access, review and disclosure by the sender's Email System Administrator.



RE: Shuffle design: optimization tradeoffs

2013-06-12 Thread John Lilley
In reading this link as well as the sailfish report, it strikes me that Hadoop 
skipped a potentially significant optimization.  Namely, why are multiple 
sorted spill files merged into a single output file?  Why not have the 
auxiliary service merge on the fly, thus avoiding landing them to disk?  Was 
this considered and rejected due to placing memory/CPU requirements on the 
auxiliary service?  I am assuming that whether the merge was done on disk or in 
a stream, it would require decompression/recompression of the data.
John


-Original Message-
From: Albert Chu [mailto:ch...@llnl.gov] 
Sent: Tuesday, June 11, 2013 3:32 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle design: optimization tradeoffs

On Tue, 2013-06-11 at 16:00 +, John Lilley wrote:
 I am curious about the tradeoffs that drove design of the 
 partition/sort/shuffle (Elephant book p 208).  Doubtless this has been 
 tuned and measured and retuned, but I’d like to know what observations 
 came about during the iterative optimization process to drive the 
 final design.  For example:
 
 ·Why does the mapper output create a single ordered file
 containing all partitions, as opposed to a file per group of 
 partitions (which would seem to lend itself better to multi-core 
 scaling), or even a file per partition?

I researched this awhile back wondering the same thing, and found this JIRA

https://issues.apache.org/jira/browse/HADOOP-331

Al

 ·Why does the max number of streams to merge at once
 (is.sort.factor) default to 10?  Is this obsolete?  In my experience, 
 so long as you have memory to buffer each input at 1MB or so, the 
 merger is more efficient as a single phase.
 
 ·Why does the mapper do a final merge of the spill files do
 disk, instead of having the auxiliary process (in YARN) merge and 
 stream data on the fly?
 
 ·Why do mappers sort the tuples, as opposed to only
 partitioning them and letting the reducers do the sorting?
 
 Sorry if this is overly academic, but I’m sure a lot of people put a 
 lot of time into the tuning effort, and I hope they left a record of 
 their efforts.
 
 Thanks
 
 John
 
  
 
 
--
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




Aggregating data nested into JSON documents

2013-06-12 Thread Tecno Brain
Hello,
   I'm new to Hadoop.
   I have a large quantity of JSON documents with a structure similar to
what is shown below.

   {
 g : some-group-identifier,
 sg: some-subgroup-identifier,
 j  : some-job-identifier,
 page : 23,
 ... // other fields omitted
 important-data : [
 {
   f1  : abc,
   f2  : a,
   f3  : /
   ...
 },
 ...
 {
   f1 : xyz,
   f2  : q,
   f3  : /,
   ...
 },
 ],
... // other fields omitted
 other-important-data : [
{
   x1  : ford,
   x2  : green,
   x3  : 35
   map : {
   free-field : value,
   other-free-field : value2
  }
 },
 ...
 {
   x1 : vw,
   x2  : red,
   x3  : 54,
   ...
 },
 ]
   },
}


Each file contains a single JSON document (gzip compressed, and roughly
about 200KB uncompressed of pretty-printed json text per document)

I am interested in analyzing only the  important-data array and the
other-important-data array.
My source data would ideally be easier to analyze if it looked like a
couple of tables with a fixed set of columns. Only the column map would
be a complex column, all others would be primitives.

( g, sg, j, page, f1, f2, f3 )

( g, sg, j, page, x1, x2, x3, map )

So, for each JSON document, I would like to create several rows, but I
would like to avoid the intermediate step of persisting -and duplicating-
the flattened data.

In order to avoid persisting the data flattened, I thought I had to write
my own map-reduce in Java code, but discovered that others have had the
same problem of using JSON as the source and there are somewhat standard
solutions.

By reading about the SerDe approach for Hive I get the impression that each
JSON document is transformed into a single row of the table with some
columns being an array, a map of other nested structures.
a) Is there a way to break each JSON document into several rows for a
Hive external table?
b) It seems there are too many JSON SerDe libraries! Is there any of them
considered the de-facto standard?

The Pig approach seems also promising using Elephant Bird Do anybody has
pointers to more user documentation on this project? Or is browsing through
the examples in GitHub my only source?

Thanks


Install CDH4 using tar ball with MRv1, Not YARN version

2013-06-12 Thread selva
Hi folks,

I am trying to install CDH4 using tar ball with MRv1, Not YARN
version(MRv2).

I downloaded two tarballs (mr1-0.20.2+n and hadoop-2.0.0+n) from this
location http://archive.cloudera.com/cdh4/cdh/4/

as per cloudera instruction i found

 If you install CDH4 from a tarball, you will install YARN. To install
MRv1 as well, install the separate MRv1 tarball (mr1-0.20.2+n) alongside
the YARN one (hadoop-2.0.0+n).
(@ bottom of
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_4_2.html
)

But i could not find steps to install using these two tarballs Since
cloudera tailored the steps to package installation.

I am totally confused like whether to start dfs of hadoop-2.0.0+n version
and start mapred of mr1-0.20.2+n or something else.

Kindly help me on setting up.

Thanks
Selva


recovery accidently deleted pig script

2013-06-12 Thread feng jiang
Hi everyone,

We have a pig script scheduled running every 4 hours. Someone accidentally
deleted the pig script(rm). Is there any way to recover the script?

I am guessing Hadoop copy the program to every nodes before running. Just
in case it has any copy in the nodes.


Best regards,
Feng Jiang


Re: recovery accidently deleted pig script

2013-06-12 Thread Michael Segel
Where was the pig script? On HDFS? 

How often does your cluster clean up the trash? 

(Deleted stuff doesn't get cleaned up when the file is deleted... ) Its a 
configurable setting so YMMV

On Jun 12, 2013, at 8:58 PM, feng jiang jiangfut...@gmail.com wrote:

 Hi everyone,
 
 We have a pig script scheduled running every 4 hours. Someone accidentally 
 deleted the pig script(rm). Is there any way to recover the script?
 
 I am guessing Hadoop copy the program to every nodes before running. Just in 
 case it has any copy in the nodes.
 
 
 Best regards,
 Feng Jiang



Re: SSD support in HDFS

2013-06-12 Thread Michael Segel
I could have sworn there was a thread on this already. (Maybe the HBase list?) 

Andrew P. kinda nailed it when he talked about the fact that you had to write 
the replication(s). 

If you wanted improved performance, why not look at the hybrid drives that have 
a small SSD buffer and a spinning disk? 

I don't know but it may be what you're looking for. 

HTH

-Mike

On Jun 12, 2013, at 5:18 AM, Lucas Stanley lucas23...@gmail.com wrote:

 Thanks Chris and Phil.
 
 
 On Tue, Jun 11, 2013 at 1:31 PM, Chris Nauroth cnaur...@hortonworks.com 
 wrote:
 Hi Lucas,
 
 HDFS does not have this capability right now, but there has been some 
 preliminary discussion around adding features to support it.  You might want 
 to follow jira issues HDFS-2832 and HDFS-4672 if you'd like to receive 
 notifications about the discussion.
 
 https://issues.apache.org/jira/browse/HDFS-2832
 https://issues.apache.org/jira/browse/HDFS-4672
 
 Chris Nauroth
 Hortonworks
 http://hortonworks.com/
 
 
 
 On Mon, Jun 10, 2013 at 6:57 PM, Lucas Stanley lucas23...@gmail.com wrote:
 Hi, 
 
 Is it possible to tell Apache HDFS to store some files on SSD and the rest of 
 the files on spinning disks?
 
 So if each on my nodes has 1 SSD and 5 spinning disks, can I configure a 
 directory in HDFS to put all files in that dir on the SSD?
 
 I think Intel's Hadoop distribution is working on some SSD support right?
 
 



Compatibility of Hadoop 0.20.x and hadoop 1.0.3

2013-06-12 Thread Lin Yang
Hi, all,

I was wondering could an application written with hadoop 0.20.3 API run on
a hadoop 1.0.3 cluster?

If not, is there any way to run this application on hadoop 1.0.3 instead of
re-writing all the code??

-- 
Lin Yang


Reducer not getting called

2013-06-12 Thread Omkar Joshi
Hi,

I have a SequenceFile which contains several jpeg images with (image name, 
image bytes) as key-value pairs. My objective is to count the no. of images by 
grouping them by the source, something like this :

Nikon Coolpix  100
Sony Cybershot 251
N82 100


The MR code is :

package com.hadoop.basics;

import java.io.BufferedInputStream;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import com.drew.imaging.ImageMetadataReader;
import com.drew.imaging.ImageProcessingException;
import com.drew.metadata.Directory;
import com.drew.metadata.Metadata;
import com.drew.metadata.exif.ExifIFD0Directory;

public class ImageSummary extends Configured implements Tool {

public static class ImageSourceMapper extends
MapperText, BytesWritable, Text, 
IntWritable {

private static int tagId = 272;
private static final IntWritable one = new 
IntWritable(1);

public void map(Text imageName, BytesWritable 
imageBytes,
Context context) throws 
IOException, InterruptedException {
// TODO Auto-generated method stub

System.out.println(In the map method, 
image is 
+ 
imageName.toString());

byte[] imageInBytes = imageBytes.getBytes();
ByteArrayInputStream bais = new 
ByteArrayInputStream(imageInBytes);
BufferedInputStream bis = new 
BufferedInputStream(bais);

Metadata imageMD = null;

try {
imageMD = 
ImageMetadataReader.readMetadata(bis, true);
} catch (ImageProcessingException e) {
// TODO Auto-generated catch 
block
System.out.println(Got an 
ImageProcessingException !);
e.printStackTrace();
}

Directory exifIFD0Directory = imageMD

.getDirectory(ExifIFD0Directory.class);

String imageSource = 
exifIFD0Directory.getString(tagId);

System.out.println(imageName.toString() +  
is taken using 
+ imageSource);

context.write(new Text(imageSource), one);

System.out.println(Returning from the map 
method);
}
}

public static class ImageSourceReducer extends
ReducerText, IntWritable, Text, 
IntWritable {

public void reduce(Text imageSource, 
IteratorIntWritable counts,
Context context) throws 
IOException, InterruptedException {
// TODO Auto-generated method stub

System.out.println(In the reduce method);

int finalCount = 0;

while (counts.hasNext()) {
finalCount += 
counts.next().get();
}

context.write(imageSource, new 
IntWritable(finalCount));

System.out.println(Returning from the 
reduce method);
}

}

public static void main(String[] args) throws Exception {
ToolRunner.run(new ImageSummary(), args);
}

@Override
public int run(String[] args) throws Exception {
// TODO Auto-generated method stub

System.out.println(In ImageSummary.run(...));

Configuration configuration = getConf();


Re: Reducer not getting called

2013-06-12 Thread Harsh J
You're not using the recommended @Override annotations, and are
hitting a classic programming mistake. Your issue is same as this
earlier discussion: http://search-hadoop.com/m/gqA3rAaVQ7 (and the
ones before it).

On Thu, Jun 13, 2013 at 9:52 AM, Omkar Joshi
omkar.jo...@lntinfotech.com wrote:
 Hi,



 I have a SequenceFile which contains several jpeg images with (image name,
 image bytes) as key-value pairs. My objective is to count the no. of images
 by grouping them by the source, something like this :



 Nikon Coolpix  100

 Sony Cybershot 251

 N82 100





 The MR code is :



 package com.hadoop.basics;



 import java.io.BufferedInputStream;

 import java.io.ByteArrayInputStream;

 import java.io.IOException;

 import java.util.Iterator;



 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.BytesWritable;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;



 import com.drew.imaging.ImageMetadataReader;

 import com.drew.imaging.ImageProcessingException;

 import com.drew.metadata.Directory;

 import com.drew.metadata.Metadata;

 import com.drew.metadata.exif.ExifIFD0Directory;



 public class ImageSummary extends Configured implements Tool {



 public static class ImageSourceMapper extends

 MapperText, BytesWritable, Text,
 IntWritable {



 private static int tagId = 272;

 private static final IntWritable one = new
 IntWritable(1);



 public void map(Text imageName, BytesWritable
 imageBytes,

 Context context) throws
 IOException, InterruptedException {

 // TODO Auto-generated method stub



 System.out.println(In the map method,
 image is 

 +
 imageName.toString());



 byte[] imageInBytes =
 imageBytes.getBytes();

 ByteArrayInputStream bais = new
 ByteArrayInputStream(imageInBytes);

 BufferedInputStream bis = new
 BufferedInputStream(bais);



 Metadata imageMD = null;



 try {

 imageMD =
 ImageMetadataReader.readMetadata(bis, true);

 } catch (ImageProcessingException e) {

 // TODO Auto-generated catch
 block

 System.out.println(Got an
 ImageProcessingException !);

 e.printStackTrace();

 }



 Directory exifIFD0Directory = imageMD


 .getDirectory(ExifIFD0Directory.class);



 String imageSource =
 exifIFD0Directory.getString(tagId);



 System.out.println(imageName.toString()
 +  is taken using 

 + imageSource);



 context.write(new Text(imageSource),
 one);



 System.out.println(Returning from the
 map method);

 }

 }



 public static class ImageSourceReducer extends

 ReducerText, IntWritable, Text,
 IntWritable {



 public void reduce(Text imageSource,
 IteratorIntWritable counts,

 Context context) throws
 IOException, InterruptedException {

 // TODO Auto-generated method stub



 System.out.println(In the reduce
 method);



 int finalCount = 0;



 while (counts.hasNext()) {

 finalCount +=
 counts.next().get();

 }



 context.write(imageSource, new
 IntWritable(finalCount));



 System.out.println(Returning from the
 reduce method);

 }



 }



 public static 

RE: Reducer not getting called

2013-06-12 Thread Omkar Joshi
Ok but that link is broken - can you provide a working one?

Regards,
Omkar Joshi


-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Thursday, June 13, 2013 11:01 AM
To: user@hadoop.apache.org
Subject: Re: Reducer not getting called

You're not using the recommended @Override annotations, and are
hitting a classic programming mistake. Your issue is same as this
earlier discussion: http://search-hadoop.com/m/gqA3rAaVQ7 (and the
ones before it).

On Thu, Jun 13, 2013 at 9:52 AM, Omkar Joshi
omkar.jo...@lntinfotech.com wrote:
 Hi,



 I have a SequenceFile which contains several jpeg images with (image name,
 image bytes) as key-value pairs. My objective is to count the no. of images
 by grouping them by the source, something like this :



 Nikon Coolpix  100

 Sony Cybershot 251

 N82 100





 The MR code is :



 package com.hadoop.basics;



 import java.io.BufferedInputStream;

 import java.io.ByteArrayInputStream;

 import java.io.IOException;

 import java.util.Iterator;



 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.BytesWritable;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;



 import com.drew.imaging.ImageMetadataReader;

 import com.drew.imaging.ImageProcessingException;

 import com.drew.metadata.Directory;

 import com.drew.metadata.Metadata;

 import com.drew.metadata.exif.ExifIFD0Directory;



 public class ImageSummary extends Configured implements Tool {



 public static class ImageSourceMapper extends

 MapperText, BytesWritable, Text,
 IntWritable {



 private static int tagId = 272;

 private static final IntWritable one = new
 IntWritable(1);



 public void map(Text imageName, BytesWritable
 imageBytes,

 Context context) throws
 IOException, InterruptedException {

 // TODO Auto-generated method stub



 System.out.println(In the map method,
 image is 

 +
 imageName.toString());



 byte[] imageInBytes =
 imageBytes.getBytes();

 ByteArrayInputStream bais = new
 ByteArrayInputStream(imageInBytes);

 BufferedInputStream bis = new
 BufferedInputStream(bais);



 Metadata imageMD = null;



 try {

 imageMD =
 ImageMetadataReader.readMetadata(bis, true);

 } catch (ImageProcessingException e) {

 // TODO Auto-generated catch
 block

 System.out.println(Got an
 ImageProcessingException !);

 e.printStackTrace();

 }



 Directory exifIFD0Directory = imageMD


 .getDirectory(ExifIFD0Directory.class);



 String imageSource =
 exifIFD0Directory.getString(tagId);



 System.out.println(imageName.toString()
 +  is taken using 

 + imageSource);



 context.write(new Text(imageSource),
 one);



 System.out.println(Returning from the
 map method);

 }

 }



 public static class ImageSourceReducer extends

 ReducerText, IntWritable, Text,
 IntWritable {



 public void reduce(Text imageSource,
 IteratorIntWritable counts,

 Context context) throws
 IOException, InterruptedException {

 // TODO Auto-generated method stub



 System.out.println(In the reduce
 method);



 int finalCount = 0;



 while (counts.hasNext()) {

 finalCount +=
 counts.next().get();

 }



 

Re: Compatibility of Hadoop 0.20.x and hadoop 1.0.3

2013-06-12 Thread Lin Yang
Hi, Vinod,

Thanks.*
*

2013/6/13 Vinod Kumar Vavilapalli vino...@hortonworks.com


 It should mostly work. I just checked our CHANGES.txt file and haven't
 seen much incompatibilities introduced between those releases.

 But 0.20.3 is pretty old, so only one way to know for sure - compile and
 run against 1.x.

 If you are making that jump, you may as well use the latest releases in
 1.x line.

  Thanks,
 +Vinod Kumar Vavilapalli
 Hortonworks Inc.
 http://hortonworks.com/

 On Jun 12, 2013, at 8:34 PM, Lin Yang wrote:

 Hi, all,

 I was wondering could an application written with hadoop 0.20.3 API run on
 a hadoop 1.0.3 cluster?

 If not, is there any way to run this application on hadoop 1.0.3 instead
 of re-writing all the code??

 --
 Lin Yang





-- 
Lin Yang