Re: Creating files through the hadoop streaming interface

Simone Leo Thu, 07 Feb 2013 07:40:11 -0800

Hello,

the lack of an HDFS API is just one of the drawbacks that motivated usto abandon Streaming and develop Pydoop. Unfortunately, in the blogpost cited by Harsh J, Pydoop is just briefly mentioned because theauthor failed to build and install it.

Here is how you solve your problem in Pydoop (for details on how to runprograms, see the docs at http://pydoop.sourceforge.net/docs):


import pydoop.pipes as pp
import pydoop.hdfs as hdfs

class Mapper(pp.Mapper):

  def __init__(self, context):
    super(Mapper, self).__init__(context)
    jc = context.getJobConf()

fname = "%s/%s" % (jc.get("mapred.output.dir"),jc.get("mapred.task.id"))

    self.fo = hdfs.open(fname, "w", user="simleo")
    self.fo.close()
    self.fo = hdfs.open(fname, "a", user="simleo")

  def map(self, context):
    l = len(context.getInputValue())
    self.fo.write("%d\n" % l)

  def close(self):
    self.fo.close()

class Reducer(pp.Reducer):
  pass

if __name__ == "__main__":
  pp.runTask(pp.Factory(Mapper, Reducer))

Note that I'm embedding the task attempt info into the file name, toavoid clashes due to different mappers trying to access the same file atthe same time.


Simone

On 02/07/2013 06:18 AM, Harsh J wrote:

The raw streaming interface has much issues of this manner. The python
open(…, 'w') calls won't open files on HDFS, further. Perhaps, since
you wish to use Python for its various advantages, check out this
detailed comparison guide of various Python-based Hadoop frameworks
(including the raw streaming we offer as part of Apache Hadoop) at
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
by Uri? Many of these provide python extensions to HDFS/etc., letting
you do much more than plain streaming.

On Thu, Feb 7, 2013 at 6:43 AM, Julian Bui <julian...@gmail.com> wrote:

Hi hadoop users,

I am trying to use the streaming interface to use my python script mapper to
create some files but am running into difficulties actually creating files
on the hdfs.

I have a python script mapper with no reducers.  Currently, it doesn't even
read the input and instead reads in the env variable for the output dir
(outdir = os.environ['mapred_output_dir']) and attempts to create an empty
file at that location.  However, that appears to fail with the [vague] error
message appended to this email.

I am using the streaming interface because the python file examples seem so
much cleaner and abstract a lot of the details away for me but if I instead
need to use the java bindings (and create a mapper and reducer class) then
please let me know.  I'm still learning hadoop.  As I understand it, I
should be able to create files in hadoop but perhaps there is limited
ability while using the streaming i/o interface.

Further questions: If my mapper absolutely must send my output to stdout, is
there a way to rename the file after it has been created?

Please help.

Thanks,
-Julian

Python mapper code:
outdir = os.environ['mapred_output_dir']
f = open(outdir + "/testfile.txt", "wb")
f.close()


13/02/06 17:07:55 INFO streaming.StreamJob:  map 100%  reduce 100%
13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
13/02/06 17:07:55 INFO streaming.StreamJob:
/opt/hadoop/libexec/../bin/hadoop job
-Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
http://gcn-13-88.ibnet0:50030/jobdetails.jsp?jobid=job_201302061706_0001
13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: # of
failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
task_201302061706_0001_m_000000
13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
Streaming Command Failed!




--
Harsh J


--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone....@crs4.it
http://www.crs4.it

Re: Creating files through the hadoop streaming interface

Reply via email to