On 14/08/12 23:55, Shawn Smith wrote: > Has anyone tried using Crunch with Amazon Elastic MapReduce? I've run > into a few issues, and I thought I'd share my experiences so far:
Thank you. You made this much easier for me. > 1. A typical Elastic MapReduce job uses S3 input and output files > (w/Amazon's customized Native S3 File System) and HDFS intermediate > files. This doesn't work with Crunch calls to > FileSystem.get(Configuration) that assume the default file system > (HDFS). Example stack trace: > > Exception in thread "main" java.lang.IllegalArgumentException: This > file system object (hdfs://10.114.37.65:9000) does not support > access to the request path 's3://test-bucket/test/Input.avro' You > possibly called FileSystem.get(conf) when you should have called > FileSystem.get(uri, conf) to obtain a file system supporting your path. > > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:129) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:513) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:767) > at > org.apache.crunch.io.SourceTargetHelper.getPathSize(SourceTargetHelper.java:44) > > It looks like switching to Path.getFileSystem(Configuration) throughout > allows mixing S3 and HDFS files. There's another one of these in https://issues.apache.org/jira/browse/CRUNCH-138 that has just been merged. > 3. EMR Hadoop 1.0.3 includes Avro 1.5.3 which apparently takes > precedence over Crunch's Avro 1.7.0. I didn't mess around with trying > to get my classes in the class path first… Instead I used the > maven-shade-plugin in my job's build to shade Avro 1.7.0 from > "org.apache.avro.*" to "shaded.org.apache.avro.*" so it wouldn't > conflict with the EMR version of Avro. Example exception (you can see > the Avro source code line numbers correspond to version 1.5.3): > > 2012-08-13 06:50:57,547 WARN org.apache.hadoop.mapred.Child (main): > Error running child > java.lang.RuntimeException: java.lang.NoSuchMethodException: > org.apache.avro.mapred.Pair.<init>() > at > > org.apache.avro.specific.SpecificDatumReader.newInstance(SpecificDatumReader.java:101) > at > > org.apache.avro.specific.SpecificDatumReader.newRecord(SpecificDatumReader.java:56) > This kind of shade is difficult to do using SBT for a Scrunch project. Messing around with the Hadoop classpath variables in AWS EMR bootstrap actions is no fun either. Instead, a quick and nasty hack is to remove the conflicting Avro jar from the Hadoop installations using a bootstrap action: #!/bin/bash # Remove Avro 1.5.3 from Amazon Hadoop 1.0.3 to fix Crunch conflict. rm /home/hadoop/lib/avro-1.5.3.jar Daithi Please consider the environment before printing this email. ------------------------------------------------------------------ Visit guardian.co.uk - website of the year www.guardian.co.uk www.observer.co.uk www.guardiannews.com On your mobile, visit m.guardian.co.uk or download the Guardian iPhone app www.guardian.co.uk/iphone and iPad edition www.guardian.co.uk/iPad Save up to 37% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access. Visit guardian.co.uk/subscribe --------------------------------------------------------------------- This e-mail and all attachments are confidential and may also be privileged. If you are not the named recipient, please notify the sender and delete the e-mail and all attachments immediately. Do not disclose the contents to another person. You may not use the information for any purpose, or store, or copy, it in any way. Guardian News & Media Limited is not liable for any computer viruses or other material transmitted with or as part of this e-mail. You should employ virus checking software. Guardian News & Media Limited A member of Guardian Media Group plc Registered Office PO Box 68164 Kings Place 90 York Way London N1P 2AP Registered in England Number 908396
