Re: Map with another map inside (unpredictable naming)
The map union value you currently have can certainly carry another map type within. Here's how you'd probably want to define it: { "name": "metadata", "type": { "type": "map", "values": [ "null", "int", "float", "string", "boolean", "long", { "type": "map", "values": [ "null", "int", "float", "string", "boolean", "long" ] } ] } } P.s. Its generally better to specify "null" as the first union type: http://avro.apache.org/docs/current/spec.html#Unions On Mon, 27 Mar 2017 at 16:47 Dag Stockstadwrote: > Hi Avro aficionados, > > I'm having trouble serializing a record with a nested map structure i.e. a > map within a map. The record I'm trying to send has the following structure: > { > "event_type": "some_type", > "data": { > "id": "2f720f90-ea06-4248-a72e-01eea44981ed", > "metadata": { > "some_attr": "some_value", > "some_map_with_unpredictable_name": { > "some_attr": "some_value" > } > } > } > } > > And the schema is this: > { > "namespace": "org.example.event.avro", > "type": "record", > "name": "EventNotification", > "fields": [{ > "name": "event_type", > "type": "string" > }, { > "name": "data", > "type": { > "type": "record", > "name": "EventData", > "fields": [{ > "name": "id", > "type": "string" > }, { > "name": "metadata", > "type": { > "type": "map", > "values": [ > "int", > "float", > "string", > "boolean", > "long", > "null" > ] > } > }] > } > }] > } > > The nested map (some_map_with_unpredictable_name) is causing problems > (serialization error). Is there any way I can have another map as a value > in the metadata map? > > Due to the nature of the system, I cannot 100% predict the structure of > the metadata field. Can Avro accomodate these requirements or do I have to > fall back on something such as JSON for this one? > > Help very appreciated (I'm a bit stuck). > > Kind regards, > Dag > >
Re: avro-tools tojson where avro file in HDFS
The avro-tools jar is usually a standalone one, and if you do have a standalone variant running it with 'hadoop jar' may cause classpath pollution as hadoop also includes a (likely different) version of avro into the runtime classpath. Run it instead this way: export HADOOP_USER_CLASSPATH_FIRST=true export HADOOP_CLASSPATH=avro-tools-1.7.7.jar hadoop jar avro-tools-1.7.7.jar … Or if you are certain avro-tools is built against a compatible hadoop server version, simply run: java -jar avro-tools-1.7.7.jar … On Fri, Sep 4, 2015 at 8:26 PM Ashish Rastogi (BLOOMBERG/ 731 LEX) < arastog...@bloomberg.net> wrote: > Hi Avro users, > > I'm using avro-tools-1.7.7.jar, and would like to print records to stdout > using the "tojson" option. I want to do this with my avro files in HDFS > (and not on the local file system). I thought AVRO-867 ( > https://issues.apache.org/jira/browse/AVRO-867) would allow me to do > this. However, I get the following exception when I run: > > $ hadoop jar avro-tools-1.7.7.jar tojson hdfs:// > > Exception in thread "main" java.lang.NoSuchMethodError: > org.apache.avro.io.EncoderFactory.jsonEncoder(Lorg/apache/avro/Schema;Ljava/io/OutputStream;Z)Lorg/apache/avro/io/JsonEncoder; > at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:76) > at org.apache.avro.tool.Main.run(Main.java:84) > at org.apache.avro.tool.Main.main(Main.java:73) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > > Anyone knows what this means? Any help is appreciated. > > As a confirmation, I'm able to use avro-tools after I -copyToLocal this > avro file to view these records. > > Thanks, > Ashish >
Re: Is Avro Splittable?
Could you also link to the articles that claim Avro containers are not splittable? It'd be good to correct them to avoid this confusion. On Thu, Jun 25, 2015 at 11:25 AM Ankur Jain ankur.j...@yash.com wrote: Hello, I am reading various forms and docs, somewhere it is mentioned that avro is splittable and somewhere non-splittable. So which one is right?? Regards, Ankur Information transmitted by this e-mail is proprietary to YASH Technologies and/ or its Customers and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly prohibited. In such cases, please notify us immediately at i...@yash.com and delete this mail from your records.
Re: Binary output in MR job
You're not explicitly specifying a relevant form of AvroOutputFormat. This causes the default TextOutputFormat to get used, giving you JSON representation of records. On Sat, Aug 16, 2014 at 3:07 PM, Anand Nalya anand.na...@gmail.com wrote: Hi, I'm writing a MR 2 job in which I'm reading plain text as input and producing avro output. On running the job in local mode, the output is being serialized into json format. What can I do so that the output uses binary encoding. Following is my job definition: Job job = new Job(getConf(), Post convertor); job.setJarByClass(getClass()); AvroJob.setOutputKeySchema(job, Post.getClassSchema()); AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.LONG)); AvroJob.setMapOutputValueSchema(job, Post.getClassSchema()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(PostMapper.class); job.setReducerClass(PostReducer.class); Regards. Anand -- Harsh J
Re: Error when trying to convert a local datafile to plain text with Avro Tools
Its useful to have plaintext data in compressed avro files in HDFS for MR/etc. processing, since the container format allows splitting. The feature of 'totext'/'from'text' was added originally via AVRO-567. You may instead be looking for the avro (avrocat) tool? You can obtain it by installing the Python 'avro' package (easy_install avro, or pip install avro) and by then running the 'avro' command. It allows configurable forms of text transformation from regular Avro schema files. On Mon, Jul 14, 2014 at 10:52 AM, julianpeeters julianpeet...@gmail.com wrote: Hi, I'm exploring the human-readable avro options in the avro-tools jar, namely `tojson` and `totext`. `tojson` works fine, but I try `totext` with: `$ java -jar avro-tools-1.7.6.jar totext twitter.avro twitter.txt`, then twitter.txt is empty and I get this error: Jul 13, 2014 8:41:19 PM org.apache.hadoop.util.NativeCodeLoader clinit WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Avro file is not generic text schema What am I doing wrong? Thanks for looking, -Julian PS (Looking into the source, it looks like this error is thrown when the schema in the datafile is not equal to the string \bytes\, but I have a hard time understanding why the datafile's schema would ever be that.) -- View this message in context: http://apache-avro.679487.n3.nabble.com/Error-when-trying-to-convert-a-local-datafile-to-plain-text-with-Avro-Tools-tp4030458.html Sent from the Avro - Users mailing list archive at Nabble.com. -- Harsh J
Re: Passing Schema objects through Avro RPC
Hey Joey, Are you perhaps looking for https://issues.apache.org/jira/browse/AVRO-251? On Sat, Jun 21, 2014 at 2:20 AM, Joey Echeverria j...@cloudera.com wrote: Has anyone built an Avro RPC interface that includes a method that returns Avro Schema objects? I've built my protocol using the reflect APIs, but because Schema doesn't have an empty constructor, I get a NoSuchMethodException when trying to deserialize on the client. -Joey -- Harsh J
Re: How to dynamically create a MapSchema with String as key, not Utf8
You can pass -string to the avro-tools compile program, to make the generated classes use String/CharSequence and not Utf8. On Mon, May 19, 2014 at 1:37 PM, Fengyun RAO raofeng...@gmail.com wrote: I've noticed the jira page: https://issues.apache.org/jira/browse/AVRO-803, and known how to generate a MapString, MyType class using a schema file. In my case, the schema file is “MyType.avsc”, and I used avro-tools to generate a MyType.java class, My question is how to dynamically create a MapSchema of MapString, MyType, since I have to Ser/De a MapString, MyType. I tried to use the method Schema.createMap(Schema myType), but the key is Utf8 not String. -- Harsh J
Re: How to dynamically create a MapSchema with String as key, not Utf8
Ah my bad, I didn't realise you wanted to infer/generate the schema in code. I believe you may be looking for this static method perhaps: http://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/generic/GenericData.html#setStringType(org.apache.avro.Schema, org.apache.avro.generic.GenericData.StringType) On Mon, May 19, 2014 at 4:03 PM, Fengyun RAO raofeng...@gmail.com wrote: Yes, I knew it. MyType.java was compiled using -string. My question is how to generate schema of MapString, MyType , providing that I only have MyType.avsc file, but not the map schema file. Schema.createMap(Schema myType) method generates a MapUtf8, MyType, not MapString, MyType. 2014-05-19 18:16 GMT+08:00 Harsh J ha...@cloudera.com: You can pass -string to the avro-tools compile program, to make the generated classes use String/CharSequence and not Utf8. On Mon, May 19, 2014 at 1:37 PM, Fengyun RAO raofeng...@gmail.com wrote: I've noticed the jira page: https://issues.apache.org/jira/browse/AVRO-803, and known how to generate a MapString, MyType class using a schema file. In my case, the schema file is “MyType.avsc”, and I used avro-tools to generate a MyType.java class, My question is how to dynamically create a MapSchema of MapString, MyType, since I have to Ser/De a MapString, MyType. I tried to use the method Schema.createMap(Schema myType), but the key is Utf8 not String. -- Harsh J -- Harsh J
Re: Record field names
Hi Yael, The field name requirements are defined in the Avro specification: http://avro.apache.org/docs/current/spec.html#Names On Tue, May 20, 2014 at 1:15 AM, yael aharon yael.aharo...@gmail.com wrote: Hello, Recently, I noticed that the name of generic record fields are very restrictive. Only alphanumeric characters and underscores are allowed. Could someone shed some light on why is this restriction? thanks, Yael -- Harsh J
Re: avro_file_writer_sync
Read more about sync markers in the specification part that explains the data file format: http://avro.apache.org/docs/current/spec.html#Object+Container+Files I don't think the call helps with your situation though - that error is usually seen for badly written/produced avro data files, or possibly cause of some specific bug. I've asked for more details on your other thread. On Sun, Apr 20, 2014 at 10:35 PM, amit nanda amit...@gmail.com wrote: Hi, I see this function avro_file_writer_sync in the io,h file in Avro C Library, what does this function does. I am facing Invalid Sync! for some of the files, can this function help me in that? Thanks Amit -- Harsh J
Re: Cannot decode file block: Error decompressing block with deflate, possible data error
Are all your files running into this, or just a few among the C-written ones? The file is possibly corrupt cause of missing bytes or other reasons (hard to tell specifically). Having more info on the behaviour or a sample such file to inspect may help. On Fri, Apr 18, 2014 at 3:57 PM, amit nanda amit...@gmail.com wrote: Hi, While reading a file using avrocat tool, i am getting errors *Cannot decode file block: Error decompressing block with deflate, possible data error.* And when using avro-tools-1.7.6.jar i get Invalid Sync! errors. Can anyone please tel me why these errors are coming, and is there a way to correct the file now without loosing any data? File was created using 1.7.4 C library Thanks Amit -- Harsh J
Re: Saving arbitrary data to avro files
You can do this, sure. You just need a schema of string type or something similar. Are you not concerned about the read time of the data you plan to store as strings? Typically you write once and read more than once during processing. Storing the data types in proper serialized form would help greatly during reads. On Mar 4, 2014 8:54 AM, yael aharon yael.aharo...@gmail.com wrote: Hello, I am writing a C++ library that stores arbitrary data to avro files. The schema is given to me through my library's API. All the data is given to me in the form of strings; including integers, doubles, etc. Is there a way for me to store this data to avro files without converting the strings to the correct types first? I am concerned about the performance impact that this conversion would have thanks, Yael
Re: Multiple inputs for different avro inputs
One more doubt: Why we don't have AvroMultipleInputs just like AvroMultipleOutputs? Any reason? This (and other) question belongs to Apache Avro's list (user@avro.apache.org). Moving user@hadoop to bcc:. For AvroMultipleInputs, see https://issues.apache.org/jira/browse/AVRO-1439 On Thu, Feb 27, 2014 at 2:43 AM, AnilKumar B akumarb2...@gmail.com wrote: Hi, I am using MultipleInputs to read two different avro inputs with different schemas. But in run method, as we need to specify the AvroJob.setInputKeySchema(job,schema), Which schema I need to set? I tried as below ListSchema schemas = new ArrayListSchema(); schemas.add(FlumeEvent.SCHEMA$); schemas.add(Event.SCHEMA$); AvroJob.setInputKeySchema(testJob, Schema.createUnion(schemas)); But I am facing issue while Map phase Error: org.apache.avro.AvroTypeException: Found Event, expecting union How to fix this issue? One more doubt: Why we don't have AvroMultipleInputs just like AvroMultipleOutputs? Any reason? Thanks Regards, B Anil Kumar. -- Harsh J
Re: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to java.util.Map
at org.apache.avro.generic.GenericDatumWriter.getMapSize(GenericDatumWriter.java:194) at org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:173) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:69) at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143) at org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:138) at org.apache.avro.reflect.ReflectDatumWriter.writeArray(ReflectDatumWriter.java:64) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68) at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143) at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114) at org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:175) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66) at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143) at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114) at org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:175) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66) at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58) at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:290) Thanks Regards, B Anil Kumar. -- Harsh J
Re: Avro MapReduce (MR1): Prevent Key from being output by reducer when using Pair schema
Hello Ed, The AvroReducer per http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroReducer.html has a simple spec of K,V,OUT, where OUT can be any record type and not necessarily a PairKO,VO type. AvroJob.setOutputSchema(…) should accept non-pair configs. I think its java-doc is incorrect though. I wrote a test case yesterday at http://issues.apache.org/jira/browse/AVRO-1439, in which I set a non-Pair schema via the same call without any trouble. We could get the java-doc fixed, if it is indeed wrong. On Thu, Jan 16, 2014 at 2:14 PM, ed edor...@gmail.com wrote: Hello, I am currently reading in lots of small avro files and then writing them out into one large avro file using Map Reduce MR1. I'm trying to do this using the AvroMapper and AvroReducer and it's almost working how I want. The problem right now is that it looks like I have to use org.apache.avro.mapred.Pair if I use AvroJob.setOutputSchema. Is there a way to output a Pair schema from AvroReducer and have the key in that schema be ignored (i.e., not included in the output from the reducer)? Right now when I check the Reducer output there is an added field in each record called key which I'd like to not have there. Essentially I'm looking for something like NullWritable where the key will just be ignored in the final output. Thank you for any assistance or guidance you can provide! Best Regards, Ed -- Harsh J
Re: Avro MapReduce (MR1): Prevent Key from being output by reducer when using Pair schema
Thanks Ed! Can you also file an improvement JIRA under https://issues.apache.org/jira/browse/AVRO with a patch that changes it to make more sense? On Thu, Jan 16, 2014 at 5:14 PM, ed edor...@gmail.com wrote: Hi Harsh, Thank you for your response which was invaluable in helping me to figure out my issue. The Java-Doc is in fact incorrect when it states that AvroJob.setOutputSchema cannot accept non-Pair configs as it turns out it can. What was throwing me off is that if you use AvroJob.setOutputSchema to set a non-Pair config, then you also need to call AvroJob.setMapOutputSchema (which does require the use of Pair). Otherwise, by default, the map output schema gets set to whatever you set in setOutputSchema and if that is non-pair you'll get an error at runtime. Maybe the JavaDoc should say something along the lines of: Configure a job's output schema. If this is a not a Pair-schema then you must explicitly set the job's map output schema using setMapOutputSchema Thank you! Best Regards, Ed On Thu, Jan 16, 2014 at 6:47 PM, Harsh J ha...@cloudera.com wrote: Hello Ed, The AvroReducer per http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroReducer.html has a simple spec of K,V,OUT, where OUT can be any record type and not necessarily a PairKO,VO type. AvroJob.setOutputSchema(…) should accept non-pair configs. I think its java-doc is incorrect though. I wrote a test case yesterday at http://issues.apache.org/jira/browse/AVRO-1439, in which I set a non-Pair schema via the same call without any trouble. We could get the java-doc fixed, if it is indeed wrong. On Thu, Jan 16, 2014 at 2:14 PM, ed edor...@gmail.com wrote: Hello, I am currently reading in lots of small avro files and then writing them out into one large avro file using Map Reduce MR1. I'm trying to do this using the AvroMapper and AvroReducer and it's almost working how I want. The problem right now is that it looks like I have to use org.apache.avro.mapred.Pair if I use AvroJob.setOutputSchema. Is there a way to output a Pair schema from AvroReducer and have the key in that schema be ignored (i.e., not included in the output from the reducer)? Right now when I check the Reducer output there is an added field in each record called key which I'd like to not have there. Essentially I'm looking for something like NullWritable where the key will just be ignored in the final output. Thank you for any assistance or guidance you can provide! Best Regards, Ed -- Harsh J -- Harsh J
Re: Nullable Fields
Hi Alparisian, Are you looking for the Nullable annotation for ReflectData based schemas: http://avro.apache.org/docs/1.7.5/api/java/org/apache/avro/reflect/Nullable.html? Sample usage here: https://github.com/apache/avro/blob/release-1.7.5/lang/java/avro/src/test/java/org/apache/avro/reflect/TestReflect.java#L324 On Thu, Jan 16, 2014 at 5:27 PM, Alparslan Avcı alparslan.a...@agmlab.com wrote: Hi all, I've been developing Gora to upgrade Avro 1.7.x and have a question. Why do we have to use UNION type (Ex.: {name: field1, type: [null, {type:map, values:[null, string]}],default:null}) for nullable fields? Because of this issue, we have to handle UNION types in an appropriate way both normal values and null values as exceptions. Instead of UNION type, why don't we use a 'nullable' property for any field? -- Harsh J
Re: an awkward question
This question is for the ASF Infra I think - we use their services :) There are services such as Nabble, etc. that let you also browse and post on most indexed ASF project threads, by displaying it as a forum. Note though that Google Groups carries its own annoying limits, which are raised only for large customers/users of their paid apps services. I'm still a big fan of the older backends, though I do enjoy Google Groups' permalink feature :) On Fri, Sep 27, 2013 at 9:42 PM, John Langley dige...@gmail.com wrote: Sorry if this is an awkward question, I don't really want to start some huge debate, but... google groups rock you guys. Is there any easy way to view and use this email distribution list from google groups, or is there any interest in moving there? I know the community probably doesn't want to endorse any one vendor, etc. etc. but honestly, working with other technology bases... google-groups REALLY makes it easy to collaborate and search for issues that someone may already have solved. (certainly http://search-hadoop.com/?q=fc_project=Avro helps, if only you could post from there as well, w/out having to jump to another interface etc.) -- Harsh J
Re: Kafka JMX
Hello, Did you mean to send this to the Kafka lists instead of the Avro one? On Fri, Aug 30, 2013 at 4:08 AM, Mark static.void@gmail.com wrote: Can you view Kafka metrics via JConsole? I've tried connecting to port with no such luck? -- Harsh J
Re: Avro file Compression
Can you share your test? There is an example at http://svn.apache.org/repos/asf/avro/trunk/lang/c/examples/quickstop.c which has the right calls for using a file writer with a deflate codec - is yours similar? On Mon, Aug 19, 2013 at 9:42 PM, amit nanda amit...@gmail.com wrote: I am try to compress the avro files that i am writing, for that i am using the latest Avro C, with deflate option, but i am not able to see any difference in the file size. Is there any special type to data that this works on, or is there any more setting that needs to be done for this to work. -- Harsh J
Re: Is there a way to conditionally read Avro data?
What Eric suggests (reader schemas) would work, but may incur a double read cost when you wish to proceed based on a positive condition met by the specific read. If this data is held, order-wise, early into the record, then perhaps using a custom DatumReader implementation (that does the low level deserialization) may work more effectively. You can pass a DatumReader when constructing a DataFileReader - but its quite a long route to go IMO. On Sat, Aug 17, 2013 at 4:17 AM, Eric Wasserman ewasser...@247-inc.com wrote: If you define you records like this (this is in the Avro IDL lang. for brevity) If you write your records with a schema like this: record R { Header header; Body body; } Then you can read with a schema like this: record RSansBody { Header header; } And the Avro libraries will read the header part (in which your type would reside) and effectively skip the body part. From: Anna Lahoud annalah...@gmail.com Sent: Friday, August 16, 2013 12:23 PM To: user@avro.apache.org Subject: Is there a way to conditionally read Avro data? I am wondering if there is a way that I can avoid reading all of an item in an Avro file, based on some of the data that I have already read. For instance, say I have a datum where I know that if it's 'type' value is a 'ComputerVirus', and that I do not want to touch the remaining fields. Is there a way to 'move on' and get the next datum, without touching the remainder of the scary datum? I would call it a 'conditional read' in that I only want to fully read the datum if the datum meets some criteria. Anna -- Harsh J
Re: Mapper not called
I've often found the issue behind such an observance to be that the input files lack an .avro extension. Is that true in your case? Can you retry after a rename if yes? On Wed, Jul 31, 2013 at 1:02 AM, Anna Lahoud annalah...@gmail.com wrote: I am following directions on http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/package-summary.html to write a job that takes Avro files as input and outputs non-Avro files, I created the following job. I should note that I have tried different variations of ordering the setInput/OutputPath lines, the AvroJob lines, and the reduce task settings. It always results the same: the job runs with 0 mappers and 1 reducer (which gets no data so is essentially an emtpy SequenceFile). It always says there are 10 input files so that's not the issue. There is an @Override statement on my map and my reduce so that's not the issue. And I believe I have correctly followed the Avro input/non-Avro output instructions mentioned in the link above. Any other ideas would be welcome!!! public class MyAvroJob extends Configured implements Tool { @Override public int run(String[] args) throws Exception { JobConf job = new JobConf(getConf(), this.getClass()); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); AvroJob.setMapperClass(job, MyAvroMapper.class); AvroJob.setInputSchema(job, MySchema.SCHEMA$); AvroJob.setMapOutputSchema(job, Pair.getPairSchema(Schema.create(Type.STRING), Schema.create(Type.STRING))); job.setReducerClass(MyNonAvroReducer.class); job.setOutputFormat(SequenceFileOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setNumReduceTasks(1); return JobClient.runJob(job).isSuccessful(); } public static class MyAvroMapper extends AvroMapperMySchema, PairString, String { @Override public void map(MySchema in, AvroCollectorPairString, String collector, Reporter reporter) throws IOException { ListMyThings things = in.getRecords(); ... collector.collect(new PairString, String( newKey, newValue)); } } public static class MyNonAvroReducer extends MapReduceBase implements ReducerAvroKeyString, AvroValueString, Text, Text { @Override public void reduce(AvroKeyString key, IteratorAvroValueString values, OutputCollectorText, Text output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(new Text(key.datum()), new Text(values.next().datum())); } } } public static void main(String[] args) throws Exception { ToolRunner.run(new MyAvroJob(), args); } -Anna -- Harsh J
Re: Avro schema
We read it from the top of the file at start (just the schema bytes) and then initialize the reader. On Thu, Aug 1, 2013 at 8:32 PM, Lior Schachter lior...@gmail.com wrote: Hi all, When writing Avro schema to the a data file, what will be the expected behavior if the file is used as M/R input. How does the second/third/... splits get the schema (the schema is always written to the first split) ? Thanks, Lior -- Harsh J
Re: Avro schema
Yes, we seek to 0 and we read the header then seek back to the split offset. On Aug 1, 2013 11:16 PM, Lior Schachter lior...@gmail.com wrote: Hi Harsh, So for each split you first read the header of the file directly from HDFS ? Thanks, Lior On Thu, Aug 1, 2013 at 7:36 PM, Harsh J ha...@cloudera.com wrote: We read it from the top of the file at start (just the schema bytes) and then initialize the reader. On Thu, Aug 1, 2013 at 8:32 PM, Lior Schachter lior...@gmail.com wrote: Hi all, When writing Avro schema to the a data file, what will be the expected behavior if the file is used as M/R input. How does the second/third/... splits get the schema (the schema is always written to the first split) ? Thanks, Lior -- Harsh J
Re: Avro and MapReduce 2.0
Hi, The MR APIs are not tied to MR version 1.0 or 2.0. Both APIs are available on both versions and are fully supported. If you're looking for a new API Avro MapReduce example though, you can check out some of its test cases such as http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java On Fri, Jul 19, 2013 at 1:10 AM, F. Jerrell Schivers jerr...@bordercore.com wrote: Hi folks, I'm writing a map-only job that outputs to Avro. Can someone point me to some examples that use the 2.0 version of the API? Most of the sample code I've found refer to the 1.0 version. Thanks, Jerrell -- Harsh J
Re: ArrayIndexOutOfBoundsException in Symbol.getSymbol in map reduce job
Its difficult to tell what the error means without context and other info (such as version). If I had to guess, I think there may be a corruption on the file being processed here. Does running the file through avro-tools' tojson sub-command end up in a successful read? On Tue, May 14, 2013 at 3:28 AM, Sripad Sriram sri...@path.com wrote: Hi all, A java hadoop job that's previously executed without issue began erroring with the following stack trace - have any of you seen this before? java.lang.ArrayIndexOutOfBoundsException: 14 at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) at org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer.deserialize(AvroSerialization.java:83) at org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer.deserialize(AvroSerialization.java:65) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:1262) at org.apache.hadoop.mapred.Task$ValuesIterator.nextKey(Task.java:1233) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:533) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:429) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249) -- Harsh J
Re: Hadoop Avro Question
Oops, moving for sure this time :) On Wed, May 1, 2013 at 10:35 AM, Harsh J ha...@cloudera.com wrote: Moving the question to Apache Avro's user@ lists. Please use the right lists for the most relevant answers. Avro is a different serialization technique that intends to replace the Writable serialization defaults in Hadoop. MR accepts a list of serializers it can use for its key/value structures and isn't limited to Writable in any way. Look up the property io.serializations in your Hadoop's core-default.xml for more information. The Avro project also offers fast comparator classes that are used for comparing the bytes/structures of Avro objects. This is mostly auto-set for you when you use the MR framework as described at http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/package-summary.html (via AvroJob helper class). On Tue, Apr 30, 2013 at 6:51 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, When dealing with Avro data files in MR jobs ,we use AvroMapper , I noticed that the output of K and V of AvroMapper isnt writable and neither the key is comparable (these are AvroKey and AvroValue). As the general serialization mechanism is writable , how is the K,V pairs in case of avro , travel across nodes? Thanks, Rahul -- Harsh J -- Harsh J
Re: Hadoop Avro Question
For the TOP, check out https://issues.apache.org/jira/browse/MAPREDUCE-4574 which we fixed in Hadoop recently to allow full reuse with Avro. On Wed, May 1, 2013 at 10:49 AM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi Harsh, Looks like a lot of other classes are also to be rewritten for avro. Like the total sort partitioner , which I think currently assumes writable as the io mechanism. I faced problem using with avro , so though of writing to the forum. Thanks a lot Rahul! On Wed, May 1, 2013 at 10:35 AM, Harsh J ha...@cloudera.com wrote: Moving the question to Apache Avro's user@ lists. Please use the right lists for the most relevant answers. Avro is a different serialization technique that intends to replace the Writable serialization defaults in Hadoop. MR accepts a list of serializers it can use for its key/value structures and isn't limited to Writable in any way. Look up the property io.serializations in your Hadoop's core-default.xml for more information. The Avro project also offers fast comparator classes that are used for comparing the bytes/structures of Avro objects. This is mostly auto-set for you when you use the MR framework as described at http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/package-summary.html (via AvroJob helper class). On Tue, Apr 30, 2013 at 6:51 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, When dealing with Avro data files in MR jobs ,we use AvroMapper , I noticed that the output of K and V of AvroMapper isnt writable and neither the key is comparable (these are AvroKey and AvroValue). As the general serialization mechanism is writable , how is the K,V pairs in case of avro , travel across nodes? Thanks, Rahul -- Harsh J -- Harsh J
Re: Python Errors
Isn't RHEL4 too old as well, now? On Tue, Apr 16, 2013 at 3:48 AM, Milind Vaidya kava...@gmail.com wrote: Thanks...I will upgrade n check...I was using whatever installed on my RHEL4 box On Mon, Apr 15, 2013 at 4:50 PM, Miki Tebeka miki.teb...@gmail.com wrote: Python 2.3 is too old. IIRC the minimal Python version supported is 2.6. On Mon, Apr 15, 2013 at 1:54 PM, Milind Vaidya kava...@gmail.com wrote: I installed avro for python. Like Referred : https://avro.apache.org/docs/current/gettingstartedpython.html 1. Build as per the instructions. Here is the output. ** Installation Output*** /usr/lib64/python2.3/distutils/dist.py:227: UserWarning: Unknown distribution option: 'extras_require' warnings.warn(msg) /usr/lib64/python2.3/distutils/dist.py:227: UserWarning: Unknown distribution option: 'install_requires' warnings.warn(msg) running install running build running build_py running build_scripts running install_lib byte-compiling /usr/lib/python2.3/site-packages/avro/io.py to io.pyc File /usr/lib/python2.3/site-packages/avro/io.py, line 371 @staticmethod ^ SyntaxError: invalid syntax byte-compiling /usr/lib/python2.3/site-packages/avro/schema.py to schema.pyc File /usr/lib/python2.3/site-packages/avro/schema.py, line 589 @staticmethod ^ SyntaxError: invalid syntax byte-compiling /usr/lib/python2.3/site-packages/avro/datafile.py to datafile.pyc File /usr/lib/python2.3/site-packages/avro/datafile.py, line 71 @staticmethod ^ SyntaxError: invalid syntax running install_scripts changing mode of /usr/bin/avro to 755 ** Installation Output*** 2.I checked import avro on python prompt as follows Python 2.3.4 (#1, Jan 11 2011, 14:40:50) [GCC 3.4.6 20060404 (Red Hat 3.4.6-11)] on linux2 Type help, copyright, credits or license for more information. import avro 3. I created the file user.avsc containing schema given at about link 4. Copied the code from above link in BasicAvro.py (I added #! /usr/bin/python) 5.Both BasicAvrio..py and user.avsc are in the same directory. If I run pyhon BasicAvro.py gives error Traceback (most recent call last): File BasicAvro.py, line 2, in ? import avro.schema File /usr/lib/python2.3/site-packages/avro/schema.py, line 589 @staticmethod ^ SyntaxError: invalid syntax 6. Tried executing the script under scripts directory called avro. gives following error File avro, line 75 return dict((k, obj[k]) for k in (set(obj) fields)) ^ SyntaxError: invalid syntax 7. What is going wrong ? -- Harsh J
Re: Enabling compression
Hi Vinod, In Avro, compression is provided only at the file container level (i.e. block compression). For compressing a simple byte array, you can rely on the Hadoop's compression classes such as a GzipCodec [1] to compress the byte stream directly (wrapping via a compressed output stream [2] got by its helper method [3]). Something like this, for example (I've not tested it out): ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); GzipCodec codec = ReflectionUtils.newInstance(GzipCodec.class, new Configuration()); OutputStream compressedOutputStream = codec.createOutputStream(outputStream); [… Encode over compressedOutputStream, etc. …] [1] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/GzipCodec.html [2] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressorStream.html [3] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/GzipCodec.html#createOutputStream(java.io.OutputStream) On Tue, Apr 9, 2013 at 11:17 AM, Vinod Jammula vinod.kumar.jamm...@ericsson.com wrote: Hi, I have a a csv string which I want to serialize, compress and write to a database. I have the following code to serialize the string ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); Encoder e = EncoderFactory.get().binaryEncoder(outputStream, null); GenericDatumWriter w = new GenericDatumWriter(schema); w.write(record, e) byte[] avroBytes = outputStream.toByteArray(); Following code to de-serialize and process the record. DatumReaderGenericRecord reader = new GenericDatumReaderGenericRecord(schema); Decoder decoder = DecoderFactory.get().binaryDecoder(avroBytes, null); GenericRecord record = reader.read(decoder, null); I find compression with DataFileWriter and DataFileReader. But how to enable the compression for avro serialized buffer. Thanks and Regards, Vinod -- Harsh J
Re: Record sort order is lexicographically by field -- what does that mean?
Hey Jeremy, On Thu, Mar 28, 2013 at 5:15 AM, Jeremy Kahn jer...@trochee.net wrote: According to the documentation http://avro.apache.org/docs/current/spec.html#order , the sort order for records is: record data is ordered lexicographically by field. If a field specifies that its order is: ascending, then the order of its values is unaltered. descending, then the order of its values is reversed. ignore, then its values are ignored when sorting. What does ordered lexicographically by field mean? I can see two interpretations. Consider a record of the following schema: {name: ZooInventory, type: Record, fields: [ {name: city, type: string, order: ignore}, {name: zebras, type: int, order: descending}, {name: anacondas, type: int, order: ascending}, {name: baboons, type: int} ] } I can read ordered lexicographically by field in two ways: the names of the fields are sorted lexicographically, and the field that goes lexicographically first (not marked as order:ignore) dominates. the records are sorted by the sort order of each field, with the first fields (not marked order: ignore) taking sort priority. The second one is correct. The field's order in the defined schema is not changed but only walked through. I've always read this more like it will compare in the provided order (of read schema) and based on the type of ordering (positive, ignore or negative) and thats true from my use of it in Hadoop MR as well. So suppose I have my ZooInventory objects, and I sort them according to the sort order specification. Under interpretation (1), cities with low anaconda counts would go first in the sorted list, and within a given value of anacondas, sort by baboon count. Under interpretation (2), large zebra-count zoos would go first, and within a given value of zebras, sort ascending by anacondas. Yes, (2) is the result you'll see. Baboons would also be considered ascending as you've not ignored it, btw. It seems to me that (2), in which the zebras field values dominate the sort descending, is the right way to behave, but I can't seem to square that with my understanding of ordered lexicographically by field -- or maybe lexicographically means something different to me than to you, or maybe (2) just isn't really right after all. Behavior (2) -- relative to behavior (1) -- offers the ability to adjust the order of the schema to express a different sort order, but might present problems for schema negotiation. What kind of problems are you describing here? Sorry if I'm not getting it by the words schema negotiation alone. -- Harsh J
Re: Record sort order is lexicographically by field -- what does that mean?
Hmmm, I've not used the messaging aspects of Avro much as of yet, but AFAIK the sorting is only applied manually by use of the BinaryData.compare(…) API methods. If the IPC parts use that for some reason to compare two messages or more, then I can imagine this to be a problem as well. On Thu, Mar 28, 2013 at 11:27 PM, Jeremy Kahn troc...@trochee.net wrote: Thanks for the information, Harsh. Further comments inline below: On Thu, Mar 28, 2013 at 4:01 AM, Harsh J ha...@cloudera.com wrote: On Thu, Mar 28, 2013 at 5:15 AM, Jeremy Kahn jer...@trochee.net wrote: I can read ordered lexicographically by field in two ways: 1. the names of the fields are sorted lexicographically, and the field that goes lexicographically first (not marked as order:ignore) dominates. 2. the records are sorted by the sort order of each field, with the first fields (not marked order: ignore) taking sort priority. The second one is correct. The field's order in the defined schema is not changed but only walked through. [...] that's true from my use of it in Hadoop MR as well. Okay, this is very helpful to know: it's working the way I had hoped. Behavior (2) -- relative to behavior (1) -- offers the ability to adjust the order of the schema to express a different sort order, but might present problems for schema negotiation. What kind of problems are you describing here? Sorry if I'm not getting it by the words schema negotiation alone. Suppose I sort a sequence of ZooInventory objects by the sort order implied by this schema, and I send them to you in sorted order over a protocol with an IDL type specification of arrayZooInventory. You *read* the sequence with a different ZooInventory schema with the same fields, but which contains a different ordering. The objects in the array will not (necessarily) appear to be sorted *to you*. This isn't necessarily a problem -- it might actually be a feature. It is worth noting that two schemas may be compatible under schema negotiation but have different sort order for reader and writer. --jeremy -- Harsh J
Re: Avro and Oozie Map Reduce action
The value you're specifying for io.serializations below is incorrect: property nameio.serializations/name valueorg.apache.avro.mapred.AvroSerialization, avro.serialization.key.reader.schema, avro.serialization.value.reader.schema, avro.serialization.key.writer.schema,avro.serialization.value.writer.schema /value /property If the goal is to include org.apache.avro.mapred.AvroSerialization, then it should look more like: property nameio.serializations/name valueorg.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization,org.apache.hadoop.io.serializer.avro.AvroReflectSerialization,org.apache.avro.mapred.AvroSerialization/value /property That is, it must be an extension of the default values, and not a replacement of them. On Wed, Mar 13, 2013 at 4:05 AM, M, Paul pa...@iqt.org wrote: Hello, I am trying to run an M/R job with Avro serialization via Oozie. I've made some progress in the workflow.xml, however I am still running into the following error. Any thoughts? I believe it may have to do with the io.serializations property below. FYI, I am using CDH 4.2.0 mr1. 2013-03-12 15:24:32,334 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_20130318_0080_m_00_3: java.lang.NullPointerException at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:389) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1407) at org.apache.hadoop.mapred.Child.main(Child.java:262) action name=mr-node map-reduce job-tracker${jobTracker}/job-tracker name-node${nameNode}/name-node prepare delete path=${nameNode}/user/${wf:user()}/${outputDir} / /prepare configuration property namemapred.job.queue.name/name value${queueName}/value /property property namemapreduce.reduce.class/name valueorg.apache.avro.mapred.HadoopReducer/value /property property namemapreduce.map.class/name valueorg.apache.avro.mapred.HadoopMapper/value /property property nameavro.reducer/name valueorg.my.project.mapreduce.CombineAvroRecordsByHourReducer /value /property property nameavro.mapper/name valueorg.my.project.mapreduce.ParseMetadataAsTextIntoAvroMapper /value /property property namemapreduce.inputformat.class/name valueorg.my.project.mapreduce.NonSplitableInputFormat/value /property !-- Key Value Mapper -- property nameavro.output.schema/name value{type:record,name:Pair,namespace:org.apache.avro.mapred,fields:...}]} /value /property property namemapred.mapoutput.key.class/name valueorg.apache.avro.mapred.AvroKey/value /property property namemapred.mapoutput.value.class/name valueorg.apache.avro.mapred.AvroValue/value /property property nameavro.schema.output.key/name value{type:record,name:DataRecord,namespace:...]}]} /value /property property namemapreduce.outputformat.class/name valueorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat /value /property property namemapred.output.key.comparator.class/name valueorg.apache.avro.mapred.AvroKeyComparator/value /property property nameio.serializations/name valueorg.apache.avro.mapred.AvroSerialization, avro.serialization.key.reader.schema, avro.serialization.value.reader.schema, avro.serialization.key.writer.schema,avro.serialization.value.writer.schema /value /property property namemapred.map.tasks/name value1/value /property !--Input/Output -- property namemapred.input.dir/name value/user/${wf:user()}/input//value /property property namemapred.output.dir/name value/user/${wf:user()}/${outputDir}/value /property /configuration /map-reduce -- Harsh J
Re: Is it possible to append to an already existing avro file
I assume by non-trivial you meant the extra Seekable stuff I needed to wrap around the DFS output streams to let Avro take it as append-able? I don't think its possible for Avro to carry it since Avro (core) does not reverse-depend on Hadoop. Should we document it somewhere though? Do you have any ideas on the best place to do that? On Thu, Feb 7, 2013 at 6:12 AM, Michael Malak michaelma...@yahoo.com wrote: Thanks so much for the code -- it works great! Since it is a non-trivial amount of code required to achieve append, I suggest attaching that code to AVRO-1035, in the hopes that someone will come up with an interface that requires just one line of user code to achieve append. --- On Wed, 2/6/13, Harsh J ha...@cloudera.com wrote: From: Harsh J ha...@cloudera.com Subject: Re: Is it possible to append to an already existing avro file To: user@avro.apache.org Date: Wednesday, February 6, 2013, 11:17 AM Hey Michael, It does implement the regular Java OutputStream interface, as seen in the API: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataOutputStream.html. Here's a sample program that works on Hadoop 2.x in my tests: https://gist.github.com/QwertyManiac/4724582 On Wed, Feb 6, 2013 at 9:00 AM, Michael Malak michaelma...@yahoo.com wrote: I don't believe a Hadoop FileSystem is a Java OutputStream? --- On Tue, 2/5/13, Doug Cutting cutt...@apache.org wrote: From: Doug Cutting cutt...@apache.org Subject: Re: Is it possible to append to an already existing avro file To: user@avro.apache.org Date: Tuesday, February 5, 2013, 5:27 PM It will work on an OutputStream that supports append. http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(org.apache.avro.file.SeekableInput, java.io.OutputStream) So it depends on how well HDFS implements FileSystem#append(), not on any changes in Avro. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#append(org.apache.hadoop.fs.Path) I have no recent personal experience with append in HDFS. Does anyone else here? Doug On Tue, Feb 5, 2013 at 4:10 PM, Michael Malak michaelma...@yahoo.com wrote: My understanding is that will append to a file on the local filesystem, but not to a file on HDFS. --- On Tue, 2/5/13, Doug Cutting cutt...@apache.org wrote: From: Doug Cutting cutt...@apache.org Subject: Re: Is it possible to append to an already existing avro file To: user@avro.apache.org Date: Tuesday, February 5, 2013, 5:08 PM The Jira is: https://issues.apache.org/jira/browse/AVRO-1035 It is possible to append to an existing Avro file: http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File) Should we close that issue as fixed? Doug On Fri, Feb 1, 2013 at 11:32 AM, Michael Malak michaelma...@yahoo.com wrote: Was a JIRA ticket ever created regarding appending to an existing Avro file on HDFS? What is the status of such a capability, a year out from when the issue below was raised? On Wed, 22 Feb 2012 10:57:48 +0100, Vyacheslav Zholudev vyacheslav.zholu...@gmail.com wrote: Thanks for your reply, I suspected this. I will create a JIRA ticket. Vyacheslav On Feb 21, 2012, at 6:02 PM, Scott Carey wrote: On 2/21/12 7:29 AM, Vyacheslav Zholudev vyacheslav.zholu...@gmail.com wrote: Yep, I saw that method as well as the stackoverflow post. However, I'm interested how to append to a file on the arbitrary file system, not only on the local one. I want to get an OutputStream based on the Path and the FileSystem implementation and then pass it for appending to avro methods. Is that possible? It is not possible without modifying DataFileWriter. Please open a JIRA ticket. It could not simply append to an OutputStream, since it must either: * Seek to the start to validate the schemas match and find the sync marker, or * Trust that the schemas match and find the sync marker from the last block DataFileWriter cannot refer to Hadoop classes such as FileSystem, but we could add something to the mapred module that takes a Path and FileSystem and returns something that implemements an interface that DataFileWriter can append to. This would be something that is both a http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInput.html and an OutputStream, or has both an InputStream from the start of the existing file and an OutputStream at the end. Thanks, Vyacheslav On Feb 21, 2012, at 5:29 AM, Harsh J wrote: Hi, Use the appendTo feature of the DataFileWriter. See
Re: run time error during reduce stage: No field named ____ in: null
You could also use an @Override to assert an override at compile-time. On Fri, Nov 2, 2012 at 9:55 PM, Brian Derickson bderickso...@gmail.com wrote: That did it! I never would have found that, thank you so much. This is what I get for trying to just use Vim and Maven instead of a proper IDE. I'll work on getting Eclipse set up. Again, thanks a bunch. I've been pouring over this for awhile now and I'm both glad and embarrassed it was so simple. On Fri, Nov 2, 2012 at 11:06 AM, Dave Beech d...@paraliatech.com wrote: I think I have it. Your reducer isn't being called at all, because the signature of the reducer method doesn't match the one in AvroReducer. So, the base implementation isn't being overridden. You've stated Iterator where it should actually be Iterable. If you use Eclipse, look for a green arrow icon next to the method declaration - that means it's being overridden properly. Dave On 2 November 2012 15:55, Brian Derickson bderickso...@gmail.com wrote: I've made another gist for this rather than clutter up the mail with code snippets: https://gist.github.com/4002132 I basically just changed all instances of PairGenericRecord, Integer in the reducer with just GenericRecord. I also changed the output schema that gets set in the Main function. When I run this, I get a run time error that's also included in the above gist: java.lang.IllegalArgumentException: Not a Pair schema: The pom.xml file I'm using is also in this gist, in case I'm screwing up a version somewhere. My intent is to be running on CDH4 using MRv1 and Avro 1.7.1, and as far as I can tell from the pom.xml I'm doing just that. Could be mistaken. Thanks again for your time. On Thu, Nov 1, 2012 at 5:49 PM, Dave Beech d...@paraliatech.com wrote: Hi Brian I don't think the output from the reducer should be a Pair. You said you got an error when you didn't use a Pair here - what was it? Cheers, Dave On 1 November 2012 22:09, Brian Derickson bderickso...@gmail.com wrote: I've been pulling my hair out over this all day, and I'm hoping this is something simple I'm overlooking. The relevant portions of my code, the schema I'm using, and the stack trace are at https://gist.github.com/3996847. I'm using Hadoop 0.20.2 and Avro 1.7.1 as part of CDH4. To briefly describe what I'm doing: the mapper (not included in the gist) is taking a bam file and spitting out some information. The key is the chromosome and position colon delimited and the value is an integer. The reducer is summing up all the integers at a particular position and creating a Pair object containing a record using the schema included in my gist. The second portion of the pair is an integer that I don't care about... if I didn't use a Pair here, I'd get an error. If this is something I could do differently, please correct me. :) Every time this is run, I get the stack trace included in the gist. I've run out of things to try to fix this... I'd really really appreciate any help I can get. Thanks! -- Harsh J
Re: Example of secondary sort using Avro data.
Hi Ravi, Avro questions are best asked at user@avro lists. I've moved your question there. Take a look at Jacob's responses at http://search-hadoop.com/m/woY9Gz8Qyz1 for a detailed take on how to setup the comparators. On Tue, Oct 16, 2012 at 1:54 AM, Ravi P hadoo...@outlook.com wrote: Hello Group, Are there any sample code/documentation available on writing Map-reduce jobs with secondary sort using Avro data? -- Thanks, Ravi -- Harsh J
Re: How to convert Avro GenericRecord to AvroKeyGenericRecord?
Ravi, Moving this to the Avro user lists (user@avro.apache.org). You can simply do a AvroKeyGenericRecord key = new AvroKeyGenericRecord(datum). On Thu, Sep 27, 2012 at 4:51 AM, Ravi P hadoo...@outlook.com wrote: Hello, I am using Avro files for Hadoop MaprReduce. My Mapper function has following definition. Mapper AvroKeyGenericRecord,NullWritable, AvroKeyGenericRecord,AvroValueGenericRecord For writing unit tests for above Mapper function I need to pass AvroKeyGenericRecord. How do I convert GenericRecord to AvroKeyGenericRecord ? Is there any example available ? Thanks, Ravi -- Harsh J
Re: avrogencpp generates vector of $Undefined$ type
Hey Jan, Perhaps filing a JIRA with your reproduction steps and a sample, runnable test case will help speed this up. Seems like a genuine bug, so you should go ahead! On Tue, Aug 28, 2012 at 5:30 AM, Jan van der Lugt janl...@gmail.com wrote: It seems the $Undefined$ is coming from an AVRO_UNION type, which is also not checked in the cppTypeOf method. I could try to come up with some solution, but if someone with knowledge of this code could tell me what the issue is and why AVRO_UNION is not being handled, that would be very helpful. - Jan On Sun, Aug 26, 2012 at 9:56 PM, Jan van der Lugt janl...@gmail.com wrote: Good find! I'll take a look at this tomorrow, see if I can come up with a fix. On Sun, Aug 26, 2012 at 5:26 AM, Harsh J ha...@cloudera.com wrote: I'm not an expert on the Avro C++ implementation, but I wonder if this is cause of the nulls not being checked for in http://svn.apache.org/repos/asf/avro/trunk/lang/c++/impl/avrogencpp.cc's CodeGen::cppTypeOf method. On Sun, Aug 26, 2012 at 1:54 PM, Jan van der Lugt janl...@gmail.com wrote: Hi all, Sorry to be impatient, but could someone please comment on this issue? I know that the C++ version isn't as popular as the Java version, but the whole idea is to make information exchange between applications in different languages easier, right? - Jan On Sat, Aug 18, 2012 at 12:10 AM, Jan van der Lugt janl...@gmail.com wrote: Hi all, After deciding on Apache Avro for one of the main formats for storing our graph data, I tried to integrate it with our graph processing system built in C++. If I generate a header file from the attached Avro schema using avrogencpp, I get a vector of type $Undefined$ somewhere in the generated code (see the snippet below). Is there an error in my schema or is this a bug in avrogencpp? Thanks in advance for your help! - Jan static void decode(Decoder d, gm::gm_avro_graph_avpr_Union__5__ v) { size_t n = d.decodeUnionIndex(); if (n = 2) { throw avro::Exception(Union index too big); } switch (n) { case 0: d.decodeNull(); v.set_null(); break; case 1: { std::vector$Undefined$ vv; avro::decode(d, vv); v.set_array(vv); } break; } } -- Harsh J -- Harsh J
Re: avrogencpp generates vector of $Undefined$ type
I'm not an expert on the Avro C++ implementation, but I wonder if this is cause of the nulls not being checked for in http://svn.apache.org/repos/asf/avro/trunk/lang/c++/impl/avrogencpp.cc's CodeGen::cppTypeOf method. On Sun, Aug 26, 2012 at 1:54 PM, Jan van der Lugt janl...@gmail.com wrote: Hi all, Sorry to be impatient, but could someone please comment on this issue? I know that the C++ version isn't as popular as the Java version, but the whole idea is to make information exchange between applications in different languages easier, right? - Jan On Sat, Aug 18, 2012 at 12:10 AM, Jan van der Lugt janl...@gmail.com wrote: Hi all, After deciding on Apache Avro for one of the main formats for storing our graph data, I tried to integrate it with our graph processing system built in C++. If I generate a header file from the attached Avro schema using avrogencpp, I get a vector of type $Undefined$ somewhere in the generated code (see the snippet below). Is there an error in my schema or is this a bug in avrogencpp? Thanks in advance for your help! - Jan static void decode(Decoder d, gm::gm_avro_graph_avpr_Union__5__ v) { size_t n = d.decodeUnionIndex(); if (n = 2) { throw avro::Exception(Union index too big); } switch (n) { case 0: d.decodeNull(); v.set_null(); break; case 1: { std::vector$Undefined$ vv; avro::decode(d, vv); v.set_array(vv); } break; } } -- Harsh J
Re: avro-1.5.4 jars missing
Hi Steven, I can find all releases here on the main archive link: http://archive.apache.org/dist/avro/. Hope this helps! On Tue, Jul 24, 2012 at 10:15 PM, Steven Willis swil...@compete.com wrote: Hello, I was looking for the avro-1.5.4 jars today and found that they are no longer in the releases directory on the mirrors: http://www.us.apache.org/dist/avro/ [PARENTDIR] Parent Directory - [DIR] avro-1.6.3/ 2012-03-02 22:31- [DIR] avro-1.7.1/ 2012-07-12 19:29- [DIR] stable/ 2012-07-12 19:29- [ ] KEYS2010-06-07 16:58 5.4K Nor are they in the archive location: http://archive.apache.org/dist/hadoop/avro/ [PARENTDIR] Parent Directory - [DIR] avro-1.0.0/ 2009-08-06 16:22- [DIR] avro-1.1.0/ 2009-09-11 17:29- [DIR] avro-1.2.0/ 2009-10-09 21:15- [DIR] avro-1.3.0/ 2010-03-01 16:46- [DIR] avro-1.3.1/ 2010-03-16 17:30- [DIR] avro-1.3.2/ 2010-03-25 22:17- [DIR] stable/ 2010-03-25 22:17- [ ] KEYS2009-07-14 20:13 2.0K Is this an oversight, or should I be looking elsewhere? -Steven Willis -- Harsh J
Re: Avro file size is too big
Snappy is known to have lower compression rates against Gzip, but perhaps you can try larger blocks in the Avro DataFiles as indicated in the thread, via a higher sync-interval? [1] What snappy is really good at is a fast decompression rate though, so perhaps your reads are going to be comparable with gzip plaintext? P.s. What do you get if you use deflate compression on the data files, with maximal compression level (9)? [2] [1] - http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setSyncInterval(org.apache.hadoop.mapred.JobConf,%20int) or http://avro.apache.org/docs/1.7.1/api/java/index.html?org/apache/avro/mapred/AvroOutputFormat.html [2] - http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setDeflateLevel(org.apache.hadoop.mapred.JobConf,%20int) or via http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/CodecFactory.html#deflateCodec(int) coupled with http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory) On Thu, Jul 19, 2012 at 5:29 AM, Ey-Chih chow eyc...@gmail.com wrote: We are converting our compression scheme from gzip to snappy for our json logs. In one case, the size of a gzip file is 715MB and the corresponding snappy file is 1.885GB. The schema of the snappy file is bytes. In other words, we compress line by line of our json logs and each line is a json string. Is there any way we can optimize our compression with snappy? Ey-Chih Chow On Jul 5, 2012, at 3:19 PM, Doug Cutting wrote: You can use the Avro command-line tool to dump the metadata, which will show the schema and codec: java -jar avro-tools.jar getmeta file Doug On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh metarus...@gmail.com wrote: Hey Doug, Here is a little more of explanation http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E I'll answer your questions later after some investigation Thank you! On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting cutt...@apache.org wrote: Rusian, This is unexpected. Perhaps we can understand it if we have more information. What Writable class are you using for keys and values in the SequenceFile? What schema are you using in the Avro data file? Can you provide small sample files of each and/or code that will reproduce this? Thanks, Doug On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh metarus...@gmail.com wrote: Hello, In my organization currently we are evaluating Avro as a format. Our concern is file size. I've done some comparisons of a piece of our data. Say we have sequence files, compressed. The payload (values) are just lines. As far as I know we use line number as keys and we use the default codec for compression inside sequence files. The size is 1.6G, when I put it to avro with deflate codec with deflate level 9 it becomes 2.2G. This is interesting, because the values in seq files are just string, but Avro has a normal schema with primitive types. And those are kept binary. Shouldn't Avro be less in size? Also I took another dataset which is 28G (gzip files, plain tab-delimited text, don't know what is the deflate level) and put it to Avro and it became 38G Why Avro is so big in size? Am I missing some size optimization? Thanks in advance! -- Harsh J
Re: Which jar file is for what?
The answer would depend on what you're looking to do. That said, the best way to use Avro is to use it via maven and the avro-maven-plugin. See http://github.com/phunt/avro-rpc-quickstart for an example of how to use it for different things. On Tue, Jul 10, 2012 at 1:00 AM, Saptarshi Guha sg...@mozilla.com wrote: Hello, In the folder http://www.eng.lsu.edu/mirrors/apache/avro/stable/java/ there is avro-1.6.3.jar 02-Mar-2012 16:27 286K Java-Apache (old) avro-compiler-1.6.3.jar 02-Mar-2012 16:27 71K Java-Apache (old) avro-ipc-1.6.3.jar 02-Mar-2012 16:27 180K Java-Apache (old) avro-mapred-1.6.3.jar 02-Mar-2012 16:27 89K Java-Apache (old) avro-maven-plugin-1.6.3.jar 02-Mar-2012 16:27 20K Java-Apache (old) avro-protobuf-1.6.3.jar 02-Mar-2012 16:27 18K Java-Apache (old) avro-thrift-1.6.3.jar 02-Mar-2012 16:27 15K Java-Apache (old) avro-tools-1.6.3-nodeps.jar 02-Mar-2012 16:27 46K Java-Apache (old) avro-tools-1.6.3.jar02-Mar-2012 16:27 10M Java-Apache (old) Where would i use the different JAR files? Many thanks Regards Saptarshi -- Harsh J
Re: Avro + Snappy changing blocksize of snappy compression
Hey Nikhil, When using Avro Datafiles, you perhaps need to tweak its sync-interval to affect compression chunk sizes: http://avro.apache.org/docs/1.6.3/api/java/org/apache/avro/file/DataFileWriter.html#setSyncInterval(int) On Wed, Apr 18, 2012 at 10:53 PM, snikhil0 snik...@telenav.com wrote: I am experimenting with Avro and snappy and want to plot the size of the compressed avro datafile as a function of varying compression block size. I am doing this by setting the configuration value for io.compression.codec.snappy.buffersize. Unfortunately, this is not working: or more precisely for buffer sizes 256K to 2MB I get the same size output avro (snappyfied) data file. What am I missing? Someone had success with this? Thanks, Nikhil -- View this message in context: http://apache-avro.679487.n3.nabble.com/Avro-Snappy-changing-blocksize-of-snappy-compression-tp3920732p3920732.html Sent from the Avro - Users mailing list archive at Nabble.com. -- Harsh J
Re: Getting started with Avro + Reading from an Avro formatted file
Selvi, Expanding on Douglas' response, if you have installed Avro's python libraries (Simplest way to get latest stable is: easy_install avro, or install from the distribution -- Post back if you need help on this), you can simply do, using the now-installed 'avro' executable: $ ls sample_input.avro $ avro cat sample_input.avro --format csv 011990-9,0,-61952400 011990-9,22,-61950600 011990-9,-11,-61948440 012650-9,111,-65553120 012650-9,78,-65550960 Or, write to a resultant file, as you would regularly in a shell: $ avro cat sample_input.avro --format csv sample_input.csv For more options on avro's cat and write opts: $ avro --help On Tue, Jan 24, 2012 at 9:01 PM, selvi k gridsngat...@gmail.com wrote: Hello All, I would like some suggestions on where I can start in the Avro project. I want to be able to read from an Avro formatted log file (specifically the History Log file created at the end of a Hadoop job) and create a Comma Separated file of certain log entries. I need a csv file because this is the format that is accepted by post processing software I am working with (eg: Matlab). Initially I was using a BASH script to grep and awk from this file and create my CSV file because I needed a very few values from it, and a quick script just worked. I didn't try to get to know what format the log file was in and utilize that. (my bad!) Now that I need to be scaling up and want to have a reliable way to parse, I would like to try and do it the right way. My question is this: For the above goal, could you please guide me with steps I can follow - such as reading material and libraries I could try to use. As I go through the Quick Start Guide and FAQ, I see that a lot of the information here is geared to someone who wants to use the data serialization and RPC functionality provided by Avro. Given that I only want to be able to read, where may I start? I can comfortably script with BASH and Perl. Given that I only see support for Java, Python and Ruby, I think I can take this as as opportunity to learn Python and get up to speed. Thanks a lot. -Selvi -- Harsh J Customer Ops. Engineer, Cloudera
Re: Getting started with Avro + Reading from an Avro formatted file
[2] /avro$ sudo easy_install avro Searching for avro Best match: avro 1.6.1 Processing avro-1.6.1-py2.6.egg avro 1.6.1 is already the active version in easy-install.pth Installing avro script to /usr/local/bin Using /usr/local/lib/python2.6/dist-packages/avro-1.6.1-py2.6.egg Processing dependencies for avro Searching for python-snappy Reading http://pypi.python.org/simple/python-snappy/ Reading http://github.com/andrix/python-snappy Best match: python-snappy 0.3.2 Downloading http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f Processing python-snappy-0.3.2.tar.gz Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-c6jLm0/python-snappy-0.3.2/egg-dist-tmp-TTWQBN cc1plus: warning: command line option -Wstrict-prototypes is valid for Ada/C/ObjC but not for C++ snappymodule.cc:31:22: error: snappy-c.h: No such file or directory snappymodule.cc: In function ‘PyObject* snappy__compress(PyObject*, PyObject*)’: snappymodule.cc:62: error: ‘snappy_status’ was not declared in this scope snappymodule.cc:62: error: expected ‘;’ before ‘status’ snappymodule.cc:75: error: ‘snappy_max_compressed_length’ was not declared in this scope snappymodule.cc:79: error: ‘status’ was not declared in this scope snappymodule.cc:79: error: ‘snappy_compress’ was not declared in this scope snappymodule.cc:81: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc: In function ‘PyObject* snappy__uncompress(PyObject*, PyObject*)’: snappymodule.cc:107: error: ‘snappy_status’ was not declared in this scope snappymodule.cc:107: error: expected ‘;’ before ‘status’ snappymodule.cc:120: error: ‘status’ was not declared in this scope snappymodule.cc:120: error: ‘snappy_uncompressed_length’ was not declared in this scope snappymodule.cc:121: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc:128: error: ‘snappy_uncompress’ was not declared in this scope snappymodule.cc:129: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc: In function ‘PyObject* snappy__is_valid_compressed_buffer(PyObject*, PyObject*)’: snappymodule.cc:151: error: ‘snappy_status’ was not declared in this scope snappymodule.cc:151: error: expected ‘;’ before ‘status’ snappymodule.cc:156: error: ‘status’ was not declared in this scope snappymodule.cc:156: error: ‘snappy_validate_compressed_buffer’ was not declared in this scope snappymodule.cc:157: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc: At global scope: snappymodule.cc:41: warning: ‘_state’ defined but not used error: Setup script exited with error: command 'gcc' failed with exit status 1 [3] python$ sudo easy_install python-snappy Searching for python-snappy Reading http://pypi.python.org/simple/python-snappy/ Reading http://github.com/andrix/python-snappy Best match: python-snappy 0.3.2 Downloading http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f Processing python-snappy-0.3.2.tar.gz Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-Hpzssm/python-snappy-0.3.2/egg-dist-tmp-UStJPW gcc: error trying to exec 'cc1plus': execvp: No such file or directory error: Setup script exited with error: command 'gcc' failed with exit status 1 On Tue, Jan 24, 2012 at 11:01 AM, Harsh J ha...@cloudera.com wrote: Selvi, Expanding on Douglas' response, if you have installed Avro's python libraries (Simplest way to get latest stable is: easy_install avro, or install from the distribution -- Post back if you need help on this), you can simply do, using the now-installed 'avro' executable: $ ls sample_input.avro $ avro cat sample_input.avro --format csv 011990-9,0,-61952400 011990-9,22,-61950600 011990-9,-11,-61948440 012650-9,111,-65553120 012650-9,78,-65550960 Or, write to a resultant file, as you would regularly in a shell: $ avro cat sample_input.avro --format csv sample_input.csv For more options on avro's cat and write opts: $ avro --help On Tue, Jan 24, 2012 at 9:01 PM, selvi k gridsngat...@gmail.com wrote: Hello All, I would like some suggestions on where I can start in the Avro project. I want to be able to read from an Avro formatted log file (specifically the History Log file created at the end of a Hadoop job) and create a Comma Separated file of certain log entries. I need a csv file because this is the format that is accepted by post processing software I am working with (eg: Matlab). Initially I was using a BASH script to grep and awk from this file and create my CSV file because I needed a very few values from it, and a quick script just worked. I didn't try
Re: Decode without using DataFileReader
I do not understand what you're trying to achieve here. Encoders work at the primitive level - they merely serialize a given data structure (records, unions, for example), and not look at the schema (Notice - you create a record with a schema, not an encoder with a schema). Decoders could do the same and read back primitives, but if they had a schema they'd read back properly packed data structures. Since encoders do not store schema, decoders need it externally. DataFiles solve this for you by writing the schema itself into the file as a header. The reader loads this schema into the decoder when it attempts to read it back. On 05-Dec-2011, at 11:43 PM, Gaurav wrote: it makes no sense for the encoder to store schema for every given record, into a stream. Agree. Its not even encode/decoders job to store schema. While writing data, I noticed that we don't even need DataFileWriter, all it needs is GenericDatumWriter, Encoder and any kind of output stream (which can also be a file output stream). Sample: private static ByteArrayOutputStream EncodeData() throws IOException { // TODO Auto-generated method stub Schema schema = createMetaData(); GenericDatumWriterGenericData.Record datum = new GenericDatumWriterGenericData.Record(schema); GenericData.Record inner_record = new GenericData.Record(schema.getField(trade).schema()); inner_record.put(inner_abc, new Long(23490843)); GenericData.Record record = new GenericData.Record(schema); record.put(abc, 1050324); record.put(trade, inner_record); ByteArrayOutputStream out = new ByteArrayOutputStream(); BinaryEncoder encoder = ENCODER_FACTORY.binaryEncoder(out, null); datum.write(record, encoder); encoder.flush(); out.close(); return out; } Then why can't I just use back the same output stream to read back metadata and data. It should not be the responsibility of stream reader (which in this case is served by FileDataReader) to parse out schema. Thanks, Gaurav Nanda -- View this message in context: http://apache-avro.679487.n3.nabble.com/Decode-without-using-DataFileReader-tp3561722p3562127.html Sent from the Avro - Users mailing list archive at Nabble.com.
Re: Avro and Hadoop streaming
Miki, You'll need to provide the entire canonical class name (org.apache.avro.mapred…). On Wed, Jun 15, 2011 at 5:31 AM, Miki Tebeka miki.teb...@gmail.com wrote: Greetings, I've tried to run a job with the following command: hadoop jar ./hadoop-streaming-0.20.2-cdh3u0.jar \ -input /in/avro \ -output $out \ -mapper avro-mapper.py \ -reducer avro-reducer.py \ -file avro-mapper.py \ -file avro-reducer.py \ -cacheArchive /cache/avro-mapred-1.6.0-SNAPSHOT.jar \ -inputformat AvroAsTextInputFormat However I get -inputformat : class not found : AvroAsTextInputFormat I'm probably missing something obvious to do. Any ideas? Thanks! -- Miki On Fri, Jun 3, 2011 at 1:43 AM, Doug Cutting cutt...@apache.org wrote: Miki, Have you looked at AvroAsTextInputFormat? http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/AvroAsTextInputFormat.html Also, release 1.5.2 will include AvroTextOutputFormat: https://issues.apache.org/jira/browse/AVRO-830 Are these perhaps what you're looking for? Doug On 06/02/2011 11:30 PM, Miki Tebeka wrote: Greetings, I'd like to use hadoop streaming with Avro files. My plan is to write an inputformat class that emits json records, one per line. This way the streaming application can read one record per line. (http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#Specifying+Other+Plugins+for+Jobs) I couldn't find any documentation/help about writing inputformat classes. Can someone point me to the right direction? Thanks, -- Miki -- Harsh J
Re: could I add a field Map
You can use Maps as long as their key type is limited to strings I think. MapString, X is alright (X should also be Avro-acceptable, of course..). On Fri, Apr 8, 2011 at 7:49 PM, Weishung Chung weish...@gmail.com wrote: I am using Apache Avro in my project and was wondering could it be possible to add a field Map (TreeMap). I know that we can use Array and it works but I would like to be able to get by key :) Thank you, Wei Shung -- Harsh J http://harshj.com
Re: How to direct Reducer to write avro objects to avro sequence file?
By 'Avro sequence files' do you mean Avro data-files? Avro-Mapred classes right now only support the older, stable API (which has been undeprecated in 0.20.3, and is supported in 0.21 as well - no worries in using it really). There is AVRO-593 that tracks a new API implementation of Avro's mapred suppor (but it should be fairly easy to write your own wrappers for these after a bit of reading, since changes are mostly superficial). On Fri, Mar 11, 2011 at 11:24 AM, Aleksey Maslov aleksey.mas...@lab49.com wrote: Hi, (using hadoop 0.20.2 and avro 1.4.1) I have defined a simple avro object 'AvroObj' (a record of strings), compiled the schema and setup a simple MR job that takes as input lt;Object, Textgt; and emits lt;Text, IntWritablegt; and reducer that takes said lt;Text, IntWritablegt; and ... I would like to achieve is - have reducer emit lt;NullWritable, AvroObjgt; pairs into an avro sequence file; so the next mr job will open that avro file and read-in avro objects, not text lines, out of it; I have looked through the (H ed.2) book and few online samples but can't figure out how to do it; some online sources mention job config settings like: job.setOutputFormatClass(AvroOutputFormat.class); AvroOutputFormat.setCompressOutput(conf, false); But this doesn't compile - setCompressOutput asks for deprecated JobConf object, and setOutputFormatClass gives error about its param - param not applicable to AvroOutputFormat.class; Could someone enlighten me how to have reducer write to avro sequence file ? Cheers; -- View this message in context: http://apache-avro.679487.n3.nabble.com/How-to-direct-Reducer-to-write-avro-objects-to-avro-sequence-file-tp2663706p2663706.html Sent from the Avro - Users mailing list archive at Nabble.com. -- Harsh J www.harshj.com
Importing into Eclipse
Hello, I updated my svn clone of Avro after quite a while and noticed that the Java build has moved from Ant to Maven. I'm was not very familiar with maven based projects yet, but I got some reading done and am able to use it now. But I have a remaining question that I was not able to solve: How do I ask it to generate Eclipse project files? I liked the 'eclipse' or 'eclipse-files' target in the earlier Ant-based build system of Avro which easily generated Eclipse project files. But when I try maven install; maven eclipse:eclipse it fails for the tools package (in the eclipse part, the install goes fine). I do not have enough experience with Maven to know if it is a fault I'm doing or if it is a fault with the build files related to maven. Any help with creating eclipse project files for Avro's Java sub-project? -- Harsh J www.harshj.com
Re: How to get started with examples on avro
Based on the language you're targeting, have a look at its test-cases available on the in the project's version control: http://svn.apache.org/repos/asf/avro/trunk/lang/ [You can check it out via SVN, or via Git mirrors] Another good resource on the ends of Avro (Data and RPC) is by phunt at http://github.com/phunt/avro-rpc-quickstart#readme I had written a python data-file centric snippet for Avro a while ago at my blog; it may help if you're looking to get started with Python (although it does not cover all aspects, which the functions in the available test cases for lang/python do): http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/ On Sat, Jan 29, 2011 at 1:34 AM, felix gao gre1...@gmail.com wrote: Hi all, I am trying to convert a lot of our existing logs into avro format in hadoop. I am not sure if there are any examples to follow. Thanks, Felix -- Harsh J www.harshj.com
Re: How to get started with examples on avro
On Sat, Jan 29, 2011 at 1:59 AM, felix gao gre1...@gmail.com wrote: Thanks for the quick reply. I am interested in doing this through the java implementation and I would like to do it in parallel that utilizes the mapreduce framework. That operation is pretty similar to writing a normal output data file. You can use the MapReduce API of Avro (that provides an Input/Output Format class to use, given a Schema) to do so, or write your own custom record writing classes that do it by converting your input format's record representation to Avro serialized records and writing those out to an open DataFile for a given schema. Alternatively, you can also write avro serialized data bytes into SequenceFiles. I believe the Hadoop MapReduce trunk may have some good code on Avro serialization classes and uses of that in MapReduce. On Fri, Jan 28, 2011 at 12:22 PM, Harsh J qwertyman...@gmail.com wrote: Based on the language you're targeting, have a look at its test-cases available on the in the project's version control: http://svn.apache.org/repos/asf/avro/trunk/lang/ [You can check it out via SVN, or via Git mirrors] Another good resource on the ends of Avro (Data and RPC) is by phunt at http://github.com/phunt/avro-rpc-quickstart#readme I had written a python data-file centric snippet for Avro a while ago at my blog; it may help if you're looking to get started with Python (although it does not cover all aspects, which the functions in the available test cases for lang/python do): http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/ On Sat, Jan 29, 2011 at 1:34 AM, felix gao gre1...@gmail.com wrote: Hi all, I am trying to convert a lot of our existing logs into avro format in hadoop. I am not sure if there are any examples to follow. Thanks, Felix -- Harsh J www.harshj.com -- Harsh J www.harshj.com
Re: Avro Python appending data
Hi, On Thu, Dec 23, 2010 at 6:29 AM, felix gao gre1...@gmail.com wrote: Hi all, I am having trouble adding more data into a file. Environment: Python 2.6.5, avro-1.3.3-py2.6 Program looks like this I see you've read my blog post on Avro+Python :P http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/ if I remove the second write_avro_file() call then everything is fine. How to properly append more data into the file? To append to an existing datafile, do not initialize the writer object with a writers_schema again. Just create it using: df_writer = datafile.DataFileWriter( open(OUTFILE_NAME, 'wb'), io.DatumWriter(), ) -- Harsh J www.harshj.com
Re: Avro Python appending data
Sorry, minor error, not 'wb', but 'ab+' df_writer = datafile.DataFileWriter( open(OUTFILE_NAME, 'ab+'), io.DatumWriter(), ) -- Harsh J www.harshj.com
Parsing Anonymous Schema
Hello, Is it possible to parse an anonymous record schema (back into a proper Schema object)? If I've created an anonymous record schema, I'm getting an error (No name found, via Schema.parse() of Java API) when I de-serialize its JSON to form a Schema object again. Is this the intended behavior or is it a bug? -- Harsh J www.harshj.com