Re: Map with another map inside (unpredictable naming)

2017-03-27 Thread Harsh J
The map union value you currently have can certainly carry another map type
within. Here's how you'd probably want to define it:

{
"name": "metadata",
"type": {
"type": "map",
"values": [
"null",
"int",
"float",
"string",
"boolean",
"long",
{
"type": "map",
"values": [
"null",
"int",
"float",
"string",
"boolean",
"long"
]
}
]
}
}

P.s. Its generally better to specify "null" as the first union type:
http://avro.apache.org/docs/current/spec.html#Unions

On Mon, 27 Mar 2017 at 16:47 Dag Stockstad  wrote:

> Hi Avro aficionados,
>
> I'm having trouble serializing a record with a nested map structure i.e. a
> map within a map. The record I'm trying to send has the following structure:
> {
> "event_type": "some_type",
> "data": {
> "id": "2f720f90-ea06-4248-a72e-01eea44981ed",
> "metadata": {
> "some_attr": "some_value",
> "some_map_with_unpredictable_name": {
> "some_attr": "some_value"
> }
> }
> }
> }
>
> And the schema is this:
> {
> "namespace": "org.example.event.avro",
> "type": "record",
> "name": "EventNotification",
> "fields": [{
> "name": "event_type",
> "type": "string"
> }, {
> "name": "data",
> "type": {
> "type": "record",
> "name": "EventData",
> "fields": [{
> "name": "id",
> "type": "string"
> }, {
> "name": "metadata",
> "type": {
> "type": "map",
> "values": [
> "int",
> "float",
> "string",
> "boolean",
> "long",
> "null"
> ]
> }
> }]
> }
> }]
> }
>
> The nested map (some_map_with_unpredictable_name) is causing problems
> (serialization error). Is there any way I can have another map as a value
> in the metadata map?
>
> Due to the nature of the system, I cannot 100% predict the structure of
> the metadata field. Can Avro accomodate these requirements or do I have to
> fall back on something such as JSON for this one?
>
> Help very appreciated (I'm a bit stuck).
>
> Kind regards,
> Dag
>
>


Re: avro-tools tojson where avro file in HDFS

2015-09-10 Thread Harsh J
The avro-tools jar is usually a standalone one, and if you do have a
standalone variant running it with 'hadoop jar' may cause classpath
pollution as hadoop also includes a (likely different) version of avro into
the runtime classpath.

Run it instead this way:

export HADOOP_USER_CLASSPATH_FIRST=true
export HADOOP_CLASSPATH=avro-tools-1.7.7.jar
hadoop jar avro-tools-1.7.7.jar …

Or if you are certain avro-tools is built against a compatible hadoop
server version, simply run:

java -jar avro-tools-1.7.7.jar …

On Fri, Sep 4, 2015 at 8:26 PM Ashish Rastogi (BLOOMBERG/ 731 LEX) <
arastog...@bloomberg.net> wrote:

> Hi Avro users,
>
> I'm using avro-tools-1.7.7.jar, and would like to print records to stdout
> using the "tojson" option. I want to do this with my avro files in HDFS
> (and not on the local file system). I thought AVRO-867 (
> https://issues.apache.org/jira/browse/AVRO-867) would allow me to do
> this. However, I get the following exception when I run:
>
> $ hadoop jar avro-tools-1.7.7.jar tojson hdfs://
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.avro.io.EncoderFactory.jsonEncoder(Lorg/apache/avro/Schema;Ljava/io/OutputStream;Z)Lorg/apache/avro/io/JsonEncoder;
> at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:76)
> at org.apache.avro.tool.Main.run(Main.java:84)
> at org.apache.avro.tool.Main.main(Main.java:73)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>
> Anyone knows what this means? Any help is appreciated.
>
> As a confirmation, I'm able to use avro-tools after I -copyToLocal this
> avro file to view these records.
>
> Thanks,
> Ashish
>


Re: Is Avro Splittable?

2015-07-18 Thread Harsh J
Could you also link to the articles that claim Avro containers are not
splittable? It'd be good to correct them to avoid this confusion.

On Thu, Jun 25, 2015 at 11:25 AM Ankur Jain ankur.j...@yash.com wrote:

  Hello,



 I am reading various forms and docs, somewhere it is mentioned that avro
 is splittable and somewhere non-splittable.

 So which one is right??



 Regards,

 Ankur


  Information transmitted by this e-mail is proprietary to YASH
 Technologies and/ or its Customers and is intended for use only by the
 individual or entity to which it is addressed, and may contain information
 that is privileged, confidential or exempt from disclosure under applicable
 law. If you are not the intended recipient or it appears that this mail has
 been forwarded to you without proper authority, you are notified that any
 use or dissemination of this information in any manner is strictly
 prohibited. In such cases, please notify us immediately at i...@yash.com
 and delete this mail from your records.



Re: Binary output in MR job

2014-08-16 Thread Harsh J
You're not explicitly specifying a relevant form of AvroOutputFormat.
This causes the default TextOutputFormat to get used, giving you JSON
representation of records.

On Sat, Aug 16, 2014 at 3:07 PM, Anand Nalya anand.na...@gmail.com wrote:
 Hi,

 I'm writing a MR 2 job in which I'm reading plain text as input and
 producing avro output. On running the job in local mode, the output is being
 serialized into json format. What can I do so that the output uses binary
 encoding. Following is my job definition:

 Job job = new Job(getConf(), Post convertor);
 job.setJarByClass(getClass());

 AvroJob.setOutputKeySchema(job, Post.getClassSchema());
 AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.LONG));
 AvroJob.setMapOutputValueSchema(job, Post.getClassSchema());

 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));

 job.setMapperClass(PostMapper.class);
 job.setReducerClass(PostReducer.class);

 Regards.
 Anand



-- 
Harsh J


Re: Error when trying to convert a local datafile to plain text with Avro Tools

2014-07-19 Thread Harsh J
Its useful to have plaintext data in compressed avro files in HDFS for
MR/etc. processing, since the container format allows splitting. The
feature of 'totext'/'from'text' was added originally via AVRO-567.

You may instead be looking for the avro (avrocat) tool? You can obtain
it by installing the Python 'avro' package (easy_install avro, or pip
install avro) and by then running the 'avro' command. It allows
configurable forms of text transformation from regular Avro schema
files.

On Mon, Jul 14, 2014 at 10:52 AM, julianpeeters julianpeet...@gmail.com wrote:
 Hi,

 I'm exploring the human-readable avro options in the avro-tools jar, namely
 `tojson` and `totext`.

 `tojson` works fine, but I try `totext` with:

 `$ java -jar avro-tools-1.7.6.jar totext twitter.avro twitter.txt`,

 then twitter.txt is empty and I get this error:

 Jul 13, 2014 8:41:19 PM org.apache.hadoop.util.NativeCodeLoader clinit
 WARNING: Unable to load native-hadoop library for your platform... using
 builtin-java classes where applicable
 Avro file is not generic text schema


 What am I doing wrong?

 Thanks for looking,
 -Julian

 PS (Looking into the source, it looks like this error is thrown when the
 schema in the datafile is not equal to the string \bytes\, but I have a
 hard time understanding why the datafile's schema would ever be that.)





 --
 View this message in context: 
 http://apache-avro.679487.n3.nabble.com/Error-when-trying-to-convert-a-local-datafile-to-plain-text-with-Avro-Tools-tp4030458.html
 Sent from the Avro - Users mailing list archive at Nabble.com.



-- 
Harsh J


Re: Passing Schema objects through Avro RPC

2014-06-21 Thread Harsh J
Hey Joey,

Are you perhaps looking for https://issues.apache.org/jira/browse/AVRO-251?

On Sat, Jun 21, 2014 at 2:20 AM, Joey Echeverria j...@cloudera.com wrote:
 Has anyone built an Avro RPC interface that includes a method that returns
 Avro Schema objects?

 I've built my protocol using the reflect APIs, but because Schema doesn't
 have an empty constructor, I get a NoSuchMethodException when trying to
 deserialize on the client.

 -Joey



-- 
Harsh J


Re: How to dynamically create a MapSchema with String as key, not Utf8

2014-05-19 Thread Harsh J
You can pass -string to the avro-tools compile program, to make the
generated classes use String/CharSequence and not Utf8.

On Mon, May 19, 2014 at 1:37 PM, Fengyun RAO raofeng...@gmail.com wrote:
 I've noticed the jira page: https://issues.apache.org/jira/browse/AVRO-803,
 and known how to generate a MapString, MyType class using a schema file.

 In my case, the schema file is “MyType.avsc”, and I used avro-tools to
 generate a MyType.java class,
 My question is how to dynamically create a MapSchema of MapString, MyType,
 since I have to Ser/De a MapString, MyType.

 I tried to use the method Schema.createMap(Schema myType), but the key is
 Utf8 not String.



-- 
Harsh J


Re: How to dynamically create a MapSchema with String as key, not Utf8

2014-05-19 Thread Harsh J
Ah my bad, I didn't realise you wanted to infer/generate the schema in
code. I believe you may be looking for this static method perhaps:
http://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/generic/GenericData.html#setStringType(org.apache.avro.Schema,
org.apache.avro.generic.GenericData.StringType)

On Mon, May 19, 2014 at 4:03 PM, Fengyun RAO raofeng...@gmail.com wrote:
 Yes, I knew it. MyType.java was compiled using -string.

 My question is how to generate schema of MapString, MyType , providing
 that I only have MyType.avsc file, but not the map schema file.

 Schema.createMap(Schema myType) method generates a MapUtf8, MyType, not
 MapString, MyType.


 2014-05-19 18:16 GMT+08:00 Harsh J ha...@cloudera.com:

 You can pass -string to the avro-tools compile program, to make the
 generated classes use String/CharSequence and not Utf8.

 On Mon, May 19, 2014 at 1:37 PM, Fengyun RAO raofeng...@gmail.com wrote:
  I've noticed the jira page:
  https://issues.apache.org/jira/browse/AVRO-803,
  and known how to generate a MapString, MyType class using a schema
  file.
 
  In my case, the schema file is “MyType.avsc”, and I used avro-tools to
  generate a MyType.java class,
  My question is how to dynamically create a MapSchema of MapString,
  MyType,
  since I have to Ser/De a MapString, MyType.
 
  I tried to use the method Schema.createMap(Schema myType), but the key
  is
  Utf8 not String.



 --
 Harsh J





-- 
Harsh J


Re: Record field names

2014-05-19 Thread Harsh J
Hi Yael,

The field name requirements are defined in the Avro specification:
http://avro.apache.org/docs/current/spec.html#Names

On Tue, May 20, 2014 at 1:15 AM, yael aharon yael.aharo...@gmail.com wrote:
 Hello,
 Recently, I noticed that the name of generic record fields are very
 restrictive. Only alphanumeric characters and underscores are allowed.
 Could someone shed some light on why is this restriction?
 thanks, Yael



-- 
Harsh J


Re: avro_file_writer_sync

2014-04-30 Thread Harsh J
Read more about sync markers in the specification part that explains
the data file format:
http://avro.apache.org/docs/current/spec.html#Object+Container+Files

I don't think the call helps with your situation though - that error
is usually seen for badly written/produced avro data files, or
possibly cause of some specific bug. I've asked for more details on
your other thread.

On Sun, Apr 20, 2014 at 10:35 PM, amit nanda amit...@gmail.com wrote:
 Hi,

 I see this function avro_file_writer_sync in the io,h file in Avro C
 Library, what does this function does.

 I am facing Invalid Sync! for some of the files, can this function help me
 in that?

 Thanks
 Amit



-- 
Harsh J


Re: Cannot decode file block: Error decompressing block with deflate, possible data error

2014-04-30 Thread Harsh J
Are all your files running into this, or just a few among the
C-written ones? The file is possibly corrupt cause of missing bytes or
other reasons (hard to tell specifically). Having more info on the
behaviour or a sample such file to inspect may help.

On Fri, Apr 18, 2014 at 3:57 PM, amit nanda amit...@gmail.com wrote:
 Hi,

 While reading a file using avrocat tool, i am getting errors *Cannot decode
 file block: Error decompressing block with deflate, possible data error.*

 And when using avro-tools-1.7.6.jar i get Invalid Sync! errors.

 Can anyone please tel me why these errors are coming, and is there a way to
 correct the file now without loosing any data?

 File was created using 1.7.4 C library

 Thanks
 Amit



-- 
Harsh J


Re: Saving arbitrary data to avro files

2014-03-04 Thread Harsh J
You can do this, sure. You just need a schema of string type or something
similar.

Are you not concerned about the read time of the data you plan to store as
strings? Typically you write once and read more than once during processing.

Storing the data types in proper serialized form would help greatly during
reads.
On Mar 4, 2014 8:54 AM, yael aharon yael.aharo...@gmail.com wrote:

 Hello,
 I am writing a C++ library that stores arbitrary data to avro files.
 The schema is given to me through my library's API.
 All the data is given to me in the form of strings; including integers,
 doubles, etc.
 Is there a way for me to store this data to avro files without converting
 the strings to the correct types first? I am concerned about the
 performance impact that this conversion would have
 thanks, Yael



Re: Multiple inputs for different avro inputs

2014-02-27 Thread Harsh J
 One more doubt: Why we don't have AvroMultipleInputs just like 
 AvroMultipleOutputs? Any reason?

This (and other) question belongs to Apache Avro's list
(user@avro.apache.org). Moving user@hadoop to bcc:.

For AvroMultipleInputs, see https://issues.apache.org/jira/browse/AVRO-1439

On Thu, Feb 27, 2014 at 2:43 AM, AnilKumar B akumarb2...@gmail.com wrote:
 Hi,

 I am using MultipleInputs to read two different avro inputs with different
 schemas.

 But in run method, as we need to specify the
 AvroJob.setInputKeySchema(job,schema),

 Which schema I need to set?

 I tried as below

 ListSchema schemas = new ArrayListSchema();
 schemas.add(FlumeEvent.SCHEMA$);
 schemas.add(Event.SCHEMA$);
 AvroJob.setInputKeySchema(testJob, Schema.createUnion(schemas));

 But I am facing issue while Map phase
 Error: org.apache.avro.AvroTypeException: Found Event, expecting union

 How to fix this issue?

 One more doubt: Why we don't have AvroMultipleInputs just like
 AvroMultipleOutputs? Any reason?

 Thanks  Regards,
 B Anil Kumar.



-- 
Harsh J


Re: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to java.util.Map

2014-02-08 Thread Harsh J
 at
 org.apache.avro.generic.GenericDatumWriter.getMapSize(GenericDatumWriter.java:194)
 at
 org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:173)
 at
 org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:69)
 at
 org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143)
 at
 org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:138)
 at
 org.apache.avro.reflect.ReflectDatumWriter.writeArray(ReflectDatumWriter.java:64)
 at
 org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68)
 at
 org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143)
 at
 org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114)
 at
 org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:175)
 at
 org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
 at
 org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
 at
 org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143)
 at
 org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114)
 at
 org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:175)
 at
 org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
 at
 org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
 at
 org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143)
 at
 org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58)
 at
 org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:290)



 Thanks  Regards,
 B Anil Kumar.



-- 
Harsh J


Re: Avro MapReduce (MR1): Prevent Key from being output by reducer when using Pair schema

2014-01-16 Thread Harsh J
Hello Ed,

The AvroReducer per
http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroReducer.html
has a simple spec of K,V,OUT, where OUT can be any record type and
not necessarily a PairKO,VO type.

AvroJob.setOutputSchema(…) should accept non-pair configs. I think its
java-doc is incorrect though. I wrote a test case yesterday at
http://issues.apache.org/jira/browse/AVRO-1439, in which I set a
non-Pair schema via the same call without any trouble. We could get
the java-doc fixed, if it is indeed wrong.

On Thu, Jan 16, 2014 at 2:14 PM, ed edor...@gmail.com wrote:
 Hello,

 I am currently reading in lots of small avro files and then writing them out
 into one large avro file using Map Reduce MR1.  I'm trying to do this using
 the AvroMapper and AvroReducer and it's almost working how I want.

 The problem right now is that it looks like I have to use
 org.apache.avro.mapred.Pair if I use AvroJob.setOutputSchema.  Is there
 a way to output a Pair schema from AvroReducer and have the key in that
 schema be ignored (i.e., not included in the output from the reducer)?
 Right now when I check the Reducer output there is an added field in each
 record called key which I'd like to not have there.

 Essentially I'm looking for something like NullWritable where the key will
 just be ignored in the final output.

 Thank you for any assistance or guidance you can provide!

 Best Regards,

 Ed



-- 
Harsh J


Re: Avro MapReduce (MR1): Prevent Key from being output by reducer when using Pair schema

2014-01-16 Thread Harsh J
Thanks Ed! Can you also file an improvement JIRA under
https://issues.apache.org/jira/browse/AVRO with a patch that changes
it to make more sense?

On Thu, Jan 16, 2014 at 5:14 PM, ed edor...@gmail.com wrote:
 Hi Harsh,

 Thank you for your response which was invaluable in helping me to figure out
 my issue.  The Java-Doc is in fact incorrect when it states that
 AvroJob.setOutputSchema cannot accept non-Pair configs as it turns out it
 can.  What was throwing me off is that if you use AvroJob.setOutputSchema to
 set a non-Pair config, then you also need to call AvroJob.setMapOutputSchema
 (which does require the use of Pair).  Otherwise, by default, the map output
 schema gets set to whatever you set in setOutputSchema and if that is
 non-pair you'll get an error at runtime.

 Maybe the JavaDoc should say something along the lines of:

 Configure a job's output schema. If this is a not a Pair-schema then you
 must explicitly set the job's map output schema using setMapOutputSchema


 Thank you!

 Best Regards,

 Ed




 On Thu, Jan 16, 2014 at 6:47 PM, Harsh J ha...@cloudera.com wrote:

 Hello Ed,

 The AvroReducer per

 http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroReducer.html
 has a simple spec of K,V,OUT, where OUT can be any record type and
 not necessarily a PairKO,VO type.

 AvroJob.setOutputSchema(…) should accept non-pair configs. I think its
 java-doc is incorrect though. I wrote a test case yesterday at
 http://issues.apache.org/jira/browse/AVRO-1439, in which I set a
 non-Pair schema via the same call without any trouble. We could get
 the java-doc fixed, if it is indeed wrong.

 On Thu, Jan 16, 2014 at 2:14 PM, ed edor...@gmail.com wrote:
  Hello,
 
  I am currently reading in lots of small avro files and then writing them
  out
  into one large avro file using Map Reduce MR1.  I'm trying to do this
  using
  the AvroMapper and AvroReducer and it's almost working how I want.
 
  The problem right now is that it looks like I have to use
  org.apache.avro.mapred.Pair if I use AvroJob.setOutputSchema.  Is
  there
  a way to output a Pair schema from AvroReducer and have the key in
  that
  schema be ignored (i.e., not included in the output from the reducer)?
  Right now when I check the Reducer output there is an added field in
  each
  record called key which I'd like to not have there.
 
  Essentially I'm looking for something like NullWritable where the key
  will
  just be ignored in the final output.
 
  Thank you for any assistance or guidance you can provide!
 
  Best Regards,
 
  Ed



 --
 Harsh J





-- 
Harsh J


Re: Nullable Fields

2014-01-16 Thread Harsh J
Hi Alparisian,

Are you looking for the Nullable annotation for ReflectData based
schemas: 
http://avro.apache.org/docs/1.7.5/api/java/org/apache/avro/reflect/Nullable.html?
Sample usage here:
https://github.com/apache/avro/blob/release-1.7.5/lang/java/avro/src/test/java/org/apache/avro/reflect/TestReflect.java#L324

On Thu, Jan 16, 2014 at 5:27 PM, Alparslan Avcı
alparslan.a...@agmlab.com wrote:
 Hi all,
 I've been developing Gora to upgrade Avro 1.7.x and have a question. Why do
 we have to use UNION type (Ex.: {name: field1, type: [null,
 {type:map, values:[null, string]}],default:null}) for nullable
 fields? Because of this issue, we have to handle UNION types in an
 appropriate way both normal values and null values as exceptions.

 Instead of UNION type, why don't we use a 'nullable' property for any field?



-- 
Harsh J


Re: an awkward question

2013-10-01 Thread Harsh J
This question is for the ASF Infra I think - we use their services :)

There are services such as Nabble, etc. that let you also browse and
post on most indexed ASF project threads, by displaying it as a forum.

Note though that Google Groups carries its own annoying limits, which
are raised only for large customers/users of their paid apps
services. I'm still a big fan of the older backends, though I do enjoy
Google Groups' permalink feature :)

On Fri, Sep 27, 2013 at 9:42 PM, John Langley dige...@gmail.com wrote:
 Sorry if this is an awkward question, I don't really want to start some huge
 debate, but...

 google groups rock you guys.

 Is there any easy way to view and use this email distribution list from
 google groups, or is there any interest in moving there?

 I know the community probably doesn't want to endorse any one vendor, etc.
 etc. but honestly, working with other technology bases... google-groups
 REALLY makes it easy to collaborate and search for issues that someone may
 already have solved. (certainly http://search-hadoop.com/?q=fc_project=Avro
 helps, if only you could post from there as well, w/out having to jump to
 another interface etc.)





-- 
Harsh J


Re: Kafka JMX

2013-08-30 Thread Harsh J
Hello,

Did you mean to send this to the Kafka lists instead of the Avro one?

On Fri, Aug 30, 2013 at 4:08 AM, Mark static.void@gmail.com wrote:
 Can you view Kafka metrics via JConsole? I've tried connecting to port  
 with no such luck?





-- 
Harsh J


Re: Avro file Compression

2013-08-22 Thread Harsh J
Can you share your test? There is an example at
http://svn.apache.org/repos/asf/avro/trunk/lang/c/examples/quickstop.c
which has the right calls for using a file writer with a deflate codec
- is yours similar?

On Mon, Aug 19, 2013 at 9:42 PM, amit nanda amit...@gmail.com wrote:
 I am try to compress the avro files that i am writing, for that i am using
 the latest Avro C, with deflate option, but i am not able to see any
 difference in the file size.

 Is there any special type to data that this works on, or is there any more
 setting that needs to be done for this to work.





-- 
Harsh J


Re: Is there a way to conditionally read Avro data?

2013-08-17 Thread Harsh J
What Eric suggests (reader schemas) would work, but may incur a double
read cost when you wish to proceed based on a positive condition met
by the specific read.

If this data is held, order-wise, early into the record, then perhaps
using a custom DatumReader implementation (that does the low level
deserialization) may work more effectively. You can pass a DatumReader
when constructing a DataFileReader - but its quite a long route to go
IMO.

On Sat, Aug 17, 2013 at 4:17 AM, Eric Wasserman ewasser...@247-inc.com wrote:
 If you define you records like this (this is in the Avro IDL lang. for
 brevity)

 If you write your records with a schema like this:


 record R {

 Header header;

 Body body;

   }



 Then you can read with a schema like this:


   record RSansBody {

 Header header;

   }


 And the Avro libraries will read the header part (in which your type would
 reside) and effectively skip the body part.

 
 From: Anna Lahoud annalah...@gmail.com
 Sent: Friday, August 16, 2013 12:23 PM
 To: user@avro.apache.org
 Subject: Is there a way to conditionally read Avro data?

 I am wondering if there is a way that I can avoid reading all of an item in
 an Avro file, based on some of the data that I have already read. For
 instance, say I have a datum where I know that if it's 'type' value is a
 'ComputerVirus', and that I do not want to touch the remaining fields. Is
 there a way to 'move on' and get the next datum, without touching the
 remainder of the scary datum? I would call it a 'conditional read' in that I
 only want to fully read the datum if the datum meets some criteria.

 Anna




-- 
Harsh J


Re: Mapper not called

2013-08-01 Thread Harsh J
I've often found the issue behind such an observance to be that the
input files lack an .avro extension. Is that true in your case? Can
you retry after a rename if yes?

On Wed, Jul 31, 2013 at 1:02 AM, Anna Lahoud annalah...@gmail.com wrote:
 I am following directions on
 http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/package-summary.html
 to write a job that takes Avro files as input and outputs non-Avro files, I
 created the following job. I should note that I have tried different
 variations of ordering the setInput/OutputPath lines, the AvroJob lines, and
 the reduce task settings. It always results the same: the job runs with 0
 mappers and 1 reducer (which gets no data so is essentially an emtpy
 SequenceFile). It always says there are 10 input files so that's not the
 issue. There is an @Override statement on my map and my reduce so that's not
 the issue. And I believe I have correctly followed the Avro input/non-Avro
 output instructions mentioned in the link above. Any other ideas would be
 welcome!!!


 public class MyAvroJob extends Configured implements Tool {

 @Override
 public int run(String[] args) throws Exception {

 JobConf job = new JobConf(getConf(), this.getClass());

 FileInputFormat.setInputPaths(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));

 AvroJob.setMapperClass(job, MyAvroMapper.class);
 AvroJob.setInputSchema(job, MySchema.SCHEMA$);
 AvroJob.setMapOutputSchema(job,
 Pair.getPairSchema(Schema.create(Type.STRING), Schema.create(Type.STRING)));

 job.setReducerClass(MyNonAvroReducer.class);
 job.setOutputFormat(SequenceFileOutputFormat.class);
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Text.class);
 job.setNumReduceTasks(1);

 return JobClient.runJob(job).isSuccessful();
 }

 public static class MyAvroMapper extends AvroMapperMySchema, PairString,
 String {

 @Override
 public void map(MySchema in, AvroCollectorPairString, String collector,
 Reporter reporter) throws IOException {

 ListMyThings things = in.getRecords();
 ...
 collector.collect(new PairString, String( newKey, newValue));
 }
 }

 public static class MyNonAvroReducer extends MapReduceBase implements
 ReducerAvroKeyString, AvroValueString, Text, Text {

 @Override
 public void reduce(AvroKeyString key, IteratorAvroValueString values,
 OutputCollectorText, Text output, Reporter reporter) throws IOException {
 while (values.hasNext()) {
   output.collect(new Text(key.datum()), new Text(values.next().datum()));
 }
 }
 }

 public static void main(String[] args) throws Exception {
 ToolRunner.run(new MyAvroJob(), args);

 }




 -Anna








-- 
Harsh J


Re: Avro schema

2013-08-01 Thread Harsh J
We read it from the top of the file at start (just the schema bytes)
and then initialize the reader.

On Thu, Aug 1, 2013 at 8:32 PM, Lior Schachter lior...@gmail.com wrote:
 Hi all,

 When writing Avro schema to the a data file, what will be the expected
 behavior if the file is used as M/R input. How does the second/third/...
 splits get the schema (the schema is always written to the first split) ?

 Thanks,
 Lior





-- 
Harsh J


Re: Avro schema

2013-08-01 Thread Harsh J
Yes, we seek to 0 and we read the header then seek back to the split offset.
On Aug 1, 2013 11:16 PM, Lior Schachter lior...@gmail.com wrote:

 Hi Harsh,
 So for each split you first read the header of the file directly from HDFS
 ?

 Thanks,
 Lior




 On Thu, Aug 1, 2013 at 7:36 PM, Harsh J ha...@cloudera.com wrote:

 We read it from the top of the file at start (just the schema bytes)
 and then initialize the reader.

 On Thu, Aug 1, 2013 at 8:32 PM, Lior Schachter lior...@gmail.com wrote:
  Hi all,
 
  When writing Avro schema to the a data file, what will be the expected
  behavior if the file is used as M/R input. How does the second/third/...
  splits get the schema (the schema is always written to the first split)
 ?
 
  Thanks,
  Lior
 
 



 --
 Harsh J





Re: Avro and MapReduce 2.0

2013-07-23 Thread Harsh J
Hi,

The MR APIs are not tied to MR version 1.0 or 2.0. Both APIs are
available on both versions and are fully supported. If you're looking
for a new API Avro MapReduce example though, you can check out some of
its test cases such as
http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java

On Fri, Jul 19, 2013 at 1:10 AM, F. Jerrell Schivers
jerr...@bordercore.com wrote:
 Hi folks,

 I'm writing a map-only job that outputs to Avro.  Can someone point me
 to some examples that use the 2.0 version of the API?  Most of the
 sample code I've found refer to the 1.0 version.

 Thanks,
 Jerrell



-- 
Harsh J


Re: ArrayIndexOutOfBoundsException in Symbol.getSymbol in map reduce job

2013-05-13 Thread Harsh J
Its difficult to tell what the error means without context and other
info (such as version). If I had to guess, I think there may be a
corruption on the file being processed here. Does running the file
through avro-tools' tojson sub-command end up in a successful read?

On Tue, May 14, 2013 at 3:28 AM, Sripad Sriram sri...@path.com wrote:
 Hi all,

 A java hadoop job that's previously executed without issue began erroring
 with the following stack trace - have any of you seen this before?

 java.lang.ArrayIndexOutOfBoundsException: 14
 at
 org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364)
 at
 org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
 at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
 at
 org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
 at
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
 at
 org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166)
 at
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
 at
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
 at
 org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer.deserialize(AvroSerialization.java:83)
 at
 org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer.deserialize(AvroSerialization.java:65)
 at
 org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:1262)
 at
 org.apache.hadoop.mapred.Task$ValuesIterator.nextKey(Task.java:1233)
 at
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:533)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:429)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)



-- 
Harsh J


Re: Hadoop Avro Question

2013-04-30 Thread Harsh J
Oops, moving for sure this time :)

On Wed, May 1, 2013 at 10:35 AM, Harsh J ha...@cloudera.com wrote:
 Moving the question to Apache Avro's user@ lists. Please use the right
 lists for the most relevant answers.

 Avro is a different serialization technique that intends to replace
 the Writable serialization defaults in Hadoop. MR accepts a list of
 serializers it can use for its key/value structures and isn't limited
 to Writable in any way. Look up the property io.serializations in
 your Hadoop's core-default.xml for more information.

 The Avro project also offers fast comparator classes that are used for
 comparing the bytes/structures of Avro objects. This is mostly
 auto-set for you when you use the MR framework as described at
 http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/package-summary.html
 (via AvroJob helper class).

 On Tue, Apr 30, 2013 at 6:51 PM, Rahul Bhattacharjee
 rahul.rec@gmail.com wrote:
 Hi,

 When dealing with Avro data files in MR jobs ,we use AvroMapper , I noticed
 that the output of K and V of AvroMapper isnt writable and neither the key
 is comparable (these are AvroKey and AvroValue). As the general
 serialization mechanism is writable , how is the K,V pairs in case of avro ,
 travel across nodes?

 Thanks,
 Rahul



 --
 Harsh J



-- 
Harsh J


Re: Hadoop Avro Question

2013-04-30 Thread Harsh J
For the TOP, check out
https://issues.apache.org/jira/browse/MAPREDUCE-4574 which we fixed in
Hadoop recently to allow full reuse with Avro.

On Wed, May 1, 2013 at 10:49 AM, Rahul Bhattacharjee
rahul.rec@gmail.com wrote:
 Hi Harsh,
 Looks like a lot of other classes are also to be rewritten for avro. Like
 the total sort partitioner , which I think currently assumes writable as the
 io mechanism.

 I faced problem using with avro , so though of writing to the forum.

 Thanks a lot
 Rahul!


 On Wed, May 1, 2013 at 10:35 AM, Harsh J ha...@cloudera.com wrote:

 Moving the question to Apache Avro's user@ lists. Please use the right
 lists for the most relevant answers.

 Avro is a different serialization technique that intends to replace
 the Writable serialization defaults in Hadoop. MR accepts a list of
 serializers it can use for its key/value structures and isn't limited
 to Writable in any way. Look up the property io.serializations in
 your Hadoop's core-default.xml for more information.

 The Avro project also offers fast comparator classes that are used for
 comparing the bytes/structures of Avro objects. This is mostly
 auto-set for you when you use the MR framework as described at

 http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/package-summary.html
 (via AvroJob helper class).

 On Tue, Apr 30, 2013 at 6:51 PM, Rahul Bhattacharjee
 rahul.rec@gmail.com wrote:
  Hi,
 
  When dealing with Avro data files in MR jobs ,we use AvroMapper , I
  noticed
  that the output of K and V of AvroMapper isnt writable and neither the
  key
  is comparable (these are AvroKey and AvroValue). As the general
  serialization mechanism is writable , how is the K,V pairs in case of
  avro ,
  travel across nodes?
 
  Thanks,
  Rahul



 --
 Harsh J





--
Harsh J


Re: Python Errors

2013-04-16 Thread Harsh J
Isn't RHEL4 too old as well, now?

On Tue, Apr 16, 2013 at 3:48 AM, Milind Vaidya kava...@gmail.com wrote:
 Thanks...I will upgrade n check...I was using whatever installed on my RHEL4
 box


 On Mon, Apr 15, 2013 at 4:50 PM, Miki Tebeka miki.teb...@gmail.com wrote:

 Python 2.3 is too old. IIRC the minimal Python version supported is 2.6.


 On Mon, Apr 15, 2013 at 1:54 PM, Milind Vaidya kava...@gmail.com wrote:

 I installed avro for python.

 Like Referred :
 https://avro.apache.org/docs/current/gettingstartedpython.html


 1. Build as per the instructions. Here is the output.

 ** Installation Output***
 /usr/lib64/python2.3/distutils/dist.py:227: UserWarning: Unknown
 distribution option: 'extras_require'
   warnings.warn(msg)
 /usr/lib64/python2.3/distutils/dist.py:227: UserWarning: Unknown
 distribution option: 'install_requires'
   warnings.warn(msg)
 running install
 running build
 running build_py
 running build_scripts
 running install_lib
 byte-compiling /usr/lib/python2.3/site-packages/avro/io.py to io.pyc
   File /usr/lib/python2.3/site-packages/avro/io.py, line 371
 @staticmethod
 ^
 SyntaxError: invalid syntax
 byte-compiling /usr/lib/python2.3/site-packages/avro/schema.py to
 schema.pyc
   File /usr/lib/python2.3/site-packages/avro/schema.py, line 589
 @staticmethod
 ^
 SyntaxError: invalid syntax
 byte-compiling /usr/lib/python2.3/site-packages/avro/datafile.py to
 datafile.pyc
   File /usr/lib/python2.3/site-packages/avro/datafile.py, line 71
 @staticmethod
 ^
 SyntaxError: invalid syntax
 running install_scripts
 changing mode of /usr/bin/avro to 755
 ** Installation Output***

 2.I checked import avro on python prompt as follows
 Python 2.3.4 (#1, Jan 11 2011, 14:40:50)
 [GCC 3.4.6 20060404 (Red Hat 3.4.6-11)] on linux2
 Type help, copyright, credits or license for more information.
  import avro
 

 3. I created the file user.avsc containing schema given at about link

 4. Copied the code from above link in BasicAvro.py (I added #!
 /usr/bin/python)

 5.Both BasicAvrio..py and user.avsc are in the same directory. If I run

 pyhon BasicAvro.py

 gives error

 Traceback (most recent call last):
   File BasicAvro.py, line 2, in ?
 import avro.schema
   File /usr/lib/python2.3/site-packages/avro/schema.py, line 589
 @staticmethod
 ^
 SyntaxError: invalid syntax

 6. Tried executing the script under scripts directory called avro.

 gives following error

   File avro, line 75
 return dict((k, obj[k]) for k in (set(obj)  fields))
   ^
 SyntaxError: invalid syntax

 7. What is going wrong ?












-- 
Harsh J


Re: Enabling compression

2013-04-09 Thread Harsh J
Hi Vinod,

In Avro, compression is provided only at the file container level
(i.e. block compression).

For compressing a simple byte array, you can rely on the Hadoop's
compression classes such as a GzipCodec [1] to compress the byte
stream directly (wrapping via a compressed output stream [2] got by
its helper method [3]).

Something like this, for example (I've not tested it out):

ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
GzipCodec codec = ReflectionUtils.newInstance(GzipCodec.class, new
Configuration());
OutputStream compressedOutputStream = codec.createOutputStream(outputStream);
[… Encode over compressedOutputStream, etc. …]

[1] - 
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/GzipCodec.html
[2] - 
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressorStream.html
[3] - 
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/GzipCodec.html#createOutputStream(java.io.OutputStream)

On Tue, Apr 9, 2013 at 11:17 AM, Vinod Jammula
vinod.kumar.jamm...@ericsson.com wrote:
 Hi,

 I have a a csv string which I want to serialize, compress and write to a
 database.

 I have the following code to serialize the string

 ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
 Encoder e = EncoderFactory.get().binaryEncoder(outputStream, null);
 GenericDatumWriter w = new GenericDatumWriter(schema);
 w.write(record, e)
 byte[] avroBytes = outputStream.toByteArray();


 Following code to de-serialize and process the record.

 DatumReaderGenericRecord reader = new
 GenericDatumReaderGenericRecord(schema);

  Decoder decoder = DecoderFactory.get().binaryDecoder(avroBytes, null);

 GenericRecord record = reader.read(decoder, null);


 I find compression with DataFileWriter and DataFileReader. But how to enable
 the compression for avro serialized buffer.

 Thanks and Regards,
 Vinod



-- 
Harsh J


Re: Record sort order is lexicographically by field -- what does that mean?

2013-03-28 Thread Harsh J
Hey Jeremy,

On Thu, Mar 28, 2013 at 5:15 AM, Jeremy Kahn jer...@trochee.net wrote:
 According to the documentation
 http://avro.apache.org/docs/current/spec.html#order , the sort order for
 records is:

 record data is ordered lexicographically by field. If a field specifies that
 its order is:

 ascending, then the order of its values is unaltered.
 descending, then the order of its values is reversed.
 ignore, then its values are ignored when sorting.


 What does ordered lexicographically by field mean?  I can see two
 interpretations.  Consider a record of the following schema:

 {name: ZooInventory,
  type: Record,
  fields: [
{name: city, type: string, order: ignore},
{name: zebras, type: int, order: descending},
{name: anacondas, type: int, order: ascending},
{name: baboons, type: int}
  ]
 }


 I can read ordered lexicographically by field in two ways:

 the names of the fields are sorted lexicographically, and the field that
 goes lexicographically first (not marked as order:ignore) dominates.

 the records are sorted by the sort order of each field, with the first
 fields (not marked order: ignore) taking sort priority.

The second one is correct. The field's order in the defined schema is
not changed but only walked through.

I've always read this more like it will compare in the provided order
(of read schema) and based on the type of ordering (positive, ignore
or negative) and thats true from my use of it in Hadoop MR as well.

 So suppose I have my ZooInventory objects, and I sort them according to the
 sort order specification.

 Under interpretation (1), cities with low anaconda counts would go first in
 the sorted list, and within a given value of anacondas, sort by baboon
 count.
 Under interpretation (2), large zebra-count zoos would go first, and within
 a given value of zebras, sort ascending by anacondas.

Yes, (2) is the result you'll see. Baboons would also be considered
ascending as you've not ignored it, btw.

 It seems to me that (2), in which the zebras field values dominate the sort
 descending, is the right way to behave, but I can't seem to square that
 with my understanding of ordered lexicographically by field -- or maybe
 lexicographically means something different to me than to you, or maybe
 (2) just isn't really right after all.

 Behavior (2) -- relative to behavior (1) -- offers the ability to adjust the
 order of the schema to express a different sort order, but might present
 problems for schema negotiation.

What kind of problems are you describing here? Sorry if I'm not
getting it by the words schema negotiation alone.

--
Harsh J


Re: Record sort order is lexicographically by field -- what does that mean?

2013-03-28 Thread Harsh J
Hmmm, I've not used the messaging aspects of Avro much as of yet, but
AFAIK the sorting is only applied manually by use of the
BinaryData.compare(…) API methods. If the IPC parts use that for some
reason to compare two messages or more, then I can imagine this to be
a problem as well.

On Thu, Mar 28, 2013 at 11:27 PM, Jeremy Kahn troc...@trochee.net wrote:
 Thanks for the information, Harsh. Further comments inline below:

 On Thu, Mar 28, 2013 at 4:01 AM, Harsh J ha...@cloudera.com wrote:

 On Thu, Mar 28, 2013 at 5:15 AM, Jeremy Kahn jer...@trochee.net wrote:
  I can read ordered lexicographically by field in two ways:
 
  1. the names of the fields are sorted lexicographically, and the field
  that

  goes lexicographically first (not marked as order:ignore) dominates.
 
  2. the records are sorted by the sort order of each field, with the
  first

  fields (not marked order: ignore) taking sort priority.

 The second one is correct. The field's order in the defined schema is
 not changed but only walked through.

 [...] that's true from my use of it in Hadoop MR as well.


 Okay, this is very helpful to know: it's working the way I had hoped.



  Behavior (2) -- relative to behavior (1) -- offers the ability to adjust
  the
  order of the schema to express a different sort order, but might present
  problems for schema negotiation.

 What kind of problems are you describing here? Sorry if I'm not
 getting it by the words schema negotiation alone.


 Suppose I sort a sequence of ZooInventory objects by the sort order implied
 by this schema, and I send them to you in sorted order over a protocol with
 an IDL type specification of arrayZooInventory.  You *read* the sequence
 with a different ZooInventory schema with the same fields, but which
 contains a different ordering. The objects in the array will not
 (necessarily) appear to be sorted *to you*.

 This isn't necessarily a problem -- it might actually be a feature. It is
 worth noting that two schemas may be compatible under schema negotiation but
 have different sort order for reader and writer.

 --jeremy



-- 
Harsh J


Re: Avro and Oozie Map Reduce action

2013-03-18 Thread Harsh J
The value you're specifying for io.serializations below is incorrect:

property
nameio.serializations/name
valueorg.apache.avro.mapred.AvroSerialization,
avro.serialization.key.reader.schema,
avro.serialization.value.reader.schema,
avro.serialization.key.writer.schema,avro.serialization.value.writer.schema
/value
/property

If the goal is to include org.apache.avro.mapred.AvroSerialization,
then it should look more like:

property
  nameio.serializations/name
  
valueorg.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization,org.apache.hadoop.io.serializer.avro.AvroReflectSerialization,org.apache.avro.mapred.AvroSerialization/value
/property

That is, it must be an extension of the default values, and not a
replacement of them.

On Wed, Mar 13, 2013 at 4:05 AM, M, Paul pa...@iqt.org wrote:
 Hello,

 I am trying to run an M/R job with Avro serialization via Oozie.  I've made
 some progress in the workflow.xml, however I am still running into the
 following error.  Any thoughts?  I believe it may have to do with the
 io.serializations property below.   FYI, I am using CDH 4.2.0 mr1.

 2013-03-12 15:24:32,334 INFO org.apache.hadoop.mapred.TaskInProgress: Error
 from attempt_20130318_0080_m_00_3: java.lang.NullPointerException
 at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:389)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1407)
 at org.apache.hadoop.mapred.Child.main(Child.java:262)


 action name=mr-node
 map-reduce
 job-tracker${jobTracker}/job-tracker
 name-node${nameNode}/name-node
 prepare
 delete path=${nameNode}/user/${wf:user()}/${outputDir} /
 /prepare
 configuration
 property
 namemapred.job.queue.name/name
 value${queueName}/value
 /property

 property
 namemapreduce.reduce.class/name
 valueorg.apache.avro.mapred.HadoopReducer/value
 /property
 property
 namemapreduce.map.class/name
 valueorg.apache.avro.mapred.HadoopMapper/value
 /property


 property
 nameavro.reducer/name
 valueorg.my.project.mapreduce.CombineAvroRecordsByHourReducer
 /value
 /property

 property
 nameavro.mapper/name
 valueorg.my.project.mapreduce.ParseMetadataAsTextIntoAvroMapper
 /value
 /property


 property
 namemapreduce.inputformat.class/name
 valueorg.my.project.mapreduce.NonSplitableInputFormat/value
 /property

 !-- Key Value Mapper --
 property
 nameavro.output.schema/name
 value{type:record,name:Pair,namespace:org.apache.avro.mapred,fields:...}]}
 /value
 /property
 property
 namemapred.mapoutput.key.class/name
 valueorg.apache.avro.mapred.AvroKey/value
 /property
 property
 namemapred.mapoutput.value.class/name
 valueorg.apache.avro.mapred.AvroValue/value
 /property


 property
 nameavro.schema.output.key/name
 value{type:record,name:DataRecord,namespace:...]}]}
 /value
 /property

 property
 namemapreduce.outputformat.class/name
 valueorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat
 /value
 /property

 property
 namemapred.output.key.comparator.class/name
 valueorg.apache.avro.mapred.AvroKeyComparator/value
 /property

 property
 nameio.serializations/name
 valueorg.apache.avro.mapred.AvroSerialization,
 avro.serialization.key.reader.schema,
 avro.serialization.value.reader.schema,
 avro.serialization.key.writer.schema,avro.serialization.value.writer.schema
 /value
 /property

 property
 namemapred.map.tasks/name
 value1/value
 /property



 !--Input/Output --
 property
 namemapred.input.dir/name
 value/user/${wf:user()}/input//value
 /property
 property
 namemapred.output.dir/name
 value/user/${wf:user()}/${outputDir}/value
 /property
 /configuration
 /map-reduce



-- 
Harsh J


Re: Is it possible to append to an already existing avro file

2013-02-07 Thread Harsh J
I assume by non-trivial you meant the extra Seekable stuff I needed to
wrap around the DFS output streams to let Avro take it as append-able?
I don't think its possible for Avro to carry it since Avro (core) does
not reverse-depend on Hadoop. Should we document it somewhere though?
Do you have any ideas on the best place to do that?

On Thu, Feb 7, 2013 at 6:12 AM, Michael Malak michaelma...@yahoo.com wrote:
 Thanks so much for the code -- it works great!

 Since it is a non-trivial amount of code required to achieve append, I 
 suggest attaching that code to AVRO-1035, in the hopes that someone will come 
 up with an interface that requires just one line of user code to achieve 
 append.

 --- On Wed, 2/6/13, Harsh J ha...@cloudera.com wrote:

 From: Harsh J ha...@cloudera.com
 Subject: Re: Is it possible to append to an already existing avro file
 To: user@avro.apache.org
 Date: Wednesday, February 6, 2013, 11:17 AM
 Hey Michael,

 It does implement the regular Java OutputStream interface,
 as seen in
 the API: 
 http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataOutputStream.html.

 Here's a sample program that works on Hadoop 2.x in my
 tests:
 https://gist.github.com/QwertyManiac/4724582

 On Wed, Feb 6, 2013 at 9:00 AM, Michael Malak michaelma...@yahoo.com
 wrote:
  I don't believe a Hadoop FileSystem is a Java
 OutputStream?
 
  --- On Tue, 2/5/13, Doug Cutting cutt...@apache.org
 wrote:
 
  From: Doug Cutting cutt...@apache.org
  Subject: Re: Is it possible to append to an already
 existing avro file
  To: user@avro.apache.org
  Date: Tuesday, February 5, 2013, 5:27 PM
  It will work on an OutputStream that
  supports append.
 
  http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(org.apache.avro.file.SeekableInput,
  java.io.OutputStream)
 
  So it depends on how well HDFS implements
  FileSystem#append(), not on
  any changes in Avro.
 
  http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#append(org.apache.hadoop.fs.Path)
 
  I have no recent personal experience with append
 in
  HDFS.  Does anyone
  else here?
 
  Doug
 
  On Tue, Feb 5, 2013 at 4:10 PM, Michael Malak
 michaelma...@yahoo.com
  wrote:
   My understanding is that will append to a file
 on the
  local filesystem, but not to a file on HDFS.
  
   --- On Tue, 2/5/13, Doug Cutting cutt...@apache.org
  wrote:
  
   From: Doug Cutting cutt...@apache.org
   Subject: Re: Is it possible to append to
 an already
  existing avro file
   To: user@avro.apache.org
   Date: Tuesday, February 5, 2013, 5:08 PM
   The Jira is:
  
   https://issues.apache.org/jira/browse/AVRO-1035
  
   It is possible to append to an existing
 Avro file:
  
   http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File)
  
   Should we close that issue as fixed?
  
   Doug
  
   On Fri, Feb 1, 2013 at 11:32 AM, Michael
 Malak
  michaelma...@yahoo.com
   wrote:
Was a JIRA ticket ever created
 regarding
  appending to
   an existing Avro file on HDFS?
   
What is the status of such a
 capability, a
  year out
   from when the issue below was raised?
   
On Wed, 22 Feb 2012 10:57:48 +0100,
  Vyacheslav
   Zholudev vyacheslav.zholu...@gmail.com
   wrote:
   
Thanks for your reply, I
 suspected this.
   
I will create a JIRA ticket.
   
Vyacheslav
   
On Feb 21, 2012, at 6:02 PM,
 Scott Carey
  wrote:
   
   
On 2/21/12 7:29 AM,
 Vyacheslav
  Zholudev
   vyacheslav.zholu...@gmail.com
wrote:
   
Yep, I saw that method as
 well as
  the
   stackoverflow post. However, I'm
interested how to append
 to a file
  on the
   arbitrary file system, not
only on the local one.
   
I want to get an
 OutputStream
  based on the
   Path and the FileSystem
implementation and then
 pass it
  for
   appending to avro methods.
   
Is that possible?
   
It is not possible without
 modifying
   DataFileWriter. Please open a JIRA
ticket.
   
It could not simply append to
 an
  OutputStream,
   since it must either:
* Seek to the start to
 validate the
  schemas
   match and find the sync
marker, or
* Trust that the schemas
 match and
  find the
   sync marker from the last
block
   
DataFileWriter cannot refer
 to Hadoop
  classes
   such as FileSystem, but we
could add something to the
 mapred
  module that
   takes a Path and
FileSystem and returns
 something that
   implemements an interface that
DataFileWriter can append
 to.
  This would
   be something that is both a
http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInput.html
and an OutputStream, or has
 both an
  InputStream
   from the start of the
existing file and an
 OutputStream at
  the end.
   
Thanks,
Vyacheslav
   
On Feb 21, 2012, at 5:29
 AM, Harsh
  J
   wrote:
   
Hi,
   
Use the appendTo
 feature of
  the
   DataFileWriter. See

Re: run time error during reduce stage: No field named ____ in: null

2012-11-02 Thread Harsh J
You could also use an @Override to assert an override at compile-time.

On Fri, Nov 2, 2012 at 9:55 PM, Brian Derickson bderickso...@gmail.com wrote:
 That did it! I never would have found that, thank you so much. This is what
 I get for trying to just use Vim and Maven instead of a proper IDE. I'll
 work on getting Eclipse set up.

 Again, thanks a bunch. I've been pouring over this for awhile now and I'm
 both glad and embarrassed it was so simple.



 On Fri, Nov 2, 2012 at 11:06 AM, Dave Beech d...@paraliatech.com wrote:

 I think I have it.

 Your reducer isn't being called at all, because the signature of the
 reducer method doesn't match the one in AvroReducer. So, the base
 implementation isn't being overridden. You've stated Iterator where
 it should actually be Iterable.

 If you use Eclipse, look for a green arrow icon next to the method
 declaration - that means it's being overridden properly.

 Dave

 On 2 November 2012 15:55, Brian Derickson bderickso...@gmail.com wrote:
  I've made another gist for this rather than clutter up the mail with
  code
  snippets: https://gist.github.com/4002132
 
  I basically just changed all instances of PairGenericRecord, Integer
  in
  the reducer with just GenericRecord. I also changed the output schema
  that
  gets set in the Main function.
 
  When I run this, I get a run time error that's also included in the
  above
  gist: java.lang.IllegalArgumentException: Not a Pair schema:
 
  The pom.xml file I'm using is also in this gist, in case I'm screwing up
  a
  version somewhere. My intent is to be running on CDH4 using MRv1 and
  Avro
  1.7.1, and as far as I can tell from the pom.xml I'm doing just that.
  Could
  be mistaken.
 
  Thanks again for your time.
 
 
 
  On Thu, Nov 1, 2012 at 5:49 PM, Dave Beech d...@paraliatech.com wrote:
 
  Hi Brian
 
  I don't think the output from the reducer should be a Pair. You said
  you got an error when you didn't use a Pair here - what was it?
 
  Cheers,
  Dave
 
  On 1 November 2012 22:09, Brian Derickson bderickso...@gmail.com
  wrote:
   I've been pulling my hair out over this all day, and I'm hoping this
   is
   something simple I'm overlooking.
  
   The relevant portions of my code, the schema I'm using, and the stack
   trace
   are at https://gist.github.com/3996847.
  
   I'm using Hadoop 0.20.2 and Avro 1.7.1 as part of CDH4.
  
   To briefly describe what I'm doing: the mapper (not included in the
   gist) is
   taking a bam file and spitting out some information. The key is the
   chromosome and position colon delimited and the value is an integer.
  
   The reducer is summing up all the integers at a particular position
   and
   creating a Pair object containing a record using the schema included
   in
   my
   gist. The second portion of the pair is an integer that I don't care
   about... if I didn't use a Pair here, I'd get an error. If this is
   something
   I could do differently, please correct me. :)
  
   Every time this is run, I get the stack trace included in the gist.
   I've
   run
   out of things to try to fix this... I'd really really appreciate any
   help I
   can get. Thanks!
  
  
  
 
 





-- 
Harsh J


Re: Example of secondary sort using Avro data.

2012-10-15 Thread Harsh J
Hi Ravi,

Avro questions are best asked at user@avro lists. I've moved your
question there.

Take a look at Jacob's responses at
http://search-hadoop.com/m/woY9Gz8Qyz1 for a detailed take on how to
setup the comparators.

On Tue, Oct 16, 2012 at 1:54 AM, Ravi P hadoo...@outlook.com wrote:
 Hello Group,
   Are there any sample code/documentation available on writing Map-reduce
 jobs with secondary sort using Avro data?
 --
 Thanks,
 Ravi



-- 
Harsh J


Re: How to convert Avro GenericRecord to AvroKeyGenericRecord?

2012-09-26 Thread Harsh J
Ravi,

Moving this to the Avro user lists (user@avro.apache.org).

You can simply do a AvroKeyGenericRecord key = new
AvroKeyGenericRecord(datum).

On Thu, Sep 27, 2012 at 4:51 AM, Ravi P hadoo...@outlook.com wrote:
 Hello,
   I am using Avro files for Hadoop MaprReduce.  My Mapper function has
 following definition.

 Mapper AvroKeyGenericRecord,NullWritable,
 AvroKeyGenericRecord,AvroValueGenericRecord

 For writing unit tests for above Mapper function I need to pass
 AvroKeyGenericRecord.
 How do I convert GenericRecord to AvroKeyGenericRecord ?

 Is there any example available ?

 Thanks,
 Ravi



-- 
Harsh J


Re: avrogencpp generates vector of $Undefined$ type

2012-08-27 Thread Harsh J
Hey Jan,

Perhaps filing a JIRA with your reproduction steps and a sample,
runnable test case will help speed this up. Seems like a genuine bug,
so you should go ahead!

On Tue, Aug 28, 2012 at 5:30 AM, Jan van der Lugt janl...@gmail.com wrote:
 It seems the $Undefined$ is coming from an AVRO_UNION type, which is also
 not checked in the cppTypeOf method. I could try to come up with some
 solution, but if someone with knowledge of this code could tell me what the
 issue is and why AVRO_UNION is not being handled, that would be very
 helpful.

 - Jan


 On Sun, Aug 26, 2012 at 9:56 PM, Jan van der Lugt janl...@gmail.com wrote:

 Good find! I'll take a look at this tomorrow, see if I can come up with a
 fix.


 On Sun, Aug 26, 2012 at 5:26 AM, Harsh J ha...@cloudera.com wrote:

 I'm not an expert on the Avro C++ implementation, but I wonder if this
 is cause of the nulls not being checked for in
 http://svn.apache.org/repos/asf/avro/trunk/lang/c++/impl/avrogencpp.cc's
 CodeGen::cppTypeOf method.

 On Sun, Aug 26, 2012 at 1:54 PM, Jan van der Lugt janl...@gmail.com
 wrote:
  Hi all,
 
  Sorry to be impatient, but could someone please comment on this issue?
  I
  know that the C++ version isn't as popular as the Java version, but the
  whole idea is to make information exchange between applications in
  different
  languages easier, right?
 
  - Jan
 
 
  On Sat, Aug 18, 2012 at 12:10 AM, Jan van der Lugt janl...@gmail.com
  wrote:
 
  Hi all,
 
  After deciding on Apache Avro for one of the main formats for storing
  our
  graph data, I tried to integrate it with our graph processing system
  built
  in C++. If I generate a header file from the attached Avro schema
  using
  avrogencpp, I get a vector of type $Undefined$ somewhere in the
  generated
  code (see the snippet below). Is there an error in my schema or is
  this a
  bug in avrogencpp? Thanks in advance for your help!
 
  - Jan
 
  static void decode(Decoder d, gm::gm_avro_graph_avpr_Union__5__ v)
  {
  size_t n = d.decodeUnionIndex();
  if (n = 2) { throw avro::Exception(Union index too big); }
  switch (n) {
  case 0:
  d.decodeNull();
  v.set_null();
  break;
  case 1:
  {
  std::vector$Undefined$  vv;
  avro::decode(d, vv);
  v.set_array(vv);
  }
  break;
  }
  }
 
 



 --
 Harsh J






-- 
Harsh J


Re: avrogencpp generates vector of $Undefined$ type

2012-08-26 Thread Harsh J
I'm not an expert on the Avro C++ implementation, but I wonder if this
is cause of the nulls not being checked for in
http://svn.apache.org/repos/asf/avro/trunk/lang/c++/impl/avrogencpp.cc's
CodeGen::cppTypeOf method.

On Sun, Aug 26, 2012 at 1:54 PM, Jan van der Lugt janl...@gmail.com wrote:
 Hi all,

 Sorry to be impatient, but could someone please comment on this issue? I
 know that the C++ version isn't as popular as the Java version, but the
 whole idea is to make information exchange between applications in different
 languages easier, right?

 - Jan


 On Sat, Aug 18, 2012 at 12:10 AM, Jan van der Lugt janl...@gmail.com
 wrote:

 Hi all,

 After deciding on Apache Avro for one of the main formats for storing our
 graph data, I tried to integrate it with our graph processing system built
 in C++. If I generate a header file from the attached Avro schema using
 avrogencpp, I get a vector of type $Undefined$ somewhere in the generated
 code (see the snippet below). Is there an error in my schema or is this a
 bug in avrogencpp? Thanks in advance for your help!

 - Jan

 static void decode(Decoder d, gm::gm_avro_graph_avpr_Union__5__ v) {
 size_t n = d.decodeUnionIndex();
 if (n = 2) { throw avro::Exception(Union index too big); }
 switch (n) {
 case 0:
 d.decodeNull();
 v.set_null();
 break;
 case 1:
 {
 std::vector$Undefined$  vv;
 avro::decode(d, vv);
 v.set_array(vv);
 }
 break;
 }
 }





-- 
Harsh J


Re: avro-1.5.4 jars missing

2012-07-24 Thread Harsh J
Hi Steven,

I can find all releases here on the main archive link:
http://archive.apache.org/dist/avro/. Hope this helps!

On Tue, Jul 24, 2012 at 10:15 PM, Steven Willis swil...@compete.com wrote:
 Hello,

 I was looking for the avro-1.5.4 jars today and found that they are no longer 
 in the releases directory on the mirrors:

 http://www.us.apache.org/dist/avro/

 [PARENTDIR] Parent Directory -
 [DIR] avro-1.6.3/ 2012-03-02 22:31-
 [DIR] avro-1.7.1/ 2012-07-12 19:29-
 [DIR] stable/ 2012-07-12 19:29-
 [   ] KEYS2010-06-07 16:58  5.4K

 Nor are they in the archive location:

 http://archive.apache.org/dist/hadoop/avro/

 [PARENTDIR] Parent Directory -
 [DIR] avro-1.0.0/ 2009-08-06 16:22-
 [DIR] avro-1.1.0/ 2009-09-11 17:29-
 [DIR] avro-1.2.0/ 2009-10-09 21:15-
 [DIR] avro-1.3.0/ 2010-03-01 16:46-
 [DIR] avro-1.3.1/ 2010-03-16 17:30-
 [DIR] avro-1.3.2/ 2010-03-25 22:17-
 [DIR] stable/ 2010-03-25 22:17-
 [   ] KEYS2009-07-14 20:13  2.0K

 Is this an oversight, or should I be looking elsewhere?

 -Steven Willis



-- 
Harsh J


Re: Avro file size is too big

2012-07-19 Thread Harsh J
Snappy is known to have lower compression rates against Gzip, but
perhaps you can try larger blocks in the Avro DataFiles as indicated
in the thread, via a higher sync-interval? [1] What snappy is really
good at is a fast decompression rate though, so perhaps your reads are
going to be comparable with gzip plaintext?

P.s. What do you get if you use deflate compression on the data files,
with maximal compression level (9)? [2]

[1] - 
http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setSyncInterval(org.apache.hadoop.mapred.JobConf,%20int)
or 
http://avro.apache.org/docs/1.7.1/api/java/index.html?org/apache/avro/mapred/AvroOutputFormat.html

[2] - 
http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setDeflateLevel(org.apache.hadoop.mapred.JobConf,%20int)
or via 
http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/CodecFactory.html#deflateCodec(int)
coupled with 
http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory)

On Thu, Jul 19, 2012 at 5:29 AM, Ey-Chih chow eyc...@gmail.com wrote:
 We are converting our compression scheme from gzip to snappy for our json 
 logs.  In one case, the size of a gzip file is 715MB and the corresponding 
 snappy file is 1.885GB.  The schema of the snappy file is bytes.  In other 
 words, we compress line by line of our json logs and each line is a json 
 string.  Is there any way we can optimize our compression with snappy?

 Ey-Chih Chow


 On Jul 5, 2012, at 3:19 PM, Doug Cutting wrote:

 You can use the Avro command-line tool to dump the metadata, which
 will show the schema and codec:

  java -jar avro-tools.jar getmeta file

 Doug

 On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh metarus...@gmail.com 
 wrote:
 Hey Doug,

 Here is a little more of explanation
 http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E
 I'll answer your questions later after some investigation

 Thank you!


 On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting cutt...@apache.org wrote:
 Rusian,

 This is unexpected.  Perhaps we can understand it if we have more 
 information.

 What Writable class are you using for keys and values in the SequenceFile?

 What schema are you using in the Avro data file?

 Can you provide small sample files of each and/or code that will reproduce 
 this?

 Thanks,

 Doug

 On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh metarus...@gmail.com 
 wrote:
 Hello,

 In my organization currently we are evaluating Avro as a format. Our
 concern is file size. I've done some comparisons of a piece of our
 data.
 Say we have sequence files, compressed. The payload (values) are just
 lines. As far as I know we use line number as keys and we use the
 default codec for compression inside sequence files. The size is 1.6G,
 when I put it to avro with deflate codec with deflate level 9 it
 becomes 2.2G.
 This is interesting, because the values in seq files are just string,
 but Avro has a normal schema with primitive types. And those are kept
 binary. Shouldn't Avro be less in size?
 Also I took another dataset which is 28G (gzip files, plain
 tab-delimited text, don't know what is the deflate level) and put it
 to Avro and it became 38G
 Why Avro is so big in size? Am I missing some size optimization?

 Thanks in advance!




-- 
Harsh J


Re: Which jar file is for what?

2012-07-09 Thread Harsh J
The answer would depend on what you're looking to do.

That said, the best way to use Avro is to use it via maven and the
avro-maven-plugin. See http://github.com/phunt/avro-rpc-quickstart for
an example of how to use it for different things.

On Tue, Jul 10, 2012 at 1:00 AM, Saptarshi Guha sg...@mozilla.com wrote:
 Hello,


 In the folder

 http://www.eng.lsu.edu/mirrors/apache/avro/stable/java/

 there is


   avro-1.6.3.jar  02-Mar-2012 16:27  286K  
 Java-Apache (old)
   avro-compiler-1.6.3.jar 02-Mar-2012 16:27   71K  
 Java-Apache (old)
   avro-ipc-1.6.3.jar  02-Mar-2012 16:27  180K  
 Java-Apache (old)
   avro-mapred-1.6.3.jar   02-Mar-2012 16:27   89K  
 Java-Apache (old)
   avro-maven-plugin-1.6.3.jar 02-Mar-2012 16:27   20K  
 Java-Apache (old)
   avro-protobuf-1.6.3.jar 02-Mar-2012 16:27   18K  
 Java-Apache (old)
   avro-thrift-1.6.3.jar   02-Mar-2012 16:27   15K  
 Java-Apache (old)
   avro-tools-1.6.3-nodeps.jar 02-Mar-2012 16:27   46K  
 Java-Apache (old)
   avro-tools-1.6.3.jar02-Mar-2012 16:27   10M  
 Java-Apache (old)


 Where would i use the different JAR files?
 Many thanks
 Regards
 Saptarshi



-- 
Harsh J


Re: Avro + Snappy changing blocksize of snappy compression

2012-04-18 Thread Harsh J
Hey Nikhil,

When using Avro Datafiles, you perhaps need to tweak its sync-interval
to affect compression chunk sizes:
http://avro.apache.org/docs/1.6.3/api/java/org/apache/avro/file/DataFileWriter.html#setSyncInterval(int)

On Wed, Apr 18, 2012 at 10:53 PM, snikhil0 snik...@telenav.com wrote:
 I am experimenting with Avro and snappy and want to plot the size of the
 compressed avro datafile as a function of varying compression block size. I
 am doing this by setting the configuration value for
 io.compression.codec.snappy.buffersize. Unfortunately, this is not
 working: or more precisely for buffer sizes 256K to 2MB I get the same size
 output avro (snappyfied) data file. What am I missing? Someone had success
 with this?

 Thanks,
 Nikhil

 --
 View this message in context: 
 http://apache-avro.679487.n3.nabble.com/Avro-Snappy-changing-blocksize-of-snappy-compression-tp3920732p3920732.html
 Sent from the Avro - Users mailing list archive at Nabble.com.



-- 
Harsh J


Re: Getting started with Avro + Reading from an Avro formatted file

2012-01-24 Thread Harsh J
Selvi,

Expanding on Douglas' response, if you have installed Avro's python
libraries (Simplest way to get latest stable is: easy_install avro,
or install from the distribution -- Post back if you need help on
this), you can simply do, using the now-installed 'avro' executable:

$ ls
sample_input.avro

$ avro cat sample_input.avro --format csv
011990-9,0,-61952400
011990-9,22,-61950600
011990-9,-11,-61948440
012650-9,111,-65553120
012650-9,78,-65550960

Or, write to a resultant file, as you would regularly in a shell:

$ avro cat sample_input.avro --format csv  sample_input.csv

For more options on avro's cat and write opts:

$ avro --help

On Tue, Jan 24, 2012 at 9:01 PM, selvi k gridsngat...@gmail.com wrote:
 Hello All,


 I would like some suggestions on where I can start in the Avro project.


 I want to be able to read from an Avro formatted log file (specifically the
 History Log file created at the end of a Hadoop job) and create a Comma
 Separated file of certain log entries. I need a csv file because this is the
 format that is accepted by post processing software I am working with (eg:
 Matlab).


 Initially I was using a BASH script to grep and awk from this file and
 create my CSV file because I needed a very few values from it, and a quick
 script just worked. I didn't try to get to know what format the log file was
 in and utilize that. (my bad!)  Now that I need to be scaling up and want to
 have a reliable way to parse, I would like to try and do it the right way.


 My question is this: For the above goal, could you please guide me with
 steps I can follow - such as reading material and libraries I could try to
 use. As I go through the Quick Start Guide and FAQ, I see that a lot of the
 information here is geared to someone who wants to use the data
 serialization and RPC functionality provided by Avro. Given that I only want
 to be able to read, where may I start?


 I can comfortably script with BASH and Perl. Given that I only see support
 for Java, Python and Ruby, I think I can take this as as opportunity to
 learn Python and get up to speed.


 Thanks a lot.


 -Selvi





-- 
Harsh J
Customer Ops. Engineer, Cloudera


Re: Getting started with Avro + Reading from an Avro formatted file

2012-01-24 Thread Harsh J


 


 [2] /avro$ sudo easy_install avro

 Searching for avro

 Best match: avro 1.6.1

 Processing avro-1.6.1-py2.6.egg

 avro 1.6.1 is already the active version in easy-install.pth

 Installing avro script to /usr/local/bin


 Using /usr/local/lib/python2.6/dist-packages/avro-1.6.1-py2.6.egg

 Processing dependencies for avro

 Searching for python-snappy

 Reading http://pypi.python.org/simple/python-snappy/

 Reading http://github.com/andrix/python-snappy

 Best match: python-snappy 0.3.2

 Downloading
 http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f

 Processing python-snappy-0.3.2.tar.gz

 Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir
 /tmp/easy_install-c6jLm0/python-snappy-0.3.2/egg-dist-tmp-TTWQBN

 cc1plus: warning: command line option -Wstrict-prototypes is valid for
 Ada/C/ObjC but not for C++

 snappymodule.cc:31:22: error: snappy-c.h: No such file or directory

 snappymodule.cc: In function ‘PyObject* snappy__compress(PyObject*,
 PyObject*)’:

 snappymodule.cc:62: error: ‘snappy_status’ was not declared in this scope

 snappymodule.cc:62: error: expected ‘;’ before ‘status’

 snappymodule.cc:75: error: ‘snappy_max_compressed_length’ was not declared
 in this scope

 snappymodule.cc:79: error: ‘status’ was not declared in this scope

 snappymodule.cc:79: error: ‘snappy_compress’ was not declared in this
 scope

 snappymodule.cc:81: error: ‘SNAPPY_OK’ was not declared in this scope

 snappymodule.cc: In function ‘PyObject* snappy__uncompress(PyObject*,
 PyObject*)’:

 snappymodule.cc:107: error: ‘snappy_status’ was not declared in this scope

 snappymodule.cc:107: error: expected ‘;’ before ‘status’

 snappymodule.cc:120: error: ‘status’ was not declared in this scope

 snappymodule.cc:120: error: ‘snappy_uncompressed_length’ was not declared
 in this scope

 snappymodule.cc:121: error: ‘SNAPPY_OK’ was not declared in this scope

 snappymodule.cc:128: error: ‘snappy_uncompress’ was not declared in this
 scope

 snappymodule.cc:129: error: ‘SNAPPY_OK’ was not declared in this scope

 snappymodule.cc: In function ‘PyObject*
 snappy__is_valid_compressed_buffer(PyObject*, PyObject*)’:

 snappymodule.cc:151: error: ‘snappy_status’ was not declared in this scope

 snappymodule.cc:151: error: expected ‘;’ before ‘status’

 snappymodule.cc:156: error: ‘status’ was not declared in this scope

 snappymodule.cc:156: error: ‘snappy_validate_compressed_buffer’ was not
 declared in this scope

 snappymodule.cc:157: error: ‘SNAPPY_OK’ was not declared in this scope

 snappymodule.cc: At global scope:

 snappymodule.cc:41: warning: ‘_state’ defined but not used

 error: Setup script exited with error: command 'gcc' failed with exit
 status 1


 


 [3]

 python$ sudo easy_install python-snappy

 Searching for python-snappy

 Reading http://pypi.python.org/simple/python-snappy/

 Reading http://github.com/andrix/python-snappy

 Best match: python-snappy 0.3.2

 Downloading
 http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f

 Processing python-snappy-0.3.2.tar.gz

 Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir
 /tmp/easy_install-Hpzssm/python-snappy-0.3.2/egg-dist-tmp-UStJPW

 gcc: error trying to exec 'cc1plus': execvp: No such file or directory

 error: Setup script exited with error: command 'gcc' failed with exit
 status 1





 On Tue, Jan 24, 2012 at 11:01 AM, Harsh J ha...@cloudera.com wrote:

 Selvi,

 Expanding on Douglas' response, if you have installed Avro's python
 libraries (Simplest way to get latest stable is: easy_install avro,
 or install from the distribution -- Post back if you need help on
 this), you can simply do, using the now-installed 'avro' executable:

 $ ls
 sample_input.avro

 $ avro cat sample_input.avro --format csv
 011990-9,0,-61952400
 011990-9,22,-61950600
 011990-9,-11,-61948440
 012650-9,111,-65553120
 012650-9,78,-65550960

 Or, write to a resultant file, as you would regularly in a shell:

 $ avro cat sample_input.avro --format csv  sample_input.csv

 For more options on avro's cat and write opts:

 $ avro --help

 On Tue, Jan 24, 2012 at 9:01 PM, selvi k gridsngat...@gmail.com wrote:
  Hello All,
 
 
  I would like some suggestions on where I can start in the Avro project.
 
 
  I want to be able to read from an Avro formatted log file (specifically
  the
  History Log file created at the end of a Hadoop job) and create a Comma
  Separated file of certain log entries. I need a csv file because this
  is the
  format that is accepted by post processing software I am working with
  (eg:
  Matlab).
 
 
  Initially I was using a BASH script to grep and awk from this file and
  create my CSV file because I needed a very few values from it, and a
  quick
  script just worked. I didn't try

Re: Decode without using DataFileReader

2011-12-05 Thread Harsh J
I do not understand what you're trying to achieve here.

Encoders work at the primitive level - they merely serialize a given data 
structure (records, unions, for example), and not look at the schema (Notice - 
you create a record with a schema, not an encoder with a schema). Decoders 
could do the same and read back primitives, but if they had a schema they'd 
read back properly packed data structures. Since encoders do not store schema, 
decoders need it externally.

DataFiles solve this for you by writing the schema itself into the file as a 
header. The reader loads this schema into the decoder when it attempts to read 
it back.

On 05-Dec-2011, at 11:43 PM, Gaurav wrote:

 it makes no sense for the encoder to store schema for every given record,
 into a stream. 
 
 Agree. Its not even encode/decoders job to store schema.
 
 While writing data, I noticed that we don't even need DataFileWriter, all it
 needs is GenericDatumWriter, Encoder and any kind of output stream (which
 can also be a file output stream).
 
 Sample:
 
 private static ByteArrayOutputStream EncodeData() throws IOException {
   // TODO Auto-generated method stub
   Schema schema = createMetaData();
   
   GenericDatumWriterGenericData.Record datum = new
 GenericDatumWriterGenericData.Record(schema);
   
   GenericData.Record inner_record = new
 GenericData.Record(schema.getField(trade).schema());
   inner_record.put(inner_abc, new Long(23490843));
   
   GenericData.Record record = new GenericData.Record(schema);
   record.put(abc, 1050324); 
   record.put(trade, inner_record);
   
   ByteArrayOutputStream out = new ByteArrayOutputStream();
   BinaryEncoder encoder = ENCODER_FACTORY.binaryEncoder(out, 
 null);
   
   datum.write(record, encoder);
   
   encoder.flush();
   out.close();
 
   return out;
   }
 
 
 Then why can't I just use back the same output stream to read back metadata
 and data. It should not be the responsibility of stream reader (which in
 this case is served by FileDataReader) to parse out schema.
 
 Thanks,
 Gaurav Nanda
 
 --
 View this message in context: 
 http://apache-avro.679487.n3.nabble.com/Decode-without-using-DataFileReader-tp3561722p3562127.html
 Sent from the Avro - Users mailing list archive at Nabble.com.



Re: Avro and Hadoop streaming

2011-06-15 Thread Harsh J
Miki,

You'll need to provide the entire canonical class name
(org.apache.avro.mapred…).

On Wed, Jun 15, 2011 at 5:31 AM, Miki Tebeka miki.teb...@gmail.com wrote:
 Greetings,

 I've tried to run a job with the following command:

 hadoop jar ./hadoop-streaming-0.20.2-cdh3u0.jar \
    -input /in/avro \
    -output $out \
    -mapper avro-mapper.py \
    -reducer avro-reducer.py \
    -file avro-mapper.py \
    -file avro-reducer.py \
    -cacheArchive /cache/avro-mapred-1.6.0-SNAPSHOT.jar \
    -inputformat AvroAsTextInputFormat

 However I get
 -inputformat : class not found : AvroAsTextInputFormat

 I'm probably missing something obvious to do.

 Any ideas?

 Thanks!
 --
 Miki

 On Fri, Jun 3, 2011 at 1:43 AM, Doug Cutting cutt...@apache.org wrote:
 Miki,

 Have you looked at AvroAsTextInputFormat?

 http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/AvroAsTextInputFormat.html

 Also, release 1.5.2 will include AvroTextOutputFormat:

 https://issues.apache.org/jira/browse/AVRO-830

 Are these perhaps what you're looking for?

 Doug

 On 06/02/2011 11:30 PM, Miki Tebeka wrote:
 Greetings,

 I'd like to use hadoop streaming with Avro files.
 My plan is to write an inputformat class that emits json records, one
 per line. This way the streaming application can read one record per
 line.
 (http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#Specifying+Other+Plugins+for+Jobs)

 I couldn't find any documentation/help about writing inputformat
 classes. Can someone point me to the right direction?

 Thanks,
 --
 Miki





-- 
Harsh J


Re: could I add a field Map

2011-04-08 Thread Harsh J
You can use Maps as long as their key type is limited to strings I
think. MapString, X is alright (X should also be Avro-acceptable, of
course..).

On Fri, Apr 8, 2011 at 7:49 PM, Weishung Chung weish...@gmail.com wrote:
 I am using Apache Avro in my project and was wondering could it be possible
 to add a field Map (TreeMap). I know that we can use Array and it works but
 I would like to be able to get by key :)
 Thank you,
 Wei Shung



-- 
Harsh J
http://harshj.com


Re: How to direct Reducer to write avro objects to avro sequence file?

2011-03-10 Thread Harsh J
By 'Avro sequence files' do you mean Avro data-files?

Avro-Mapred classes right now only support the older, stable API
(which has been undeprecated in 0.20.3, and is supported in 0.21 as
well - no worries in using it really). There is AVRO-593 that tracks a
new API implementation of Avro's mapred suppor (but it should be
fairly easy to write your own wrappers for these after a bit of
reading, since changes are mostly superficial).

On Fri, Mar 11, 2011 at 11:24 AM, Aleksey Maslov
aleksey.mas...@lab49.com wrote:
 Hi,
 (using hadoop 0.20.2 and avro 1.4.1)

 I have defined a simple avro object 'AvroObj' (a record of strings),
 compiled the schema and
 setup a simple MR job that takes as input lt;Object, Textgt; and emits
 lt;Text, IntWritablegt;
 and reducer that takes said lt;Text, IntWritablegt; and ...
 I would like to achieve is - have reducer emit lt;NullWritable, AvroObjgt;
 pairs into an avro sequence file;

 so the next mr job will open that avro file and read-in avro objects, not
 text lines, out of it;

 I have looked through the (H ed.2) book and few online samples but can't
 figure out how to do it;
 some online sources mention job config settings like:
        job.setOutputFormatClass(AvroOutputFormat.class);
        AvroOutputFormat.setCompressOutput(conf, false);

 But this doesn't compile - setCompressOutput asks for deprecated JobConf
 object, and
 setOutputFormatClass gives error about its param - param not applicable to
 AvroOutputFormat.class;

 Could someone enlighten me how to have reducer write to avro sequence file ?

 Cheers;

 --
 View this message in context: 
 http://apache-avro.679487.n3.nabble.com/How-to-direct-Reducer-to-write-avro-objects-to-avro-sequence-file-tp2663706p2663706.html
 Sent from the Avro - Users mailing list archive at Nabble.com.




-- 
Harsh J
www.harshj.com


Importing into Eclipse

2011-02-06 Thread Harsh J
Hello,

I updated my svn clone of Avro after quite a while and noticed that
the Java build has moved from Ant to Maven. I'm was not very familiar
with maven based projects yet, but I got some reading done and am able
to use it now. But I have a remaining question that I was not able to
solve:

How do I ask it to generate Eclipse project files? I liked the
'eclipse' or 'eclipse-files' target in the earlier Ant-based build
system of Avro which easily generated Eclipse project files. But when
I try maven install; maven eclipse:eclipse it fails for the tools
package (in the eclipse part, the install goes fine). I do not have
enough experience with Maven to know if it is a fault I'm doing or if
it is a fault with the build files related to maven. Any help with
creating eclipse project files for Avro's Java sub-project?

-- 
Harsh J
www.harshj.com


Re: How to get started with examples on avro

2011-01-28 Thread Harsh J
Based on the language you're targeting, have a look at its test-cases
available on the in the project's version control:
http://svn.apache.org/repos/asf/avro/trunk/lang/ [You can check it out
via SVN, or via Git mirrors]

Another good resource on the ends of Avro (Data and RPC) is by phunt
at http://github.com/phunt/avro-rpc-quickstart#readme

I had written a python data-file centric snippet for Avro a while ago
at my blog; it may help if you're looking to get started with Python
(although it does not cover all aspects, which the functions in the
available test cases for lang/python do):
http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/

On Sat, Jan 29, 2011 at 1:34 AM, felix gao gre1...@gmail.com wrote:
 Hi all,
 I am trying to convert a lot of our existing logs into avro format in
 hadoop.  I am not sure if there are any examples to follow.
 Thanks,
 Felix



-- 
Harsh J
www.harshj.com


Re: How to get started with examples on avro

2011-01-28 Thread Harsh J
On Sat, Jan 29, 2011 at 1:59 AM, felix gao gre1...@gmail.com wrote:
 Thanks for the quick reply.  I am interested in doing this through the java
 implementation and I would like to do it in parallel that utilizes the
 mapreduce framework.

That operation is pretty similar to writing a normal output data file.

You can use the MapReduce API of Avro (that provides an Input/Output
Format class to use, given a Schema) to do so, or write your own
custom record writing classes that do it by converting your input
format's record representation to Avro serialized records and writing
those out to an open DataFile for a given schema. Alternatively, you
can also write avro serialized data bytes into SequenceFiles.

I believe the Hadoop MapReduce trunk may have some good code on Avro
serialization classes and uses of that in MapReduce.

 On Fri, Jan 28, 2011 at 12:22 PM, Harsh J qwertyman...@gmail.com wrote:

 Based on the language you're targeting, have a look at its test-cases
 available on the in the project's version control:
 http://svn.apache.org/repos/asf/avro/trunk/lang/ [You can check it out
 via SVN, or via Git mirrors]

 Another good resource on the ends of Avro (Data and RPC) is by phunt
 at http://github.com/phunt/avro-rpc-quickstart#readme

 I had written a python data-file centric snippet for Avro a while ago
 at my blog; it may help if you're looking to get started with Python
 (although it does not cover all aspects, which the functions in the
 available test cases for lang/python do):

 http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/

 On Sat, Jan 29, 2011 at 1:34 AM, felix gao gre1...@gmail.com wrote:
  Hi all,
  I am trying to convert a lot of our existing logs into avro format in
  hadoop.  I am not sure if there are any examples to follow.
  Thanks,
  Felix



 --
 Harsh J
 www.harshj.com





-- 
Harsh J
www.harshj.com


Re: Avro Python appending data

2010-12-22 Thread Harsh J
Hi,

On Thu, Dec 23, 2010 at 6:29 AM, felix gao gre1...@gmail.com wrote:
 Hi all,

 I am having trouble adding more data into a file.

 Environment: Python 2.6.5, avro-1.3.3-py2.6

 Program looks like this

I see you've read my blog post on Avro+Python :P
http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/

 if I remove the second write_avro_file() call then everything is fine.  How
 to properly append more data into the file?

To append to an existing datafile, do not initialize the writer object
with a writers_schema again. Just create it using:

df_writer = datafile.DataFileWriter(
open(OUTFILE_NAME, 'wb'),
io.DatumWriter(),
)

-- 
Harsh J
www.harshj.com


Re: Avro Python appending data

2010-12-22 Thread Harsh J
Sorry, minor error, not 'wb', but 'ab+'

 df_writer = datafile.DataFileWriter(
                    open(OUTFILE_NAME, 'ab+'),
                    io.DatumWriter(),
                )

-- 
Harsh J
www.harshj.com


Parsing Anonymous Schema

2010-11-24 Thread Harsh J
Hello,

Is it possible to parse an anonymous record schema (back into a
proper Schema object)?

If I've created an anonymous record schema, I'm getting an error (No
name found, via Schema.parse() of Java API) when I de-serialize its
JSON to form a Schema object again.

Is this the intended behavior or is it a bug?

-- 
Harsh J
www.harshj.com