Re: Avro file Compression

2013-08-22 Thread Scott Carey
The file format compresses in blocks, and the block size is configurable. This will compress across objects in a block, so it works for small objects as well as large ones ‹ as long as the total block size is large enough. I have found that I can increase the ratio of compression by ordering the

Re: Avro Schema to SQL

2013-06-28 Thread Scott Carey
Not all Avro schemas can be converted to SQL. Primarily, Unions can pose challenges, as well as recursive references. Nested types are a mixed bag ‹ some SQL-related systems have rich support for nested types and/or JSON (e.g. PosgtgreSQL) which can make this easier, while others are more crude

Re: Reader / Writer terminology

2013-06-10 Thread Scott Carey
to or from object representations to serialized forms. The general case includes all transformation classes as well as views. On 6/8/13 10:16 PM, Gregory (Grisha) Trubetskoy gri...@apache.org wrote: On Sat, 8 Jun 2013, Scott Carey wrote: In a more general sense it is simply from and to -- One might

Re: Reader / Writer terminology

2013-06-08 Thread Scott Carey
I'm about to make all of this even more confusingŠ For pair-wise resolution when the operation is deserialization, reader and writer make sense. In a more general sense it is simply from and to -- One might move from schema A to B without serialization at all, transforming a data structure, or

Re: Compressed Avro vs. compressed Sequence - unexpected results?

2013-05-23 Thread Scott Carey
For your avro files, double check that snappy is used (use avro-tools to peek at the metadata in the file, or simply view the head in a text editor, the compression codec used will be in the header). Snappy is very fast, most likely the time to read is dominated by deserialization. Avro will be

Re: using Avro unions with HIVE

2013-05-23 Thread Scott Carey
The Hive mailing list would have more info on the Avro SerDe usage. In general, a system that does not have union types like Hive (or Pig, etc) has to expand a union into multiple fields if there are more than one non-null type -- and at most one branch of the union is not null. For example a

Re: Newb question on imorting JSON and defaults

2013-05-23 Thread Scott Carey
On 5/22/13 2:26 PM, Gregory (Grisha) Trubetskoy gri...@apache.org wrote: Hello! I have a test.json file that looks like this: {first:John, last:Doe, middle:C} {first:John, last:Doe} (Second line does NOT have a middle element). And I have a test.schema file that looks like this:

Re: Best practices for java enums...?

2013-05-13 Thread Scott Carey
It would be nice to be able to reference an existing class when using the specific compiler. If you have an existing com.mycompany.Foo enum (or SpecificRecord, or Fixed type), then provide the specific compiler with the type prior to parsing the schema, it could accept a reference: {type:record,

Re: Jackson and Avro, nested schema

2013-05-13 Thread Scott Carey
It appears that you will need to modify the JSON decoder in Avro to achieve this. The JSON decoder in Avro was built to encode any Avro schema into JSON with 100% fidelity, so that the decoder can read it back. The decoder does not work with any arbitrary JSON. This is because there are

Re: avro.java.string vs utf8 compatibility in recent pig and hive versions

2013-05-13 Thread Scott Carey
The change in the Pig loader in PIG-3297 seems correct ‹ they must use CharSequence, not Utf8. I suspect that the Avro 1.5.3.jar does not respect the avro.java.string property and is using Utf8 (for the API that Pig is using), but have not confirmed it. avro.java.string is an optional hint for

Re: Hadoop serialization DatumReader/Writer

2013-05-13 Thread Scott Carey
Making the DatumReader/Writers configurable would be a welcome addition. Ideally, much more of what goes on there could be: 1. configuration driven 2. pre-computed to avoid repeated work during decoding/encoding We do some of both already. The trick is to do #1 without impacting performance

Re: map/reduce of compressed Avro

2013-04-29 Thread Scott Carey
Martin said it already, but I will emphasize: Avro data files are splittable and can support multiple mappers no matter what codec is used for compression. This is because avro files are block based, and only use the compression within the block. I recommend starting with gzip compression, and

Re: Could specific records implement the generic API as well?

2013-04-15 Thread Scott Carey
Which aspect of the generic API are you most interested in? The builder, getters, or setters? Most people that use Specific records do so for compile time type safety, so adding 'set(foo, fooval)' is not desired for those users. On the other hand it is certainly possible to add it. The code

Re: Could specific records implement the generic API as well?

2013-04-15 Thread Scott Carey
I would like to figure out how to make SpecificRecord and GenericRecord immutable in the longer term (or as an option with the code generation and/or builder). The builder is the first step, but setters are the enemy. Is there a way to do this that does not introduce new mutators for all

Re: Issue writing union in avro?

2013-04-07 Thread Scott Carey
- writing binary avro data? Thanks again Jon 2013/4/6 Scott Carey scottca...@apache.org This is due to using the JSON encoding for avro and not the binary encoding. It would appear that the Python version is a little bit lax on the spec. Some have built variations of the JSON encoding

Re: Issue writing union in avro?

2013-04-06 Thread Scott Carey
This is due to using the JSON encoding for avro and not the binary encoding. It would appear that the Python version is a little bit lax on the spec. Some have built variations of the JSON encoding that do not label the union, but there are drawbacks to this too, as the type can be ambiguous in a

Re: Has anyone developed a utility to tell what is missing from a record?

2013-04-06 Thread Scott Carey
Try GenericRecordBuilder. For the Specific API, there are builders that will not let you construct an object that can not be serialized. The Generic API should have the same thing, but I am not 100% sure the builder there covers it. I have always avoided using any API that allows me to create an

Re: Support for char[] and short[] - Java

2013-01-08 Thread Scott Carey
You can cast both short and char safely to int and back, and use Avro's int type. These will be variable length integer encoded and take 1 to 3 bytes in binary form per short/char. This will be clunky as a user to wrap char[] or short[] into ListInteger or int[] however. Another option would be

Re: Appending to .avro log files

2013-01-08 Thread Scott Carey
A sync marker delimits each block in the avro file. If you want to start reading data from the middle of a 100GB file, DataFileReader will seek to the middle and find the next sync marker. Each block can be individually compressed, and by default when writing a file the writer will not compress

Re: Embedding schema with binary encoding

2013-01-08 Thread Scott Carey
Calling toJson() on a Schema will print it in json fom. However you most likely do not want to invent your own file format for Avro data. DataFileWriter which will manage the schema for you, along with compression, metadata, and the ability to seek to the middle of the file.Additionally it

Re: Setters and getters

2013-01-08 Thread Scott Carey
No. However each API (Specific, Reflect, Generic in Java) has different limitations and use cases. You'll have to provide more information about your use cases and expectations for more specific guidance. On 1/7/13 11:21 AM, Tanya Bansal tanyapban...@gmail.com wrote: Is it necessary to write

Re: Sync() between records? How do we recover from a bad record, using DataFileReader?

2013-01-08 Thread Scott Carey
For the corruption test, try corrupting the records, not the sync marker. The features added to DataFileReader for corruption recovery were for the case when decoding a record fails (corrupted record), not for when a sync marker is corrupted. Perhaps we should add that too, but it does not

Re: Serializing json against a schema

2013-01-08 Thread Scott Carey
You could use the ReflectDatumWriter to write a simple java data class to Avro, and you can create instances of such classes from JSON using a library like Jackson. There is a JSON encoding for Avro, if your data conformed to that format (which would be more verbose than what you have below)

Re: issue with writing an array of records

2013-01-08 Thread Scott Carey
On 1/7/13 8:35 AM, Alan Miller alan.mill...@gmail.com wrote: Hi, I have a schema with an array of records (I'm open to other suggestions too) field called ifnet to store misc attribute name/values for a host's network interfaces. e.g. { type: record, namespace: com.company.avro.data,

Re: any movement on JSON encoding for RPC?

2012-11-27 Thread Scott Carey
Avro can serialize in JSON, however most users use the compact binary serialization for performance and data storage reasons (JSON is typically 10x larger), and use the JSON format for debugging or export to other systems. I do not know if anyone is planning work on the JSON encoding in

Re: Backwards compatible - Optional fields

2012-10-03 Thread Scott Carey
A reader must always have the schema of the written data to decode it. When creating your Decoder, you must pass both the reader's schema and the schema as written. Once given this pair, Avro can know to skip data as written if the reader does not need it, or to inject default values for the

Re: Schema resolution failure when the writer's schema is a primitive type and the reader's schema is a union

2012-08-31 Thread Scott Carey
My understanding of the spec is that promotion to a union should work as long as the prior type is a member of the union. What happens if the union in the reader schema union order is reversed? This may be a bug. -Scott On 8/16/12 5:59 PM, Alexandre Normand alexandre.norm...@gmail.com wrote:

Re: Pig with Avro and HBase

2012-08-30 Thread Scott Carey
I am using Pig on Avro data files, and Avro in HBase. Can you elaborate on what you mean by 'auto-load the schema'? In the sense that a big LOAD statement doesn't have to declare the schema? I do this with avro data files to some extent (with limitations). A working implementation of

Re: Suggestions when using Pair.getPairSchema for Reduce-Side Joins in MR2

2012-06-28 Thread Scott Carey
It sounds like we need to be extra clear in the documentation on Pair, and perhaps have a different class or flavor that serves the purpose you needed. (KeyPair?) In Avro's MRV1 API, there is no key schema or value schema for map output, but only one map output schema that must be a Pair ‹ a pair

Re: C/C++ parsing vs. Java parsing.

2012-06-25 Thread Scott Carey
The schema provided is a union of several schemas. Java supports parsing this, C++ may not. Does it work if you make it one single schema, and nest NA, acomplex and retypes inside of object ? It only needs to be defined the first time it is referenced. If it does not, then it is certainly a

Re: Paranamer issue

2012-06-18 Thread Scott Carey
On 6/6/12 10:33 AM, Peter Cameron peter.came...@2icworld.com wrote: The BSD license is a problem for our clients, whereas the Apache 2 license is not. Go figure. That's the situation! ASL 2.0 is a derivative of the BSD license, after all... Apache projects regularly depend on other items that

Re: Scala API

2012-05-30 Thread Scott Carey
This would be fantastic. I would be excited to see it. It would be great to see a Scala language addition to the project if you wish to contribute. I believe there have been a few other Scala Avro attempts by others over time. I recall one where all records were case classes (but this broke

Re: How represent abstract in Schemas

2012-05-07 Thread Scott Carey
Avro schemas can represent Union types, but not abstract types. It does not make sense to serialize an abstract class, since its data members are not known. By definition, an abstract type does not define all of the possible sub types in advance, which presents another problem -- in order to make

Re: Nested schema issue

2012-05-01 Thread Scott Carey
On 5/1/12 9:47 AM, Peter Cameron pe...@pbpartnership.com wrote: I'm having a problem with nesting schemas. A very brief overview of why we're using Avro (successfully so far) is: o code generation not required o small binary format o dynamic use of schemas at runtime We're doing a flavour of

Re: Nested schema issue (with munged invalid schema)

2012-05-01 Thread Scott Carey
On 5/1/12 9:55 AM, Peter Cameron pe...@pbpartnership.com wrote: I'm having a problem with nesting schemas. A very brief overview of why we're using Avro (successfully so far) is: o code generation not required o small binary format o dynamic use of schemas at runtime We're doing a

Re: Support for Serialization and Externalization?

2012-05-01 Thread Scott Carey
On 4/23/12 10:37 AM, Joe Gamache gama...@cabotresearch.com wrote: Hello, We have been using Avro successfully to serialize many of our objects, using binary encoding, for storage and retrieval. Although the documentation about the Reflect Mapping states: This API is not recommended

Re: Specific/GenericDatumReader performance and resolving decoders

2012-04-19 Thread Scott Carey
I think this approach makes sense, reader=writer is common. In addition to record fields, unions are affected. I have been thinking about the issue that resolving records is slower than not for a while. In theory, it could be just as fast because you can pre-compute the steps needed and bake

[ANNOUNCE] New Apache Avro PMC Member: Douglas Creager

2012-04-10 Thread Scott Carey
The Apache Avro PMC is pleased to announce that Douglas Creager is now part of the PMC. Congratulations and Thanks!

Re: Sync Marker Issue while reading AVRO files writen with FLUME with PIG

2012-04-03 Thread Scott Carey
I have not seen this issue before with 100 TB of Avro files, but am not using Flume to write them. We have moved on to Avro 1.6.x but were on the 1.5.x line for quite some time. Perhaps while writing there was an exception of some sort that was not handled correctly in Avro or Flume. Looking at

Re: avro compression using snappy and deflate

2012-04-02 Thread Scott Carey
On 3/30/12 12:08 PM, Shirahatti, Nikhil snik...@telenav.com wrote: Hello All, I think I figured our where I goofed up. I was flushing on every record, so basically this was compression per record, so it had a meta data with each record. This was adding more data to the output when compared to

Re: BigInt / longlong

2012-03-28 Thread Scott Carey
On 3/28/12 11:01 AM, Meyer, Dennis dennis.me...@adtech.com wrote: Hi, What type refers to an Java Bigint or C long long? Or is there any other type in Avro that maps a 64 bit unsigned int? I unfortunately could only find smaller types in the docs: Primitive Types The set of primitive

Re: Problem: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 / avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches / `read_data': Writer's schem

2012-03-26 Thread Scott Carey
the reader. On Fri, Mar 23, 2012 at 8:01 PM, Russell Jurney russell.jur...@gmail.com wrote: Thanks Scott, looking at the raw data it seems to have been a truncated record due to UTF problems. Russell Jurney http://datasyndrome.com On Mar 23, 2012, at 7:59 PM, Scott Carey scottca...@apache.org

Re: Problem: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 / avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches / `read_data': Writer's schem

2012-03-23 Thread Scott Carey
It appears to be reading a union index and failing in there somehow. If it did not have any of the pig AvroStorage stuff in there I could tell you more. What does avro-tools.jar 's 'tojson' tool do? (java ­jar avro-tools-1.6.3.jar tojson file | your_favorite_text_reader) What version of Avro

Re: Globbing several AVRO files with different (extended) schemes

2012-03-20 Thread Scott Carey
I'm assuming you are using Pig's AvroStorage function. It appears that it does not support schema migration, but it certainly could do so. A collection of avro files can be 'viewed' as if they all are of one schema provided they can all resolve to it. I have several tools that do this

Re: a possible bug in Avro MapReduce

2012-03-20 Thread Scott Carey
Perhaps it is https://issues.apache.org/jira/browse/AVRO-1045 Are you creating a copy of the GenericRecord? -Scott On 3/19/12 3:34 PM, ey-chih chow eyc...@hotmail.com wrote: Hi, We got an Avro MapReduce job with the signature of the map function as follows: public void

Re: Java MapReduce Avro Jackson Error

2012-03-19 Thread Scott Carey
What version of Avro are you using? You may want to try Avro 1.6.3 + Jackson 1.8.8. This is related, but is not your exact problem. https://issues.apache.org/jira/browse/AVRO-1037 You are likely pulling in some other version of jackson somehow. You may want to use 'mvn dependency:tree' on

Re: Java MapReduce Avro Jackson Error

2012-03-19 Thread Scott Carey
error. Is there anything specific I need to do other than changing dependencies in pom.xml to make this error go away? On Mon, Mar 19, 2012 at 9:12 PM, Tatu Saloranta tsalora...@gmail.com wrote: On Mon, Mar 19, 2012 at 6:06 PM, Scott Carey scottca...@apache.org wrote: What version of Avro are you

Re: Make a copy of an avro record

2012-03-12 Thread Scott Carey
We should be generating Java 1.6 compatible code. What version were you testing? 1.6.3 is near release, the RC is available here: http://mail-archives.apache.org/mod_mbox/avro-dev/201203.mbox/%3C4F514F22.8 =070...@apache.org%3E Does it have the same problem? On 3/12/12 9:27 AM, Jeremy Lewi

Re: parsing Avro data files in JavaScript

2012-02-21 Thread Scott Carey
See also the discussion about a JavaScript Avro implementation from last week: http://search-hadoop.com/m/MiNCyvLts/HttpTranceiversubj=HttpTranceiver+and +JSON+encoded+Avro+ On 2/21/12 7:56 AM, Carriere, Jeromy jero...@x.com wrote: We're working on one to support the X.commerce Fabric:

Re: Order of the schema in Union

2012-02-21 Thread Scott Carey
As for why the union does not seem to match: The Union schemas are not the same as the one in the error ‹ the one in the error does not have a namespace. It finds AVRO_NCP_ICM but the union has only merced.AVRO_NCP_ICM and merced. AVRO_IVR_BY_CALLID. The namespace and name must both match. Is

Re: HttpTranceiver and JSON-encoded Avro?

2012-02-15 Thread Scott Carey
See https://issues.apache.org/jira/browse/AVRO-485 for some discussion on JavaScript for Avro. Please comment in that ticket with your needs and use case. The project would welcome a JavaScript implementation. On 2/15/12 2:07 PM, Frank Grimes frankgrime...@gmail.com wrote: Are there any fast

Re: Writing Unsolicited Messages to a Connected Netty Client

2012-01-20 Thread Scott Carey
For certain kinds of data it would be useful to continuously stream data from server to client (or vice-versa). This can be represented as an Avro array response or request where each array element triggers a callback at the receiving end. This likely requires an extension to the avro spec, but

Re: AVRO Path

2012-01-12 Thread Scott Carey
There are no plans that I know of currently, although the topic came up two times in separate conversations last night at the SF Hadoop MeetUp. I think an ability to extract a subset of a schema from a larger one and read/write/transform data accordingly makes a lot of sense. Currently, the Avro

Re: Can spill to disk be in compressed Avro format to reduce I/O?

2012-01-12 Thread Scott Carey
/org/apache/avro/file/DataFileWr iter.html#appendAllFrom%28org.apache.avro.file.DataFileStream,%20boolean%29 Thanks, Frank Grimes On 2012-01-12, at 1:14 PM, Scott Carey wrote: On 1/12/12 8:27 AM, Frank Grimes frankgrime...@gmail.com wrote: Hi All, We have Avro data files

Re: Can spill to disk be in compressed Avro format to reduce I/O?

2012-01-12 Thread Scott Carey
deserializing them? Thanks, Frank Grimes On 2012-01-12, at 1:14 PM, Scott Carey wrote: On 1/12/12 8:27 AM, Frank Grimes frankgrime...@gmail.com wrote: Hi All, We have Avro data files in HDFS which are compressed using the Deflate codec. We have written an M/R job using the Avro

Re: Can spill to disk be in compressed Avro format to reduce I/O?

2012-01-12 Thread Scott Carey
12:53 PM, Scott Carey scottca...@apache.org wrote: On 1/12/12 12:35 PM, Frank Grimes frankgrime...@gmail.com wrote: So I decided to try writing my own AvroStreamCombiner utility and it seems to choke when passing multiple input files: hadoop dfs -cat hdfs://hadoop/machine1.log.avro hdfs

Re: Can spill to disk be in compressed Avro format to reduce I/O?

2012-01-12 Thread Scott Carey
a command line tool that takes a list of files (HDFS or local) and writes a new file (HDFS or local) concatenated and possibly recodec'd. Thanks, Frank Grimes On 2012-01-12, at 3:53 PM, Scott Carey wrote: On 1/12/12 12:35 PM, Frank Grimes frankgrime...@gmail.com wrote: So I decided

Re: encoding problem for ruby client

2012-01-05 Thread Scott Carey
This sounds like the Ruby implementation does not correctly use UTF-8 on your platform for encoding strings. It may be a bug, but I am not knowledgeable enough on the Ruby implementation to know for sure. The Avro specification states that a string is encoded as a long followed by that many

Re: Collecting union-ed Records in AvroReducer

2011-12-08 Thread Scott Carey
On 12/8/11 4:10 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote: Hallo, is it possible to write/collect a union-ed record from an avro reducer? I have a reduce class (extending AvroReducer), and the output schema is a union schema of record type A and record type B. In the reduce logic I

Re: Map having string, Object

2011-12-07 Thread Scott Carey
The best practice is usually to use the flexible schema with the union value rather than transmit schemas each time. This restricts the possibilities to the set defined, and the type selected in the branch is available on the decoding side. In the case above the number of variants is not too

Re: Reduce-side joins in Avro M/R

2011-12-07 Thread Scott Carey
This should be conceptually the same as a normal map-reduce join of the same type. Avro handles the serialization, but not the map-reduce algorithm or strategy. On 12/6/11 8:43 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote: Hi, I'd like to use reduce-side joins in an avro M/R job,

Re: Importing in avdl from classpath of project

2011-12-07 Thread Scott Carey
I think that at minimum, it would be useful to have an option to 'also look in the classpath' in the maven plugin, and have the option to do so in general with the IDL compiler. I would gladly review the patch in a JIRA. -Scott On 12/7/11 10:13 AM, Chau, Victor vic...@x.com wrote: Hello,

Re: Best practice for versioning IDLs?

2011-11-29 Thread Scott Carey
I don't think there are yet best practices for what you are trying to do. However, I suggest you first consider embedding the version as metadata in the schema, rather than data. If you put it in a Record, it will be data serialized with every record. If you put it as schema metadata, it will

Re: Overriding default velocity templates

2011-11-28 Thread Scott Carey
To the best of my recollection, the IDL custom template bits you mention below have not been wired up through all of the tooling. Please feel free to submit JIRA tickets and patches to improve it. Thanks! -Scott On 11/28/11 7:01 AM, George Fletcher gffle...@aol.com wrote: Hi, I'm

Re: Avro-mapred and new Java MapReduce API (org.apache.hadoop.mapreduce)

2011-11-13 Thread Scott Carey
I have heard some suggestions that it would be useful if we could somehow model Avro's interaction with mapreduce using composition rather than inheritance. Has anyone tried that? Or would it be too clumsy? A good relationship with the mapreduce/mapred api via composition might require changes

Re: Does extending union break compatibility

2011-11-03 Thread Scott Carey
On 11/3/11 4:56 PM, Neil Davudo neil_dav...@yahoo.com wrote: I have a record defined as follows // version 1 record SomeRecord { union { null, TypeA } unionOfTypes; } I change the record to the following // version 2 record SomeRecord { union { null, TypeA, TypeB } unionOfTypes; }

Re: How to add optional new record fields and/or new methods in avro-ipc?

2011-10-18 Thread Scott Carey
On 10/18/11 9:47 AM, Doug Cutting cutt...@apache.org wrote: On 10/17/2011 08:14 PM, 常冰琳 wrote: What I do in the demo is add a new nullable string in server side, not change a string to nullable string. I add a new field with default value using specific, and it works fine, so I suspect the

Re: How to add optional new record fields and/or new methods in avro-ipc?

2011-10-18 Thread Scott Carey
On 10/18/11 10:38 AM, Doug Cutting cutt...@apache.org wrote: On 10/18/2011 10:09 AM, Scott Carey wrote: On 10/18/11 9:47 AM, Doug Cutting cutt...@apache.org wrote: To amend this, you can use Avro's @Nullable annotation: The problem is that this does not provide the ability to evolve

Re: Avro mapred: How to avoid schema specification in job.xml?

2011-10-10 Thread Scott Carey
I'm not all that familiar with how Oozie interacts with Avro. The Job must set its avro.input.schema and avro.output.schema properties ‹ this can be done in code (see the unit tests in the Avro mapred project for examples), and if you are using SpecificRecords and DataFiles the schema is

Re: Avro mapred: How to avoid schema specification in job.xml?

2011-10-10 Thread Scott Carey
is not set ‹ reflection on a class name or an annotation . If this looks like it is an enhancement request for Avro (or a bug) please file a JIRA ticket. Thanks! Thanks, Julien Muller 2011/10/10 Scott Carey scottca...@apache.org I'm not all that familiar with how Oozie interacts with Avro

Re: Data incompatibility between Avro 1.4.1 and 1.5.4

2011-10-03 Thread Scott Carey
AVRO-793 was not a bug in the encoded data or its format. It was a bug in how schema resolution worked for certain projection corner cases during deserialization. Is your data readable with the same schema that wrote it? (for example, if it is an avro data file, you can use avro-tools.jar to

Re: In Java, how can I create an equivalent of an Apache Avro container file without being forced to use a File as a medium?

2011-10-03 Thread Scott Carey
In addition to Joe's comments: On the write side, DataFileWriter.create() can take a file or an output stream. http://avro.apache.org/docs/1.5.4/api/java/org/apache/avro/file/DataFileWrit er.html On the read side, DataFileStream can be used if the input does not have random access and can be

Re: Compression and splittable Avro files in Hadoop

2011-09-30 Thread Scott Carey
Yes, Avro Data Files are always splittable. You may want to up the default block size in the files if this is for MapReduce. The block size can often have a bigger impact on the compression ratio than the compression level setting. If you are sensitive to the write performance, you might want

Re: Avro versioning and SpecificDatum's

2011-09-20 Thread Scott Carey
That looks like a bug. What happens if there is no aliasing/renaming involved? Aliasing is a newer feature than field addition, removal, and promotion. This should be easy to reproduce, can you file a JIRA ticket? We should discuss this further there. Thanks! On 9/19/11 6:14 PM, Alex Holmes

Re: Avro versioning and SpecificDatum's

2011-09-20 Thread Scott Carey
. On Tue, Sep 20, 2011 at 2:22 AM, Scott Carey scottca...@apache.org wrote: That looks like a bug. What happens if there is no aliasing/renaming involved? Aliasing is a newer feature than field addition, removal, and promotion. This should be easy to reproduce, can you file a JIRA ticket? We

Re: Avro versioning and SpecificDatum's

2011-09-19 Thread Scott Carey
I version with SpecificDatum objects using avro data files and it works fine. I have seen problems arise if a user is configuring or reconfiguring the schemas on the DatumReader passed into the construction of the DataFileReader. In the case of SpecificDatumReader, it is as simple as:

Re: How should I migrate 1.4 code to avro 1.5?

2011-09-02 Thread Scott Carey
The javadoc for the deprecated method directs users to the replacement. BinaryEncoder and BinaryDecoder are well documented, with docs available via maven for IDE's to consume easily, or via the Apache Avro website: http://avro.apache.org/docs/1.5.3/api/java/org/apache/avro/io/BinaryEncoder. html

Re: How should I migrate 1.4 code to avro 1.5?

2011-09-02 Thread Scott Carey
Are you still having trouble with this? I noticed that the code has changed and you are using MyPair instead of Pair. Was there a naming conflict bug with Avro's Pair.java? -Scott On 9/2/11 3:46 PM, W.P. McNeill bill...@gmail.com wrote: I made changes that got rid of all the deprecated

Re: simultaneous read + write?

2011-09-02 Thread Scott Carey
AvroDataFile deals with this for some cases. Is it an acceptable API for your use case? You can configure the block size to be very small and/or flush() regularly. If you do this on your own, you will need to track the position that you start to read a record at, and if there is a failure,

Re: How should I migrate 1.4 code to avro 1.5?

2011-09-02 Thread Scott Carey
on github. My last email details where I'm at right now. The pull request code looks correct; I'm just trying to get it to build in my Maven environment. On Fri, Sep 2, 2011 at 5:19 PM, Scott Carey scottca...@apache.org wrote: Are you still having trouble with this? I noticed

Re: avro BinaryDecoder bug ?

2011-08-31 Thread Scott Carey
Looks like a bug to me. Can you file a JIRA ticket? Thanks! On 8/29/11 1:24 PM, Yang tedd...@gmail.com wrote: if I read on a empty file with BinaryDecoder, I get EOF, good, but with the current code, if I read it again with the same decoder, I get a IndexOutofBoundException, not EOF. it

Re: Map output records/reducer input records mismatch

2011-08-16 Thread Scott Carey
We have had one other report of something similar happening. https://issues.apache.org/jira/browse/AVRO-782 What Avro version is this happening with? What JVM version? On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args if it is Sun and JRE 6u21 or later? (some issues in

Re: Compiling multiple input schemas

2011-08-16 Thread Scott Carey
What about leveraging shell expansion? This would mean we would need inverse syntax, like tar or zip ( destination, list of sources in reverse dependency order ) Then your examples are avro-tools-1.6.0.jar compile schema tmp/ input/position.avsc input/player.avsc avro-tools-1.6.0.jar compile

Re: Map output records/reducer input records mismatch

2011-08-16 Thread Scott Carey
On 8/16/11 3:56 PM, Vyacheslav Zholudev vyacheslav.zholu...@gmail.com wrote: Hi, Scott, thanks for your reply. What Avro version is this happening with? What JVM version? We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have to look up. On a hunch, have you tried adding

Re: why Utf8 (vs String)?

2011-08-11 Thread Scott Carey
Also, Utf8 caches the result of toString(), so that if you call toString() many times, it only allocates the String once. It also implements the CharSequence interface, and many libraries in the JRE accept CharSequence. Note that Utf8 is mutable and exposes its backing store (byte array). String

Re: Combining schemas

2011-08-09 Thread Scott Carey
On 8/9/11 11:15 AM, Bill Graham billgra...@gmail.com wrote: Hi, I'm trying to create a schema that references a type defined in another schema and I'm having some troubles. Is there an easy way to do this? My test schemas look like this: $ cat position.avsc {type:enum, name: Position,

Re: Hadoop and org.apache.avro.file.DataFileReader sez Not an Avro data file

2011-07-20 Thread Scott Carey
An avro data file is not created with a FileOutputStream. That will write = avro binary data to a file, but not in the avro file format (which is split= table and contains header metadata). The API for Avro Data Files is here:

Re: Hadoop and org.apache.avro.file.DataFileReader sez Not an Avro data file

2011-07-20 Thread Scott Carey
://avro.apache.org/docs/current/api/java/org/apache/avro/file/package-s ummary.html On 7/20/11 5:38 PM, Scott Carey scottca...@apache.org wrote: An avro data file is not created with a FileOutputStream. That will write = avro binary data to a file, but not in the avro file format (which is split= table

Re: Schema with multiple Record types Java API

2011-07-15 Thread Scott Carey
to point to a root Java object, and say serialize this, and everything it points to, as AVRO. BTW AVRO Rocks! My objects contain are amounts of data, and I am *very* impressed with the speed of serialization/deserialization. Cheers P On 7/14/11 10:10 PM, Scott Carey wrote: AvroIDL can

Re: Schema with multiple Record types Java API

2011-07-14 Thread Scott Carey
The name and namespace is part of any named schema (Type.RECORD, Type.FIXED, Type.ENUM). We don't currently have an API to search a schema for subschemas that match names. It would be useful, you might want to create a JIRA ticket explaining your use case. So it would be a little more

Re: Schema with multiple Record types Java API

2011-07-14 Thread Scott Carey
be the first. Is there a document describing best practices? Thanks P On 7/14/11 7:02 PM, Scott Carey wrote: The name and namespace is part of any named schema (Type.RECORD, Type.FIXED, Type.ENUM). We don't currently have an API to search a schema for subschemas that match names. It would be useful

Re: Classpath for java

2011-06-26 Thread Scott Carey
I suspect that you will need to go into the module with the Pair class. When executing a maven plugin directly from the command line (exec:exec) the maven 'scope' is very restricted, and when you do this on the top level project it executes on that project only by default. The surefire test

Re: Avro and Hadoop streaming

2011-06-15 Thread Scott Carey
Hadoop has an old version of Avro in it. You must place the 1.6.0 jar (and relevant dependencies, or the avro-tools.jar with all dependencies bundled) in a location that gets picked up first in the task classpath. Packaging it in the job jar works. I'm not sure if putting it in the distributed

Re: avro object reuse

2011-06-10 Thread Scott Carey
data cause attempted allocations of arrays too large for the heap. On 6/9/11 4:58 PM, Scott Carey sc...@richrelevance.commailto:sc...@richrelevance.com wrote: What is the stack trace on the out of memory exception? On 6/9/11 4:45 PM, ey-chih chow eyc...@hotmail.commailto:eyc...@hotmail.com

Re: avro object reuse

2011-06-09 Thread Scott Carey
The most likely candidate for creating many instances of BufferAccessor and ByteArrayByteSource is BinaryData.compare() and BinaryData.hashCode(). Each call will create one of each (hash) or two of each (compare). These are only 32 bytes per instance and quickly become garbage that is easily

Re: avro object reuse

2011-06-02 Thread Scott Carey
of the Schema.parse() or Protocol.parse() static methods. On 6/1/11 5:48 PM, Tatu Saloranta tsalora...@gmail.commailto:tsalora...@gmail.com wrote: On Wed, Jun 1, 2011 at 5:45 PM, Scott Carey sc...@richrelevance.commailto:sc...@richrelevance.com wrote: It would be useful to get a 'jmap

Re: mixed schema avro data file?

2011-06-01 Thread Scott Carey
Two options: * DIfferent files per schema * One schema that is a union of all schemas you want in the file Which is best depends on your use case. On 6/1/11 4:02 PM, Yang tedd...@gmail.commailto:tedd...@gmail.com wrote: our use case is that we have many different types of events, with

Re: avro object reuse

2011-06-01 Thread Scott Carey
based on the data: Somewhere, Schema objects are being created frequently via one of the Schema.parse() or Protocol.parse() static methods. On 6/1/11 5:48 PM, Tatu Saloranta tsalora...@gmail.com wrote: On Wed, Jun 1, 2011 at 5:45 PM, Scott Carey sc...@richrelevance.com wrote: It would be useful

Re: I have written a layout for log4j using avro

2011-05-31 Thread Scott Carey
To read and write an Avro Data File use the classes in org.apache.avro.file : http://avro.apache.org/docs/current/api/java/index.html The classes in tools are command line tools that wrap Avro Java APIs. The source code of these can be used as examples for using these APIs. On 5/30/11 8:01 AM,

Re: inheritance implementation?

2011-05-31 Thread Scott Carey
You can do this a few ways. The composition you list will work, the member variable should be of type Fruit. Or you can put the type object inside the fruit: record Fruit { int size; string color; int weight; union { Apple, Orange } type; } record Orange { string skin_thickness; } record

  1   2   >