expect specific record but get generic
i am observing that on a particular system (spark) my code breaks in that avro does not return the specific record i expected but instead returns generic records. i suspect this is some class loading issue on the distributed system (something about how the classpath is constructed for the spark slaves). anyone had class loading issues get in the way of avro specific?
Re: expect specific record but get generic
doug, could it be a classloader (instead of classpath) issue? looking at spark it seems to run the tasks inside the slaves/workers with a custom classloader. thanks! koert On Mon, Oct 21, 2013 at 1:07 PM, Doug Cutting cutt...@apache.org wrote: If the generated classes are not on the classpath then the generic representation is used. So, yes, this sounds like a classpath problem. On Mon, Oct 21, 2013 at 8:41 AM, Koert Kuipers ko...@tresata.com wrote: i am observing that on a particular system (spark) my code breaks in that avro does not return the specific record i expected but instead returns generic records. i suspect this is some class loading issue on the distributed system (something about how the classpath is constructed for the spark slaves). anyone had class loading issues get in the way of avro specific?
Re: AVRO M/R: ClassNotFoundException: ...Paranamer
i had this too ad some point. i just added paranemer to distributed cache (or classpath on hadoop) and it went away On Thu, Dec 13, 2012 at 2:21 PM, Terry Healy the...@bnl.gov wrote: paran
Re: version of avro
option 3 is not available to us. i have been using option 1 without issues so far (except in hive), but i have only worked with MR1 old api and avro- generic so far (not sure which of these is relevant...). On Sat, Oct 20, 2012 at 3:59 AM, Jacob Metcalf jacob_metc...@hotmail.comwrote: Yes I have CDH4 working happily with Avro 1.7.0 following the process described below. The various methods of distributing the Avro jar are discussed in http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job. I could not distribute the newer Avro jar via methods 1 and 2 because of classpath issues (discussed in AVRO-1103) so settled for the not so satisfactory option 3. The latest version of Avro has a patch for AVRO-1103 and a Maven profile to compile against the newer versions of Hadoop. I have not tried upgrading yet, but you should definitely try this first before going down the self patching route. Regards Jacob -- From: kkrugler_li...@transpac.com Subject: Re: version of avro Date: Fri, 19 Oct 2012 13:16:24 -0700 To: user@avro.apache.org On Oct 19, 2012, at 1:03pm, Koert Kuipers wrote: i noticed avro version 1.5.4 is included with some version/distros of hadoop and hive... is there a reason why 1.5.4 is included specifically and not newer ones? are there some incompatibilities to be aware of? i would like to use a newer version In the mail archives there was a discussion back in July about using Avro 1.7 with CHD4. where Jacob Metcalf said: Avro 17 is compiled against Hadoop 1 i.e. CDH3. I have it running aginst CDH4 but I had to patch and recompile Avro, then replace the Avro 153 in the Hadoop lib directory with 17. I have attached my patch against JIRA AVRO-1103 for consideration. It sounds like if you're using CDH3/Hadoop 1.0 then you should be able to use Avro 1.7 as-is, but I haven't tried this myself. -- Ken http://about.me/kkrugler +1 530-210-6378
record schema names... a nuisance?
we are on a fairly old avro (1.5.4) so not sure my observations apply to newer versions. i noticed that when i read from avro files in hadoop it does not expect the reader's schema (fully qualified) name to be equal to the writer's schema (fully qualified) name. this allows me to read from files without knowing what name the schema had when it was written. according to doug cutting this is a bug and the read should not succeed if the reader's and writer's schema do not have the same name. also when the schema names are not the same then field aliases do not work. ok with that out of the way this is my situation: we create lots of avro files that we add to large partitioned tables (a structure with subdirs on hdfs). the people that write the files understand the importance of canonical columns names (field names), but not everyone gets the idea of schema names, so generally i have avro files with name different (writer's) schema names in there. i do not expect i can correct this. also it is not unusual to run a hadoop map-red job reading from many different data sources at once, using avro's fantastic projection ability to extract just a few columns. however in that case again the (writer's) schema names are not expected to be the same across avro files i am reading from. so today all of this works, meaning i can run map-reduce jobs across all these files with difference/inconsistent schema names, but only thanks to a bug, which makes me nervous one day it will not work. also field aliases do not work, which is a real limitation. so i am trying to see if i can come up with a better solution. of course i could go find out every times what all the schema names are in the avro files, and add all aliases to my reader's schema. but that is real pain, in particular since the set is not constant. i guess i could automate this by scanning all the avro files first and extracting their schemas. however it sounds very inelegant. so i rather not do that. so i have 2 questions: 1) can i reasonably assume that processing in hadoop will continue to work even if the reader and writer's schema names are not the same (so rely on this bug)? the fact that field aliases do not work in this case is too bad but at least i got something working... 2) is there a better solution? for example, something like where i could say in my reader's schema that the schema has an alias of * (wildcard) so that i can read from all these files with different (writer's) schema names and it works without relying on a bug, and on top of that field aliases will also work? that would be fantastic...
version of avro
i noticed avro version 1.5.4 is included with some version/distros of hadoop and hive... is there a reason why 1.5.4 is included specifically and not newer ones? are there some incompatibilities to be aware of? i would like to use a newer version thanks! koert
using strings instead of utf8
how do i tell (generic) avro to use strings for values instead of it's own utf8 class? i saw a way of doing it by modifying the schemas (adding a property). i also saw mention of a way to do it if you use maven (which i don't). is there a generic way to do this? like a system property perhaps? or a static method that i call to change the default? thanks! koert
Re: abuse of aliases?
thanks doug On Fri, Feb 3, 2012 at 3:58 PM, Doug Cutting cutt...@apache.org wrote: On 02/02/2012 08:03 PM, Koert Kuipers wrote: i have many avro files with similar data (same meaning, same type, etc.) but different names for the fields. can i create a reader schema that for each field that i am interested in maps it to all the different possible fields in the files by using aliases, and then run map-reduce over the files using this schema? i am talking about tens of aliases per field, and this number will only grow as more data comes in. is this acceptible use of the alias concept, or is it abuse? This seems like a reasonable use of aliases to me. Note that aliases are limited to elements at the same level of nesting and cannot perform arbitrary structural manipulations. But beyond that, they're meant to be a general-purpose mechanism for mapping data from one schema to another. and is the alias implementation in avro efficient for such usage? They should be efficient. Aliases are implemented by rewriting the old schema to have the new names prior to reading. The rewriting is performed once and cached so performance should not be impacted. Doug
Re: re-use or copy a Field
ok i will do thanks On Fri, Feb 3, 2012 at 7:26 PM, Doug Cutting cutt...@apache.org wrote: On 02/03/2012 01:57 PM, Koert Kuipers wrote: I could create a copy myself using the Field constructor, however that way i lose the aliases and props. In avro 1.5.4 there is no way to get to these either. In Avro 1.6 there's Field#getProps() method but there's no getAliases(). Please file a bug, asking for this. Thanks, Doug
abuse of aliases?
i have many avro files with similar data (same meaning, same type, etc.) but different names for the fields. can i create a reader schema that for each field that i am interested in maps it to all the different possible fields in the files by using aliases, and then run map-reduce over the files using this schema? i am talking about tens of aliases per field, and this number will only grow as more data comes in. is this acceptible use of the alias concept, or is it abuse? and is the alias implementation in avro efficient for such usage? thanks! koert
using avro schemas to select columns (abusing versioning?)
we are working on a very sparse table with say 500 columns where we do batch uploads that typically only contain a subset of the columns (say 100), and we run multiple map-reduce queries on subsets of the columns (typically less than 50 columns go into a single map-reduce job). my question is the following: if i use avro, do i ever actually need the use the full schema of the table? if i understand avro correctly, then the batch uploads could simply add avro files with the schema reflective of the columns that are in the file (as opposed to first inserting many nulls into the data and then saving it with the full schema). the queries could also simply query with the schema that is reflective of the query (as opposed to querying with the full schema with 500 columns and then picking out the relevant columns). as long as i provide defaults of null in the query schemas, i think this would work! correct? is this considered abuse of avro's versioning capabilities?
override avro generic representation
Is there a way to override the avro generic representation, or perhaps an easy way to create my own? For example, for FIXED i would like Byte[] instead of ByteBuffer, for STRING i would prefer String over CharArray, for arrays i would like to have a List instead of a Collection, etc. Right now i do a translation from the generic representation to my internal representation and then back, which is fragile and inefficient. Thanks! Koert
reader in hadoop without reader's schema
I am reading from avro container files in hadoop. I know the container files have a (writers) schema stored in them. My reader specifies it's schema using avro.input.schema job parameter. This way any schema changes are gracefully handled with both schema's present. However, i dont always need all this complexity. Is there a way to read without having to specify a reader's schema, where i basically say just accept the writer's schema and read the data that way.
avro compare() operation in hadoop
If i use Avro in hadoop (and read my data from Avro container files), will i automatically get a very fast comparison for sorting in Hadoop (similar to what WritableComparator provides)? Are there benchmarks on sorting with Avro vs Writables? Best, Koert