expect specific record but get generic

2013-10-21 Thread Koert Kuipers
i am observing that on a particular system (spark) my code breaks in that
avro does not return the specific record i expected but instead returns
generic records.


i suspect this is some class loading issue on the distributed system
(something about how the classpath is constructed for the spark slaves).

anyone had class loading issues get in the way of avro specific?


Re: expect specific record but get generic

2013-10-21 Thread Koert Kuipers
doug, could it be a classloader (instead of classpath) issue? looking at
spark it seems to run the tasks inside the slaves/workers with a custom
classloader.
thanks! koert


On Mon, Oct 21, 2013 at 1:07 PM, Doug Cutting cutt...@apache.org wrote:

 If the generated classes are not on the classpath then the generic
 representation is used.  So, yes, this sounds like a classpath
 problem.

 On Mon, Oct 21, 2013 at 8:41 AM, Koert Kuipers ko...@tresata.com wrote:
  i am observing that on a particular system (spark) my code breaks in that
  avro does not return the specific record i expected but instead returns
  generic records.
 
 
  i suspect this is some class loading issue on the distributed system
  (something about how the classpath is constructed for the spark slaves).
 
  anyone had class loading issues get in the way of avro specific?



Re: AVRO M/R: ClassNotFoundException: ...Paranamer

2012-12-14 Thread Koert Kuipers
i had this too ad some point. i just added paranemer to distributed cache
(or classpath on hadoop) and it went away

On Thu, Dec 13, 2012 at 2:21 PM, Terry Healy the...@bnl.gov wrote:

 paran


Re: version of avro

2012-10-20 Thread Koert Kuipers
option 3 is not available to us. i have been using option 1 without issues
so far (except in hive), but i have only worked with MR1 old api and avro-
generic so far (not sure which of these is relevant...).

On Sat, Oct 20, 2012 at 3:59 AM, Jacob Metcalf jacob_metc...@hotmail.comwrote:


 Yes I have CDH4 working happily with Avro 1.7.0 following the process
 described below.

 The various methods of distributing the Avro jar are discussed in
 http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job.
  I
 could not distribute the newer Avro jar via methods 1 and 2 because of
 classpath issues (discussed in AVRO-1103) so settled for the not so
 satisfactory option 3.

 The latest version of Avro has a patch for AVRO-1103 and a Maven profile
 to compile against the newer versions of Hadoop. I have not tried upgrading
 yet, but you should definitely try this first before going down the self
 patching route.

 Regards

 Jacob

 --
 From: kkrugler_li...@transpac.com
 Subject: Re: version of avro
 Date: Fri, 19 Oct 2012 13:16:24 -0700
 To: user@avro.apache.org



 On Oct 19, 2012, at 1:03pm, Koert Kuipers wrote:

 i noticed avro version 1.5.4 is included with some version/distros of
 hadoop and hive... is there a reason why 1.5.4 is included specifically and
 not newer ones? are there some incompatibilities to be aware of? i would
 like to use a newer version


 In the mail archives there was a discussion back in July about using Avro
 1.7 with CHD4. where Jacob Metcalf said:

 Avro 17 is compiled against Hadoop 1 i.e. CDH3.

 I have it running aginst CDH4 but I had to patch and recompile Avro, then
 replace the Avro 153 in the Hadoop lib directory with 17. I have attached
 my patch against JIRA AVRO-1103 for consideration.


 It sounds like if you're using CDH3/Hadoop 1.0 then you should be able to
 use Avro 1.7 as-is, but I haven't tried this myself.

 -- Ken

 
 http://about.me/kkrugler
 +1 530-210-6378







record schema names... a nuisance?

2012-10-20 Thread Koert Kuipers
we are on a fairly old avro (1.5.4) so not sure my observations apply to
newer versions. i noticed that when i read from avro files in hadoop it
does not expect the reader's schema (fully qualified) name to be equal to
the writer's schema (fully qualified) name. this allows me to read from
files without knowing what name the schema had when it was written.
according to doug cutting this is a bug and the read should not succeed if
the reader's and writer's schema do not have the same name. also when the
schema names are not the same then field aliases do not work.

ok with that out of the way this is my situation: we create lots of avro
files that we add to large partitioned tables (a structure with subdirs on
hdfs). the people that write the files understand the importance of
canonical columns names (field names), but not everyone gets the idea of
schema names, so generally i have avro files with name different (writer's)
schema names in there. i do not expect i can correct this. also it is not
unusual to run a hadoop map-red job reading from many different data
sources at once, using avro's fantastic projection ability to extract just
a few columns. however in that case again the (writer's) schema names are
not expected to be the same across avro files i am reading from.

so today all of this works, meaning i can run map-reduce jobs across all
these files with difference/inconsistent schema names, but only thanks to a
bug, which makes me nervous one day it will not work. also field aliases do
not work, which is a real limitation. so i am trying to see if i can come
up with a better solution. of course i could go find out every times what
all the schema names are in the avro files, and add all aliases to my
reader's schema. but that is real pain, in particular since the set is not
constant. i guess i could automate this by scanning all the avro files
first and extracting their schemas. however it sounds very inelegant. so i
rather not do that.

so i have 2 questions:

1) can i reasonably assume that processing in hadoop will continue to work
even if the reader and writer's schema names are not the same (so rely on
this bug)? the fact that field aliases do not work in this case is too bad
but at least i got something working...

2) is there a better solution? for example, something like where i could
say in my reader's schema that the schema has an alias of * (wildcard) so
that i can read from all these files with different (writer's) schema names
and it works without relying on a bug, and on top of that field aliases
will also work? that would be fantastic...


version of avro

2012-10-19 Thread Koert Kuipers
i noticed avro version 1.5.4 is included with some version/distros of
hadoop and hive... is there a reason why 1.5.4 is included specifically and
not newer ones? are there some incompatibilities to be aware of? i would
like to use a newer version

thanks! koert


using strings instead of utf8

2012-10-19 Thread Koert Kuipers
how do i tell (generic) avro to use strings for values instead of it's own
utf8 class?

i saw a way of doing it by modifying the schemas (adding a property). i
also saw mention of a way to do it if you use maven (which i don't).

is there a generic way to do this? like a system property perhaps? or a
static method that i call to change the default?
thanks!
koert


Re: abuse of aliases?

2012-02-03 Thread Koert Kuipers
thanks doug

On Fri, Feb 3, 2012 at 3:58 PM, Doug Cutting cutt...@apache.org wrote:

 On 02/02/2012 08:03 PM, Koert Kuipers wrote:
  i have many avro files with similar data (same meaning, same type, etc.)
  but different names for the fields.
  can i create a reader schema that for each field that i am interested in
  maps it to all the different possible fields in the files by using
  aliases, and then run map-reduce over the files using this schema?
  i am talking about tens of aliases per field, and this number will only
  grow as more data comes in.
  is this acceptible use of the alias concept, or is it abuse?

 This seems like a reasonable use of aliases to me.  Note that aliases
 are limited to elements at the same level of nesting and cannot perform
 arbitrary structural manipulations.  But beyond that, they're meant to
 be a general-purpose mechanism for mapping data from one schema to another.

  and is the
  alias implementation in avro efficient for such usage?

 They should be efficient.  Aliases are implemented by rewriting the old
 schema to have the new names prior to reading.  The rewriting is
 performed once and cached so performance should not be impacted.

 Doug



Re: re-use or copy a Field

2012-02-03 Thread Koert Kuipers
ok i will do thanks

On Fri, Feb 3, 2012 at 7:26 PM, Doug Cutting cutt...@apache.org wrote:

 On 02/03/2012 01:57 PM, Koert Kuipers wrote:
  I could create a copy myself using the Field constructor, however that
  way i lose the aliases and props. In avro 1.5.4 there is no way to get
  to these either.

 In Avro 1.6 there's Field#getProps() method but there's no getAliases().

 Please file a bug, asking for this.

 Thanks,

 Doug



abuse of aliases?

2012-02-02 Thread Koert Kuipers
i have many avro files with similar data (same meaning, same type, etc.)
but different names for the fields.
can i create a reader schema that for each field that i am interested in
maps it to all the different possible fields in the files by using aliases,
and then run map-reduce over the files using this schema?
i am talking about tens of aliases per field, and this number will only
grow as more data comes in.
is this acceptible use of the alias concept, or is it abuse? and is the
alias implementation in avro efficient for such usage?
thanks! koert


using avro schemas to select columns (abusing versioning?)

2012-01-23 Thread Koert Kuipers
we are working on a very sparse table with say 500 columns where we do
batch uploads that typically only contain a subset of the columns (say
100), and we run multiple map-reduce queries on subsets of the columns
(typically less than 50 columns go into a single map-reduce job).

my question is the following: if i use avro, do i ever actually need the
use the full schema of the table?

if i understand avro correctly, then the batch uploads could simply add
avro files with the schema reflective of the columns that are in the file
(as opposed to first inserting many nulls into the data and then saving it
with the full schema).

the queries could also simply query with the schema that is reflective of
the query (as opposed to querying with the full schema with 500 columns and
then picking out the relevant columns).

as long as i provide defaults of null in the query schemas, i think this
would work! correct? is this considered abuse of avro's versioning
capabilities?


override avro generic representation

2011-12-01 Thread Koert Kuipers
Is there a way to override the avro generic representation, or perhaps an
easy way to create my own?
For example, for FIXED i would like Byte[] instead of ByteBuffer, for
STRING i would prefer String over CharArray, for arrays i would like to
have a List instead of a Collection, etc.

Right now i do a translation from the generic representation to my internal
representation and then back, which is fragile and inefficient.

Thanks! Koert


reader in hadoop without reader's schema

2011-12-01 Thread Koert Kuipers
I am reading from avro container files in hadoop. I know the container
files have a (writers) schema stored in them. My reader specifies it's
schema using avro.input.schema job parameter. This way any schema changes
are gracefully handled with both schema's present.

However, i dont always need all this complexity. Is there a way to read
without having to specify a reader's schema, where i basically say just
accept the writer's schema and read the data that way.


avro compare() operation in hadoop

2011-11-03 Thread Koert Kuipers
If i use Avro in hadoop (and read my data from Avro container files), will
i automatically get a very fast comparison for sorting in Hadoop (similar
to what WritableComparator provides)? Are there benchmarks on sorting with
Avro vs Writables?
Best, Koert