[Announce] Please welcome Ryan Skraba to the Apache Avro PMC

2020-09-14 Thread Sean Busbey
Hi folks!

On behalf of the Apache Avro PMC I am pleased to announce that Ryan
Skraba has accepted our invitation to become a PMC member. We
appreciate Ryan stepping up to take more responsibility in the
project.

Please join me in welcoming Ryan to the Avro PMC!

As a reminder, if anyone would like to nominate another person as a
committer or PMC member, even if you are not currently a committer or
PMC member, you can always drop a note to priv...@avro.apache.org to
let us know.

-- 
busbey


Re: 1.10.0 Release?

2020-04-23 Thread Sean Busbey
Please join the dev@avro mailing list if you would like to try out things
prior to the 1.10 release. As an ASF project we must insist that downstream
users not use unreleased code i.e. development SNAPSHOTs.

If folks would rather have a chance to test things out as downstream prior
to a release then we could discuss having alpha/beta labeled release
versions prior to the GA release.

On Wed, Apr 22, 2020 at 2:11 PM Driesprong, Fokko 
wrote:

> Thanks Covey,
>
> Please give the 1.10-SNAPSHOT a try, and if you see any issues, let me
> know. This will make sure that we capture bugs on forehand.
>
> Cheers, Fokko
>
> Op di 14 apr. 2020 om 23:59 schreef Corey Fritz  >:
>
>> Thanks! Will keep an eye out...
>>
>> *Corey Fritz | Architect*
>> corey.fr...@snagajob.com
>> *office* | 866.227.0466
>>
>>
>> On Tue, Apr 14, 2020 at 5:32 PM Ismaël Mejía  wrote:
>>
>>> We have discussed so far about cutting the branch for 1.10.0 and
>>> starting the release next month (May 2020).
>>> I will send a reminder soon to Avro's dev@ mailing list so we start
>>> triaging and preparing the release.
>>>
>>>
>>> https://lists.apache.org/thread.html/rb9693e90a8141b2c9f0f9c901c488a079fa6245b2e4d475e022ab1e8%40%3Cdev.avro.apache.org%3E
>>>
>>>
>>>
>>> On Tue, Apr 14, 2020 at 10:44 PM Corey Fritz 
>>> wrote:
>>>
 Any estimate available on when 1.10.0 will be released? We have a strong
 desire to use the C#  POCO serializers added in this ticket:

 https://issues.apache.org/jira/browse/AVRO-2389

 *Corey Fritz | Architect*
 corey.fr...@snagajob.com
 *office* | 866.227.0466



 The largest platform for hourly work.

>>>
>>
>>
>> The largest platform for hourly work.
>>
>


Re: More idiomatic JSON encoding for unions

2020-01-08 Thread Sean Busbey
I agree with Zoltan here. We have a really long history of maintaining
compatibility for encoders.

On Tue, Jan 7, 2020 at 10:06 AM Zoltan Farkas  wrote:

> Fokko,
>
> I am not sure we should be changing the existing json encoder,
> I think we should just add another encoder, and devs can use either one of
> them based on their use case… and stay backward compatible.
>
> we should maybe standardize the content types for them… I have seen
> application/avro being used for binary, we could have for json:
> application/avro+json for the current format, application/avro.2+json for
> the new format….
>
> At some point in the future we could deprecate the old one…
>
> —Z
>
>
> On Jan 7, 2020, at 2:41 AM, Driesprong, Fokko 
> wrote:
>
> I would be a great fan of this as well. This also bothered me. The tricky
> part here is to see when to release this because it will break the existing
> JSON structure. We could make this configurable as well.
>
> Cheers, Fokko
>
> Op ma 6 jan. 2020 om 22:36 schreef roger peppe :
>
>> That's great, thanks! I thought this would probably have come up before.
>>
>> Have you written down your changes in a somewhat more formal
>> specification document, by any chance?
>>
>>   cheers,
>> rog.
>>
>>
>> On Mon, 6 Jan 2020, 18:50 zoly farkas,  wrote:
>>
>>> I think there is consensus that this should be implemented, see [AVRO-1582]
>>> Json serialization of nullable fileds and fields with default values
>>> improvement. - ASF JIRA
>>> 
>>>
>>> [AVRO-1582] Json serialization of nullable fileds and fields with
>>> defaul...
>>>
>>> 
>>>
>>>
>>> Here is a live example to get some sample data in avro json:
>>> https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson
>>> and the "Natural"
>>> https://demo.spf4j.org/example/records/1?_Accept=application/json using
>>> the encoder suggested as implementation in the jira.
>>>
>>> Somebody needs to find the time do the work to integrate this...
>>>
>>> --Z
>>>
>>>
>>>
>>>
>>> On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe <
>>> rogpe...@gmail.com> wrote:
>>>
>>>
>>> Hi,
>>>
>>> The JSON encoding in the specification
>>>  includes
>>> an explicit type name for all kinds of object other than null. This means
>>> that a JSON-encoded Avro value with a union is very rarely directly
>>> compatible with normal JSON formats.
>>>
>>> For example, it's very common for a JSON-encoded value to allow a value
>>> that's either null or string. In Avro, that's trivially expressed as the
>>> union type ["null", "string"]. With conventional JSON, a string value
>>> "foo" would be encoded just as "foo", which is easily distinguished
>>> from null when decoding. However when using the Avro JSON format it
>>> must be encoded as {"string": "foo"}.
>>>
>>> This means that Avro JSON-encoded values don't interchange easily with
>>> other JSON-encoded values.
>>>
>>> AFAICS the main reason that the type name is always required in
>>> JSON-encoded unions is to avoid ambiguity. This particularly applies to
>>> record and map types, where it's not possible in general to tell which
>>> member of the union has been specified by looking at the data itself.
>>>
>>> However, that reasoning doesn't apply if all the members of the union
>>> can be distinguished from their JSON token type.
>>>
>>> I am considering using a JSON encoding that omits the type name when all
>>> the members of the union encode to distinct JSON token types (the JSON
>>> token types being: null, boolean, string, number, object and array).
>>>
>>> For example, JSON-encoded values using the Avro schema ["null",
>>> "string", "int"] would encode as the literal values themselves (e.g.
>>> null, "foo", 999), but JSON-encoded values using the Avro schema ["int",
>>> "double"] would require the type name because the JSON lexeme doesn't
>>> distinguish between different kinds of number.
>>>
>>> This would mean that it would be possible to represent a significant
>>> subset of "normal" JSON schemas with Avro. It seems to me that would
>>> potentially be very useful.
>>>
>>> Thoughts? Is this a really bad idea to be contemplating? :)
>>>
>>>   cheers,
>>> rog.
>>>
>>>
>>>
>


[Announce] Please welcome Nándor Kollár to the Apache Avro PMC

2019-08-30 Thread Sean Busbey
Hi folks!

On behalf of the Apache Avro PMC I am pleased to announce that Nándor
Kollár has accepted our invitation to become a PMC member. We
appreciate Nándor stepping up to take more responsibility in the
project.

Please join me in welcoming Nándor to the Avro PMC!

As a reminder, if anyone would like to nominate another person as a
committer or PMC member, even if you are not currently a committer or
PMC member, you can always drop a note to priv...@avro.apache.org to
let us know.


[ANNOUNCE] Please welcome Ismaël Mejía to the Apache Avro PMC

2019-06-10 Thread Sean Busbey
Hi folks!

On behalf of the Apache Avro PMC I am pleased to announce that Ismaël
Mejía has accepted our invitation to become a PMC member. We
appreciate Ismaël stepping up to take more responsibility in the
project.

Please join me in welcoming Ismaël to the Avro PMC!

As a reminder, if anyone would like to nominate another person as a
committer or PMC member, even if you are not currently a committer or
PMC member, you can always drop a note to priv...@avro.apache.org to
let us know.

-busbey


[ANNOUNCE] Please welcome Fokko Driesprong to the Apache Avro PMC

2019-05-14 Thread Sean Busbey
Hi folks!

On behalf of the Apache Avro PMC I am pleased to announce that Fokko
Driesprong has accepted our invitation to become a PMC member on the
Avro project. We appreciate Fokko stepping up to take more
responsibility in the project.

Please join me in welcoming Fokko to the Avro PMC!



As a reminder, if anyone would like to nominate another person as a
committer or PMC member, even if you are not currently a committer or
PMC member, you can always drop a note to priv...@avro.apache.org to
let us know.

-busbey


Re: new release with fix for AVRO-1723?

2018-12-04 Thread Sean Busbey
Hi David!

The fastest way to get a release with AVRO-1723 in it is to subscribe
to the dev@avro list and help push forward on getting a release out.
we can discussion options for it on dev@avro. For example, if there
are too many blockers waiting around for 1.8.3 it might make sense to
get a 1.8.2.1 out.
On Fri, Nov 30, 2018 at 5:28 PM David Carlton  wrote:
>
> I'm running into https://issues.apache.org/jira/browse/AVRO-1723 (forward 
> declarations in Avro IDL), and I'm wondering what the timing is for a release 
> that contains that fix?  I see 
> https://issues.apache.org/jira/browse/AVRO-2163 for releasing 1.8.3, but it's 
> not clear from that Jira what the timeline is for 1.8.3 and whether it will 
> contain a fix for AVRO-1723.  So I'm trying to figure out if I should 
> generate my own local build of Avro containing the patch for AVRO-1723, or if 
> I should just write that protocol in JSON instead of IDL and then switch it 
> over to IDL once 1.8.3 is released.
>
> Thanks for any advice you have,
> David Carlton
> carl...@sumologic.com



-- 
busbey


Re: Avro release

2018-03-22 Thread Sean Busbey
please, please, please go ahead with cross posting to the dev list. Suraj
isn't the only maintainer and not all maintainers monitor the user@ list.

On Thu, Mar 22, 2018 at 9:46 AM, Edward Anderson <eander...@doximity.com>
wrote:

> Thanks Suraj. Since you saw this, I won't cross-post to the dev list.
>
> On Thu, Mar 22, 2018 at 10:45 AM, Suraj Acharya <su...@apache.org> wrote:
>
>> I'll file a jira in and try to get to it.
>> I know someone is trying to release 1.7.x simultaneously.
>>
>> Suraj
>>
>> On Wed, Mar 21, 2018, 1:16 PM Edward Anderson <eander...@doximity.com>
>> wrote:
>>
>>> Can do. Thanks for the pointer.
>>>
>>> Edward
>>>
>>> On Wed, Mar 21, 2018 at 4:14 PM, Sean Busbey <bus...@cloudera.com>
>>> wrote:
>>>
>>>> It'd be great to get a new set of releases out. I don't recall what
>>>> bogged us down last time we made a go of it.
>>>>
>>>> Would you mind bringing the issue up over on dev@avro? we can figure
>>>> out what's missing, volunteers to get those things done, and if needed the
>>>> PMC can vote on giving people more powers to do so.
>>>>
>>>> On Mon, Mar 19, 2018 at 11:37 AM, Edward Anderson <
>>>> eander...@doximity.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> There have been a lot of great improvements
>>>>> <https://github.com/apache/avro/compare/release-1.8.2...master> since
>>>>> May 2017 when Avro 1.8.2 was released—241 commits from 55 different
>>>>> contributors over nearly a year. We would love to see these included in a
>>>>> new versioned release. Right now, people affected by this year's worth of
>>>>> fixes need to work around the issues or do a local build of master, both 
>>>>> of
>>>>> which are inconvenient. The number of users affected will continue to grow
>>>>> until the next release.
>>>>>
>>>>> Do you agree that it's time for a new release? If not now, when do you
>>>>> think will be best? We wanted to check in before starting our own
>>>>> workarounds.
>>>>>
>>>>> Thanks for all your work on this great project!
>>>>>
>>>>> Best,
>>>>>
>>>>> Edward Anderson
>>>>> Software Engineer
>>>>> Doximity, Inc.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> busbey
>>>>
>>>
>>>
>


-- 
busbey


Re: Avro release

2018-03-21 Thread Sean Busbey
It'd be great to get a new set of releases out. I don't recall what bogged
us down last time we made a go of it.

Would you mind bringing the issue up over on dev@avro? we can figure out
what's missing, volunteers to get those things done, and if needed the PMC
can vote on giving people more powers to do so.

On Mon, Mar 19, 2018 at 11:37 AM, Edward Anderson 
wrote:

> Hi,
>
> There have been a lot of great improvements
>  since May
> 2017 when Avro 1.8.2 was released—241 commits from 55 different
> contributors over nearly a year. We would love to see these included in a
> new versioned release. Right now, people affected by this year's worth of
> fixes need to work around the issues or do a local build of master, both of
> which are inconvenient. The number of users affected will continue to grow
> until the next release.
>
> Do you agree that it's time for a new release? If not now, when do you
> think will be best? We wanted to check in before starting our own
> workarounds.
>
> Thanks for all your work on this great project!
>
> Best,
>
> Edward Anderson
> Software Engineer
> Doximity, Inc.
>



-- 
busbey


Re: Is it possible to use $ characters in field names?

2017-10-25 Thread Sean Busbey
Shoot. my copying in the NiFi user list failed. Mike, if using the
PutMongoRecord processor might work, the folks on that list are more likely
to be able to help with edge cases.

If you need the intermediate JSON for some reason, I think there's a JSON
transforming processor that you could maybe use to rewrite the JSON records
with the right field name?

On Wed, Oct 25, 2017 at 11:05 AM, Sean Busbey <bus...@cloudera.com> wrote:

> +us...@nifi.apache.org[1]
>
> Could you can keep the data in Avro and then use Nifi's PutMongoRecord
> processor[2] with an AvroReader to insert?
>
>
> [1]: https://lists.apache.org/list.html?us...@nifi.apache.org
> [2]: https://s.apache.org/MmPG
>
> On Wed, Oct 25, 2017 at 7:51 AM, Mike Thomsen <mikerthom...@gmail.com>
> wrote:
>
>> No, it doesn't look like it's going to work. It accepts $date into the
>> record using the alias, but it doesn't generate $date as the field name
>> when writing the object back to JSON.
>>
>> On Wed, Oct 25, 2017 at 8:19 AM, Nandor Kollar <nkol...@cloudera.com>
>> wrote:
>>
>>> Oh yes, you're right, you face with the limitation of field names
>>> <https://avro.apache.org/docs/1.8.0/spec.html#names>. Apart from
>>> solving this via a map, you might consider using Avro aliases
>>> <https://avro.apache.org/docs/1.8.2/spec.html#Aliases>, since looks
>>> like aliases don't have this limitation, can you use them?
>>>
>>> Nandor
>>>
>>> On Wed, Oct 25, 2017 at 1:40 PM, Mike Thomsen <mikerthom...@gmail.com>
>>> wrote:
>>>
>>>> Hi Nandor,
>>>>
>>>> It's not the numeric portion that is the problem for me, but the $date
>>>> field name. Mongo apparently requires the structure I provided in the
>>>> example, and whenever I use $date as the field name the Java Avro API
>>>> throws an exception about an invalid character in the field definition.
>>>>
>>>> The logical type thing is good to know for future reference.
>>>>
>>>> I admit that this is likely a really uncommon edge case for Avro. The
>>>> work around I found for defining a schema that is at least compatible with
>>>> the Mongo Extended JSON requirements was to do this (one field example):
>>>>
>>>> {
>>>> "namespace": "test",
>>>> "name": "PutTestRecord",
>>>> "type": "record",
>>>> "fields": [{
>>>> "name": "timestampField",
>>>> "type": {
>>>> "type": "map",
>>>> "values": "long"
>>>> }
>>>> }]
>>>> }
>>>>
>>>> It doesn't give you the full validation that would be ideal if we could
>>>> define a field with the name "$date," but it's an 80% solution that works
>>>> with NiFi and other tools that have to generate Extended JSON for Mongo.
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>> On Wed, Oct 25, 2017 at 4:48 AM, Nandor Kollar <nkol...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Hi Mike,
>>>>>
>>>>> This JSON doesn't seems like a valid Avro schema
>>>>> <https://avro.apache.org/docs/1.8.1/spec.html#schemas>. If you'd like
>>>>> to use timestamps in your schema, you should use Timestamp logical
>>>>> types,
>>>>> <https://avro.apache.org/docs/1.8.1/spec.html#Timestamp+%28millisecond+precision%29>
>>>>> which annotate Avro longs. In this case the schema of this field should
>>>>> look like this:
>>>>>
>>>>> {
>>>>>"name":"timestamp",
>>>>>"type":"long",
>>>>>"logicalType":"timestamp-millis"
>>>>> }
>>>>>
>>>>> If you'd like to create Avro files with this schema, there's on Avro
>>>>> wiki you can find a brief tutorial
>>>>> <https://avro.apache.org/docs/1.8.1/gettingstartedjava.html#Compiling+the+schema>
>>>>> how to create and write Avro files with this schema in Java.
>>>>>
>>>>> Regards,
>>>>> Nandor
>>>>>
>>>>> On Tue, Oct 24, 2017 at 8:18 PM, Mike Thomsen <mikerthom...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I am trying to build an avro schema for a NiFi flow that is going to
>>>>>> insert data into Mongo, and Mongo extended JSON requires the use of $
>>>>>> characters in cases like this (to represent a date):
>>>>>>
>>>>>> {
>>>>>> "timestamp": {
>>>>>> "$date": TIMESTAMP_LONG_HERE
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> I tried building a schema with that, and it failed saying there was
>>>>>> an invalid character in the schema.  just wanted to check and see if 
>>>>>> there
>>>>>> was a work around for this or if I'll have to choose another option.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> busbey
>



-- 
busbey


Re: Is it possible to use $ characters in field names?

2017-10-25 Thread Sean Busbey
+us...@nifi.apache.org[1]

Could you can keep the data in Avro and then use Nifi's PutMongoRecord
processor[2] with an AvroReader to insert?


[1]: https://lists.apache.org/list.html?us...@nifi.apache.org
[2]: https://s.apache.org/MmPG

On Wed, Oct 25, 2017 at 7:51 AM, Mike Thomsen 
wrote:

> No, it doesn't look like it's going to work. It accepts $date into the
> record using the alias, but it doesn't generate $date as the field name
> when writing the object back to JSON.
>
> On Wed, Oct 25, 2017 at 8:19 AM, Nandor Kollar 
> wrote:
>
>> Oh yes, you're right, you face with the limitation of field names
>> . Apart from solving
>> this via a map, you might consider using Avro aliases
>> , since looks like
>> aliases don't have this limitation, can you use them?
>>
>> Nandor
>>
>> On Wed, Oct 25, 2017 at 1:40 PM, Mike Thomsen 
>> wrote:
>>
>>> Hi Nandor,
>>>
>>> It's not the numeric portion that is the problem for me, but the $date
>>> field name. Mongo apparently requires the structure I provided in the
>>> example, and whenever I use $date as the field name the Java Avro API
>>> throws an exception about an invalid character in the field definition.
>>>
>>> The logical type thing is good to know for future reference.
>>>
>>> I admit that this is likely a really uncommon edge case for Avro. The
>>> work around I found for defining a schema that is at least compatible with
>>> the Mongo Extended JSON requirements was to do this (one field example):
>>>
>>> {
>>> "namespace": "test",
>>> "name": "PutTestRecord",
>>> "type": "record",
>>> "fields": [{
>>> "name": "timestampField",
>>> "type": {
>>> "type": "map",
>>> "values": "long"
>>> }
>>> }]
>>> }
>>>
>>> It doesn't give you the full validation that would be ideal if we could
>>> define a field with the name "$date," but it's an 80% solution that works
>>> with NiFi and other tools that have to generate Extended JSON for Mongo.
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>> On Wed, Oct 25, 2017 at 4:48 AM, Nandor Kollar 
>>> wrote:
>>>
 Hi Mike,

 This JSON doesn't seems like a valid Avro schema
 . If you'd like
 to use timestamps in your schema, you should use Timestamp logical
 types,
 
 which annotate Avro longs. In this case the schema of this field should
 look like this:

 {
"name":"timestamp",
"type":"long",
"logicalType":"timestamp-millis"
 }

 If you'd like to create Avro files with this schema, there's on Avro
 wiki you can find a brief tutorial
 
 how to create and write Avro files with this schema in Java.

 Regards,
 Nandor

 On Tue, Oct 24, 2017 at 8:18 PM, Mike Thomsen 
 wrote:

> I am trying to build an avro schema for a NiFi flow that is going to
> insert data into Mongo, and Mongo extended JSON requires the use of $
> characters in cases like this (to represent a date):
>
> {
> "timestamp": {
> "$date": TIMESTAMP_LONG_HERE
> }
> }
>
> I tried building a schema with that, and it failed saying there was an
> invalid character in the schema.  just wanted to check and see if there 
> was
> a work around for this or if I'll have to choose another option.
>
> Thanks,
>
> Mike
>


>>>
>>
>


-- 
busbey


Re: When is v1.8.2 going to be released?

2017-04-20 Thread Sean Busbey
Hi Folks!

The appropriate place to discuss what gets merged and timing of
not-yet-done releases is the dev@avro list. There's even been a brief
discussion already about the tradeoff between trying to squeeze more
things in to the next 1.8.z release vs working on getting a regular
release cadence.

I don't mean to indicate that these discussions aren't important; I'd
love to see them continue on the dev list.

On Thu, Apr 20, 2017 at 10:20 AM, Gill, John  wrote:
> Suraj,
>
> Is there any way that we could look at picking up PR 217 which fixes
> AVRO-766 & AVRO-1167? These are long outstanding memory leak issues in the C
> implementation.
>
>
>
> -- John
>
>
>
>
>
> From: Suraj Acharya 
> Reply-To: "user@avro.apache.org" 
> Date: Thursday, April 20, 2017 at 7:44 AM
> To: "user@avro.apache.org" 
> Subject: Re: When is v1.8.2 going to be released?
>
>
>
> Hi Javier,
>
>
>
> I made the last two tags so i can give you some context.
>
> The RC2 and RC3 failed for license issues in java and versioning in JS.
>
> I have fixed both of those. However, in the days after I started a couple of
> people wanted a few code pieces merged. I had stopped that since I was
> working on a release and that would cause continuous churn. I am giving
> people time to merge any code in before I start back.
>
> I intend to make another release in the next few days. I can say I will try
> to make sure it is not paused.
>
>
>
> Regarding when will this be released. I can't give you a definite date. I
> try to make an hour or so a day for open source work and sometimes I am not
> able to do the same. Also, it needs community help to pass an RC too. All I
> can say is soon.
>
>
>
> Hope that helps.
>
>
>
> S
>
>
>
> On Thu, Apr 20, 2017 at 4:50 AM, Javier Holguera 
> wrote:
>
> Hi,
>
>
>
> My company is really looking forward to using some of the bug fixes that
> will come with this version. We are even considering compiling and uploading
> Avro 1.8.2 into our own Artifactory.
>
>
>
> Obviously it would be much better if we could just pull from public Maven
> repos, as usual.
>
>
>
> When are you planning to release this version? The Github tags suggest there
> was an attempt months ago and a few more two weeks ago.
>
>
>
> Is this really imminent or are there chances that it will be paused like it
> happened in November?
>
>
>
> Thanks.
>
>
>
> Javier.
>
>



-- 
busbey


Re: Map with another map inside (unpredictable naming)

2017-03-27 Thread Sean Busbey
Your schema says that metadata is a map that has values of a few
different types, but it does not list a map type as one of them.

On Mon, Mar 27, 2017 at 6:11 AM, Dag Stockstad  wrote:
> Hi Avro aficionados,
>
> I'm having trouble serializing a record with a nested map structure i.e. a
> map within a map. The record I'm trying to send has the following structure:
> {
> "event_type": "some_type",
> "data": {
> "id": "2f720f90-ea06-4248-a72e-01eea44981ed",
> "metadata": {
> "some_attr": "some_value",
> "some_map_with_unpredictable_name": {
> "some_attr": "some_value"
> }
> }
> }
> }
>
> And the schema is this:
> {
> "namespace": "org.example.event.avro",
> "type": "record",
> "name": "EventNotification",
> "fields": [{
> "name": "event_type",
> "type": "string"
> }, {
> "name": "data",
> "type": {
> "type": "record",
> "name": "EventData",
> "fields": [{
> "name": "id",
> "type": "string"
> }, {
> "name": "metadata",
> "type": {
> "type": "map",
> "values": [
> "int",
> "float",
> "string",
> "boolean",
> "long",
> "null"
> ]
> }
> }]
> }
> }]
> }
>
> The nested map (some_map_with_unpredictable_name) is causing problems
> (serialization error). Is there any way I can have another map as a value in
> the metadata map?
>
> Due to the nature of the system, I cannot 100% predict the structure of the
> metadata field. Can Avro accomodate these requirements or do I have to fall
> back on something such as JSON for this one?
>
> Help very appreciated (I'm a bit stuck).
>
> Kind regards,
> Dag
>



-- 
busbey


Re: Is this a valid Avro schema?

2016-09-02 Thread Sean Busbey
The schemas are fine, but the JSON snippet isn't a valid instance of
the second schema.

In the default JSON encoding for Avro, you have to include the name of
the record as an object field[1].

For example, given test_schema_0.avsc with your first schema and
test_schema_1.avsc as your second, here are random example instances:

$ java -jar avro-tools-1.9.0-SNAPSHOT.jar random --count 1
--schema-file test_schema_0.avsc schema_0_random.avro
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
for more info.
test.seed=1472871710806
$ java -jar avro-tools-1.9.0-SNAPSHOT.jar tojson --pretty schema_0_random.avro
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
for more info.
{
  "name" : "msbvsjefb",
  "id" : 5742171927645279316
}
$ java -jar avro-tools-1.9.0-SNAPSHOT.jar random --count 1
--schema-file test_schema_1.avsc schema_1_random.avro
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
for more info.
test.seed=1472871721099
$ java -jar avro-tools-1.9.0-SNAPSHOT.jar tojson --pretty schema_1_random.avro
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
for more info.
{
  "com.user.user_record" : {
"name" : "ljfijs",
"id" : -7695450471550075616
  }
}



[1]: http://avro.apache.org/docs/current/spec.html#json_encoding

On Fri, Sep 2, 2016 at 5:01 PM, Kamesh Kompella  wrote:
> Hi there,
>  First, please look at the following schema
>
> {"name": "user_record",
>   "namespace": "com.user",
>   "type": "record",
>   "fields" : [
> {"name": "name", "type": "string"},
> {"name": "id", "type": "long"}
>   ]}
>
> and the following JSON:
>
> {"name": “Foo", “id": 42}
>
>
> When I run avro-tools with the option fromjson, I get a .avro file. Stuff
> works.
>
> If I enclose the schema above into array as shown below (I bolded the array
> begin and end in red for clarity), avro-tools (version 1.8.1) throws the
> following exception and dies.
>
>
> [{"name": "user_record",
>   "namespace": "com.user",
>   "type": "record",
>   "fields" : [
> {"name": "name", "type": "string"},
> {"name": "id", "type": "long"}
>   ]}]
>
> I get the following exception:
>
> Exception in thread "main" org.apache.avro.AvroTypeException: Unknown union
> branch name at
> org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:445)
>
> Does it make sense to enclose a schema into array? Is this a bug in
> avro-tools or is this an invalid schema? The exception above seems to
> indicate that a schema file may not begin with a JSON array of schemas.
>
> The documentation seems to indicate schema may be defined as union of other
> other schemas.
>
> I cloned the code base and I could not locate a single instance of avsc file
> in it that defined its schema as a JSON array. Hence, the question.
>
> I appreciate your response.
>
> Regards
> Kamesh



-- 
busbey


[NOTICE] jira lockdown

2016-04-22 Thread Sean Busbey
Hi folks!

Just a quick heads-up that the ASF JIRA is currently locked down to
counter a spam attack. Unfortunately, this lock down prevents our
normal open-policy that allows anyone with a JIRA account to create,
assign, and comment on issues.

If you are caught up in this, please drop me a note either on or off
list with your JIRA user name and I'll get you added to a formal JIRA
role so that you can interact with the Avro project.

-Sean


Re: avro.java.string

2016-03-25 Thread Sean Busbey
could you make a small maven project that reproduces the issue?

On Fri, Mar 25, 2016 at 5:24 PM, Matt Narrell 
wrote:

> Avro and avro-maven-plugin 1.7.7
>
> No matter what I do, I'm unable to get this feature to work.  I've
> exhausted my Google skills and continue to be unsuccessful.  I'm looking
> here:
>
>
> http://stackoverflow.com/questions/25118727/how-to-generate-fields-of-type-string-instead-of-charsequence-using-avro
> https://issues.apache.org/jira/browse/AVRO-803
>
> I have a simple field in my schema like this:
>
> {
> "name": "simple",
> "type": {
> "type": "string",
> "avro.java.string": "String"
> }
> }
>
> However, this consistently yields:
> @Deprecated public java.lang.CharSequence simple;
>
> I'm not able to use the  configuration of the plugin as this
> project holds all of the Avro schemas for our organization, and cannot
> suffer a change that wide.
>
> Am I missing something obvious?
>



-- 
busbey


Re: Avro consumes all memory on box

2015-10-27 Thread Sean Busbey
well, testing with the java avro-tools was my very next suggestion. :/

Can you make a redacted version of the schema?

On Tue, Oct 27, 2015 at 1:22 PM, web user <webuser1...@gmail.com> wrote:
> Unfortunately the company I work at has a strict policy about sharing data.
> Having said that I don't think the file is corrupted.
>
> I ran the following command:
>
> java -jar avro-tools-1.7.7.jar tojson testdata.avro
>
> and it generates a file of 1 byte
>
> I also ran java -jar avro-tools-1.7.7.jar getschema testdata.avro and it
> gets back the correct schema.
>
> Is there any way when using the python library for it not to have consume
> all memory on the entire box?
>
> Regards,
>
> WU
>
>
>
> On Tue, Oct 27, 2015 at 2:08 PM, Sean Busbey <bus...@cloudera.com> wrote:
>>
>> It sounds like the file you are reading is malformed. Could you share
>> the file or how it was written?
>>
>> On Tue, Oct 27, 2015 at 1:01 PM, web user <webuser1...@gmail.com> wrote:
>> > I ran this in a vm with much less memory and it immediately failed with
>> > a
>> > memory error:
>> >
>> > Traceback (most recent call last):
>> >   File "testavro.py", line 31, in 
>> > for r in reader:
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/datafile.py", line
>> > 362,
>> > in next
>> > datum = self.datum_reader.read(self.datum_decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 445, in
>> > read
>> > return self.read_data(self.writers_schema, self.readers_schema,
>> > decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 490, in
>> > read_data
>> > return self.read_record(writers_schema, readers_schema, decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 690, in
>> > read_record
>> > field_val = self.read_data(field.type, readers_field.type, decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 484, in
>> > read_data
>> > return self.read_array(writers_schema, readers_schema, decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 582, in
>> > read_array
>> > for i in range(block_count):
>> > MemoryError
>> >
>> >
>> > On Tue, Oct 27, 2015 at 1:36 PM, web user <webuser1...@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I'm doing the following:
>> >>
>> >> from avro.datafile import DataFileReader
>> >> from avro.datafile import DataFileWriter
>> >> from avro.io import DatumReader
>> >> from avro.io import DatumWriter
>> >>
>> >> def OpenAvroFileToRead(avro_filename):
>> >>DataFileReader(open(avro_filename, 'r'), DatumReader())
>> >>
>> >>
>> >> with OpenAvroFileToRead(avro_filename) as reader:
>> >>for r in reader:
>> >>
>> >>
>> >> I have an avro file which is only 500 bytes. I think there is a data
>> >> structure in there which is null or empty.
>> >>
>> >> I put in print statements before and after "for r in reader". On the
>> >> instruction, for r in reader it consumes about 400Gigs of memory before
>> >> I
>> >> have to kill the process.
>> >>
>> >> That is 400Gigs! Ihave 1TB on my server. I have tried this with 1.6.1
>> >> and
>> >> 1.7.1 and 1.7.7 and get the same behavior on all three versions.
>> >>
>> >> Any ideas on what is causing this?
>> >>
>> >> Regards,
>> >>
>> >> WU
>> >
>> >
>>
>>
>>
>> --
>> Sean
>
>



-- 
Sean


Re: Converting Protobuf object to Avro

2015-08-24 Thread Sean Busbey
Hiya Lan!

You need to use a container file instead of just writing via the datum
writer yourself.

Take a look at the Getting Started (Java) section on serialization[1].
The example there uses the GenericDatumWriter, but you ought to be able to
switch it out for your ProtobufDatumWriter.




[1]:
http://avro.apache.org/docs/1.7.7/gettingstartedjava.html#Serializing-N101DE

On Mon, Aug 24, 2015 at 12:54 PM, Lan Jiang ljia...@gmail.com wrote:

 Hi, there

 I am trying to convert a protobuf object to Avro. I am using

 //myProto object is deserialized using google protobuf API
 ProtobufDatumWriterMyProto pbWriter = new
 ProtobufDatumWriterMyProto(MyProto.class);

 FileOutputStream fo = new FileOutputStream(args[0]);
 Encoder e = EncoderFactory.get().binaryEncoder(fo, null);
 pbWriter.write(myProto, e);
 fo.flush();

 The avro file was created successfully. If I cat the file, I can see the
 data in the file. However, when I tried to use avro-tools to get schema or
 meta info about the saved avro file, it says

 Exception in thread main java.io.IOException: Not a data file.
 at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
 at org.apache.avro.file.DataFileReader.init(DataFileReader.java:97)
 at
 org.apache.avro.tool.DataFileGetSchemaTool.run(DataFileGetSchemaTool.java:47)

 Look at the Avro source code, the error means it does not have the first 4
 bytes matching the MAGIC first 4 bytes. I am trying to see if I have done
 anything wrong.

 Appreciate any help you can give me.

 Lan




-- 
Sean


[DISCUSS] Ruby version support in the upcoming 1.8 release line

2015-07-07 Thread Sean Busbey
Hi folks!

The dev list is working to get our next minor release line, 1.8.z, ready to
ship.

We're looking to modernize our Ruby support, and there is currently a
ticket[1] to drop support for Ruby 1.8. At the moment, the implementation
moves us to Ruby 2.0. We'd like to know how this will impact downstream
folks.


Ruby 1.8 has been EOL for a *very* long time and 1.9 recently reached EOL
as well[2]. That said, Java 6 has been EOL since Feb 2013 and we still
support it for the Java bindings[3].


What do folks think?


[1]: https://issues.apache.org/jira/browse/AVRO-1559
[2]:
https://www.ruby-lang.org/en/news/2015/02/23/support-for-ruby-1-9-3-has-ended/
even with the late extension for 1.8, it's been EOL for a year
https://www.ruby-lang.org/en/news/2014/07/01/eol-for-1-8-7-and-1-9-2/
[3]: http://www.oracle.com/technetwork/java/eol-135779.html


-- 
Sean


Re: Is Avro Splittable?

2015-06-26 Thread Sean Busbey
For Avro Container Files the schema is always at the beginning. Starting
each split task reading the schema and then seeking to a particular block
has worked well enough for MapReduce over the length of the project, so I
would just stick with doing the same thing.

If you are handling split work yourself you can just use DataFileReader[1]
and use seek/sync with your desired split offset and pastSync to tell when
your work is done. This will essentially access the file the same way
MapReduce currently does: a small read at the start followed by a seek and
then deserialization of a particular task's work.

[1]:
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileReader.html

On Fri, Jun 26, 2015 at 10:38 AM, Mike Stanley m...@mikestanley.org wrote:

 Not 100% on this --- but I'm pretty sure the only other thing you need to
 take into consideration is the schema.  The avro schema is sometimes
 located at the beginning of the container (or external).  If you expect it
 at the beginning of the container and are using it to introspect an avro
 file, then splitting it could be problematic for consumer code.  If you
 plan on splitting it, than it's likely best to manage the schema externally
 to the container.

 On Fri, Jun 26, 2015 at 10:11 AM, Sean Busbey bus...@cloudera.com wrote:

 Avro Container Files are always splittable[1]. They're the way you will
 commonly interact with Avro serialized data.

 Data serialized as Avro's binary encoding is not splittable by itself,
 because the encoding includes no markers[2]. This may be the source of the
 disconnect you're finding in online docs.



 [1]: http://avro.apache.org/docs/1.7.7/spec.html#Object+Container+Files
 [2]: http://avro.apache.org/docs/1.7.7/spec.html#Data+Serialization

 On Thu, Jun 25, 2015 at 12:54 AM, Ankur Jain ankur.j...@yash.com wrote:

  Hello,



 I am reading various forms and docs, somewhere it is mentioned that avro
 is splittable and somewhere non-splittable.

 So which one is right??



 Regards,

 Ankur


  Information transmitted by this e-mail is proprietary to YASH
 Technologies and/ or its Customers and is intended for use only by the
 individual or entity to which it is addressed, and may contain information
 that is privileged, confidential or exempt from disclosure under applicable
 law. If you are not the intended recipient or it appears that this mail has
 been forwarded to you without proper authority, you are notified that any
 use or dissemination of this information in any manner is strictly
 prohibited. In such cases, please notify us immediately at i...@yash.com
 and delete this mail from your records.




 --
 Sean





-- 
Sean


Re: Is Avro Splittable?

2015-06-26 Thread Sean Busbey
Avro Container Files are always splittable[1]. They're the way you will
commonly interact with Avro serialized data.

Data serialized as Avro's binary encoding is not splittable by itself,
because the encoding includes no markers[2]. This may be the source of the
disconnect you're finding in online docs.



[1]: http://avro.apache.org/docs/1.7.7/spec.html#Object+Container+Files
[2]: http://avro.apache.org/docs/1.7.7/spec.html#Data+Serialization

On Thu, Jun 25, 2015 at 12:54 AM, Ankur Jain ankur.j...@yash.com wrote:

  Hello,



 I am reading various forms and docs, somewhere it is mentioned that avro
 is splittable and somewhere non-splittable.

 So which one is right??



 Regards,

 Ankur


  Information transmitted by this e-mail is proprietary to YASH
 Technologies and/ or its Customers and is intended for use only by the
 individual or entity to which it is addressed, and may contain information
 that is privileged, confidential or exempt from disclosure under applicable
 law. If you are not the intended recipient or it appears that this mail has
 been forwarded to you without proper authority, you are notified that any
 use or dissemination of this information in any manner is strictly
 prohibited. In such cases, please notify us immediately at i...@yash.com
 and delete this mail from your records.




-- 
Sean


Re: serialization-deserialization problem

2015-06-03 Thread Sean Busbey
The JSON listed is not the form that Avro's json encoder/decoder can
handle. Because the optional fields are unions, Avro's decoder expects you
to first list the type before the values.

Presuming CustomerEmails is an Avro record.
i.e.

{emails : { CustomerEmails : { emails: [a...@a.com, b...@b.com]} },
transactions: null, features : null }

See the spec for more details:
http://avro.apache.org/docs/1.7.7/spec.html#json_encoding

To use the JsonDecoder, you will have to use a JsonEncoder rather than
relying on the toString method (the format of its output is undefined, so
that's probably a good idea anyways).

On Wed, Jun 3, 2015 at 5:20 PM, C 4.5 cfourf...@gmail.com wrote:

 Hi All,

 I have defined a simple Avro schema for Customer objects. Schema compiles
 OK and don't have any problem.

 I am trying to serialize to json string and deserialize into Customer java
 objects but I am not able to.

 Please see below:
 [1]: is the full code snippet
 [2]: is the output I obtain

 I am using Avro 1.7.7

 Customer's schema is flows. Emails are optional.
 {namespace: ...,
  type: record,
  name: Customer,
  fields: [
  {name: emails,   type: [ null, CustomerEmails]},
  {name: transactions, type: [ null,
 ...CustomerTransactions]},
  {name: features, type: [ null, ...CustomerFeatures]}
  ]
 }

 I have researched and tried different things but something seems to escape
 me, this should be a very trivial case to support.
 Hence, I am sure I am just overlooking something that is probably obvious.

 Any feedback?
 Thanks a ton

 ===[1] code
// ser
ListCharSequence emails = new ArrayListCharSequence();
 emails.add(a...@a.com);
 emails.add(b...@b.com);

 CustomerEmails customerEmails = CustomerEmails.newBuilder().
 setEmails(emails).
 build();

 Customer customer = Customer.newBuilder().
 setEmails(customerEmails).
 setTransactions(null).
 setFeatures(null).
 build();

 String customerJson = customer.toString();
 System.out.println(customer json:  + customerJson);

 // de-ser
 JsonDecoder jsonDecoder =
 DecoderFactory.get().jsonDecoder(Customer.getClassSchema(), customerJson);
 SpecificDatumReaderCustomer reader = new
 SpecificDatumReaderCustomer(Customer.class);
 Object obj = reader.read(null, jsonDecoder);

 System.out.println(obj= + obj);

 ===[2] output
 customer json: {emails: {emails: [a...@a.com, b...@b.com]},
 transactions: null, features: null}
 Exception in thread main org.apache.avro.AvroTypeException: Unknown
 union branch emails
 at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:445)
 at
 org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
 at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
 at
 org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
 at
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
 at
 org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
 at
 org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
 at
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
 at
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
 at com.mytest.Deser.main(Deser.java:79)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)




-- 
Sean


Re: Concurrent writes to same avro file

2015-03-13 Thread Sean Busbey
The various Avro writer / readers are not thread safe. You will need to do
some sort of external synchronization. If the threads are in the same JVM,
the easiest way to write from multiple threads safely will be to
synchronize on the DataFileStream instance.

e.g.

synchronized(myDataFileWriter) {
  myDataFileWriter.append(datum);
}



On Fri, Mar 13, 2015 at 1:05 PM, Shruthi Jeganathan 
shruthi.jeganat...@tapjoy.com wrote:

 Hi,

 I have multiple threads writing to same avro output file(out.avro). When
 deserializing out.avro, I get this exception:

 org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
 at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
 at com.example.Main.deserialize(Main.java:80)
 at com.example.Main.main(Main.java:50)Caused by: java.io.IOException: 
 Invalid sync!
 at 
 org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
 at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:198)   
   ... 2 more

 Is this because I'm concurrently writing to out.avro? If it's an issue, is
 there a way for multiple threads to simultaneously write to out.avro?
 Please provide code samples, if possible.

 Thanks.




-- 
Sean


Re: Doubt in a AVRO scenario

2015-02-12 Thread Sean Busbey
Hi!

DatumWriter doesn't serialize the schema when writing individual datum out.
If you look at your byte array contents, I believe you'll find that it just
contains the binary representation of the record.

Your use case sounds very similar to a recent question on the list on
storing records in byte arrays without the original writer schema[1]. The
guidance on that thread about using a schema id should work for your
scenario as well. Note that it will be up to you to handle serialization of
the id and lookup of the schema.

[1]: *http://s.apache.org/2IM http://s.apache.org/2IM*



On Thu, Feb 12, 2015 at 1:50 AM, Arunasalam G zealousa...@gmail.com wrote:

 Hi,

 Is there any way to retrieve schema from the encoded data without knowing
 its schema prior to deserialization?
 As requested, we have given the steps that we did for serializing the data
 and schema.

 Please help us in resolving the scenario.
 Looking forward to hearing from you soon.

 Thanks and Regards,
 Arun G.

 On Wed, Feb 11, 2015 at 2:56 PM, Arunasalam G zealousa...@gmail.com
 wrote:

 Hi,

 We serialized the schema using the following code.

 ByteArrayOutputStream out = new ByteArrayOutputStream();
 BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
 DatumWriterRecord writer = new SpecificDatumWriterRecord(schema);

 writer.write(record, encoder);
 encoder.flush();
 out.close();

 Here, record is of type org.apache.avro.generic.GenericData.Record.

 Thanks and Regards,
 Arun G


 On Wed, Feb 11, 2015 at 2:08 PM, Sean Busbey bus...@cloudera.com wrote:

 On Wed, Feb 11, 2015 at 1:24 AM, Arunasalam G zealousa...@gmail.com
 wrote:


 Our scenario is we have stored the data with schema added to it.

 I would like to make it more simple without bringing the Hbase into
 consideration.

 We have an Avro data object which has both data and schema and is
 serialized to Byte Array.
 Is there any way to retrieve the schema from this ByteArray object?

 Lets assume that we don't know what schema is present in the incoming
 object.
 I could find that for an AVRO data file, its possible to retrieve the
 schema from the file and similarly, is there any way for retrieving the
 schema from a serialized byte array object?


 It depends entirely on how you serialized the schema + binary into the
 byte array. Did you use some library or can you briefly describe the method
 used?

 --
 Sean






-- 
Sean


Re: Adding new field with default value to an Avro schema

2015-02-03 Thread Sean Busbey
Schema evolution in Avro requires access to both the schema used when
writing the data and the desired Schema for reading the data.

Normally, Avro data is stored in some container format (i.e. the one in the
spec[1]) and the parsing library takes care of pulling the schema used when
writing out of said container.

If you are using Avro data in some other location, you must have the writer
schema as well. One common use case is a shared messaging system focused on
small messages (but that doesn't use Avro RPC). In such cases, Doug Cutting
has some guidance he's previously given (quoted with permission, albeit
very late):

 A best practice for things like this is to prefix each Avro record
 with a (small) numeric schema ID.  This is used as the key for a
 shared database of schemas.  The schema corresponding to a key never
 changes, so the database can be cached heavily.  It never gets very
 big either.  It could be as simple as a .java file, with the
 constraint that you'd need to upgrade things downstream before
 upstream, or as complicated as an enterprise-wide REST schema service
 (AVRO-1124).  A variation is to use schema fingerprints as keys.

 Potentially relevant stuff:

 https://issues.apache.org/jira/browse/AVRO-1124
 http://avro.apache.org/docs/current/spec.html#Schema+Fingerprints

If you take the integer schema ID approach, you can use Avro's built in
utilities for zig-zap encoding, which will ensure that most of the time
your identifier only takes a small amount of space.

[1]: http://avro.apache.org/docs/current/spec.html#Object+Container+Files


On Tue, Feb 3, 2015 at 5:57 AM, Burak Emre emrekaba...@gmail.com wrote:

 I added a field with a default value to an Avro schema which is previously
 used for writing data. Is it possible to read the previous data using *only
 new schema* which has that new field at the end?

 I tried this scenario but unfortunately it throws EOFException while
 reading third field. Even though it has a default value and the previous
 fields is read successfully, I'm not able to de-serialize the record back
 without providing the writer schema I used previously.

 Schema schema = Schema.createRecord(test, null, avro.test, false);
 schema.setFields(Lists.newArrayList(
 new Field(project, Schema.create(Type.STRING), null, null),
 new Field(city, 
 Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), 
 Schema.create(Type.STRING))), null, NullNode.getInstance(;
 GenericData.Record record = new GenericRecordBuilder(schema)
 .set(project, ff).build();
 GenericDatumWriter w = new GenericDatumWriter(schema);ByteArrayOutputStream 
 outputStream = new ByteArrayOutputStream();BinaryEncoder encoder = 
 EncoderFactory.get().binaryEncoder(outputStream, null);

 w.write(record, encoder);
 encoder.flush();

 schema = Schema.createRecord(test, null, avro.test, false);
 schema.setFields(Lists.newArrayList(
 new Field(project, Schema.create(Type.STRING), null, null),
 new Field(city, 
 Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), 
 Schema.create(Type.STRING))), null, NullNode.getInstance()),
 new Field(newField, 
 Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), 
 Schema.create(Type.STRING))), null, NullNode.getInstance(;
 DatumReaderGenericRecord reader = new GenericDatumReader(schema);Decoder 
 decoder = DecoderFactory.get().binaryDecoder(outputStream.toByteArray(), 
 null);GenericRecord result = reader.read(null, decoder);





-- 
Sean


Re: Adding new field with default value to an Avro schema

2015-02-03 Thread Sean Busbey
On Tue, Feb 3, 2015 at 11:34 AM, Lukas Steiblys lu...@doubledutch.me
wrote:

   On a related note, is there a tool that can check the backwards
 compatibility of schemas? I found some old messages talking about it, but
 no actual tool. I guess I could hack it together using some functions in
 the Avro library.

 Lukas


I don't think so, but this would be a great addition to the avro-tools
utility. Would you mind filing a JIRA for it?

-- 
Sean


Re: avro json-ld

2014-11-26 Thread Sean Busbey
Sounds interesting. Any chance you could put up a patch for a new Encoder
that does json-id? (rather than changing the extant json encoder)

-- 
Sean
On Nov 25, 2014 1:41 PM, peter peter.amst...@curoverse.com wrote:

 Hello everyone,

 Has anyone given any thought to a json-ld (http://json-ld.org/) encoding
 for Avro?  By my reading, it appears that the Avro schema language and
 existing Avro JSON encoding could be tweaked to produce json-ld
 compatible structures without much difficulty.  I think this would be
 very interesting as it would provide a bridge to transform any Avro data
 structure to RDF, which would benefit communities using Avro that are
 also using linked data.

 Thanks,
 Peter





Re: Generated enum dollar sign in front of a symbol.

2014-10-08 Thread Sean Busbey
Hi Lukas!

Yes, this sounds like a bug please file an issue.



On Wed, Oct 8, 2014 at 2:26 PM, Lukas Steiblys lu...@doubledutch.me wrote:

   I realized now that “default” is a keyword in Java and can’t be used as
 an enum value. The files were generated in python using the python Avro
 library, where “default” is not a keyword and can be used freely. I assume
 there should be a conversion somewhere in the Java Avro library, where a
 dollar sign is automatically added for enum values that are Java keywords.
 Is that actually the case? Why did it fail this time then? Should I file a
 bug?

 Lukas

  *From:* Lukas Steiblys lu...@doubledutch.me
 *Sent:* Wednesday, October 8, 2014 12:06 PM
 *To:* user@avro.apache.org
 *Subject:* Generated enum dollar sign in front of a symbol.

   Has anyone run into the problem where the generated java class for an
 enum has a dollar sign for one enum value?

 The schema {type: enum, name: ButtonTypeID, symbols:
 [default, keyboard]} generates the following class:

 public final class ButtonTypeID extends java.lang.EnumButtonTypeID {
   public static final ButtonTypeID default$;
   public static final ButtonTypeID keyboard;
   public static final org.apache.avro.Schema SCHEMA$;
   public static ButtonTypeID[] values();
   public static ButtonTypeID valueOf(java.lang.String);
   public static org.apache.avro.Schema getClassSchema();
   static {};
 }

 (this is what “javap ButtonTypeID.class” produces)

 When I try to read my data that has the “default” value for ButtonTypeID,
 I get the exception:

 java.lang.IllegalArgumentException: No enum constant ButtonTypeID.default
   at java.lang.Enum.valueOf(Enum.java:236)
   at 
 org.apache.avro.specific.SpecificData.createEnum(SpecificData.java:106)
   at 
 org.apache.avro.generic.GenericDatumReader.createEnum(GenericDatumReader.java:205)...

 Strangely, everything was working fine a day before. Where is this dollar 
 sign coming from?

 Lukas




-- 
Sean


Re: Where is org.apache.avro.reflect.DecimalEncoding?

2014-08-11 Thread Sean Busbey
AVRO-1402 only updated the specification to include Decimal[1].

AVRO-1497 is the ticket for adding an implementation to the java library
and it is still open[2].


HTH.

[1]: http://avro.apache.org/docs/1.7.7/spec.html#Decimal
[2]: https://issues.apache.org/jira/browse/AVRO-1497


On Mon, Aug 11, 2014 at 12:52 PM, Michael Pigott 
mpigott.subscripti...@gmail.com wrote:

 Hi,
 I'm sorry, I'm feeling kind of dense at the moment.  I just upgraded
 to Avro 1.7.7 to hopefully use the new Decimal type (
 https://issues.apache.org/jira/browse/AVRO-1402 ), however I can't seem
 to find it.  Which Maven dependency should I use?  I tried the following:

 org.apache.avro:avro:1.7.7
 org.apache.avro:avro-tools:1.7.7
 org.apache.avro:avro-ipc:1.7.7
 org.apache.avro:avro-protobuf:1.7.7

 I also noticed that the last comment on
 https://issues.apache.org/jira/browse/AVRO-1402 is a Hudson build failure
 - did the code make it into the build?

 Thanks,
 Mike




-- 
Sean


Re: Unions Only Allow One Map, Even If Values Are Different?

2014-08-05 Thread Sean Busbey
Hi Mike!

Yep, the specification calls out that only those types that are named
(records, enums, and fixed) can occur multiple times in a union[1].

[1]: http://avro.apache.org/docs/1.7.7/spec.html#Unions


On Tue, Aug 5, 2014 at 2:28 PM, Michael Pigott 
mpigott.subscripti...@gmail.com wrote:

 Hi,
 I looked through the JIRA tickets but I did not find one that matched
 what I just ran into - unions can't have multiple maps, even if each one
 has a different value type?  Automatic-map-generation is the last feature
 I'm adding to AVRO-457 https://issues.apache.org/jira/browse/AVRO-457 
 (before
 Avro - XML conversion) and I was surprised that two maps, each with
 different value types, could not co-exist in a union.  Is this on purpose?

 Thanks!
 Mike




-- 
Sean


Re: Avro compression doubt

2014-07-09 Thread Sean Busbey
Can you share the schema? How big is it?

The schema itself is not compressed, so given your small data size it might
be dominating.


On Wed, Jul 9, 2014 at 1:20 AM, Sachin Goyal sgo...@walmartlabs.com wrote:

 Hi,

 I have been trying to use Avro compression codecs to reduce the size of
 avro-output.
 The Java object being serialized is pretty huge and here are the results
 of applying different codecs.


   Serialization   : Kilo-Bytes
 -   : ---
 Avro (No Codec)   :   57.3
 Avro (Snappy)   :   52.0
 Avro (Bzip2):   51.6
 Avro (Deflate)  :   51.1
 Avro (xzCodec)  :   51.0
 Direct JSON :   23.6  (Just for comparison since we use JSON too
 heavily. This was done using Jackson)




 The Java code I used to try codecs is as follows:
 ---
 
 ReflectDatumWriter datumWriter = new ReflectDatumWriter
 (productObj.getClass(), rdata);
 DataFileWriter fileWriter = new DataFileWriter (datumWriter);


 // Try each one of these codecs one at a time
 fileWriter.setCodec(CodecFactory.snappyCodec());
 fileWriter.setCodec(CodecFactory.bzip2Codec());
 fileWriter.setCodec(CodecFactory.deflateCodec(9));
 fileWriter.setCodec(CodecFactory.xzCodec(5));  // using 9 here caused
 out-of-memory

 // Now check output size
 ByteArrayOutputStream baos = new ByteArrayOutputStream();

 fileWriter.create(schema, baos);
 fileWriter.append(productObj);
 fileWriter.close();
 System.out.println (Avro bytes =  + baos.toByteArray().length);
 ---
 



 And then, on the command line, I applied the normal zip command as:
   $ zip output.zip output.avr;
   $ ls -l output.*
 This gives me the following output:

 57339  output.avr
  9081  output.zip (20% the original size!)




 So my questions are:
 -
 1) Why I am not seeing a huge benefit in size when applying the codec? Am
 I using the API correctly?
 2) I understand that the compression achieved by normal zip command would
 be better than applying codecs in Avro, but is such a huge difference
 expected?


 One thing I expected and did notice is that Avro truly shines when the
 number of objects to be appended are more than 10.
 This is so because the schema is written only once and all the actual
 objects are appended as binary.
 So that was expected, but compression codecs output looked a bit
 questionable.

 Please suggest if I am doing something wrong.

 Thanks
 Sachin







-- 
Sean


Re: Dynamic Package/namespace naming

2014-05-06 Thread Sean Busbey
Hi Lewis,

Well, I'm not sure what you're going to gain by isolating each file into
its own namespace.

However, one way you can achieve that is to take the information you want
included from before and then create a hash of it. You can use the hex
representation of the hash to give you something that

* meets Avro's requirements for namespace names
* uniquely identifies each namespace based on experiment, channel, and
timestamp

If you need the file to also contain the information that went into making
that hash, you can attach it as Avro-ignored metadata on the schema.

-Sean


On Mon, May 5, 2014 at 6:49 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Sean,
 Arh...
 Maybe I can propose a change to the namespace semantics in my project.
 I've been given the following brief.
 Data files should have a logical namespace structure which can be globally
 unified across multiple instruments and receiver platforms into a common
 namespace on the filesystem.
 I have a whole list of other data representation constraints and Avro
 satisfies them all. This one is the only 'issue'.
 Thanks for your help.
 Lewis
 On May 5, 2014 4:21 PM, Sean Busbey busbey+li...@cloudera.com wrote:

 Hi Lewis!

 Avro namespaces don't allow the characters '/', ':', or '-'. So your
 specific example would not work.  The allowed characters for a namespace
 are defined in the Avro spec[1].

 It would help if you could clarify what purpose namespacing serves in the
 system.

 [1]: http://avro.apache.org/docs/current/spec.html#Names


 On Mon, May 5, 2014 at 6:14 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 Hi Charles
 Thanks for reply.
 Re: more info
 Say I want my namespace to be

 experiment_name-{timestamp}/channel_name/2014-05-05T16:13:29/

 Where all constituent parts of namespace separated by '/' are dynamic...
 is this possible via builder api?
 Thanks again
 On May 5, 2014 4:04 PM, Pritchard, Charles X. -ND 
 charles.x.pritchard@disney.com wrote:

 Need more info — you can use the Schema builder to do anything you like
 at runtime,
 Schema.createRecord and setFields.



 On May 5, 2014, at 3:54 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 Hi Folks,
 I'm trying to propose Afro for a project I've been drafted on to.
 Namespace declarations are important... and a requirement is that
 namespaces are dynamic in nature... preferably even decided at runtime.
 Is this possible in Avro?
 Thanks
 Lewis






-- 
Sean


Re: Dynamic Package/namespace naming

2014-05-05 Thread Sean Busbey
Hi Lewis!

Avro namespaces don't allow the characters '/', ':', or '-'. So your
specific example would not work.  The allowed characters for a namespace
are defined in the Avro spec[1].

It would help if you could clarify what purpose namespacing serves in the
system.

[1]: http://avro.apache.org/docs/current/spec.html#Names


On Mon, May 5, 2014 at 6:14 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Charles
 Thanks for reply.
 Re: more info
 Say I want my namespace to be

 experiment_name-{timestamp}/channel_name/2014-05-05T16:13:29/

 Where all constituent parts of namespace separated by '/' are dynamic...
 is this possible via builder api?
 Thanks again
 On May 5, 2014 4:04 PM, Pritchard, Charles X. -ND 
 charles.x.pritchard@disney.com wrote:

 Need more info — you can use the Schema builder to do anything you like
 at runtime,
 Schema.createRecord and setFields.



 On May 5, 2014, at 3:54 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 Hi Folks,
 I'm trying to propose Afro for a project I've been drafted on to.
 Namespace declarations are important... and a requirement is that
 namespaces are dynamic in nature... preferably even decided at runtime.
 Is this possible in Avro?
 Thanks
 Lewis





Re: Hadoop Avro generated code error due to Turkish Locale

2014-04-05 Thread Sean Busbey
On Sat, Apr 5, 2014 at 11:49 AM, Serkan Taş serkan_...@hotmail.com wrote:

 Hi all,

 I am faced to a common problem of tr locale settings for java application
 while trying to build dev environment according to
 http://wiki.apache.org/hadoop/EclipseEnvironment.

 Here is the error :

 [ERROR] Failed to execute goal
 org.apache.maven.plugins:maven-compiler-plugin:2.5.1:testCompile
 (default-testCompile) on project hadoop-common: Compilation failure
 [ERROR]
 /Users/serkan/programlar/dev/hadooptest/hadoop-trunk/hadoop-common-project/hadoop-common/target/generated-test-sources/java/org/apache/hadoop/io/serializer/avro/AvroRecord.java:[10,244]
 unmappable character for encoding UTF-8
 [ERROR] - [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:testCompile
 (default-testCompile) on project hadoop-common: Compilation failure
 /Users/serkan/programlar/dev/hadooptest/hadoop-trunk/hadoop-common-project/hadoop-common/target/generated-test-sources/java/org/apache/hadoop/io/serializer/avro/AvroRecord.java:[10,244]
 unmappable character for encoding UTF-8


 I f i check the code i discovered the reason for error :

  public static final org.apache.avro.Schema SCHEMA$ = new
 org.apache.avro.Schema.Parser().parse({\type\:\record\,\name\:\AvroRecord\,\namespace\:\org.apache.hadoop.io.serializer.avro\,\fields\:[{\name\:\intField\,\type\:\Ýnt\}]});

 As you can see, locale dependent capitalization of letter i turns in to
 Ý.




 This code is automatically generated by avro.


 Same bug exists in some other apache projects and fixed. I am not sure
 about avro

 For eg.

 OPENEJB-1071 https://issues.apache.org/jira/browse/OPENEJB-1071, 
 OAK-260https://issues.apache.org/jira/browse/OAK-260
 , IBATIS-218 https://issues.apache.org/jira/browse/IBATIS-218, etc.

 Should i file a bug ?


Yes, please do file a bug. Avro should be specifying a locale when
generating that Schema string.


-- 
Sean


Re:

2014-03-17 Thread Sean Busbey
Hi Shaq!

Could you describe your use case in more detail?

Generally, HDFS will behave poorly in the face of many small files. Could
you perhaps colocate several data in one file? This will help both with the
relative overhead of the schema and the pressure on the HDFS NameNode.

-Sean


On Mon, Mar 17, 2014 at 2:55 PM, Salman Haq shaq@audaxhealth.comwrote:

 Hello,

 I'd like to confirm if there is a recommended way to serialize data to a
 file but without the schema being written in the file metadata. Assume a
 reader's schema will be available for deserialization at a later time.

 My use case requires small-sized datum messages to be serialized and
 copied to HDFS. The presence of the schema in the message file adds
 considerable overhead relative to the size of the datum itself.

 Thank you,
 Shaq




Re: Unable to compile a namespace-less schema

2013-10-11 Thread Sean Busbey
Hi Vitaly,

In general, Java does not allow classes outside of the default package to
import classes from within the default package.

I think this means that this is expected behavior, given that Avro says
non-namespaced schemas generate code in the default package.

For your particular issue, this means either restructuring your application
so that Main is in the default package or giving the schemas a namespace so
they'll be in some package. The latter is preferable; in your example
com.company or com.company.serialization would be appropriate.

HTH


On Fri, Oct 11, 2013 at 3:59 PM, Vitaly Gordon vita...@gmail.com wrote:

 Hi Doug,
 I've attached a maven project that contains code that shows the problem.
 The code is basically the same one from the Avro guide, but what is
 important to observe is that since the Main class has a package define, it
 cannot import the classes generated from the namespace-less schema.

 just run mvn:compile to get the compilation errors

 Thanks,
 Vitaly


 On Thu, Oct 10, 2013 at 1:58 PM, Doug Cutting cutt...@apache.org wrote:

 I encourage you to please provide a complete test, code that fails.
 If maven is involved, create a simple, Maven project that illustrates
 the problem.

 Thanks,

 Doug

 On Wed, Oct 9, 2013 at 11:21 PM, Vitaly Gordon vita...@gmail.com wrote:
  Hi Doug,
  You are right, the code does compile with javac. Apparently it is some
 maven
  error, where it doesn't like to compile package-less files.
 
  Having said that, I still have the issue of not being able to use these
 java
  files in my code, because there is no way to import them. One thing I
 tried
  that sometimes work is adding some arbitrary namespace to the avro
 schema.
  However, when I try to read records using the new (with namespace)
 schema, I
  get in return a generic record instead of a specific one. This behavior
 can
  be observed in the same file I attached by adding an arbitrary
 namespace to
  the schema before generating the Java classes from it.
 
  Is there any way to read specific records when the schema that was used
 to
  write them contains no namespace?
 
  Thanks,
  Vitaly
 
 
  On Wed, Oct 9, 2013 at 6:07 PM, Doug Cutting cutt...@apache.org
 wrote:
 
  Using the current trunk of Avro I am able to:
   - extract the schema from the data file you provided (using
  avro-tools schema command)
   - generate Java classes for this schema (using the avro-tools compile
  command)
   - compile these generated Java classes (using the javac command)
 
  Can you provide a complete case of what fails for you?
 
  Thanks,
 
  Doug
 
  On Wed, Oct 9, 2013 at 4:56 PM, Vitaly Gordon vita...@gmail.com
 wrote:
   Does anyone else might have an idea how I can resolve this
   namespace-less
   Avro schema code generation?
  
   Thanks,
   Vitaly
  
  
   On Mon, Oct 7, 2013 at 2:04 PM, Vitaly Gordon vita...@gmail.com
 wrote:
  
   Hi Sean,
   Here is a file that contains a single record that I cannot read
 using a
   specific reader.
  
   It's hard for me to add code because the problem is a compilation
   problem
   with the generated Java files.
  
   So to recreate the problem:
   1. Extract the schema from the record
   2. Generate the code from the schema
   3. Compile
  
   Is there another way that I can describe the issue?
  
  
  
   On Mon, Oct 7, 2013 at 10:58 AM, Sean Busbey bus...@cloudera.com
   wrote:
  
   Hi Vitay!
  
   Can you give us a minimal schema and test program that illustrates
 the
   problem you're describing?
  
   --
   Sean
  
   On Oct 7, 2013 12:27 PM, Vitaly Gordon vita...@gmail.com
 wrote:
  
   Hi All,
   I am trying to read Avro data that its schema does not have a
   namespace.
   The problem is that I cannot compile the classes, because the
   generated Java
   code does not have a package. On the other hand, if I do add some
   arbitrary
   namespace to the schema, the record is resolved as a generic one,
   which then
   fails on ClassCastException to the specific record.
  
   Any ideas on how I can resolve this issue?
  
   Thanks,
   Vitay
  
  
  
 
 





-- 
Sean


Re: Unable to compile a namespace-less schema

2013-10-07 Thread Sean Busbey
Hi Vitay!

Can you give us a minimal schema and test program that illustrates the
problem you're describing?

-- 
Sean
On Oct 7, 2013 12:27 PM, Vitaly Gordon vita...@gmail.com wrote:

 Hi All,
 I am trying to read Avro data that its schema does not have a namespace.
 The problem is that I cannot compile the classes, because the generated
 Java code does not have a package. On the other hand, if I do add some
 arbitrary namespace to the schema, the record is resolved as a generic one,
 which then fails on ClassCastException to the specific record.

 Any ideas on how I can resolve this issue?

 Thanks,
 Vitay



Re: can avro files on hdfs be read using pig

2013-09-28 Thread Sean Busbey
+user@pig

On Wed, Sep 25, 2013 at 9:33 AM, Anup ahirea...@gmail.com wrote:


  On Sep 24, 2013, at 11:15 PM, Phani pche...@gmail.com wrote:
 
  wanted to know if avro files can be read using pig from hdfs.
 
  Thanks
 
  --
  phani



 Yes. Use AvroStorage().

 Sent from my Turing Machine.


With slightly more detail:

AvroStorage will let you load and store arbitrary Avro:

https://cwiki.apache.org/confluence/display/PIG/AvroStorage

It is included in the contrib packages known as PiggyBank:

https://cwiki.apache.org/confluence/display/PIG/PiggyBank

Please note that the PiggyBank is distributed as source-only via SVN. The
page above explains how to build it.

If you use a repackaging of Pig provided by a commercial vendor, they will
likely provide a binary distribution and instructions specific to their
distribution for using AvroStorage.


-- 
Sean