Re: Questions re integrating Avro into Cascading process

2010-04-22 Thread Ken Krugler


On Apr 21, 2010, at 3:22pm, Doug Cutting wrote:


Ken Krugler wrote:
One open issue - it would be great to be able to set metadata in  
the headers of the resulting Avro files. But it wasn't obvious how  
to do that, given our (intentionally) arms-length approach via the  
use of the Avro mapred code.
One idea would be to have job conf values using keys prefixed with  
avro.metadata.xxx, and the Avro mapred support could automagically  
use that when creating the file. But this would break our goal of  
using unmodified Avro source, so I'm curious whether support for  
setting the file metadata would also be useful for the standard  
(Hadoop) use of Avro for an output format, and if so, whether there  
was a better approach.


Embedding the metadata in the configuration seems like a good  
approach.  Please file a Jira issue for this and attach a patch.


AvroOutputFormat can add properties named  
avro.mapred.output.metadata.*.  We'll have to enumerate all  
properties in the job and test for this prefix, since Configuration  
is a HashMap, but the alternative of encoding the metadata map in a  
single configuration value seems no more attractive.


Note that https://issues.apache.org/jira/browse/HADOOP-6420 added  
support for adding maps to configuration, but the extracted map  
cannot be enumerated, so could not be added to the DataFileWriter's  
metadata. Also, this feature is perhaps slated for removal as a part  
of https://issues.apache.org/jira/browse/HADOOP-6698, but its code  
might prove useful as a starting point.


Thanks for the info, we'll work up a patch & file the issue when it's  
ready.


Two related questions:

1. I'm assuming there's no compelling reason to read the file headers  
- in fact, not sure how you'd even get at the data, much less how  
you'd deal with potentially partial/missing data from a set of Avro  
files being read as part files.


2. We'd like to not include Avro source in the Cascading scheme  
project, but rather just have a dependency on the Avro jar.


We have a similar relationship between Bixo and Tika, and what's  
worked well is for the Bixo master branch to have a dependency on the  
Tika snapshot builds, so we can quickly iterate on both projects.


So are there plans to start pushing Avro snapshot builds to the Apache  
snapshots repository? I see occasional Avro releases to the Maven  
central repo (1.0, 1.2, 1.3.2) but nothing for snapshots.


Thanks,

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






[jira] Commented: (AVRO-285) request-only messages

2010-04-22 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860060#action_12860060
 ] 

Doug Cutting commented on AVRO-285:
---

I've started looking into implementing this.  We still need a handshake to take 
place, so that client and server versions need not match exactly.  But, without 
a response, there's no way to make a piggybacked handshake.  So, I think, to 
implement this, we need to factor the handshake logic out of every request and 
response.

This can be done compatibly.  HTTP always sends a response, so with that 
transport there will always be a handshake response, and that's the only 
transport specified today.

For the Java implementation of this I thus intend to refactor handshaking.  
I'll use a non-standard transport to test unidirectional messages, like 
SocketTransciever and SocketServer.

> request-only messages
> -
>
> Key: AVRO-285
> URL: https://issues.apache.org/jira/browse/AVRO-285
> Project: Avro
>  Issue Type: New Feature
>  Components: spec
>Reporter: Doug Cutting
>
> It might be useful to have a standard mechanism in Avro for transmitting 
> messages that receive no response, not even an acknowledgement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: TLP move

2010-04-22 Thread Bruce Mitchener

+1

Sent from my iPhone

On Apr 22, 2010, at 1:56 PM, Jeff Hammerbacher   
wrote:



Switch to Confluence for the wiki, please.

On Thu, Apr 22, 2010 at 10:19 AM, Doug Cutting   
wrote:



The board yesterday passed a resolution making Avro a TLP.

I filed an infrastructure issue to start the move:

https://issues.apache.org/jira/browse/INFRA-2640

The primary disruption to developers will be when the subversion  
repository
is renamed.  We'll send out a note before we do this, then  
developers can

use 'svn switch' to update their repos.

One other issue is the wiki.  I don't think it's easy to rename a  
subtree
from a Moin Moin wiki to a new wiki.  Fortunately we don't have  
many wiki
pages and could cut and paste them manually.  Alternately, we could  
switch

to using confluence for our wiki.  Thoughts?

Doug



Re: TLP move

2010-04-22 Thread Scott Carey
+ 1

On Apr 22, 2010, at 10:56 AM, Jeff Hammerbacher wrote:

> Switch to Confluence for the wiki, please.
> 
> On Thu, Apr 22, 2010 at 10:19 AM, Doug Cutting  wrote:
> 
>> The board yesterday passed a resolution making Avro a TLP.
>> 
>> I filed an infrastructure issue to start the move:
>> 
>> https://issues.apache.org/jira/browse/INFRA-2640
>> 
>> The primary disruption to developers will be when the subversion repository
>> is renamed.  We'll send out a note before we do this, then developers can
>> use 'svn switch' to update their repos.
>> 
>> One other issue is the wiki.  I don't think it's easy to rename a subtree
>> from a Moin Moin wiki to a new wiki.  Fortunately we don't have many wiki
>> pages and could cut and paste them manually.  Alternately, we could switch
>> to using confluence for our wiki.  Thoughts?
>> 
>> Doug
>> 



[jira] Created: (AVRO-523) records with the same name as a member generate bad c++ code

2010-04-22 Thread John Plevyak (JIRA)
records with the same name as a member generate bad c++ code


 Key: AVRO-523
 URL: https://issues.apache.org/jira/browse/AVRO-523
 Project: Avro
  Issue Type: Bug
  Components: c++
Reporter: John Plevyak


records with the same name as a member generate bad c++ code:

{
"type" : "array",
"name" : "optionals",
"items" : [
   { "name" : "l", "type" : "record", "fields" : [ { "name" : "l", "type": 
"long"} ] },
   { "name" : "r", "type" : "record", "fields" : [ { "name" : "r", "type": 
"long"} ] }
]
}

produces c++ code such that when it is compiled it produces:

union2.h:42: error: field 'int64_t avrouser::l::l' with same name as class


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: TLP move

2010-04-22 Thread Jeff Hammerbacher
Switch to Confluence for the wiki, please.

On Thu, Apr 22, 2010 at 10:19 AM, Doug Cutting  wrote:

> The board yesterday passed a resolution making Avro a TLP.
>
> I filed an infrastructure issue to start the move:
>
>  https://issues.apache.org/jira/browse/INFRA-2640
>
> The primary disruption to developers will be when the subversion repository
> is renamed.  We'll send out a note before we do this, then developers can
> use 'svn switch' to update their repos.
>
> One other issue is the wiki.  I don't think it's easy to rename a subtree
> from a Moin Moin wiki to a new wiki.  Fortunately we don't have many wiki
> pages and could cut and paste them manually.  Alternately, we could switch
> to using confluence for our wiki.  Thoughts?
>
> Doug
>


Re: [Fwd: New benchmarking page.]

2010-04-22 Thread Scott Carey
The create time can be improved.  I think the issue is it forces Avro to create 
a lot more objects.

Others can take a string, and then directly encode it to the resulting byte 
output.   We have to take a string, encode it into a byte[] (in a Utf8) then 
copy that to the output, and then throw away the Utf8.   We could recycle the 
byte[] buffers from Utf8's  (say, a thread-local byte[] buffer cache like what 
Jackson does), or allow Strings to write and read directly from the decoder 
along side Utf8's.  Our challenge will be that we must encode the length of the 
string before encoding, and that is not available until it has been converted 
to Utf8.

Because of the way the test is partitioned, some of our serialize time ended up 
in the create time -- others do the UTF16 >> UTF8 conversion while serializing, 
we do it in the 'create' phase.

Furthermore on the Java side I think there is a lot of room for further 
improvement on the raw serialization and deserialization, but not much of it is 
easy and most of it has to do with more complicated schemas. 

The benchmark setup is suspect -- last I checked it used an inappropriate heap 
size and the code comments around its 'warmup' process were misguided.

-Scott

On Apr 22, 2010, at 8:51 AM, Doug Cutting wrote:

> Avro seems to be sliding a bit in this benchmark.  The poor "create" 
> time has always been a problem for Avro, although I'm not sure why. 
> This isn't a great benchmark, but lots of folks look at it, so it'd be 
> nice if we did well there.
> 
> Doug
> 
>  Original Message 
> Subject: New benchmarking page.
> Date: Thu, 22 Apr 2010 04:34:04 -0700
> From: Kannan Goundan 
> Reply-To: java-serialization-benchmark...@googlegroups.com
> To: java-serialization-benchmark...@googlegroups.com
> 
> I've created a "version 2" of the Benchmarking page.
> 
>http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2
> 
> These measurements were generated using the new code I've been adding
> over the past month or so.  One advantage of the new code is that I've
> actually tried to make the various serializers do the same amount of
> work (previously, many serializers were specialized to the exact data
> value being tested).



TLP move

2010-04-22 Thread Doug Cutting

The board yesterday passed a resolution making Avro a TLP.

I filed an infrastructure issue to start the move:

  https://issues.apache.org/jira/browse/INFRA-2640

The primary disruption to developers will be when the subversion 
repository is renamed.  We'll send out a note before we do this, then 
developers can use 'svn switch' to update their repos.


One other issue is the wiki.  I don't think it's easy to rename a 
subtree from a Moin Moin wiki to a new wiki.  Fortunately we don't have 
many wiki pages and could cut and paste them manually.  Alternately, we 
could switch to using confluence for our wiki.  Thoughts?


Doug


[jira] Commented: (AVRO-519) Efficient sparse optional fields support

2010-04-22 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859897#action_12859897
 ] 

Doug Cutting commented on AVRO-519:
---

> What is the rational for not permitting a name to be associated with other 
> types in a union?

This is discussed in AVRO-248.  One rationale is simply that it would be an 
incompatible change.  Existing implementations should ignore the name, but they 
should also generate an error if a union has two "bytes" branches.

A dynamic language needs a way at runtime to distinguish whether "a" or "b" is 
used.  So one would need to wrap the bytes in something to indicate this.  Like 
a record.  Records can add a name to any type, with no serialized overhead.

> Efficient sparse optional fields support
> 
>
> Key: AVRO-519
> URL: https://issues.apache.org/jira/browse/AVRO-519
> Project: Avro
>  Issue Type: New Feature
>  Components: spec
>Reporter: John Plevyak
>
> One of the nice features of protobuf is efficient support for very sparse 
> optional fields,
> for example large number of tags potentially associated with a document the 
> vast
> majority of which are empty.
> Avro does support optional fields as part of differing specifications, but 
> not on a per-record
> level after a protocol has been agreed upon.  Avro does have support for 
> arrays and maps
> however both of these require homogeneous types.
> I would suggest adding an additional field attribute:
>* "optional" - with values "true"/"false" (where "false" is assumed)
> For the encoding I would suggest that that any record which includes optional 
> fields
> would be prefixed by an presence map which would be a sequence of int8 x* 
> where:
>   x > 0 : the lower 7 bits are presence bits for the next 7 optional fields 
> (low bit first)
>   -128 < x < 0 : the next present field is position x + 135 (as x runs from 0 
> to -127 and the first 7
>   must be empty otherwise we would use the x > 0 encoding) 
>   x == -128: no optional fields present in the next 134 optional fields
>   x = 0 : end of sequence
>   further, if the map has covered all the options, the end-of-sequence marker 
> can be
>   elided.  For example, a type with 3 optional fields would require only a 
> single byte. 
> This will permit encoding at 8/7 of a bit per present entry (worst case) and 
> at a cost of
> 8/134 (0.06) bits/entry per all but last not-present (7.5 bytes / 1000 
> optional fields).
> This encoding is backward compatible as well as schema's which do not contain 
> optional
> elements do not have the presence map and the encoding is therefore 
> identical.  Backward
> compatibility can be maintained by simply using the default value for 
> not-present fields.
> Language APIs:
> Efficient support could include either an explicit presence test or a 
> function which returns the value
> or default value (if the field is not present).
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (AVRO-522) the rule on unions containing identical types are not enforced

2010-04-22 Thread John Plevyak (JIRA)
the rule on unions containing identical types are not enforced
--

 Key: AVRO-522
 URL: https://issues.apache.org/jira/browse/AVRO-522
 Project: Avro
  Issue Type: Bug
  Components: c++
Reporter: John Plevyak


{
"type" : "array",
"name" : "optionals",
"items" : [ "long", "long" ]
}

is accepted despite being illegal by the combination of precompile and 
gen-cppcode.py

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (AVRO-519) Efficient sparse optional fields support

2010-04-22 Thread John Plevyak (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859878#action_12859878
 ] 

John Plevyak commented on AVRO-519:
---

Doug,  your proposed solution is made somewhat more complex by the fact that it 
is not possible to associate a name
with types other than records, fixed and enum within a union.  One might want 
to do:

{
"type" : "array",
"name" : "optionals",
"items" : [
   { "name" : "a", "type" : "bytes" },
   { "name" : "b", "type" : "bytes" }
]
}

which the C++ translator accepts but for which it nevertheless generates 
incorrect code (I will file a bug).

As it stands, one would have to do:

{
"type" : "array",
"name" : "optionals",
"items" : [
   { "name" : "l", "type" : "record", "fields" : [ { "name" : "l", "type": 
"long"} ] },
   { "name" : "r", "type" : "record", "fields" : [ { "name" : "r", "type": 
"long"} ] }
]
}

which is workable, albeit more complicated than one might want.  What is the 
rational for not permitting a name
to be associated with other types in a union?

> Efficient sparse optional fields support
> 
>
> Key: AVRO-519
> URL: https://issues.apache.org/jira/browse/AVRO-519
> Project: Avro
>  Issue Type: New Feature
>  Components: spec
>Reporter: John Plevyak
>
> One of the nice features of protobuf is efficient support for very sparse 
> optional fields,
> for example large number of tags potentially associated with a document the 
> vast
> majority of which are empty.
> Avro does support optional fields as part of differing specifications, but 
> not on a per-record
> level after a protocol has been agreed upon.  Avro does have support for 
> arrays and maps
> however both of these require homogeneous types.
> I would suggest adding an additional field attribute:
>* "optional" - with values "true"/"false" (where "false" is assumed)
> For the encoding I would suggest that that any record which includes optional 
> fields
> would be prefixed by an presence map which would be a sequence of int8 x* 
> where:
>   x > 0 : the lower 7 bits are presence bits for the next 7 optional fields 
> (low bit first)
>   -128 < x < 0 : the next present field is position x + 135 (as x runs from 0 
> to -127 and the first 7
>   must be empty otherwise we would use the x > 0 encoding) 
>   x == -128: no optional fields present in the next 134 optional fields
>   x = 0 : end of sequence
>   further, if the map has covered all the options, the end-of-sequence marker 
> can be
>   elided.  For example, a type with 3 optional fields would require only a 
> single byte. 
> This will permit encoding at 8/7 of a bit per present entry (worst case) and 
> at a cost of
> 8/134 (0.06) bits/entry per all but last not-present (7.5 bytes / 1000 
> optional fields).
> This encoding is backward compatible as well as schema's which do not contain 
> optional
> elements do not have the presence map and the encoding is therefore 
> identical.  Backward
> compatibility can be maintained by simply using the default value for 
> not-present fields.
> Language APIs:
> Efficient support could include either an explicit presence test or a 
> function which returns the value
> or default value (if the field is not present).
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Fwd: New benchmarking page.]

2010-04-22 Thread Doug Cutting
Avro seems to be sliding a bit in this benchmark.  The poor "create" 
time has always been a problem for Avro, although I'm not sure why. 
This isn't a great benchmark, but lots of folks look at it, so it'd be 
nice if we did well there.


Doug

 Original Message 
Subject: New benchmarking page.
Date: Thu, 22 Apr 2010 04:34:04 -0700
From: Kannan Goundan 
Reply-To: java-serialization-benchmark...@googlegroups.com
To: java-serialization-benchmark...@googlegroups.com

I've created a "version 2" of the Benchmarking page.

   http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2

These measurements were generated using the new code I've been adding
over the past month or so.  One advantage of the new code is that I've
actually tried to make the various serializers do the same amount of
work (previously, many serializers were specialized to the exact data
value being tested).

A couple notes:

1. The timing measurements aren't very precise.  I tried taking the
best of 100 trials (instead of the default best of 20) and the numbers
still won't stabilize.  I sometimes get a 20% difference between runs.
 A side-effect of this is we sometimes end up with weird results like
the "deserialize" time being greater than the "deserialize and access
fields" time.

2. The Scala object creation times are higher than before.  I think
this may be because I rewrote the Scala code and used "Option[T]" in a
couple places that were previously just "T".  Most of the Java-based
tools just use "null" for optional values, which is more efficient.

3. The "java (externalizable)" test didn't look like it was using the
Externalizable feature of Java at all.  I renamed it "java-manual"
since all it does is manually serialize using a DataInputStream and
DataOutputStream (which is basically what Kryo does with runtime code
generation).

4. Do you think it's useful to have the XML/JSON tests that use
abbreviated field names?  You'd think someone would use XML/JSON over
a binary format for readability; using abbreviated field names negate
that advantage (partially).

5. I use a slightly different data value (intended to be a bit more
realistic).  See the wiki page: DataStructuresV2.

-- Kannan

--
You received this message because you are subscribed to the Google 
Groups "java-serialization-benchmarking" group.
To post to this group, send email to 
java-serialization-benchmark...@googlegroups.com.
To unsubscribe from this group, send email to 
java-serialization-benchmarking+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/java-serialization-benchmarking?hl=en.