we are working on a very sparse table with say 500 columns where we do
batch uploads that typically only contain a subset of the columns (say
100), and we run multiple map-reduce queries on subsets of the columns
(typically less than 50 columns go into a single map-reduce job).

my question is the following: if i use avro, do i ever actually need the
use the full schema of the table?

if i understand avro correctly, then the batch uploads could simply add
avro files with the schema reflective of the columns that are in the file
(as opposed to first inserting many nulls into the data and then saving it
with the full schema).

the queries could also simply query with the schema that is reflective of
the query (as opposed to querying with the full schema with 500 columns and
then picking out the relevant columns).

as long as i provide defaults of null in the query schemas, i think this
would work! correct? is this considered "abuse" of avro's versioning
capabilities?

Reply via email to