Thanks.

So AvroIO.parseAllGenericRecords() uses the writer schema to read the avro
file and creates a GenericRecord using this same writer schema. This is
then given to the parseFn SerializableFunction to convert into some other
object - no other schemas are supplied.

What I am suggesting is there would be a AvroIO.readAllGenericRecords()
which similarly uses the writer schema create a GenericRecord as its output
allowing the pipeline to process it further.



On Sun, 13 Jan 2019, 04:58 Gijun Lee, <[email protected]> wrote:

> The schema attached to an avro file is "writer schema" which means that
> this schema is required to encode/decode avro records.  This schema is very
> critical because the records in the avro file cannot be converted to
> Generic Record objects or even specific records without this writer
> schema.  The schema that is being passed as an argument to the API is not
> "writer" schema.  This schema is "reader" schema.  This schema is the same
> as the writer schema in most of the time, but this reader schema can be an
> "evolved" schema from the writer schema. If you look at AVRO APIs,  the
> APIs uses the two schemas internally.
>
> Hope this helps.
>
>
>
>
> On Sat, Jan 12, 2019 at 3:09 PM Niel Markwick <[email protected]> wrote:
>
>> Considering the transform is reading Avro container files, which by
>> definition
>> <https://avro.apache.org/docs/1.8.1/spec.html#Object+Container+Files>
>> contain a schema, it should be possible for the reader to infer the schema
>> from the file...
>>
>> parseAllGenericRecords()
>> <https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/AvroIO.html#parseAllGenericRecords-org.apache.beam.sdk.transforms.SerializableFunction->
>>  seems
>> to be able to do this, decodes and passes a GenericRecord to the
>> given parseFn without needing to know the schema in advance...
>>
>> In fact parseAllGenericRecords() would be perfect for my requirements if
>> I could use a Contextful.Fn as a parseFn that accepted side imputs :/
>>
>>
>>
>> <https://cloud.google.com>
>> * •  **Niel Markwick*
>> * •  *Cloud Solutions Architect
>> * •  *Google Belgium
>> * •  *[email protected]
>> * •  *+32 2 894 6771
>>
>> Google Belgium NV/SA, Steenweg op Etterbeek 180
>> <https://maps.google.com/?q=Steenweg%C2%A0op%C2%A0Etterbeek%C2%A0180&entry=gmail&source=g>
>> , 1040 Brussel, Belgie. RPR: 0878.065.378
>>
>> If you received this communication by mistake, please don't forward it to
>> anyone else (it may contain confidential or privileged information), please
>> erase all copies of it, including all attachments, and please let the
>> sender know it went to the wrong person. Thanks
>>
>>
>> On Sat, 12 Jan 2019 at 20:08, Alex Van Boxel <[email protected]> wrote:
>>
>>> Hey Niels,
>>>
>>> The reason you need to specify the schema to GenericRecord is that
>>> without it it's *impossible* for GenericRecord to make any sense of the
>>> binary data. Unlike protobuf, avro doesn't have any kind of information in
>>> the message about the structure. This makes it smaller, but impossible to
>>> decode without the schema.
>>>
>>> So if you really want todo flexible messages, I would read it binary,
>>> message per message and handle your schema switching into a DoFn.
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Sat, Jan 12, 2019 at 7:44 PM Niel Markwick <[email protected]> wrote:
>>>
>>>> Is there a reason why don't we have an AvroIO reader that reads and
>>>> outputs a GenericRecord without requiring any schema to be given?
>>>>
>>>> Does passing the schema into readGenericRecord() have any benefits
>>>> other than verifying that the avro file has records of the same schema?
>>>>
>>>> This could be useful for parsing a collection of avro files of varying
>>>> schemas, then post-processing the GenericRecords in further transform with
>>>> side inputs.
>>>>
>>>> --
>>>>
>>>> <https://cloud.google.com/>
>>>> * •  **Niel Markwick*
>>>> * •  *Cloud Solutions Architect
>>>> * •  *Google Belgium
>>>> * •  *[email protected]
>>>> * •  *+32 2 894 6771 <+3228946771>
>>>>
>>>> Google Belgium NV/SA, Steenweg op Etterbeek 180
>>>> <https://maps.google.com/?q=Steenweg%C2%A0op%C2%A0Etterbeek%C2%A0180&entry=gmail&source=g>
>>>> , 1040 Brussel, Belgie. RPR: 0878.065.378
>>>>
>>>> If you received this communication by mistake, please don't forward it
>>>> to anyone else (it may contain confidential or privileged information),
>>>> please erase all copies of it, including all attachments, and please let
>>>> the sender know it went to the wrong person. Thanks
>>>>
>>> --

<https://cloud.google.com/>
* •  **Niel Markwick*
* •  *Cloud Solutions Architect
* •  *Google Belgium
* •  *[email protected]
* •  *+32 2 894 6771 <+3228946771>

Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie.
RPR: 0878.065.378

If you received this communication by mistake, please don't forward it to
anyone else (it may contain confidential or privileged information), please
erase all copies of it, including all attachments, and please let the
sender know it went to the wrong person. Thanks

Reply via email to