Typically this would be done by reading in the contents of the entire file
into a map side input and then consuming that side input within a DoFn.

Unfortunately, only Dataflow supports really large side inputs with an
efficient access pattern and only when using Beam Java for bounded
pipelines. Support for really large side inputs for Beam Python bounded
pipelines on Dataflow is coming but not yet available.

Otherwise, you could still read the Avro files and still create a map and
store the index as a side input and as long as the index fits in memory,
this would work well across all runners.

The programming guide[1] has a basic example on how to get started using
side inputs.

1: https://beam.apache.org/documentation/programming-guide/#side-inputs


On Tue, Jul 9, 2019 at 2:21 PM Shannon Duncan <joseph.dun...@liveramp.com>
wrote:

> So being pretty new to beam and big data I have been working on
> standardizing some input output items for different
> hadoop/beam/spark/bigquery jobs and processes.
>
> So what I'm working on is having them all read/write Avro files which is
> actually pretty straight forward. So basic read/write I have down.
>
> What I'm looking for and hoping someone on this list knows, is how to
> index an Avro file and be able to search quickly through that index to only
> open a partial part of an Avro file in beam.
>
> For example currently our pipeline is able to do this with Hadoop and
> Sequence Files since they store <K,V> with bytesoffest.
>
> So given a key I'd like to only pull that key from the Avro file reducing
> IO / Network costs.
>
> Any ideas, thoughts, suggestions?
>
> Thanks!
> Shannon
>

Reply via email to