Yea, this is a delicate tradeoff.

 - TextIO is in core so that getting started is easy (also legacy) but you
can't put all text-related things in core
 - Schemas are in core because they really are a new core abstraction and
we do want getting started to be easy for them too

IIRC these two will interact, aka this module is not just about SQL and
cross-language. Am I remembering correctly? In that case, we definitely
want users to do "getting started" stuff with TextIO and schemas. CSV is
far more natural than JSON for this. If I were to choose one dependency to
add to core to make TextIO easy to get started with schemas, it would
probably be commons CSV. It has zero compile dependencies and is very
stable so risk is low AFAIK.

I'm not sure how long it has been since we had a discussion about how to
have a "getting started" level TextIO in the core while separating out more
troublesome deps... we can discuss for a long time what color to paint that
bikeshed...

I am fine with adding a commons CSV dependency to Beam core. I think the
bigger discussion is not necessary based on the points here.

Kenn


On Mon, Jul 20, 2020 at 1:32 PM Brian Hulette <[email protected]> wrote:

> I'm on the fence about this. It might make sense to have a dependency on a
> CSV library in core Beam as it's a natural text interface for Beam Rows.
> There is some precedent since we do already have code for converting
> between JSON and Rows in core Beam [1], but I assume that was able to
> leverage an existing dependency on jackson when it was added.
>
> Instead of adding the dependency in core Beam we could consider adding a
> new module that includes this logic. That would still let us make the SQL
> logic available cross-language, but it wouldn't be as accessible for Beam
> Java users.
>
> Brian
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowJson.java
>
> On Mon, Jul 20, 2020 at 12:38 PM Scott Lukas <[email protected]> wrote:
>
>> I'm writing a schema aware IO abstraction in core beam, SchemaIOProvider
>> <https://github.com/apache/beam/blob/9b0941945545e71a949649309e05e405ca73aea2/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java>,
>> and implementing it by shifting the logic of IO table providers and tables
>> from Beam SQL to the location of its IO (ie PubsubSchemaCapableIOProvider
>> <https://github.com/apache/beam/blob/9b0941945545e71a949649309e05e405ca73aea2/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubSchemaCapableIOProvider.java>).
>> This creates one building point for using the logic (design docs
>> <https://docs.google.com/document/d/1ic3P8EVGHIydHQ-VMDKbN9kEdwm7sBXMo80VrhwksvI/edit>
>> ).
>>
>> I'm trying to implement a TextSchemaCapableIOProvider in Beam Core (like
>> PubsubSchemaCapableIOProvider
>> <https://github.com/apache/beam/blob/9b0941945545e71a949649309e05e405ca73aea2/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubSchemaCapableIOProvider.java>),
>> where TextIO is, but the table logic
>> <https://github.com/apache/beam/blob/9b0941945545e71a949649309e05e405ca73aea2/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/text/TextTableProvider.java#L102-L126>
>> relies on org.apache.commons.csv.CSVFormat. What are your thoughts on
>> adding the dependency org.apache.commons.csv.CSVFormat to Core Beam? Or
>> should I try finding a work around?
>>
>

Reply via email to