Yea, this is a delicate tradeoff. - TextIO is in core so that getting started is easy (also legacy) but you can't put all text-related things in core - Schemas are in core because they really are a new core abstraction and we do want getting started to be easy for them too
IIRC these two will interact, aka this module is not just about SQL and cross-language. Am I remembering correctly? In that case, we definitely want users to do "getting started" stuff with TextIO and schemas. CSV is far more natural than JSON for this. If I were to choose one dependency to add to core to make TextIO easy to get started with schemas, it would probably be commons CSV. It has zero compile dependencies and is very stable so risk is low AFAIK. I'm not sure how long it has been since we had a discussion about how to have a "getting started" level TextIO in the core while separating out more troublesome deps... we can discuss for a long time what color to paint that bikeshed... I am fine with adding a commons CSV dependency to Beam core. I think the bigger discussion is not necessary based on the points here. Kenn On Mon, Jul 20, 2020 at 1:32 PM Brian Hulette <[email protected]> wrote: > I'm on the fence about this. It might make sense to have a dependency on a > CSV library in core Beam as it's a natural text interface for Beam Rows. > There is some precedent since we do already have code for converting > between JSON and Rows in core Beam [1], but I assume that was able to > leverage an existing dependency on jackson when it was added. > > Instead of adding the dependency in core Beam we could consider adding a > new module that includes this logic. That would still let us make the SQL > logic available cross-language, but it wouldn't be as accessible for Beam > Java users. > > Brian > > [1] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowJson.java > > On Mon, Jul 20, 2020 at 12:38 PM Scott Lukas <[email protected]> wrote: > >> I'm writing a schema aware IO abstraction in core beam, SchemaIOProvider >> <https://github.com/apache/beam/blob/9b0941945545e71a949649309e05e405ca73aea2/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java>, >> and implementing it by shifting the logic of IO table providers and tables >> from Beam SQL to the location of its IO (ie PubsubSchemaCapableIOProvider >> <https://github.com/apache/beam/blob/9b0941945545e71a949649309e05e405ca73aea2/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubSchemaCapableIOProvider.java>). >> This creates one building point for using the logic (design docs >> <https://docs.google.com/document/d/1ic3P8EVGHIydHQ-VMDKbN9kEdwm7sBXMo80VrhwksvI/edit> >> ). >> >> I'm trying to implement a TextSchemaCapableIOProvider in Beam Core (like >> PubsubSchemaCapableIOProvider >> <https://github.com/apache/beam/blob/9b0941945545e71a949649309e05e405ca73aea2/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubSchemaCapableIOProvider.java>), >> where TextIO is, but the table logic >> <https://github.com/apache/beam/blob/9b0941945545e71a949649309e05e405ca73aea2/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/text/TextTableProvider.java#L102-L126> >> relies on org.apache.commons.csv.CSVFormat. What are your thoughts on >> adding the dependency org.apache.commons.csv.CSVFormat to Core Beam? Or >> should I try finding a work around? >> >
