Re: Improvements in JavaCSVTableSource

Kaustubh Beedkar Fri, 13 Oct 2023 12:38:38 -0700

Hi Mirko,

That sounds like a great work package! The JavaCSVTableSource could go into
wayang-platforms/wayang-java/operators.


Just to give you a bit of backdrop about thie specific source:
The CSV files that this source operator expects must have a specific
format, i.e., the first row should contain the column headers and types.
See wayang-api/wayang-api-sql/src/test/resources/csv/data.csv for an
example. This is needed for supporting SQL queries over CSV files. In a way
this is not really a generic CSVfile source operator and hence initially
was put inside the wayang-api/wayang-api-sql module.



Best,
Kaustubh


On Sat, Oct 14, 2023 at 12:51 AM Mirko Kämpf <[email protected]> wrote:

> Hey Wayang team,
>
> after my warm up exercise using the CsvRowConverter I want to start
> planning a work package which gives the project some new functionality and
> me a deeper understanding of the codebase.
>
> Just to wrap up the idea: I started with working on the CsvRowConverter so
> that it can handle multiple separators, depending on the data, or based on
> a decision of the developer.
>
> I arrived at the JavaCSVTableSource and due to the lack of architectural
> knowledge pressed my STOP button.
>
> So far I understand that the following could be a goal for implementation:
>
> (1) Migrate the JavaCSVTableSource to a place where it has a better home.
> (2) Configure the default separator of the JavaCSVTableSource via
> config-file (I have to learn how the config is handled during the life
> cycle of a job).
> (3) Create a JavaCSVTableSource with a well known separator
> programmatically.
> (4) Create a JavaCSVTableSource and allow autodetection of the separator.
>
> Question: Is the JavaCSVTableSource the right class to start, or is the
> functionality I refer to defined on a higher level in the framework.
>
> Currently I ask for hints only, so that I can go for a solution while I
> learn to navigate the code base.
>
> I envision a JavaCSVTableSource component with a set of tests which shows
> that CSV / TSV files can be loaded from local files or even using the
> "remote file source" where data can be loaded from an HTTP server.
>
> *The use case I have in mind is this:*
> We have a data asset in our local processing context which should be joined
> with a dataset which is provided in a remote data pod. I can read the small
> "lookup table" into memory and handle the larger local data set using the
> scalable platform. I can avid all data management steps for the "small
> lookup" table which is hosted outside my envirnonemt.
>
> Would that make sense? I think if we have a clear target scenario and if
> that is aligned with existing ideas, it could become a great learning path
> for a newcomer. And if the use case i have in mind is totally against the
> core idea, no problem, there is a lot to learn from.
>
> Cheers,
> Mirko
>
>
>
> Dr. rer. nat. Mirko Kämpf
> Müchelner Str. 23
> 06259 Frankleben
>

Re: Improvements in JavaCSVTableSource

Reply via email to