Hi Mirko, That sounds like a great work package! The JavaCSVTableSource could go into wayang-platforms/wayang-java/operators.
Just to give you a bit of backdrop about thie specific source: The CSV files that this source operator expects must have a specific format, i.e., the first row should contain the column headers and types. See wayang-api/wayang-api-sql/src/test/resources/csv/data.csv for an example. This is needed for supporting SQL queries over CSV files. In a way this is not really a generic CSVfile source operator and hence initially was put inside the wayang-api/wayang-api-sql module. Best, Kaustubh On Sat, Oct 14, 2023 at 12:51 AM Mirko Kämpf <[email protected]> wrote: > Hey Wayang team, > > after my warm up exercise using the CsvRowConverter I want to start > planning a work package which gives the project some new functionality and > me a deeper understanding of the codebase. > > Just to wrap up the idea: I started with working on the CsvRowConverter so > that it can handle multiple separators, depending on the data, or based on > a decision of the developer. > > I arrived at the JavaCSVTableSource and due to the lack of architectural > knowledge pressed my STOP button. > > So far I understand that the following could be a goal for implementation: > > (1) Migrate the JavaCSVTableSource to a place where it has a better home. > (2) Configure the default separator of the JavaCSVTableSource via > config-file (I have to learn how the config is handled during the life > cycle of a job). > (3) Create a JavaCSVTableSource with a well known separator > programmatically. > (4) Create a JavaCSVTableSource and allow autodetection of the separator. > > Question: Is the JavaCSVTableSource the right class to start, or is the > functionality I refer to defined on a higher level in the framework. > > Currently I ask for hints only, so that I can go for a solution while I > learn to navigate the code base. > > I envision a JavaCSVTableSource component with a set of tests which shows > that CSV / TSV files can be loaded from local files or even using the > "remote file source" where data can be loaded from an HTTP server. > > *The use case I have in mind is this:* > We have a data asset in our local processing context which should be joined > with a dataset which is provided in a remote data pod. I can read the small > "lookup table" into memory and handle the larger local data set using the > scalable platform. I can avid all data management steps for the "small > lookup" table which is hosted outside my envirnonemt. > > Would that make sense? I think if we have a clear target scenario and if > that is aligned with existing ideas, it could become a great learning path > for a newcomer. And if the use case i have in mind is totally against the > core idea, no problem, there is a lot to learn from. > > Cheers, > Mirko > > > > Dr. rer. nat. Mirko Kämpf > Müchelner Str. 23 > 06259 Frankleben >
