I've come across file formats like this in the past, of-course. The original form of this is probably to be found in Cobol copy-books with "re-defines" allowing different content based on some key field. So far I haven't found a common pattern in the custom files that companies invariably wanted to build in the past. XML and subsequent file-formats like JSON, YAML and others solved most of the challenge these days. For these older file formats the challenge remains.
For standard file formats like HL7 in the past I opted to emit key/value rows with a path-building exercise. Similarly you could do the same for your file format with a bit of JavaScript. Have a look at the attached. HTH, Matt On Tue, 28 Mar 2023 at 19:02, Michele Mor <[email protected]> wrote: > Hi Justin > I think that you could try looking at the documentation for the metadata > injection (MDI). > > https://hop.apache.org/manual/latest/pipeline/metadata-injection.html > > That could be a start. > But I'm sure that more expert Hoppers have better suggestions. > > Best regards, > Michele > > > On Tue, 28 Mar 2023, 15:41 Austin, Justin via users, <[email protected]> > wrote: > >> Thank you for the advice Diego! >> >> >> >> We had come across this type of multi-schema text input/output capability >> in Talend and I was hoping we could create our own plugins to accomplish >> something similar here. >> >> >> >> *From:* Diego Mainou <[email protected]> >> *Sent:* Monday, March 27, 2023 3:44 PM >> *To:* users <[email protected]>; Austin, Justin >> <[email protected]> >> *Subject:* Re: Custom plugin - multi-schema text input >> >> >> >> [EXTERNAL EMAIL] >> >> Hi Justin, >> >> >> >> It seems to me that you are wanting to do too many things with one step >> and that you will struggle to find a piece of software cheap or expensive >> that does what you are describing in one step. >> >> >> >> ETL tools are good but they are not magical even ai needs to be trained. >> >> >> >> Best practice is to separate acquisition from business logic. >> >> So my recommendation would be to grab those files and acquire them in >> their native state + governance (e.g. a load id) before you do anything to >> them. >> >> >> >> Further, because you are dealing with many files of distinct nature you >> may wish to segregate the "acquisition" from the loading >> >> E.g. by creating: >> >> - A generic and reusable component that 'copies/moves' the files from >> wherever they are located into your landing zone. >> - A bespoke component that acquires either a specific file or a >> specific file types e.g. JSON and outputs to a generic format. E.g. a >> serialised file >> - A generic and reusable component that grabs files of the generic >> format and loads into a table containing the raw data plus governance. >> >> The above will result in files from all walks of life being loaded into >> your staging database in their raw state. This is very important for >> governance purposes. >> >> >> >> Potentially your next step is to create a generic and reusable component >> that utilises metadata injection to parse JSON into columns + governance. >> >> Rinse and repeat for xml, csv, etc. >> >> >> >> The next step being the mapping of your data and your dimensions. Once >> you have your sk's you can the drop the values that were used to map those >> sk's. etc, etc etc. >> >> >> >> Diego >> >> >> >> >> >> [image: Image removed by sender.] >> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.bizcubed.com.au%2F&data=05%7C01%7Cjustin.austin%40venturesolutions.com%7Cd55619e0740d4e106a1908db2f14bcf9%7C335a532847a0444489f8552b2e6caeea%7C0%7C0%7C638155538351967357%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Nj%2Bv9hwEsTjnYrocT%2FcfSfWOwqZwnA3PWxDEpkCefPk%3D&reserved=0> >> >> Diego Mainou >> Product Manager >> M. +61 415 152 091 >> E. [email protected] >> >> www.bizcubed.com.au >> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.bizcubed.com.au%2F&data=05%7C01%7Cjustin.austin%40venturesolutions.com%7Cd55619e0740d4e106a1908db2f14bcf9%7C335a532847a0444489f8552b2e6caeea%7C0%7C0%7C638155538351967357%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Nj%2Bv9hwEsTjnYrocT%2FcfSfWOwqZwnA3PWxDEpkCefPk%3D&reserved=0> >> >> >> >> [image: Image removed by sender.] >> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.bizcubed.com.au%2F&data=05%7C01%7Cjustin.austin%40venturesolutions.com%7Cd55619e0740d4e106a1908db2f14bcf9%7C335a532847a0444489f8552b2e6caeea%7C0%7C0%7C638155538351967357%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Nj%2Bv9hwEsTjnYrocT%2FcfSfWOwqZwnA3PWxDEpkCefPk%3D&reserved=0> >> >> >> ------------------------------ >> >> *From: *"Austin, Justin via users" <[email protected]> >> *To: *"users" <[email protected]> >> *Sent: *Tuesday, 28 March, 2023 1:41:06 AM >> *Subject: *Custom plugin - multi-schema text input >> >> >> >> Hi Hop users, >> >> >> >> We’re evaluating whether HOP is the right tool to solve a common problem >> for our business. >> >> >> >> We encounter hundreds of different file formats containing similar layers >> of one-to-many hierarchy (simplified example below). Getting this to work >> using out-of-box inputs/outputs and transform components results in a >> complex/convoluted set of workflows & pipelines. Since we run into this so >> often, we would like to develop a plugin with a custom “input” component >> that reads the input file, inserts some ID fields for relationships, and >> exposes multiple output rowsets (one for each schema/row type) that can be >> mapped to separate downstream transformations. Eventually we’d like to make >> another custom “output” component that can accept multiple inputs to load >> them where we need them with hierarchy preserved (JSON, relational DB, >> etc.). >> >> >> >> After reviewing the plugin documentation and samples, I’m still not sure >> whether this is possible. It seems that the relevant plugin base classes >> assume there will always be a single schema (IRowMeta) and single rowset >> shared by all input and output connections/hops. I believe we would >> require a single “transform” to have multiple IRowMeta and multiple rowsets >> and the ability to select a specific one for any given hop to a downstream >> transform/component. >> >> >> >> Is there a good path to accomplishing this with a HOP plugin? Or perhaps >> a better approach to the problem with existing Hop features? >> >> >> >> Thanks! >> >> >> >> Example file: >> >> REC|Jane Smith|03-20-2003 >> >> ADDR|123 Main Street|Apartment 321|Anytown|US|55555 >> >> ACT|987654321|$4321.56|02-01-2023|03-02-2023 >> >> DTL|debit|$23.45|02-05-2023 >> >> DTL|debit|$143.20|02-13-2023 >> >> DTL|credit|$652.02|02-14-2023 >> >> DTL|debit|$8.78|02-28-2023 >> >> ACT|56789123|$7894.56|02-01-2023|03-02-2023 >> >> DTL|credit|$0.28|02-14-2023 >> >> REC|John Jacobs|03-20-2003 >> >> ADDR|876 Big Avenue||Anywhere|US|55556 >> >> ACT|5632178|$2256.79|02-01-2023|03-02-2023 >> >> DTL|credit|$0.02|02-14-2023 >> >> >> >> >> >> >> >
multi-schema-text-input.hpl
Description: Binary data
