Re: Custom plugin - multi-schema text input

Matt Casters Tue, 28 Mar 2023 10:36:26 -0700

I've come across file formats like this in the past, of-course.
The original form of this is probably to be found in Cobol copy-books with
"re-defines" allowing different content based on some key field.
So far I haven't found a common pattern in the custom files that companies
invariably wanted to build in the past.
XML and subsequent file-formats like JSON, YAML and others solved most of
the challenge these days.  For these older file formats the challenge
remains.


For standard file formats like HL7 in the past I opted to emit key/value
rows with a path-building exercise.
Similarly you could do the same for your file format with a bit of
JavaScript.
Have a look at the attached.

HTH,

Matt




On Tue, 28 Mar 2023 at 19:02, Michele Mor <[email protected]> wrote:

> Hi Justin
> I think that you could try looking at the documentation for the metadata
> injection (MDI).
>
> https://hop.apache.org/manual/latest/pipeline/metadata-injection.html
>
> That could be a start.
> But I'm sure that more expert Hoppers have better suggestions.
>
> Best regards,
> Michele
>
>
> On Tue, 28 Mar 2023, 15:41 Austin, Justin via users, <[email protected]>
> wrote:
>
>> Thank you for the advice Diego!
>>
>>
>>
>> We had come across this type of multi-schema text input/output capability
>> in Talend and I was hoping we could create our own plugins to accomplish
>> something similar here.
>>
>>
>>
>> *From:* Diego Mainou <[email protected]>
>> *Sent:* Monday, March 27, 2023 3:44 PM
>> *To:* users <[email protected]>; Austin, Justin
>> <[email protected]>
>> *Subject:* Re: Custom plugin - multi-schema text input
>>
>>
>>
>> [EXTERNAL EMAIL]
>>
>> Hi Justin,
>>
>>
>>
>> It seems to me that you are wanting to do too many things with one step
>> and that you will struggle to find a piece of software cheap or  expensive
>> that  does what you are describing in one step.
>>
>>
>>
>> ETL tools are good but they are not magical even ai needs to be trained.
>>
>>
>>
>> Best practice is to separate acquisition from business logic.
>>
>> So my recommendation would be to grab those files and acquire them in
>> their native state + governance (e.g. a load id) before you do anything to
>> them.
>>
>>
>>
>> Further, because you are dealing with many files of  distinct nature you
>> may wish to segregate the "acquisition" from the loading
>>
>> E.g. by creating:
>>
>>    - A generic and reusable component that 'copies/moves' the files from
>>    wherever they are located into your landing zone.
>>    - A bespoke component that acquires either a specific file or a
>>    specific file types e.g. JSON and outputs to a generic format. E.g. a
>>    serialised file
>>    - A generic and reusable component that grabs files of the generic
>>    format and loads into a table containing the raw data plus governance.
>>
>> The above will result in files from all walks of life being loaded into
>> your staging database in their raw state. This is very important for
>> governance purposes.
>>
>>
>>
>> Potentially your next step is to create a generic and reusable component
>> that utilises metadata injection to parse JSON into columns + governance.
>>
>> Rinse and repeat for xml, csv, etc.
>>
>>
>>
>> The next step being the mapping of your data and your dimensions. Once
>> you have your sk's you can the drop the values that were used to map those
>> sk's. etc, etc etc.
>>
>>
>>
>> Diego
>>
>>
>>
>>
>>
>> [image: Image removed by sender.]
>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.bizcubed.com.au%2F&data=05%7C01%7Cjustin.austin%40venturesolutions.com%7Cd55619e0740d4e106a1908db2f14bcf9%7C335a532847a0444489f8552b2e6caeea%7C0%7C0%7C638155538351967357%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Nj%2Bv9hwEsTjnYrocT%2FcfSfWOwqZwnA3PWxDEpkCefPk%3D&reserved=0>
>>
>> Diego Mainou
>> Product Manager
>> M. +61 415 152 091
>> E. [email protected]
>>
>> www.bizcubed.com.au
>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.bizcubed.com.au%2F&data=05%7C01%7Cjustin.austin%40venturesolutions.com%7Cd55619e0740d4e106a1908db2f14bcf9%7C335a532847a0444489f8552b2e6caeea%7C0%7C0%7C638155538351967357%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Nj%2Bv9hwEsTjnYrocT%2FcfSfWOwqZwnA3PWxDEpkCefPk%3D&reserved=0>
>>
>>
>>
>> [image: Image removed by sender.]
>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.bizcubed.com.au%2F&data=05%7C01%7Cjustin.austin%40venturesolutions.com%7Cd55619e0740d4e106a1908db2f14bcf9%7C335a532847a0444489f8552b2e6caeea%7C0%7C0%7C638155538351967357%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Nj%2Bv9hwEsTjnYrocT%2FcfSfWOwqZwnA3PWxDEpkCefPk%3D&reserved=0>
>>
>>
>> ------------------------------
>>
>> *From: *"Austin, Justin via users" <[email protected]>
>> *To: *"users" <[email protected]>
>> *Sent: *Tuesday, 28 March, 2023 1:41:06 AM
>> *Subject: *Custom plugin - multi-schema text input
>>
>>
>>
>> Hi Hop users,
>>
>>
>>
>> We’re evaluating whether HOP is the right tool to solve a common problem
>> for our business.
>>
>>
>>
>> We encounter hundreds of different file formats containing similar layers
>> of one-to-many hierarchy (simplified example below).  Getting this to work
>> using out-of-box inputs/outputs and transform components results in a
>> complex/convoluted set of workflows & pipelines.  Since we run into this so
>> often, we would like to develop a plugin with a custom “input” component
>> that reads the input file, inserts some ID fields for relationships, and
>> exposes multiple output rowsets (one for each schema/row type) that can be
>> mapped to separate downstream transformations. Eventually we’d like to make
>> another custom “output” component that can accept multiple inputs to load
>> them where we need them with hierarchy preserved (JSON, relational DB,
>> etc.).
>>
>>
>>
>> After reviewing the plugin documentation and samples, I’m still not sure
>> whether this is possible.  It seems that the relevant plugin base classes
>> assume there will always be a single schema (IRowMeta) and single rowset
>> shared by all input and output connections/hops.  I believe we would
>> require a single “transform” to have multiple IRowMeta and multiple rowsets
>> and the ability to select a specific one for any given hop to a downstream
>> transform/component.
>>
>>
>>
>> Is there a good path to accomplishing this with a HOP plugin?  Or perhaps
>> a better approach to the problem with existing Hop features?
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Example file:
>>
>> REC|Jane Smith|03-20-2003
>>
>> ADDR|123 Main Street|Apartment 321|Anytown|US|55555
>>
>> ACT|987654321|$4321.56|02-01-2023|03-02-2023
>>
>> DTL|debit|$23.45|02-05-2023
>>
>> DTL|debit|$143.20|02-13-2023
>>
>> DTL|credit|$652.02|02-14-2023
>>
>> DTL|debit|$8.78|02-28-2023
>>
>> ACT|56789123|$7894.56|02-01-2023|03-02-2023
>>
>> DTL|credit|$0.28|02-14-2023
>>
>> REC|John Jacobs|03-20-2003
>>
>> ADDR|876 Big Avenue||Anywhere|US|55556
>>
>> ACT|5632178|$2256.79|02-01-2023|03-02-2023
>>
>> DTL|credit|$0.02|02-14-2023
>>
>>
>>
>>
>>
>>
>>
>

multi-schema-text-input.hpl
Description: Binary data

Re: Custom plugin - multi-schema text input

Reply via email to