[
https://issues.apache.org/jira/browse/DRILL-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440442#comment-17440442
]
ASF GitHub Bot commented on DRILL-7978:
---------------------------------------
cgivre commented on pull request #2282:
URL: https://github.com/apache/drill/pull/2282#issuecomment-963130138
> @MFoss19 @estherbuchwalter following some [recent
chat](https://github.com/apache/drill/pull/2359#issuecomment-962673076) with
@paul-rogers and my last comment here, how about a reduced format config such
as the following? The goal is to get to something terse and consistent with
what we do for other text formats.
>
> ```json
> "fixedwidth": {
> "type": "fixedwidth",
> "extensions": [
> "fwf"
> ],
> "extractHeader": true,
> "trimStrings": true,
> "columnOffsets": [1, 11, 21, 31],
> "columnWidths": [10, 10, 10, 10]
> }
> ```
>
> Column names and types can already come from a provided schema or aliasing
after calls to `CAST()`. Incidentally, the settings above can be overriden per
query using a provided schema too.
>
> There's also a part of that wonders whether we could have justified adding
our fixed width functionality to the existing delimited text format reader.
@dzamo In this case, I'd respectfully disagree here. In effect, the
configuration is providing a schema to the user, similar to the way the
logRegex reader works. In this case, the user will get the best data possible
if we can include datatypes and field names in the schema, so that they can
just do a `SELECT *` and not have to worry about casting etc.
Let's consider a real world use case: some fixed width log generated by a
database. Since the fields may be mashed together, there isn't a delimiter
that you can use to divide the fields. You *could* use however the logRegex
reader to do this. That point aside for the moment, the way I imagined someone
using this was that different configs could be set up and linked to workspaces
such that if a file was in the `mysql_logs` folder, it would use the mysql log
config, and if it was in the `postgres` it would use another.
My opinion here is that the goal should be to get the cleanest data to the
user as possible without the user having to rely on CASTs and other
complicating factors.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Fixed Width Format Plugin
> -------------------------
>
> Key: DRILL-7978
> URL: https://issues.apache.org/jira/browse/DRILL-7978
> Project: Apache Drill
> Issue Type: New Feature
> Components: Storage - Other
> Reporter: Megan Foss
> Priority: Major
>
> Developing format plugin to parse fixed width files.
> Fixed Width Text File Definition:
> https://www.oracle.com/webfolder/technetwork/data-quality/edqhelp/Content/introduction/getting_started/configuring_fixed_width_text_file_formats.htm
--
This message was sent by Atlassian Jira
(v8.20.1#820001)