[jira] [Commented] (DRILL-7978) Fixed Width Format Plugin

ASF GitHub Bot (Jira) Mon, 08 Nov 2021 05:06:06 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440442#comment-17440442
 ]


ASF GitHub Bot commented on DRILL-7978:
---------------------------------------

cgivre commented on pull request #2282:
URL: https://github.com/apache/drill/pull/2282#issuecomment-963130138


   > @MFoss19 @estherbuchwalter following some [recent 
chat](https://github.com/apache/drill/pull/2359#issuecomment-962673076) with 
@paul-rogers and my last comment here, how about a reduced format config such 
as the following? The goal is to get to something terse and consistent with 
what we do for other text formats.
   > 
   > ```json
   > "fixedwidth": {
   >   "type": "fixedwidth",
   >   "extensions": [
   >     "fwf"
   >   ],
   >   "extractHeader": true,
   >   "trimStrings": true,
   >   "columnOffsets": [1, 11, 21, 31],
   >   "columnWidths": [10, 10, 10, 10]
   > }
   > ```
   > 
   > Column names and types can already come from a provided schema or aliasing 
after calls to `CAST()`. Incidentally, the settings above can be overriden per 
query using a provided schema too.
   > 
   > There's also a part of that wonders whether we could have justified adding 
our fixed width functionality to the existing delimited text format reader.
   
   @dzamo In this case, I'd respectfully disagree here.  In effect, the 
configuration is providing a schema to the user, similar to the way the 
logRegex reader works.  In this case, the user will get the best data possible 
if we can include datatypes and field names in the schema, so that they can 
just do a `SELECT *` and not have to worry about casting etc. 
   
   Let's consider a real world use case: some fixed width log generated by a 
database.  Since the fields may be mashed together, there isn't a delimiter 
that you can use to divide the fields.   You *could* use however the logRegex 
reader to do this.  That point aside for the moment, the way I imagined someone 
using this was that different configs could be set up and linked to workspaces 
such that if a file was in the `mysql_logs` folder, it would use the mysql log 
config, and if it was in the `postgres` it would use another.  
   
   My opinion here is that the goal should be to get the cleanest data to the 
user as possible without the user having to rely on CASTs and other 
complicating factors. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Fixed Width Format Plugin
> -------------------------
>
>                 Key: DRILL-7978
>                 URL: https://issues.apache.org/jira/browse/DRILL-7978
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Storage - Other
>            Reporter: Megan Foss
>            Priority: Major
>
> Developing format plugin to parse fixed width files.
> Fixed Width Text File Definition: 
> https://www.oracle.com/webfolder/technetwork/data-quality/edqhelp/Content/introduction/getting_started/configuring_fixed_width_text_file_formats.htm



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (DRILL-7978) Fixed Width Format Plugin

Reply via email to