Andries's solution is quite handy for the occasional use. But, having a storage
plugin available can be more convenient and will perform better. When used with
table functions, the format plugin allows specifying fields and column names
per-query if you find yourself querying multiple different files.
Maybe start with the simple approach and grow to the custom approach if
performance and convenience justify the extra work.
Thanks,
- Paul
On Tuesday, February 20, 2018, 2:09:02 PM PST, Kunal Khatua
<[email protected]> wrote:
I agree... Using Andries' solution in combination with a view is probably the
best approach.
-----Original Message-----
From: Flavio Pompermaier [mailto:[email protected]]
Sent: Tuesday, February 20, 2018 1:47 PM
To: [email protected]
Subject: Re: Fixed-width files
Actually what I'd like to achieve, in the end, is to remember how to read a
fixed-width file.
After considering all your opinions, the best way to achieve this will be
probably to create a VIEW and then extract through a DESCRIBE query the columns
definition. What do you think?
On 20 Feb 2018 20:25, "Arjun kr" <[email protected]> wrote:
If you have Hive storage plugin enabled, You can create Hive table with regex
serde and query the same in Drill.
-- Table contents
$ hadoop fs -cat /tmp/regex_test/*
112123
$
-- Hive DDL with regex '(.{1})(.{2})(.{3})' - column1 of width 1,column2 of
width 2 and column3 of width 3
CREATE EXTERNAL TABLE `hive_regex_test`(
`column1` string COMMENT 'from deserializer',
`column2` string COMMENT 'from deserializer',
`column3` string COMMENT 'from deserializer') ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='(.{1})(.{2})(.{3})')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/tmp/regex_test';
hive>
> select * from hive_regex_test;
OK
hive_regex_test.column1 hive_regex_test.column2 hive_regex_test.column3
1 12 123
Time taken: 0.235 seconds, Fetched: 1 row(s)
hive>
-- Drill
0: jdbc:drill:schema=dfs> select * from `hive_regex_test`;
+----------+----------+----------+
| column1 | column2 | column3 |
+----------+----------+----------+
| 1 | 12 | 123 |
+----------+----------+----------+
1 row selected (0.587 seconds)
0: jdbc:drill:schema=dfs>
Thanks,
Arjun
________________________________
From: Kunal Khatua <[email protected]>
Sent: Wednesday, February 21, 2018 12:37 AM
To: [email protected]
Subject: RE: Fixed-width files
This might be a better option, since DRILL-6170 will introduce a rigid parsing
definition. So, different fixed-width files can't leverage the same definition,
though they might share the same extension.
Thanks, Andries!
-----Original Message-----
From: Andries Engelbrecht [mailto:[email protected]]
Sent: Tuesday, February 20, 2018 7:39 AM
To: [email protected]
Subject: Re: Fixed-width files
You can also try and see if you can just use the CSV plugin to read a line as
columns[0] and then use the substr function to pull out the fields in the line.
https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.ap
ache.org_docs_string-2Dmanipulation_-23substr&d=DwIGaQ&c=csk
dkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=oItppN_rkOKe_
pgJb06T71ul6__8GsXmWQzTOQlCvBc&s=u6-Tx7rmfJQDa3_W3hg7YxojXP3
Hf60YPLGHMnD8yLg&e=
Here is a simple example
Simple csv file
[test]$ cat test.csv
col1col2col3
jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4),
substr(columns[0],5,4), substr(columns[0],9,4) from
dfs.root.`/data/csv/test/test.csv`;
+---------+---------+---------+
| EXPR$0 | EXPR$1 | EXPR$2 |
+---------+---------+---------+
| col1 | col2 | col3 |
+---------+---------+---------+
--Andries
On 2/20/18, 1:17 AM, "Flavio Pompermaier" <[email protected]> wrote:
For the moment I've created an improvement issue about this:
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZ
xdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69
ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW
2Q8weYDZtzb7-U5CqR9fML7ihno&e=
On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier < [email protected]>
wrote:
> Thanks Paul for this suggestion, I think I'm going to give it a try.
> Once I've created my EasyFormatPlugin where should I put the produced jar?
> in which folder within jars directory?
>
> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers <[email protected]>
> wrote:
>
>> It may be that by "fixed width text", Flavio means a file in which the
>> text columns are of fixed width: kind of like old-school punch cards.
>> Drill has no reader for this use case, but if you are a Java programmer,
>> you can create one. See Drill Pull Request #1114 [1] for one example of a
>> regex reader along with pointers to a second example I'm building for a
>> book. Should be easy to adopt this code to take a list of column widths
in
>> place of the regex. Actually, you could use the regex with a pattern that
>> just picks out a fixed number of characters.
>> Thanks,
>> - Paul
>>
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__github.
com_apache_drill_pull_1114&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg
&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhId
PzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=-0LdlBnmAXaipanP87yJ
ezn5HPEHQIQVX5izxnNTYFY&e=
>>
>>
>>
>>
>> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <
>> [email protected]> wrote:
>>
>> As long as you have delimiters, you should be able to import it as a
>> regular CSV file. Using views that define the fixed-width nature should
>> help operators downstream work more efficiently.
>>
>> -----Original Message-----
>> From: Flavio Pompermaier [mailto:[email protected]]
>> Sent: Monday, February 19, 2018 6:50 AM
>> To: [email protected]
>> Subject: Fixed-width files
>>
>> Hi to all,
>> I'm currently looking for the best solution to load a fixed-width text
>> file into Drill.
>> Is there any way right now to do that? Is there anyone that already have
>> a working connector?
>> Is it better to implement a brand new FormatPluginConfig or
>> StoragePluginConfig?
>>
>> Best,
>> Flavio
>>
>>
>