Re: RE: Fixed-width files

2018-02-20 Thread Flavio Pompermaier
Thanks Paul for this suggestion, I think I'm going to give it a try.
Once I've created my EasyFormatPlugin where should I put the produced jar?
in which folder within jars directory?

On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 
wrote:

> It may be that by "fixed width text", Flavio means a file in which the
> text columns are of fixed width: kind of like old-school punch cards.
> Drill has no reader for this use case, but if you are a Java programmer,
> you can create one. See Drill Pull Request #1114 [1] for one example of a
> regex reader along with pointers to a second example I'm building for a
> book. Should be easy to adopt this code to take a list of column widths in
> place of the regex. Actually, you could use the regex with a pattern that
> just picks out a fixed number of characters.
> Thanks,
> - Paul
>
> [1]  https://github.com/apache/drill/pull/1114
>
>
>
>
> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <
> kkha...@mapr.com> wrote:
>
>  As long as you have delimiters, you should be able to import it as a
> regular CSV file. Using views that define the fixed-width nature should
> help operators downstream work more efficiently.
>
> -Original Message-
> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]
> Sent: Monday, February 19, 2018 6:50 AM
> To: user@drill.apache.org
> Subject: Fixed-width files
>
> Hi to all,
> I'm currently looking for the best solution to load a fixed-width text
> file into Drill.
> Is there any way right now to do that? Is there anyone that already have a
> working connector?
> Is it better to implement a brand new FormatPluginConfig or
> StoragePluginConfig?
>
> Best,
> Flavio
>
>


Re: RE: Fixed-width files

2018-02-20 Thread Flavio Pompermaier
For the moment I've created an improvement issue about this:
https://issues.apache.org/jira/browse/DRILL-6170

On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier 
wrote:

> Thanks Paul for this suggestion, I think I'm going to give it a try.
> Once I've created my EasyFormatPlugin where should I put the produced jar?
> in which folder within jars directory?
>
> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 
> wrote:
>
>> It may be that by "fixed width text", Flavio means a file in which the
>> text columns are of fixed width: kind of like old-school punch cards.
>> Drill has no reader for this use case, but if you are a Java programmer,
>> you can create one. See Drill Pull Request #1114 [1] for one example of a
>> regex reader along with pointers to a second example I'm building for a
>> book. Should be easy to adopt this code to take a list of column widths in
>> place of the regex. Actually, you could use the regex with a pattern that
>> just picks out a fixed number of characters.
>> Thanks,
>> - Paul
>>
>> [1]  https://github.com/apache/drill/pull/1114
>>
>>
>>
>>
>> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <
>> kkha...@mapr.com> wrote:
>>
>>  As long as you have delimiters, you should be able to import it as a
>> regular CSV file. Using views that define the fixed-width nature should
>> help operators downstream work more efficiently.
>>
>> -Original Message-
>> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]
>> Sent: Monday, February 19, 2018 6:50 AM
>> To: user@drill.apache.org
>> Subject: Fixed-width files
>>
>> Hi to all,
>> I'm currently looking for the best solution to load a fixed-width text
>> file into Drill.
>> Is there any way right now to do that? Is there anyone that already have
>> a working connector?
>> Is it better to implement a brand new FormatPluginConfig or
>> StoragePluginConfig?
>>
>> Best,
>> Flavio
>>
>>
>


Save the date: ApacheCon North America, September 24-27 in Montréal

2018-02-20 Thread Rich Bowen

Dear Apache Enthusiast,

(You’re receiving this message because you’re subscribed to a user@ or 
dev@ list of one or more Apache Software Foundation projects.)


We’re pleased to announce the upcoming ApacheCon [1] in Montréal, 
September 24-27. This event is all about you — the Apache project community.


We’ll have four tracks of technical content this time, as well as lots 
of opportunities to connect with your project community, hack on the 
code, and learn about other related (and unrelated!) projects across the 
foundation.


The Call For Papers (CFP) [2] and registration are now open. Register 
early to take advantage of the early bird prices and secure your place 
at the event hotel.


Important dates
March 30: CFP closes
April 20: CFP notifications sent
	August 24: Hotel room block closes (please do not wait until the last 
minute)


Follow @ApacheCon on Twitter to be the first to hear announcements about 
keynotes, the schedule, evening events, and everything you can expect to 
see at the event.


See you in Montréal!

Sincerely, Rich Bowen, V.P. Events,
on behalf of the entire ApacheCon team

[1] http://www.apachecon.com/acna18
[2] https://cfp.apachecon.com/conference.html?apachecon-north-america-2018


Re: Fixed-width files

2018-02-20 Thread Andries Engelbrecht
You can also try and see if you can just use the CSV plugin to read a line as 
columns[0] and then use the substr function to pull out the fields in the line.
http://drill.apache.org/docs/string-manipulation/#substr

Here is a simple example

Simple csv file

[test]$ cat test.csv
col1col2col3


jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4), 
substr(columns[0],5,4), substr(columns[0],9,4) from  
dfs.root.`/data/csv/test/test.csv`;
+-+-+-+
| EXPR$0  | EXPR$1  | EXPR$2  |
+-+-+-+
| col1| col2| col3|
+-+-+-+



--Andries




On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:

For the moment I've created an improvement issue about this:

https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW2Q8weYDZtzb7-U5CqR9fML7ihno&e=

On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier 
wrote:

> Thanks Paul for this suggestion, I think I'm going to give it a try.
> Once I've created my EasyFormatPlugin where should I put the produced jar?
> in which folder within jars directory?
>
> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 
> wrote:
>
>> It may be that by "fixed width text", Flavio means a file in which the
>> text columns are of fixed width: kind of like old-school punch cards.
>> Drill has no reader for this use case, but if you are a Java programmer,
>> you can create one. See Drill Pull Request #1114 [1] for one example of a
>> regex reader along with pointers to a second example I'm building for a
>> book. Should be easy to adopt this code to take a list of column widths 
in
>> place of the regex. Actually, you could use the regex with a pattern that
>> just picks out a fixed number of characters.
>> Thanks,
>> - Paul
>>
>> [1]  
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1114&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=-0LdlBnmAXaipanP87yJezn5HPEHQIQVX5izxnNTYFY&e=
>>
>>
>>
>>
>> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <
>> kkha...@mapr.com> wrote:
>>
>>  As long as you have delimiters, you should be able to import it as a
>> regular CSV file. Using views that define the fixed-width nature should
>> help operators downstream work more efficiently.
>>
>> -Original Message-
>> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]
>> Sent: Monday, February 19, 2018 6:50 AM
>> To: user@drill.apache.org
>> Subject: Fixed-width files
>>
>> Hi to all,
>> I'm currently looking for the best solution to load a fixed-width text
>> file into Drill.
>> Is there any way right now to do that? Is there anyone that already have
>> a working connector?
>> Is it better to implement a brand new FormatPluginConfig or
>> StoragePluginConfig?
>>
>> Best,
>> Flavio
>>
>>
>




RE: Fixed-width files

2018-02-20 Thread Kunal Khatua
This might be a better option, since DRILL-6170 will introduce a rigid parsing 
definition. So, different fixed-width files can't leverage the same definition, 
though they might share the same extension. 

Thanks, Andries!

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com] 
Sent: Tuesday, February 20, 2018 7:39 AM
To: user@drill.apache.org
Subject: Re: Fixed-width files

You can also try and see if you can just use the CSV plugin to read a line as 
columns[0] and then use the substr function to pull out the fields in the line.

https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.apache.org_docs_string-2Dmanipulation_-23substr&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=oItppN_rkOKe_pgJb06T71ul6__8GsXmWQzTOQlCvBc&s=u6-Tx7rmfJQDa3_W3hg7YxojXP3Hf60YPLGHMnD8yLg&e=



Here is a simple example



Simple csv file



[test]$ cat test.csv

col1col2col3





jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4), 
substr(columns[0],5,4), substr(columns[0],9,4) from  
dfs.root.`/data/csv/test/test.csv`;

+-+-+-+

| EXPR$0  | EXPR$1  | EXPR$2  |

+-+-+-+

| col1| col2| col3|

+-+-+-+







--Andries









On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:



For the moment I've created an improvement issue about this:


https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW2Q8weYDZtzb7-U5CqR9fML7ihno&e=



On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier 

wrote:



> Thanks Paul for this suggestion, I think I'm going to give it a try.

> Once I've created my EasyFormatPlugin where should I put the produced jar?

> in which folder within jars directory?

>

> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 

> wrote:

>

>> It may be that by "fixed width text", Flavio means a file in which the

>> text columns are of fixed width: kind of like old-school punch cards.

>> Drill has no reader for this use case, but if you are a Java programmer,

>> you can create one. See Drill Pull Request #1114 [1] for one example of a

>> regex reader along with pointers to a second example I'm building for a

>> book. Should be easy to adopt this code to take a list of column widths 
in

>> place of the regex. Actually, you could use the regex with a pattern that

>> just picks out a fixed number of characters.

>> Thanks,

>> - Paul

>>

>> [1]  
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1114&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=-0LdlBnmAXaipanP87yJezn5HPEHQIQVX5izxnNTYFY&e=

>>

>>

>>

>>

>> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <

>> kkha...@mapr.com> wrote:

>>

>>  As long as you have delimiters, you should be able to import it as a

>> regular CSV file. Using views that define the fixed-width nature should

>> help operators downstream work more efficiently.

>>

>> -Original Message-

>> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]

>> Sent: Monday, February 19, 2018 6:50 AM

>> To: user@drill.apache.org

>> Subject: Fixed-width files

>>

>> Hi to all,

>> I'm currently looking for the best solution to load a fixed-width text

>> file into Drill.

>> Is there any way right now to do that? Is there anyone that already have

>> a working connector?

>> Is it better to implement a brand new FormatPluginConfig or

>> StoragePluginConfig?

>>

>> Best,

>> Flavio

>>

>>

>







Re: Fixed-width files

2018-02-20 Thread Arjun kr

If you have Hive storage plugin enabled, You can create Hive table with regex 
serde and query the same in Drill.


-- Table contents

$ hadoop fs -cat /tmp/regex_test/*
112123
$

-- Hive DDL with regex '(.{1})(.{2})(.{3})' - column1 of width 1,column2 of 
width 2 and column3 of width 3

CREATE EXTERNAL TABLE `hive_regex_test`(
  `column1` string COMMENT 'from deserializer',
  `column2` string COMMENT 'from deserializer',
  `column3` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'='(.{1})(.{2})(.{3})')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/tmp/regex_test';

hive>
> select * from hive_regex_test;
OK
hive_regex_test.column1 hive_regex_test.column2 hive_regex_test.column3
1 12 123
Time taken: 0.235 seconds, Fetched: 1 row(s)
hive>


-- Drill

0: jdbc:drill:schema=dfs> select * from `hive_regex_test`;
+--+--+--+
| column1  | column2  | column3  |
+--+--+--+
| 1| 12   | 123  |
+--+--+--+
1 row selected (0.587 seconds)
0: jdbc:drill:schema=dfs>

Thanks,

Arjun


From: Kunal Khatua 
Sent: Wednesday, February 21, 2018 12:37 AM
To: user@drill.apache.org
Subject: RE: Fixed-width files

This might be a better option, since DRILL-6170 will introduce a rigid parsing 
definition. So, different fixed-width files can't leverage the same definition, 
though they might share the same extension.

Thanks, Andries!

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
Sent: Tuesday, February 20, 2018 7:39 AM
To: user@drill.apache.org
Subject: Re: Fixed-width files

You can also try and see if you can just use the CSV plugin to read a line as 
columns[0] and then use the substr function to pull out the fields in the line.

https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.apache.org_docs_string-2Dmanipulation_-23substr&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=oItppN_rkOKe_pgJb06T71ul6__8GsXmWQzTOQlCvBc&s=u6-Tx7rmfJQDa3_W3hg7YxojXP3Hf60YPLGHMnD8yLg&e=



Here is a simple example



Simple csv file



[test]$ cat test.csv

col1col2col3





jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4), 
substr(columns[0],5,4), substr(columns[0],9,4) from  
dfs.root.`/data/csv/test/test.csv`;

+-+-+-+

| EXPR$0  | EXPR$1  | EXPR$2  |

+-+-+-+

| col1| col2| col3|

+-+-+-+







--Andries









On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:



For the moment I've created an improvement issue about this:


https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW2Q8weYDZtzb7-U5CqR9fML7ihno&e=



On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier 

wrote:



> Thanks Paul for this suggestion, I think I'm going to give it a try.

> Once I've created my EasyFormatPlugin where should I put the produced jar?

> in which folder within jars directory?

>

> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 

> wrote:

>

>> It may be that by "fixed width text", Flavio means a file in which the

>> text columns are of fixed width: kind of like old-school punch cards.

>> Drill has no reader for this use case, but if you are a Java programmer,

>> you can create one. See Drill Pull Request #1114 [1] for one example of a

>> regex reader along with pointers to a second example I'm building for a

>> book. Should be easy to adopt this code to take a list of column widths 
in

>> place of the regex. Actually, you could use the regex with a pattern that

>> just picks out a fixed number of characters.

>> Thanks,

>> - Paul

>>

>> [1]  
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1114&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=-0LdlBnmAXaipanP87yJezn5HPEHQIQVX5izxnNTYFY&e=

>>

>>

>>

>>

>> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <

>> kkha...@mapr.com> wrote:

>>

>>  As long as you have delimiters, you should be able to import it as a

>> regular CSV file. Using views that define the fixed-width nature should

>> help operators downstream work more efficiently.

>>

>> -Original Message-

>> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]

>> Sent: Monday, February 19, 2018 6:50 AM

>> To: user@drill.apache.org

>> Subject: Fixed-width files

>>

>> Hi to all,


Re: Fixed-width files

2018-02-20 Thread Flavio Pompermaier
Actually what I'd like to achieve, in the end, is to remember how to read a
fixed-width file.
After considering all your opinions, the best way to achieve this will be
probably to create a VIEW and then extract through a DESCRIBE query the
columns definition. What do you think?

On 20 Feb 2018 20:25, "Arjun kr"  wrote:


If you have Hive storage plugin enabled, You can create Hive table with
regex serde and query the same in Drill.


-- Table contents

$ hadoop fs -cat /tmp/regex_test/*
112123
$

-- Hive DDL with regex '(.{1})(.{2})(.{3})' - column1 of width 1,column2 of
width 2 and column3 of width 3

CREATE EXTERNAL TABLE `hive_regex_test`(
  `column1` string COMMENT 'from deserializer',
  `column2` string COMMENT 'from deserializer',
  `column3` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'='(.{1})(.{2})(.{3})')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/tmp/regex_test';

hive>
> select * from hive_regex_test;
OK
hive_regex_test.column1 hive_regex_test.column2 hive_regex_test.column3
1 12 123
Time taken: 0.235 seconds, Fetched: 1 row(s)
hive>


-- Drill

0: jdbc:drill:schema=dfs> select * from `hive_regex_test`;
+--+--+--+
| column1  | column2  | column3  |
+--+--+--+
| 1| 12   | 123  |
+--+--+--+
1 row selected (0.587 seconds)
0: jdbc:drill:schema=dfs>

Thanks,

Arjun


From: Kunal Khatua 
Sent: Wednesday, February 21, 2018 12:37 AM
To: user@drill.apache.org
Subject: RE: Fixed-width files

This might be a better option, since DRILL-6170 will introduce a rigid
parsing definition. So, different fixed-width files can't leverage the same
definition, though they might share the same extension.

Thanks, Andries!

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
Sent: Tuesday, February 20, 2018 7:39 AM
To: user@drill.apache.org
Subject: Re: Fixed-width files

You can also try and see if you can just use the CSV plugin to read a line
as columns[0] and then use the substr function to pull out the fields in
the line.

https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.ap
ache.org_docs_string-2Dmanipulation_-23substr&d=DwIGaQ&c=csk
dkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=oItppN_rkOKe_
pgJb06T71ul6__8GsXmWQzTOQlCvBc&s=u6-Tx7rmfJQDa3_W3hg7YxojXP3
Hf60YPLGHMnD8yLg&e=



Here is a simple example



Simple csv file



[test]$ cat test.csv

col1col2col3





jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4),
substr(columns[0],5,4), substr(columns[0],9,4) from
dfs.root.`/data/csv/test/test.csv`;

+-+-+-+

| EXPR$0  | EXPR$1  | EXPR$2  |

+-+-+-+

| col1| col2| col3|

+-+-+-+







--Andries









On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:



For the moment I've created an improvement issue about this:

https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZ
xdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69
ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW
2Q8weYDZtzb7-U5CqR9fML7ihno&e=



On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier <
pomperma...@okkam.it>

wrote:



> Thanks Paul for this suggestion, I think I'm going to give it a try.

> Once I've created my EasyFormatPlugin where should I put the produced
jar?

> in which folder within jars directory?

>

> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers


> wrote:

>

>> It may be that by "fixed width text", Flavio means a file in which
the

>> text columns are of fixed width: kind of like old-school punch cards.

>> Drill has no reader for this use case, but if you are a Java
programmer,

>> you can create one. See Drill Pull Request #1114 [1] for one example
of a

>> regex reader along with pointers to a second example I'm building
for a

>> book. Should be easy to adopt this code to take a list of column
widths in

>> place of the regex. Actually, you could use the regex with a pattern
that

>> just picks out a fixed number of characters.

>> Thanks,

>> - Paul

>>

>> [1]  https://urldefense.proofpoint.com/v2/url?u=https-3A__github.
com_apache_drill_pull_1114&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg
&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhId
PzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=-0LdlBnmAXaipanP87yJ
ezn5HPEHQIQVX5izxnNTYFY&e=

>>

>>

>>

>>

>> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <

>> kkha...@mapr.com> wrote:

>>

>>  As long as you have delimiters, you should be able to import it as a

>> regular CSV file. Using views that def

RE: Fixed-width files

2018-02-20 Thread Kunal Khatua
I agree... Using Andries' solution in combination with a view is probably the 
best approach.

-Original Message-
From: Flavio Pompermaier [mailto:pomperma...@okkam.it] 
Sent: Tuesday, February 20, 2018 1:47 PM
To: user@drill.apache.org
Subject: Re: Fixed-width files

Actually what I'd like to achieve, in the end, is to remember how to read a 
fixed-width file.
After considering all your opinions, the best way to achieve this will be 
probably to create a VIEW and then extract through a DESCRIBE query the columns 
definition. What do you think?

On 20 Feb 2018 20:25, "Arjun kr"  wrote:


If you have Hive storage plugin enabled, You can create Hive table with regex 
serde and query the same in Drill.


-- Table contents

$ hadoop fs -cat /tmp/regex_test/*
112123
$

-- Hive DDL with regex '(.{1})(.{2})(.{3})' - column1 of width 1,column2 of 
width 2 and column3 of width 3

CREATE EXTERNAL TABLE `hive_regex_test`(
  `column1` string COMMENT 'from deserializer',
  `column2` string COMMENT 'from deserializer',
  `column3` string COMMENT 'from deserializer') ROW FORMAT SERDE
  'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'='(.{1})(.{2})(.{3})')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/tmp/regex_test';

hive>
> select * from hive_regex_test;
OK
hive_regex_test.column1 hive_regex_test.column2 hive_regex_test.column3
1 12 123
Time taken: 0.235 seconds, Fetched: 1 row(s)
hive>


-- Drill

0: jdbc:drill:schema=dfs> select * from `hive_regex_test`;
+--+--+--+
| column1  | column2  | column3  |
+--+--+--+
| 1| 12   | 123  |
+--+--+--+
1 row selected (0.587 seconds)
0: jdbc:drill:schema=dfs>

Thanks,

Arjun


From: Kunal Khatua 
Sent: Wednesday, February 21, 2018 12:37 AM
To: user@drill.apache.org
Subject: RE: Fixed-width files

This might be a better option, since DRILL-6170 will introduce a rigid parsing 
definition. So, different fixed-width files can't leverage the same definition, 
though they might share the same extension.

Thanks, Andries!

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
Sent: Tuesday, February 20, 2018 7:39 AM
To: user@drill.apache.org
Subject: Re: Fixed-width files

You can also try and see if you can just use the CSV plugin to read a line as 
columns[0] and then use the substr function to pull out the fields in the line.

https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.ap
ache.org_docs_string-2Dmanipulation_-23substr&d=DwIGaQ&c=csk
dkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=oItppN_rkOKe_
pgJb06T71ul6__8GsXmWQzTOQlCvBc&s=u6-Tx7rmfJQDa3_W3hg7YxojXP3
Hf60YPLGHMnD8yLg&e=



Here is a simple example



Simple csv file



[test]$ cat test.csv

col1col2col3





jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4), 
substr(columns[0],5,4), substr(columns[0],9,4) from 
dfs.root.`/data/csv/test/test.csv`;

+-+-+-+

| EXPR$0  | EXPR$1  | EXPR$2  |

+-+-+-+

| col1| col2| col3|

+-+-+-+







--Andries









On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:



For the moment I've created an improvement issue about this:

https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZ
xdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69
ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW
2Q8weYDZtzb7-U5CqR9fML7ihno&e=



On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier < pomperma...@okkam.it>

wrote:



> Thanks Paul for this suggestion, I think I'm going to give it a try.

> Once I've created my EasyFormatPlugin where should I put the produced jar?

> in which folder within jars directory?

>

> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 

> wrote:

>

>> It may be that by "fixed width text", Flavio means a file in which the

>> text columns are of fixed width: kind of like old-school punch cards.

>> Drill has no reader for this use case, but if you are a Java programmer,

>> you can create one. See Drill Pull Request #1114 [1] for one example of a

>> regex reader along with pointers to a second example I'm building for a

>> book. Should be easy to adopt this code to take a list of column widths 
in

>> place of the regex. Actually, you could use the regex with a pattern that

>> just picks out a fixed number of characters.

>> Thanks,

>> - Paul

>>

>> [1]  https://urldefense.proofpoint.com/v2/url?u=https-3A__github.
com_apache_drill_pull_1114&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg
&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhId
PzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=-0LdlBnmAXaipanP87yJ
ezn5HPEHQI

Re: Fixed-width files

2018-02-20 Thread Paul Rogers
Hi Flavio,
Great question! I've not yet experimented with the solution myself, but I 
believe that the plugin can be placed into a jar, along with the needed Drill 
config file, and then placed into the jars/3rd-party directory if you keep your 
config information in the Drill product directory. Perhaps Charles can offer 
more details based on his experience.
You may find it more convenient to use the "site" directory added in Drill 1.8. 
With a site directory, you separate your config files and custom jars from the 
Drill product files. Launch drill with the "--site" flag:
> drillbit.sh --site /my/site/dir start
For convenience, you can set the DRILL_SITE_DIR env var instead of using the 
--site flag.
If using a site directory, put your jar in the "jars" folder.
All that said, while you develop your plugin, you'll want to put the sources 
inside the Drill java-exec project. Why? Doing so allows you to very rapidly 
build and debug your library using your favorite IDE. The test file mentioned 
in the PR shows how to use the test framework to run a query, start an 
in-process Drillbit, and immediately step through (or set breakpoints in) your 
plugin code.
If you build the plugin as a jar file, then for each edit/compile/debug cycle, 
you'll need to build your jar, copy it to the proper location, restart the 
Drill server, attach the remote debugger, start a client tool, and finally 
submit a query. This works, but is quite slow; the above technique is faster 
for us impatient types...
Once the storage plugin works, then you can move the code to a new project from 
which you can build and deploy your jar.
Or, you can do as Charles did: offer your plugin to the Drill project via a PR 
so others can use it.
Thanks,
- Paul

 

On Tuesday, February 20, 2018, 12:24:10 AM PST, Flavio Pompermaier 
 wrote:  
 
 Thanks Paul for this suggestion, I think I'm going to give it a try.
Once I've created my EasyFormatPlugin where should I put the produced jar?
in which folder within jars directory?

On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 
wrote:

> It may be that by "fixed width text", Flavio means a file in which the
> text columns are of fixed width: kind of like old-school punch cards.
> Drill has no reader for this use case, but if you are a Java programmer,
> you can create one. See Drill Pull Request #1114 [1] for one example of a
> regex reader along with pointers to a second example I'm building for a
> book. Should be easy to adopt this code to take a list of column widths in
> place of the regex. Actually, you could use the regex with a pattern that
> just picks out a fixed number of characters.
> Thanks,
> - Paul
>
> [1]  https://github.com/apache/drill/pull/1114
>
>
>
>
>    On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <
> kkha...@mapr.com> wrote:
>
>  As long as you have delimiters, you should be able to import it as a
> regular CSV file. Using views that define the fixed-width nature should
> help operators downstream work more efficiently.
>
> -Original Message-
> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]
> Sent: Monday, February 19, 2018 6:50 AM
> To: user@drill.apache.org
> Subject: Fixed-width files
>
> Hi to all,
> I'm currently looking for the best solution to load a fixed-width text
> file into Drill.
> Is there any way right now to do that? Is there anyone that already have a
> working connector?
> Is it better to implement a brand new FormatPluginConfig or
> StoragePluginConfig?
>
> Best,
> Flavio
>
>
  

Re: RE: Fixed-width files

2018-02-20 Thread Paul Rogers
Andries's solution is quite handy for the occasional use. But, having a storage 
plugin available can be more convenient and will perform better. When used with 
table functions, the format plugin allows specifying fields and column names 
per-query if you find yourself querying multiple different files.
Maybe start with the simple approach and grow to the custom approach if 
performance and convenience justify the extra work.
Thanks,
- Paul

 

On Tuesday, February 20, 2018, 2:09:02 PM PST, Kunal Khatua 
 wrote:  
 
 I agree... Using Andries' solution in combination with a view is probably the 
best approach.

-Original Message-
From: Flavio Pompermaier [mailto:pomperma...@okkam.it] 
Sent: Tuesday, February 20, 2018 1:47 PM
To: user@drill.apache.org
Subject: Re: Fixed-width files

Actually what I'd like to achieve, in the end, is to remember how to read a 
fixed-width file.
After considering all your opinions, the best way to achieve this will be 
probably to create a VIEW and then extract through a DESCRIBE query the columns 
definition. What do you think?

On 20 Feb 2018 20:25, "Arjun kr"  wrote:


If you have Hive storage plugin enabled, You can create Hive table with regex 
serde and query the same in Drill.


-- Table contents

$ hadoop fs -cat /tmp/regex_test/*
112123
$

-- Hive DDL with regex '(.{1})(.{2})(.{3})' - column1 of width 1,column2 of 
width 2 and column3 of width 3

CREATE EXTERNAL TABLE `hive_regex_test`(
  `column1` string COMMENT 'from deserializer',
  `column2` string COMMENT 'from deserializer',
  `column3` string COMMENT 'from deserializer') ROW FORMAT SERDE
  'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'='(.{1})(.{2})(.{3})')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/tmp/regex_test';

hive>
    > select * from hive_regex_test;
OK
hive_regex_test.column1 hive_regex_test.column2 hive_regex_test.column3
1 12 123
Time taken: 0.235 seconds, Fetched: 1 row(s)
hive>


-- Drill

0: jdbc:drill:schema=dfs> select * from `hive_regex_test`;
+--+--+--+
| column1  | column2  | column3  |
+--+--+--+
| 1        | 12      | 123      |
+--+--+--+
1 row selected (0.587 seconds)
0: jdbc:drill:schema=dfs>

Thanks,

Arjun


From: Kunal Khatua 
Sent: Wednesday, February 21, 2018 12:37 AM
To: user@drill.apache.org
Subject: RE: Fixed-width files

This might be a better option, since DRILL-6170 will introduce a rigid parsing 
definition. So, different fixed-width files can't leverage the same definition, 
though they might share the same extension.

Thanks, Andries!

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
Sent: Tuesday, February 20, 2018 7:39 AM
To: user@drill.apache.org
Subject: Re: Fixed-width files

You can also try and see if you can just use the CSV plugin to read a line as 
columns[0] and then use the substr function to pull out the fields in the line.

https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.ap
ache.org_docs_string-2Dmanipulation_-23substr&d=DwIGaQ&c=csk
dkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=oItppN_rkOKe_
pgJb06T71ul6__8GsXmWQzTOQlCvBc&s=u6-Tx7rmfJQDa3_W3hg7YxojXP3
Hf60YPLGHMnD8yLg&e=



Here is a simple example



Simple csv file



[test]$ cat test.csv

col1col2col3





jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4), 
substr(columns[0],5,4), substr(columns[0],9,4) from 
dfs.root.`/data/csv/test/test.csv`;

+-+-+-+

| EXPR$0  | EXPR$1  | EXPR$2  |

+-+-+-+

| col1    | col2    | col3    |

+-+-+-+







--Andries









On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:



    For the moment I've created an improvement issue about this:

    https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZ
xdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69
ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW
2Q8weYDZtzb7-U5CqR9fML7ihno&e=



    On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier < pomperma...@okkam.it>

    wrote:



    > Thanks Paul for this suggestion, I think I'm going to give it a try.

    > Once I've created my EasyFormatPlugin where should I put the produced jar?

    > in which folder within jars directory?

    >

    > On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 

    > wrote:

    >

    >> It may be that by "fixed width text", Flavio means a file in which the

    >> text columns are of fixed width: kind of like old-school punch cards.

    >> Drill has no reader for this use case, but if you are a Java programmer,

    >> you can create one. See Drill Pull Request #1114 [1] for one example of a

    >> regex reader along with pointers to a second example I'm building

Re: Code too large

2018-02-20 Thread Khurram Faraaz
Hello Anup,


I could not repro the issue with my data, can you please share the data that 
you used so I can try it again with your data.


Thanks,

Khurram


From: Anup Tiwari 
Sent: Monday, February 19, 2018 3:09:22 AM
To: user@drill.apache.org
Subject: Re: Code too large

Hi Khurram/Arjun,
Anyone got time to look into it?





On Fri, Feb 16, 2018 4:53 PM, Anup Tiwari anup.tiw...@games24x7.com  wrote:
Hi Arjun,
After posting this reply ; i have found the same answer on net and that
parameter to 30 and then query worked but it took bit more time than expected.
Also don't you think these type of things should be adjusted automatically?
@khurram, Please find below query and logs(since log is huge in drillbit.log for
this query so i have divided logs into 3 parts in an order which i got for the
query - error + some drill code(which was too large) + error). FYI : hive.cs_all
is a hive(2.1.1) parquet table.
Query :-
create table dfs.tmp.cs_all_test AS select log_date,ssid ,select log_date,ssid ,
count((case when ((id like 'cta-action-%' and event = 'click' and sit = 'pnow'
and ptype = '1' and stype = '1') OR (id like '1:100%' and event = 'pnowclick'
and STRPOS(url,'mrc/player/sit.html') > 0) OR (id like '/fmg/110%/pn/pnow.html'
or (id like '110%/fmgopt/pnow'))) then ssid end)) as pnow_prac_c , count((case
when ((id like 'btsit%' and event = 'click' and sit like '%TSit%' and ptype1 =
'1' and stype1 = '1') OR (event = 'ts.click' and id like '1:100%') OR (id like
'/mgems/over/110%/ts.html')) then ssid end)) as ts_prac_c , count((case when
((id = '/res/vinit/' and mptype = '1' and (mgtype = 'cfp' OR mftype = '100')) OR
(id like '/dl%/fmg/110%/pn/ftpracga/vinit.html' or id like
'/dl%/fmg/110%/pn/vinit.html')) then ssid end)) as vinit_prac_c , count((case
when (id = '/res/tiu/' and mptype = '1' and (mgtype = 'cfp' OR mftype = '100'))
then ssid end)) as tiu_prac_c , count((case when (id = '/onstClick/btnStHr/' and
event='click' and mptype = '1' and (mgtype = 'cfp' OR mftype = '100')) then ssid
end)) as StHr_prac_c , count((case when ((id = '/res/dcd/' and mptype = '1' and
(mgtype = 'cfp' OR mftype = '100')) OR (id like
'/dl%/fmg/110%/pn/ftpracga/dcd.html' or id like '/dl%/fmg/110%/pn/dcd.html'))
then ssid end)) as dcd_prac_c , count((case when ((id = '/confirmdr/btnY/' and
event in ('click','Click') and mptype = '1' and (mgtype = 'cfp' OR mftype =
'100')) OR (id like '/dl%/fmg/110%/pn/dr.html')) then ssid end)) as dr_prac_c ,
count((case when ((id = '/res/finish/' and mptype = '1' and (mgtype = 'cfp' OR
mftype = '100')) OR (id like '/dl%/fmg/110%/pn/ftpracga/finish.html' or id like
'/dl%/fmg/110%/pn/finish.html')) then ssid end)) as finish_prac_c , count((case
when ((id like 'cta-action-%' and event = 'click' and sit = 'pnow' and ptype =
'2' and stype = '1') OR (id like '2:100%' and event = 'pnowclick' and
STRPOS(url,'mrc/player/sit.html') > 0) OR (id like '/fmg/210%/pn/pnow.html' or
(id like '210%/fmgopt/pnow'))) then ssid end)) as pnow_cash_c , count((case when
(id like '2:100%' and event = 'pnowclick' and STRPOS(url,'mrc/player/sit.html')
= 0) then ssid end)) as pnow_cash_c_pac , count((case when ((id like 'btsit%'
and event = 'click' and sit like '%TSit%' and ptype1 = '2' and stype1 = '1') OR
(event = 'ts.click' and id like '2:100%') OR (id like
'/mgems/over/210%/ts.html')) then ssid end)) as ts_cash_c , count((case when
((id = '/res/vinit/' and mptype = '2' and (mgtype = 'cfp' OR mftype = '100')) OR
(id like '/dl%/fmg/210%/pn/ftpracga/vinit.html' or id like
'/dl%/fmg/210%/pn/vinit.html')) then ssid end)) as vinit_cash_c , count((case
when (id = '/res/tiu/' and mptype = '2' and (mgtype = 'cfp' OR mftype = '100'))
then ssid end)) as tiu_cash_c , count((case when (id = '/onstClick/btnStHr/' and
event='click' and mptype = '2' and (mgtype = 'cfp' OR mftype = '100')) then ssid
end)) as StHr_cash_c , count((case when ((id = '/res/dcd/' and mptype = '2' and
(mgtype = 'cfp' OR mftype = '100')) OR (id like
'/dl%/fmg/210%/pn/ftpracga/dcd.html' or id like '/dl%/fmg/210%/pn/dcd.html'))
then ssid end)) as dcd_cash_c , count((case when ((id = '/confirmdr/btnY/' and
event in ('click','Click') and mptype = '2' and (mgtype = 'cfp' OR mftype =
'100')) OR (id like '/dl%/fmg/210%/pn/dr.html')) then ssid end)) as dr_cash_c ,
count((case when ((id = '/res/finish/' and mptype = '2' and (mgtype = 'cfp' OR
mftype = '100')) OR (id like '/dl%/fmg/210%/pn/ftpracga/finish.html' or id like
'/dl%/fmg/210%/pn/finish.html')) then ssid end)) as finish_cash_c , count((case
when event = 'event.ajax' and ajaxUrl = '/pyr/ac/change-uname' and ajaxResponse
like '%validationResponse%true%updateResponse%true%' then ssid end)) as
changeunamesuccess , count((case when unameautoassign = 'true' then ssid end))
as unameautoassign , count((case when (id = '/res/cdsdlt/' or id like
'/dl%/fmg/210%/pn/cdsdlt.html%') and mptype = '2' and (mgtype = 'cfp' OR mftype
= '100') then ssid end)) as cds_dlt_cfp_cash ,
count((case when ((id