Standalone drillbit without Zookeeper?

2020-09-29 Thread Matt Keranen
Is it possible to run a single node drillbit without Zookeeper, as a
"service" without the need for coordination across multiple nodes?

`zk.connect: "local"` is not accepted as the equivalent of "zk=local" with
drill-embedded.


Re: [DISCUSS] Drill Storage Plugins

2019-11-05 Thread Matt
Perhaps an "awesome-drill" repo on GitHub would be a place to back fill the
book, and serve as a central location for thins like the list you supplied:

https://github.com/topics/awesome

On Tue, Nov 5, 2019 at 9:14 AM Charles Givre  wrote:

> One more thing:  I've found code for storage plugins (in various states of
> completion) for the folllowing systems:
> DynamoDB (https://github.com/fineoio/drill-dynamo-adapter <
> https://github.com/fineoio/drill-dynamo-adapter>)
> Apache Druid:  (Current Draft PR https://github.com/apache/drill/pull/1888
> )
> Couchbase: (https://github.com/LyleLeo/Apache-Drill-CouchDB-Storage-Plugin
> ) (Author
> said he would consider submitting as PR)
> ElasticSearch: https://github.com/javiercanillas/drill-storage-elastic <
> https://github.com/javiercanillas/drill-storage-elastic>,
> https://github.com/gaoshui87/drill-storage-elastic <
> https://github.com/gaoshui87/drill-storage-elastic>
> Apache Solr
>
> Are there others that anyone knows of?
>
>
> > On Nov 4, 2019, at 10:23 PM, Charles Givre  wrote:
> >
> > Hello all,
> > I've written some UDFs and Format plugins for Drill and I'm interested
> in tackling a storage plugin.  One of my regrets from the Drill book was
> that we didn't get into this topic.  For those of you who have written one,
> my hat's off to you. I wanted to ask if there are any resources or
> tutorials available that you found particularly helpful?  I'm having a
> little trouble figuring out what all the pieces do and how they fit
> together.
> >
> > Does anyone have any ideas about storage plugins should be implemented?
> Personally I'd really like to see one for ElasticSearch,
> > Best,
> > -- C
>
>


Re: Problem creating jt400 jdbc connection

2019-07-25 Thread Matt Rabbitt
It works using 9.4 java8 version.  Thanks!

On Thu, Jul 25, 2019 at 12:07 PM  wrote:

> Hi Matt, I tried with 9.4 jt400.rar and it works for me
> With this parameters
>
> {
>   "type": "jdbc",
>   "driver": "com.ibm.as400.access.AS400JDBCDriver",
>   "url": "jdbc:as400://93.63.150.44;prompt=false",
>   "username": "xxx",
>   "password": "xxx",
>   "caseInsensitiveTableNames": false,
>   "enabled": true
> }
>
> Best
> Alessandro
>
>
> -Messaggio originale-
> Da: Matt Rabbitt 
> Inviato: giovedì 25 luglio 2019 18:02
> A: user@drill.apache.org
> Oggetto: Problem creating jt400 jdbc connection
>
> Is anyone successfully using the jt400 jdbc driver with Drill?  I am trying
> to add a storage plugin but when I go to create it in the web gui I'm
> getting an error:
>
> Please retry: Error while creating / updating storage :
> java.sql.SQLException: Cannot create PoolableConnectionFactory (The
> application requester cannot establish the connection. (Connection was
> dropped unexpectedly.))
>
> I am using jtopen v9.8 jt400.jar, connecting to an AS400 (not totally sure
> what version).  The JDBC driver works in other applications for me like
> DBeaver and jdbcsql.zip.  Config below:
>
> {type: "jdbc",
>   enabled: true,
>   driver: "com.ibm.as400.access.AS400JDBCDriver",
>   url:"jdbc:as400://192.168.1.33:446;prompt=false;translate
> binary=true;naming=sql;toolbox trace=datastream;trace=true",
>   username:"foo",
>   password:"bar"}
>
> The trace outputs a message "Failed to read all of the data stream." before
> failing.  Full trace below:
>
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019  as400:
> Driver AS/400 Toolbox for Java JDBC Driver (1235335025) : connect called
> with URL: jdbc:as400://192.168.1.33:446;prompt=false;translate
> binary=true;naming=sql;toolbox trace=datastream;trace=true.
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : access = "all".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : block size = "32".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : block criteria = "2".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : date format = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : date separator = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : decimal separator = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : errors = "basic".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : extended dynamic = "false".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : libraries = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : naming = "sql".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package add = "true".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package cache = "false".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package clear = "false".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package error = "warning".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package library = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : password = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : prefetch = "true".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setStrin

Problem creating jt400 jdbc connection

2019-07-25 Thread Matt Rabbitt
Is anyone successfully using the jt400 jdbc driver with Drill?  I am trying
to add a storage plugin but when I go to create it in the web gui I'm
getting an error:

Please retry: Error while creating / updating storage :
java.sql.SQLException: Cannot create PoolableConnectionFactory (The
application requester cannot establish the connection. (Connection was
dropped unexpectedly.))

I am using jtopen v9.8 jt400.jar, connecting to an AS400 (not totally sure
what version).  The JDBC driver works in other applications for me like
DBeaver and jdbcsql.zip.  Config below:

{type: "jdbc",
  enabled: true,
  driver: "com.ibm.as400.access.AS400JDBCDriver",
  url:"jdbc:as400://192.168.1.33:446;prompt=false;translate
binary=true;naming=sql;toolbox trace=datastream;trace=true",
  username:"foo",
  password:"bar"}

The trace outputs a message "Failed to read all of the data stream." before
failing.  Full trace below:

Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019  as400:
Driver AS/400 Toolbox for Java JDBC Driver (1235335025) : connect called
with URL: jdbc:as400://192.168.1.33:446;prompt=false;translate
binary=true;naming=sql;toolbox trace=datastream;trace=true.
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : access = "all".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : block size = "32".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : block criteria = "2".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : date format = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : date separator = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : decimal separator = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : errors = "basic".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : extended dynamic = "false".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : libraries = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : naming = "sql".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package add = "true".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package cache = "false".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package clear = "false".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package error = "warning".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package library = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : password = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : prefetch = "true".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : prompt = "false".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : remarks = "system".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : sort = "hex".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : sort language = "ENU".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : sort table = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : sort weight = "shared".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : time format = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : time separator = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : trace = "true".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : transaction isolation = "read
uncommitted".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : 

File does not exist errors across cluster

2018-11-27 Thread Matt Keranen
Have 4 nodes running drillbits version 1.14 for queries over JSON files in
the regular filesystem (not HDFS).

Each node has an identical directory structure, but not all file names
exist on all nodes, and any query in the form of "SELECT ... FROM
dfs.logs.`logs*.json.gz`" fails with:

Error: DATA_READ ERROR: Failure reading JSON file - File
file:/localdata/logs/logs.xhr.json.gz does not exist

where the filename may change, but is always one that exists on some but
not all nodes.

Is there a configuration for Drill where drillbits querying non-distrbuted
filesystems don't expect all files to exist on all nodes?


File "does not exist" error on non-distributed filesystem cluster

2018-11-27 Thread Matt Keranen
Have 4 nodes running drillbits version 1.14 for queries over JSON files in
the regular filesystem (not HDFS).

Each node has an identical directory structure, but not all file names
exist on all nodes, and any query in the form of "SELECT ... FROM
dfs.logs.`logs*.json.gz`" fails with:

Error: DATA_READ ERROR: Failure reading JSON file - File
file:/localdata/logs/logs.xxx.json.gz does not exist

where the filename may change, but is always one that exists on some but
not all nodes.

Is there a configuration for Drill where drillbits querying non-distributed
filesystems don't expect all files to exist on all nodes?


Re: Failure while reading messages from kafka

2018-09-04 Thread Matt
https://issues.apache.org/jira/browse/DRILL-6723

On Mon, Aug 27, 2018 at 12:27 PM Matt  wrote:

> I have a Kafka topic with some non-JSON test messages in it, resulting in
> errors "Error: DATA_READ ERROR: Failure while reading messages from kafka.
> Recordreader was at record: 1"
>
> I don't seem to be able to bypass these topic messages with
> "store.json.reader.skip_invalid_records" or even an OFFSET in the query.
>
> Is there a mechanism or setting I can use to query a topic and not fail on
> malformed messages?
>


Failure while reading messages from kafka

2018-08-27 Thread Matt
I have a Kafka topic with some non-JSON test messages in it, resulting in
errors "Error: DATA_READ ERROR: Failure while reading messages from kafka.
Recordreader was at record: 1"

I don't seem to be able to bypass these topic messages with
"store.json.reader.skip_invalid_records" or even an OFFSET in the query.

Is there a mechanism or setting I can use to query a topic and not fail on
malformed messages?


Cassandra storage plugin

2018-01-16 Thread Matt
I note there are some old Jira issues about Cassandra storage, and have
this concept as to why it could be very valuable for Drill. Can anyone
support or refute the idea?

Cassandra is an excellent engine for high volume ingest, but support for
aggregations and scans is very limited. Would a Drill plugin support the
ability to query a Cassandra keyspace using the partition keys, then
perform aggregations within Drill, adding an "analytics layer" in Drill
onto Cassandra?

This seems like a very useful approach, but I don't know the level of
effort in getting Calcite to properly determine which part of the SQL query
passed to Drill can be segmented to CQL to push down to Cassandra, and then
aggregate to summarize the results.

Is there any active effort in this area?


Re: Drill Summit/Conference Proposal

2017-06-16 Thread Matt K
A counter point: I would be concerned that Drill would be overshadowed by
more “popular” or more entrenched platforms.

Drill is an excellent and somewhat unique tech that needs more exposure to
grow. An event that focuses purely on Drill may have better success at that.

The caveat may be that a Drill only summit may have low turnout and thus
not succeed at the above goals, thus I’m not certain what the right choice
is, but I wanted to bring up the point for discussion.


On June 14, 2017 at 2:32:45 PM, Julian Hyde (jh...@apache.org) wrote:

I like the idea of co-hosting a conference. ApacheCon in particular is a
good venue, and they explicitly encourage sub-conferences (there are “Big
Data” and “IoT” tracks, and this year there were sub-conferences for Tomcat
and CloudStack). DrillCon was part of ApacheCon, people could attend a
whole day of Drill talks, or they they could go to talks about other Apache
projects in the larger conference, and connect with Apache members.

Also, the conference is professionally organized, at a large hotel with
good facilities.

Unfortunately ApacheCon just happened (it was in Miami in May); but it’s
something to consider next year.

Julian


> On Jun 14, 2017, at 9:18 AM, Charles Givre  wrote:
>
> Hi Bob,
> Good to hear from you. I agree that there could be value in having a
joint
> Presto/Drill/Redshift conference, but how would you describe the overall
> theme?
>
> In essence (not looking to start a flame war here...) these tools are
> similar in terms of what the user experiences and I can definitely see
> value in bringing the communities together. I also like the idea of
> multiple tracks. I was thinking of having something like
developer/analyst
> tracks.
> -- C
>
> On Wed, Jun 14, 2017 at 11:27 AM, Bob Rudis  wrote:
>
>> I grok this is the Drill list and I'm also a big user of Drill (and
>> have made some UDFs) but there might be some efficacy in expanding the
>> scope to the Presto and Redshift Spectrum communities. I'm not
>> claiming there's 100% equivalence, but the broader view of being able
>> to access multiple types of data sources from a central platform is
>> compelling and -- at least in my circles -- not widely known. It could
>> be an event with a a primary central track but three separate
>> specializations that have a couple intraday time pockets.
>>
>> On Wed, Jun 14, 2017 at 8:55 AM, Charles Givre 
wrote:
>>> Hello fellow Drill users and developers,
>>> I've been involved with the Drill community for some time, and I was
>>> thinking that it might be time to start exploring the idea of a Drill
>>> Summit or Conference. If you're interested, please send me a note and
>> I'll
>>> start having some conversations about what's next.
>>>
>>> Personally, I think it could be extremely valuable to get Drill
>> developers
>>> and users together and share ideas about where things are and how
people
>>> are using Drill.
>>> Thanks!!
>>> -- C
>>


Inequality join error with date range and calendar table

2017-03-20 Thread Matt
Using a calendar table with monthly start and end dates, I am attempting 
to count records in another table that has cycle start and end dates.


In PostgreSQL I would either use a date range type, or in standard SQL 
do something like:


```
SELECT m.startdate as monthdate, COUNT(distinct p.period_id) subs
FROM dfs.test.months m
LEFT JOIN dfs.test.periods p ON p.startdate <= m.enddate AND p.enddate 
>= m.startdate

WHERE m.startdate BETWEEN '2016-01-01' AND '2017-02-01'
GROUP BY p.period_id, m.startdate
ORDER BY p.period_id, m.startdate;
```

This should result in a list of period counts that occur in each month, 
where the date range overlaps the month start and end date.


So far all my attempts with Drill result in **UNSUPPORTED_OPERATION 
ERROR: This query cannot be planned possibly due to either a cartesian 
join or an inequality join**.


Is there an approach that Drill can execute?




Re: How to avoid case sensitivity in group by

2017-02-08 Thread Matt
Drill is not SQL Server, and not expected to work identically. 

Using the upper() and lower() functions is a common approach, unless you find 
options to set the collation sort order in the Drill docs. 

> On Feb 8, 2017, at 1:13 PM, Dechang Gu  wrote:
> 
> Sanjiv,
> 
> Can you share the query and the output you get through sql server?
> 
> 
> Thanks,
> 
> Dechang
> 
> 
> From: Sanjiv Kumar 
> Sent: Wednesday, February 8, 2017 5:01:27 AM
> To: user@drill.apache.org
> Subject: Re: How to avoid case sensitivity in group by
> 
> ​lower and upper is not a solution which i am aspecting. ​I want to display
> same output as i get through sql server management studio
> 
>> On Wed, Feb 8, 2017 at 5:15 PM, Sanjiv Kumar  wrote:
>> 
>> Drill Environment:- OS:- Window 10
>>   Version:- 1.9
>>   Mode:- embedded mode
>> 
>> I have a column name 'State' in db, which have data like(e.g:- Florida,
>> florida,new jersey, etc).
>> 
>> My problem is while using count(State) and group by in query,
>> Florida and florida are showing two seperate rows.
>> 
>> So how to avoid case sensitivity so that both  Florida and florida act as
>> 1 rows while firing count(Sate).?
>> 
>> 
>> --
>> Thanks & Regards.
>> Sanjiv Kumar.
>> 
> 
> 
> 
> --
> Thanks & Regards.
> Sanjiv Kumar.


Performance with multiple FLATTENs

2016-07-15 Thread Matt
I have JSON data with with a nested list and am using FLATTEN to extract 
two of three list elements as:


~~~
SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ...
~~~

This works, but each FLATTEN seems to slow the query down dramatically, 
3x slower with the second flatten.


Is there a better approach to extracting list elements?

~~~
[
  {
"id": 16,
"data": [
  [
"2016-07-13 00:00",
509,
"OK"
  ],
  [
"2016-07-13 00:01",
461,
"OK"
  ],
  [
"2016-07-13 00:02",
508,
"OK"
  ],
~~~


Re: Drill 1.6 on MapR cluster not using extractHeader ?

2016-04-18 Thread Matt
I found that the dfs storage section for csv file types did not all have 
the extractHeader setting in place. Manually putting it in all four of 
my nodes may have resolved the issue.


In my vanilla Hadoop 2.7.0 setup on the same servers, I don't recall 
having to set it on all nodes.


Did I perhaps miss something in the MapR cluster setup?


On 15 Apr 2016, at 14:16, Abhishek Girish wrote:


Hello,

This is my format setting:

"csv": {
  "type": "text",
  "extensions": [
"csv"
  ],
  "extractHeader": true,
  "delimiter": ","
}

I was able to extract the header and get expected results:



select * from mfs.tmp.`abcd.csv`;

+++++
| A  | B  | C  | D  |
+++++
| 1  | 2  | 3  | 4  |
| 2  | 3  | 4  | 5  |
| 3  | 4  | 5  | 6  |
+++++
3 rows selected (0.196 seconds)


select A from mfs.tmp.`abcd.csv`;

++
| A  |
++
| 1  |
| 2  |
| 3  |
++
3 rows selected (0.16 seconds)

I am using a MapR cluster with Drill 1.6.0. I had also enabled the new 
text

reader.

Note: My initial query failed to extract header, similar to what you
reported. I had to set the "skipFirstLine" option to true, for it to 
work.
Strangely, for subsequent queries, it works even after removing / 
disabling

the "skipFirstLine" option. This could be a bug, but I'm not able to
reproduce it right now. Will file a JIRA once i have more clarity.



Regards,
Abhishek

On Fri, Apr 15, 2016 at 10:53 AM, Matt <bsg...@gmail.com> wrote:

With files in the local filesystem, and an embedded drill bit from 
the
download on drill.apache.org, I can successfully query csv data by 
column
name with the extractHeader option on, as in SELECT customer_if FROM 
`file`;


But in a MapR cluster (v. 5.1.0.37549.GA) with the data in MapR-FS, 
the
extractHeader options does not seem to be taking effect. A plain 
"SELECT *"

returns rows with the header as a data row, not in the columns list.

I have verified that exec.storage.enable_new_text_reader is true, and 
in

both cases csv storage is defined as:

~~~
"csv": {
  "type": "text",
  "extensions": [
"csv"
  ],
  "extractHeader": true,
  "delimiter": ","
}
~~~

Of course with the csv reader not extracting the columns, an attempt 
to

reference columns by name results in:

Error: DATA_READ ERROR: Selected column 'customer_id' must have name
'columns' or must be plain '*'. In trying to diagnose the issue, I 
noted
that at times the file header row not being part of the SELECT * 
results,

but also not being used to detect column names.

Both cases are Drill v1.6.0, but the MapR installed version has a
different commit than the standalone copy I am using:

MapR:

~~~

+--+---+--++--++
| version  | commit_id |
commit_message
|commit_time | build_email  |
 build_time |

+--+---+--++--++
| 1.6.0| 2d532bd206d7ae9f3cb703ee7f51ae3764374d43  | MD-850: 
Treat the

type of decimal literals as DOUBLE only when
planner.enable_decimal_data_type is true  | 31.03.2016 @ 04:47:25 UTC 
 |

Unknown  | 31.03.2016 @ 04:40:54 UTC  |

+--+---+--++--++
~~~

Local:

~~~

+--+---+-++++
| version  | commit_id |
 commit_message|commit_time |
build_email | build_time |

+--+---+-++++
| 1.6.0| d51f7fc14bd71d3e711ece0d02cdaa4d4c385eeb  |
[maven-release-plugin] prepare release drill-1.6.0  | 10.03.2016 @ 
16:34:37

PST  | par...@apache.org  | 10.03.2016 @ 17:45:29 PST  |

+--+---+-++++
~~~


Drill 1.6 on MapR cluster not using extractHeader ?

2016-04-15 Thread Matt
With files in the local filesystem, and an embedded drill bit from the 
download on drill.apache.org, I can successfully query csv data by 
column name with the extractHeader option on, as in SELECT customer_if 
FROM `file`;


But in a MapR cluster (v. 5.1.0.37549.GA) with the data in MapR-FS, the 
extractHeader options does not seem to be taking effect. A plain "SELECT 
*" returns rows with the header as a data row, not in the columns list.


I have verified that exec.storage.enable_new_text_reader is true, and in 
both cases csv storage is defined as:


~~~
"csv": {
  "type": "text",
  "extensions": [
"csv"
  ],
  "extractHeader": true,
  "delimiter": ","
}
~~~

Of course with the csv reader not extracting the columns, an attempt to 
reference columns by name results in:


Error: DATA_READ ERROR: Selected column 'customer_id' must have name 
'columns' or must be plain '*'. In trying to diagnose the issue, I noted 
that at times the file header row not being part of the SELECT * 
results, but also not being used to detect column names.


Both cases are Drill v1.6.0, but the MapR installed version has a 
different commit than the standalone copy I am using:


MapR:

~~~
+--+---+--++--++
| version  | commit_id | 
 commit_message 
 |commit_time | build_email  | 
build_time |

+--+---+--++--++
| 1.6.0| 2d532bd206d7ae9f3cb703ee7f51ae3764374d43  | MD-850: Treat 
the type of decimal literals as DOUBLE only when 
planner.enable_decimal_data_type is true  | 31.03.2016 @ 04:47:25 UTC  | 
Unknown  | 31.03.2016 @ 04:40:54 UTC  |

+--+---+--++--++
~~~

Local:

~~~
+--+---+-++++
| version  | commit_id | 
  commit_message|commit_time |
build_email | build_time |

+--+---+-++++
| 1.6.0| d51f7fc14bd71d3e711ece0d02cdaa4d4c385eeb  | 
[maven-release-plugin] prepare release drill-1.6.0  | 10.03.2016 @ 
16:34:37 PST  | par...@apache.org  | 10.03.2016 @ 17:45:29 PST  |

+--+---+-++++
~~~

Re: NumberFormatException with cast to double?

2016-03-13 Thread Matt
That does indeed remove the NumberFormatException, replacing it with the 
same error on "empty String".


As indicated by Jason in this thread I may have mistakenly believed that 
empty columns would be treated as NULLs.


Is there a setting to set the NULL value in the data, or should all of 
those COALESCE be replaced with CASE statements? It would be a lot of 
CASEs in this case (39 of them).



On 11 Mar 2016, at 0:31, Abdel Hakim Deneche wrote:

Looks like the COALESCE function is the source of the problem. Passing 
a

double (0.0) instead of an int (0) as a second expression solved the
problem for me:

CAST(COALESCE(t_total, 0.0) AS double)


On Fri, Mar 11, 2016 at 12:45 AM, Matt <bsg...@gmail.com> wrote:


~~~
00-01  Project(date_tm=[CAST($23):TIMESTAMP(0)],
id_1=[CAST($11):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE
"ISO-8859-1$en_US$primary"], id_2=[CAST($15):VARCHAR(1) CHARACTER SET
"ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"],
id_3=[CAST($33):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE
"ISO-8859-1$en_US$primary"], b_total=[CAST(CASE(IS NOT NULL($4), $4,
0)):BIGINT], t_total=[CAST(CASE(IS NOT NULL($31), $31, 0)):DOUBLE],
h_total=[CAST(CASE(IS NOT NULL($40), $40, 0)):BIGINT],
b_small=[CAST(CASE(IS NOT NULL($36), $36, 0)):BIGINT],
t_small=[CAST(CASE(IS NOT NULL($14), $14, 0)):DOUBLE],
h_small=[CAST(CASE(IS NOT NULL($38), $38, 0)):BIGINT],
b_18000=[CAST(CASE(IS NOT NULL($32), $32, 0)):BIGINT],
t_18000=[CAST(CASE(IS NOT NULL($24), $24, 0)):DOUBLE],
h_18000=[CAST(CASE(IS NOT NULL($27), $27, 0)):BIGINT],
b_12000=[CAST(CASE(IS NOT NULL($30), $30, 0)):BIGINT],
t_12000=[CAST(CASE(IS NOT NULL($28), $28, 0)):DOUBLE],
h_12000=[CAST(CASE(IS NOT NULL($20), $20, 0)):BIGINT], 
b_6000=[CAST(CASE(IS
NOT NULL($41), $41, 0)):BIGINT], t_6000=[CAST(CASE(IS NOT NULL($37), 
$37,

0)):DOUBLE], h_6000=[CAST(CASE(IS NOT NULL($29), $29, 0)):BIGINT],
b_3000=[CAST(CASE(IS NOT NULL($17), $17, 0)):BIGINT], 
t_3000=[CAST(CASE(IS

NOT NULL($7), $7, 0)):DOUBLE], h_3000=[CAST(CASE(IS NOT NULL($1), $1,
0)):BIGINT], b_2000=[CAST(CASE(IS NOT NULL($26), $26, 0)):BIGINT],
t_2000=[CAST(CASE(IS NOT NULL($34), $34, 0)):DOUBLE], 
h_2000=[CAST(CASE(IS
NOT NULL($10), $10, 0)):BIGINT], b_1500=[CAST(CASE(IS NOT NULL($42), 
$42,

0)):BIGINT], t_1500=[CAST(CASE(IS NOT NULL($13), $13, 0)):DOUBLE],
h_1500=[CAST(CASE(IS NOT NULL($3), $3, 0)):BIGINT], 
b_1250=[CAST(CASE(IS
NOT NULL($21), $21, 0)):BIGINT], t_1250=[CAST(CASE(IS NOT NULL($25), 
$25,

0)):DOUBLE], h_1250=[CAST(CASE(IS NOT NULL($16), $16, 0)):BIGINT],
b_1000=[CAST(CASE(IS NOT NULL($12), $12, 0)):BIGINT], 
t_1000=[CAST(CASE(IS
NOT NULL($19), $19, 0)):DOUBLE], h_1000=[CAST(CASE(IS NOT NULL($6), 
$6,

0)):BIGINT], b_750=[CAST(CASE(IS NOT NULL($9), $9, 0)):BIGINT],
t_750=[CAST(CASE(IS NOT NULL($0), $0, 0)):DOUBLE], 
h_750=[CAST(CASE(IS NOT

NULL($5), $5, 0)):BIGINT], b_500=[CAST(CASE(IS NOT NULL($2), $2,
0)):BIGINT], t_500=[CAST(CASE(IS NOT NULL($8), $8, 0)):DOUBLE],
h_500=[CAST(CASE(IS NOT NULL($39), $39, 0)):BIGINT], 
b_0=[CAST(CASE(IS NOT

NULL($18), $18, 0)):BIGINT], t_0=[CAST(CASE(IS NOT NULL($35), $35,
0)):DOUBLE], EXPR$42=[CAST(CASE(IS NOT NULL($22), $22, 0)):BIGINT])
00-02Scan(groupscan=[EasyGroupScan
[selectionRoot=hdfs://es05:54310/caspr/csv/smalltest.csv, numFiles=1,
columns=[`date_tm`, `id_1`, `id_2`, `id_3`, `b_total`, `t_total`,
`h_total`, `b_small`, `t_small`, `h_small`, `b_18000`, `t_18000`,
`h_18000`, `b_12000`, `t_12000`, `h_12000`, `b_6000`, `t_6000`, 
`h_6000`,

`b_3000`, `t_3000`, `h_3000`, `b_2000`, `t_2000`, `h_2000`, `b_1500`,
`t_1500`, `h_1500`, `b_1250`, `t_1250`, `h_1250`, `b_1000`, `t_1000`,
`h_1000`, `b_750`, `t_750`, `h_750`, `b_500`, `t_500`, `h_500`, 
`b_0`,

`t_0`, `h_0`], files=[hdfs://es05:54310/caspr/csv/smalltest.csv]]])
~~~

~~~
{
  "head" : {
"version" : 1,
"generator" : {
  "type" : "ExplainHandler",
  "info" : ""
},
"type" : "APACHE_DRILL_PHYSICAL",
"options" : [ {
  "kind" : "STRING",
  "type" : "SESSION",
  "name" : "store.format",
  "string_val" : "parquet"
}, {
  "kind" : "BOOLEAN",
  "type" : "SESSION",
  "name" : "exec.errors.verbose",
  "bool_val" : true
} ],
"queue" : 0,
"resultMode" : "EXEC"
  },
  "graph" : [ {
"pop" : "fs-scan",
"@id" : 2,
"userName" : "hduser",
"files" : [ "hdfs://es05:54310/caspr/csv/smalltest.csv" ],
"storage" : {
  "type" : "file",
  "enabled" : true,
  "connection" : "hdfs://es05:54310",
  "wo

Re: NumberFormatException with cast to double?

2016-03-10 Thread Matt
) as FLOAT8 )"

}, {
  "ref" : "`h_3000`",
  "expr" : "cast( ( ( if (isnotnull(`h_3000`)  ) then (`h_3000` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`b_2000`",
  "expr" : "cast( ( ( if (isnotnull(`b_2000`)  ) then (`b_2000` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`t_2000`",
  "expr" : "cast( ( ( if (isnotnull(`t_2000`)  ) then (`t_2000` )  
else (0 )  end  )  ) as FLOAT8 )"

}, {
  "ref" : "`h_2000`",
  "expr" : "cast( ( ( if (isnotnull(`h_2000`)  ) then (`h_2000` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`b_1500`",
  "expr" : "cast( ( ( if (isnotnull(`b_1500`)  ) then (`b_1500` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`t_1500`",
  "expr" : "cast( ( ( if (isnotnull(`t_1500`)  ) then (`t_1500` )  
else (0 )  end  )  ) as FLOAT8 )"

}, {
  "ref" : "`h_1500`",
  "expr" : "cast( ( ( if (isnotnull(`h_1500`)  ) then (`h_1500` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`b_1250`",
  "expr" : "cast( ( ( if (isnotnull(`b_1250`)  ) then (`b_1250` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`t_1250`",
  "expr" : "cast( ( ( if (isnotnull(`t_1250`)  ) then (`t_1250` )  
else (0 )  end  )  ) as FLOAT8 )"

}, {
  "ref" : "`h_1250`",
  "expr" : "cast( ( ( if (isnotnull(`h_1250`)  ) then (`h_1250` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`b_1000`",
  "expr" : "cast( ( ( if (isnotnull(`b_1000`)  ) then (`b_1000` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`t_1000`",
  "expr" : "cast( ( ( if (isnotnull(`t_1000`)  ) then (`t_1000` )  
else (0 )  end  )  ) as FLOAT8 )"

}
~~~

On 10 Mar 2016, at 18:40, Hsuan Yi Chu wrote:


From the log, I see this one:

StringFunctionHelpers.varCharToInt():

It seems like there might be an implicit casting (to integer) be 
inserted

into your expression.

Can you do EXPLAIN PLAN FOR  "your query" (not double quoted needed)

and share what you see?

On Thu, Mar 10, 2016 at 3:11 PM, Matt <bsg...@gmail.com> wrote:

TRIM() did not help, and I visually verified there is no whitespace 
around

the number: The entire test data file:

~~~

date_tm,id_1,id_2,id_3,b_total,t_total,h_total,b_small,t_small,h_small,b_18000,t_18000,h_18000,b_12000,t_12000,h_12000,b_6000,t_6000,h_6000,b_3000,t_3000,h_3000,b_2000,t_2000,h_2000,b_1500,t_1500,h_1500,b_1250,t_1250,h_1250,b_1000,t_1000,h_1000,b_750,t_750,h_750,b_500,t_500,h_500,b_0,t_0,h_0
2015-10-17
00:00:00,f5e9v8u2,err,mi1,268918254,140.755,23519,268918254,140.755,23519,
~~~

Including a trim resulted in the same:

~~~
Error: SYSTEM ERROR: NumberFormatException: 140.755

Fragment 0:0

[Error Id: 01d368ee-6c06-476c-a553-5eb5baea7c3f on es07:31010]

  (java.lang.NumberFormatException) 140.755
   org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI():95

org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToInt():120
org.apache.drill.exec.test.generated.ProjectorGen122.doEval():467

org.apache.drill.exec.test.generated.ProjectorGen122.projectRecords():62

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():175
   
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93


org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():415
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1145
java.util.concurrent.ThreadPoolExecutor$Worker.run():615
java.lang.Thread.run():745 (state=,code=0)
~~~

On 10 Mar 2016, at 17:49, Jason Altekruse wrote:

Could you try throwing a trim() in before the cast? Might just be

whitespace.

There shouldn't be a need to add the COALESCE statement, as we only 
ever
read defined values (which may be empty string) out of a CSV file. 
You

should instead write out a full case statement that checks for empty
string
and provides your default value of 0 in that case.

- Jason

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Thu, Mar 10, 2016 at 2:32 PM, Matt <bsg...@gmail.com> wrote:

Have some CSV data that Drill 1.5 selects as-is without any 
problems,

until I attempt to CAST columns in a CTAS or plain SELECT:

Error: SYSTEM ERROR: NumberFormatException: 140.755

The data is unquoted CSV, and column in question does have the 
value

"140.755" (unquoted). As the column can be empty, I am using the
following
transform:

  CAST(COALESCE(t_total, 0) AS double) AS t_total

And on the first data row:

Fragment 1:0

[Error Id: 4b4d83cf-a87d-451e-9cfa-4280e5adf64f on es08:31010]

  (java.lang.NumberFormatException) 140.755

What could be causing this error?





Re: NumberFormatException with cast to double?

2016-03-10 Thread Matt
TRIM() did not help, and I visually verified there is no whitespace 
around the number: The entire test data file:


~~~
date_tm,id_1,id_2,id_3,b_total,t_total,h_total,b_small,t_small,h_small,b_18000,t_18000,h_18000,b_12000,t_12000,h_12000,b_6000,t_6000,h_6000,b_3000,t_3000,h_3000,b_2000,t_2000,h_2000,b_1500,t_1500,h_1500,b_1250,t_1250,h_1250,b_1000,t_1000,h_1000,b_750,t_750,h_750,b_500,t_500,h_500,b_0,t_0,h_0
2015-10-17 
00:00:00,f5e9v8u2,err,mi1,268918254,140.755,23519,268918254,140.755,23519,

~~~

Including a trim resulted in the same:

~~~
Error: SYSTEM ERROR: NumberFormatException: 140.755

Fragment 0:0

[Error Id: 01d368ee-6c06-476c-a553-5eb5baea7c3f on es07:31010]

  (java.lang.NumberFormatException) 140.755
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI():95

org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToInt():120

org.apache.drill.exec.test.generated.ProjectorGen122.doEval():467

org.apache.drill.exec.test.generated.ProjectorGen122.projectRecords():62

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():175

org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132

org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81

org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():415
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1145
java.util.concurrent.ThreadPoolExecutor$Worker.run():615
java.lang.Thread.run():745 (state=,code=0)
~~~

On 10 Mar 2016, at 17:49, Jason Altekruse wrote:


Could you try throwing a trim() in before the cast? Might just be
whitespace.

There shouldn't be a need to add the COALESCE statement, as we only 
ever

read defined values (which may be empty string) out of a CSV file. You
should instead write out a full case statement that checks for empty 
string

and provides your default value of 0 in that case.

- Jason

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Thu, Mar 10, 2016 at 2:32 PM, Matt <bsg...@gmail.com> wrote:


Have some CSV data that Drill 1.5 selects as-is without any problems,
until I attempt to CAST columns in a CTAS or plain SELECT:

Error: SYSTEM ERROR: NumberFormatException: 140.755

The data is unquoted CSV, and column in question does have the value
"140.755" (unquoted). As the column can be empty, I am using the 
following

transform:

  CAST(COALESCE(t_total, 0) AS double) AS t_total

And on the first data row:

Fragment 1:0

[Error Id: 4b4d83cf-a87d-451e-9cfa-4280e5adf64f on es08:31010]

  (java.lang.NumberFormatException) 140.755

What could be causing this error?




Re: CTAS error with CSV data

2016-01-27 Thread Matt

https://issues.apache.org/jira/browse/DRILL-4317

On 26 Jan 2016, at 23:50, Abdel Hakim Deneche wrote:

This definitely looks like a bug, could you open a JIRA and share as 
much
information as possible about the structure of the CSV file and the 
number

of records.

On Tue, Jan 26, 2016 at 7:38 PM, Matt <bsg...@gmail.com> wrote:


The CTAS with fails with:

~~~
Error: SYSTEM ERROR: IllegalArgumentException: length: -260 
(expected: >=

0)

Fragment 1:2

[Error Id: 1807615e-4385-4f85-8402-5900aaa568e9 on es07:31010]

(java.lang.IllegalArgumentException) length: -260 (expected: >= 0)
 io.netty.buffer.AbstractByteBuf.checkIndex():1131
 io.netty.buffer.PooledUnsafeDirectByteBuf.nioBuffer():344
 io.netty.buffer.WrappedByteBuf.nioBuffer():727
 io.netty.buffer.UnsafeDirectLittleEndian.nioBuffer():26
 io.netty.buffer.DrillBuf.nioBuffer():356

org.apache.drill.exec.store.ParquetOutputRecordWriter$VarCharParquetConverter.writeField():1842
 org.apache.drill.exec.store.EventBasedRecordWriter.write():62
 
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():106

 org.apache.drill.exec.record.AbstractRecordBatch.next():162
 org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
 org.apache.drill.exec.physical.impl.BaseRootExec.next():94
 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
 java.security.AccessController.doPrivileged():-2
 javax.security.auth.Subject.doAs():415
 org.apache.hadoop.security.UserGroupInformation.doAs():1657
 org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
 org.apache.drill.common.SelfCleaningRunnable.run():38
 java.util.concurrent.ThreadPoolExecutor.runWorker():1145
 java.util.concurrent.ThreadPoolExecutor$Worker.run():615
 java.lang.Thread.run():745 (state=,code=0)
~~~

And a simple SELECT * fails with:

~~~
java.lang.IndexOutOfBoundsException: index: 547681, length: 1 
(expected:

range(0, 547681))
 at
io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1134)
 at
io.netty.buffer.PooledUnsafeDirectByteBuf.getBytes(PooledUnsafeDirectByteBuf.java:136)
 at 
io.netty.buffer.WrappedByteBuf.getBytes(WrappedByteBuf.java:289)

 at
io.netty.buffer.UnsafeDirectLittleEndian.getBytes(UnsafeDirectLittleEndian.java:26)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at
org.apache.drill.exec.vector.VarCharVector$Accessor.get(VarCharVector.java:443)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getBytes(VarCharAccessor.java:125)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getString(VarCharAccessor.java:146)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:136)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:94)
 at
org.apache.drill.exec.vector.accessor.BoundCheckingAccessor.getObject(BoundCheckingAccessor.java:148)
 at
org.apache.drill.jdbc.impl.TypeConvertingSqlAccessor.getObject(TypeConvertingSqlAccessor.java:795)
 at
org.apache.drill.jdbc.impl.AvaticaDrillSqlAccessor.getObject(AvaticaDrillSqlAccessor.java:179)
 at
net.hydromatic.avatica.AvaticaResultSet.getObject(AvaticaResultSet.java:351)
 at
org.apache.drill.jdbc.impl.DrillResultSetImpl.getObject(DrillResultSetImpl.java:420)
 at sqlline.Rows$Row.(Rows.java:157)
 at sqlline.IncrementalRows.hasNext(IncrementalRows.java:63)
 at
sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
 at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
 at sqlline.SqlLine.print(SqlLine.java:1593)
 at sqlline.Commands.execute(Commands.java:852)
 at sqlline.Commands.sql(Commands.java:751)
 at sqlline.SqlLine.dispatch(SqlLine.java:746)
 at sqlline.SqlLine.begin(SqlLine.java:621)
 at sqlline.SqlLine.start(SqlLine.java:375)
 at sqlline.SqlLine.main(SqlLine.java:268)
~~~

It also looks like if I run the SELECT from a bash shell as "sqlline 
-u
... -f test.sql 2>&1 > test.out" upon the error the sqlline session 
"locks

up". No errors spool to the out file and the Java thread can only be
terminated with a kill -9. It can be backgrounded with ^z, but won't
respond to a ^c.


On 26 Jan 2016, at 14:07, Abdel Hakim Deneche wrote:

It's an internal buffer index. Can you try enabling verbose errors 
and run
the query again, this should provide us with more details about the 
error.
You can enable verbose error by running the following before the 
select *:


alter session set `exec.errors.verbose`=true;

thanks

On Tue, Jan 26, 2016 at 11:01 AM, Matt <bsg...@gmail.com> wrote:


CTAS error with CSV data

2016-01-26 Thread Matt
Getting some errors when attempting to create Parquet files from CSV 
data, and trying to determine if it is due to the format of the source 
data.


Its a fairly simple format of 
"datetime,key,key,key,numeric,numeric,numeric, ..." with 32 of those 
numeric columns in total.


The source data does contain a lot missing values for the numeric 
columns, and those are represented by as consecutive delimiters: 
""datetime,key,key,key,numeric,,..."


Could this be causing the CTAS to fail with these types of errors? Or is 
there another cause to look for?


~~~
Error: SYSTEM ERROR: IllegalArgumentException: length: -260 (expected: 
>= 0)   
   
│···


  
│···

Fragment 1:2
~~~


SELECT * via sqlline -q dumps filenames

2016-01-26 Thread Matt
sqlline -u ... -q 'SELECT * FROM dfs.`/path/to/files/file.csv` LIMIT 10' 
seems to emit a list of files in the local path (pwd), along with a 
parsing error.


Putting the query in a file and passing that file name to sqlline or 
using an explicit column list runs the query as expected.


Is this a sqlline issue, or my improper use of it (quoting) ?




Re: CTAS error with CSV data

2016-01-26 Thread Matt
Putting the "select * from 
`/csv/customer/hourly/customer_20151017.csv`;" in a local .sql file, 
and executing it with sqlline > /dev/null (to avoid a ton of scrolling) 
results in:


~~~
index: 418719, length: 2 (expected: range(0, 418719))
 Aborting 
command set because "force" is false and command failed: "select * from 
`/csv/customer/hourly/customer_20151017.csv`;"

Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl
~~~

Is that index a byte or line offset?


On 26 Jan 2016, at 12:55, Abdel Hakim Deneche wrote:


Does a select * on the same data also fail ?

On Tue, Jan 26, 2016 at 9:44 AM, Matt <bsg...@gmail.com> wrote:

Getting some errors when attempting to create Parquet files from CSV 
data,
and trying to determine if it is due to the format of the source 
data.


Its a fairly simple format of
"datetime,key,key,key,numeric,numeric,numeric, ..." with 32 of those
numeric columns in total.

The source data does contain a lot missing values for the numeric 
columns,

and those are represented by as consecutive delimiters:
""datetime,key,key,key,numeric,,..."

Could this be causing the CTAS to fail with these types of errors? Or 
is

there another cause to look for?

~~~
Error: SYSTEM ERROR: IllegalArgumentException: length: -260 
(expected: >=

0)

│···



│···
Fragment 1:2
~~~





--

Abdelhakim Deneche

Software Engineer

<http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: CTAS error with CSV data

2016-01-26 Thread Matt

The CTAS with fails with:

~~~
Error: SYSTEM ERROR: IllegalArgumentException: length: -260 (expected: 
>= 0)


Fragment 1:2

[Error Id: 1807615e-4385-4f85-8402-5900aaa568e9 on es07:31010]

  (java.lang.IllegalArgumentException) length: -260 (expected: >= 0)
io.netty.buffer.AbstractByteBuf.checkIndex():1131
io.netty.buffer.PooledUnsafeDirectByteBuf.nioBuffer():344
io.netty.buffer.WrappedByteBuf.nioBuffer():727
io.netty.buffer.UnsafeDirectLittleEndian.nioBuffer():26
io.netty.buffer.DrillBuf.nioBuffer():356

org.apache.drill.exec.store.ParquetOutputRecordWriter$VarCharParquetConverter.writeField():1842

org.apache.drill.exec.store.EventBasedRecordWriter.write():62

org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():106

org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93

org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():415
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1145
java.util.concurrent.ThreadPoolExecutor$Worker.run():615
java.lang.Thread.run():745 (state=,code=0)
~~~

And a simple SELECT * fails with:

~~~
java.lang.IndexOutOfBoundsException: index: 547681, length: 1 (expected: 
range(0, 547681))
at 
io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1134)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.getBytes(PooledUnsafeDirectByteBuf.java:136)
at 
io.netty.buffer.WrappedByteBuf.getBytes(WrappedByteBuf.java:289)
at 
io.netty.buffer.UnsafeDirectLittleEndian.getBytes(UnsafeDirectLittleEndian.java:26)

at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
at 
org.apache.drill.exec.vector.VarCharVector$Accessor.get(VarCharVector.java:443)
at 
org.apache.drill.exec.vector.accessor.VarCharAccessor.getBytes(VarCharAccessor.java:125)
at 
org.apache.drill.exec.vector.accessor.VarCharAccessor.getString(VarCharAccessor.java:146)
at 
org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:136)
at 
org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:94)
at 
org.apache.drill.exec.vector.accessor.BoundCheckingAccessor.getObject(BoundCheckingAccessor.java:148)
at 
org.apache.drill.jdbc.impl.TypeConvertingSqlAccessor.getObject(TypeConvertingSqlAccessor.java:795)
at 
org.apache.drill.jdbc.impl.AvaticaDrillSqlAccessor.getObject(AvaticaDrillSqlAccessor.java:179)
at 
net.hydromatic.avatica.AvaticaResultSet.getObject(AvaticaResultSet.java:351)
at 
org.apache.drill.jdbc.impl.DrillResultSetImpl.getObject(DrillResultSetImpl.java:420)

at sqlline.Rows$Row.(Rows.java:157)
at sqlline.IncrementalRows.hasNext(IncrementalRows.java:63)
at 
sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)

at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
at sqlline.SqlLine.print(SqlLine.java:1593)
at sqlline.Commands.execute(Commands.java:852)
at sqlline.Commands.sql(Commands.java:751)
at sqlline.SqlLine.dispatch(SqlLine.java:746)
at sqlline.SqlLine.begin(SqlLine.java:621)
at sqlline.SqlLine.start(SqlLine.java:375)
at sqlline.SqlLine.main(SqlLine.java:268)
~~~

It also looks like if I run the SELECT from a bash shell as "sqlline -u 
... -f test.sql 2>&1 > test.out" upon the error the sqlline session 
"locks up". No errors spool to the out file and the Java thread can only 
be terminated with a kill -9. It can be backgrounded with ^z, but won't 
respond to a ^c.


On 26 Jan 2016, at 14:07, Abdel Hakim Deneche wrote:

It's an internal buffer index. Can you try enabling verbose errors and 
run
the query again, this should provide us with more details about the 
error.
You can enable verbose error by running the following before the 
select *:


alter session set `exec.errors.verbose`=true;

thanks

On Tue, Jan 26, 2016 at 11:01 AM, Matt <bsg...@gmail.com> wrote:


Putting the "select * from
`/csv/customer/hourly/customer_20151017.csv`;" in a local .sql 
file,
and executing it with sqlline > /dev/null (to avoid a ton of

CTAS plan showing single node?

2016-01-21 Thread Matt
Running a CTAS from csv files in a 4 node HDFS cluster into a Parquet 
file, and I note the physical plan in the Drill UI references scans of 
all the csv sources on a single node.


collectl implies read and write IO on all 4 nodes - does this imply that 
the full cluster is used for scanning the source files, or does my 
configuration have the reads pinned to a single node?


Nodes in cluster: es0{5-8}:

Scan(groupscan=[EasyGroupScan 
[selectionRoot=hdfs://es05:54310/csv/customer, numFiles=33, 
columns=[`*`], 
files=[hdfs://es05:54310/csv/customer/customer_20151001.csv, 
hdfs://es05:54310/csv/customer/customer_20151002.csv, 
hdfs://es05:54310/csv/customer/customer_20151003.csv, 
hdfs://es05:54310/csv/customer/customer_20151004.csv, 
hdfs://es05:54310/csv/customer/customer_20151005.csv, 
hdfs://es05:54310/csv/customer/customer_20151006.csv, 
hdfs://es05:54310/csv/customer/customer_20151007.csv, 
hdfs://es05:54310/csv/customer/customer_20151008.csv, 
hdfs://es05:54310/csv/customer/customer_20151009.csv, 
hdfs://es05:54310/csv/customer/customer_20151010.csv, 
hdfs://es05:54310/csv/customer/customer_20151011.csv, 
hdfs://es05:54310/csv/customer/customer_20151012.csv, 
hdfs://es05:54310/csv/customer/customer_20151013.csv, 
hdfs://es05:54310/csv/customer/customer_20151014.csv, 
hdfs://es05:54310/csv/customer/customer_20151015.csv, 
hdfs://es05:54310/csv/customer/customer_20151016.csv, 
hdfs://es05:54310/csv/customer/customer_20151017.csv, 
hdfs://es05:54310/csv/customer/customer_20151018.csv, 
hdfs://es05:54310/csv/customer/customer_20151019.csv, 
hdfs://es05:54310/csv/customer/customer_20151020.csv, 
hdfs://es05:54310/csv/customer/customer_20151021.csv, 
hdfs://es05:54310/csv/customer/customer_20151022.csv, 
hdfs://es05:54310/csv/customer/customer_20151023.csv, 
hdfs://es05:54310/csv/customer/customer_20151024.csv, 
hdfs://es05:54310/csv/customer/customer_20151025.csv, 
hdfs://es05:54310/csv/customer/customer_20151026.csv, 
hdfs://es05:54310/csv/customer/customer_20151027.csv, 
hdfs://es05:54310/csv/customer/customer_20151028.csv, 
hdfs://es05:54310/csv/customer/customer_20151029.csv, 
hdfs://es05:54310/csv/customer/customer_20151030.csv, 
hdfs://es05:54310/csv/customer/customer_20151031.csv, 
hdfs://es05:54310/csv/customer/customer_20151101.csv, 
hdfs://es05:54310/csv/customer/customer_20151102.csv]]]) : rowType = 
(DrillRecordRow[*]): rowcount = 2.407374395E9, cumulative cost = 
{2.407374395E9 rows, 2.407374395E9 cpu, 0.0 io, 0.0 network, 0.0 
memory}, id = 4355

File size limit for CTAS?

2016-01-21 Thread Matt
Converting CSV files to Parquet with CTAS, and getting errors on some 
larger files:


With a source file of 16.34GB (as reported in the HDFS explorer):

~~~
create table `/parquet/customer_20151017` partition by (date_tm) AS 
select * from `/csv/customer/customer_20151017.csv`;
Error: SYSTEM ERROR: IllegalArgumentException: length: -484 (expected: 
>= 0)


Fragment 1:1

[Error Id: da53d687-a8d5-4927-88ec-e56d5da17112 on es07:31010] 
(state=,code=0)

~~~

But an optation on a 70 MB file of the same format succeeds.

Given some HDFS advice is to avoid large numbers of small files [1], is 
there a general guideline for the max file size to ingest into Parquet 
files with CTAS?


---

[1] HDFS put performance is very poor with a large number of small 
files, thus trying to find the right amount of source rollup to perform. 
Pointers to HDFS configuration guides for beginners would be appreciated 
too. I have only used HDFS for Drill - no other Hadoop experience.


Re: Tableau / filter issues

2015-08-17 Thread Matt
I think Tableau also uses the first query to fetch the structure / 
metadata of the expected result set.


We have often eliminated performance issues using Tableau by hiding 
the structure of queries by putting them in database views. Could that 
be a possible solution here?


On 17 Aug 2015, at 11:02, Boris Chmiel wrote:


Hello Driller,
I'm facing issues using the Filter function of Tableau with Drill : 
Tableau always populates values of the filter with null.The action 
is translated into 2 queries within Drill :
SELECT \* FROM (SELECT field FROM Table GROUP BY field ORDER BY field 
ASC) T LIMIT 0 

SELECT field FROM Table GROUP BY field ORDER BY field ASC
It seems that Tableau is fetching the result of the first query (null 
- which seems to by generated to catch exceptions) instead of the 
second one.
Anyone already met this issue ? Is there a workaround or customization 
parameter to ajust ?
conf : Tableau v9.0 + MapRDrillODBC driver with TDC file 
deployedDrill V1.1.0


Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Matt

On 23 Jul 2015, at 10:53, Abdel Hakim Deneche wrote:

When you try to read schema-less data, Drill will first investigate 
the

1000 rows to figure out a schema for your data, then it will use this
schema for the remaining of the query.


To clarify, if the JSON schema changes on the 1001st 1MMth record, is 
Drill supposed to report an error, or ignore new data elements and only 
consider those discovered in the first 1000 objects?


Sorting and partitioning for range scans?

2015-06-01 Thread Matt
I have seen some discussions on the Parquet storage format suggesting 
that sorting time series data on the time key prior to converting to the 
Parquet format will improve range query efficiency via min/max values on 
column chunks - perhaps analogous to skip indexes?


Is this a recommended approach for data accessed via Drill?

In addition, for data stored in HDFS for Drill that has a regular growth 
rate and is mainly subject to time range queries, is there validity to 
partitioning it by date into subdirectories?


For example, in PostgreSQL, I might partition data tables by month so 
queries including the partition date column hit the proper partitions 
directly (with an extra benefit of space management that does not touch 
all date ranges).


Segmenting data into directories in HDFS would require clients to 
structure queries accordingly, but would there be benefit in reduced 
query time by limiting scan ranges?


Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
That is a good point. The difference between the number of source rows, 
and those that made it into the parquet files is about the same count as 
the other fragments.


Indeed the query profile does show fragment 1_1 as CANCELED while the 
others all have State FINISHED. Additionally the other fragments have a 
runtime of less than 30 mins, where only fragment 1_1 lasted the 23 
hours before cancellation.



On 28 May 2015, at 16:43, Mehant Baid wrote:

I think the problem might be related to a single laggard, looks like 
we are waiting for one minor fragment to complete. Based on the output 
you provided looks like the fragment 1_1 hasn't completed. You might 
want to find out where the fragment was scheduled and what is going on 
in that node. It might also be useful to look at the profile for that 
minor fragment to see how much data has been processed.



Thanks
Mehant

On 5/28/15 10:57 AM, Matt wrote:

Did you check the log files for any errors?


No messages related to this query containing errors or warning, nor 
nothing mentioning memory or heap. Querying now to determine what is 
missing in the parquet destination.


drillbit.out on the master shows no error messages, and what looks 
like the last relevant line is:


~~~
May 27, 2015 6:43:50 PM INFO: 
parquet.hadoop.ColumnChunkPageWriteStore: written 2,258,263B for 
[bytes_1250] INT64: 3,069,414 values, 24,555,504B raw, 2,257,112B 
comp, 24 pages, encodings: [RLE, PLAIN, BIT_PACKED]
May 27, 2015 6:43:51 PM INFO: parquet.haMay 28, 2015 5:13:42 PM 
org.apache.calcite.sql.validate.SqlValidatorException init

~~~

The final lines in drillbit.log (which appear to use a different time 
format in the log) that contain the profile ID:


~~~
2015-05-27 18:39:49,980 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20: State change requested 
from RUNNING -- FINISHED for
2015-05-27 18:39:49,981 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20. New state: FINISHED
2015-05-27 18:40:05,650 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12: State change requested 
from RUNNING -- FINISHED for
2015-05-27 18:40:05,650 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12. New state: FINISHED
2015-05-27 18:41:57,444 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16: State change requested 
from RUNNING -- FINISHED for
2015-05-27 18:41:57,444 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16. New state: FINISHED
2015-05-27 18:43:25,005 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:8] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:8: State change requested from 
RUNNING -- FINISHED for
2015-05-27 18:43:25,005 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:8] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:8. New state: FINISHED
2015-05-27 18:43:54,539 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:0] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:0: State change requested from 
RUNNING -- FINISHED for
2015-05-27 18:43:54,540 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:0] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:0. New state: FINISHED
2015-05-27 18:43:59,947 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:4] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:4: State change requested from 
RUNNING -- FINISHED for
2015-05-27 18:43:59,947 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:4] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:4. New state: FINISHED

~~~


On 28 May 2015, at 13:42, Andries Engelbrecht wrote:


It should execute multi threaded, need to check on text file.

Did you check the log files for any errors?


On May 28, 2015, at 10:36 AM, Matt bsg...@gmail.com wrote:

The time seems pretty long for that file size. What type of file 
is it?


Tab delimited UTF-8 text.

I left the query to run overnight to see if it would complete, but 
24 hours for an import like this would indeed be too long.



Is the CTAS running single threaded?


In the first hour, with this being the only client connected to the 
cluster, I observed activity on all 4 nodes.


Is multi-threaded query execution the default? I would not have 
changed anything deliberately to force single thread execution.



On 28 May 2015, at 13:06, Andries Engelbrecht wrote

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt

Bumping memory to:

DRILL_MAX_DIRECT_MEMORY=16G
DRILL_HEAP=8G

The 44GB file imported successfully in 25 minutes - acceptable on this 
hardware.


I don't know if the default memory setting was to blame or not.


On 28 May 2015, at 14:22, Andries Engelbrecht wrote:


That is the Drill direct memory per node.

DRILL_HEAP is for the heap size per node.

More info here
http://drill.apache.org/docs/configuring-drill-memory/


—Andries

On May 28, 2015, at 11:09 AM, Matt bsg...@gmail.com wrote:


Referencing http://drill.apache.org/docs/configuring-drill-memory/

Is DRILL_MAX_DIRECT_MEMORY the limit for each node, or the cluster?

The root page on a drillbit at port 8047 list for nodes, with the 16G 
Maximum Direct Memory equal to DRILL_MAX_DIRECT_MEMORY, thus 
uncertain if that is a node or cluster limit.



On 28 May 2015, at 12:23, Jason Altekruse wrote:

That is correct. I guess it could be possible that HDFS might run 
out of
heap, but I'm guessing that is unlikely the cause of the failure you 
are
seeing. We should not be taxing zookeeper enough to be causing any 
issues

there.

On Thu, May 28, 2015 at 9:17 AM, Matt bsg...@gmail.com wrote:

To make sure I am adjusting the correct config, these are heap 
parameters

within the Drill configure path, not for Hadoop or Zookeeper?


On May 28, 2015, at 12:08 PM, Jason Altekruse 
altekruseja...@gmail.com

wrote:


There should be no upper limit on the size of the tables you can 
create

with Drill. Be advised that Drill does currently operate entirely
optimistically in regards to available resources. If a network 
connection

between two drillbits fails during a query, we will not currently
re-schedule the work to make use of remaining nodes and network

connections
that are still live. While we have had a good amount of success 
using

Drill
for data conversion, be aware that these conditions could cause 
long

running queries to fail.

That being said, it isn't the only possible cause for such a 
failure. In
the case of a network failure we would expect to see a message 
returned

to

you that part of the query was unsuccessful and that it had been

cancelled.
Andries has a good suggestion in regards to checking the heap 
memory,

this
should also be detected and reported back to you at the CLI, but 
we may

be
failing to propagate the error back to the head node for the 
query. I
believe writing parquet may still be the most heap-intensive 
operation in
Drill, despite our efforts to refactor the write path to use 
direct

memory
instead of on-heap for large buffers needed in the process of 
creating

parquet files.


On Thu, May 28, 2015 at 8:43 AM, Matt bsg...@gmail.com wrote:

Is 300MM records too much to do in a single CTAS statement?

After almost 23 hours I killed the query (^c) and it returned:

~~~
+---++
| Fragment  | Number of records written  |
+---++
| 1_20  | 13568824   |
| 1_15  | 12411822   |
| 1_7   | 12470329   |
| 1_12  | 13693867   |
| 1_5   | 13292136   |
| 1_18  | 13874321   |
| 1_16  | 13303094   |
| 1_9   | 13639049   |
| 1_10  | 13698380   |
| 1_22  | 13501073   |
| 1_8   | 13533736   |
| 1_2   | 13549402   |
| 1_21  | 13665183   |
| 1_0   | 13544745   |
| 1_4   | 13532957   |
| 1_19  | 12767473   |
| 1_17  | 13670687   |
| 1_13  | 13469515   |
| 1_23  | 12517632   |
| 1_6   | 13634338   |
| 1_14  | 13611322   |
| 1_3   | 13061900   |
| 1_11  | 12760978   |
+---++
23 rows selected (82294.854 seconds)
~~~

The sum of those record counts is  306,772,763 which is close to 
the

320,843,454 in the source file:

~~~
0: jdbc:drill:zk=es05:2181 select count(*)  FROM

root.`sample_201501.dat`;

++
|   EXPR$0   |
++
| 320843454  |
++
1 row selected (384.665 seconds)
~~~


It represents one month of data, 4 key columns and 38 numeric 
measure
columns, which could also be partitioned daily. The test here was 
to

create
monthly Parquet files to see how the min/max stats on Parquet 
chunks

help

with range select performance.

Instead of a small number of large monthly RDBMS tables, I am 
attempting
to determine how many Parquet files should be used with Drill / 
HDFS.





On 27 May 2015, at 15:17, Matt wrote:

Attempting to create a Parquet backed table with a CTAS from an 
44GB tab
delimited file in HDFS. The process seemed to be running, as CPU 
and

IO was
seen on all 4 nodes in this cluster, and .parquet files being 
created

in

the expected

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt

Did you check the log files for any errors?


No messages related to this query containing errors or warning, nor 
nothing mentioning memory or heap. Querying now to determine what is 
missing in the parquet destination.


drillbit.out on the master shows no error messages, and what looks like 
the last relevant line is:


~~~
May 27, 2015 6:43:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
written 2,258,263B for [bytes_1250] INT64: 3,069,414 values, 24,555,504B 
raw, 2,257,112B comp, 24 pages, encodings: [RLE, PLAIN, BIT_PACKED]
May 27, 2015 6:43:51 PM INFO: parquet.haMay 28, 2015 5:13:42 PM 
org.apache.calcite.sql.validate.SqlValidatorException init

~~~

The final lines in drillbit.log (which appear to use a different time 
format in the log) that contain the profile ID:


~~~
2015-05-27 18:39:49,980 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20: State change requested from 
RUNNING -- FINISHED for
2015-05-27 18:39:49,981 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20. New state: FINISHED
2015-05-27 18:40:05,650 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12: State change requested from 
RUNNING -- FINISHED for
2015-05-27 18:40:05,650 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12. New state: FINISHED
2015-05-27 18:41:57,444 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16: State change requested from 
RUNNING -- FINISHED for
2015-05-27 18:41:57,444 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16. New state: FINISHED
2015-05-27 18:43:25,005 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:8] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:8: State change requested from 
RUNNING -- FINISHED for
2015-05-27 18:43:25,005 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:8] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:8. New state: FINISHED
2015-05-27 18:43:54,539 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:0] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:0: State change requested from 
RUNNING -- FINISHED for
2015-05-27 18:43:54,540 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:0] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:0. New state: FINISHED
2015-05-27 18:43:59,947 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:4] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:4: State change requested from 
RUNNING -- FINISHED for
2015-05-27 18:43:59,947 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:4] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:4. New state: FINISHED

~~~


On 28 May 2015, at 13:42, Andries Engelbrecht wrote:


It should execute multi threaded, need to check on text file.

Did you check the log files for any errors?


On May 28, 2015, at 10:36 AM, Matt bsg...@gmail.com wrote:

The time seems pretty long for that file size. What type of file is 
it?


Tab delimited UTF-8 text.

I left the query to run overnight to see if it would complete, but 24 
hours for an import like this would indeed be too long.



Is the CTAS running single threaded?


In the first hour, with this being the only client connected to the 
cluster, I observed activity on all 4 nodes.


Is multi-threaded query execution the default? I would not have 
changed anything deliberately to force single thread execution.



On 28 May 2015, at 13:06, Andries Engelbrecht wrote:

The time seems pretty long for that file size. What type of file is 
it?


Is the CTAS running single threaded?

—Andries


On May 28, 2015, at 9:37 AM, Matt bsg...@gmail.com wrote:

How large is the data set you are working with, and your 
cluster/nodes?


Just testing with that single 44GB source file currently, and my 
test cluster is made from 4 nodes, each with 8 CPU cores, 32GB RAM, 
a 6TB Ext4 volume (RAID-10).


Drill defaults left as come in v1.0. I will be adjusting memory and 
retrying the CTAS.


I know I can / should assign individual disks to HDFS, but as a 
test cluster there are apps that expect data volumes to work on. A 
dedicated Hadoop production cluster would have a disk layout 
specific to the task.



On 28 May 2015, at 12:26, Andries Engelbrecht wrote:

Just check the drillbit.log and drillbit.out files in the log 
directory.
Before adjusting memory, see if that is an issue first

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
To make sure I am adjusting the correct config, these are heap parameters 
within the Drill configure path, not for Hadoop or Zookeeper?


 On May 28, 2015, at 12:08 PM, Jason Altekruse altekruseja...@gmail.com 
 wrote:
 
 There should be no upper limit on the size of the tables you can create
 with Drill. Be advised that Drill does currently operate entirely
 optimistically in regards to available resources. If a network connection
 between two drillbits fails during a query, we will not currently
 re-schedule the work to make use of remaining nodes and network connections
 that are still live. While we have had a good amount of success using Drill
 for data conversion, be aware that these conditions could cause long
 running queries to fail.
 
 That being said, it isn't the only possible cause for such a failure. In
 the case of a network failure we would expect to see a message returned to
 you that part of the query was unsuccessful and that it had been cancelled.
 Andries has a good suggestion in regards to checking the heap memory, this
 should also be detected and reported back to you at the CLI, but we may be
 failing to propagate the error back to the head node for the query. I
 believe writing parquet may still be the most heap-intensive operation in
 Drill, despite our efforts to refactor the write path to use direct memory
 instead of on-heap for large buffers needed in the process of creating
 parquet files.
 
 On Thu, May 28, 2015 at 8:43 AM, Matt bsg...@gmail.com wrote:
 
 Is 300MM records too much to do in a single CTAS statement?
 
 After almost 23 hours I killed the query (^c) and it returned:
 
 ~~~
 +---++
 | Fragment  | Number of records written  |
 +---++
 | 1_20  | 13568824   |
 | 1_15  | 12411822   |
 | 1_7   | 12470329   |
 | 1_12  | 13693867   |
 | 1_5   | 13292136   |
 | 1_18  | 13874321   |
 | 1_16  | 13303094   |
 | 1_9   | 13639049   |
 | 1_10  | 13698380   |
 | 1_22  | 13501073   |
 | 1_8   | 13533736   |
 | 1_2   | 13549402   |
 | 1_21  | 13665183   |
 | 1_0   | 13544745   |
 | 1_4   | 13532957   |
 | 1_19  | 12767473   |
 | 1_17  | 13670687   |
 | 1_13  | 13469515   |
 | 1_23  | 12517632   |
 | 1_6   | 13634338   |
 | 1_14  | 13611322   |
 | 1_3   | 13061900   |
 | 1_11  | 12760978   |
 +---++
 23 rows selected (82294.854 seconds)
 ~~~
 
 The sum of those record counts is  306,772,763 which is close to the
 320,843,454 in the source file:
 
 ~~~
 0: jdbc:drill:zk=es05:2181 select count(*)  FROM root.`sample_201501.dat`;
 ++
 |   EXPR$0   |
 ++
 | 320843454  |
 ++
 1 row selected (384.665 seconds)
 ~~~
 
 
 It represents one month of data, 4 key columns and 38 numeric measure
 columns, which could also be partitioned daily. The test here was to create
 monthly Parquet files to see how the min/max stats on Parquet chunks help
 with range select performance.
 
 Instead of a small number of large monthly RDBMS tables, I am attempting
 to determine how many Parquet files should be used with Drill / HDFS.
 
 
 
 
 On 27 May 2015, at 15:17, Matt wrote:
 
 Attempting to create a Parquet backed table with a CTAS from an 44GB tab
 delimited file in HDFS. The process seemed to be running, as CPU and IO was
 seen on all 4 nodes in this cluster, and .parquet files being created in
 the expected path.
 
 In however in the last two hours or so, all nodes show near zero CPU or
 IO, and the Last Modified date on the .parquet have not changed. Same time
 delay shown in the Last Progress column in the active fragment profile.
 
 What approach can I take to determine what is happening (or not)?
 


Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
I did not note any memory errors or warnings in a quick scan of the logs, but 
to double check, is there a specific log I would find such warnings in?


 On May 28, 2015, at 12:01 PM, Andries Engelbrecht aengelbre...@maprtech.com 
 wrote:
 
 I have used a single CTAS to create tables using parquet with 1.5B rows.
 
 It did consume a lot of heap memory on the Drillbits and I had to increase 
 the heap size. Check your logs to see if you are running out of heap memory.
 
 I used 128MB parquet block size.
 
 This was with Drill 0.9 , so I’m sure 1.0 will be better in this regard.
 
 —Andries
 
 
 
 On May 28, 2015, at 8:43 AM, Matt bsg...@gmail.com wrote:
 
 Is 300MM records too much to do in a single CTAS statement?
 
 After almost 23 hours I killed the query (^c) and it returned:
 
 ~~~
 +---++
 | Fragment  | Number of records written  |
 +---++
 | 1_20  | 13568824   |
 | 1_15  | 12411822   |
 | 1_7   | 12470329   |
 | 1_12  | 13693867   |
 | 1_5   | 13292136   |
 | 1_18  | 13874321   |
 | 1_16  | 13303094   |
 | 1_9   | 13639049   |
 | 1_10  | 13698380   |
 | 1_22  | 13501073   |
 | 1_8   | 13533736   |
 | 1_2   | 13549402   |
 | 1_21  | 13665183   |
 | 1_0   | 13544745   |
 | 1_4   | 13532957   |
 | 1_19  | 12767473   |
 | 1_17  | 13670687   |
 | 1_13  | 13469515   |
 | 1_23  | 12517632   |
 | 1_6   | 13634338   |
 | 1_14  | 13611322   |
 | 1_3   | 13061900   |
 | 1_11  | 12760978   |
 +---++
 23 rows selected (82294.854 seconds)
 ~~~
 
 The sum of those record counts is  306,772,763 which is close to the  
 320,843,454 in the source file:
 
 ~~~
 0: jdbc:drill:zk=es05:2181 select count(*)  FROM root.`sample_201501.dat`;
 ++
 |   EXPR$0   |
 ++
 | 320843454  |
 ++
 1 row selected (384.665 seconds)
 ~~~
 
 
 It represents one month of data, 4 key columns and 38 numeric measure 
 columns, which could also be partitioned daily. The test here was to create 
 monthly Parquet files to see how the min/max stats on Parquet chunks help 
 with range select performance.
 
 Instead of a small number of large monthly RDBMS tables, I am attempting to 
 determine how many Parquet files should be used with Drill / HDFS.
 
 
 
 On 27 May 2015, at 15:17, Matt wrote:
 
 Attempting to create a Parquet backed table with a CTAS from an 44GB tab 
 delimited file in HDFS. The process seemed to be running, as CPU and IO was 
 seen on all 4 nodes in this cluster, and .parquet files being created in 
 the expected path.
 
 In however in the last two hours or so, all nodes show near zero CPU or IO, 
 and the Last Modified date on the .parquet have not changed. Same time 
 delay shown in the Last Progress column in the active fragment profile.
 
 What approach can I take to determine what is happening (or not)?
 


Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
How large is the data set you are working with, and your 
cluster/nodes?


Just testing with that single 44GB source file currently, and my test 
cluster is made from 4 nodes, each with 8 CPU cores, 32GB RAM, a 6TB 
Ext4 volume (RAID-10).


Drill defaults left as come in v1.0. I will be adjusting memory and 
retrying the CTAS.


I know I can / should assign individual disks to HDFS, but as a test 
cluster there are apps that expect data volumes to work on. A dedicated 
Hadoop production cluster would have a disk layout specific to the task.



On 28 May 2015, at 12:26, Andries Engelbrecht wrote:

Just check the drillbit.log and drillbit.out files in the log 
directory.
Before adjusting memory, see if that is an issue first. It was for me, 
but as Jason mentioned there can be other causes as well.


You adjust memory allocation in the drill-env.sh files, and have to 
restart the drill bits.


How large is the data set you are working with, and your 
cluster/nodes?


—Andries


On May 28, 2015, at 9:17 AM, Matt bsg...@gmail.com wrote:

To make sure I am adjusting the correct config, these are heap 
parameters within the Drill configure path, not for Hadoop or 
Zookeeper?



On May 28, 2015, at 12:08 PM, Jason Altekruse 
altekruseja...@gmail.com wrote:


There should be no upper limit on the size of the tables you can 
create

with Drill. Be advised that Drill does currently operate entirely
optimistically in regards to available resources. If a network 
connection

between two drillbits fails during a query, we will not currently
re-schedule the work to make use of remaining nodes and network 
connections
that are still live. While we have had a good amount of success 
using Drill

for data conversion, be aware that these conditions could cause long
running queries to fail.

That being said, it isn't the only possible cause for such a 
failure. In
the case of a network failure we would expect to see a message 
returned to
you that part of the query was unsuccessful and that it had been 
cancelled.
Andries has a good suggestion in regards to checking the heap 
memory, this
should also be detected and reported back to you at the CLI, but we 
may be
failing to propagate the error back to the head node for the query. 
I
believe writing parquet may still be the most heap-intensive 
operation in
Drill, despite our efforts to refactor the write path to use direct 
memory
instead of on-heap for large buffers needed in the process of 
creating

parquet files.


On Thu, May 28, 2015 at 8:43 AM, Matt bsg...@gmail.com wrote:

Is 300MM records too much to do in a single CTAS statement?

After almost 23 hours I killed the query (^c) and it returned:

~~~
+---++
| Fragment  | Number of records written  |
+---++
| 1_20  | 13568824   |
| 1_15  | 12411822   |
| 1_7   | 12470329   |
| 1_12  | 13693867   |
| 1_5   | 13292136   |
| 1_18  | 13874321   |
| 1_16  | 13303094   |
| 1_9   | 13639049   |
| 1_10  | 13698380   |
| 1_22  | 13501073   |
| 1_8   | 13533736   |
| 1_2   | 13549402   |
| 1_21  | 13665183   |
| 1_0   | 13544745   |
| 1_4   | 13532957   |
| 1_19  | 12767473   |
| 1_17  | 13670687   |
| 1_13  | 13469515   |
| 1_23  | 12517632   |
| 1_6   | 13634338   |
| 1_14  | 13611322   |
| 1_3   | 13061900   |
| 1_11  | 12760978   |
+---++
23 rows selected (82294.854 seconds)
~~~

The sum of those record counts is  306,772,763 which is close to 
the

320,843,454 in the source file:

~~~
0: jdbc:drill:zk=es05:2181 select count(*)  FROM 
root.`sample_201501.dat`;

++
|   EXPR$0   |
++
| 320843454  |
++
1 row selected (384.665 seconds)
~~~


It represents one month of data, 4 key columns and 38 numeric 
measure
columns, which could also be partitioned daily. The test here was 
to create
monthly Parquet files to see how the min/max stats on Parquet 
chunks help

with range select performance.

Instead of a small number of large monthly RDBMS tables, I am 
attempting
to determine how many Parquet files should be used with Drill / 
HDFS.





On 27 May 2015, at 15:17, Matt wrote:

Attempting to create a Parquet backed table with a CTAS from an 
44GB tab
delimited file in HDFS. The process seemed to be running, as CPU 
and IO was
seen on all 4 nodes in this cluster, and .parquet files being 
created in

the expected path.

In however in the last two hours or so, all nodes show near zero 
CPU or
IO, and the Last Modified date

Re: Query local files on cluster? [Beginner]

2015-05-27 Thread Matt
Drill can process a lot of data quickly, and for best performance and 
consistency you will likely find that the sooner you get the data to 
the DFS the better.


Already most of the way there. Initial confusion came from the features 
to query the local / native filesystem, and how that does not fit a 
distributed Drill cluster well. In other words its really an embedded / 
single-node Drill feature.


Currently using the approach of doing a put from local filsystem into 
hdfs, then CTAS into Parquet, if only for simplicity in testing (not 
performance).


Thanks!

On 27 May 2015, at 11:29, Andries Engelbrecht wrote:

You will be better off to use the Drill cluster as a whole vs trying 
to play with local vs DFS storage.


A couple of ideas:
As previously mentioned you can use the robust NFS on MapR to easily 
place the CSV/files on the DFS, and then use Drill with CTAS to 
convert the files to Parquet on the DFS.


You can set up a remote NFS server and map the local FS on each node 
to the same NFS mount point to the remote NFS server, this will the 
files will be consistently available to the Drillbits in the cluster 
and you can do CTAS to create parquet file son the DFS. This however 
will likely be a lot slower than the first option, as the NFS server 
BW will become a bottleneck if you have a number of drillbits in the 
cluster.


Just copy the files to one node in the cluster and then use hadoop fs 
to put the files in the DFS, and then do the CTAS from DFS to parquet 
DFS.


You can even place the data on S3 and then query and CTAS from there, 
however security and bandwidth may be a concern for large data 
volumes, pending the use case.


I really think you will find the first option the most robust and 
fastest in the long run. You can point Drill at any FS source as long 
as it is consistent to all nodes in the cluster, but keep in mind that 
Drill can process a lot of data quickly, and for best performance and 
consistency you will likely find that the sooner you get the data to 
the DFS the better.






On May 26, 2015, at 5:58 PM, Matt bsg...@gmail.com wrote:

Thanks, I am incorrectly conflating the file system with data 
storage.


Looking to experiment with the Parquet format, and was looking at 
CTAS queries as an import approach.


Are direct queries over local files meant for an embedded drill, 
where on a cluster files should be moved into HDFS first?


That would make sense as files on one node would be query bound to 
that local filesystem.


On May 26, 2015, at 8:28 PM, Andries Engelbrecht 
aengelbre...@maprtech.com wrote:


You can use the HDFS shell
hadoop fs -put

To copy from local file system to HDFS


For more robust mechanisms from remote systems you can look at using 
NFS, MapR has a really robust NFS integration and you can use it 
with the community edition.






On May 26, 2015, at 5:11 PM, Matt bsg...@gmail.com wrote:


That might be the end goal, but currently I don't have an HDFS 
ingest mechanism.


We are not currently a Hadoop shop - can you suggest simple 
approaches for bulk loading data from delimited files into HDFS?





On May 26, 2015, at 8:04 PM, Andries Engelbrecht 
aengelbre...@maprtech.com wrote:


Perhaps I’m missing something here.

Why not create a DFS plug in for HDFS and put the file in HDFS?




On May 26, 2015, at 4:54 PM, Matt bsg...@gmail.com wrote:

New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it 
appears text files need to be on all nodes in a cluster?


Using the dfs config below, I am only able to query if a csv file 
is on all 4 nodes. If the file is only on the local node and not 
others, I get errors in the form of:


~~~
0: jdbc:drill:zk=es05:2181 select * from 
root.`customer_reviews_1998.csv`;
Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: 
Table 'root.customer_reviews_1998.csv' not found

~~~

~~~
{
type: file,
enabled: true,
connection: file:///,
workspaces: {
root: {
location: /localdata/hadoop/stage,
writable: false,
defaultInputFormat: null
},
~~~


On 25 May 2015, at 20:39, Kristine Hahn wrote:

The storage plugin location needs to be the full path to the 
localdata
directory. This partial storage plugin definition works for the 
user named

mapr:

{
type: file,
enabled: true,
connection: file:///,
workspaces: {
root: {
location: /home/mapr/localdata,
writable: false,
defaultInputFormat: null
},
. . .

Here's a working query for the data in localdata:

0: jdbc:drill: SELECT COLUMNS[0] AS Ngram,
. . . . . . .  COLUMNS[1] AS Publication_Date,
. . . . . . .  COLUMNS[2] AS Frequency
. . . . . . .  FROM dfs.root.`mydata.csv`
. . . . . . .  WHERE ((columns[0] = 'Zoological Journal of the 
Linnean')

. . . . . . .  AND (columns[2]  250)) LIMIT 10;

An complete example, not yet published on the Drill site, shows 
in detail

the steps involved:
http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file


Kristine Hahn
Sr. Technical Writer
415-497-8107 @krishahn

Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt

That might be the end goal, but currently I don't have an HDFS ingest 
mechanism. 

We are not currently a Hadoop shop - can you suggest simple approaches for bulk 
loading data from delimited files into HDFS?

 


 On May 26, 2015, at 8:04 PM, Andries Engelbrecht aengelbre...@maprtech.com 
 wrote:
 
 Perhaps I’m missing something here.
 
 Why not create a DFS plug in for HDFS and put the file in HDFS?
 
 
 
 On May 26, 2015, at 4:54 PM, Matt bsg...@gmail.com wrote:
 
 New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text 
 files need to be on all nodes in a cluster?
 
 Using the dfs config below, I am only able to query if a csv file is on all 
 4 nodes. If the file is only on the local node and not others, I get errors 
 in the form of:
 
 ~~~
 0: jdbc:drill:zk=es05:2181 select * from root.`customer_reviews_1998.csv`;
 Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
 'root.customer_reviews_1998.csv' not found
 ~~~
 
 ~~~
 {
 type: file,
 enabled: true,
 connection: file:///,
 workspaces: {
   root: {
 location: /localdata/hadoop/stage,
 writable: false,
 defaultInputFormat: null
   },
 ~~~
 
 On 25 May 2015, at 20:39, Kristine Hahn wrote:
 
 The storage plugin location needs to be the full path to the localdata
 directory. This partial storage plugin definition works for the user named
 mapr:
 
 {
 type: file,
 enabled: true,
 connection: file:///,
 workspaces: {
 root: {
  location: /home/mapr/localdata,
  writable: false,
  defaultInputFormat: null
 },
 . . .
 
 Here's a working query for the data in localdata:
 
 0: jdbc:drill: SELECT COLUMNS[0] AS Ngram,
 . . . . . . .  COLUMNS[1] AS Publication_Date,
 . . . . . . .  COLUMNS[2] AS Frequency
 . . . . . . .  FROM dfs.root.`mydata.csv`
 . . . . . . .  WHERE ((columns[0] = 'Zoological Journal of the Linnean')
 . . . . . . .  AND (columns[2]  250)) LIMIT 10;
 
 An complete example, not yet published on the Drill site, shows in detail
 the steps involved:
 http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
 
 
 Kristine Hahn
 Sr. Technical Writer
 415-497-8107 @krishahn
 
 
 On Sun, May 24, 2015 at 1:56 PM, Matt bsg...@gmail.com wrote:
 
 I have used a single node install (unzip and run) to query local text /
 csv files, but on a 3 node cluster (installed via MapR CE), a query with
 local files results in:
 
 ~~~
 sqlline version 1.1.6
 0: jdbc:drill: select * from dfs.`testdata.csv`;
 Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
 Table 'dfs./localdata/testdata.csv' not found
 
 0: jdbc:drill: select * from dfs.`/localdata/testdata.csv`;
 Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
 Table 'dfs./localdata/testdata.csv' not found
 ~~~
 
 Is there a special config for local file querying? An initial doc search
 did not point me to a solution, but I may simply not have found the
 relevant sections.
 
 I have tried modifying the default dfs config to no avail:
 
 ~~~
 type: file,
 enabled: true,
 connection: file:///,
 workspaces: {
 root: {
  location: /localdata,
  writable: false,
  defaultInputFormat: null
 }
 ~~~
 


Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt
Thanks, I am incorrectly conflating the file system with data storage. 

Looking to experiment with the Parquet format, and was looking at CTAS queries 
as an import approach.

Are direct queries over local files meant for an embedded drill, where on a 
cluster files should be moved into HDFS first?

That would make sense as files on one node would be query bound to that local 
filesystem. 

 On May 26, 2015, at 8:28 PM, Andries Engelbrecht aengelbre...@maprtech.com 
 wrote:
 
 You can use the HDFS shell
 hadoop fs -put
 
 To copy from local file system to HDFS
 
 
 For more robust mechanisms from remote systems you can look at using NFS, 
 MapR has a really robust NFS integration and you can use it with the 
 community edition.
 
 
 
 
 On May 26, 2015, at 5:11 PM, Matt bsg...@gmail.com wrote:
 
 
 That might be the end goal, but currently I don't have an HDFS ingest 
 mechanism. 
 
 We are not currently a Hadoop shop - can you suggest simple approaches for 
 bulk loading data from delimited files into HDFS?
 
 
 
 
 On May 26, 2015, at 8:04 PM, Andries Engelbrecht 
 aengelbre...@maprtech.com wrote:
 
 Perhaps I’m missing something here.
 
 Why not create a DFS plug in for HDFS and put the file in HDFS?
 
 
 
 On May 26, 2015, at 4:54 PM, Matt bsg...@gmail.com wrote:
 
 New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text 
 files need to be on all nodes in a cluster?
 
 Using the dfs config below, I am only able to query if a csv file is on 
 all 4 nodes. If the file is only on the local node and not others, I get 
 errors in the form of:
 
 ~~~
 0: jdbc:drill:zk=es05:2181 select * from root.`customer_reviews_1998.csv`;
 Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
 'root.customer_reviews_1998.csv' not found
 ~~~
 
 ~~~
 {
 type: file,
 enabled: true,
 connection: file:///,
 workspaces: {
 root: {
   location: /localdata/hadoop/stage,
   writable: false,
   defaultInputFormat: null
 },
 ~~~
 
 On 25 May 2015, at 20:39, Kristine Hahn wrote:
 
 The storage plugin location needs to be the full path to the localdata
 directory. This partial storage plugin definition works for the user named
 mapr:
 
 {
 type: file,
 enabled: true,
 connection: file:///,
 workspaces: {
 root: {
 location: /home/mapr/localdata,
 writable: false,
 defaultInputFormat: null
 },
 . . .
 
 Here's a working query for the data in localdata:
 
 0: jdbc:drill: SELECT COLUMNS[0] AS Ngram,
 . . . . . . .  COLUMNS[1] AS Publication_Date,
 . . . . . . .  COLUMNS[2] AS Frequency
 . . . . . . .  FROM dfs.root.`mydata.csv`
 . . . . . . .  WHERE ((columns[0] = 'Zoological Journal of the Linnean')
 . . . . . . .  AND (columns[2]  250)) LIMIT 10;
 
 An complete example, not yet published on the Drill site, shows in detail
 the steps involved:
 http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
 
 
 Kristine Hahn
 Sr. Technical Writer
 415-497-8107 @krishahn
 
 
 On Sun, May 24, 2015 at 1:56 PM, Matt bsg...@gmail.com wrote:
 
 I have used a single node install (unzip and run) to query local text /
 csv files, but on a 3 node cluster (installed via MapR CE), a query with
 local files results in:
 
 ~~~
 sqlline version 1.1.6
 0: jdbc:drill: select * from dfs.`testdata.csv`;
 Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
 Table 'dfs./localdata/testdata.csv' not found
 
 0: jdbc:drill: select * from dfs.`/localdata/testdata.csv`;
 Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
 Table 'dfs./localdata/testdata.csv' not found
 ~~~
 
 Is there a special config for local file querying? An initial doc search
 did not point me to a solution, but I may simply not have found the
 relevant sections.
 
 I have tried modifying the default dfs config to no avail:
 
 ~~~
 type: file,
 enabled: true,
 connection: file:///,
 workspaces: {
 root: {
 location: /localdata,
 writable: false,
 defaultInputFormat: null
 }
 ~~~
 


Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt
New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears 
text files need to be on all nodes in a cluster?


Using the dfs config below, I am only able to query if a csv file is on 
all 4 nodes. If the file is only on the local node and not others, I get 
errors in the form of:


~~~
0: jdbc:drill:zk=es05:2181 select * from 
root.`customer_reviews_1998.csv`;
Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
'root.customer_reviews_1998.csv' not found

~~~

~~~
{
  type: file,
  enabled: true,
  connection: file:///,
  workspaces: {
root: {
  location: /localdata/hadoop/stage,
  writable: false,
  defaultInputFormat: null
},
~~~

On 25 May 2015, at 20:39, Kristine Hahn wrote:

The storage plugin location needs to be the full path to the 
localdata
directory. This partial storage plugin definition works for the user 
named

mapr:

{
type: file,
enabled: true,
connection: file:///,
workspaces: {
 root: {
   location: /home/mapr/localdata,
   writable: false,
   defaultInputFormat: null
 },
. . .

Here's a working query for the data in localdata:

0: jdbc:drill: SELECT COLUMNS[0] AS Ngram,
. . . . . . .  COLUMNS[1] AS Publication_Date,
. . . . . . .  COLUMNS[2] AS Frequency
. . . . . . .  FROM dfs.root.`mydata.csv`
. . . . . . .  WHERE ((columns[0] = 'Zoological Journal of the 
Linnean')

. . . . . . .  AND (columns[2]  250)) LIMIT 10;

An complete example, not yet published on the Drill site, shows in 
detail

the steps involved:
http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file


Kristine Hahn
Sr. Technical Writer
415-497-8107 @krishahn


On Sun, May 24, 2015 at 1:56 PM, Matt bsg...@gmail.com wrote:

I have used a single node install (unzip and run) to query local text 
/
csv files, but on a 3 node cluster (installed via MapR CE), a query 
with

local files results in:

~~~
sqlline version 1.1.6
0: jdbc:drill: select * from dfs.`testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
17:

Table 'dfs./localdata/testdata.csv' not found

0: jdbc:drill: select * from dfs.`/localdata/testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
17:

Table 'dfs./localdata/testdata.csv' not found
~~~

Is there a special config for local file querying? An initial doc 
search

did not point me to a solution, but I may simply not have found the
relevant sections.

I have tried modifying the default dfs config to no avail:

~~~
type: file,
enabled: true,
connection: file:///,
workspaces: {
 root: {
   location: /localdata,
   writable: false,
   defaultInputFormat: null
 }
~~~



Re: Query local files on cluster? [Beginner]

2015-05-25 Thread Matt
That does represent something I have not tried yet. Will test as soon as 
I can.


Thanks!

On 25 May 2015, at 20:39, Kristine Hahn wrote:

The storage plugin location needs to be the full path to the 
localdata
directory. This partial storage plugin definition works for the user 
named

mapr:

{
type: file,
enabled: true,
connection: file:///,
workspaces: {
 root: {
   location: /home/mapr/localdata,
   writable: false,
   defaultInputFormat: null
 },
. . .

Here's a working query for the data in localdata:

0: jdbc:drill: SELECT COLUMNS[0] AS Ngram,
. . . . . . .  COLUMNS[1] AS Publication_Date,
. . . . . . .  COLUMNS[2] AS Frequency
. . . . . . .  FROM dfs.root.`mydata.csv`
. . . . . . .  WHERE ((columns[0] = 'Zoological Journal of the 
Linnean')

. . . . . . .  AND (columns[2]  250)) LIMIT 10;

An complete example, not yet published on the Drill site, shows in 
detail

the steps involved:
http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file


Kristine Hahn
Sr. Technical Writer
415-497-8107 @krishahn


On Sun, May 24, 2015 at 1:56 PM, Matt bsg...@gmail.com wrote:

I have used a single node install (unzip and run) to query local text 
/
csv files, but on a 3 node cluster (installed via MapR CE), a query 
with

local files results in:

~~~
sqlline version 1.1.6
0: jdbc:drill: select * from dfs.`testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
17:

Table 'dfs./localdata/testdata.csv' not found

0: jdbc:drill: select * from dfs.`/localdata/testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
17:

Table 'dfs./localdata/testdata.csv' not found
~~~

Is there a special config for local file querying? An initial doc 
search

did not point me to a solution, but I may simply not have found the
relevant sections.

I have tried modifying the default dfs config to no avail:

~~~
type: file,
enabled: true,
connection: file:///,
workspaces: {
 root: {
   location: /localdata,
   writable: false,
   defaultInputFormat: null
 }
~~~



Re: How do I make json files les painful

2015-03-19 Thread Matt

Is each file a single json array object?

If so, would converting the files to a format with one line per record a 
potential solution?


Example using jq (http://stedolan.github.io/jq/): jq -c '.[]'


On 19 Mar 2015, at 12:22, Jim Bates wrote:


I constantly, constantly, constantly hit this.

I have json files that are just a huge collection of an array of json
objects

example
MyArrayInTheFile:
[{a:1,b:2,c:3},{a:1,b:2,c:3},...]

My issue is in exploring the data, I hit this.

Query failed: Query stopped., Record was too large to copy into 
vector. [

39186288-2e01-408c-b886-dcee0a2c25c5 on maprdemo:31010 ]

I can explore csv, tab, maprdb, hive at fairly large data sets and 
limit
the response to what fits in my system limitations but not json in 
this

format.

The two options I have come up with to move forward are..

1. I strip out 90% of the array values in a file and explore that to 
get
to my view. then go to a larger system and see if I have enough to get 
the

job done.
2. Move to the larger system and explore there taking resources that
don't need to be spent on a science project.

Hoping the smart people have a different option for me,

Jim