Standalone drillbit without Zookeeper?

2020-09-29 Thread Matt Keranen
Is it possible to run a single node drillbit without Zookeeper, as a
"service" without the need for coordination across multiple nodes?

`zk.connect: "local"` is not accepted as the equivalent of "zk=local" with
drill-embedded.


Re: [DISCUSS] Drill Storage Plugins

2019-11-05 Thread Matt
Perhaps an "awesome-drill" repo on GitHub would be a place to back fill the
book, and serve as a central location for thins like the list you supplied:

https://github.com/topics/awesome

On Tue, Nov 5, 2019 at 9:14 AM Charles Givre  wrote:

> One more thing:  I've found code for storage plugins (in various states of
> completion) for the folllowing systems:
> DynamoDB (https://github.com/fineoio/drill-dynamo-adapter <
> https://github.com/fineoio/drill-dynamo-adapter>)
> Apache Druid:  (Current Draft PR https://github.com/apache/drill/pull/1888
> )
> Couchbase: (https://github.com/LyleLeo/Apache-Drill-CouchDB-Storage-Plugin
> ) (Author
> said he would consider submitting as PR)
> ElasticSearch: https://github.com/javiercanillas/drill-storage-elastic <
> https://github.com/javiercanillas/drill-storage-elastic>,
> https://github.com/gaoshui87/drill-storage-elastic <
> https://github.com/gaoshui87/drill-storage-elastic>
> Apache Solr
>
> Are there others that anyone knows of?
>
>
> > On Nov 4, 2019, at 10:23 PM, Charles Givre  wrote:
> >
> > Hello all,
> > I've written some UDFs and Format plugins for Drill and I'm interested
> in tackling a storage plugin.  One of my regrets from the Drill book was
> that we didn't get into this topic.  For those of you who have written one,
> my hat's off to you. I wanted to ask if there are any resources or
> tutorials available that you found particularly helpful?  I'm having a
> little trouble figuring out what all the pieces do and how they fit
> together.
> >
> > Does anyone have any ideas about storage plugins should be implemented?
> Personally I'd really like to see one for ElasticSearch,
> > Best,
> > -- C
>
>


Re: Problem creating jt400 jdbc connection

2019-07-25 Thread Matt Rabbitt
It works using 9.4 java8 version.  Thanks!

On Thu, Jul 25, 2019 at 12:07 PM  wrote:

> Hi Matt, I tried with 9.4 jt400.rar and it works for me
> With this parameters
>
> {
>   "type": "jdbc",
>   "driver": "com.ibm.as400.access.AS400JDBCDriver",
>   "url": "jdbc:as400://93.63.150.44;prompt=false",
>   "username": "xxx",
>   "password": "xxx",
>   "caseInsensitiveTableNames": false,
>   "enabled": true
> }
>
> Best
> Alessandro
>
>
> -Messaggio originale-
> Da: Matt Rabbitt 
> Inviato: giovedì 25 luglio 2019 18:02
> A: user@drill.apache.org
> Oggetto: Problem creating jt400 jdbc connection
>
> Is anyone successfully using the jt400 jdbc driver with Drill?  I am trying
> to add a storage plugin but when I go to create it in the web gui I'm
> getting an error:
>
> Please retry: Error while creating / updating storage :
> java.sql.SQLException: Cannot create PoolableConnectionFactory (The
> application requester cannot establish the connection. (Connection was
> dropped unexpectedly.))
>
> I am using jtopen v9.8 jt400.jar, connecting to an AS400 (not totally sure
> what version).  The JDBC driver works in other applications for me like
> DBeaver and jdbcsql.zip.  Config below:
>
> {type: "jdbc",
>   enabled: true,
>   driver: "com.ibm.as400.access.AS400JDBCDriver",
>   url:"jdbc:as400://192.168.1.33:446;prompt=false;translate
> binary=true;naming=sql;toolbox trace=datastream;trace=true",
>   username:"foo",
>   password:"bar"}
>
> The trace outputs a message "Failed to read all of the data stream." before
> failing.  Full trace below:
>
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019  as400:
> Driver AS/400 Toolbox for Java JDBC Driver (1235335025) : connect called
> with URL: jdbc:as400://192.168.1.33:446;prompt=false;translate
> binary=true;naming=sql;toolbox trace=datastream;trace=true.
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : access = "all".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : block size = "32".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : block criteria = "2".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : date format = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : date separator = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : decimal separator = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : errors = "basic".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : extended dynamic = "false".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : libraries = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : naming = "sql".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package add = "true".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package cache = "false".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package clear = "false".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package error = "warning".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : package library = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : password = "".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@setString: Properties  (157835982) : prefetch = "true".
> Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
>  as400@set

Problem creating jt400 jdbc connection

2019-07-25 Thread Matt Rabbitt
Is anyone successfully using the jt400 jdbc driver with Drill?  I am trying
to add a storage plugin but when I go to create it in the web gui I'm
getting an error:

Please retry: Error while creating / updating storage :
java.sql.SQLException: Cannot create PoolableConnectionFactory (The
application requester cannot establish the connection. (Connection was
dropped unexpectedly.))

I am using jtopen v9.8 jt400.jar, connecting to an AS400 (not totally sure
what version).  The JDBC driver works in other applications for me like
DBeaver and jdbcsql.zip.  Config below:

{type: "jdbc",
  enabled: true,
  driver: "com.ibm.as400.access.AS400JDBCDriver",
  url:"jdbc:as400://192.168.1.33:446;prompt=false;translate
binary=true;naming=sql;toolbox trace=datastream;trace=true",
  username:"foo",
  password:"bar"}

The trace outputs a message "Failed to read all of the data stream." before
failing.  Full trace below:

Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019  as400:
Driver AS/400 Toolbox for Java JDBC Driver (1235335025) : connect called
with URL: jdbc:as400://192.168.1.33:446;prompt=false;translate
binary=true;naming=sql;toolbox trace=datastream;trace=true.
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : access = "all".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : block size = "32".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : block criteria = "2".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : date format = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : date separator = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : decimal separator = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : errors = "basic".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : extended dynamic = "false".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : libraries = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : naming = "sql".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package add = "true".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package cache = "false".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package clear = "false".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package error = "warning".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : package library = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : password = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : prefetch = "true".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : prompt = "false".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : remarks = "system".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : sort = "hex".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : sort language = "ENU".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : sort table = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : sort weight = "shared".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : time format = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : time separator = "".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : trace = "true".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : transaction isolation = "read
uncommitted".
Thread[qtp1108639122-114,5,main]  Thu Jul 25 11:57:25:862 EDT 2019
 as400@setString: Properties  (157835982) : translate

File does not exist errors across cluster

2018-11-27 Thread Matt Keranen
Have 4 nodes running drillbits version 1.14 for queries over JSON files in
the regular filesystem (not HDFS).

Each node has an identical directory structure, but not all file names
exist on all nodes, and any query in the form of "SELECT ... FROM
dfs.logs.`logs*.json.gz`" fails with:

Error: DATA_READ ERROR: Failure reading JSON file - File
file:/localdata/logs/logs.xhr.json.gz does not exist

where the filename may change, but is always one that exists on some but
not all nodes.

Is there a configuration for Drill where drillbits querying non-distrbuted
filesystems don't expect all files to exist on all nodes?


File "does not exist" error on non-distributed filesystem cluster

2018-11-27 Thread Matt Keranen
Have 4 nodes running drillbits version 1.14 for queries over JSON files in
the regular filesystem (not HDFS).

Each node has an identical directory structure, but not all file names
exist on all nodes, and any query in the form of "SELECT ... FROM
dfs.logs.`logs*.json.gz`" fails with:

Error: DATA_READ ERROR: Failure reading JSON file - File
file:/localdata/logs/logs.xxx.json.gz does not exist

where the filename may change, but is always one that exists on some but
not all nodes.

Is there a configuration for Drill where drillbits querying non-distributed
filesystems don't expect all files to exist on all nodes?


Re: Failure while reading messages from kafka

2018-09-04 Thread Matt
https://issues.apache.org/jira/browse/DRILL-6723

On Mon, Aug 27, 2018 at 12:27 PM Matt  wrote:

> I have a Kafka topic with some non-JSON test messages in it, resulting in
> errors "Error: DATA_READ ERROR: Failure while reading messages from kafka.
> Recordreader was at record: 1"
>
> I don't seem to be able to bypass these topic messages with
> "store.json.reader.skip_invalid_records" or even an OFFSET in the query.
>
> Is there a mechanism or setting I can use to query a topic and not fail on
> malformed messages?
>


Failure while reading messages from kafka

2018-08-27 Thread Matt
I have a Kafka topic with some non-JSON test messages in it, resulting in
errors "Error: DATA_READ ERROR: Failure while reading messages from kafka.
Recordreader was at record: 1"

I don't seem to be able to bypass these topic messages with
"store.json.reader.skip_invalid_records" or even an OFFSET in the query.

Is there a mechanism or setting I can use to query a topic and not fail on
malformed messages?


Cassandra storage plugin

2018-01-16 Thread Matt
I note there are some old Jira issues about Cassandra storage, and have
this concept as to why it could be very valuable for Drill. Can anyone
support or refute the idea?

Cassandra is an excellent engine for high volume ingest, but support for
aggregations and scans is very limited. Would a Drill plugin support the
ability to query a Cassandra keyspace using the partition keys, then
perform aggregations within Drill, adding an "analytics layer" in Drill
onto Cassandra?

This seems like a very useful approach, but I don't know the level of
effort in getting Calcite to properly determine which part of the SQL query
passed to Drill can be segmented to CQL to push down to Cassandra, and then
aggregate to summarize the results.

Is there any active effort in this area?


Drillbits and duplicate files

2017-11-04 Thread Matt
If multiple Drillbits on different servers are coordinating via Zookeeper,
and some files across the servers are duplicates (with identical
filenames), will the cluster of distributed Drillbits avoid duplicating
data on queries?

I’m specifically interested in aggregating CSV data on multiple servers,
but not in HDFS.


Re: Drill Summit/Conference Proposal

2017-06-16 Thread Matt K
A counter point: I would be concerned that Drill would be overshadowed by
more “popular” or more entrenched platforms.

Drill is an excellent and somewhat unique tech that needs more exposure to
grow. An event that focuses purely on Drill may have better success at that.

The caveat may be that a Drill only summit may have low turnout and thus
not succeed at the above goals, thus I’m not certain what the right choice
is, but I wanted to bring up the point for discussion.


On June 14, 2017 at 2:32:45 PM, Julian Hyde (jh...@apache.org) wrote:

I like the idea of co-hosting a conference. ApacheCon in particular is a
good venue, and they explicitly encourage sub-conferences (there are “Big
Data” and “IoT” tracks, and this year there were sub-conferences for Tomcat
and CloudStack). DrillCon was part of ApacheCon, people could attend a
whole day of Drill talks, or they they could go to talks about other Apache
projects in the larger conference, and connect with Apache members.

Also, the conference is professionally organized, at a large hotel with
good facilities.

Unfortunately ApacheCon just happened (it was in Miami in May); but it’s
something to consider next year.

Julian


> On Jun 14, 2017, at 9:18 AM, Charles Givre  wrote:
>
> Hi Bob,
> Good to hear from you. I agree that there could be value in having a
joint
> Presto/Drill/Redshift conference, but how would you describe the overall
> theme?
>
> In essence (not looking to start a flame war here...) these tools are
> similar in terms of what the user experiences and I can definitely see
> value in bringing the communities together. I also like the idea of
> multiple tracks. I was thinking of having something like
developer/analyst
> tracks.
> -- C
>
> On Wed, Jun 14, 2017 at 11:27 AM, Bob Rudis  wrote:
>
>> I grok this is the Drill list and I'm also a big user of Drill (and
>> have made some UDFs) but there might be some efficacy in expanding the
>> scope to the Presto and Redshift Spectrum communities. I'm not
>> claiming there's 100% equivalence, but the broader view of being able
>> to access multiple types of data sources from a central platform is
>> compelling and -- at least in my circles -- not widely known. It could
>> be an event with a a primary central track but three separate
>> specializations that have a couple intraday time pockets.
>>
>> On Wed, Jun 14, 2017 at 8:55 AM, Charles Givre 
wrote:
>>> Hello fellow Drill users and developers,
>>> I've been involved with the Drill community for some time, and I was
>>> thinking that it might be time to start exploring the idea of a Drill
>>> Summit or Conference. If you're interested, please send me a note and
>> I'll
>>> start having some conversations about what's next.
>>>
>>> Personally, I think it could be extremely valuable to get Drill
>> developers
>>> and users together and share ideas about where things are and how
people
>>> are using Drill.
>>> Thanks!!
>>> -- C
>>


Drill Cluster without HDFS/MapR-FS?

2017-05-08 Thread Matt
I have seen some posts in the past about Drill nodes mounted "close to 
the data", and am wondering if its possible to use Drill as a cluster 
without HDFS?


Using ZK would not be an issue in itself, and there are apparently 
options like https://github.com/mhausenblas/dromedar


Any experiences with this?


Inequality join error with date range and calendar table

2017-03-20 Thread Matt
Using a calendar table with monthly start and end dates, I am attempting 
to count records in another table that has cycle start and end dates.


In PostgreSQL I would either use a date range type, or in standard SQL 
do something like:


```
SELECT m.startdate as monthdate, COUNT(distinct p.period_id) subs
FROM dfs.test.months m
LEFT JOIN dfs.test.periods p ON p.startdate <= m.enddate AND p.enddate 
>= m.startdate

WHERE m.startdate BETWEEN '2016-01-01' AND '2017-02-01'
GROUP BY p.period_id, m.startdate
ORDER BY p.period_id, m.startdate;
```

This should result in a list of period counts that occur in each month, 
where the date range overlaps the month start and end date.


So far all my attempts with Drill result in **UNSUPPORTED_OPERATION 
ERROR: This query cannot be planned possibly due to either a cartesian 
join or an inequality join**.


Is there an approach that Drill can execute?




Re: How to avoid case sensitivity in group by

2017-02-08 Thread Matt
Drill is not SQL Server, and not expected to work identically. 

Using the upper() and lower() functions is a common approach, unless you find 
options to set the collation sort order in the Drill docs. 

> On Feb 8, 2017, at 1:13 PM, Dechang Gu  wrote:
> 
> Sanjiv,
> 
> Can you share the query and the output you get through sql server?
> 
> 
> Thanks,
> 
> Dechang
> 
> 
> From: Sanjiv Kumar 
> Sent: Wednesday, February 8, 2017 5:01:27 AM
> To: user@drill.apache.org
> Subject: Re: How to avoid case sensitivity in group by
> 
> ​lower and upper is not a solution which i am aspecting. ​I want to display
> same output as i get through sql server management studio
> 
>> On Wed, Feb 8, 2017 at 5:15 PM, Sanjiv Kumar  wrote:
>> 
>> Drill Environment:- OS:- Window 10
>>   Version:- 1.9
>>   Mode:- embedded mode
>> 
>> I have a column name 'State' in db, which have data like(e.g:- Florida,
>> florida,new jersey, etc).
>> 
>> My problem is while using count(State) and group by in query,
>> Florida and florida are showing two seperate rows.
>> 
>> So how to avoid case sensitivity so that both  Florida and florida act as
>> 1 rows while firing count(Sate).?
>> 
>> 
>> --
>> Thanks & Regards.
>> Sanjiv Kumar.
>> 
> 
> 
> 
> --
> Thanks & Regards.
> Sanjiv Kumar.


Re: Sql Server Query taking long time to run

2016-11-24 Thread Matt Keranen
How long do the same queries take to run directly on SQL Server?

> On Nov 24, 2016, at 1:15 PM, Sanjiv Kumar  wrote:
> 
> Hello
>I have sql server databases having 1 million of data. I am using
> drill in window having 1.8 version. And Its running in embedded mode.
> 
>  While firing select query, it will take almost 1 hour to run the
> query.
> 
> Even i have given limit also it will take 1 hour to run.
> 
> I am executing this query:-  select * from demoplugin.dbo.category limit 5;
> 
> ​Even this query also take 1 hour to show output.
> Please tell some solution for that.​
> 
> -- 
> Thanks & Regards.
> Sanjiv Kumar.


Performance with multiple FLATTENs

2016-07-15 Thread Matt
I have JSON data with with a nested list and am using FLATTEN to extract 
two of three list elements as:


~~~
SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ...
~~~

This works, but each FLATTEN seems to slow the query down dramatically, 
3x slower with the second flatten.


Is there a better approach to extracting list elements?

~~~
[
  {
"id": 16,
"data": [
  [
"2016-07-13 00:00",
509,
"OK"
  ],
  [
"2016-07-13 00:01",
461,
"OK"
  ],
  [
"2016-07-13 00:02",
508,
"OK"
  ],
~~~


Re: Drill 1.6 on MapR cluster not using extractHeader ?

2016-04-18 Thread Matt
I found that the dfs storage section for csv file types did not all have 
the extractHeader setting in place. Manually putting it in all four of 
my nodes may have resolved the issue.


In my vanilla Hadoop 2.7.0 setup on the same servers, I don't recall 
having to set it on all nodes.


Did I perhaps miss something in the MapR cluster setup?


On 15 Apr 2016, at 14:16, Abhishek Girish wrote:


Hello,

This is my format setting:

"csv": {
  "type": "text",
  "extensions": [
"csv"
  ],
  "extractHeader": true,
  "delimiter": ","
}

I was able to extract the header and get expected results:



select * from mfs.tmp.`abcd.csv`;

+++++
| A  | B  | C  | D  |
+++++
| 1  | 2  | 3  | 4  |
| 2  | 3  | 4  | 5  |
| 3  | 4  | 5  | 6  |
+++++
3 rows selected (0.196 seconds)


select A from mfs.tmp.`abcd.csv`;

++
| A  |
++
| 1  |
| 2  |
| 3  |
++
3 rows selected (0.16 seconds)

I am using a MapR cluster with Drill 1.6.0. I had also enabled the new 
text

reader.

Note: My initial query failed to extract header, similar to what you
reported. I had to set the "skipFirstLine" option to true, for it to 
work.
Strangely, for subsequent queries, it works even after removing / 
disabling

the "skipFirstLine" option. This could be a bug, but I'm not able to
reproduce it right now. Will file a JIRA once i have more clarity.



Regards,
Abhishek

On Fri, Apr 15, 2016 at 10:53 AM, Matt  wrote:

With files in the local filesystem, and an embedded drill bit from 
the
download on drill.apache.org, I can successfully query csv data by 
column
name with the extractHeader option on, as in SELECT customer_if FROM 
`file`;


But in a MapR cluster (v. 5.1.0.37549.GA) with the data in MapR-FS, 
the
extractHeader options does not seem to be taking effect. A plain 
"SELECT *"

returns rows with the header as a data row, not in the columns list.

I have verified that exec.storage.enable_new_text_reader is true, and 
in

both cases csv storage is defined as:

~~~
"csv": {
  "type": "text",
  "extensions": [
"csv"
  ],
  "extractHeader": true,
  "delimiter": ","
}
~~~

Of course with the csv reader not extracting the columns, an attempt 
to

reference columns by name results in:

Error: DATA_READ ERROR: Selected column 'customer_id' must have name
'columns' or must be plain '*'. In trying to diagnose the issue, I 
noted
that at times the file header row not being part of the SELECT * 
results,

but also not being used to detect column names.

Both cases are Drill v1.6.0, but the MapR installed version has a
different commit than the standalone copy I am using:

MapR:

~~~

+--+---+--++--++
| version  | commit_id |
commit_message
|commit_time | build_email  |
 build_time |

+--+---+--++--++
| 1.6.0| 2d532bd206d7ae9f3cb703ee7f51ae3764374d43  | MD-850: 
Treat the

type of decimal literals as DOUBLE only when
planner.enable_decimal_data_type is true  | 31.03.2016 @ 04:47:25 UTC 
 |

Unknown  | 31.03.2016 @ 04:40:54 UTC  |

+--+---+--++--++
~~~

Local:

~~~

+--+---+-++++
| version  | commit_id |
 commit_message|commit_time |
build_email | build_time |

+--+---+-++++
| 1.6.0| d51f7fc14bd71d3e711ece0d02cdaa4d4c385eeb  |
[maven-release-plugin] prepare release drill-1.6.0  | 10.03.2016 @ 
16:34:37

PST  | par...@apache.org  | 10.03.2016 @ 17:45:29 PST  |

+--+---+-++++
~~~


Drill 1.6 on MapR cluster not using extractHeader ?

2016-04-15 Thread Matt
With files in the local filesystem, and an embedded drill bit from the 
download on drill.apache.org, I can successfully query csv data by 
column name with the extractHeader option on, as in SELECT customer_if 
FROM `file`;


But in a MapR cluster (v. 5.1.0.37549.GA) with the data in MapR-FS, the 
extractHeader options does not seem to be taking effect. A plain "SELECT 
*" returns rows with the header as a data row, not in the columns list.


I have verified that exec.storage.enable_new_text_reader is true, and in 
both cases csv storage is defined as:


~~~
"csv": {
  "type": "text",
  "extensions": [
"csv"
  ],
  "extractHeader": true,
  "delimiter": ","
}
~~~

Of course with the csv reader not extracting the columns, an attempt to 
reference columns by name results in:


Error: DATA_READ ERROR: Selected column 'customer_id' must have name 
'columns' or must be plain '*'. In trying to diagnose the issue, I noted 
that at times the file header row not being part of the SELECT * 
results, but also not being used to detect column names.


Both cases are Drill v1.6.0, but the MapR installed version has a 
different commit than the standalone copy I am using:


MapR:

~~~
+--+---+--++--++
| version  | commit_id | 
 commit_message 
 |commit_time | build_email  | 
build_time |

+--+---+--++--++
| 1.6.0| 2d532bd206d7ae9f3cb703ee7f51ae3764374d43  | MD-850: Treat 
the type of decimal literals as DOUBLE only when 
planner.enable_decimal_data_type is true  | 31.03.2016 @ 04:47:25 UTC  | 
Unknown  | 31.03.2016 @ 04:40:54 UTC  |

+--+---+--++--++
~~~

Local:

~~~
+--+---+-++++
| version  | commit_id | 
  commit_message|commit_time |
build_email | build_time |

+--+---+-++++
| 1.6.0| d51f7fc14bd71d3e711ece0d02cdaa4d4c385eeb  | 
[maven-release-plugin] prepare release drill-1.6.0  | 10.03.2016 @ 
16:34:37 PST  | par...@apache.org  | 10.03.2016 @ 17:45:29 PST  |

+--+---+-++++
~~~

Re: NumberFormatException with cast to double?

2016-03-13 Thread Matt
That does indeed remove the NumberFormatException, replacing it with the 
same error on "empty String".


As indicated by Jason in this thread I may have mistakenly believed that 
empty columns would be treated as NULLs.


Is there a setting to set the NULL value in the data, or should all of 
those COALESCE be replaced with CASE statements? It would be a lot of 
CASEs in this case (39 of them).



On 11 Mar 2016, at 0:31, Abdel Hakim Deneche wrote:

Looks like the COALESCE function is the source of the problem. Passing 
a

double (0.0) instead of an int (0) as a second expression solved the
problem for me:

CAST(COALESCE(t_total, 0.0) AS double)


On Fri, Mar 11, 2016 at 12:45 AM, Matt  wrote:


~~~
00-01  Project(date_tm=[CAST($23):TIMESTAMP(0)],
id_1=[CAST($11):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE
"ISO-8859-1$en_US$primary"], id_2=[CAST($15):VARCHAR(1) CHARACTER SET
"ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"],
id_3=[CAST($33):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE
"ISO-8859-1$en_US$primary"], b_total=[CAST(CASE(IS NOT NULL($4), $4,
0)):BIGINT], t_total=[CAST(CASE(IS NOT NULL($31), $31, 0)):DOUBLE],
h_total=[CAST(CASE(IS NOT NULL($40), $40, 0)):BIGINT],
b_small=[CAST(CASE(IS NOT NULL($36), $36, 0)):BIGINT],
t_small=[CAST(CASE(IS NOT NULL($14), $14, 0)):DOUBLE],
h_small=[CAST(CASE(IS NOT NULL($38), $38, 0)):BIGINT],
b_18000=[CAST(CASE(IS NOT NULL($32), $32, 0)):BIGINT],
t_18000=[CAST(CASE(IS NOT NULL($24), $24, 0)):DOUBLE],
h_18000=[CAST(CASE(IS NOT NULL($27), $27, 0)):BIGINT],
b_12000=[CAST(CASE(IS NOT NULL($30), $30, 0)):BIGINT],
t_12000=[CAST(CASE(IS NOT NULL($28), $28, 0)):DOUBLE],
h_12000=[CAST(CASE(IS NOT NULL($20), $20, 0)):BIGINT], 
b_6000=[CAST(CASE(IS
NOT NULL($41), $41, 0)):BIGINT], t_6000=[CAST(CASE(IS NOT NULL($37), 
$37,

0)):DOUBLE], h_6000=[CAST(CASE(IS NOT NULL($29), $29, 0)):BIGINT],
b_3000=[CAST(CASE(IS NOT NULL($17), $17, 0)):BIGINT], 
t_3000=[CAST(CASE(IS

NOT NULL($7), $7, 0)):DOUBLE], h_3000=[CAST(CASE(IS NOT NULL($1), $1,
0)):BIGINT], b_2000=[CAST(CASE(IS NOT NULL($26), $26, 0)):BIGINT],
t_2000=[CAST(CASE(IS NOT NULL($34), $34, 0)):DOUBLE], 
h_2000=[CAST(CASE(IS
NOT NULL($10), $10, 0)):BIGINT], b_1500=[CAST(CASE(IS NOT NULL($42), 
$42,

0)):BIGINT], t_1500=[CAST(CASE(IS NOT NULL($13), $13, 0)):DOUBLE],
h_1500=[CAST(CASE(IS NOT NULL($3), $3, 0)):BIGINT], 
b_1250=[CAST(CASE(IS
NOT NULL($21), $21, 0)):BIGINT], t_1250=[CAST(CASE(IS NOT NULL($25), 
$25,

0)):DOUBLE], h_1250=[CAST(CASE(IS NOT NULL($16), $16, 0)):BIGINT],
b_1000=[CAST(CASE(IS NOT NULL($12), $12, 0)):BIGINT], 
t_1000=[CAST(CASE(IS
NOT NULL($19), $19, 0)):DOUBLE], h_1000=[CAST(CASE(IS NOT NULL($6), 
$6,

0)):BIGINT], b_750=[CAST(CASE(IS NOT NULL($9), $9, 0)):BIGINT],
t_750=[CAST(CASE(IS NOT NULL($0), $0, 0)):DOUBLE], 
h_750=[CAST(CASE(IS NOT

NULL($5), $5, 0)):BIGINT], b_500=[CAST(CASE(IS NOT NULL($2), $2,
0)):BIGINT], t_500=[CAST(CASE(IS NOT NULL($8), $8, 0)):DOUBLE],
h_500=[CAST(CASE(IS NOT NULL($39), $39, 0)):BIGINT], 
b_0=[CAST(CASE(IS NOT

NULL($18), $18, 0)):BIGINT], t_0=[CAST(CASE(IS NOT NULL($35), $35,
0)):DOUBLE], EXPR$42=[CAST(CASE(IS NOT NULL($22), $22, 0)):BIGINT])
00-02Scan(groupscan=[EasyGroupScan
[selectionRoot=hdfs://es05:54310/caspr/csv/smalltest.csv, numFiles=1,
columns=[`date_tm`, `id_1`, `id_2`, `id_3`, `b_total`, `t_total`,
`h_total`, `b_small`, `t_small`, `h_small`, `b_18000`, `t_18000`,
`h_18000`, `b_12000`, `t_12000`, `h_12000`, `b_6000`, `t_6000`, 
`h_6000`,

`b_3000`, `t_3000`, `h_3000`, `b_2000`, `t_2000`, `h_2000`, `b_1500`,
`t_1500`, `h_1500`, `b_1250`, `t_1250`, `h_1250`, `b_1000`, `t_1000`,
`h_1000`, `b_750`, `t_750`, `h_750`, `b_500`, `t_500`, `h_500`, 
`b_0`,

`t_0`, `h_0`], files=[hdfs://es05:54310/caspr/csv/smalltest.csv]]])
~~~

~~~
{
  "head" : {
"version" : 1,
"generator" : {
  "type" : "ExplainHandler",
  "info" : ""
},
"type" : "APACHE_DRILL_PHYSICAL",
"options" : [ {
  "kind" : "STRING",
  "type" : "SESSION",
  "name" : "store.format",
  "string_val" : "parquet"
}, {
  "kind" : "BOOLEAN",
  "type" : "SESSION",
  "name" : "exec.errors.verbose",
  "bool_val" : true
} ],
"queue" : 0,
"resultMode" : "EXEC"
  },
  "graph" : [ {
"pop" : "fs-scan",
"@id" : 2,
"userName" : "hduser",
"files" : [ "hdfs://es05:54310/caspr/csv/smalltest.csv" ],
"storage" : {
  "type" : "file",
  "enabled" : true,
  "connection" : "hdfs://es05:54310",
  "workspaces" : {
   

Re: NumberFormatException with cast to double?

2016-03-10 Thread Matt
) as FLOAT8 )"

}, {
  "ref" : "`h_3000`",
  "expr" : "cast( ( ( if (isnotnull(`h_3000`)  ) then (`h_3000` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`b_2000`",
  "expr" : "cast( ( ( if (isnotnull(`b_2000`)  ) then (`b_2000` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`t_2000`",
  "expr" : "cast( ( ( if (isnotnull(`t_2000`)  ) then (`t_2000` )  
else (0 )  end  )  ) as FLOAT8 )"

}, {
  "ref" : "`h_2000`",
  "expr" : "cast( ( ( if (isnotnull(`h_2000`)  ) then (`h_2000` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`b_1500`",
  "expr" : "cast( ( ( if (isnotnull(`b_1500`)  ) then (`b_1500` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`t_1500`",
  "expr" : "cast( ( ( if (isnotnull(`t_1500`)  ) then (`t_1500` )  
else (0 )  end  )  ) as FLOAT8 )"

}, {
  "ref" : "`h_1500`",
  "expr" : "cast( ( ( if (isnotnull(`h_1500`)  ) then (`h_1500` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`b_1250`",
  "expr" : "cast( ( ( if (isnotnull(`b_1250`)  ) then (`b_1250` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`t_1250`",
  "expr" : "cast( ( ( if (isnotnull(`t_1250`)  ) then (`t_1250` )  
else (0 )  end  )  ) as FLOAT8 )"

}, {
  "ref" : "`h_1250`",
  "expr" : "cast( ( ( if (isnotnull(`h_1250`)  ) then (`h_1250` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`b_1000`",
  "expr" : "cast( ( ( if (isnotnull(`b_1000`)  ) then (`b_1000` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`t_1000`",
  "expr" : "cast( ( ( if (isnotnull(`t_1000`)  ) then (`t_1000` )  
else (0 )  end  )  ) as FLOAT8 )"

}
~~~

On 10 Mar 2016, at 18:40, Hsuan Yi Chu wrote:


From the log, I see this one:

StringFunctionHelpers.varCharToInt():

It seems like there might be an implicit casting (to integer) be 
inserted

into your expression.

Can you do EXPLAIN PLAN FOR  "your query" (not double quoted needed)

and share what you see?

On Thu, Mar 10, 2016 at 3:11 PM, Matt  wrote:

TRIM() did not help, and I visually verified there is no whitespace 
around

the number: The entire test data file:

~~~

date_tm,id_1,id_2,id_3,b_total,t_total,h_total,b_small,t_small,h_small,b_18000,t_18000,h_18000,b_12000,t_12000,h_12000,b_6000,t_6000,h_6000,b_3000,t_3000,h_3000,b_2000,t_2000,h_2000,b_1500,t_1500,h_1500,b_1250,t_1250,h_1250,b_1000,t_1000,h_1000,b_750,t_750,h_750,b_500,t_500,h_500,b_0,t_0,h_0
2015-10-17
00:00:00,f5e9v8u2,err,mi1,268918254,140.755,23519,268918254,140.755,23519,
~~~

Including a trim resulted in the same:

~~~
Error: SYSTEM ERROR: NumberFormatException: 140.755

Fragment 0:0

[Error Id: 01d368ee-6c06-476c-a553-5eb5baea7c3f on es07:31010]

  (java.lang.NumberFormatException) 140.755
   org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI():95

org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToInt():120
org.apache.drill.exec.test.generated.ProjectorGen122.doEval():467

org.apache.drill.exec.test.generated.ProjectorGen122.projectRecords():62

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():175
   
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93


org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():415
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1145
java.util.concurrent.ThreadPoolExecutor$Worker.run():615
java.lang.Thread.run():745 (state=,code=0)
~~~

On 10 Mar 2016, at 17:49, Jason Altekruse wrote:

Could you try throwing a trim() in before the cast? Might just be

whitespace.

There shouldn't be a need to add the COALESCE statement, as we only 
ever
read defined values (which may be empty string) out of a CSV file. 
You

should instead write out a full case statement that checks for empty
string
and provides your default value of 0 in that case.

- Jason

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Thu, Mar 10, 2016 at 2:32 PM, Matt  wrote:

Have some CSV data that Drill 1.5 selects as-is without any 
problems,

until I attempt to CAST columns in a CTAS or plain SELECT:

Error: SYSTEM ERROR: NumberFormatException: 140.755

The data is unquoted CSV, and column in question does have the 
value

"140.755" (unquoted). As the column can be empty, I am using the
following
transform:

  CAST(COALESCE(t_total, 0) AS double) AS t_total

And on the first data row:

Fragment 1:0

[Error Id: 4b4d83cf-a87d-451e-9cfa-4280e5adf64f on es08:31010]

  (java.lang.NumberFormatException) 140.755

What could be causing this error?





Re: NumberFormatException with cast to double?

2016-03-10 Thread Matt
TRIM() did not help, and I visually verified there is no whitespace 
around the number: The entire test data file:


~~~
date_tm,id_1,id_2,id_3,b_total,t_total,h_total,b_small,t_small,h_small,b_18000,t_18000,h_18000,b_12000,t_12000,h_12000,b_6000,t_6000,h_6000,b_3000,t_3000,h_3000,b_2000,t_2000,h_2000,b_1500,t_1500,h_1500,b_1250,t_1250,h_1250,b_1000,t_1000,h_1000,b_750,t_750,h_750,b_500,t_500,h_500,b_0,t_0,h_0
2015-10-17 
00:00:00,f5e9v8u2,err,mi1,268918254,140.755,23519,268918254,140.755,23519,

~~~

Including a trim resulted in the same:

~~~
Error: SYSTEM ERROR: NumberFormatException: 140.755

Fragment 0:0

[Error Id: 01d368ee-6c06-476c-a553-5eb5baea7c3f on es07:31010]

  (java.lang.NumberFormatException) 140.755
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI():95

org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToInt():120

org.apache.drill.exec.test.generated.ProjectorGen122.doEval():467

org.apache.drill.exec.test.generated.ProjectorGen122.projectRecords():62

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():175

org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132

org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81

org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():415
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1145
java.util.concurrent.ThreadPoolExecutor$Worker.run():615
java.lang.Thread.run():745 (state=,code=0)
~~~

On 10 Mar 2016, at 17:49, Jason Altekruse wrote:


Could you try throwing a trim() in before the cast? Might just be
whitespace.

There shouldn't be a need to add the COALESCE statement, as we only 
ever

read defined values (which may be empty string) out of a CSV file. You
should instead write out a full case statement that checks for empty 
string

and provides your default value of 0 in that case.

- Jason

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Thu, Mar 10, 2016 at 2:32 PM, Matt  wrote:


Have some CSV data that Drill 1.5 selects as-is without any problems,
until I attempt to CAST columns in a CTAS or plain SELECT:

Error: SYSTEM ERROR: NumberFormatException: 140.755

The data is unquoted CSV, and column in question does have the value
"140.755" (unquoted). As the column can be empty, I am using the 
following

transform:

  CAST(COALESCE(t_total, 0) AS double) AS t_total

And on the first data row:

Fragment 1:0

[Error Id: 4b4d83cf-a87d-451e-9cfa-4280e5adf64f on es08:31010]

  (java.lang.NumberFormatException) 140.755

What could be causing this error?




NumberFormatException with cast to double?

2016-03-10 Thread Matt
Have some CSV data that Drill 1.5 selects as-is without any problems, 
until I attempt to CAST columns in a CTAS or plain SELECT:


Error: SYSTEM ERROR: NumberFormatException: 140.755

The data is unquoted CSV, and column in question does have the value 
"140.755" (unquoted). As the column can be empty, I am using the 
following transform:


  CAST(COALESCE(t_total, 0) AS double) AS t_total

And on the first data row:

Fragment 1:0

[Error Id: 4b4d83cf-a87d-451e-9cfa-4280e5adf64f on es08:31010]

  (java.lang.NumberFormatException) 140.755

What could be causing this error?



NegativeArraySizeException on SELECT

2016-03-09 Thread Matt
Testing to see if the problems I had in v1.4 are different in v1.5 
(https://issues.apache.org/jira/browse/DRILL-4317), I ran a simple 
SELECT * on a csv file about 650MB in size:


~~~
$ time sqlline -f test.sql > /dev/null
1/3  alter session set `exec.errors.verbose`=true;
1 row selected (1.201 seconds)
2/3  use dfs.test;
1 row selected (0.177 seconds)
3/3  select * from 
`/csv/customer/hourly/customer_20151017.csv`;

java.lang.NegativeArraySizeException
Aborting command set because "force" is false and command failed: 
"select * from `/csv/customer/hourly/customer_20151017.csv`;"

Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl

real44m50.561s
user41m36.382s
sys 7m18.650s
~~~

That exception does not appear in the drillbit.log local to sqlline, 
would I find relevant log information elsewhere?


Should this be a new Jira issue, DRILL-4317 remains untouched.

Re: CTAS error with CSV data

2016-02-04 Thread Matt

Is there any more information I can supply in this issue?

Its a blocker for Drill adoption for us, and my ability to diagnose 
exceptions in Java based systems is very limited ;)



On 27 Jan 2016, at 14:30, Matt wrote:


https://issues.apache.org/jira/browse/DRILL-4317

On 26 Jan 2016, at 23:50, Abdel Hakim Deneche wrote:

This definitely looks like a bug, could you open a JIRA and share as 
much
information as possible about the structure of the CSV file and the 
number

of records.

On Tue, Jan 26, 2016 at 7:38 PM, Matt  wrote:


The CTAS with fails with:

~~~
Error: SYSTEM ERROR: IllegalArgumentException: length: -260 
(expected: >=

0)

Fragment 1:2

[Error Id: 1807615e-4385-4f85-8402-5900aaa568e9 on es07:31010]

(java.lang.IllegalArgumentException) length: -260 (expected: >= 0)
io.netty.buffer.AbstractByteBuf.checkIndex():1131
io.netty.buffer.PooledUnsafeDirectByteBuf.nioBuffer():344
io.netty.buffer.WrappedByteBuf.nioBuffer():727
io.netty.buffer.UnsafeDirectLittleEndian.nioBuffer():26
io.netty.buffer.DrillBuf.nioBuffer():356

org.apache.drill.exec.store.ParquetOutputRecordWriter$VarCharParquetConverter.writeField():1842
org.apache.drill.exec.store.EventBasedRecordWriter.write():62
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():106
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():415
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1145
java.util.concurrent.ThreadPoolExecutor$Worker.run():615
java.lang.Thread.run():745 (state=,code=0)
~~~

And a simple SELECT * fails with:

~~~
java.lang.IndexOutOfBoundsException: index: 547681, length: 1 
(expected:

range(0, 547681))
 at
io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1134)
 at
io.netty.buffer.PooledUnsafeDirectByteBuf.getBytes(PooledUnsafeDirectByteBuf.java:136)
 at io.netty.buffer.WrappedByteBuf.getBytes(WrappedByteBuf.java:289)
 at
io.netty.buffer.UnsafeDirectLittleEndian.getBytes(UnsafeDirectLittleEndian.java:26)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at
org.apache.drill.exec.vector.VarCharVector$Accessor.get(VarCharVector.java:443)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getBytes(VarCharAccessor.java:125)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getString(VarCharAccessor.java:146)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:136)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:94)
 at
org.apache.drill.exec.vector.accessor.BoundCheckingAccessor.getObject(BoundCheckingAccessor.java:148)
 at
org.apache.drill.jdbc.impl.TypeConvertingSqlAccessor.getObject(TypeConvertingSqlAccessor.java:795)
 at
org.apache.drill.jdbc.impl.AvaticaDrillSqlAccessor.getObject(AvaticaDrillSqlAccessor.java:179)
 at
net.hydromatic.avatica.AvaticaResultSet.getObject(AvaticaResultSet.java:351)
 at
org.apache.drill.jdbc.impl.DrillResultSetImpl.getObject(DrillResultSetImpl.java:420)
 at sqlline.Rows$Row.(Rows.java:157)
 at sqlline.IncrementalRows.hasNext(IncrementalRows.java:63)
 at
sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
 at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
 at sqlline.SqlLine.print(SqlLine.java:1593)
 at sqlline.Commands.execute(Commands.java:852)
 at sqlline.Commands.sql(Commands.java:751)
 at sqlline.SqlLine.dispatch(SqlLine.java:746)
 at sqlline.SqlLine.begin(SqlLine.java:621)
 at sqlline.SqlLine.start(SqlLine.java:375)
 at sqlline.SqlLine.main(SqlLine.java:268)
~~~

It also looks like if I run the SELECT from a bash shell as "sqlline 
-u
... -f test.sql 2>&1 > test.out" upon the error the sqlline session 
"locks

up". No errors spool to the out file and the Java thread can only be
terminated with a kill -9. It can be backgrounded with ^z, but won't
respond to a ^c.


On 26 Jan 2016, at 14:07, Abdel Hakim Deneche wrote:

It's an internal buffer index. Can you try enabling verbose errors 
and run
the query again, this should provide us with more details about the 
error.
You can enable verbose error by running the following before the 
select *:


alter session set `exec.errors.verbose`=true;

thanks

On Tue, Jan 2

Re: CTAS error with CSV data

2016-01-27 Thread Matt

https://issues.apache.org/jira/browse/DRILL-4317

On 26 Jan 2016, at 23:50, Abdel Hakim Deneche wrote:

This definitely looks like a bug, could you open a JIRA and share as 
much
information as possible about the structure of the CSV file and the 
number

of records.

On Tue, Jan 26, 2016 at 7:38 PM, Matt  wrote:


The CTAS with fails with:

~~~
Error: SYSTEM ERROR: IllegalArgumentException: length: -260 
(expected: >=

0)

Fragment 1:2

[Error Id: 1807615e-4385-4f85-8402-5900aaa568e9 on es07:31010]

(java.lang.IllegalArgumentException) length: -260 (expected: >= 0)
 io.netty.buffer.AbstractByteBuf.checkIndex():1131
 io.netty.buffer.PooledUnsafeDirectByteBuf.nioBuffer():344
 io.netty.buffer.WrappedByteBuf.nioBuffer():727
 io.netty.buffer.UnsafeDirectLittleEndian.nioBuffer():26
 io.netty.buffer.DrillBuf.nioBuffer():356

org.apache.drill.exec.store.ParquetOutputRecordWriter$VarCharParquetConverter.writeField():1842
 org.apache.drill.exec.store.EventBasedRecordWriter.write():62
 
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():106

 org.apache.drill.exec.record.AbstractRecordBatch.next():162
 org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
 org.apache.drill.exec.physical.impl.BaseRootExec.next():94
 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
 java.security.AccessController.doPrivileged():-2
 javax.security.auth.Subject.doAs():415
 org.apache.hadoop.security.UserGroupInformation.doAs():1657
 org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
 org.apache.drill.common.SelfCleaningRunnable.run():38
 java.util.concurrent.ThreadPoolExecutor.runWorker():1145
 java.util.concurrent.ThreadPoolExecutor$Worker.run():615
 java.lang.Thread.run():745 (state=,code=0)
~~~

And a simple SELECT * fails with:

~~~
java.lang.IndexOutOfBoundsException: index: 547681, length: 1 
(expected:

range(0, 547681))
 at
io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1134)
 at
io.netty.buffer.PooledUnsafeDirectByteBuf.getBytes(PooledUnsafeDirectByteBuf.java:136)
 at 
io.netty.buffer.WrappedByteBuf.getBytes(WrappedByteBuf.java:289)

 at
io.netty.buffer.UnsafeDirectLittleEndian.getBytes(UnsafeDirectLittleEndian.java:26)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
 at
org.apache.drill.exec.vector.VarCharVector$Accessor.get(VarCharVector.java:443)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getBytes(VarCharAccessor.java:125)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getString(VarCharAccessor.java:146)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:136)
 at
org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:94)
 at
org.apache.drill.exec.vector.accessor.BoundCheckingAccessor.getObject(BoundCheckingAccessor.java:148)
 at
org.apache.drill.jdbc.impl.TypeConvertingSqlAccessor.getObject(TypeConvertingSqlAccessor.java:795)
 at
org.apache.drill.jdbc.impl.AvaticaDrillSqlAccessor.getObject(AvaticaDrillSqlAccessor.java:179)
 at
net.hydromatic.avatica.AvaticaResultSet.getObject(AvaticaResultSet.java:351)
 at
org.apache.drill.jdbc.impl.DrillResultSetImpl.getObject(DrillResultSetImpl.java:420)
 at sqlline.Rows$Row.(Rows.java:157)
 at sqlline.IncrementalRows.hasNext(IncrementalRows.java:63)
 at
sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
 at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
 at sqlline.SqlLine.print(SqlLine.java:1593)
 at sqlline.Commands.execute(Commands.java:852)
 at sqlline.Commands.sql(Commands.java:751)
 at sqlline.SqlLine.dispatch(SqlLine.java:746)
 at sqlline.SqlLine.begin(SqlLine.java:621)
 at sqlline.SqlLine.start(SqlLine.java:375)
 at sqlline.SqlLine.main(SqlLine.java:268)
~~~

It also looks like if I run the SELECT from a bash shell as "sqlline 
-u
... -f test.sql 2>&1 > test.out" upon the error the sqlline session 
"locks

up". No errors spool to the out file and the Java thread can only be
terminated with a kill -9. It can be backgrounded with ^z, but won't
respond to a ^c.


On 26 Jan 2016, at 14:07, Abdel Hakim Deneche wrote:

It's an internal buffer index. Can you try enabling verbose errors 
and run
the query again, this should provide us with more details about the 
error.
You can enable verbose error by running the following before the 
select *:


alter session set `exec.errors.verbose`=true;

thanks

On Tue, Jan 26, 2016 at 11:01 AM, Matt  wrote:

Putting the "select * from
`/csv/customer/

Re: CTAS error with CSV data

2016-01-26 Thread Matt

The CTAS with fails with:

~~~
Error: SYSTEM ERROR: IllegalArgumentException: length: -260 (expected: 
>= 0)


Fragment 1:2

[Error Id: 1807615e-4385-4f85-8402-5900aaa568e9 on es07:31010]

  (java.lang.IllegalArgumentException) length: -260 (expected: >= 0)
io.netty.buffer.AbstractByteBuf.checkIndex():1131
io.netty.buffer.PooledUnsafeDirectByteBuf.nioBuffer():344
io.netty.buffer.WrappedByteBuf.nioBuffer():727
io.netty.buffer.UnsafeDirectLittleEndian.nioBuffer():26
io.netty.buffer.DrillBuf.nioBuffer():356

org.apache.drill.exec.store.ParquetOutputRecordWriter$VarCharParquetConverter.writeField():1842

org.apache.drill.exec.store.EventBasedRecordWriter.write():62

org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():106

org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93

org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():415
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1145
java.util.concurrent.ThreadPoolExecutor$Worker.run():615
java.lang.Thread.run():745 (state=,code=0)
~~~

And a simple SELECT * fails with:

~~~
java.lang.IndexOutOfBoundsException: index: 547681, length: 1 (expected: 
range(0, 547681))
at 
io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1134)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.getBytes(PooledUnsafeDirectByteBuf.java:136)
at 
io.netty.buffer.WrappedByteBuf.getBytes(WrappedByteBuf.java:289)
at 
io.netty.buffer.UnsafeDirectLittleEndian.getBytes(UnsafeDirectLittleEndian.java:26)

at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
at 
org.apache.drill.exec.vector.VarCharVector$Accessor.get(VarCharVector.java:443)
at 
org.apache.drill.exec.vector.accessor.VarCharAccessor.getBytes(VarCharAccessor.java:125)
at 
org.apache.drill.exec.vector.accessor.VarCharAccessor.getString(VarCharAccessor.java:146)
at 
org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:136)
at 
org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:94)
at 
org.apache.drill.exec.vector.accessor.BoundCheckingAccessor.getObject(BoundCheckingAccessor.java:148)
at 
org.apache.drill.jdbc.impl.TypeConvertingSqlAccessor.getObject(TypeConvertingSqlAccessor.java:795)
at 
org.apache.drill.jdbc.impl.AvaticaDrillSqlAccessor.getObject(AvaticaDrillSqlAccessor.java:179)
at 
net.hydromatic.avatica.AvaticaResultSet.getObject(AvaticaResultSet.java:351)
at 
org.apache.drill.jdbc.impl.DrillResultSetImpl.getObject(DrillResultSetImpl.java:420)

at sqlline.Rows$Row.(Rows.java:157)
at sqlline.IncrementalRows.hasNext(IncrementalRows.java:63)
at 
sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)

at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
at sqlline.SqlLine.print(SqlLine.java:1593)
at sqlline.Commands.execute(Commands.java:852)
at sqlline.Commands.sql(Commands.java:751)
at sqlline.SqlLine.dispatch(SqlLine.java:746)
at sqlline.SqlLine.begin(SqlLine.java:621)
at sqlline.SqlLine.start(SqlLine.java:375)
at sqlline.SqlLine.main(SqlLine.java:268)
~~~

It also looks like if I run the SELECT from a bash shell as "sqlline -u 
... -f test.sql 2>&1 > test.out" upon the error the sqlline session 
"locks up". No errors spool to the out file and the Java thread can only 
be terminated with a kill -9. It can be backgrounded with ^z, but won't 
respond to a ^c.


On 26 Jan 2016, at 14:07, Abdel Hakim Deneche wrote:

It's an internal buffer index. Can you try enabling verbose errors and 
run
the query again, this should provide us with more details about the 
error.
You can enable verbose error by running the following before the 
select *:


alter session set `exec.errors.verbose`=true;

thanks

On Tue, Jan 26, 2016 at 11:01 AM, Matt  wrote:


Putting the "select * from
`/csv/customer/hourly/customer_20151017.csv`;" in a local .sql 
file,
and executing it with sqlline > /dev/null (to avoid a ton of 
scrolling)

r

Re: CTAS error with CSV data

2016-01-26 Thread Matt
Putting the "select * from 
`/csv/customer/hourly/customer_20151017.csv`;" in a local .sql file, 
and executing it with sqlline > /dev/null (to avoid a ton of scrolling) 
results in:


~~~
index: 418719, length: 2 (expected: range(0, 418719))
 Aborting 
command set because "force" is false and command failed: "select * from 
`/csv/customer/hourly/customer_20151017.csv`;"

Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl
~~~

Is that index a byte or line offset?


On 26 Jan 2016, at 12:55, Abdel Hakim Deneche wrote:


Does a select * on the same data also fail ?

On Tue, Jan 26, 2016 at 9:44 AM, Matt  wrote:

Getting some errors when attempting to create Parquet files from CSV 
data,
and trying to determine if it is due to the format of the source 
data.


Its a fairly simple format of
"datetime,key,key,key,numeric,numeric,numeric, ..." with 32 of those
numeric columns in total.

The source data does contain a lot missing values for the numeric 
columns,

and those are represented by as consecutive delimiters:
""datetime,key,key,key,numeric,,..."

Could this be causing the CTAS to fail with these types of errors? Or 
is

there another cause to look for?

~~~
Error: SYSTEM ERROR: IllegalArgumentException: length: -260 
(expected: >=

0)

│···



│···
Fragment 1:2
~~~





--

Abdelhakim Deneche

Software Engineer

<http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>


SELECT * via sqlline -q dumps filenames

2016-01-26 Thread Matt
sqlline -u ... -q 'SELECT * FROM dfs.`/path/to/files/file.csv` LIMIT 10' 
seems to emit a list of files in the local path (pwd), along with a 
parsing error.


Putting the query in a file and passing that file name to sqlline or 
using an explicit column list runs the query as expected.


Is this a sqlline issue, or my improper use of it (quoting) ?




CTAS error with CSV data

2016-01-26 Thread Matt
Getting some errors when attempting to create Parquet files from CSV 
data, and trying to determine if it is due to the format of the source 
data.


Its a fairly simple format of 
"datetime,key,key,key,numeric,numeric,numeric, ..." with 32 of those 
numeric columns in total.


The source data does contain a lot missing values for the numeric 
columns, and those are represented by as consecutive delimiters: 
""datetime,key,key,key,numeric,,..."


Could this be causing the CTAS to fail with these types of errors? Or is 
there another cause to look for?


~~~
Error: SYSTEM ERROR: IllegalArgumentException: length: -260 (expected: 
>= 0)   
   
│···


  
│···

Fragment 1:2
~~~


File size limit for CTAS?

2016-01-21 Thread Matt
Converting CSV files to Parquet with CTAS, and getting errors on some 
larger files:


With a source file of 16.34GB (as reported in the HDFS explorer):

~~~
create table `/parquet/customer_20151017` partition by (date_tm) AS 
select * from `/csv/customer/customer_20151017.csv`;
Error: SYSTEM ERROR: IllegalArgumentException: length: -484 (expected: 
>= 0)


Fragment 1:1

[Error Id: da53d687-a8d5-4927-88ec-e56d5da17112 on es07:31010] 
(state=,code=0)

~~~

But an optation on a 70 MB file of the same format succeeds.

Given some HDFS advice is to avoid large numbers of small files [1], is 
there a general guideline for the max file size to ingest into Parquet 
files with CTAS?


---

[1] HDFS put performance is very poor with a large number of small 
files, thus trying to find the right amount of source rollup to perform. 
Pointers to HDFS configuration guides for beginners would be appreciated 
too. I have only used HDFS for Drill - no other Hadoop experience.


CTAS plan showing single node?

2016-01-21 Thread Matt
Running a CTAS from csv files in a 4 node HDFS cluster into a Parquet 
file, and I note the physical plan in the Drill UI references scans of 
all the csv sources on a single node.


collectl implies read and write IO on all 4 nodes - does this imply that 
the full cluster is used for scanning the source files, or does my 
configuration have the reads pinned to a single node?


Nodes in cluster: es0{5-8}:

Scan(groupscan=[EasyGroupScan 
[selectionRoot=hdfs://es05:54310/csv/customer, numFiles=33, 
columns=[`*`], 
files=[hdfs://es05:54310/csv/customer/customer_20151001.csv, 
hdfs://es05:54310/csv/customer/customer_20151002.csv, 
hdfs://es05:54310/csv/customer/customer_20151003.csv, 
hdfs://es05:54310/csv/customer/customer_20151004.csv, 
hdfs://es05:54310/csv/customer/customer_20151005.csv, 
hdfs://es05:54310/csv/customer/customer_20151006.csv, 
hdfs://es05:54310/csv/customer/customer_20151007.csv, 
hdfs://es05:54310/csv/customer/customer_20151008.csv, 
hdfs://es05:54310/csv/customer/customer_20151009.csv, 
hdfs://es05:54310/csv/customer/customer_20151010.csv, 
hdfs://es05:54310/csv/customer/customer_20151011.csv, 
hdfs://es05:54310/csv/customer/customer_20151012.csv, 
hdfs://es05:54310/csv/customer/customer_20151013.csv, 
hdfs://es05:54310/csv/customer/customer_20151014.csv, 
hdfs://es05:54310/csv/customer/customer_20151015.csv, 
hdfs://es05:54310/csv/customer/customer_20151016.csv, 
hdfs://es05:54310/csv/customer/customer_20151017.csv, 
hdfs://es05:54310/csv/customer/customer_20151018.csv, 
hdfs://es05:54310/csv/customer/customer_20151019.csv, 
hdfs://es05:54310/csv/customer/customer_20151020.csv, 
hdfs://es05:54310/csv/customer/customer_20151021.csv, 
hdfs://es05:54310/csv/customer/customer_20151022.csv, 
hdfs://es05:54310/csv/customer/customer_20151023.csv, 
hdfs://es05:54310/csv/customer/customer_20151024.csv, 
hdfs://es05:54310/csv/customer/customer_20151025.csv, 
hdfs://es05:54310/csv/customer/customer_20151026.csv, 
hdfs://es05:54310/csv/customer/customer_20151027.csv, 
hdfs://es05:54310/csv/customer/customer_20151028.csv, 
hdfs://es05:54310/csv/customer/customer_20151029.csv, 
hdfs://es05:54310/csv/customer/customer_20151030.csv, 
hdfs://es05:54310/csv/customer/customer_20151031.csv, 
hdfs://es05:54310/csv/customer/customer_20151101.csv, 
hdfs://es05:54310/csv/customer/customer_20151102.csv]]]) : rowType = 
(DrillRecordRow[*]): rowcount = 2.407374395E9, cumulative cost = 
{2.407374395E9 rows, 2.407374395E9 cpu, 0.0 io, 0.0 network, 0.0 
memory}, id = 4355

Re: Where is the Drill Explorer and / or docs on installing it?

2016-01-04 Thread Matt
That section refers only to Windows? Are there OSX specific docs for Explorer?

Sent from my iPhone

> On Jan 4, 2016, at 2:12 PM, Kristine Hahn  wrote:
> 
> Please see step 1 of
> 
> http://drill.apache.org/docs/installing-the-driver-on-windows/#step-3-verify-the-installation
> 
> "Installing the ODBC Administrator installs Drill Explorer and the Tableau
> TDC file."
> 
> Kristine Hahn
> Sr. Technical Writer
> 415-497-8107 @krishahn skype:krishahn
> 
> 
> On Mon, Jan 4, 2016 at 10:48 AM, Andries Engelbrecht <
> aengelbre...@maprtech.com> wrote:
> 
>> Peder,
>> 
>> Drill Explorer typically installs with the ODBC driver. Did you install
>> the ODBC driver on OS X?
>> 
>> --Andries
>> 
>> 
 On Dec 27, 2015, at 7:53 PM, Peder Jakobsen | gmail 
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> The docs refer to the Drill Explorer, which for OS X is supposed to be
>>> located in the Applications folder:
>>> https://drill.apache.org/docs/connecting-drill-explorer-to-data/
>>> 
>>> But there is no mention in the documentation on how to install this tool
>> (I
>>> think)?  Are the docs referring to and old version of Drill, perhaps?
>>> 
>>> Thanks,
>>> 
>>> Peder J.
>> 
>> 


Re: A single users view/opinion of Drill

2015-12-27 Thread Matt

 - Let's use Strings as the default value type if values are missing
 - Instead of Double (pet peeve)


This might be more important that just a pet peeve. Is there a reason 
Drill default casts to a more restrictive data type instead of less 
restrictive strings?


On 27 Dec 2015, at 8:26, Stefán Baxter wrote:


Hi Drillers,

I have been meaning to share some thoughts on Drill for a long time 
and
what I, or we at Activity Stream, believe would make Drill better (for 
us).

Please keep in mind that this is a single sided view from a simple,
non-contributing, user and please excuse my English.

We love using Drill and our setup included Drill, Parquet, Avro, JSON, 
JDBC
sources and more. Drill offers many great things but in the beginning 
it
affected our decision to use Drill, over Presto, that we could use it 
with
both Hive/HDFS and local disc storage and its support for the various 
data

sources.

Working with Drill has not always been easy and we have spent a lot of 
time
adjusting to "Drill quirks", like defaulting to Double for values that 
show

up to late, but the "this is awesome" moments have  always been more
frequent than the "I don't believe this s**t" moments (please excuse 
the

language).

We see the main roles of Drill as the following:

 - Run distributed and fast SQL on top of various data sources and 
allow

 us to mix the data into a single result
 - Eliminate ETL by supporting evolving schema


Some discussion points:

*1. Null exists, let's use it!  (some pun intended)*

 - If a field is missing let's return Null
 - Schema validation is great but in a polyglot and mixed schema
 environment that should not surprise anyone.
 - Drill has a bunch of functions to deal with null values and

*2. String is the lowest common denominator*

 - Almost all values can be converted to and from Strings

 - Let's use Strings as the default value type if values are missing
 - Instead of Double (pet peeve)

 - Lets always convert String values automatically, if functions are
 expecting other value types and the value is applicable for 
conversion

 - Create a warning that this is being done when it's affecting
 performance rather than throw errors

*3. Be as tolerant towards data as possible - Log warning rather than 
throw

errors*

Let's minimize the "conversion boiler plaiting" needed in SQL by 
having a

more flexible infrastructure.

 - ISO Date-String, Time-Stamp and Long are all valid Dates, let's 
treat

 them as such for any function or conditions

 - Integers can be accurately converted to Real/Double, let's not make
 that difference matter (Going the other way is not the same)

 - Other pointers
- Missing tables in a union could return empty data sets rather 
than

throw errors
- Empty files should always return empty result sets
- Valid JSON files, starting with "[" and ending with "]" should 
be

trimmed to be suitable for Drill
- It seems odd that Drill only supports non-standard lists
- A "incomplete last line" in a any log file (JSON, CSV etc.) 
should

be ignored as it could represent an incomplete append operation
(live logs)

*4. Consistency between data/storage formats if at all possible*

Having different behavior in Parquet and Avro, for example, when it 
comes

to missing fields is counter intuitive and appears fragmented.

Please synchronize the way "Drill behaves" rather than fragment on how
every single format reader behaves.



This is by no means a conclusive list but I just wanted to see if I 
could

get this ball rolling.


Hope you are all enjoying the holidays.

Best regards,
-Stefán

ps.
Our only contribution to Drill is this simple UDF library:
https://github.com/activitystream/asdrill (Apache license)


Re: Tableau / filter issues

2015-08-17 Thread Matt
I think Tableau also uses the first query to fetch the structure / 
metadata of the expected result set.


We have often eliminated performance issues using Tableau by "hiding" 
the structure of queries by putting them in database views. Could that 
be a possible solution here?


On 17 Aug 2015, at 11:02, Boris Chmiel wrote:


Hello Driller,
I'm facing issues using the Filter function of Tableau with Drill : 
Tableau always populates values of the filter with "null".The action 
is translated into 2 queries within Drill :
SELECT \* FROM (SELECT field FROM Table GROUP BY field ORDER BY field 
ASC) T LIMIT 0 

SELECT field FROM Table GROUP BY field ORDER BY field ASC
It seems that Tableau is fetching the result of the first query (null 
- which seems to by generated to catch exceptions) instead of the 
second one.
Anyone already met this issue ? Is there a workaround or customization 
parameter to ajust ?
conf : Tableau v9.0 + MapRDrillODBC driver with TDC file 
deployedDrill V1.1.0


Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Matt

On 23 Jul 2015, at 10:53, Abdel Hakim Deneche wrote:

When you try to read schema-less data, Drill will first investigate 
the

1000 rows to figure out a schema for your data, then it will use this
schema for the remaining of the query.


To clarify, if the JSON schema changes on the 1001st 1MMth record, is 
Drill supposed to report an error, or ignore new data elements and only 
consider those discovered in the first 1000 objects?


Re: Drill authentication with pyodbc

2015-06-08 Thread Matt

Does using a DSN as per this notebook help?

http://nbviewer.ipython.org/github/cjmatta/drill_ipython_notebook/blob/master/Twitter%20Drill%20Pandas.ipynb

https://github.com/cjmatta/drill_ipython_notebook


On 8 Jun 2015, at 22:20, Christopher Matta wrote:

Does anyone know what the expected key names are for userid and 
password
for an ODBC connection? I was using pyodbc to connect to Drill pre-1.0 
but

now with authentication enabled I haven’t figured out how to do it.

Relevant errors:

conn = 
pyodbc.connect('Driver=/opt/mapr/drillodbc/lib/universal/libmaprdrillodbc.dylib;ConnectionType=Zookeeper;ZKQuorum=hosta:5181,hostb:5181,hostc:5181;ZKClusterID=cluster-drillbits;Catalog=DRILL;AuthenticationType=BasicAuthentication;AdvancedProperties=CastAnyToVarchar=true;HandshakeTimeout=5;QueryTimeout=180;TimestampTZDisplayTimezone=utc;ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;uid=cmatta;pwd=',

autocommit=True)
cursor = conn.cursor()
cursor.execute('select * from maprfs.cmatta.tweets_view limit 10;')

--- output ---
---
Error Traceback (most recent call last)
 in ()
1 conn = 
pyodbc.connect('Driver=/opt/mapr/drillodbc/lib/universal/libmaprdrillodbc.dylib;ConnectionType=Zookeeper;ZKQuorum=hosta:5181,hostb:5181,hostc:5181;ZKClusterID=cluster-drillbits;Catalog=DRILL;AuthenticationType=BasicAuthentication;AdvancedProperties=CastAnyToVarchar=true;HandshakeTimeout=5;QueryTimeout=180;TimestampTZDisplayTimezone=utc;ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;uid=cmatta;pwd=',

autocommit=True)
2 cursor = conn.cursor()
> 3 cursor.execute('select * from maprfs.cmatta.tweets_view limit 
10;')


Error: ('HY000', '[HY000] [MapR][Drill] (1040) Drill failed to execute
the query: select * from maprfs.cmatta.tweets_view limit
10;\n[30027]Query execution error. Details:[ \nSYSTEM ERROR:
java.lang.IllegalArgumentException: A valid userName is
expected\n\n\n[Error Id: 2f26217c-ddac-4195-9bfb-03c519a95e56 on
se-node12.se.lab:31010]\n] (1040) (SQLExecDirectW)')

I’ve also tried replacing uid with user, username, userName etc… 
with no

luck.

Chris mattacma...@mapr.com
215-701-3146
​


Sorting and partitioning for range scans?

2015-06-01 Thread Matt
I have seen some discussions on the Parquet storage format suggesting 
that sorting time series data on the time key prior to converting to the 
Parquet format will improve range query efficiency via min/max values on 
column chunks - perhaps analogous to skip indexes?


Is this a recommended approach for data accessed via Drill?

In addition, for data stored in HDFS for Drill that has a regular growth 
rate and is mainly subject to time range queries, is there validity to 
partitioning it by date into subdirectories?


For example, in PostgreSQL, I might partition data tables by month so 
queries including the partition date column hit the proper partitions 
directly (with an extra benefit of space management that does not touch 
all date ranges).


Segmenting data into directories in HDFS would require clients to 
structure queries accordingly, but would there be benefit in reduced 
query time by limiting scan ranges?


Re: Monitoring long / stuck CTAS

2015-05-29 Thread Matt
0]
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) 
[netty-common-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603) 
[netty-common-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563) 
[netty-common-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) 
[netty-common-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:788) 
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:689) 
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1114) 
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:705) 
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32) 
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:980) 
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1032) 
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:965) 
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) 
[netty-common-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254) 
[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) 
[netty-common-4.0.27.Final.jar:4.0.27.Final]

at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
~~~



On 28 May 2015, at 16:43, Mehant Baid wrote:

I think the problem might be related to a single laggard, looks like 
we are waiting for one minor fragment to complete. Based on the output 
you provided looks like the fragment 1_1 hasn't completed. You might 
want to find out where the fragment was scheduled and what is going on 
in that node. It might also be useful to look at the profile for that 
minor fragment to see how much data has been processed.



Thanks
Mehant

On 5/28/15 10:57 AM, Matt wrote:

Did you check the log files for any errors?


No messages related to this query containing errors or warning, nor 
nothing mentioning memory or heap. Querying now to determine what is 
missing in the parquet destination.


drillbit.out on the master shows no error messages, and what looks 
like the last relevant line is:


~~~
May 27, 2015 6:43:50 PM INFO: 
parquet.hadoop.ColumnChunkPageWriteStore: written 2,258,263B for 
[bytes_1250] INT64: 3,069,414 values, 24,555,504B raw, 2,257,112B 
comp, 24 pages, encodings: [RLE, PLAIN, BIT_PACKED]
May 27, 2015 6:43:51 PM INFO: parquet.haMay 28, 2015 5:13:42 PM 
org.apache.calcite.sql.validate.SqlValidatorException 

~~~

The final lines in drillbit.log (which appear to use a different time 
format in the log) that contain the profile ID:


~~~
2015-05-27 18:39:49,980 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20: State change requested 
from RUNNING --> FINISHED for
2015-05-27 18:39:49,981 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20. New state: FINISHED
2015-05-27 18:40:05,650 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12: State change requested 
from RUNNING --> FINISHED for
2015-05-27 18:40:05,650 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12. New state: FINISHED
2015-05-27 18:41:57,444 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16: State change requested 
from RUNNING --> FINISHED for
2015-05-27 18:41:57,444 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16. New state: FINISHED
2015-05-27 18:43:25,005 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:8] INFO 
o.a.d.e.w.fragment.FragmentExecutor

Re: Monitoring long / stuck CTAS

2015-05-29 Thread Matt
> 1) it isn't HDFS.

Is MapR-FS a replacement or stand-in for HDFS?


On 29 May 2015, at 5:55, Ted Dunning wrote:

> Apologies for the plug, but using MapR FS would help you a lot here.  The
> trick is that you can run an NFS server on every node and mount that server
> as localhost.
>
> The benefits are:
>
> 1) the entire cluster appears as a conventional POSIX style file system in
> addition to being available via HDFS API's.
>
> 2) you get very high I/O speeds
>
> 3) you get real snapshots and mirrors if you need them
>
> 4) you get the use of the HBase API without having to run HBase.  Tables
> are integrated directly into MapR FS.
>
> 5) programs that need to exceed local disk size can do so
>
> 6) data can be isolated to single machines if you want.
>
> 7) you can get it for free or pay for support
>
>
> The downsides are:
>
> 1) it isn't HDFS.
>
> 2) the data platform isn't Apache licensed (all of eco-system code is
> unchanged wrt licensing)
>
>
>
>
>
> On Thu, May 28, 2015 at 9:37 AM, Matt  wrote:
>
>> I know I can / should assign individual disks to HDFS, but as a test
>> cluster there are apps that expect data volumes to work on. A dedicated
>> Hadoop production cluster would have a disk layout specific to the task.


Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt

Bumping memory to:

DRILL_MAX_DIRECT_MEMORY="16G"
DRILL_HEAP="8G"

The 44GB file imported successfully in 25 minutes - acceptable on this 
hardware.


I don't know if the default memory setting was to blame or not.


On 28 May 2015, at 14:22, Andries Engelbrecht wrote:


That is the Drill direct memory per node.

DRILL_HEAP is for the heap size per node.

More info here
http://drill.apache.org/docs/configuring-drill-memory/


—Andries

On May 28, 2015, at 11:09 AM, Matt  wrote:


Referencing http://drill.apache.org/docs/configuring-drill-memory/

Is DRILL_MAX_DIRECT_MEMORY the limit for each node, or the cluster?

The root page on a drillbit at port 8047 list for nodes, with the 16G 
Maximum Direct Memory equal to DRILL_MAX_DIRECT_MEMORY, thus 
uncertain if that is a node or cluster limit.



On 28 May 2015, at 12:23, Jason Altekruse wrote:

That is correct. I guess it could be possible that HDFS might run 
out of
heap, but I'm guessing that is unlikely the cause of the failure you 
are
seeing. We should not be taxing zookeeper enough to be causing any 
issues

there.

On Thu, May 28, 2015 at 9:17 AM, Matt  wrote:

To make sure I am adjusting the correct config, these are heap 
parameters

within the Drill configure path, not for Hadoop or Zookeeper?


On May 28, 2015, at 12:08 PM, Jason Altekruse 


wrote:


There should be no upper limit on the size of the tables you can 
create

with Drill. Be advised that Drill does currently operate entirely
optimistically in regards to available resources. If a network 
connection

between two drillbits fails during a query, we will not currently
re-schedule the work to make use of remaining nodes and network

connections
that are still live. While we have had a good amount of success 
using

Drill
for data conversion, be aware that these conditions could cause 
long

running queries to fail.

That being said, it isn't the only possible cause for such a 
failure. In
the case of a network failure we would expect to see a message 
returned

to

you that part of the query was unsuccessful and that it had been

cancelled.
Andries has a good suggestion in regards to checking the heap 
memory,

this
should also be detected and reported back to you at the CLI, but 
we may

be
failing to propagate the error back to the head node for the 
query. I
believe writing parquet may still be the most heap-intensive 
operation in
Drill, despite our efforts to refactor the write path to use 
direct

memory
instead of on-heap for large buffers needed in the process of 
creating

parquet files.


On Thu, May 28, 2015 at 8:43 AM, Matt  wrote:

Is 300MM records too much to do in a single CTAS statement?

After almost 23 hours I killed the query (^c) and it returned:

~~~
+---++
| Fragment  | Number of records written  |
+---++
| 1_20  | 13568824   |
| 1_15  | 12411822   |
| 1_7   | 12470329   |
| 1_12  | 13693867   |
| 1_5   | 13292136   |
| 1_18  | 13874321   |
| 1_16  | 13303094   |
| 1_9   | 13639049   |
| 1_10  | 13698380   |
| 1_22  | 13501073   |
| 1_8   | 13533736   |
| 1_2   | 13549402   |
| 1_21  | 13665183   |
| 1_0   | 13544745   |
| 1_4   | 13532957   |
| 1_19  | 12767473   |
| 1_17  | 13670687   |
| 1_13  | 13469515   |
| 1_23  | 12517632   |
| 1_6   | 13634338   |
| 1_14  | 13611322   |
| 1_3   | 13061900   |
| 1_11  | 12760978   |
+---++
23 rows selected (82294.854 seconds)
~~~

The sum of those record counts is  306,772,763 which is close to 
the

320,843,454 in the source file:

~~~
0: jdbc:drill:zk=es05:2181> select count(*)  FROM

root.`sample_201501.dat`;

++
|   EXPR$0   |
++
| 320843454  |
++
1 row selected (384.665 seconds)
~~~


It represents one month of data, 4 key columns and 38 numeric 
measure
columns, which could also be partitioned daily. The test here was 
to

create
monthly Parquet files to see how the min/max stats on Parquet 
chunks

help

with range select performance.

Instead of a small number of large monthly RDBMS tables, I am 
attempting
to determine how many Parquet files should be used with Drill / 
HDFS.





On 27 May 2015, at 15:17, Matt wrote:

Attempting to create a Parquet backed table with a CTAS from an 
44GB tab
delimited file in HDFS. The process seemed to be running, as CPU 
and

IO was
seen on all 4 nodes in this cluster, and .parquet files being 
created

in

the expected path.

In however in the las

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
That is a good point. The difference between the number of source rows, 
and those that made it into the parquet files is about the same count as 
the other fragments.


Indeed the query profile does show fragment 1_1 as CANCELED while the 
others all have State FINISHED. Additionally the other fragments have a 
runtime of less than 30 mins, where only fragment 1_1 lasted the 23 
hours before cancellation.



On 28 May 2015, at 16:43, Mehant Baid wrote:

I think the problem might be related to a single laggard, looks like 
we are waiting for one minor fragment to complete. Based on the output 
you provided looks like the fragment 1_1 hasn't completed. You might 
want to find out where the fragment was scheduled and what is going on 
in that node. It might also be useful to look at the profile for that 
minor fragment to see how much data has been processed.



Thanks
Mehant

On 5/28/15 10:57 AM, Matt wrote:

Did you check the log files for any errors?


No messages related to this query containing errors or warning, nor 
nothing mentioning memory or heap. Querying now to determine what is 
missing in the parquet destination.


drillbit.out on the master shows no error messages, and what looks 
like the last relevant line is:


~~~
May 27, 2015 6:43:50 PM INFO: 
parquet.hadoop.ColumnChunkPageWriteStore: written 2,258,263B for 
[bytes_1250] INT64: 3,069,414 values, 24,555,504B raw, 2,257,112B 
comp, 24 pages, encodings: [RLE, PLAIN, BIT_PACKED]
May 27, 2015 6:43:51 PM INFO: parquet.haMay 28, 2015 5:13:42 PM 
org.apache.calcite.sql.validate.SqlValidatorException 

~~~

The final lines in drillbit.log (which appear to use a different time 
format in the log) that contain the profile ID:


~~~
2015-05-27 18:39:49,980 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20: State change requested 
from RUNNING --> FINISHED for
2015-05-27 18:39:49,981 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20. New state: FINISHED
2015-05-27 18:40:05,650 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12: State change requested 
from RUNNING --> FINISHED for
2015-05-27 18:40:05,650 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12. New state: FINISHED
2015-05-27 18:41:57,444 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16: State change requested 
from RUNNING --> FINISHED for
2015-05-27 18:41:57,444 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16. New state: FINISHED
2015-05-27 18:43:25,005 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:8] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:8: State change requested from 
RUNNING --> FINISHED for
2015-05-27 18:43:25,005 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:8] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:8. New state: FINISHED
2015-05-27 18:43:54,539 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:0] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:0: State change requested from 
RUNNING --> FINISHED for
2015-05-27 18:43:54,540 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:0] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:0. New state: FINISHED
2015-05-27 18:43:59,947 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:4] INFO 
o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:4: State change requested from 
RUNNING --> FINISHED for
2015-05-27 18:43:59,947 
[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:4] INFO 
o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:4. New state: FINISHED

~~~


On 28 May 2015, at 13:42, Andries Engelbrecht wrote:


It should execute multi threaded, need to check on text file.

Did you check the log files for any errors?


On May 28, 2015, at 10:36 AM, Matt  wrote:

The time seems pretty long for that file size. What type of file 
is it?


Tab delimited UTF-8 text.

I left the query to run overnight to see if it would complete, but 
24 hours for an import like this would indeed be too long.



Is the CTAS running single threaded?


In the first hour, with this being the only client connected to the 
cluster, I observed activity on all 4 nodes.


Is multi-threaded query execution the default? I would not have 
changed anything deliberately to force single thread execution.



On 28 May 2015, at 13:06, Andries Engelbrecht

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt

Referencing http://drill.apache.org/docs/configuring-drill-memory/

Is DRILL_MAX_DIRECT_MEMORY the limit for each node, or the cluster?

The root page on a drillbit at port 8047 list for nodes, with the 16G 
Maximum Direct Memory equal to DRILL_MAX_DIRECT_MEMORY, thus uncertain 
if that is a node or cluster limit.



On 28 May 2015, at 12:23, Jason Altekruse wrote:

That is correct. I guess it could be possible that HDFS might run out 
of
heap, but I'm guessing that is unlikely the cause of the failure you 
are
seeing. We should not be taxing zookeeper enough to be causing any 
issues

there.

On Thu, May 28, 2015 at 9:17 AM, Matt  wrote:

To make sure I am adjusting the correct config, these are heap 
parameters

within the Drill configure path, not for Hadoop or Zookeeper?


On May 28, 2015, at 12:08 PM, Jason Altekruse 


wrote:


There should be no upper limit on the size of the tables you can 
create

with Drill. Be advised that Drill does currently operate entirely
optimistically in regards to available resources. If a network 
connection

between two drillbits fails during a query, we will not currently
re-schedule the work to make use of remaining nodes and network

connections
that are still live. While we have had a good amount of success 
using

Drill

for data conversion, be aware that these conditions could cause long
running queries to fail.

That being said, it isn't the only possible cause for such a 
failure. In
the case of a network failure we would expect to see a message 
returned

to

you that part of the query was unsuccessful and that it had been

cancelled.
Andries has a good suggestion in regards to checking the heap 
memory,

this
should also be detected and reported back to you at the CLI, but we 
may

be
failing to propagate the error back to the head node for the query. 
I
believe writing parquet may still be the most heap-intensive 
operation in

Drill, despite our efforts to refactor the write path to use direct

memory
instead of on-heap for large buffers needed in the process of 
creating

parquet files.


On Thu, May 28, 2015 at 8:43 AM, Matt  wrote:

Is 300MM records too much to do in a single CTAS statement?

After almost 23 hours I killed the query (^c) and it returned:

~~~
+---++
| Fragment  | Number of records written  |
+---++
| 1_20  | 13568824   |
| 1_15  | 12411822   |
| 1_7   | 12470329   |
| 1_12  | 13693867   |
| 1_5   | 13292136   |
| 1_18  | 13874321   |
| 1_16  | 13303094   |
| 1_9   | 13639049   |
| 1_10  | 13698380   |
| 1_22  | 13501073   |
| 1_8   | 13533736   |
| 1_2   | 13549402   |
| 1_21  | 13665183   |
| 1_0   | 13544745   |
| 1_4   | 13532957   |
| 1_19  | 12767473   |
| 1_17  | 13670687   |
| 1_13  | 13469515   |
| 1_23  | 12517632   |
| 1_6   | 13634338   |
| 1_14  | 13611322   |
| 1_3   | 13061900   |
| 1_11  | 12760978   |
+---++
23 rows selected (82294.854 seconds)
~~~

The sum of those record counts is  306,772,763 which is close to 
the

320,843,454 in the source file:

~~~
0: jdbc:drill:zk=es05:2181> select count(*)  FROM

root.`sample_201501.dat`;

++
|   EXPR$0   |
++
| 320843454  |
++
1 row selected (384.665 seconds)
~~~


It represents one month of data, 4 key columns and 38 numeric 
measure
columns, which could also be partitioned daily. The test here was 
to

create
monthly Parquet files to see how the min/max stats on Parquet 
chunks

help

with range select performance.

Instead of a small number of large monthly RDBMS tables, I am 
attempting
to determine how many Parquet files should be used with Drill / 
HDFS.





On 27 May 2015, at 15:17, Matt wrote:

Attempting to create a Parquet backed table with a CTAS from an 
44GB tab
delimited file in HDFS. The process seemed to be running, as CPU 
and

IO was
seen on all 4 nodes in this cluster, and .parquet files being 
created

in

the expected path.

In however in the last two hours or so, all nodes show near zero 
CPU or
IO, and the Last Modified date on the .parquet have not changed. 
Same

time
delay shown in the Last Progress column in the active fragment 
profile.


What approach can I take to determine what is happening (or not)?






Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt

Did you check the log files for any errors?


No messages related to this query containing errors or warning, nor 
nothing mentioning memory or heap. Querying now to determine what is 
missing in the parquet destination.


drillbit.out on the master shows no error messages, and what looks like 
the last relevant line is:


~~~
May 27, 2015 6:43:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
written 2,258,263B for [bytes_1250] INT64: 3,069,414 values, 24,555,504B 
raw, 2,257,112B comp, 24 pages, encodings: [RLE, PLAIN, BIT_PACKED]
May 27, 2015 6:43:51 PM INFO: parquet.haMay 28, 2015 5:13:42 PM 
org.apache.calcite.sql.validate.SqlValidatorException 

~~~

The final lines in drillbit.log (which appear to use a different time 
format in the log) that contain the profile ID:


~~~
2015-05-27 18:39:49,980 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20: State change requested from 
RUNNING --> FINISHED for
2015-05-27 18:39:49,981 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20. New state: FINISHED
2015-05-27 18:40:05,650 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12: State change requested from 
RUNNING --> FINISHED for
2015-05-27 18:40:05,650 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12. New state: FINISHED
2015-05-27 18:41:57,444 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16: State change requested from 
RUNNING --> FINISHED for
2015-05-27 18:41:57,444 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16. New state: FINISHED
2015-05-27 18:43:25,005 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:8] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:8: State change requested from 
RUNNING --> FINISHED for
2015-05-27 18:43:25,005 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:8] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:8. New state: FINISHED
2015-05-27 18:43:54,539 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:0] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:0: State change requested from 
RUNNING --> FINISHED for
2015-05-27 18:43:54,540 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:0] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:0. New state: FINISHED
2015-05-27 18:43:59,947 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:4] 
INFO  o.a.d.e.w.fragment.FragmentExecutor - 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:4: State change requested from 
RUNNING --> FINISHED for
2015-05-27 18:43:59,947 [2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:4] 
INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 
2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:4. New state: FINISHED

~~~


On 28 May 2015, at 13:42, Andries Engelbrecht wrote:


It should execute multi threaded, need to check on text file.

Did you check the log files for any errors?


On May 28, 2015, at 10:36 AM, Matt  wrote:

The time seems pretty long for that file size. What type of file is 
it?


Tab delimited UTF-8 text.

I left the query to run overnight to see if it would complete, but 24 
hours for an import like this would indeed be too long.



Is the CTAS running single threaded?


In the first hour, with this being the only client connected to the 
cluster, I observed activity on all 4 nodes.


Is multi-threaded query execution the default? I would not have 
changed anything deliberately to force single thread execution.



On 28 May 2015, at 13:06, Andries Engelbrecht wrote:

The time seems pretty long for that file size. What type of file is 
it?


Is the CTAS running single threaded?

—Andries


On May 28, 2015, at 9:37 AM, Matt  wrote:

How large is the data set you are working with, and your 
cluster/nodes?


Just testing with that single 44GB source file currently, and my 
test cluster is made from 4 nodes, each with 8 CPU cores, 32GB RAM, 
a 6TB Ext4 volume (RAID-10).


Drill defaults left as come in v1.0. I will be adjusting memory and 
retrying the CTAS.


I know I can / should assign individual disks to HDFS, but as a 
test cluster there are apps that expect data volumes to work on. A 
dedicated Hadoop production cluster would have a disk layout 
specific to the task.



On 28 May 2015, at 12:26, Andries Engelbrecht wrote:

Just check the drillbit.log and drillbit.out files in the log 
directory.
Before adjusting memory, see if that is an issue first. It was for

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
CPU and IO went to near zero on the master and all nodes after about 1 
hour. I am do not know if the bulk of rows were written within that hour 
or after.


Is there any way you can read the table and try to validate if all of 
the data was written?


A simple join will show me where it stopped, and if that was at a 
specific point in scanning the source file top to bottom.


While we certainly want to look into this more to find the issue in 
your case, you might have all of the data you need to start running 
queries against the parquet files.


Simple row count comparison tells me about 5% of the rows are missing in 
the destination, but I will be confirming that.



On 28 May 2015, at 13:24, Jason Altekruse wrote:


He mentioned in his original post that he saw CPU and IO on all of the
nodes for a while when the query was active, but it suddenly dropped 
down
to low CPU usage and stopped producing files. It seems like we are 
failing

to detect an error an cancel the query.

It is possible that the failure happened when we were finalizing the 
query,

cleanup resources/file handles/ etc. Is there any way you can read the
table and try to validate if all of the data was written? You can try 
to
run a few of the same queries against the tab delimited files and 
resulting
parquet files to see if all of the records were written. While we 
certainly
want to look into this more to find the issue in your case, you might 
have
all of the data you need to start running queries against the parquet 
files.


On Thu, May 28, 2015 at 10:06 AM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:

The time seems pretty long for that file size. What type of file is 
it?


Is the CTAS running single threaded?

—Andries


On May 28, 2015, at 9:37 AM, Matt  wrote:

How large is the data set you are working with, and your 
cluster/nodes?


Just testing with that single 44GB source file currently, and my 
test
cluster is made from 4 nodes, each with 8 CPU cores, 32GB RAM, a 6TB 
Ext4

volume (RAID-10).


Drill defaults left as come in v1.0. I will be adjusting memory and

retrying the CTAS.


I know I can / should assign individual disks to HDFS, but as a test
cluster there are apps that expect data volumes to work on. A 
dedicated
Hadoop production cluster would have a disk layout specific to the 
task.



On 28 May 2015, at 12:26, Andries Engelbrecht wrote:

Just check the drillbit.log and drillbit.out files in the log 
directory.
Before adjusting memory, see if that is an issue first. It was for 
me,

but as Jason mentioned there can be other causes as well.


You adjust memory allocation in the drill-env.sh files, and have to

restart the drill bits.


How large is the data set you are working with, and your 
cluster/nodes?


—Andries


On May 28, 2015, at 9:17 AM, Matt  wrote:


To make sure I am adjusting the correct config, these are heap
parameters within the Drill configure path, not for Hadoop or 
Zookeeper?




On May 28, 2015, at 12:08 PM, Jason Altekruse <

altekruseja...@gmail.com> wrote:


There should be no upper limit on the size of the tables you can

create

with Drill. Be advised that Drill does currently operate entirely
optimistically in regards to available resources. If a network

connection

between two drillbits fails during a query, we will not currently
re-schedule the work to make use of remaining nodes and network

connections
that are still live. While we have had a good amount of success 
using

Drill
for data conversion, be aware that these conditions could cause 
long

running queries to fail.

That being said, it isn't the only possible cause for such a 
failure.

In

the case of a network failure we would expect to see a message

returned to

you that part of the query was unsuccessful and that it had been

cancelled.
Andries has a good suggestion in regards to checking the heap 
memory,

this
should also be detected and reported back to you at the CLI, but 
we

may be
failing to propagate the error back to the head node for the 
query. I

believe writing parquet may still be the most heap-intensive

operation in
Drill, despite our efforts to refactor the write path to use 
direct

memory
instead of on-heap for large buffers needed in the process of 
creating

parquet files.


On Thu, May 28, 2015 at 8:43 AM, Matt  wrote:

Is 300MM records too much to do in a single CTAS statement?

After almost 23 hours I killed the query (^c) and it returned:

~~~
+---++
| Fragment  | Number of records written  |
+---++
| 1_20  | 13568824   |
| 1_15  | 12411822   |
| 1_7   | 12470329   |
| 1_12  | 13693867   |
| 1_5   | 13292136   |
| 1_18  | 13874321   |
| 1_16  | 13303094   |
| 1_9   | 13639049   |
| 1_10  | 13698380   |
| 1_2

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
The time seems pretty long for that file size. What type of file is 
it?


Tab delimited UTF-8 text.

I left the query to run overnight to see if it would complete, but 24 
hours for an import like this would indeed be too long.



Is the CTAS running single threaded?


In the first hour, with this being the only client connected to the 
cluster, I observed activity on all 4 nodes.


Is multi-threaded query execution the default? I would not have changed 
anything deliberately to force single thread execution.



On 28 May 2015, at 13:06, Andries Engelbrecht wrote:

The time seems pretty long for that file size. What type of file is 
it?


Is the CTAS running single threaded?

—Andries


On May 28, 2015, at 9:37 AM, Matt  wrote:

How large is the data set you are working with, and your 
cluster/nodes?


Just testing with that single 44GB source file currently, and my test 
cluster is made from 4 nodes, each with 8 CPU cores, 32GB RAM, a 6TB 
Ext4 volume (RAID-10).


Drill defaults left as come in v1.0. I will be adjusting memory and 
retrying the CTAS.


I know I can / should assign individual disks to HDFS, but as a test 
cluster there are apps that expect data volumes to work on. A 
dedicated Hadoop production cluster would have a disk layout specific 
to the task.



On 28 May 2015, at 12:26, Andries Engelbrecht wrote:

Just check the drillbit.log and drillbit.out files in the log 
directory.
Before adjusting memory, see if that is an issue first. It was for 
me, but as Jason mentioned there can be other causes as well.


You adjust memory allocation in the drill-env.sh files, and have to 
restart the drill bits.


How large is the data set you are working with, and your 
cluster/nodes?


—Andries


On May 28, 2015, at 9:17 AM, Matt  wrote:

To make sure I am adjusting the correct config, these are heap 
parameters within the Drill configure path, not for Hadoop or 
Zookeeper?



On May 28, 2015, at 12:08 PM, Jason Altekruse 
 wrote:


There should be no upper limit on the size of the tables you can 
create

with Drill. Be advised that Drill does currently operate entirely
optimistically in regards to available resources. If a network 
connection

between two drillbits fails during a query, we will not currently
re-schedule the work to make use of remaining nodes and network 
connections
that are still live. While we have had a good amount of success 
using Drill
for data conversion, be aware that these conditions could cause 
long

running queries to fail.

That being said, it isn't the only possible cause for such a 
failure. In
the case of a network failure we would expect to see a message 
returned to
you that part of the query was unsuccessful and that it had been 
cancelled.
Andries has a good suggestion in regards to checking the heap 
memory, this
should also be detected and reported back to you at the CLI, but 
we may be
failing to propagate the error back to the head node for the 
query. I
believe writing parquet may still be the most heap-intensive 
operation in
Drill, despite our efforts to refactor the write path to use 
direct memory
instead of on-heap for large buffers needed in the process of 
creating

parquet files.


On Thu, May 28, 2015 at 8:43 AM, Matt  wrote:

Is 300MM records too much to do in a single CTAS statement?

After almost 23 hours I killed the query (^c) and it returned:

~~~
+---++
| Fragment  | Number of records written  |
+---++
| 1_20  | 13568824   |
| 1_15  | 12411822   |
| 1_7   | 12470329   |
| 1_12  | 13693867   |
| 1_5   | 13292136   |
| 1_18  | 13874321   |
| 1_16  | 13303094   |
| 1_9   | 13639049   |
| 1_10  | 13698380   |
| 1_22  | 13501073   |
| 1_8   | 13533736   |
| 1_2   | 13549402   |
| 1_21  | 13665183   |
| 1_0   | 13544745   |
| 1_4   | 13532957   |
| 1_19  | 12767473   |
| 1_17  | 13670687   |
| 1_13  | 13469515   |
| 1_23  | 12517632   |
| 1_6   | 13634338   |
| 1_14  | 13611322   |
| 1_3   | 13061900   |
| 1_11  | 12760978   |
+---++
23 rows selected (82294.854 seconds)
~~~

The sum of those record counts is  306,772,763 which is close to 
the

320,843,454 in the source file:

~~~
0: jdbc:drill:zk=es05:2181> select count(*)  FROM 
root.`sample_201501.dat`;

++
|   EXPR$0   |
++
| 320843454  |
++
1 row selected (384.665 seconds)
~~~


It represents one month of data, 4 key columns and 38 numeric 
measure
columns, which could also be partitioned da

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
How large is the data set you are working with, and your 
cluster/nodes?


Just testing with that single 44GB source file currently, and my test 
cluster is made from 4 nodes, each with 8 CPU cores, 32GB RAM, a 6TB 
Ext4 volume (RAID-10).


Drill defaults left as come in v1.0. I will be adjusting memory and 
retrying the CTAS.


I know I can / should assign individual disks to HDFS, but as a test 
cluster there are apps that expect data volumes to work on. A dedicated 
Hadoop production cluster would have a disk layout specific to the task.



On 28 May 2015, at 12:26, Andries Engelbrecht wrote:

Just check the drillbit.log and drillbit.out files in the log 
directory.
Before adjusting memory, see if that is an issue first. It was for me, 
but as Jason mentioned there can be other causes as well.


You adjust memory allocation in the drill-env.sh files, and have to 
restart the drill bits.


How large is the data set you are working with, and your 
cluster/nodes?


—Andries


On May 28, 2015, at 9:17 AM, Matt  wrote:

To make sure I am adjusting the correct config, these are heap 
parameters within the Drill configure path, not for Hadoop or 
Zookeeper?



On May 28, 2015, at 12:08 PM, Jason Altekruse 
 wrote:


There should be no upper limit on the size of the tables you can 
create

with Drill. Be advised that Drill does currently operate entirely
optimistically in regards to available resources. If a network 
connection

between two drillbits fails during a query, we will not currently
re-schedule the work to make use of remaining nodes and network 
connections
that are still live. While we have had a good amount of success 
using Drill

for data conversion, be aware that these conditions could cause long
running queries to fail.

That being said, it isn't the only possible cause for such a 
failure. In
the case of a network failure we would expect to see a message 
returned to
you that part of the query was unsuccessful and that it had been 
cancelled.
Andries has a good suggestion in regards to checking the heap 
memory, this
should also be detected and reported back to you at the CLI, but we 
may be
failing to propagate the error back to the head node for the query. 
I
believe writing parquet may still be the most heap-intensive 
operation in
Drill, despite our efforts to refactor the write path to use direct 
memory
instead of on-heap for large buffers needed in the process of 
creating

parquet files.


On Thu, May 28, 2015 at 8:43 AM, Matt  wrote:

Is 300MM records too much to do in a single CTAS statement?

After almost 23 hours I killed the query (^c) and it returned:

~~~
+---++
| Fragment  | Number of records written  |
+---++
| 1_20  | 13568824   |
| 1_15  | 12411822   |
| 1_7   | 12470329   |
| 1_12  | 13693867   |
| 1_5   | 13292136   |
| 1_18  | 13874321   |
| 1_16  | 13303094   |
| 1_9   | 13639049   |
| 1_10  | 13698380   |
| 1_22  | 13501073   |
| 1_8   | 13533736   |
| 1_2   | 13549402   |
| 1_21  | 13665183   |
| 1_0   | 13544745   |
| 1_4   | 13532957   |
| 1_19  | 12767473   |
| 1_17  | 13670687   |
| 1_13  | 13469515   |
| 1_23  | 12517632   |
| 1_6   | 13634338   |
| 1_14  | 13611322   |
| 1_3   | 13061900   |
| 1_11  | 12760978   |
+---++
23 rows selected (82294.854 seconds)
~~~

The sum of those record counts is  306,772,763 which is close to 
the

320,843,454 in the source file:

~~~
0: jdbc:drill:zk=es05:2181> select count(*)  FROM 
root.`sample_201501.dat`;

++
|   EXPR$0   |
++
| 320843454  |
++
1 row selected (384.665 seconds)
~~~


It represents one month of data, 4 key columns and 38 numeric 
measure
columns, which could also be partitioned daily. The test here was 
to create
monthly Parquet files to see how the min/max stats on Parquet 
chunks help

with range select performance.

Instead of a small number of large monthly RDBMS tables, I am 
attempting
to determine how many Parquet files should be used with Drill / 
HDFS.





On 27 May 2015, at 15:17, Matt wrote:

Attempting to create a Parquet backed table with a CTAS from an 
44GB tab
delimited file in HDFS. The process seemed to be running, as CPU 
and IO was
seen on all 4 nodes in this cluster, and .parquet files being 
created in

the expected path.

In however in the last two hours or so, all nodes show near zero 
CPU or
IO, and the Last Modified date on the .parquet have not changed. 
Same time
delay shown in 

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
I did not note any memory errors or warnings in a quick scan of the logs, but 
to double check, is there a specific log I would find such warnings in?


> On May 28, 2015, at 12:01 PM, Andries Engelbrecht  
> wrote:
> 
> I have used a single CTAS to create tables using parquet with 1.5B rows.
> 
> It did consume a lot of heap memory on the Drillbits and I had to increase 
> the heap size. Check your logs to see if you are running out of heap memory.
> 
> I used 128MB parquet block size.
> 
> This was with Drill 0.9 , so I’m sure 1.0 will be better in this regard.
> 
> —Andries
> 
> 
> 
>> On May 28, 2015, at 8:43 AM, Matt  wrote:
>> 
>> Is 300MM records too much to do in a single CTAS statement?
>> 
>> After almost 23 hours I killed the query (^c) and it returned:
>> 
>> ~~~
>> +---++
>> | Fragment  | Number of records written  |
>> +---++
>> | 1_20  | 13568824   |
>> | 1_15  | 12411822   |
>> | 1_7   | 12470329   |
>> | 1_12  | 13693867   |
>> | 1_5   | 13292136   |
>> | 1_18  | 13874321   |
>> | 1_16  | 13303094   |
>> | 1_9   | 13639049   |
>> | 1_10  | 13698380   |
>> | 1_22  | 13501073   |
>> | 1_8   | 13533736   |
>> | 1_2   | 13549402   |
>> | 1_21  | 13665183   |
>> | 1_0   | 13544745   |
>> | 1_4   | 13532957   |
>> | 1_19  | 12767473   |
>> | 1_17  | 13670687   |
>> | 1_13  | 13469515   |
>> | 1_23  | 12517632   |
>> | 1_6   | 13634338   |
>> | 1_14  | 13611322   |
>> | 1_3   | 13061900   |
>> | 1_11  | 12760978   |
>> +---++
>> 23 rows selected (82294.854 seconds)
>> ~~~
>> 
>> The sum of those record counts is  306,772,763 which is close to the  
>> 320,843,454 in the source file:
>> 
>> ~~~
>> 0: jdbc:drill:zk=es05:2181> select count(*)  FROM root.`sample_201501.dat`;
>> ++
>> |   EXPR$0   |
>> ++
>> | 320843454  |
>> ++
>> 1 row selected (384.665 seconds)
>> ~~~
>> 
>> 
>> It represents one month of data, 4 key columns and 38 numeric measure 
>> columns, which could also be partitioned daily. The test here was to create 
>> monthly Parquet files to see how the min/max stats on Parquet chunks help 
>> with range select performance.
>> 
>> Instead of a small number of large monthly RDBMS tables, I am attempting to 
>> determine how many Parquet files should be used with Drill / HDFS.
>> 
>> 
>> 
>>> On 27 May 2015, at 15:17, Matt wrote:
>>> 
>>> Attempting to create a Parquet backed table with a CTAS from an 44GB tab 
>>> delimited file in HDFS. The process seemed to be running, as CPU and IO was 
>>> seen on all 4 nodes in this cluster, and .parquet files being created in 
>>> the expected path.
>>> 
>>> In however in the last two hours or so, all nodes show near zero CPU or IO, 
>>> and the Last Modified date on the .parquet have not changed. Same time 
>>> delay shown in the Last Progress column in the active fragment profile.
>>> 
>>> What approach can I take to determine what is happening (or not)?
> 


Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
To make sure I am adjusting the correct config, these are heap parameters 
within the Drill configure path, not for Hadoop or Zookeeper?


> On May 28, 2015, at 12:08 PM, Jason Altekruse  
> wrote:
> 
> There should be no upper limit on the size of the tables you can create
> with Drill. Be advised that Drill does currently operate entirely
> optimistically in regards to available resources. If a network connection
> between two drillbits fails during a query, we will not currently
> re-schedule the work to make use of remaining nodes and network connections
> that are still live. While we have had a good amount of success using Drill
> for data conversion, be aware that these conditions could cause long
> running queries to fail.
> 
> That being said, it isn't the only possible cause for such a failure. In
> the case of a network failure we would expect to see a message returned to
> you that part of the query was unsuccessful and that it had been cancelled.
> Andries has a good suggestion in regards to checking the heap memory, this
> should also be detected and reported back to you at the CLI, but we may be
> failing to propagate the error back to the head node for the query. I
> believe writing parquet may still be the most heap-intensive operation in
> Drill, despite our efforts to refactor the write path to use direct memory
> instead of on-heap for large buffers needed in the process of creating
> parquet files.
> 
>> On Thu, May 28, 2015 at 8:43 AM, Matt  wrote:
>> 
>> Is 300MM records too much to do in a single CTAS statement?
>> 
>> After almost 23 hours I killed the query (^c) and it returned:
>> 
>> ~~~
>> +---++
>> | Fragment  | Number of records written  |
>> +---++
>> | 1_20  | 13568824   |
>> | 1_15  | 12411822   |
>> | 1_7   | 12470329   |
>> | 1_12  | 13693867   |
>> | 1_5   | 13292136   |
>> | 1_18  | 13874321   |
>> | 1_16  | 13303094   |
>> | 1_9   | 13639049   |
>> | 1_10  | 13698380   |
>> | 1_22  | 13501073   |
>> | 1_8   | 13533736   |
>> | 1_2   | 13549402   |
>> | 1_21  | 13665183   |
>> | 1_0   | 13544745   |
>> | 1_4   | 13532957   |
>> | 1_19  | 12767473   |
>> | 1_17  | 13670687   |
>> | 1_13  | 13469515   |
>> | 1_23  | 12517632   |
>> | 1_6   | 13634338   |
>> | 1_14  | 13611322   |
>> | 1_3   | 13061900   |
>> | 1_11  | 12760978   |
>> +---++
>> 23 rows selected (82294.854 seconds)
>> ~~~
>> 
>> The sum of those record counts is  306,772,763 which is close to the
>> 320,843,454 in the source file:
>> 
>> ~~~
>> 0: jdbc:drill:zk=es05:2181> select count(*)  FROM root.`sample_201501.dat`;
>> ++
>> |   EXPR$0   |
>> ++
>> | 320843454  |
>> ++
>> 1 row selected (384.665 seconds)
>> ~~~
>> 
>> 
>> It represents one month of data, 4 key columns and 38 numeric measure
>> columns, which could also be partitioned daily. The test here was to create
>> monthly Parquet files to see how the min/max stats on Parquet chunks help
>> with range select performance.
>> 
>> Instead of a small number of large monthly RDBMS tables, I am attempting
>> to determine how many Parquet files should be used with Drill / HDFS.
>> 
>> 
>> 
>> 
>> On 27 May 2015, at 15:17, Matt wrote:
>> 
>> Attempting to create a Parquet backed table with a CTAS from an 44GB tab
>>> delimited file in HDFS. The process seemed to be running, as CPU and IO was
>>> seen on all 4 nodes in this cluster, and .parquet files being created in
>>> the expected path.
>>> 
>>> In however in the last two hours or so, all nodes show near zero CPU or
>>> IO, and the Last Modified date on the .parquet have not changed. Same time
>>> delay shown in the Last Progress column in the active fragment profile.
>>> 
>>> What approach can I take to determine what is happening (or not)?
>> 


Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt

Is 300MM records too much to do in a single CTAS statement?

After almost 23 hours I killed the query (^c) and it returned:

~~~
+---++
| Fragment  | Number of records written  |
+---++
| 1_20  | 13568824   |
| 1_15  | 12411822   |
| 1_7   | 12470329   |
| 1_12  | 13693867   |
| 1_5   | 13292136   |
| 1_18  | 13874321   |
| 1_16  | 13303094   |
| 1_9   | 13639049   |
| 1_10  | 13698380   |
| 1_22  | 13501073   |
| 1_8   | 13533736   |
| 1_2   | 13549402   |
| 1_21  | 13665183   |
| 1_0   | 13544745   |
| 1_4   | 13532957   |
| 1_19  | 12767473   |
| 1_17  | 13670687   |
| 1_13  | 13469515   |
| 1_23  | 12517632   |
| 1_6   | 13634338   |
| 1_14  | 13611322   |
| 1_3   | 13061900   |
| 1_11  | 12760978   |
+---++
23 rows selected (82294.854 seconds)
~~~

The sum of those record counts is  306,772,763 which is close to the  
320,843,454 in the source file:


~~~
0: jdbc:drill:zk=es05:2181> select count(*)  FROM 
root.`sample_201501.dat`;

++
|   EXPR$0   |
++
| 320843454  |
++
1 row selected (384.665 seconds)
~~~


It represents one month of data, 4 key columns and 38 numeric measure 
columns, which could also be partitioned daily. The test here was to 
create monthly Parquet files to see how the min/max stats on Parquet 
chunks help with range select performance.


Instead of a small number of large monthly RDBMS tables, I am attempting 
to determine how many Parquet files should be used with Drill / HDFS.




On 27 May 2015, at 15:17, Matt wrote:

Attempting to create a Parquet backed table with a CTAS from an 44GB 
tab delimited file in HDFS. The process seemed to be running, as CPU 
and IO was seen on all 4 nodes in this cluster, and .parquet files 
being created in the expected path.


In however in the last two hours or so, all nodes show near zero CPU 
or IO, and the Last Modified date on the .parquet have not changed. 
Same time delay shown in the Last Progress column in the active 
fragment profile.


What approach can I take to determine what is happening (or not)?


Monitoring long / stuck CTAS

2015-05-27 Thread Matt
Attempting to create a Parquet backed table with a CTAS from an 44GB tab 
delimited file in HDFS. The process seemed to be running, as CPU and IO 
was seen on all 4 nodes in this cluster, and .parquet files being 
created in the expected path.


In however in the last two hours or so, all nodes show near zero CPU or 
IO, and the Last Modified date on the .parquet have not changed. Same 
time delay shown in the Last Progress column in the active fragment 
profile.


What approach can I take to determine what is happening (or not)?



Re: Query local files on cluster? [Beginner]

2015-05-27 Thread Matt
Drill can process a lot of data quickly, and for best performance and 
consistency you will likely find that the sooner you get the data to 
the DFS the better.


Already most of the way there. Initial confusion came from the features 
to query the local / native filesystem, and how that does not fit a 
distributed Drill cluster well. In other words its really an embedded / 
single-node Drill feature.


Currently using the approach of doing a put from local filsystem into 
hdfs, then CTAS into Parquet, if only for simplicity in testing (not 
performance).


Thanks!

On 27 May 2015, at 11:29, Andries Engelbrecht wrote:

You will be better off to use the Drill cluster as a whole vs trying 
to play with local vs DFS storage.


A couple of ideas:
As previously mentioned you can use the robust NFS on MapR to easily 
place the CSV/files on the DFS, and then use Drill with CTAS to 
convert the files to Parquet on the DFS.


You can set up a remote NFS server and map the local FS on each node 
to the same NFS mount point to the remote NFS server, this will the 
files will be consistently available to the Drillbits in the cluster 
and you can do CTAS to create parquet file son the DFS. This however 
will likely be a lot slower than the first option, as the NFS server 
BW will become a bottleneck if you have a number of drillbits in the 
cluster.


Just copy the files to one node in the cluster and then use hadoop fs 
to put the files in the DFS, and then do the CTAS from DFS to parquet 
DFS.


You can even place the data on S3 and then query and CTAS from there, 
however security and bandwidth may be a concern for large data 
volumes, pending the use case.


I really think you will find the first option the most robust and 
fastest in the long run. You can point Drill at any FS source as long 
as it is consistent to all nodes in the cluster, but keep in mind that 
Drill can process a lot of data quickly, and for best performance and 
consistency you will likely find that the sooner you get the data to 
the DFS the better.






On May 26, 2015, at 5:58 PM, Matt  wrote:

Thanks, I am incorrectly conflating the file system with data 
storage.


Looking to experiment with the Parquet format, and was looking at 
CTAS queries as an import approach.


Are direct queries over local files meant for an embedded drill, 
where on a cluster files should be moved into HDFS first?


That would make sense as files on one node would be query bound to 
that local filesystem.


On May 26, 2015, at 8:28 PM, Andries Engelbrecht 
 wrote:


You can use the HDFS shell
hadoop fs -put

To copy from local file system to HDFS


For more robust mechanisms from remote systems you can look at using 
NFS, MapR has a really robust NFS integration and you can use it 
with the community edition.






On May 26, 2015, at 5:11 PM, Matt  wrote:


That might be the end goal, but currently I don't have an HDFS 
ingest mechanism.


We are not currently a Hadoop shop - can you suggest simple 
approaches for bulk loading data from delimited files into HDFS?





On May 26, 2015, at 8:04 PM, Andries Engelbrecht 
 wrote:


Perhaps I’m missing something here.

Why not create a DFS plug in for HDFS and put the file in HDFS?




On May 26, 2015, at 4:54 PM, Matt  wrote:

New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it 
appears text files need to be on all nodes in a cluster?


Using the dfs config below, I am only able to query if a csv file 
is on all 4 nodes. If the file is only on the local node and not 
others, I get errors in the form of:


~~~
0: jdbc:drill:zk=es05:2181> select * from 
root.`customer_reviews_1998.csv`;
Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: 
Table 'root.customer_reviews_1998.csv' not found

~~~

~~~
{
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
"root": {
"location": "/localdata/hadoop/stage",
"writable": false,
"defaultInputFormat": null
},
~~~


On 25 May 2015, at 20:39, Kristine Hahn wrote:

The storage plugin "location" needs to be the full path to the 
localdata
directory. This partial storage plugin definition works for the 
user named

mapr:

{
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
"root": {
"location": "/home/mapr/localdata",
"writable": false,
"defaultInputFormat": null
},
. . .

Here's a working query for the data in localdata:

0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
. . . . . . . > COLUMNS[1] AS Publication_Date,
. . . . . . . > COLUMNS[2] AS Frequency
. . . . . . . > FROM dfs.root.`mydata.csv`
. . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the 
Linnean')

. . . . . . . > AND (columns[2] > 250)) LIMIT 10;

An complete examp

Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt
Thanks, I am incorrectly conflating the file system with data storage. 

Looking to experiment with the Parquet format, and was looking at CTAS queries 
as an import approach.

Are direct queries over local files meant for an embedded drill, where on a 
cluster files should be moved into HDFS first?

That would make sense as files on one node would be query bound to that local 
filesystem. 

> On May 26, 2015, at 8:28 PM, Andries Engelbrecht  
> wrote:
> 
> You can use the HDFS shell
> hadoop fs -put
> 
> To copy from local file system to HDFS
> 
> 
> For more robust mechanisms from remote systems you can look at using NFS, 
> MapR has a really robust NFS integration and you can use it with the 
> community edition.
> 
> 
> 
> 
>> On May 26, 2015, at 5:11 PM, Matt  wrote:
>> 
>> 
>> That might be the end goal, but currently I don't have an HDFS ingest 
>> mechanism. 
>> 
>> We are not currently a Hadoop shop - can you suggest simple approaches for 
>> bulk loading data from delimited files into HDFS?
>> 
>> 
>> 
>> 
>>> On May 26, 2015, at 8:04 PM, Andries Engelbrecht 
>>>  wrote:
>>> 
>>> Perhaps I’m missing something here.
>>> 
>>> Why not create a DFS plug in for HDFS and put the file in HDFS?
>>> 
>>> 
>>> 
>>>> On May 26, 2015, at 4:54 PM, Matt  wrote:
>>>> 
>>>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text 
>>>> files need to be on all nodes in a cluster?
>>>> 
>>>> Using the dfs config below, I am only able to query if a csv file is on 
>>>> all 4 nodes. If the file is only on the local node and not others, I get 
>>>> errors in the form of:
>>>> 
>>>> ~~~
>>>> 0: jdbc:drill:zk=es05:2181> select * from root.`customer_reviews_1998.csv`;
>>>> Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
>>>> 'root.customer_reviews_1998.csv' not found
>>>> ~~~
>>>> 
>>>> ~~~
>>>> {
>>>> "type": "file",
>>>> "enabled": true,
>>>> "connection": "file:///",
>>>> "workspaces": {
>>>> "root": {
>>>>   "location": "/localdata/hadoop/stage",
>>>>   "writable": false,
>>>>   "defaultInputFormat": null
>>>> },
>>>> ~~~
>>>> 
>>>>> On 25 May 2015, at 20:39, Kristine Hahn wrote:
>>>>> 
>>>>> The storage plugin "location" needs to be the full path to the localdata
>>>>> directory. This partial storage plugin definition works for the user named
>>>>> mapr:
>>>>> 
>>>>> {
>>>>> "type": "file",
>>>>> "enabled": true,
>>>>> "connection": "file:///",
>>>>> "workspaces": {
>>>>> "root": {
>>>>> "location": "/home/mapr/localdata",
>>>>> "writable": false,
>>>>> "defaultInputFormat": null
>>>>> },
>>>>> . . .
>>>>> 
>>>>> Here's a working query for the data in localdata:
>>>>> 
>>>>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
>>>>> . . . . . . . > COLUMNS[1] AS Publication_Date,
>>>>> . . . . . . . > COLUMNS[2] AS Frequency
>>>>> . . . . . . . > FROM dfs.root.`mydata.csv`
>>>>> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the Linnean')
>>>>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
>>>>> 
>>>>> An complete example, not yet published on the Drill site, shows in detail
>>>>> the steps involved:
>>>>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
>>>>> 
>>>>> 
>>>>> Kristine Hahn
>>>>> Sr. Technical Writer
>>>>> 415-497-8107 @krishahn
>>>>> 
>>>>> 
>>>>>> On Sun, May 24, 2015 at 1:56 PM, Matt  wrote:
>>>>>> 
>>>>>> I have used a single node install (unzip and run) to query local text /
>>>>>> csv files, but on a 3 node cluster (installed via MapR CE), a query with
>>>>>> local files results in:
>>>>>> 
>>>>>> ~~~
>>>>>> sqlline version 1.1.6
>>>>>> 0: jdbc:drill:> select * from dfs.`testdata.csv`;
>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>> 
>>>>>> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>> ~~~
>>>>>> 
>>>>>> Is there a special config for local file querying? An initial doc search
>>>>>> did not point me to a solution, but I may simply not have found the
>>>>>> relevant sections.
>>>>>> 
>>>>>> I have tried modifying the default dfs config to no avail:
>>>>>> 
>>>>>> ~~~
>>>>>> "type": "file",
>>>>>> "enabled": true,
>>>>>> "connection": "file:///",
>>>>>> "workspaces": {
>>>>>> "root": {
>>>>>> "location": "/localdata",
>>>>>> "writable": false,
>>>>>> "defaultInputFormat": null
>>>>>> }
>>>>>> ~~~
> 


Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt
A better explanation from my last: Looking to use CTAS queries as HDFS ingest. 


> On May 26, 2015, at 8:04 PM, Andries Engelbrecht  
> wrote:
> 
> Perhaps I’m missing something here.
> 
> Why not create a DFS plug in for HDFS and put the file in HDFS?
> 
> 
> 
>> On May 26, 2015, at 4:54 PM, Matt  wrote:
>> 
>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text 
>> files need to be on all nodes in a cluster?
>> 
>> Using the dfs config below, I am only able to query if a csv file is on all 
>> 4 nodes. If the file is only on the local node and not others, I get errors 
>> in the form of:
>> 
>> ~~~
>> 0: jdbc:drill:zk=es05:2181> select * from root.`customer_reviews_1998.csv`;
>> Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
>> 'root.customer_reviews_1998.csv' not found
>> ~~~
>> 
>> ~~~
>> {
>> "type": "file",
>> "enabled": true,
>> "connection": "file:///",
>> "workspaces": {
>>   "root": {
>> "location": "/localdata/hadoop/stage",
>> "writable": false,
>> "defaultInputFormat": null
>>   },
>> ~~~
>> 
>>> On 25 May 2015, at 20:39, Kristine Hahn wrote:
>>> 
>>> The storage plugin "location" needs to be the full path to the localdata
>>> directory. This partial storage plugin definition works for the user named
>>> mapr:
>>> 
>>> {
>>> "type": "file",
>>> "enabled": true,
>>> "connection": "file:///",
>>> "workspaces": {
>>> "root": {
>>>  "location": "/home/mapr/localdata",
>>>  "writable": false,
>>>  "defaultInputFormat": null
>>> },
>>> . . .
>>> 
>>> Here's a working query for the data in localdata:
>>> 
>>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
>>> . . . . . . . > COLUMNS[1] AS Publication_Date,
>>> . . . . . . . > COLUMNS[2] AS Frequency
>>> . . . . . . . > FROM dfs.root.`mydata.csv`
>>> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the Linnean')
>>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
>>> 
>>> An complete example, not yet published on the Drill site, shows in detail
>>> the steps involved:
>>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
>>> 
>>> 
>>> Kristine Hahn
>>> Sr. Technical Writer
>>> 415-497-8107 @krishahn
>>> 
>>> 
>>>> On Sun, May 24, 2015 at 1:56 PM, Matt  wrote:
>>>> 
>>>> I have used a single node install (unzip and run) to query local text /
>>>> csv files, but on a 3 node cluster (installed via MapR CE), a query with
>>>> local files results in:
>>>> 
>>>> ~~~
>>>> sqlline version 1.1.6
>>>> 0: jdbc:drill:> select * from dfs.`testdata.csv`;
>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>>>> Table 'dfs./localdata/testdata.csv' not found
>>>> 
>>>> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>>>> Table 'dfs./localdata/testdata.csv' not found
>>>> ~~~
>>>> 
>>>> Is there a special config for local file querying? An initial doc search
>>>> did not point me to a solution, but I may simply not have found the
>>>> relevant sections.
>>>> 
>>>> I have tried modifying the default dfs config to no avail:
>>>> 
>>>> ~~~
>>>> "type": "file",
>>>> "enabled": true,
>>>> "connection": "file:///",
>>>> "workspaces": {
>>>> "root": {
>>>>  "location": "/localdata",
>>>>  "writable": false,
>>>>  "defaultInputFormat": null
>>>> }
>>>> ~~~
> 


Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt

That might be the end goal, but currently I don't have an HDFS ingest 
mechanism. 

We are not currently a Hadoop shop - can you suggest simple approaches for bulk 
loading data from delimited files into HDFS?

 


> On May 26, 2015, at 8:04 PM, Andries Engelbrecht  
> wrote:
> 
> Perhaps I’m missing something here.
> 
> Why not create a DFS plug in for HDFS and put the file in HDFS?
> 
> 
> 
>> On May 26, 2015, at 4:54 PM, Matt  wrote:
>> 
>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text 
>> files need to be on all nodes in a cluster?
>> 
>> Using the dfs config below, I am only able to query if a csv file is on all 
>> 4 nodes. If the file is only on the local node and not others, I get errors 
>> in the form of:
>> 
>> ~~~
>> 0: jdbc:drill:zk=es05:2181> select * from root.`customer_reviews_1998.csv`;
>> Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
>> 'root.customer_reviews_1998.csv' not found
>> ~~~
>> 
>> ~~~
>> {
>> "type": "file",
>> "enabled": true,
>> "connection": "file:///",
>> "workspaces": {
>>   "root": {
>> "location": "/localdata/hadoop/stage",
>> "writable": false,
>> "defaultInputFormat": null
>>   },
>> ~~~
>> 
>>> On 25 May 2015, at 20:39, Kristine Hahn wrote:
>>> 
>>> The storage plugin "location" needs to be the full path to the localdata
>>> directory. This partial storage plugin definition works for the user named
>>> mapr:
>>> 
>>> {
>>> "type": "file",
>>> "enabled": true,
>>> "connection": "file:///",
>>> "workspaces": {
>>> "root": {
>>>  "location": "/home/mapr/localdata",
>>>  "writable": false,
>>>  "defaultInputFormat": null
>>> },
>>> . . .
>>> 
>>> Here's a working query for the data in localdata:
>>> 
>>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
>>> . . . . . . . > COLUMNS[1] AS Publication_Date,
>>> . . . . . . . > COLUMNS[2] AS Frequency
>>> . . . . . . . > FROM dfs.root.`mydata.csv`
>>> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the Linnean')
>>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
>>> 
>>> An complete example, not yet published on the Drill site, shows in detail
>>> the steps involved:
>>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
>>> 
>>> 
>>> Kristine Hahn
>>> Sr. Technical Writer
>>> 415-497-8107 @krishahn
>>> 
>>> 
>>>> On Sun, May 24, 2015 at 1:56 PM, Matt  wrote:
>>>> 
>>>> I have used a single node install (unzip and run) to query local text /
>>>> csv files, but on a 3 node cluster (installed via MapR CE), a query with
>>>> local files results in:
>>>> 
>>>> ~~~
>>>> sqlline version 1.1.6
>>>> 0: jdbc:drill:> select * from dfs.`testdata.csv`;
>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>>>> Table 'dfs./localdata/testdata.csv' not found
>>>> 
>>>> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>>>> Table 'dfs./localdata/testdata.csv' not found
>>>> ~~~
>>>> 
>>>> Is there a special config for local file querying? An initial doc search
>>>> did not point me to a solution, but I may simply not have found the
>>>> relevant sections.
>>>> 
>>>> I have tried modifying the default dfs config to no avail:
>>>> 
>>>> ~~~
>>>> "type": "file",
>>>> "enabled": true,
>>>> "connection": "file:///",
>>>> "workspaces": {
>>>> "root": {
>>>>  "location": "/localdata",
>>>>  "writable": false,
>>>>  "defaultInputFormat": null
>>>> }
>>>> ~~~
> 


Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt
New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears 
text files need to be on all nodes in a cluster?


Using the dfs config below, I am only able to query if a csv file is on 
all 4 nodes. If the file is only on the local node and not others, I get 
errors in the form of:


~~~
0: jdbc:drill:zk=es05:2181> select * from 
root.`customer_reviews_1998.csv`;
Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
'root.customer_reviews_1998.csv' not found

~~~

~~~
{
  "type": "file",
  "enabled": true,
  "connection": "file:///",
  "workspaces": {
"root": {
  "location": "/localdata/hadoop/stage",
  "writable": false,
  "defaultInputFormat": null
},
~~~

On 25 May 2015, at 20:39, Kristine Hahn wrote:

The storage plugin "location" needs to be the full path to the 
localdata
directory. This partial storage plugin definition works for the user 
named

mapr:

{
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
 "root": {
   "location": "/home/mapr/localdata",
   "writable": false,
   "defaultInputFormat": null
 },
. . .

Here's a working query for the data in localdata:

0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
. . . . . . . > COLUMNS[1] AS Publication_Date,
. . . . . . . > COLUMNS[2] AS Frequency
. . . . . . . > FROM dfs.root.`mydata.csv`
. . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the 
Linnean')

. . . . . . . > AND (columns[2] > 250)) LIMIT 10;

An complete example, not yet published on the Drill site, shows in 
detail

the steps involved:
http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file


Kristine Hahn
Sr. Technical Writer
415-497-8107 @krishahn


On Sun, May 24, 2015 at 1:56 PM, Matt  wrote:

I have used a single node install (unzip and run) to query local text 
/
csv files, but on a 3 node cluster (installed via MapR CE), a query 
with

local files results in:

~~~
sqlline version 1.1.6
0: jdbc:drill:> select * from dfs.`testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
17:

Table 'dfs./localdata/testdata.csv' not found

0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
17:

Table 'dfs./localdata/testdata.csv' not found
~~~

Is there a special config for local file querying? An initial doc 
search

did not point me to a solution, but I may simply not have found the
relevant sections.

I have tried modifying the default dfs config to no avail:

~~~
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
 "root": {
   "location": "/localdata",
   "writable": false,
   "defaultInputFormat": null
 }
~~~



Re: Query local files on cluster? [Beginner]

2015-05-25 Thread Matt
That does represent something I have not tried yet. Will test as soon as 
I can.


Thanks!

On 25 May 2015, at 20:39, Kristine Hahn wrote:

The storage plugin "location" needs to be the full path to the 
localdata
directory. This partial storage plugin definition works for the user 
named

mapr:

{
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
 "root": {
   "location": "/home/mapr/localdata",
   "writable": false,
   "defaultInputFormat": null
 },
. . .

Here's a working query for the data in localdata:

0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
. . . . . . . > COLUMNS[1] AS Publication_Date,
. . . . . . . > COLUMNS[2] AS Frequency
. . . . . . . > FROM dfs.root.`mydata.csv`
. . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the 
Linnean')

. . . . . . . > AND (columns[2] > 250)) LIMIT 10;

An complete example, not yet published on the Drill site, shows in 
detail

the steps involved:
http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file


Kristine Hahn
Sr. Technical Writer
415-497-8107 @krishahn


On Sun, May 24, 2015 at 1:56 PM, Matt  wrote:

I have used a single node install (unzip and run) to query local text 
/
csv files, but on a 3 node cluster (installed via MapR CE), a query 
with

local files results in:

~~~
sqlline version 1.1.6
0: jdbc:drill:> select * from dfs.`testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
17:

Table 'dfs./localdata/testdata.csv' not found

0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
17:

Table 'dfs./localdata/testdata.csv' not found
~~~

Is there a special config for local file querying? An initial doc 
search

did not point me to a solution, but I may simply not have found the
relevant sections.

I have tried modifying the default dfs config to no avail:

~~~
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
 "root": {
   "location": "/localdata",
   "writable": false,
   "defaultInputFormat": null
 }
~~~



Query local files on cluster? [Beginner]

2015-05-24 Thread Matt
I have used a single node install (unzip and run) to query local text / 
csv files, but on a 3 node cluster (installed via MapR CE), a query with 
local files results in:


~~~
sqlline version 1.1.6
0: jdbc:drill:> select * from dfs.`testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17: 
Table 'dfs./localdata/testdata.csv' not found


0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17: 
Table 'dfs./localdata/testdata.csv' not found

~~~

Is there a special config for local file querying? An initial doc search 
did not point me to a solution, but I may simply not have found the 
relevant sections.


I have tried modifying the default dfs config to no avail:

~~~
  "type": "file",
  "enabled": true,
  "connection": "file:///",
  "workspaces": {
"root": {
  "location": "/localdata",
  "writable": false,
  "defaultInputFormat": null
}
~~~


Re: How do I make json files les painful

2015-03-19 Thread Matt

Is each file a single json array object?

If so, would converting the files to a format with one line per record a 
potential solution?


Example using jq (http://stedolan.github.io/jq/): jq -c '.[]'


On 19 Mar 2015, at 12:22, Jim Bates wrote:


I constantly, constantly, constantly hit this.

I have json files that are just a huge collection of an array of json
objects

example
"MyArrayInTheFile":
[{"a":"1","b":"2","c":"3"},{"a":"1","b":"2","c":"3"},...]

My issue is in exploring the data, I hit this.

Query failed: Query stopped., Record was too large to copy into 
vector. [

39186288-2e01-408c-b886-dcee0a2c25c5 on maprdemo:31010 ]

I can explore csv, tab, maprdb, hive at fairly large data sets and 
limit
the response to what fits in my system limitations but not json in 
this

format.

The two options I have come up with to move forward are..

1. I strip out 90% of the array values in a file and explore that to 
get
to my view. then go to a larger system and see if I have enough to get 
the

job done.
2. Move to the larger system and explore there taking resources that
don't need to be spent on a science project.

Hoping the smart people have a different option for me,

Jim


Choosing storage for OLAP

2015-03-05 Thread Matt
The default Drill 0.7 install contains plugins for HBase and Hive. As 
someone not overly familiar with either, is there consensus or 
documentation on which is preferable for OLAP use with time-series data?


An example case would be data with a key, timestamp, and multiple 
numeric fact columns, where common ad-hoc queries select based on the 
key and a time range, and aggregate the fact columns.


The source data in this case are csv files, but it would seem that 
storage with indexes would be preferable.