Standalone drillbit without Zookeeper?

2020-09-29 Thread Matt Keranen
Is it possible to run a single node drillbit without Zookeeper, as a "service" without the need for coordination across multiple nodes? `zk.connect: "local"` is not accepted as the equivalent of "zk=local" with drill-embedded.

Re: [DISCUSS] Drill Storage Plugins

2019-11-05 Thread Matt
Perhaps an "awesome-drill" repo on GitHub would be a place to back fill the book, and serve as a central location for thins like the list you supplied: https://github.com/topics/awesome On Tue, Nov 5, 2019 at 9:14 AM Charles Givre wrote: > One more thing: I've found code for storage plugins

Re: Problem creating jt400 jdbc connection

2019-07-25 Thread Matt Rabbitt
It works using 9.4 java8 version. Thanks! On Thu, Jul 25, 2019 at 12:07 PM wrote: > Hi Matt, I tried with 9.4 jt400.rar and it works for me > With this parameters > > { > "type": "jdbc", > "driver": "com.ibm.as400.access.AS400JDBCDriver

Problem creating jt400 jdbc connection

2019-07-25 Thread Matt Rabbitt
Is anyone successfully using the jt400 jdbc driver with Drill? I am trying to add a storage plugin but when I go to create it in the web gui I'm getting an error: Please retry: Error while creating / updating storage : java.sql.SQLException: Cannot create PoolableConnectionFactory (The

File does not exist errors across cluster

2018-11-27 Thread Matt Keranen
Have 4 nodes running drillbits version 1.14 for queries over JSON files in the regular filesystem (not HDFS). Each node has an identical directory structure, but not all file names exist on all nodes, and any query in the form of "SELECT ... FROM dfs.logs.`logs*.json.gz`" fails with: Error:

File "does not exist" error on non-distributed filesystem cluster

2018-11-27 Thread Matt Keranen
Have 4 nodes running drillbits version 1.14 for queries over JSON files in the regular filesystem (not HDFS). Each node has an identical directory structure, but not all file names exist on all nodes, and any query in the form of "SELECT ... FROM dfs.logs.`logs*.json.gz`" fails with: Error:

Re: Failure while reading messages from kafka

2018-09-04 Thread Matt
https://issues.apache.org/jira/browse/DRILL-6723 On Mon, Aug 27, 2018 at 12:27 PM Matt wrote: > I have a Kafka topic with some non-JSON test messages in it, resulting in > errors "Error: DATA_READ ERROR: Failure while reading messages from kafka. > Recordreader was at record:

Failure while reading messages from kafka

2018-08-27 Thread Matt
I have a Kafka topic with some non-JSON test messages in it, resulting in errors "Error: DATA_READ ERROR: Failure while reading messages from kafka. Recordreader was at record: 1" I don't seem to be able to bypass these topic messages with "store.json.reader.skip_invalid_records" or even an

Cassandra storage plugin

2018-01-16 Thread Matt
I note there are some old Jira issues about Cassandra storage, and have this concept as to why it could be very valuable for Drill. Can anyone support or refute the idea? Cassandra is an excellent engine for high volume ingest, but support for aggregations and scans is very limited. Would a Drill

Re: Drill Summit/Conference Proposal

2017-06-16 Thread Matt K
A counter point: I would be concerned that Drill would be overshadowed by more “popular” or more entrenched platforms. Drill is an excellent and somewhat unique tech that needs more exposure to grow. An event that focuses purely on Drill may have better success at that. The caveat may be that a

Inequality join error with date range and calendar table

2017-03-20 Thread Matt
Using a calendar table with monthly start and end dates, I am attempting to count records in another table that has cycle start and end dates. In PostgreSQL I would either use a date range type, or in standard SQL do something like: ``` SELECT m.startdate as monthdate, COUNT(distinct

Re: How to avoid case sensitivity in group by

2017-02-08 Thread Matt
Drill is not SQL Server, and not expected to work identically. Using the upper() and lower() functions is a common approach, unless you find options to set the collation sort order in the Drill docs. > On Feb 8, 2017, at 1:13 PM, Dechang Gu wrote: > > Sanjiv, > > Can you

Performance with multiple FLATTENs

2016-07-15 Thread Matt
I have JSON data with with a nested list and am using FLATTEN to extract two of three list elements as: ~~~ SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ... ~~~ This works, but each FLATTEN seems to slow the query down dramatically, 3x slower with the second flatten.

Re: Drill 1.6 on MapR cluster not using extractHeader ?

2016-04-18 Thread Matt
r, similar to what you reported. I had to set the "skipFirstLine" option to true, for it to work. Strangely, for subsequent queries, it works even after removing / disabling the "skipFirstLine" option. This could be a bug, but I'm not able to reproduce it right now. Will file a JIRA once

Drill 1.6 on MapR cluster not using extractHeader ?

2016-04-15 Thread Matt
With files in the local filesystem, and an embedded drill bit from the download on drill.apache.org, I can successfully query csv data by column name with the extractHeader option on, as in SELECT customer_if FROM `file`; But in a MapR cluster (v. 5.1.0.37549.GA) with the data in MapR-FS, the

Re: NumberFormatException with cast to double?

2016-03-13 Thread Matt
ion solved the problem for me: CAST(COALESCE(t_total, 0.0) AS double) On Fri, Mar 11, 2016 at 12:45 AM, Matt <bsg...@gmail.com> wrote: ~~~ 00-01 Project(date_tm=[CAST($23):TIMESTAMP(0)], id_1=[CAST($11):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$prima

Re: NumberFormatException with cast to double?

2016-03-10 Thread Matt
}, { "ref" : "`b_1250`", "expr" : "cast( ( ( if (isnotnull(`b_1250`) ) then (`b_1250` ) else (0 ) end ) ) as BIGINT )" }, { "ref" : "`t_1250`", "expr" : "cast( ( ( if (isnotnull(`t_1250`) )

Re: NumberFormatException with cast to double?

2016-03-10 Thread Matt
(which may be empty string) out of a CSV file. You should instead write out a full case statement that checks for empty string and provides your default value of 0 in that case. - Jason Jason Altekruse Software Engineer at Dremio Apache Drill Committer On Thu, Mar 10, 2016 at 2:32 PM, Matt <

Re: CTAS error with CSV data

2016-01-27 Thread Matt
PM, Matt <bsg...@gmail.com> wrote: The CTAS with fails with: ~~~ Error: SYSTEM ERROR: IllegalArgumentException: length: -260 (expected: >= 0) Fragment 1:2 [Error Id: 1807615e-4385-4f85-8402-5900aaa568e9 on es07:31010] (java.lang.IllegalArgumentException) length: -260 (expec

CTAS error with CSV data

2016-01-26 Thread Matt
Getting some errors when attempting to create Parquet files from CSV data, and trying to determine if it is due to the format of the source data. Its a fairly simple format of "datetime,key,key,key,numeric,numeric,numeric, ..." with 32 of those numeric columns in total. The source data

SELECT * via sqlline -q dumps filenames

2016-01-26 Thread Matt
sqlline -u ... -q 'SELECT * FROM dfs.`/path/to/files/file.csv` LIMIT 10' seems to emit a list of files in the local path (pwd), along with a parsing error. Putting the query in a file and passing that file name to sqlline or using an explicit column list runs the query as expected. Is this

Re: CTAS error with CSV data

2016-01-26 Thread Matt
On 26 Jan 2016, at 12:55, Abdel Hakim Deneche wrote: Does a select * on the same data also fail ? On Tue, Jan 26, 2016 at 9:44 AM, Matt <bsg...@gmail.com> wrote: Getting some errors when attempting to create Parquet files from CSV data, and trying to determine if it is due to the format of

Re: CTAS error with CSV data

2016-01-26 Thread Matt
Can you try enabling verbose errors and run the query again, this should provide us with more details about the error. You can enable verbose error by running the following before the select *: alter session set `exec.errors.verbose`=true; thanks On Tue, Jan 26, 2016 at 11:01 AM, Matt <bsg...@

CTAS plan showing single node?

2016-01-21 Thread Matt
Running a CTAS from csv files in a 4 node HDFS cluster into a Parquet file, and I note the physical plan in the Drill UI references scans of all the csv sources on a single node. collectl implies read and write IO on all 4 nodes - does this imply that the full cluster is used for scanning the

File size limit for CTAS?

2016-01-21 Thread Matt
Converting CSV files to Parquet with CTAS, and getting errors on some larger files: With a source file of 16.34GB (as reported in the HDFS explorer): ~~~ create table `/parquet/customer_20151017` partition by (date_tm) AS select * from `/csv/customer/customer_20151017.csv`; Error: SYSTEM

Re: Tableau / filter issues

2015-08-17 Thread Matt
I think Tableau also uses the first query to fetch the structure / metadata of the expected result set. We have often eliminated performance issues using Tableau by hiding the structure of queries by putting them in database views. Could that be a possible solution here? On 17 Aug 2015, at

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Matt
On 23 Jul 2015, at 10:53, Abdel Hakim Deneche wrote: When you try to read schema-less data, Drill will first investigate the 1000 rows to figure out a schema for your data, then it will use this schema for the remaining of the query. To clarify, if the JSON schema changes on the 1001st 1MMth

Sorting and partitioning for range scans?

2015-06-01 Thread Matt
I have seen some discussions on the Parquet storage format suggesting that sorting time series data on the time key prior to converting to the Parquet format will improve range query efficiency via min/max values on column chunks - perhaps analogous to skip indexes? Is this a recommended

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
10:57 AM, Matt wrote: Did you check the log files for any errors? No messages related to this query containing errors or warning, nor nothing mentioning memory or heap. Querying now to determine what is missing in the parquet destination. drillbit.out on the master shows no error messages

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
memory per node. DRILL_HEAP is for the heap size per node. More info here http://drill.apache.org/docs/configuring-drill-memory/ —Andries On May 28, 2015, at 11:09 AM, Matt bsg...@gmail.com wrote: Referencing http://drill.apache.org/docs/configuring-drill-memory/ Is DRILL_MAX_DIRECT_MEMORY

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
, at 13:42, Andries Engelbrecht wrote: It should execute multi threaded, need to check on text file. Did you check the log files for any errors? On May 28, 2015, at 10:36 AM, Matt bsg...@gmail.com wrote: The time seems pretty long for that file size. What type of file is it? Tab delimited UTF-8

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
for the query. I believe writing parquet may still be the most heap-intensive operation in Drill, despite our efforts to refactor the write path to use direct memory instead of on-heap for large buffers needed in the process of creating parquet files. On Thu, May 28, 2015 at 8:43 AM, Matt bsg

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
On May 28, 2015, at 8:43 AM, Matt bsg...@gmail.com wrote: Is 300MM records too much to do in a single CTAS statement? After almost 23 hours I killed the query (^c) and it returned: ~~~ +---++ | Fragment | Number of records written

Re: Monitoring long / stuck CTAS

2015-05-28 Thread Matt
bits. How large is the data set you are working with, and your cluster/nodes? —Andries On May 28, 2015, at 9:17 AM, Matt bsg...@gmail.com wrote: To make sure I am adjusting the correct config, these are heap parameters within the Drill configure path, not for Hadoop or Zookeeper? On May

Re: Query local files on cluster? [Beginner]

2015-05-27 Thread Matt
FS source as long as it is consistent to all nodes in the cluster, but keep in mind that Drill can process a lot of data quickly, and for best performance and consistency you will likely find that the sooner you get the data to the DFS the better. On May 26, 2015, at 5:58 PM, Matt bsg

Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt
wrote: Perhaps I’m missing something here. Why not create a DFS plug in for HDFS and put the file in HDFS? On May 26, 2015, at 4:54 PM, Matt bsg...@gmail.com wrote: New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text files need to be on all nodes in a cluster

Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt
mechanisms from remote systems you can look at using NFS, MapR has a really robust NFS integration and you can use it with the community edition. On May 26, 2015, at 5:11 PM, Matt bsg...@gmail.com wrote: That might be the end goal, but currently I don't have an HDFS ingest mechanism

Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt
involved: http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file Kristine Hahn Sr. Technical Writer 415-497-8107 @krishahn On Sun, May 24, 2015 at 1:56 PM, Matt bsg...@gmail.com wrote: I have used a single node install (unzip and run) to query local text

Re: Query local files on cluster? [Beginner]

2015-05-25 Thread Matt
involved: http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file Kristine Hahn Sr. Technical Writer 415-497-8107 @krishahn On Sun, May 24, 2015 at 1:56 PM, Matt bsg...@gmail.com wrote: I have used a single node install (unzip and run) to query local text

Re: How do I make json files les painful

2015-03-19 Thread Matt
Is each file a single json array object? If so, would converting the files to a format with one line per record a potential solution? Example using jq (http://stedolan.github.io/jq/): jq -c '.[]' On 19 Mar 2015, at 12:22, Jim Bates wrote: I constantly, constantly, constantly hit this. I