AW: JAVA API for Drill

2015-05-26 Thread MOIS Martin (MORPHO)
Hi,

you can use the JDBC driver shipped with the distribution, to execute SQL 
queries against an HBase database:

String className = "org.apache.drill.jdbc.Driver";
try {
Class.forName(className);
} catch (ClassNotFoundException e) {
System.err.println("Failed to load JDBC driver '" + className + "': " + 
e.getMessage());
}
String jdbcUrl = " jdbc:drill:zk=:2181/drill/drillbits1;schema=hbase";
try (Connection con = java.sql.DriverManager.getConnection(jdbcUrl); Statement 
stmt = con.createStatement()) {
...
}
...

Best Regards,
Martin Mois

-Ursprüngliche Nachricht-
Von: Nishith Maheshwari [mailto:nsh...@gmail.com] 
Gesendet: Mittwoch, 27. Mai 2015 08:39
An: user@drill.apache.org
Betreff: JAVA API for Drill

Hi,
I wanted to create a java application to connect and query over a HBase 
database using Drill, but was unable to find any documentation regarding this.
Is there a JAVA api through which Drill can be accessed? I did see a small 
mention of C++ and JAVA api in the documentation but there was no other link or 
information regarding the same.

Regards,
Nishith Maheshwari
#
" This e-mail and any attached documents may contain confidential or 
proprietary information. If you are not the intended recipient, you are 
notified that any dissemination, copying of this e-mail and any attachments 
thereto or use of their contents by any means whatsoever is strictly 
prohibited. If you have received this e-mail in error, please advise the sender 
immediately and delete this e-mail and all attached documents from your 
computer system."
#


JAVA API for Drill

2015-05-26 Thread Nishith Maheshwari
Hi,
I wanted to create a java application to connect and query over a HBase
database using Drill, but was unable to find any documentation regarding
this.
Is there a JAVA api through which Drill can be accessed? I did see a small
mention of C++ and JAVA api in the documentation but there was no other
link or information regarding the same.

Regards,
Nishith Maheshwari


Re: Custom UDFS slow

2015-05-26 Thread Ted Dunning
On Tue, May 26, 2015 at 7:26 PM, Adam Gilmore  wrote:

> The code for the WEEK() function is not far from the code from the source
> for the EXTRACT(DAY) function.  Furthermore, even if I copy the exact code
> for the EXTRACT(DAY) function into that, it has the same performance
> detriments.
>
> My question is, why would a UDF be so much slower?  Is this by design or is
> there something I'm missing?
>
> Happy to attach the source code of the function if that helps.
>

Well, you might want to try exactly copying the source of the extract
function.  I would expect that you would get just hte same performance
since UDF's use the same mechanism as physical operators.

Two possibilities are:

1) the Java optimizer has seen something subtle about your code or the
built in code that allows for economical implementation

2) the Drill optimizer has some kind of special trick that it has figured
out

3) there is some sort of data type conversion that your code has forced the
Drill optimizer to insert a conversion

(the third option is a bonus, just for you)


The fourth option that I don't know about is also quite a likely
possibility.

Seeing your code (put it in a gist, don't attach it) would help a lot.
Seeing queries and query plans would help as well.


Custom UDFS slow

2015-05-26 Thread Adam Gilmore
Hi guys,

I have written a couple of custom UDFS (specifically WEEK() and WEEKYEAR()
to get that date information out of timestamps).

I sampled two queries (on approx. 11 million records in Parquet files)

select count(*) from `table` group by extract(day from `timestamp`)

750ms

select count(*) from `table` group by week(`timestamp`)

2100ms

The code for the WEEK() function is not far from the code from the source
for the EXTRACT(DAY) function.  Furthermore, even if I copy the exact code
for the EXTRACT(DAY) function into that, it has the same performance
detriments.

My question is, why would a UDF be so much slower?  Is this by design or is
there something I'm missing?

Happy to attach the source code of the function if that helps.


Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt
Thanks, I am incorrectly conflating the file system with data storage. 

Looking to experiment with the Parquet format, and was looking at CTAS queries 
as an import approach.

Are direct queries over local files meant for an embedded drill, where on a 
cluster files should be moved into HDFS first?

That would make sense as files on one node would be query bound to that local 
filesystem. 

> On May 26, 2015, at 8:28 PM, Andries Engelbrecht  
> wrote:
> 
> You can use the HDFS shell
> hadoop fs -put
> 
> To copy from local file system to HDFS
> 
> 
> For more robust mechanisms from remote systems you can look at using NFS, 
> MapR has a really robust NFS integration and you can use it with the 
> community edition.
> 
> 
> 
> 
>> On May 26, 2015, at 5:11 PM, Matt  wrote:
>> 
>> 
>> That might be the end goal, but currently I don't have an HDFS ingest 
>> mechanism. 
>> 
>> We are not currently a Hadoop shop - can you suggest simple approaches for 
>> bulk loading data from delimited files into HDFS?
>> 
>> 
>> 
>> 
>>> On May 26, 2015, at 8:04 PM, Andries Engelbrecht 
>>>  wrote:
>>> 
>>> Perhaps I’m missing something here.
>>> 
>>> Why not create a DFS plug in for HDFS and put the file in HDFS?
>>> 
>>> 
>>> 
 On May 26, 2015, at 4:54 PM, Matt  wrote:
 
 New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text 
 files need to be on all nodes in a cluster?
 
 Using the dfs config below, I am only able to query if a csv file is on 
 all 4 nodes. If the file is only on the local node and not others, I get 
 errors in the form of:
 
 ~~~
 0: jdbc:drill:zk=es05:2181> select * from root.`customer_reviews_1998.csv`;
 Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
 'root.customer_reviews_1998.csv' not found
 ~~~
 
 ~~~
 {
 "type": "file",
 "enabled": true,
 "connection": "file:///",
 "workspaces": {
 "root": {
   "location": "/localdata/hadoop/stage",
   "writable": false,
   "defaultInputFormat": null
 },
 ~~~
 
> On 25 May 2015, at 20:39, Kristine Hahn wrote:
> 
> The storage plugin "location" needs to be the full path to the localdata
> directory. This partial storage plugin definition works for the user named
> mapr:
> 
> {
> "type": "file",
> "enabled": true,
> "connection": "file:///",
> "workspaces": {
> "root": {
> "location": "/home/mapr/localdata",
> "writable": false,
> "defaultInputFormat": null
> },
> . . .
> 
> Here's a working query for the data in localdata:
> 
> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
> . . . . . . . > COLUMNS[1] AS Publication_Date,
> . . . . . . . > COLUMNS[2] AS Frequency
> . . . . . . . > FROM dfs.root.`mydata.csv`
> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the Linnean')
> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
> 
> An complete example, not yet published on the Drill site, shows in detail
> the steps involved:
> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
> 
> 
> Kristine Hahn
> Sr. Technical Writer
> 415-497-8107 @krishahn
> 
> 
>> On Sun, May 24, 2015 at 1:56 PM, Matt  wrote:
>> 
>> I have used a single node install (unzip and run) to query local text /
>> csv files, but on a 3 node cluster (installed via MapR CE), a query with
>> local files results in:
>> 
>> ~~~
>> sqlline version 1.1.6
>> 0: jdbc:drill:> select * from dfs.`testdata.csv`;
>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>> Table 'dfs./localdata/testdata.csv' not found
>> 
>> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>> Table 'dfs./localdata/testdata.csv' not found
>> ~~~
>> 
>> Is there a special config for local file querying? An initial doc search
>> did not point me to a solution, but I may simply not have found the
>> relevant sections.
>> 
>> I have tried modifying the default dfs config to no avail:
>> 
>> ~~~
>> "type": "file",
>> "enabled": true,
>> "connection": "file:///",
>> "workspaces": {
>> "root": {
>> "location": "/localdata",
>> "writable": false,
>> "defaultInputFormat": null
>> }
>> ~~~
> 


Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt
A better explanation from my last: Looking to use CTAS queries as HDFS ingest. 


> On May 26, 2015, at 8:04 PM, Andries Engelbrecht  
> wrote:
> 
> Perhaps I’m missing something here.
> 
> Why not create a DFS plug in for HDFS and put the file in HDFS?
> 
> 
> 
>> On May 26, 2015, at 4:54 PM, Matt  wrote:
>> 
>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text 
>> files need to be on all nodes in a cluster?
>> 
>> Using the dfs config below, I am only able to query if a csv file is on all 
>> 4 nodes. If the file is only on the local node and not others, I get errors 
>> in the form of:
>> 
>> ~~~
>> 0: jdbc:drill:zk=es05:2181> select * from root.`customer_reviews_1998.csv`;
>> Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
>> 'root.customer_reviews_1998.csv' not found
>> ~~~
>> 
>> ~~~
>> {
>> "type": "file",
>> "enabled": true,
>> "connection": "file:///",
>> "workspaces": {
>>   "root": {
>> "location": "/localdata/hadoop/stage",
>> "writable": false,
>> "defaultInputFormat": null
>>   },
>> ~~~
>> 
>>> On 25 May 2015, at 20:39, Kristine Hahn wrote:
>>> 
>>> The storage plugin "location" needs to be the full path to the localdata
>>> directory. This partial storage plugin definition works for the user named
>>> mapr:
>>> 
>>> {
>>> "type": "file",
>>> "enabled": true,
>>> "connection": "file:///",
>>> "workspaces": {
>>> "root": {
>>>  "location": "/home/mapr/localdata",
>>>  "writable": false,
>>>  "defaultInputFormat": null
>>> },
>>> . . .
>>> 
>>> Here's a working query for the data in localdata:
>>> 
>>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
>>> . . . . . . . > COLUMNS[1] AS Publication_Date,
>>> . . . . . . . > COLUMNS[2] AS Frequency
>>> . . . . . . . > FROM dfs.root.`mydata.csv`
>>> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the Linnean')
>>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
>>> 
>>> An complete example, not yet published on the Drill site, shows in detail
>>> the steps involved:
>>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
>>> 
>>> 
>>> Kristine Hahn
>>> Sr. Technical Writer
>>> 415-497-8107 @krishahn
>>> 
>>> 
 On Sun, May 24, 2015 at 1:56 PM, Matt  wrote:
 
 I have used a single node install (unzip and run) to query local text /
 csv files, but on a 3 node cluster (installed via MapR CE), a query with
 local files results in:
 
 ~~~
 sqlline version 1.1.6
 0: jdbc:drill:> select * from dfs.`testdata.csv`;
 Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
 Table 'dfs./localdata/testdata.csv' not found
 
 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
 Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
 Table 'dfs./localdata/testdata.csv' not found
 ~~~
 
 Is there a special config for local file querying? An initial doc search
 did not point me to a solution, but I may simply not have found the
 relevant sections.
 
 I have tried modifying the default dfs config to no avail:
 
 ~~~
 "type": "file",
 "enabled": true,
 "connection": "file:///",
 "workspaces": {
 "root": {
  "location": "/localdata",
  "writable": false,
  "defaultInputFormat": null
 }
 ~~~
> 


Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Andries Engelbrecht
You can use the HDFS shell
hadoop fs -put

To copy from local file system to HDFS


For more robust mechanisms from remote systems you can look at using NFS, MapR 
has a really robust NFS integration and you can use it with the community 
edition.




On May 26, 2015, at 5:11 PM, Matt  wrote:

> 
> That might be the end goal, but currently I don't have an HDFS ingest 
> mechanism. 
> 
> We are not currently a Hadoop shop - can you suggest simple approaches for 
> bulk loading data from delimited files into HDFS?
> 
> 
> 
> 
>> On May 26, 2015, at 8:04 PM, Andries Engelbrecht  
>> wrote:
>> 
>> Perhaps I’m missing something here.
>> 
>> Why not create a DFS plug in for HDFS and put the file in HDFS?
>> 
>> 
>> 
>>> On May 26, 2015, at 4:54 PM, Matt  wrote:
>>> 
>>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text 
>>> files need to be on all nodes in a cluster?
>>> 
>>> Using the dfs config below, I am only able to query if a csv file is on all 
>>> 4 nodes. If the file is only on the local node and not others, I get errors 
>>> in the form of:
>>> 
>>> ~~~
>>> 0: jdbc:drill:zk=es05:2181> select * from root.`customer_reviews_1998.csv`;
>>> Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
>>> 'root.customer_reviews_1998.csv' not found
>>> ~~~
>>> 
>>> ~~~
>>> {
>>> "type": "file",
>>> "enabled": true,
>>> "connection": "file:///",
>>> "workspaces": {
>>>  "root": {
>>>"location": "/localdata/hadoop/stage",
>>>"writable": false,
>>>"defaultInputFormat": null
>>>  },
>>> ~~~
>>> 
 On 25 May 2015, at 20:39, Kristine Hahn wrote:
 
 The storage plugin "location" needs to be the full path to the localdata
 directory. This partial storage plugin definition works for the user named
 mapr:
 
 {
 "type": "file",
 "enabled": true,
 "connection": "file:///",
 "workspaces": {
 "root": {
 "location": "/home/mapr/localdata",
 "writable": false,
 "defaultInputFormat": null
 },
 . . .
 
 Here's a working query for the data in localdata:
 
 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
 . . . . . . . > COLUMNS[1] AS Publication_Date,
 . . . . . . . > COLUMNS[2] AS Frequency
 . . . . . . . > FROM dfs.root.`mydata.csv`
 . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the Linnean')
 . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
 
 An complete example, not yet published on the Drill site, shows in detail
 the steps involved:
 http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
 
 
 Kristine Hahn
 Sr. Technical Writer
 415-497-8107 @krishahn
 
 
> On Sun, May 24, 2015 at 1:56 PM, Matt  wrote:
> 
> I have used a single node install (unzip and run) to query local text /
> csv files, but on a 3 node cluster (installed via MapR CE), a query with
> local files results in:
> 
> ~~~
> sqlline version 1.1.6
> 0: jdbc:drill:> select * from dfs.`testdata.csv`;
> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
> Table 'dfs./localdata/testdata.csv' not found
> 
> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
> Table 'dfs./localdata/testdata.csv' not found
> ~~~
> 
> Is there a special config for local file querying? An initial doc search
> did not point me to a solution, but I may simply not have found the
> relevant sections.
> 
> I have tried modifying the default dfs config to no avail:
> 
> ~~~
> "type": "file",
> "enabled": true,
> "connection": "file:///",
> "workspaces": {
> "root": {
> "location": "/localdata",
> "writable": false,
> "defaultInputFormat": null
> }
> ~~~
>> 



Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt

That might be the end goal, but currently I don't have an HDFS ingest 
mechanism. 

We are not currently a Hadoop shop - can you suggest simple approaches for bulk 
loading data from delimited files into HDFS?

 


> On May 26, 2015, at 8:04 PM, Andries Engelbrecht  
> wrote:
> 
> Perhaps I’m missing something here.
> 
> Why not create a DFS plug in for HDFS and put the file in HDFS?
> 
> 
> 
>> On May 26, 2015, at 4:54 PM, Matt  wrote:
>> 
>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text 
>> files need to be on all nodes in a cluster?
>> 
>> Using the dfs config below, I am only able to query if a csv file is on all 
>> 4 nodes. If the file is only on the local node and not others, I get errors 
>> in the form of:
>> 
>> ~~~
>> 0: jdbc:drill:zk=es05:2181> select * from root.`customer_reviews_1998.csv`;
>> Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
>> 'root.customer_reviews_1998.csv' not found
>> ~~~
>> 
>> ~~~
>> {
>> "type": "file",
>> "enabled": true,
>> "connection": "file:///",
>> "workspaces": {
>>   "root": {
>> "location": "/localdata/hadoop/stage",
>> "writable": false,
>> "defaultInputFormat": null
>>   },
>> ~~~
>> 
>>> On 25 May 2015, at 20:39, Kristine Hahn wrote:
>>> 
>>> The storage plugin "location" needs to be the full path to the localdata
>>> directory. This partial storage plugin definition works for the user named
>>> mapr:
>>> 
>>> {
>>> "type": "file",
>>> "enabled": true,
>>> "connection": "file:///",
>>> "workspaces": {
>>> "root": {
>>>  "location": "/home/mapr/localdata",
>>>  "writable": false,
>>>  "defaultInputFormat": null
>>> },
>>> . . .
>>> 
>>> Here's a working query for the data in localdata:
>>> 
>>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
>>> . . . . . . . > COLUMNS[1] AS Publication_Date,
>>> . . . . . . . > COLUMNS[2] AS Frequency
>>> . . . . . . . > FROM dfs.root.`mydata.csv`
>>> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the Linnean')
>>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
>>> 
>>> An complete example, not yet published on the Drill site, shows in detail
>>> the steps involved:
>>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
>>> 
>>> 
>>> Kristine Hahn
>>> Sr. Technical Writer
>>> 415-497-8107 @krishahn
>>> 
>>> 
 On Sun, May 24, 2015 at 1:56 PM, Matt  wrote:
 
 I have used a single node install (unzip and run) to query local text /
 csv files, but on a 3 node cluster (installed via MapR CE), a query with
 local files results in:
 
 ~~~
 sqlline version 1.1.6
 0: jdbc:drill:> select * from dfs.`testdata.csv`;
 Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
 Table 'dfs./localdata/testdata.csv' not found
 
 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
 Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
 Table 'dfs./localdata/testdata.csv' not found
 ~~~
 
 Is there a special config for local file querying? An initial doc search
 did not point me to a solution, but I may simply not have found the
 relevant sections.
 
 I have tried modifying the default dfs config to no avail:
 
 ~~~
 "type": "file",
 "enabled": true,
 "connection": "file:///",
 "workspaces": {
 "root": {
  "location": "/localdata",
  "writable": false,
  "defaultInputFormat": null
 }
 ~~~
> 


Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Andries Engelbrecht
Perhaps I’m missing something here.

Why not create a DFS plug in for HDFS and put the file in HDFS?



On May 26, 2015, at 4:54 PM, Matt  wrote:

> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text 
> files need to be on all nodes in a cluster?
> 
> Using the dfs config below, I am only able to query if a csv file is on all 4 
> nodes. If the file is only on the local node and not others, I get errors in 
> the form of:
> 
> ~~~
> 0: jdbc:drill:zk=es05:2181> select * from root.`customer_reviews_1998.csv`;
> Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
> 'root.customer_reviews_1998.csv' not found
> ~~~
> 
> ~~~
> {
>  "type": "file",
>  "enabled": true,
>  "connection": "file:///",
>  "workspaces": {
>"root": {
>  "location": "/localdata/hadoop/stage",
>  "writable": false,
>  "defaultInputFormat": null
>},
> ~~~
> 
> On 25 May 2015, at 20:39, Kristine Hahn wrote:
> 
>> The storage plugin "location" needs to be the full path to the localdata
>> directory. This partial storage plugin definition works for the user named
>> mapr:
>> 
>> {
>> "type": "file",
>> "enabled": true,
>> "connection": "file:///",
>> "workspaces": {
>> "root": {
>>   "location": "/home/mapr/localdata",
>>   "writable": false,
>>   "defaultInputFormat": null
>> },
>> . . .
>> 
>> Here's a working query for the data in localdata:
>> 
>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
>> . . . . . . . > COLUMNS[1] AS Publication_Date,
>> . . . . . . . > COLUMNS[2] AS Frequency
>> . . . . . . . > FROM dfs.root.`mydata.csv`
>> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the Linnean')
>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
>> 
>> An complete example, not yet published on the Drill site, shows in detail
>> the steps involved:
>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
>> 
>> 
>> Kristine Hahn
>> Sr. Technical Writer
>> 415-497-8107 @krishahn
>> 
>> 
>> On Sun, May 24, 2015 at 1:56 PM, Matt  wrote:
>> 
>>> I have used a single node install (unzip and run) to query local text /
>>> csv files, but on a 3 node cluster (installed via MapR CE), a query with
>>> local files results in:
>>> 
>>> ~~~
>>> sqlline version 1.1.6
>>> 0: jdbc:drill:> select * from dfs.`testdata.csv`;
>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>>> Table 'dfs./localdata/testdata.csv' not found
>>> 
>>> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>>> Table 'dfs./localdata/testdata.csv' not found
>>> ~~~
>>> 
>>> Is there a special config for local file querying? An initial doc search
>>> did not point me to a solution, but I may simply not have found the
>>> relevant sections.
>>> 
>>> I have tried modifying the default dfs config to no avail:
>>> 
>>> ~~~
>>> "type": "file",
>>> "enabled": true,
>>> "connection": "file:///",
>>> "workspaces": {
>>> "root": {
>>>   "location": "/localdata",
>>>   "writable": false,
>>>   "defaultInputFormat": null
>>> }
>>> ~~~
>>> 



Re: Query local files on cluster? [Beginner]

2015-05-26 Thread Matt
New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears 
text files need to be on all nodes in a cluster?


Using the dfs config below, I am only able to query if a csv file is on 
all 4 nodes. If the file is only on the local node and not others, I get 
errors in the form of:


~~~
0: jdbc:drill:zk=es05:2181> select * from 
root.`customer_reviews_1998.csv`;
Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
'root.customer_reviews_1998.csv' not found

~~~

~~~
{
  "type": "file",
  "enabled": true,
  "connection": "file:///",
  "workspaces": {
"root": {
  "location": "/localdata/hadoop/stage",
  "writable": false,
  "defaultInputFormat": null
},
~~~

On 25 May 2015, at 20:39, Kristine Hahn wrote:

The storage plugin "location" needs to be the full path to the 
localdata
directory. This partial storage plugin definition works for the user 
named

mapr:

{
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
 "root": {
   "location": "/home/mapr/localdata",
   "writable": false,
   "defaultInputFormat": null
 },
. . .

Here's a working query for the data in localdata:

0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
. . . . . . . > COLUMNS[1] AS Publication_Date,
. . . . . . . > COLUMNS[2] AS Frequency
. . . . . . . > FROM dfs.root.`mydata.csv`
. . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the 
Linnean')

. . . . . . . > AND (columns[2] > 250)) LIMIT 10;

An complete example, not yet published on the Drill site, shows in 
detail

the steps involved:
http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file


Kristine Hahn
Sr. Technical Writer
415-497-8107 @krishahn


On Sun, May 24, 2015 at 1:56 PM, Matt  wrote:

I have used a single node install (unzip and run) to query local text 
/
csv files, but on a 3 node cluster (installed via MapR CE), a query 
with

local files results in:

~~~
sqlline version 1.1.6
0: jdbc:drill:> select * from dfs.`testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
17:

Table 'dfs./localdata/testdata.csv' not found

0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
17:

Table 'dfs./localdata/testdata.csv' not found
~~~

Is there a special config for local file querying? An initial doc 
search

did not point me to a solution, but I may simply not have found the
relevant sections.

I have tried modifying the default dfs config to no avail:

~~~
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
 "root": {
   "location": "/localdata",
   "writable": false,
   "defaultInputFormat": null
 }
~~~



Re: Drill and Spark integration

2015-05-26 Thread Hanifi Gunes
Sounds cool.

On Tue, May 26, 2015 at 12:40 PM, Jacques Nadeau  wrote:

> Let's get the current code into the public so more people can help get it
> fully integrated and tested.
>
> On Tue, May 26, 2015 at 12:16 PM, Hanifi Gunes 
> wrote:
>
> > We have a fully functional Spark integration that is not yet pushed to
> > Apache master as it lacks proper testing. We had plans to check it in
> prior
> > to 1.0 however we got higher priority items in the agenda. Now that 1.0
> is
> > out I expect to see this in the master soon. I am not aware of a specific
> > plan or timeframe though.
> >
> > Filed DRILL-3184  to
> > track this.
> >
> > -Hanifi
> >
> > On Tue, May 26, 2015 at 11:56 AM, Christopher Matta 
> > wrote:
> >
> > > Spark integration with Drill is mentioned in this
> > >  blog
> > post,
> > > however I can’t find a JIRA for this feature on either the Drill or
> Spark
> > > trackers. What’s the status on this? Is there a timeframe? Is anyone
> > > working on it?
> > >
> > > --Chris
> > > ​
> > >
> >
>


Re: Connection timeout 1.0.0

2015-05-26 Thread David Tucker
For many Linux services, this can be an unstable configuration.   Better to use 
ifconfig eth0
to identify the configured IP address and add that entry to /etc/hosts.

Some DHCP client packages will do this automatically, since the IP can change 
with every reboot.

— David


On May 26, 2015, at 1:21 PM, Davide Giannella  wrote:

> On 22/05/2015 10:22, Davide Giannella wrote:
>> ...
>> 
>> I looked up the ports drill should be using that I know of: 31010 and
>> 2181 but both are free to be taken: `sudo lsof -i TCP:${PORT}`.
>> 
> 
> Solved it. Adding the enquired hostname to /etc/host as 127.0.0.1 did
> the trick.
> 
> Cheers
> Davide
> 
> 



Re: SQL query : Question

2015-05-26 Thread Andries Engelbrecht
The query will typically fail. What source data are you looking at that may 
cause this issue?

One way of working around this is to use a predicate to filter out rows that 
may cause such issues. But pending on the use case, there can be other ways of 
dealing with this.

—Andries


On May 26, 2015, at 11:56 AM, Alok Tanna  wrote:

> In case of Drill ,lets say we are doing a sum of 20 rows via sql  and if
> there is a value that is not a number but some junk - what happens to the
> sum? Does the whole sql fail or is that row ignored?
> 
> If it fails is there a way to process the 19 rows and just ignoring the
> junk row?



Re: Connection timeout 1.0.0

2015-05-26 Thread Davide Giannella
On 22/05/2015 10:22, Davide Giannella wrote:
> ...
>
> I looked up the ports drill should be using that I know of: 31010 and
> 2181 but both are free to be taken: `sudo lsof -i TCP:${PORT}`.
>

Solved it. Adding the enquired hostname to /etc/host as 127.0.0.1 did
the trick.

Cheers
Davide




Re: Drill and Spark integration

2015-05-26 Thread Jacques Nadeau
Let's get the current code into the public so more people can help get it
fully integrated and tested.

On Tue, May 26, 2015 at 12:16 PM, Hanifi Gunes  wrote:

> We have a fully functional Spark integration that is not yet pushed to
> Apache master as it lacks proper testing. We had plans to check it in prior
> to 1.0 however we got higher priority items in the agenda. Now that 1.0 is
> out I expect to see this in the master soon. I am not aware of a specific
> plan or timeframe though.
>
> Filed DRILL-3184  to
> track this.
>
> -Hanifi
>
> On Tue, May 26, 2015 at 11:56 AM, Christopher Matta 
> wrote:
>
> > Spark integration with Drill is mentioned in this
> >  blog
> post,
> > however I can’t find a JIRA for this feature on either the Drill or Spark
> > trackers. What’s the status on this? Is there a timeframe? Is anyone
> > working on it?
> >
> > --Chris
> > ​
> >
>


Re: Drill and Spark integration

2015-05-26 Thread Hanifi Gunes
We have a fully functional Spark integration that is not yet pushed to
Apache master as it lacks proper testing. We had plans to check it in prior
to 1.0 however we got higher priority items in the agenda. Now that 1.0 is
out I expect to see this in the master soon. I am not aware of a specific
plan or timeframe though.

Filed DRILL-3184  to
track this.

-Hanifi

On Tue, May 26, 2015 at 11:56 AM, Christopher Matta  wrote:

> Spark integration with Drill is mentioned in this
>  blog post,
> however I can’t find a JIRA for this feature on either the Drill or Spark
> trackers. What’s the status on this? Is there a timeframe? Is anyone
> working on it?
>
> --Chris
> ​
>


Drill and Spark integration

2015-05-26 Thread Christopher Matta
Spark integration with Drill is mentioned in this
 blog post,
however I can’t find a JIRA for this feature on either the Drill or Spark
trackers. What’s the status on this? Is there a timeframe? Is anyone
working on it?

--Chris
​


SQL query : Question

2015-05-26 Thread Alok Tanna
In case of Drill ,lets say we are doing a sum of 20 rows via sql  and if
there is a value that is not a number but some junk - what happens to the
sum? Does the whole sql fail or is that row ignored?

If it fails is there a way to process the 19 rows and just ignoring the
junk row?


Re: To EMRFS or not to EMRFS?

2015-05-26 Thread Paul Mogren
Thank you. This kind of summary advice is helpful to getting started.




On 5/22/15, 6:37 PM, "Ted Dunning"  wrote:

>The variation will have less to do with Drill (which can read all these
>options such as EMR resident MapR FS or HDFS or persistent MapR FS or HDFS
>or S3).
>
>The biggest differences will have to do with whether your clusters
>providing storage are permanent or ephemeral.  If they are ephemeral, you
>can host the distributed file system on EBS based volumes so that you will
>have an ephemeral but restartable cluster.
>
>So the costs in run time will have to do with startup or restart times and
>the time it takes to pour the data into any new distributed file system.
>If you host permanently in S3 and have Drill read directly from there, you
>have no permanent storage cost for the input data, but will probably have
>slower reads.  With a permanent cluster hosting the data, you will have
>higher costs, but likely also higher performance.  Copying data from S3 to
>a distributed file system is probably not a great idea since you pay
>roughly the same cost during copy as you would have paid just querying
>directly from S3.
>
>Exactly how these trade-offs pan out requires some careful thought and
>considerable knowledge of your workload.
>
>
>
>On Fri, May 22, 2015 at 3:22 PM, Paul Mogren 
>wrote:
>
>> > When running Drill in AWS EMR, can anyone advise as to the advantages
>> >and disadvantages of having Drill access S3 via EMRFS vs. directly?
>>
>> Also, a third option: an actual HDFS not backed by S3
>>
>>



Hangout happening now

2015-05-26 Thread Jason Altekruse
Come join the Drill community as we discuss what has been happening lately
and what is in the pipeline. All are welcome, if you know about Drill, want
to know more or just want to listen in.

https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc