Re: Query local files on cluster? [Beginner]

Matt Tue, 26 May 2015 17:59:06 -0700

Thanks, I am incorrectly conflating the file system with data storage. 

Looking to experiment with the Parquet format, and was looking at CTAS queries 
as an import approach.


Are direct queries over local files meant for an embedded drill, where on a 
cluster files should be moved into HDFS first?

That would make sense as files on one node would be query bound to that local 
filesystem. 

> On May 26, 2015, at 8:28 PM, Andries Engelbrecht <aengelbre...@maprtech.com> 
> wrote:
> 
> You can use the HDFS shell
> hadoop fs -put
> 
> To copy from local file system to HDFS
> 
> 
> For more robust mechanisms from remote systems you can look at using NFS, 
> MapR has a really robust NFS integration and you can use it with the 
> community edition.
> 
> 
> 
> 
>> On May 26, 2015, at 5:11 PM, Matt <bsg...@gmail.com> wrote:
>> 
>> 
>> That might be the end goal, but currently I don't have an HDFS ingest 
>> mechanism. 
>> 
>> We are not currently a Hadoop shop - can you suggest simple approaches for 
>> bulk loading data from delimited files into HDFS?
>> 
>> 
>> 
>> 
>>> On May 26, 2015, at 8:04 PM, Andries Engelbrecht 
>>> <aengelbre...@maprtech.com> wrote:
>>> 
>>> Perhaps I’m missing something here.
>>> 
>>> Why not create a DFS plug in for HDFS and put the file in HDFS?
>>> 
>>> 
>>> 
>>>> On May 26, 2015, at 4:54 PM, Matt <bsg...@gmail.com> wrote:
>>>> 
>>>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears text 
>>>> files need to be on all nodes in a cluster?
>>>> 
>>>> Using the dfs config below, I am only able to query if a csv file is on 
>>>> all 4 nodes. If the file is only on the local node and not others, I get 
>>>> errors in the form of:
>>>> 
>>>> ~~~
>>>> 0: jdbc:drill:zk=es05:2181> select * from root.`customer_reviews_1998.csv`;
>>>> Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
>>>> 'root.customer_reviews_1998.csv' not found
>>>> ~~~
>>>> 
>>>> ~~~
>>>> {
>>>> "type": "file",
>>>> "enabled": true,
>>>> "connection": "file:///",
>>>> "workspaces": {
>>>> "root": {
>>>>   "location": "/localdata/hadoop/stage",
>>>>   "writable": false,
>>>>   "defaultInputFormat": null
>>>> },
>>>> ~~~
>>>> 
>>>>> On 25 May 2015, at 20:39, Kristine Hahn wrote:
>>>>> 
>>>>> The storage plugin "location" needs to be the full path to the localdata
>>>>> directory. This partial storage plugin definition works for the user named
>>>>> mapr:
>>>>> 
>>>>> {
>>>>> "type": "file",
>>>>> "enabled": true,
>>>>> "connection": "file:///",
>>>>> "workspaces": {
>>>>> "root": {
>>>>> "location": "/home/mapr/localdata",
>>>>> "writable": false,
>>>>> "defaultInputFormat": null
>>>>> },
>>>>> . . .
>>>>> 
>>>>> Here's a working query for the data in localdata:
>>>>> 
>>>>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
>>>>> . . . . . . . > COLUMNS[1] AS Publication_Date,
>>>>> . . . . . . . > COLUMNS[2] AS Frequency
>>>>> . . . . . . . > FROM dfs.root.`mydata.csv`
>>>>> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the Linnean')
>>>>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
>>>>> 
>>>>> An complete example, not yet published on the Drill site, shows in detail
>>>>> the steps involved:
>>>>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
>>>>> 
>>>>> 
>>>>> Kristine Hahn
>>>>> Sr. Technical Writer
>>>>> 415-497-8107 @krishahn
>>>>> 
>>>>> 
>>>>>> On Sun, May 24, 2015 at 1:56 PM, Matt <bsg...@gmail.com> wrote:
>>>>>> 
>>>>>> I have used a single node install (unzip and run) to query local text /
>>>>>> csv files, but on a 3 node cluster (installed via MapR CE), a query with
>>>>>> local files results in:
>>>>>> 
>>>>>> ~~~
>>>>>> sqlline version 1.1.6
>>>>>> 0: jdbc:drill:> select * from dfs.`testdata.csv`;
>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>> 
>>>>>> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 17:
>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>> ~~~
>>>>>> 
>>>>>> Is there a special config for local file querying? An initial doc search
>>>>>> did not point me to a solution, but I may simply not have found the
>>>>>> relevant sections.
>>>>>> 
>>>>>> I have tried modifying the default dfs config to no avail:
>>>>>> 
>>>>>> ~~~
>>>>>> "type": "file",
>>>>>> "enabled": true,
>>>>>> "connection": "file:///",
>>>>>> "workspaces": {
>>>>>> "root": {
>>>>>> "location": "/localdata",
>>>>>> "writable": false,
>>>>>> "defaultInputFormat": null
>>>>>> }
>>>>>> ~~~
>

Re: Query local files on cluster? [Beginner]

Reply via email to