My suggestion would be that the local file system plugin be disabled with
distributed mode. With multiple drill bits and a centralized plugin for
local file system, consistent behavior cannot be expected.

It should be either disabled when distributed mode is detected or we should
add support for multiple namespaces with local file systems (might still
not fix all issues).

On Monday, June 1, 2015, Jason Altekruse <altekruseja...@gmail.com> wrote:

> It sounds like we should not have written to the filesystem if we were not
> connected to a single host or a distributed filesystem. The problem is that
> the files we wrote will not be associated together the way they would be in
> a single filesystem (even a distributed one that would have a common
> namespace for the file on various nodes). The read of these files would not
> have a guaranteed behavior. I think it is possible we would choke, as we
> expect to be able to enumerate the files during planning so that we can
> assign the read operations to nodes with each file or file chunk. However
> the way we retrieve this information is from the HDFS API (or a storage
> plugin specific API for other storage engines).
>
> On Mon, Jun 1, 2015 at 3:39 AM, George Lu <luwenbin...@gmail.com
> <javascript:;>> wrote:
>
> > Hi Abhishek and all,
> >
> > Thanks for you answer.
> >
> > I may try later for HDFS way. But there are cases when restriction are
> set
> > for the cluster, not allow user to use distributed file system.
> >
> > In this way, Drill might need a way to query from all node to get the
> > result, not from the local. Because in CTAS, Drill created the result in
> > several node and thus should read from those nodes as well.
> >
> > Is my suggestion make sense??
> >
> > Thanks!
> >
> > George
> >
> > On Mon, Jun 1, 2015 at 9:51 AM, Abhishek Girish <
> abhishek.gir...@gmail.com <javascript:;>
> > >
> > wrote:
> >
> > > Hey George,
> > >
> > > Can I ask why aren't using a distributed file system? You would see the
> > > behavior you expect when you use the dfs plug-in configured with a
> > > distributed file system (HDFS / MapR-FS).
> > >
> > > In your case, the parquet files from CTAS will be written to a specific
> > > node's local file system, depending on which Drill-bit the client
> > connects
> > > to. And if the table is moderate to large in size, Drill may process
> them
> > > in a distributed manner and write data into more than one node - hence
> > the
> > > behavior you see.
> > >
> > > -Abhishek
> > >
> > > On Sun, May 31, 2015 at 6:34 PM, George Lu <luwenbin...@gmail.com
> <javascript:;>>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I use dfs.tmp as my schema and when I use CTAS create some tables
> over
> > > > 10000 rows the result parquet was created in like 2 nodes in the
> > cluster.
> > > > However when I query the table, I only get the portion in that node.
> > So,
> > > I
> > > > get 700 rows in one node when I use "select * from T1" and 10000 rows
> > in
> > > > another.
> > > >
> > > > May I ask is that behavior correct? How to create or let Drill get
> all
> > > > tuples when I create or query in one node using dfs.tmp local?
> > > >
> > > > Else the exists query doesn't work.
> > > >
> > > > Thanks!
> > > >
> > > > George
> > > >
> > >
> >
>


-- 

Abhishek Girish

Senior Software Engineer

(408) 476-9209

<http://www.mapr.com/>

Reply via email to