It sounds like we should not have written to the filesystem if we were not
connected to a single host or a distributed filesystem. The problem is that
the files we wrote will not be associated together the way they would be in
a single filesystem (even a distributed one that would have a common
namespace for the file on various nodes). The read of these files would not
have a guaranteed behavior. I think it is possible we would choke, as we
expect to be able to enumerate the files during planning so that we can
assign the read operations to nodes with each file or file chunk. However
the way we retrieve this information is from the HDFS API (or a storage
plugin specific API for other storage engines).

On Mon, Jun 1, 2015 at 3:39 AM, George Lu <luwenbin...@gmail.com> wrote:

> Hi Abhishek and all,
>
> Thanks for you answer.
>
> I may try later for HDFS way. But there are cases when restriction are set
> for the cluster, not allow user to use distributed file system.
>
> In this way, Drill might need a way to query from all node to get the
> result, not from the local. Because in CTAS, Drill created the result in
> several node and thus should read from those nodes as well.
>
> Is my suggestion make sense??
>
> Thanks!
>
> George
>
> On Mon, Jun 1, 2015 at 9:51 AM, Abhishek Girish <abhishek.gir...@gmail.com
> >
> wrote:
>
> > Hey George,
> >
> > Can I ask why aren't using a distributed file system? You would see the
> > behavior you expect when you use the dfs plug-in configured with a
> > distributed file system (HDFS / MapR-FS).
> >
> > In your case, the parquet files from CTAS will be written to a specific
> > node's local file system, depending on which Drill-bit the client
> connects
> > to. And if the table is moderate to large in size, Drill may process them
> > in a distributed manner and write data into more than one node - hence
> the
> > behavior you see.
> >
> > -Abhishek
> >
> > On Sun, May 31, 2015 at 6:34 PM, George Lu <luwenbin...@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > I use dfs.tmp as my schema and when I use CTAS create some tables over
> > > 10000 rows the result parquet was created in like 2 nodes in the
> cluster.
> > > However when I query the table, I only get the portion in that node.
> So,
> > I
> > > get 700 rows in one node when I use "select * from T1" and 10000 rows
> in
> > > another.
> > >
> > > May I ask is that behavior correct? How to create or let Drill get all
> > > tuples when I create or query in one node using dfs.tmp local?
> > >
> > > Else the exists query doesn't work.
> > >
> > > Thanks!
> > >
> > > George
> > >
> >
>

Reply via email to