Re: [Question] HiveStoragePlugin and NativeParquetRowGroupScan

Timothy Farkas Mon, 27 Aug 2018 13:45:47 -0700

Hi Paul,

As you said each reader uses a different file system and config. As far as
I know this happens correctly in all cases, except there was one corner
case reported by a user a year ago. The corner case was that if you set
fs.defaultFS to the local file system in the HiveStoragePlugin, then
restart a Drillbit and then do a CTAS statement, the command would fail
because an operator was using the wrong FileSystem. This corner case is no
longer reproducible in house. So, I've been trying to narrow down possible
root causes by trying to understand the theory of how Drill handles
FileSystems. Since, the problem is not reproducible and the candid root
causes for the problem have been debunked, I am going to abandon the issue
and mark it as not reproducible.


One bit of learning that came out of the exercise was that the
DrillFileSystem should be immutable after it is created. This was not
previously enforced or documented, so a programmer could accidentally
mutate a DrillFileSystem incorrectly. I have a PR open that documents and
enforces this contract now.

Thanks,
Tim

On Fri, Aug 24, 2018 at 5:11 PM Paul Rogers <[email protected]>
wrote:

> Hi Tim,
>
> Can't recall the details on this. The phrase "the filesystem
> configuration" might be misleading. When executing, Drill must support
> multiple filesystems. I can have two different DFS configs, pointing to two
> different HDFS clusters (say) in a single query:
>
> SELECT ... FROM dfs1.`aFile.csv`, dfs2.`anotherFile.csv`
>
> We'd create separate readers for each file. Each reader should have a
> different filesystem conf: the one appropriate for the storage plugin
> config used for that file.
>
> Using that as a reference, it would seem that Hive plugin queries use the
> hive fs, while any DFS tables in the same query use the DFS config.
>
> I wonder, based on your comment, is this not happening? Are the configs
> getting muddled somehow?
>
> Thanks,
> - Paul
>
>
>
>     On Friday, August 24, 2018, 3:45:08 PM PDT, Timothy Farkas <
> [email protected]> wrote:
>
>  Hi Paul / Vitalii
>
> Thanks for the info. I was asking about this because of
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6609&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=4eQVr8zB8ZBff-yxTimdOQ&m=G3sMOgIgWfI5gdRM9Zg-q7FCe5lveejIeHMb9EHRGbA&s=3joGV6TQJXZ8OlUctGeTyMc5d2KuCAJPgYnQ5K0siKI&e=
> in which some strange
> behavior was observed if the user defined fs.default.name in the
> HivePlugin
> config. I also saw that the filesystem specified in the HivePlugin config
> influences the FileSystem used for native scans. This happens because in
> HiveDrillNativeParquetRowGroupScan.getFsConf we use the HiveStoragePlugin
> to create the filesystem configuration, which is then used by
> DrillFileSystem.
>
> However, based on your feedback it looks like this is desirable behavior,
> since the user may want to define a different filesystem for the HivePlugin
> along with different format plugins. Which means the root cause of
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6609&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=4eQVr8zB8ZBff-yxTimdOQ&m=G3sMOgIgWfI5gdRM9Zg-q7FCe5lveejIeHMb9EHRGbA&s=3joGV6TQJXZ8OlUctGeTyMc5d2KuCAJPgYnQ5K0siKI&e=
> is something else then.
> I'll probably abandon that issue at this point since it's not reproducible
> and I have no further leads as to what could cause it.
>
> Thanks,
> Tim
>
> On Thu, Aug 23, 2018 at 2:46 AM, Vitalii Diravka <
> [email protected]>
> wrote:
>
> > Hi Tim,
> >
> > Some comments from me.
> >
> > *HiveStoragePlugin*
> > *fs.defaultFS *is Hive specific property. This is the URI used by Hive
> > Metastore to point where tables are placed. There is no need to specify
> > this property, if default value from *core-site.xml* is acceptable, see
> > more:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.
> > apache.org_docs_r3.1.0_hadoop-2Dproject-2Ddist_hadoop-
> > 2Dcommon_core-2Ddefault.xml&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> > 4eQVr8zB8ZBff-yxTimdOQ&m=Y3D0V12MikEpxfG9ybUeW6KLgeJcCD
> > N8jXEur5IyORo&s=iJjg-o08kFjMfaxGHOZ9QAiTnk2KhkwPofQ3jEVjtyw&e=
> >
> > *Hive Native readers. *
> > Currently Drill has two Hive Native readers: Parquet and MapR Json. Both
> of
> > them use appropriate default File Format Plugins. It is a limitation and
> > there is no way for now to change FormatPlugins config for them.
> > There is Jira ticket for it:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
> > apache.org_jira_browse_DRILL-2D6621&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> > 4eQVr8zB8ZBff-yxTimdOQ&m=Y3D0V12MikEpxfG9ybUeW6KLgeJcCDN8jXEur5IyORo&s=
> > QDZyPZEwolNN1wu5z4QMwajvdQ3iQPPQ0yycxhUUKw0&e=
> >
> >
> > Kind regards
> > Vitalii
> >
> >
> > On Thu, Aug 23, 2018 at 3:02 AM Paul Rogers <[email protected]>
> > wrote:
> >
> > > Hi Tim,
> > >
> > > I don't have an answer. But, I can point out some factors to consider.
> > >
> > > Hive describes a set of data in a specific file system. Would make
> sense
> > > to associate that file system with the Hive configuration. Else, I
> could
> > > use a Hive metastore for FS A, with a DFS configured for FS B, and have
> > > nothing work for reasons that would be hard to figure out.
> > >
> > > Further, isn't Hive its own storage plugin, and thus would be
> referenced
> > > as, say, "myHive.customers"? What would be the implied relationship
> > between
> > > the Hive plugin config and the DFS plugin config?
> > >
> > > Suppose I had two Hive plugin configs, Hive1 and Hive2. And, two DFS
> > > configs: DFS1 and DFS2. What is the implied relationship (if any)
> between
> > > Hive1 and either DFS1 or DFS2? Between Hive2 and DFS1 or DFS2?
> > >
> > > Given these ambiguities, it would seem to explain why Hive's HDFS URL
> is
> > > configured with Hive and is distinct from other a similar HDFS URL
> > defined
> > > for DFS.
> > >
> > > Can you suggest a way to avoid duplication and link the two? Perhaps,
> in
> > > Hive config, name a DFS config rather than duplicating the HDFS config
> > for
> > > Hive?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Wednesday, August 22, 2018, 4:41:37 PM PDT, Timothy Farkas <
> > > [email protected]> wrote:
> > >
> > >  Hi All,
> > >
> > > I'm a bit confused and I was hoping to get some clarification about how
> > the
> > > HiveStoragePlugin interacts with the FileSystem plugin. Currently the
> > > HiveStoragePlugin allows the user to configure their own value for
> > > fs.defaultFS in the plugin properties, which overrides the defaultFS
> used
> > > when doing a native parquet scan for Hive. Is this intentional? Also
> what
> > > is the high level theory about how Hive and the FileSystem plugins
> > > interact? Specifically does Drill support querying Hive when Hive is
> > using
> > > a different FileSystem than the one specified in the file system
> plugin?
> > Or
> > > does Drill assume that the Hive is using the same FileSystem as the one
> > > defined in the Drill FileSystem plugin?
> > >
> > > Thanks,
> > > Tim
> > >
> >
>

Re: [Question] HiveStoragePlugin and NativeParquetRowGroupScan

Reply via email to