Paul Rogers created DRILL-6667:
----------------------------------

             Summary: Include internal data sets in Documentation Sample 
Datasets
                 Key: DRILL-6667
                 URL: https://issues.apache.org/jira/browse/DRILL-6667
             Project: Apache Drill
          Issue Type: Improvement
          Components: Documentation
    Affects Versions: 1.13.0
            Reporter: Paul Rogers
            Assignee: Bridget Bevens


The Drill documentation provides the "Sample Datasets" section, which is very 
handy. However, this section does not discuss the two datasets provided with 
Drill itself.

* Julian Hyde's [FoodMart data 
set|https://github.com/julianhyde/foodmart-data-hsqldb], available on the class 
path.
* TPC-H data set.

The "FoodMart" data set is available directly under {{cp}}. In fact, the Drill 
sample query (see below) references a FoodMart table. To see the list of tables 
(at development time), find the {{foodmark-data-json-0.4.jar}} file in the 
Maven dependencies for {{drill-java-exec}}. The table names here are simplified 
relative to those in the ER diagram in the above link. Perhaps include a simple 
table with names, and the mapping to the original names, and a link to (or just 
embed the link) to the FoodMart ER image. The data is available in JSON format.

TPCH data is available in `cp`.`tpch/*.parquet`, in Parquet format. The schema 
is described in the [TPC-H 
specification](http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp).

Further, in the "Tutorials" section, "Analyzing the Yelp Academic Dataset", we 
mention the Yelp data set. But, we don't mention that in the "Sample Datasets" 
section. We should, just to be consistent and to save the reader time when 
going back and saying, "Hey, didn't Drill provide some kind of Yelp data? Let 
me look in Sample Datasets. Wait.. no Yelp?"

These are very handy, but hard to find: I find I must keep searching the source 
code to remember file names and directory paths. End uses won't have this 
luxury.

Suggestion: Describe the files available in the class path data source.

Along these same lines, in "Connect a Data Source", there is no mention of the 
class path data source. Yet, we reference that data source in the Web Console 
where we suggest a sample query to run:

{code}
Sample SQL query: SELECT * FROM cp.`employee.json` LIMIT 20
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to