paul-rogers commented on a change in pull request #1749: DRILL-7177: Format 
Plugin for Excel Files
URL: https://github.com/apache/drill/pull/1749#discussion_r336788973
 
 

 ##########
 File path: contrib/format-excel/README.md
 ##########
 @@ -0,0 +1,54 @@
+# Excel Format Plugin
+This plugin enables Drill to read Microsoft Excel files. This format is best 
used with Excel files that do not have extensive formatting, however it will 
work with formatted files, by allowing you to define a region within the file 
where the data is.  
+
+The plugin will automatically evaluate cells which contain formulae.  
+
+## Plugin Configuration 
+This plugin has several configuration variables which must be set in order to 
read Excel files effectively. Since Excel files often contain other elements 
besides data, you can use the configuration variables to define a region within 
your spreadsheet in which Drill should extract data. This is potentially useful 
if your spreadsheet contains a lot of formatting or other complications. 
+
+### Configuration Options:
+The most basic configuration is simply to add the following to a file based 
storage plugin:
+```
+"excel": {
+      "type": "excel"
+    }
+```
+The plugin has many other configuration options listed below:
+
+* `headerRow`:  Set to `-1` if there are no column headers.  Defaults to `0`. 
If the data does not have a header row, Drill will assign column names of 
`field_n` for each column. 
+* `lastRow`:  This defines the last row of data. The default is 1048576 which 
is the theoretical row limit for Excel files. It is only necessary to set this 
if you want Drill to
+ stop reading at a specific location.
+* `sheetName`:  This is the name of the sheet which Drill will query. This 
will default to the first sheet in the file if left undefined.  Drill will 
throw an exception if the
+ sheet is not found.
+* `firstColumn`:  In order to define a region within a spreadsheet, this is 
the left-most column index. This is indexed from one. If set to `0` Drill will 
start at the left most
+ column.
+* `lastColumn`:  To define a region within a spreadsheet, this is the 
right-most column index. This is indexed from one. If set to `0` Drill will 
read all available columns. This
+ is not inclusive, so if you ask for columns 2-5 you will get columns 2,3 and 
4. 
+* `readAllColumnsAsVarChar`:  When set to `true`, Drill will not attempt to 
infer column data types and will read everything as `VARCHAR`.  Defaults to 
`false`;
+
+## Usage
+You can specify the configuration at runtime via the `table()` function or in 
the storage plugin configuration.  For instance, if you just want to query an 
Excel file, you could
+ execute the query as follows:
+
+```
+SELECT <fields> 
+FROM dfs.`somefile.xlsx`
+```
+To query a different sheet other than the default, use the `table()` function 
as shown below:
+```
+SELECT <fields> 
+FROM table( dfs.`test_data.xlsx` (type => 'excel', sheetName => 'secondSheet'))
+```
+To join data together from different sheets in one file, use the query below:
+```
+SELECT <fields> 
+FROM table( dfs.`test_data.xlsx` (type => 'excel', sheetName => 
'secondSheet')) AS t1
+INNER JOIN table( dfs.`test_data.xlsx` (type => 'excel', sheetName => 
'thirdSheet')) AS t2 
+ON t1.id = t2.id
+```
+
+### Known Limitations:
+At present, Drill requires that all columns be of the same data type. If they 
are not, Drill will throw an exception upon trying to read a column of mixed 
data type. If you are
+ trying to query data with heterogenoeus columns, it will be necessary to set 
`readAllColumnsAsVarChar` to `true`.  
+
 
 Review comment:
   Are column headers required? Must be in the first line (only)? Must be text?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to