[ 
https://issues.apache.org/jira/browse/DRILL-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085987#comment-17085987
 ] 

ASF GitHub Bot commented on DRILL-6168:
---------------------------------------

paul-rogers commented on issue #2054: DRILL-6168: Revise format plugin table 
functions
URL: https://github.com/apache/drill/pull/2054#issuecomment-615408581
 
 
   @arina-ielchiieva, thanks much for running the tests! Looks like these 
failures are due to the verifying the incorrect prior behavior where the 
default field delimiter was newline (same as the line delimiter), not comma. 
This PR changes the default to comma.
   
   Let's consider an example.  For this query:
   
   ```
   select * from table(`table_function/cr_lf.csv`(type=>'text', 
lineDelimiter=>'\r\n'));
   ```
   
   With this input:
   
   ```
   1,aaa,bbb
   2,ccc,ddd
   3,eee,
   4,fff,ggg
   ```
   
   We currently expect this output because the old default field delimiter is a 
newline (same as the line delimiter):
   
   ```
   ["1,aaa,bbb"]
   ["2,ccc,ddd"]
   ["3,eee,"]
   ["4,fff,ggg"]
   ```
   
   The correct expected results, with a default field delimiter of comma, is:
   
   ```
   ["1","aaa","bbb"]
   ["2","ccc","ddd"]
   ["3","eee",""]
   ["4","fff","ggg"]
   ```
   
   Will investigate how to fix the tests.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Table functions do not "inherit" default configuration
> ------------------------------------------------------
>
>                 Key: DRILL-6168
>                 URL: https://issues.apache.org/jira/browse/DRILL-6168
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.12.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>             Fix For: 1.18.0
>
>
> See DRILL-6167 that describes an attempt to use a table function with a regex 
> format plugin.
> Consider the plugin configuration:
> {code}
>     RegexFormatConfig sampleConfig = new RegexFormatConfig();
>     sampleConfig.extension = "log1";
>     sampleConfig.regex = DATE_ONLY_PATTERN;
>     sampleConfig.fields = Lists.newArrayList("year", "month", "day");
> {code}
> (This plugin is defined in code in a test rather than the usual JSON in the 
> Web console.)
> Run a test with the above. Things work fine.
> Now, try the plugin config with a table function as described in DRILL-6167:
> {code}
>       String sql = "SELECT * FROM table(cp.`regex/simple.log2`\n" +
>           "(type => 'regex', regex => 
> '(\\\\d\\\\d\\\\d\\\\d)-(\\\\d\\\\d)-(\\\\d\\\\d) .*'))";
>       client.queryBuilder().sql(sql).printCsv();
> {code}
> Because we are using a file with suffix "log2", the query will match the 
> format plugin config defined above. A query without the table function does, 
> in fact, work using the defined config. But, with a table function, we get 
> this warning from our regex code:
> {noformat}
> 13307 WARN [257590e1-e846-9d82-61d4-e246a4925ac3:frag:0:0] 
> [org.apache.drill.exec.store.easy.regex.RegexRecordReader] - Column list has 
> fewer
>   names than the pattern has groups, filling extras with Column$n.
> {noformat}
> (The warning is in the custom plugin, not Drill.) This is the plugin saying, 
> "hey! you didn't provide column names!". But, in the format definition, we 
> did provide names. If we run the query without a table function, we do see 
> those names used.
> Result:
> {noformat}
> 3 row(s):
> Column$0<VARCHAR(OPTIONAL)>,Column$1<VARCHAR(OPTIONAL)>,Column$2<VARCHAR(OPTIONAL)>
> 2017,12,17
> 2017,12,18
> 2017,12,19
> Total rows returned : 3.  Returned in 9072ms.
> {noformat}
> Yes, indeed, the table function discarded the defined format config values, 
> filling in blanks, including for the column names.
> The expected behavior is that all properties defined in the config should 
> remain unchanged _except_ for those in the table function. Why? In order to 
> know which format plugin to use, the code has to map from the suffix (".log2" 
> here) to a format plugin _config_. (The config is the only thing that 
> specifies a suffix.) Since we mapped to a config (not the unconfigured 
> plugin), we'd expect the config properties to be used.
> It is highly surprising that all we get to use is the suffix, but all other 
> attributes are ignored. This seems very much in the "bug" category and not at 
> all in the "feature" category.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to