[jira] [Commented] (DRILL-8439) Getting col__ prefix for columns that are not special when extractHeader is enabled

2023-06-07 Thread Diksha Chaturvedi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730083#comment-17730083
 ] 

Diksha Chaturvedi commented on DRILL-8439:
--

I've noticed different outcomes based on the bom-character placement in the csv 
file.

 

*Case 1:*
{code:java}
"PRODUCTID"^"PRODUCTNAME"^"SUPPLIERID"^"CATEGORYID"^"UNIT"^"PRICE"{code}
In this case the output is col_PRODUCTID (working as expected)

 

*Case 2:*
{code:java}
"1"^"Chais"^"1"^"1"^"10 boxes x 20 bags"^"18"{code}
!bomInColDataInBeginning.PNG|width=273,height=243!

 

*Case 3:*
{code:java}
"1"^"Chais"^"1"^"1"^"10 boxes x 20 bags"^"18"{code}
!bomInColData.PNG|width=271,height=244!

 

*Case 4:*
{code:java}
"1"^"Chais"^"1"^"1"^"10 boxes x 20 bags"^"18"{code}
!bomInEnd.PNG|width=157,height=211!

 

*Case 5:*
{code:java}
"1"^"Chais"^"1"^"1"^"10 boxes x 20 bags"^"18"{code}
!bomInMiddle.PNG|width=199,height=217!

Please enlighten.

 

> Getting col__ prefix for columns that are not special when extractHeader is 
> enabled
> ---
>
> Key: DRILL-8439
> URL: https://issues.apache.org/jira/browse/DRILL-8439
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata, SQL Parser
>Affects Versions: 1.21.0
> Environment: Enabled {{extractHeader}} in the csv config of dfs 
> plugin.
> No. of drillbits: Single
> OS: Windows
>Reporter: Diksha Chaturvedi
>Priority: Major
>  Labels: drill, extractHeader
> Attachments: bomInColData.PNG, bomInColDataInBeginning.PNG, 
> bomInEnd.PNG, bomInMiddle.PNG, bomInsideColumnName-1.PNG, 
> bomInsideColumnName.PNG, image-2023-06-05-18-05-25-417.png, 
> image-2023-06-05-18-16-47-293.png
>
>
> As per documentation, Drill appends col_ to the columns that start with a 
> number or special characters.
> {code:java}
> /**
>  * Prefix used to replace non-alphabetic characters at the start of
>  * a column name. For example, $foo becomes col_foo. Used
>  * because SQL does not allow _foo.
>  */
> public static final String COLUMN_PREFIX = "col_";
> {code}
> But in my case I'm getting it even for all alphabetical column name.
> 
> I have the following data in the CSV file,
> ||PRODUCTID||PRODUCTNAME||SUPPLIERID||CATEGORYID||UNIT||PRICE||
> |1|Chais|1|1|10 boxes x 20 bags|18|
> |2|Chang|1|1|24 - 12 oz bottles|19|
> |3|Aniseed Syrup|1|2|12 - 550 ml bottles|10|
> |4|Chef Anton's Cajun Seasoning|2|2|48 - 6 oz jars|22|
> |5|Chef Anton's Gumbo Mix|2|2|36 boxes|21.35|
>  
> While querying on the csv file using following query:
> {code:sql}
> SELECT * FROM dfs.`/var/lib/PRODUCT.csv`{code}
> The output is 
> [!https://i.stack.imgur.com/FBNmn.png|width=611,height=130!|https://i.stack.imgur.com/FBNmn.png]
> 
> I know about other criterias like
> {{#UNITS}} is changed to {{col_UNITS}}
> {{FINANCIAL$RECORD}} is changed to {{FINANCIAL_RECORD}}
> But what's with {{{}PRODUCTID{}}}; Why is it changed to 
> {{col___PRODUCTID__}}? In this case it has appended extra underscores also. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8439) Getting col__ prefix for columns that are not special when extractHeader is enabled

2023-06-05 Thread Diksha Chaturvedi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729312#comment-17729312
 ] 

Diksha Chaturvedi commented on DRILL-8439:
--

Hi [~cgivre], Thanks for the hint. I've found the invisible unicode character 
in the CSV file which is **

*!image-2023-06-05-18-05-25-417.png!*

 

As per findings this character is added when the file is saved as UTF8 with 
BOM. FYI this CSV file is created using the Apache Metamodel using [this data 
context|https://metamodel.apache.org/apidocs/4.4.0/org/apache/metamodel/DataContextFactory.html#createCsvDataContext(java.io.File,%20char,%20char)]
 in which the default encoding(UTF-8) is used. Any idea how can we fix this?

> Getting col__ prefix for columns that are not special when extractHeader is 
> enabled
> ---
>
> Key: DRILL-8439
> URL: https://issues.apache.org/jira/browse/DRILL-8439
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata, SQL Parser
>Affects Versions: 1.21.0
> Environment: Enabled {{extractHeader}} in the csv config of dfs 
> plugin.
> No. of drillbits: Single
> OS: Windows
>Reporter: Diksha Chaturvedi
>Priority: Major
>  Labels: drill, extractHeader
> Attachments: image-2023-06-05-18-05-25-417.png
>
>
> As per documentation, Drill appends col_ to the columns that start with a 
> number or special characters.
> {code:java}
> /**
>  * Prefix used to replace non-alphabetic characters at the start of
>  * a column name. For example, $foo becomes col_foo. Used
>  * because SQL does not allow _foo.
>  */
> public static final String COLUMN_PREFIX = "col_";
> {code}
> But in my case I'm getting it even for all alphabetical column name.
> 
> I have the following data in the CSV file,
> ||PRODUCTID||PRODUCTNAME||SUPPLIERID||CATEGORYID||UNIT||PRICE||
> |1|Chais|1|1|10 boxes x 20 bags|18|
> |2|Chang|1|1|24 - 12 oz bottles|19|
> |3|Aniseed Syrup|1|2|12 - 550 ml bottles|10|
> |4|Chef Anton's Cajun Seasoning|2|2|48 - 6 oz jars|22|
> |5|Chef Anton's Gumbo Mix|2|2|36 boxes|21.35|
>  
> While querying on the csv file using following query:
> {code:sql}
> SELECT * FROM dfs.`/var/lib/PRODUCT.csv`{code}
> The output is 
> [!https://i.stack.imgur.com/FBNmn.png|width=611,height=130!|https://i.stack.imgur.com/FBNmn.png]
> 
> I know about other criterias like
> {{#UNITS}} is changed to {{col_UNITS}}
> {{FINANCIAL$RECORD}} is changed to {{FINANCIAL_RECORD}}
> But what's with {{{}PRODUCTID{}}}; Why is it changed to 
> {{col___PRODUCTID__}}? In this case it has appended extra underscores also. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8439) Getting col__ prefix for columns that are not special when extractHeader is enabled

2023-05-31 Thread Charles Givre (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728079#comment-17728079
 ] 

Charles Givre commented on DRILL-8439:
--

Can you please verify in the CSV file that the affected column doesn't have any 
other leading characters?  Please check for carriage returns, and other 
invisible unicode characters.  The fact that Drill is inserting an extra 
underscore leads me to believe there could be some extra garbage in that field.

In any event, can't you just query this by giving it an alias?

IE:

{{SELECT `col__PRODUCTID_` AS product_id ...}}

> Getting col__ prefix for columns that are not special when extractHeader is 
> enabled
> ---
>
> Key: DRILL-8439
> URL: https://issues.apache.org/jira/browse/DRILL-8439
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata, SQL Parser
>Affects Versions: 1.21.0
> Environment: Enabled {{extractHeader}} in the csv config of dfs 
> plugin.
> No. of drillbits: Single
> OS: Windows
>Reporter: Diksha Chaturvedi
>Priority: Major
>  Labels: drill, extractHeader
>
> As per documentation, Drill appends col_ to the columns that start with a 
> number or special characters.
> {code:java}
> /**
>  * Prefix used to replace non-alphabetic characters at the start of
>  * a column name. For example, $foo becomes col_foo. Used
>  * because SQL does not allow _foo.
>  */
> public static final String COLUMN_PREFIX = "col_";
> {code}
> But in my case I'm getting it even for all alphabetical column name.
> 
> I have the following data in the CSV file,
> ||PRODUCTID||PRODUCTNAME||SUPPLIERID||CATEGORYID||UNIT||PRICE||
> |1|Chais|1|1|10 boxes x 20 bags|18|
> |2|Chang|1|1|24 - 12 oz bottles|19|
> |3|Aniseed Syrup|1|2|12 - 550 ml bottles|10|
> |4|Chef Anton's Cajun Seasoning|2|2|48 - 6 oz jars|22|
> |5|Chef Anton's Gumbo Mix|2|2|36 boxes|21.35|
>  
> While querying on the csv file using following query:
> {code:sql}
> SELECT * FROM dfs.`/var/lib/PRODUCT.csv`{code}
> The output is 
> [!https://i.stack.imgur.com/FBNmn.png|width=611,height=130!|https://i.stack.imgur.com/FBNmn.png]
> 
> I know about other criterias like
> {{#UNITS}} is changed to {{col_UNITS}}
> {{FINANCIAL$RECORD}} is changed to {{FINANCIAL_RECORD}}
> But what's with {{{}PRODUCTID{}}}; Why is it changed to 
> {{col___PRODUCTID__}}? In this case it has appended extra underscores also. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)