[ 
https://issues.apache.org/jira/browse/HIVE-27590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lvhu updated HIVE-27590:
------------------------
    Environment: Any  (was: {code:java}
//代码占位符
{code})

> Make LINES TERMINATED BY work when creating table
> -------------------------------------------------
>
>                 Key: HIVE-27590
>                 URL: https://issues.apache.org/jira/browse/HIVE-27590
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive, SQL
>    Affects Versions: 3.1.3
>         Environment: Any
>            Reporter: lvhu
>            Assignee: lvhu
>            Priority: Major
>
> *The only way to set line delimiters when creating tables in the current hive 
> is like this:*
> {code:java}
> package abc.hive.MyFstTextInputFormat
> public class MyFstTextInputFormat extends FileInputFormat<LongWritable, Text> 
> implements JobConfigurable {
>  ...
> }
> create table test  (  
>     id string,  
>     name string  
> )  
> INPUTFORMAT 'abc.hive.MyFstTextInputFormat'   {code}
> If there are multiple different record delimiters, multiple TextInputFormats 
> need to be rewritten.
> Unluckily, The ideal method is not supported yet:
> {code:java}
> create table test  (  
>     id string,  
>     name string  
> )  
> row format delimited fields terminated by '\t'  -- supported
> LINES TERMINATED BY '|@|' ;   -- not supported  {code}
> I have a solution that supports setting line delimiters when creating tables 
> just like above.
> *1.create a new HiveTextInputFormat class to replace TextInputFormatn class.*
> HiveTextInputFormat class read <pathToDelimiter> file to support setting 
> record delimiter for input files based on the prefix of the file path.
> {code:java}
> public class HiveTextInputFormat extends FileInputFormat<LongWritable, Text>
>   implements JobConfigurable {
>   ....
>   public RecordReader<LongWritable, Text> getRecordReader(
>                                           InputSplit genericSplit, JobConf 
> job,
>                                           Reporter reporter)
>     throws IOException {
>     
>     reporter.setStatus(genericSplit.toString());
>     // default delimiter
>     String delimiter = job.get("textinputformat.record.delimiter");
>     //Obtain the path of the file
>     String filePath = genericSplit.getPath().toUri().getPath();
>     //Obtain a list of file paths and delimiter relationships by parsing the 
> <pathToDelimiter> file
>     Map pathToDelimiterMap = parsePathToDelimite()//Obtain by parsing the 
> <pathToDelimiter> file
>     for(Map.Entry<String, String> entry: pathToDelimiterMap.entrySet()){
>      //config path
>      String configPath = entry.getKey();   
>      //if configPath is the prefix of filePath, set delimiter corresponding 
> to the file path
>      if(filePath.startsWith(configPath))  delimiter = entry.getValue();       
>  
>     }
>     byte[] recordDelimiterBytes = null;
>     if (null != delimiter) {
>       recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
>     }
>     return new LineRecordReader(job, (FileSplit) genericSplit,
>         recordDelimiterBytes);
>   }
> } {code}
> *2. modify hive create table class to support <LINES TERMINATED BY>*
> {code:java}
> create table test  (  
>     id string,  
>     name string  
> )  
> LINES TERMINATED BY '|@|' ;  
> LOCATION  hdfs_path; {code}
> If Users execute above SQL, hive will insert  (hdfs_path,'|@|')  to 
> <pathToDelimiter> file.
> Set HiveTextInputFormat  as default INPUTFORMAT  .
> Looking forward to receiving your suggestions and feedback!
> *If you accept my idea, I hope you can assign the task to me. My Github 
> account is: _lvhu-goodluck_*
> I really hope to contribute code to the community
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to