lvhu created HIVE-27590:
---------------------------

             Summary: Make LINES TERMINATED BY work when creating table
                 Key: HIVE-27590
                 URL: https://issues.apache.org/jira/browse/HIVE-27590
             Project: Hive
          Issue Type: Improvement
          Components: Hive, SQL
    Affects Versions: 3.1.3
         Environment: {code:java}
//代码占位符
{code}
            Reporter: lvhu
            Assignee: lvhu


*The only way to set line delimiters when creating tables in the current hive 
is like this:*
{code:java}
package abc.hive.MyFstTextInputFormat
public class MyFstTextInputFormat extends FileInputFormat<LongWritable, Text> 
implements JobConfigurable {
 ...
}
create table test  (  
    id string,  
    name string  
)  
INPUTFORMAT 'abc.hive.MyFstTextInputFormat'   {code}
If there are multiple different record delimiters, multiple TextInputFormats 
need to be rewritten.

Unluckily, The ideal method is not supported yet:
{code:java}
create table test  (  
    id string,  
    name string  
)  
row format delimited fields terminated by '\t'  -- supported
LINES TERMINATED BY '|@|' ;   -- not supported  {code}
I have a solution that supports setting line delimiters when creating tables 
just like above.

*1. create a new HiveTextInputFormat class to replace TextInputFormatn class.* 
HiveTextInputFormat class read <pathToDelimiter> file to support setting record 
delimiter for input files based on the prefix of the file path.
{code:java}
public class HiveTextInputFormat extends FileInputFormat<LongWritable, Text>
  implements JobConfigurable {
  ....
  public RecordReader<LongWritable, Text> getRecordReader(
                                          InputSplit genericSplit, JobConf job,
                                          Reporter reporter)
    throws IOException {
    
    reporter.setStatus(genericSplit.toString());
    // default delimiter
    String delimiter = job.get("textinputformat.record.delimiter");
    //Obtain the path of the file
    String filePath = genericSplit.getPath().toUri().getPath();
    //Obtain a list of file paths and delimiter relationships by parsing the 
<pathToDelimiter> file
    Map pathToDelimiterMap = parsePathToDelimite()//Obtain by parsing the 
<pathToDelimiter> file
    for(Map.Entry<String, String> entry: pathToDelimiterMap.entrySet()){
     //config path
     String configPath = entry.getKey();   
     //if configPath is the prefix of filePath, set delimiter corresponding to 
the file path
     if(filePath.startsWith(configPath))  delimiter = entry.getValue();        
    }
    byte[] recordDelimiterBytes = null;
    if (null != delimiter) {
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
    }
    return new LineRecordReader(job, (FileSplit) genericSplit,
        recordDelimiterBytes);
  }
} {code}
*2. modify hive create table class to support <LINES TERMINATED BY>*
{code:java}
create table test  (  
    id string,  
    name string  
)  
LINES TERMINATED BY '|@|' ;  
LOCATION  hdfs_path; {code}
If Users execute above SQL, hive will insert  (hdfs_path,'|@|')  to 
<pathToDelimiter> file.

Looking forward to receiving your suggestions and feedback!

*If you accept my idea, I hope you can assign the task to me. My Github account 
is: _lvhu-goodluck_* 

I really hope to contribute code to the community

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to