[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lvhu updated MAPREDUCE-7450:
----------------------------
    Fix Version/s:     (was: MR-3902)

> Set the record delimiter for the input file based on its path
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-7450
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7450
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 3.3.6
>         Environment: Any
>            Reporter: lvhu
>            Priority: Critical
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> In the mapreduce program, when reading files, we can easily set the record 
> delimiter based on the parameter textinputformat.record.delimiter.
> This parameter can also be easily set, including Spark, for example:
> {code:java}
> spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter",
>  "|@|")
> val rdd = spark.sparkContext.newAPIHadoopFile(...) {code}
> *But once the textinputformat.record.delimiter parameter is modified, it will 
> take effect for all files. In actual scenarios, different files often have 
> different delimiters.*
> In Hive, as Hive does not support programming, we cannot modify the record 
> delimiter through the above methods. If modified through a configuration 
> file, it will take effect on all Hive tables.
> The only way to modify record delimiter in hive is to rewrite a 
> TextInputFormat class.
> The current method of hive is as follows:
> {code:java}
> package abc.hive.MyFstTextInputFormat
> public class MyFstTextInputFormat extends FileInputFormat<LongWritable, Text> 
> implements JobConfigurable
> {  ... }
> create table test  (  
>     id string,  
>     name string  
> )  stored as  
> INPUTFORMAT 'abc.hive.MyFstTextInputFormat'   {code}
> If there are multiple different record delimiters, multiple TextInputFormats 
> need to be rewritten.
> *My idea is to modify TextInputFormat class to support setting record 
> delimiter for input files based on the prefix of the file path.*
> The specific idea is to make the following modifications to TextInputFormat:
> {code:java}
> public class TextInputFormat extends FileInputFormat<LongWritable, Text>
>   implements JobConfigurable {
>   ....
>   public RecordReader<LongWritable, Text> getRecordReader(InputSplit 
> genericSplit, JobConf job,
>                                           Reporter reporter)
>     throws IOException {   
>     reporter.setStatus(genericSplit.toString());
>     // default delimiter
>     String delimiter = job.get("textinputformat.record.delimiter");
>     //Obtain the path of the file
>     String filePath = genericSplit.getPath().toUri().getPath();
>     //Obtain a list of file paths and delimiter relationships based on 
> configuration file parameters
>     Map pathToDelimiterMap = //Obtain by parsing the configuration file
>     for(Map.Entry<String, String> entry: pathToDelimiterMap.entrySet()){
>      //config path
>      String configPath = entry.getKey();   
>      //if configPath is the prefix of filePath, set delimiter corresponding 
> to the file path
>      if(filePath.startsWith(configPath)) {delimiter = entry.getValue(); 
> break;}         
>     }
>     byte[] recordDelimiterBytes = null;
>     if (null != delimiter) {
>       recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
>     }
>     return new LineRecordReader(job, (FileSplit) genericSplit,
>         recordDelimiterBytes);
>   }
> }{code}
> After implementing the record delimiter function of setting input files 
> according to the path, not only does it save code to modify the delimiter, 
> but it is also very convenient for Hadoop and Spark, without frequent 
> parameter configuration modifications.
> Looking forward to receiving your suggestions and feedback!
> *If you accept my idea, I hope you can assign the task to me. My Github 
> account is: _lvhu-goodluck_*
> I really hope to contribute code to the community.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to