[ https://issues.apache.org/jira/browse/SPARK-6316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
yunzhi.lyz updated SPARK-6316: ------------------------------ Description: add a parameter for SparkContext(conf).textFile() method , support for multi-language hdfs file . e.g. val file = new SparkContext(conf).textFile(args(0), 10,"gbk") modify the code: org.apache.spark.SparkContext + def defaultEncoding: String = "utf-8" -- def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = { hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair => pair._2.toString).setName(path) } ++ def textFile(path: String, minPartitions: Int = defaultMinPartitions,encoding: String = defaultEncoding): RDD[String] = { hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair => new String(pair._2.getBytes(), 0 , pair._2.getLength(), encoding)).setName(path) } was: add a parameter for SparkContext(conf).textFile() method , support for multi-language hdfs file . e.g. val file = new SparkContext(conf).textFile(args(0), 10,"gbk") modify the code: org.apache.spark.SparkContext + def defaultEncoding: String = "utf-8" - def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = { hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair => pair._2.toString).setName(path) } + def textFile(path: String, minPartitions: Int = defaultMinPartitions,encoding: String = defaultEncoding): RDD[String] = { hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair => new String(pair._2.getBytes(), 0 , pair._2.getLength(), encoding)).setName(path) } > add a parameter for SparkContext(conf).textFile() method , support for > multi-language hdfs file , e.g. "gbk" > ---------------------------------------------------------------------------------------------------------------- > > Key: SPARK-6316 > URL: https://issues.apache.org/jira/browse/SPARK-6316 > Project: Spark > Issue Type: New Feature > Environment: linux > LANG=en_US.UTF-8 > Reporter: yunzhi.lyz > > add a parameter for SparkContext(conf).textFile() method , support > for multi-language hdfs file . > > e.g. val file = new SparkContext(conf).textFile(args(0), 10,"gbk") > modify the code: > > org.apache.spark.SparkContext > + def defaultEncoding: String = "utf-8" > > -- def textFile(path: String, minPartitions: Int = > defaultMinPartitions): RDD[String] = { > hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], > classOf[Text], > minPartitions).map(pair => pair._2.toString).setName(path) > } > ++ def textFile(path: String, minPartitions: Int = > defaultMinPartitions,encoding: String = defaultEncoding): RDD[String] = { > hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], > classOf[Text], > minPartitions).map(pair => new String(pair._2.getBytes(), 0 , > pair._2.getLength(), encoding)).setName(path) > } > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org