yes. such as I have two data sets: date set A: /data/test1/dt=20100101 data set B: /data/test2/dt=20100202
all data has the same JSON format , such as: {“key1” : “value1”, “key2” : “value2” } my output expected: {“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” : “20100101"} {“key1” : “value1”, “key2” : “value2” , “source” : “test2”, “date” : “20100202"} > On Sep 25, 2015, at 11:52, Anchit Choudhry <anchit.choud...@gmail.com> wrote: > > Sure. May I ask for a sample input(could be just few lines) and the output > you are expecting to bring clarity to my thoughts? > > On Thu, Sep 24, 2015, 23:44 Fengdong Yu <fengdo...@everstring.com > <mailto:fengdo...@everstring.com>> wrote: > Hi Anchit, > > Thanks for the quick answer. > > my exact question is : I want to add HDFS location into each line in my JSON > data. > > > >> On Sep 25, 2015, at 11:25, Anchit Choudhry <anchit.choud...@gmail.com >> <mailto:anchit.choud...@gmail.com>> wrote: >> >> Hi Fengdong, >> >> Thanks for your question. >> >> Spark already has a function called wholeTextFiles within sparkContext which >> can help you with that: >> >> Python >> hdfs://a-hdfs-path/part-00000 >> hdfs://a-hdfs-path/part-00001 >> ... >> hdfs://a-hdfs-path/part-nnnnn >> rdd = sparkContext.wholeTextFiles(“hdfs://a- <>hdfs-path”) >> (a-hdfs-path/part-00000, its content) >> (a-hdfs-path/part-00001, its content) >> ... >> (a-hdfs-path/part-nnnnn, its content) >> More info: http://spark >> <http://spark/>.apache.org/docs/latest/api/python/pyspark.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles >> >> ------------ >> >> Scala >> >> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path") >> >> More info: >> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)] >> >> Let us know if this helps or you need more help. >> >> Thanks, >> Anchit Choudhry >> >> On 24 September 2015 at 23:12, Fengdong Yu <fengdo...@everstring.com >> <mailto:fengdo...@everstring.com>> wrote: >> Hi, >> >> I have multiple files with JSON format, such as: >> >> /data/test1_data/sub100/test.data >> /data/test2_data/sub200/test.data >> >> >> I can sc.textFile(“/data/*/*”) >> >> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save >> it the one target HDFS location. >> >> how to do it, Thanks. >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> <mailto:dev-unsubscr...@spark.apache.org> >> For additional commands, e-mail: dev-h...@spark.apache.org >> <mailto:dev-h...@spark.apache.org> >> >> >