I have csv data that is embedded in gzip format on HDFS. *With Pig*
a = load '/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-00003.gz' using PigStorage(); b = limit a 10 (2015-07-27,12459,,31243,6,Daily,-999,2099-01-01,2099-01-02,4,0,0.1,0,1,,,,,203,4810370.0,1.4090459061723766,1.017458,-0.03,-0.11,0.05,0.468666,) (2015-07-27,12459,,31241,6,Daily,-999,2099-01-01,2099-01-02,4,0,0.1,0,1,0,isGeo,,,203,7937613.0,1.1624841995932425,1.11562,-0.06,-0.15,0.03,0.233283,) However with Spark val rowStructText = sc.parallelize("/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-00000.gz") val x = rowStructText.map(s => { println(s) s} ) x.count Questions 1) x.count always shows 67 irrespective of the path i change in sc.parallelize 2) It shows x as RDD[Char] instead of String 3) println() never emits the rows. Any suggestions -Deepak -- Deepak