The parallelize method does not read the contents of a file. It simply takes a collection and distributes it to the cluster. In this case, the String is a collection 67 characters.
Use sc.textFile instead of sc.parallelize, and it should work as you want. On Wed, Aug 5, 2015 at 8:12 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> wrote: > I have csv data that is embedded in gzip format on HDFS. > > *With Pig* > > a = load > '/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-00003.gz' using > PigStorage(); > > b = limit a 10 > > > (2015-07-27,12459,,31243,6,Daily,-999,2099-01-01,2099-01-02,4,0,0.1,0,1,,,,,203,4810370.0,1.4090459061723766,1.017458,-0.03,-0.11,0.05,0.468666,) > > > (2015-07-27,12459,,31241,6,Daily,-999,2099-01-01,2099-01-02,4,0,0.1,0,1,0,isGeo,,,203,7937613.0,1.1624841995932425,1.11562,-0.06,-0.15,0.03,0.233283,) > > > However with Spark > > val rowStructText = > sc.parallelize("/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-00000.gz") > > val x = rowStructText.map(s => { > > println(s) > > s} > > ) > > x.count > > Questions > > 1) x.count always shows 67 irrespective of the path i change in > sc.parallelize > > 2) It shows x as RDD[Char] instead of String > > 3) println() never emits the rows. > > Any suggestions > > -Deepak > > > > -- > Deepak > >