I have csv data that is embedded in gzip format on HDFS.

*With Pig*

a = load
'/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-00003.gz' using
PigStorage();

b = limit a 10

(2015-07-27,12459,,31243,6,Daily,-999,2099-01-01,2099-01-02,4,0,0.1,0,1,,,,,203,4810370.0,1.4090459061723766,1.017458,-0.03,-0.11,0.05,0.468666,)

(2015-07-27,12459,,31241,6,Daily,-999,2099-01-01,2099-01-02,4,0,0.1,0,1,0,isGeo,,,203,7937613.0,1.1624841995932425,1.11562,-0.06,-0.15,0.03,0.233283,)


However with Spark

val rowStructText =
sc.parallelize("/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-00000.gz")

val x = rowStructText.map(s => {

    println(s)

    s}

    )

x.count

Questions

1) x.count always shows 67 irrespective of the path i change in
sc.parallelize

2) It shows x as RDD[Char] instead of String

3) println() never emits the rows.

Any suggestions

-Deepak



-- 
Deepak

Reply via email to