The parallelize method does not read the contents of a file. It simply
takes a collection and distributes it to the cluster. In this case, the
String is a collection 67 characters.

Use sc.textFile instead of sc.parallelize, and it should work as you want.

On Wed, Aug 5, 2015 at 8:12 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> wrote:

> I have csv data that is embedded in gzip format on HDFS.
>
> *With Pig*
>
> a = load
> '/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-00003.gz' using
> PigStorage();
>
> b = limit a 10
>
>
> (2015-07-27,12459,,31243,6,Daily,-999,2099-01-01,2099-01-02,4,0,0.1,0,1,,,,,203,4810370.0,1.4090459061723766,1.017458,-0.03,-0.11,0.05,0.468666,)
>
>
> (2015-07-27,12459,,31241,6,Daily,-999,2099-01-01,2099-01-02,4,0,0.1,0,1,0,isGeo,,,203,7937613.0,1.1624841995932425,1.11562,-0.06,-0.15,0.03,0.233283,)
>
>
> However with Spark
>
> val rowStructText =
> sc.parallelize("/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-00000.gz")
>
> val x = rowStructText.map(s => {
>
>     println(s)
>
>     s}
>
>     )
>
> x.count
>
> Questions
>
> 1) x.count always shows 67 irrespective of the path i change in
> sc.parallelize
>
> 2) It shows x as RDD[Char] instead of String
>
> 3) println() never emits the rows.
>
> Any suggestions
>
> -Deepak
>
>
>
> --
> Deepak
>
>

Reply via email to