I’ve found the trigger of my issue: if I start my spark-shell or submit by spark-submit with --conf spark.serializer=org.apache.spark.serializer.KryoSerializer, the DataFrame content goes wrong, as I described earlier.
On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu <sliznmail...@gmail.com> wrote: > Thanks Luciano, now it looks like I’m the only guy who have this issue. My > options is narrowed down to upgrade my spark to 1.6.0, to see if this issue > is gone. > > — > Cheers, > Todd Leo > > > > On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <luckbr1...@gmail.com> > wrote: > >> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and >> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the >> columns seem to be read properly. >> >> +----------+----------------------+ >> |C0 |C1 | >> +----------+----------------------+ >> >> |1446566430 | 2015-11-04<SP>00:00:30| >> |1446566430 | 2015-11-04<SP>00:00:30| >> |1446566430 | 2015-11-04<SP>00:00:30| >> |1446566430 | 2015-11-04<SP>00:00:30| >> |1446566430 | 2015-11-04<SP>00:00:30| >> |1446566431 | 2015-11-04<SP>00:00:31| >> |1446566431 | 2015-11-04<SP>00:00:31| >> |1446566431 | 2015-11-04<SP>00:00:31| >> |1446566431 | 2015-11-04<SP>00:00:31| >> |1446566431 | 2015-11-04<SP>00:00:31| >> +----------+----------------------+ >> >> >> >> >> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sliznmail...@gmail.com> >> wrote: >> >>> Hi Spark Users Group, >>> >>> I have a csv file to analysis with Spark, but I’m troubling with >>> importing as DataFrame. >>> >>> Here’s the minimal reproducible example. Suppose I’m having a >>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below: >>> >>> 1446566430 2015-11-04<SP>00:00:30 >>> 1446566430 2015-11-04<SP>00:00:30 >>> 1446566430 2015-11-04<SP>00:00:30 >>> 1446566430 2015-11-04<SP>00:00:30 >>> 1446566430 2015-11-04<SP>00:00:30 >>> 1446566431 2015-11-04<SP>00:00:31 >>> 1446566431 2015-11-04<SP>00:00:31 >>> 1446566431 2015-11-04<SP>00:00:31 >>> 1446566431 2015-11-04<SP>00:00:31 >>> 1446566431 2015-11-04<SP>00:00:31 >>> >>> the <SP> in column 2 represents sub-delimiter within that column, and >>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv >>> >>> I’m using *spark-csv* to import this file as Spark *DataFrame*: >>> >>> sqlContext.read.format("com.databricks.spark.csv") >>> .option("header", "false") // Use first line of all files as header >>> .option("inferSchema", "false") // Automatically infer data types >>> .option("delimiter", " ") >>> .load("hdfs:///tmp/1.csv") >>> .show >>> >>> Oddly, the output shows only a part of each column: >>> >>> [image: Screenshot from 2016-02-07 15-27-51.png] >>> >>> and even the boundary of the table wasn’t shown correctly. I also used >>> the other way to read csv file, by sc.textFile(...).map(_.split(" ")) >>> and sqlContext.createDataFrame, and the result is the same. Can someone >>> point me out where I did it wrong? >>> >>> — >>> BR, >>> Todd Leo >>> >>> >> >> >> >> -- >> Luciano Resende >> http://people.apache.org/~lresende >> http://twitter.com/lresende1975 >> http://lresende.blogspot.com/ >> >