Re: Imported CSV file content isn't identical to the original file

SLiZn Liu Sun, 07 Feb 2016 02:10:00 -0800

Hi Igor,

In my case, it’s not a matter of *truncate*. As the show() function in
Spark API doc reads,


truncate: Whether truncate long strings. If true, strings more than 20
characters will be truncated and all cells will be aligned right…

whereas the leading characters of my two columns are missing.

Good to know the way to show the whole content in a cell.

—
BR,
Todd Leo





On Sun, Feb 7, 2016 at 5:42 PM Igor Berman <igor.ber...@gmail.com> wrote:

> show has argument of truncate
> pass false so it wont truncate your results
>
> On 7 February 2016 at 11:01, SLiZn Liu <sliznmail...@gmail.com> wrote:
>
>> Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried
>> HiveContext, but the result is exactly the same.
>> 
>>
>> On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu <sliznmail...@gmail.com> wrote:
>>
>>> Hi Spark Users Group,
>>>
>>> I have a csv file to analysis with Spark, but I’m troubling with
>>> importing as DataFrame.
>>>
>>> Here’s the minimal reproducible example. Suppose I’m having a
>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>>
>>> the <SP> in column 2 represents sub-delimiter within that column, and
>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>
>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>
>>> sqlContext.read.format("com.databricks.spark.csv")
>>>         .option("header", "false") // Use first line of all files as header
>>>         .option("inferSchema", "false") // Automatically infer data types
>>>         .option("delimiter", " ")
>>>         .load("hdfs:///tmp/1.csv")
>>>         .show
>>>
>>> Oddly, the output shows only a part of each column:
>>>
>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>
>>> and even the boundary of the table wasn’t shown correctly. I also used
>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>> and sqlContext.createDataFrame, and the result is the same. Can someone
>>> point me out where I did it wrong?
>>>
>>> —
>>> BR,
>>> Todd Leo
>>> 
>>>
>>
>

Re: Imported CSV file content isn't identical to the original file

Reply via email to