subject:"Imported CSV file content isn't identical to the original file"

Re: Imported CSV file content isn't identical to the original file

2016-02-14 Thread SLiZn Liu

This Error message does not appear as I upgraded to 1.6.0 .

--
Cheers,
Todd Leo

On Tue, Feb 9, 2016 at 9:07 AM SLiZn Liu  wrote:

> At least works for me though, temporarily disabled Kyro serilizer until
> upgrade to 1.6.0. Appreciate for your update. :)
> Luciano Resende 于2016年2月9日 周二02:37写道：
>
>> Sorry, same expected results with trunk and Kryo serializer
>>
>> On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu  wrote:
>>
>>> I’ve found the trigger of my issue: if I start my spark-shell or submit
>>> by spark-submit with --conf
>>> spark.serializer=org.apache.spark.serializer.KryoSerializer, the
>>> DataFrame content goes wrong, as I described earlier.
>>> 
>>>
>>> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu  wrote:
>>>
 Thanks Luciano, now it looks like I’m the only guy who have this issue.
 My options is narrowed down to upgrade my spark to 1.6.0, to see if this
 issue is gone.

 —
 Cheers,
 Todd Leo


 
 On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende 
 wrote:

> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
> columns seem to be read properly.
>
>  +--+--+
> |C0|C1|
> +--+--+
>
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> +--+--+
>
>
>
>
> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu 
> wrote:
>
>> Hi Spark Users Group,
>>
>> I have a csv file to analysis with Spark, but I’m troubling with
>> importing as DataFrame.
>>
>> Here’s the minimal reproducible example. Suppose I’m having a
>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>>
>> the  in column 2 represents sub-delimiter within that column,
>> and this file is stored on HDFS, let’s say the path is
>> hdfs:///tmp/1.csv
>>
>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>
>> sqlContext.read.format("com.databricks.spark.csv")
>> .option("header", "false") // Use first line of all files as 
>> header
>> .option("inferSchema", "false") // Automatically infer data types
>> .option("delimiter", " ")
>> .load("hdfs:///tmp/1.csv")
>> .show
>>
>> Oddly, the output shows only a part of each column:
>>
>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>
>> and even the boundary of the table wasn’t shown correctly. I also
>> used the other way to read csv file, by sc.textFile(...).map(_.split("
>> ")) and sqlContext.createDataFrame, and the result is the same. Can
>> someone point me out where I did it wrong?
>>
>> —
>> BR,
>> Todd Leo
>> 
>>
>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu

Thanks Luciano, now it looks like I’m the only guy who have this issue. My
options is narrowed down to upgrade my spark to 1.6.0, to see if this issue
is gone.

—
Cheers,
Todd Leo



On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende  wrote:

> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
> columns seem to be read properly.
>
>  +--+--+
> |C0|C1|
> +--+--+
>
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> +--+--+
>
>
>
>
> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu  wrote:
>
>> Hi Spark Users Group,
>>
>> I have a csv file to analysis with Spark, but I’m troubling with
>> importing as DataFrame.
>>
>> Here’s the minimal reproducible example. Suppose I’m having a
>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>>
>> the  in column 2 represents sub-delimiter within that column, and
>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>
>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>
>> sqlContext.read.format("com.databricks.spark.csv")
>> .option("header", "false") // Use first line of all files as header
>> .option("inferSchema", "false") // Automatically infer data types
>> .option("delimiter", " ")
>> .load("hdfs:///tmp/1.csv")
>> .show
>>
>> Oddly, the output shows only a part of each column:
>>
>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>
>> and even the boundary of the table wasn’t shown correctly. I also used
>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>> and sqlContext.createDataFrame, and the result is the same. Can someone
>> point me out where I did it wrong?
>>
>> —
>> BR,
>> Todd Leo
>> 
>>
>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu

I’ve found the trigger of my issue: if I start my spark-shell or submit by
spark-submit with --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer, the DataFrame
content goes wrong, as I described earlier.


On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu  wrote:

> Thanks Luciano, now it looks like I’m the only guy who have this issue. My
> options is narrowed down to upgrade my spark to 1.6.0, to see if this issue
> is gone.
>
> —
> Cheers,
> Todd Leo
>
>
> 
> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende 
> wrote:
>
>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
>> columns seem to be read properly.
>>
>>  +--+--+
>> |C0|C1|
>> +--+--+
>>
>> |1446566430 | 2015-11-0400:00:30|
>> |1446566430 | 2015-11-0400:00:30|
>> |1446566430 | 2015-11-0400:00:30|
>> |1446566430 | 2015-11-0400:00:30|
>> |1446566430 | 2015-11-0400:00:30|
>> |1446566431 | 2015-11-0400:00:31|
>> |1446566431 | 2015-11-0400:00:31|
>> |1446566431 | 2015-11-0400:00:31|
>> |1446566431 | 2015-11-0400:00:31|
>> |1446566431 | 2015-11-0400:00:31|
>> +--+--+
>>
>>
>>
>>
>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu 
>> wrote:
>>
>>> Hi Spark Users Group,
>>>
>>> I have a csv file to analysis with Spark, but I’m troubling with
>>> importing as DataFrame.
>>>
>>> Here’s the minimal reproducible example. Suppose I’m having a
>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>
>>> 1446566430 2015-11-0400:00:30
>>> 1446566430 2015-11-0400:00:30
>>> 1446566430 2015-11-0400:00:30
>>> 1446566430 2015-11-0400:00:30
>>> 1446566430 2015-11-0400:00:30
>>> 1446566431 2015-11-0400:00:31
>>> 1446566431 2015-11-0400:00:31
>>> 1446566431 2015-11-0400:00:31
>>> 1446566431 2015-11-0400:00:31
>>> 1446566431 2015-11-0400:00:31
>>>
>>> the  in column 2 represents sub-delimiter within that column, and
>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>
>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>
>>> sqlContext.read.format("com.databricks.spark.csv")
>>> .option("header", "false") // Use first line of all files as header
>>> .option("inferSchema", "false") // Automatically infer data types
>>> .option("delimiter", " ")
>>> .load("hdfs:///tmp/1.csv")
>>> .show
>>>
>>> Oddly, the output shows only a part of each column:
>>>
>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>
>>> and even the boundary of the table wasn’t shown correctly. I also used
>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>> and sqlContext.createDataFrame, and the result is the same. Can someone
>>> point me out where I did it wrong?
>>>
>>> —
>>> BR,
>>> Todd Leo
>>> 
>>>
>>
>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread Luciano Resende

Sorry, same expected results with trunk and Kryo serializer

On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu  wrote:

> I’ve found the trigger of my issue: if I start my spark-shell or submit
> by spark-submit with --conf
> spark.serializer=org.apache.spark.serializer.KryoSerializer, the
> DataFrame content goes wrong, as I described earlier.
> 
>
> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu  wrote:
>
>> Thanks Luciano, now it looks like I’m the only guy who have this issue.
>> My options is narrowed down to upgrade my spark to 1.6.0, to see if this
>> issue is gone.
>>
>> —
>> Cheers,
>> Todd Leo
>>
>>
>> 
>> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende 
>> wrote:
>>
>>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
>>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
>>> columns seem to be read properly.
>>>
>>>  +--+--+
>>> |C0|C1|
>>> +--+--+
>>>
>>> |1446566430 | 2015-11-0400:00:30|
>>> |1446566430 | 2015-11-0400:00:30|
>>> |1446566430 | 2015-11-0400:00:30|
>>> |1446566430 | 2015-11-0400:00:30|
>>> |1446566430 | 2015-11-0400:00:30|
>>> |1446566431 | 2015-11-0400:00:31|
>>> |1446566431 | 2015-11-0400:00:31|
>>> |1446566431 | 2015-11-0400:00:31|
>>> |1446566431 | 2015-11-0400:00:31|
>>> |1446566431 | 2015-11-0400:00:31|
>>> +--+--+
>>>
>>>
>>>
>>>
>>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu 
>>> wrote:
>>>
 Hi Spark Users Group,

 I have a csv file to analysis with Spark, but I’m troubling with
 importing as DataFrame.

 Here’s the minimal reproducible example. Suppose I’m having a
 *10(rows)x2(cols)* *space-delimited csv* file, shown as below:

 1446566430 2015-11-0400:00:30
 1446566430 2015-11-0400:00:30
 1446566430 2015-11-0400:00:30
 1446566430 2015-11-0400:00:30
 1446566430 2015-11-0400:00:30
 1446566431 2015-11-0400:00:31
 1446566431 2015-11-0400:00:31
 1446566431 2015-11-0400:00:31
 1446566431 2015-11-0400:00:31
 1446566431 2015-11-0400:00:31

 the  in column 2 represents sub-delimiter within that column, and
 this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv

 I’m using *spark-csv* to import this file as Spark *DataFrame*:

 sqlContext.read.format("com.databricks.spark.csv")
 .option("header", "false") // Use first line of all files as header
 .option("inferSchema", "false") // Automatically infer data types
 .option("delimiter", " ")
 .load("hdfs:///tmp/1.csv")
 .show

 Oddly, the output shows only a part of each column:

 [image: Screenshot from 2016-02-07 15-27-51.png]

 and even the boundary of the table wasn’t shown correctly. I also used
 the other way to read csv file, by sc.textFile(...).map(_.split(" "))
 and sqlContext.createDataFrame, and the result is the same. Can
 someone point me out where I did it wrong?

 —
 BR,
 Todd Leo
 

>>>
>>>
>>>
>>> --
>>> Luciano Resende
>>> http://people.apache.org/~lresende
>>> http://twitter.com/lresende1975
>>> http://lresende.blogspot.com/
>>>
>>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu

At least works for me though, temporarily disabled Kyro serilizer until
upgrade to 1.6.0. Appreciate for your update. :)
Luciano Resende 于2016年2月9日 周二02:37写道：

> Sorry, same expected results with trunk and Kryo serializer
>
> On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu  wrote:
>
>> I’ve found the trigger of my issue: if I start my spark-shell or submit
>> by spark-submit with --conf
>> spark.serializer=org.apache.spark.serializer.KryoSerializer, the
>> DataFrame content goes wrong, as I described earlier.
>> 
>>
>> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu  wrote:
>>
>>> Thanks Luciano, now it looks like I’m the only guy who have this issue.
>>> My options is narrowed down to upgrade my spark to 1.6.0, to see if this
>>> issue is gone.
>>>
>>> —
>>> Cheers,
>>> Todd Leo
>>>
>>>
>>> 
>>> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende 
>>> wrote:
>>>
 I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
 com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
 columns seem to be read properly.

  +--+--+
 |C0|C1|
 +--+--+

 |1446566430 | 2015-11-0400:00:30|
 |1446566430 | 2015-11-0400:00:30|
 |1446566430 | 2015-11-0400:00:30|
 |1446566430 | 2015-11-0400:00:30|
 |1446566430 | 2015-11-0400:00:30|
 |1446566431 | 2015-11-0400:00:31|
 |1446566431 | 2015-11-0400:00:31|
 |1446566431 | 2015-11-0400:00:31|
 |1446566431 | 2015-11-0400:00:31|
 |1446566431 | 2015-11-0400:00:31|
 +--+--+




 On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu 
 wrote:

> Hi Spark Users Group,
>
> I have a csv file to analysis with Spark, but I’m troubling with
> importing as DataFrame.
>
> Here’s the minimal reproducible example. Suppose I’m having a
> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
>
> the  in column 2 represents sub-delimiter within that column, and
> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>
> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>
> sqlContext.read.format("com.databricks.spark.csv")
> .option("header", "false") // Use first line of all files as 
> header
> .option("inferSchema", "false") // Automatically infer data types
> .option("delimiter", " ")
> .load("hdfs:///tmp/1.csv")
> .show
>
> Oddly, the output shows only a part of each column:
>
> [image: Screenshot from 2016-02-07 15-27-51.png]
>
> and even the boundary of the table wasn’t shown correctly. I also used
> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
> and sqlContext.createDataFrame, and the result is the same. Can
> someone point me out where I did it wrong?
>
> —
> BR,
> Todd Leo
> 
>



 --
 Luciano Resende
 http://people.apache.org/~lresende
 http://twitter.com/lresende1975
 http://lresende.blogspot.com/

>>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu

Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried
HiveContext, but the result is exactly the same.


On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu  wrote:

> Hi Spark Users Group,
>
> I have a csv file to analysis with Spark, but I’m troubling with importing
> as DataFrame.
>
> Here’s the minimal reproducible example. Suppose I’m having a
> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
>
> the  in column 2 represents sub-delimiter within that column, and
> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>
> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>
> sqlContext.read.format("com.databricks.spark.csv")
> .option("header", "false") // Use first line of all files as header
> .option("inferSchema", "false") // Automatically infer data types
> .option("delimiter", " ")
> .load("hdfs:///tmp/1.csv")
> .show
>
> Oddly, the output shows only a part of each column:
>
> [image: Screenshot from 2016-02-07 15-27-51.png]
>
> and even the boundary of the table wasn’t shown correctly. I also used the
> other way to read csv file, by sc.textFile(...).map(_.split(" ")) and
> sqlContext.createDataFrame, and the result is the same. Can someone point
> me out where I did it wrong?
>
> —
> BR,
> Todd Leo
> 
>

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread Igor Berman

show has argument of truncate
pass false so it wont truncate your results

On 7 February 2016 at 11:01, SLiZn Liu  wrote:

> Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried
> HiveContext, but the result is exactly the same.
> 
>
> On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu  wrote:
>
>> Hi Spark Users Group,
>>
>> I have a csv file to analysis with Spark, but I’m troubling with
>> importing as DataFrame.
>>
>> Here’s the minimal reproducible example. Suppose I’m having a
>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>>
>> the  in column 2 represents sub-delimiter within that column, and
>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>
>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>
>> sqlContext.read.format("com.databricks.spark.csv")
>> .option("header", "false") // Use first line of all files as header
>> .option("inferSchema", "false") // Automatically infer data types
>> .option("delimiter", " ")
>> .load("hdfs:///tmp/1.csv")
>> .show
>>
>> Oddly, the output shows only a part of each column:
>>
>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>
>> and even the boundary of the table wasn’t shown correctly. I also used
>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>> and sqlContext.createDataFrame, and the result is the same. Can someone
>> point me out where I did it wrong?
>>
>> —
>> BR,
>> Todd Leo
>> 
>>
>

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu

Hi Igor,

In my case, it’s not a matter of *truncate*. As the show() function in
Spark API doc reads,

truncate: Whether truncate long strings. If true, strings more than 20
characters will be truncated and all cells will be aligned right…

whereas the leading characters of my two columns are missing.

Good to know the way to show the whole content in a cell.

—
BR,
Todd Leo





On Sun, Feb 7, 2016 at 5:42 PM Igor Berman  wrote:

> show has argument of truncate
> pass false so it wont truncate your results
>
> On 7 February 2016 at 11:01, SLiZn Liu  wrote:
>
>> Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried
>> HiveContext, but the result is exactly the same.
>> 
>>
>> On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu  wrote:
>>
>>> Hi Spark Users Group,
>>>
>>> I have a csv file to analysis with Spark, but I’m troubling with
>>> importing as DataFrame.
>>>
>>> Here’s the minimal reproducible example. Suppose I’m having a
>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>
>>> 1446566430 2015-11-0400:00:30
>>> 1446566430 2015-11-0400:00:30
>>> 1446566430 2015-11-0400:00:30
>>> 1446566430 2015-11-0400:00:30
>>> 1446566430 2015-11-0400:00:30
>>> 1446566431 2015-11-0400:00:31
>>> 1446566431 2015-11-0400:00:31
>>> 1446566431 2015-11-0400:00:31
>>> 1446566431 2015-11-0400:00:31
>>> 1446566431 2015-11-0400:00:31
>>>
>>> the  in column 2 represents sub-delimiter within that column, and
>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>
>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>
>>> sqlContext.read.format("com.databricks.spark.csv")
>>> .option("header", "false") // Use first line of all files as header
>>> .option("inferSchema", "false") // Automatically infer data types
>>> .option("delimiter", " ")
>>> .load("hdfs:///tmp/1.csv")
>>> .show
>>>
>>> Oddly, the output shows only a part of each column:
>>>
>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>
>>> and even the boundary of the table wasn’t shown correctly. I also used
>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>> and sqlContext.createDataFrame, and the result is the same. Can someone
>>> point me out where I did it wrong?
>>>
>>> —
>>> BR,
>>> Todd Leo
>>> 
>>>
>>
>

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu

*Update*: on local mode(spark-shell --local[2], no matter read from local
file system or hdfs) , it works well. But it doesn’t solve this issue,
since my data scale requires hundreds of CPU cores and hundreds GB of RAM.

BTW, it’s Chinese Tradition New Year now, wish you all have a happy year
and have Great fortune in the Year of Monkey!

—
BR,
Todd Leo

On Sun, Feb 7, 2016 at 6:09 PM SLiZn Liu  wrote:

> Hi Igor,
>
> In my case, it’s not a matter of *truncate*. As the show() function in
> Spark API doc reads,
>
> truncate: Whether truncate long strings. If true, strings more than 20
> characters will be truncated and all cells will be aligned right…
>
> whereas the leading characters of my two columns are missing.
>
> Good to know the way to show the whole content in a cell.
>
> —
> BR,
> Todd Leo
> 
>
>
>
>
> On Sun, Feb 7, 2016 at 5:42 PM Igor Berman  wrote:
>
>> show has argument of truncate
>> pass false so it wont truncate your results
>>
>> On 7 February 2016 at 11:01, SLiZn Liu  wrote:
>>
>>> Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried
>>> HiveContext, but the result is exactly the same.
>>> 
>>>
>>> On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu  wrote:
>>>
 Hi Spark Users Group,

 I have a csv file to analysis with Spark, but I’m troubling with
 importing as DataFrame.

 Here’s the minimal reproducible example. Suppose I’m having a
 *10(rows)x2(cols)* *space-delimited csv* file, shown as below:

 1446566430 2015-11-0400:00:30
 1446566430 2015-11-0400:00:30
 1446566430 2015-11-0400:00:30
 1446566430 2015-11-0400:00:30
 1446566430 2015-11-0400:00:30
 1446566431 2015-11-0400:00:31
 1446566431 2015-11-0400:00:31
 1446566431 2015-11-0400:00:31
 1446566431 2015-11-0400:00:31
 1446566431 2015-11-0400:00:31

 the  in column 2 represents sub-delimiter within that column, and
 this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv

 I’m using *spark-csv* to import this file as Spark *DataFrame*:

 sqlContext.read.format("com.databricks.spark.csv")
 .option("header", "false") // Use first line of all files as header
 .option("inferSchema", "false") // Automatically infer data types
 .option("delimiter", " ")
 .load("hdfs:///tmp/1.csv")
 .show

 Oddly, the output shows only a part of each column:

 [image: Screenshot from 2016-02-07 15-27-51.png]

 and even the boundary of the table wasn’t shown correctly. I also used
 the other way to read csv file, by sc.textFile(...).map(_.split(" "))
 and sqlContext.createDataFrame, and the result is the same. Can
 someone point me out where I did it wrong?

 —
 BR,
 Todd Leo

>>>
>>

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread Luciano Resende

I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
columns seem to be read properly.

 +--+--+
|C0|C1|
+--+--+
|1446566430 | 2015-11-0400:00:30|
|1446566430 | 2015-11-0400:00:30|
|1446566430 | 2015-11-0400:00:30|
|1446566430 | 2015-11-0400:00:30|
|1446566430 | 2015-11-0400:00:30|
|1446566431 | 2015-11-0400:00:31|
|1446566431 | 2015-11-0400:00:31|
|1446566431 | 2015-11-0400:00:31|
|1446566431 | 2015-11-0400:00:31|
|1446566431 | 2015-11-0400:00:31|
+--+--+




On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu  wrote:

> Hi Spark Users Group,
>
> I have a csv file to analysis with Spark, but I’m troubling with importing
> as DataFrame.
>
> Here’s the minimal reproducible example. Suppose I’m having a
> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566430 2015-11-0400:00:30
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
> 1446566431 2015-11-0400:00:31
>
> the  in column 2 represents sub-delimiter within that column, and
> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>
> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>
> sqlContext.read.format("com.databricks.spark.csv")
> .option("header", "false") // Use first line of all files as header
> .option("inferSchema", "false") // Automatically infer data types
> .option("delimiter", " ")
> .load("hdfs:///tmp/1.csv")
> .show
>
> Oddly, the output shows only a part of each column:
>
> [image: Screenshot from 2016-02-07 15-27-51.png]
>
> and even the boundary of the table wasn’t shown correctly. I also used the
> other way to read csv file, by sc.textFile(...).map(_.split(" ")) and
> sqlContext.createDataFrame, and the result is the same. Can someone point
> me out where I did it wrong?
>
> —
> BR,
> Todd Leo
> 
>



-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Imported CSV file content isn't identical to the original file

2016-02-06 Thread SLiZn Liu

Hi Spark Users Group,

I have a csv file to analysis with Spark, but I’m troubling with importing
as DataFrame.

Here’s the minimal reproducible example. Suppose I’m having a
*10(rows)x2(cols)* *space-delimited csv* file, shown as below:

1446566430 2015-11-0400:00:30
1446566430 2015-11-0400:00:30
1446566430 2015-11-0400:00:30
1446566430 2015-11-0400:00:30
1446566430 2015-11-0400:00:30
1446566431 2015-11-0400:00:31
1446566431 2015-11-0400:00:31
1446566431 2015-11-0400:00:31
1446566431 2015-11-0400:00:31
1446566431 2015-11-0400:00:31

the  in column 2 represents sub-delimiter within that column, and this
file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv

I’m using *spark-csv* to import this file as Spark *DataFrame*:

sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false") // Use first line of all files as header
.option("inferSchema", "false") // Automatically infer data types
.option("delimiter", " ")
.load("hdfs:///tmp/1.csv")
.show

Oddly, the output shows only a part of each column:

[image: Screenshot from 2016-02-07 15-27-51.png]

and even the boundary of the table wasn’t shown correctly. I also used the
other way to read csv file, by sc.textFile(...).map(_.split(" ")) and
sqlContext.createDataFrame, and the result is the same. Can someone point
me out where I did it wrong?

—
BR,
Todd Leo

Re: Imported CSV file content isn't identical to the original file

Re: Imported CSV file content isn't identical to the original file

Re: Imported CSV file content isn't identical to the original file

Re: Imported CSV file content isn't identical to the original file

Re: Imported CSV file content isn't identical to the original file

Re: Imported CSV file content isn't identical to the original file

Re: Imported CSV file content isn't identical to the original file

Re: Imported CSV file content isn't identical to the original file

Re: Imported CSV file content isn't identical to the original file

Re: Imported CSV file content isn't identical to the original file

Imported CSV file content isn't identical to the original file

11 matches

Site Navigation

Mail list logo

Footer information