Re: Imported CSV file content isn't identical to the original file
This Error message does not appear as I upgraded to 1.6.0 . -- Cheers, Todd Leo On Tue, Feb 9, 2016 at 9:07 AM SLiZn Liuwrote: > At least works for me though, temporarily disabled Kyro serilizer until > upgrade to 1.6.0. Appreciate for your update. :) > Luciano Resende 于2016年2月9日 周二02:37写道: > >> Sorry, same expected results with trunk and Kryo serializer >> >> On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu wrote: >> >>> I’ve found the trigger of my issue: if I start my spark-shell or submit >>> by spark-submit with --conf >>> spark.serializer=org.apache.spark.serializer.KryoSerializer, the >>> DataFrame content goes wrong, as I described earlier. >>> >>> >>> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu wrote: >>> Thanks Luciano, now it looks like I’m the only guy who have this issue. My options is narrowed down to upgrade my spark to 1.6.0, to see if this issue is gone. — Cheers, Todd Leo On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende wrote: > I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and > com.databricks:spark-csv_2.10:1.3.0 with expected results, where the > columns seem to be read properly. > > +--+--+ > |C0|C1| > +--+--+ > > |1446566430 | 2015-11-0400:00:30| > |1446566430 | 2015-11-0400:00:30| > |1446566430 | 2015-11-0400:00:30| > |1446566430 | 2015-11-0400:00:30| > |1446566430 | 2015-11-0400:00:30| > |1446566431 | 2015-11-0400:00:31| > |1446566431 | 2015-11-0400:00:31| > |1446566431 | 2015-11-0400:00:31| > |1446566431 | 2015-11-0400:00:31| > |1446566431 | 2015-11-0400:00:31| > +--+--+ > > > > > On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu > wrote: > >> Hi Spark Users Group, >> >> I have a csv file to analysis with Spark, but I’m troubling with >> importing as DataFrame. >> >> Here’s the minimal reproducible example. Suppose I’m having a >> *10(rows)x2(cols)* *space-delimited csv* file, shown as below: >> >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> >> the in column 2 represents sub-delimiter within that column, >> and this file is stored on HDFS, let’s say the path is >> hdfs:///tmp/1.csv >> >> I’m using *spark-csv* to import this file as Spark *DataFrame*: >> >> sqlContext.read.format("com.databricks.spark.csv") >> .option("header", "false") // Use first line of all files as >> header >> .option("inferSchema", "false") // Automatically infer data types >> .option("delimiter", " ") >> .load("hdfs:///tmp/1.csv") >> .show >> >> Oddly, the output shows only a part of each column: >> >> [image: Screenshot from 2016-02-07 15-27-51.png] >> >> and even the boundary of the table wasn’t shown correctly. I also >> used the other way to read csv file, by sc.textFile(...).map(_.split(" >> ")) and sqlContext.createDataFrame, and the result is the same. Can >> someone point me out where I did it wrong? >> >> — >> BR, >> Todd Leo >> >> > > > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ > >> >> >> -- >> Luciano Resende >> http://people.apache.org/~lresende >> http://twitter.com/lresende1975 >> http://lresende.blogspot.com/ >> >
Re: Imported CSV file content isn't identical to the original file
Thanks Luciano, now it looks like I’m the only guy who have this issue. My options is narrowed down to upgrade my spark to 1.6.0, to see if this issue is gone. — Cheers, Todd Leo On Mon, Feb 8, 2016 at 2:12 PM Luciano Resendewrote: > I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and > com.databricks:spark-csv_2.10:1.3.0 with expected results, where the > columns seem to be read properly. > > +--+--+ > |C0|C1| > +--+--+ > > |1446566430 | 2015-11-0400:00:30| > |1446566430 | 2015-11-0400:00:30| > |1446566430 | 2015-11-0400:00:30| > |1446566430 | 2015-11-0400:00:30| > |1446566430 | 2015-11-0400:00:30| > |1446566431 | 2015-11-0400:00:31| > |1446566431 | 2015-11-0400:00:31| > |1446566431 | 2015-11-0400:00:31| > |1446566431 | 2015-11-0400:00:31| > |1446566431 | 2015-11-0400:00:31| > +--+--+ > > > > > On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu wrote: > >> Hi Spark Users Group, >> >> I have a csv file to analysis with Spark, but I’m troubling with >> importing as DataFrame. >> >> Here’s the minimal reproducible example. Suppose I’m having a >> *10(rows)x2(cols)* *space-delimited csv* file, shown as below: >> >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> >> the in column 2 represents sub-delimiter within that column, and >> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv >> >> I’m using *spark-csv* to import this file as Spark *DataFrame*: >> >> sqlContext.read.format("com.databricks.spark.csv") >> .option("header", "false") // Use first line of all files as header >> .option("inferSchema", "false") // Automatically infer data types >> .option("delimiter", " ") >> .load("hdfs:///tmp/1.csv") >> .show >> >> Oddly, the output shows only a part of each column: >> >> [image: Screenshot from 2016-02-07 15-27-51.png] >> >> and even the boundary of the table wasn’t shown correctly. I also used >> the other way to read csv file, by sc.textFile(...).map(_.split(" ")) >> and sqlContext.createDataFrame, and the result is the same. Can someone >> point me out where I did it wrong? >> >> — >> BR, >> Todd Leo >> >> > > > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ >
Re: Imported CSV file content isn't identical to the original file
I’ve found the trigger of my issue: if I start my spark-shell or submit by spark-submit with --conf spark.serializer=org.apache.spark.serializer.KryoSerializer, the DataFrame content goes wrong, as I described earlier. On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liuwrote: > Thanks Luciano, now it looks like I’m the only guy who have this issue. My > options is narrowed down to upgrade my spark to 1.6.0, to see if this issue > is gone. > > — > Cheers, > Todd Leo > > > > On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende > wrote: > >> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and >> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the >> columns seem to be read properly. >> >> +--+--+ >> |C0|C1| >> +--+--+ >> >> |1446566430 | 2015-11-0400:00:30| >> |1446566430 | 2015-11-0400:00:30| >> |1446566430 | 2015-11-0400:00:30| >> |1446566430 | 2015-11-0400:00:30| >> |1446566430 | 2015-11-0400:00:30| >> |1446566431 | 2015-11-0400:00:31| >> |1446566431 | 2015-11-0400:00:31| >> |1446566431 | 2015-11-0400:00:31| >> |1446566431 | 2015-11-0400:00:31| >> |1446566431 | 2015-11-0400:00:31| >> +--+--+ >> >> >> >> >> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu >> wrote: >> >>> Hi Spark Users Group, >>> >>> I have a csv file to analysis with Spark, but I’m troubling with >>> importing as DataFrame. >>> >>> Here’s the minimal reproducible example. Suppose I’m having a >>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below: >>> >>> 1446566430 2015-11-0400:00:30 >>> 1446566430 2015-11-0400:00:30 >>> 1446566430 2015-11-0400:00:30 >>> 1446566430 2015-11-0400:00:30 >>> 1446566430 2015-11-0400:00:30 >>> 1446566431 2015-11-0400:00:31 >>> 1446566431 2015-11-0400:00:31 >>> 1446566431 2015-11-0400:00:31 >>> 1446566431 2015-11-0400:00:31 >>> 1446566431 2015-11-0400:00:31 >>> >>> the in column 2 represents sub-delimiter within that column, and >>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv >>> >>> I’m using *spark-csv* to import this file as Spark *DataFrame*: >>> >>> sqlContext.read.format("com.databricks.spark.csv") >>> .option("header", "false") // Use first line of all files as header >>> .option("inferSchema", "false") // Automatically infer data types >>> .option("delimiter", " ") >>> .load("hdfs:///tmp/1.csv") >>> .show >>> >>> Oddly, the output shows only a part of each column: >>> >>> [image: Screenshot from 2016-02-07 15-27-51.png] >>> >>> and even the boundary of the table wasn’t shown correctly. I also used >>> the other way to read csv file, by sc.textFile(...).map(_.split(" ")) >>> and sqlContext.createDataFrame, and the result is the same. Can someone >>> point me out where I did it wrong? >>> >>> — >>> BR, >>> Todd Leo >>> >>> >> >> >> >> -- >> Luciano Resende >> http://people.apache.org/~lresende >> http://twitter.com/lresende1975 >> http://lresende.blogspot.com/ >> >
Re: Imported CSV file content isn't identical to the original file
Sorry, same expected results with trunk and Kryo serializer On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liuwrote: > I’ve found the trigger of my issue: if I start my spark-shell or submit > by spark-submit with --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer, the > DataFrame content goes wrong, as I described earlier. > > > On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu wrote: > >> Thanks Luciano, now it looks like I’m the only guy who have this issue. >> My options is narrowed down to upgrade my spark to 1.6.0, to see if this >> issue is gone. >> >> — >> Cheers, >> Todd Leo >> >> >> >> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende >> wrote: >> >>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and >>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the >>> columns seem to be read properly. >>> >>> +--+--+ >>> |C0|C1| >>> +--+--+ >>> >>> |1446566430 | 2015-11-0400:00:30| >>> |1446566430 | 2015-11-0400:00:30| >>> |1446566430 | 2015-11-0400:00:30| >>> |1446566430 | 2015-11-0400:00:30| >>> |1446566430 | 2015-11-0400:00:30| >>> |1446566431 | 2015-11-0400:00:31| >>> |1446566431 | 2015-11-0400:00:31| >>> |1446566431 | 2015-11-0400:00:31| >>> |1446566431 | 2015-11-0400:00:31| >>> |1446566431 | 2015-11-0400:00:31| >>> +--+--+ >>> >>> >>> >>> >>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu >>> wrote: >>> Hi Spark Users Group, I have a csv file to analysis with Spark, but I’m troubling with importing as DataFrame. Here’s the minimal reproducible example. Suppose I’m having a *10(rows)x2(cols)* *space-delimited csv* file, shown as below: 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 the in column 2 represents sub-delimiter within that column, and this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv I’m using *spark-csv* to import this file as Spark *DataFrame*: sqlContext.read.format("com.databricks.spark.csv") .option("header", "false") // Use first line of all files as header .option("inferSchema", "false") // Automatically infer data types .option("delimiter", " ") .load("hdfs:///tmp/1.csv") .show Oddly, the output shows only a part of each column: [image: Screenshot from 2016-02-07 15-27-51.png] and even the boundary of the table wasn’t shown correctly. I also used the other way to read csv file, by sc.textFile(...).map(_.split(" ")) and sqlContext.createDataFrame, and the result is the same. Can someone point me out where I did it wrong? — BR, Todd Leo >>> >>> >>> >>> -- >>> Luciano Resende >>> http://people.apache.org/~lresende >>> http://twitter.com/lresende1975 >>> http://lresende.blogspot.com/ >>> >> -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Re: Imported CSV file content isn't identical to the original file
At least works for me though, temporarily disabled Kyro serilizer until upgrade to 1.6.0. Appreciate for your update. :) Luciano Resende于2016年2月9日 周二02:37写道: > Sorry, same expected results with trunk and Kryo serializer > > On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu wrote: > >> I’ve found the trigger of my issue: if I start my spark-shell or submit >> by spark-submit with --conf >> spark.serializer=org.apache.spark.serializer.KryoSerializer, the >> DataFrame content goes wrong, as I described earlier. >> >> >> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu wrote: >> >>> Thanks Luciano, now it looks like I’m the only guy who have this issue. >>> My options is narrowed down to upgrade my spark to 1.6.0, to see if this >>> issue is gone. >>> >>> — >>> Cheers, >>> Todd Leo >>> >>> >>> >>> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende >>> wrote: >>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and com.databricks:spark-csv_2.10:1.3.0 with expected results, where the columns seem to be read properly. +--+--+ |C0|C1| +--+--+ |1446566430 | 2015-11-0400:00:30| |1446566430 | 2015-11-0400:00:30| |1446566430 | 2015-11-0400:00:30| |1446566430 | 2015-11-0400:00:30| |1446566430 | 2015-11-0400:00:30| |1446566431 | 2015-11-0400:00:31| |1446566431 | 2015-11-0400:00:31| |1446566431 | 2015-11-0400:00:31| |1446566431 | 2015-11-0400:00:31| |1446566431 | 2015-11-0400:00:31| +--+--+ On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu wrote: > Hi Spark Users Group, > > I have a csv file to analysis with Spark, but I’m troubling with > importing as DataFrame. > > Here’s the minimal reproducible example. Suppose I’m having a > *10(rows)x2(cols)* *space-delimited csv* file, shown as below: > > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > > the in column 2 represents sub-delimiter within that column, and > this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv > > I’m using *spark-csv* to import this file as Spark *DataFrame*: > > sqlContext.read.format("com.databricks.spark.csv") > .option("header", "false") // Use first line of all files as > header > .option("inferSchema", "false") // Automatically infer data types > .option("delimiter", " ") > .load("hdfs:///tmp/1.csv") > .show > > Oddly, the output shows only a part of each column: > > [image: Screenshot from 2016-02-07 15-27-51.png] > > and even the boundary of the table wasn’t shown correctly. I also used > the other way to read csv file, by sc.textFile(...).map(_.split(" ")) > and sqlContext.createDataFrame, and the result is the same. Can > someone point me out where I did it wrong? > > — > BR, > Todd Leo > > -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/ >>> > > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ >
Re: Imported CSV file content isn't identical to the original file
Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried HiveContext, but the result is exactly the same. On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liuwrote: > Hi Spark Users Group, > > I have a csv file to analysis with Spark, but I’m troubling with importing > as DataFrame. > > Here’s the minimal reproducible example. Suppose I’m having a > *10(rows)x2(cols)* *space-delimited csv* file, shown as below: > > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > > the in column 2 represents sub-delimiter within that column, and > this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv > > I’m using *spark-csv* to import this file as Spark *DataFrame*: > > sqlContext.read.format("com.databricks.spark.csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .option("delimiter", " ") > .load("hdfs:///tmp/1.csv") > .show > > Oddly, the output shows only a part of each column: > > [image: Screenshot from 2016-02-07 15-27-51.png] > > and even the boundary of the table wasn’t shown correctly. I also used the > other way to read csv file, by sc.textFile(...).map(_.split(" ")) and > sqlContext.createDataFrame, and the result is the same. Can someone point > me out where I did it wrong? > > — > BR, > Todd Leo > >
Re: Imported CSV file content isn't identical to the original file
show has argument of truncate pass false so it wont truncate your results On 7 February 2016 at 11:01, SLiZn Liuwrote: > Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried > HiveContext, but the result is exactly the same. > > > On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu wrote: > >> Hi Spark Users Group, >> >> I have a csv file to analysis with Spark, but I’m troubling with >> importing as DataFrame. >> >> Here’s the minimal reproducible example. Suppose I’m having a >> *10(rows)x2(cols)* *space-delimited csv* file, shown as below: >> >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566430 2015-11-0400:00:30 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> 1446566431 2015-11-0400:00:31 >> >> the in column 2 represents sub-delimiter within that column, and >> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv >> >> I’m using *spark-csv* to import this file as Spark *DataFrame*: >> >> sqlContext.read.format("com.databricks.spark.csv") >> .option("header", "false") // Use first line of all files as header >> .option("inferSchema", "false") // Automatically infer data types >> .option("delimiter", " ") >> .load("hdfs:///tmp/1.csv") >> .show >> >> Oddly, the output shows only a part of each column: >> >> [image: Screenshot from 2016-02-07 15-27-51.png] >> >> and even the boundary of the table wasn’t shown correctly. I also used >> the other way to read csv file, by sc.textFile(...).map(_.split(" ")) >> and sqlContext.createDataFrame, and the result is the same. Can someone >> point me out where I did it wrong? >> >> — >> BR, >> Todd Leo >> >> >
Re: Imported CSV file content isn't identical to the original file
Hi Igor, In my case, it’s not a matter of *truncate*. As the show() function in Spark API doc reads, truncate: Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right… whereas the leading characters of my two columns are missing. Good to know the way to show the whole content in a cell. — BR, Todd Leo On Sun, Feb 7, 2016 at 5:42 PM Igor Bermanwrote: > show has argument of truncate > pass false so it wont truncate your results > > On 7 February 2016 at 11:01, SLiZn Liu wrote: > >> Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried >> HiveContext, but the result is exactly the same. >> >> >> On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu wrote: >> >>> Hi Spark Users Group, >>> >>> I have a csv file to analysis with Spark, but I’m troubling with >>> importing as DataFrame. >>> >>> Here’s the minimal reproducible example. Suppose I’m having a >>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below: >>> >>> 1446566430 2015-11-0400:00:30 >>> 1446566430 2015-11-0400:00:30 >>> 1446566430 2015-11-0400:00:30 >>> 1446566430 2015-11-0400:00:30 >>> 1446566430 2015-11-0400:00:30 >>> 1446566431 2015-11-0400:00:31 >>> 1446566431 2015-11-0400:00:31 >>> 1446566431 2015-11-0400:00:31 >>> 1446566431 2015-11-0400:00:31 >>> 1446566431 2015-11-0400:00:31 >>> >>> the in column 2 represents sub-delimiter within that column, and >>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv >>> >>> I’m using *spark-csv* to import this file as Spark *DataFrame*: >>> >>> sqlContext.read.format("com.databricks.spark.csv") >>> .option("header", "false") // Use first line of all files as header >>> .option("inferSchema", "false") // Automatically infer data types >>> .option("delimiter", " ") >>> .load("hdfs:///tmp/1.csv") >>> .show >>> >>> Oddly, the output shows only a part of each column: >>> >>> [image: Screenshot from 2016-02-07 15-27-51.png] >>> >>> and even the boundary of the table wasn’t shown correctly. I also used >>> the other way to read csv file, by sc.textFile(...).map(_.split(" ")) >>> and sqlContext.createDataFrame, and the result is the same. Can someone >>> point me out where I did it wrong? >>> >>> — >>> BR, >>> Todd Leo >>> >>> >> >
Re: Imported CSV file content isn't identical to the original file
*Update*: on local mode(spark-shell --local[2], no matter read from local file system or hdfs) , it works well. But it doesn’t solve this issue, since my data scale requires hundreds of CPU cores and hundreds GB of RAM. BTW, it’s Chinese Tradition New Year now, wish you all have a happy year and have Great fortune in the Year of Monkey! — BR, Todd Leo On Sun, Feb 7, 2016 at 6:09 PM SLiZn Liuwrote: > Hi Igor, > > In my case, it’s not a matter of *truncate*. As the show() function in > Spark API doc reads, > > truncate: Whether truncate long strings. If true, strings more than 20 > characters will be truncated and all cells will be aligned right… > > whereas the leading characters of my two columns are missing. > > Good to know the way to show the whole content in a cell. > > — > BR, > Todd Leo > > > > > > On Sun, Feb 7, 2016 at 5:42 PM Igor Berman wrote: > >> show has argument of truncate >> pass false so it wont truncate your results >> >> On 7 February 2016 at 11:01, SLiZn Liu wrote: >> >>> Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried >>> HiveContext, but the result is exactly the same. >>> >>> >>> On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu wrote: >>> Hi Spark Users Group, I have a csv file to analysis with Spark, but I’m troubling with importing as DataFrame. Here’s the minimal reproducible example. Suppose I’m having a *10(rows)x2(cols)* *space-delimited csv* file, shown as below: 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 the in column 2 represents sub-delimiter within that column, and this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv I’m using *spark-csv* to import this file as Spark *DataFrame*: sqlContext.read.format("com.databricks.spark.csv") .option("header", "false") // Use first line of all files as header .option("inferSchema", "false") // Automatically infer data types .option("delimiter", " ") .load("hdfs:///tmp/1.csv") .show Oddly, the output shows only a part of each column: [image: Screenshot from 2016-02-07 15-27-51.png] and even the boundary of the table wasn’t shown correctly. I also used the other way to read csv file, by sc.textFile(...).map(_.split(" ")) and sqlContext.createDataFrame, and the result is the same. Can someone point me out where I did it wrong? — BR, Todd Leo >>> >>
Re: Imported CSV file content isn't identical to the original file
I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and com.databricks:spark-csv_2.10:1.3.0 with expected results, where the columns seem to be read properly. +--+--+ |C0|C1| +--+--+ |1446566430 | 2015-11-0400:00:30| |1446566430 | 2015-11-0400:00:30| |1446566430 | 2015-11-0400:00:30| |1446566430 | 2015-11-0400:00:30| |1446566430 | 2015-11-0400:00:30| |1446566431 | 2015-11-0400:00:31| |1446566431 | 2015-11-0400:00:31| |1446566431 | 2015-11-0400:00:31| |1446566431 | 2015-11-0400:00:31| |1446566431 | 2015-11-0400:00:31| +--+--+ On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liuwrote: > Hi Spark Users Group, > > I have a csv file to analysis with Spark, but I’m troubling with importing > as DataFrame. > > Here’s the minimal reproducible example. Suppose I’m having a > *10(rows)x2(cols)* *space-delimited csv* file, shown as below: > > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566430 2015-11-0400:00:30 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > 1446566431 2015-11-0400:00:31 > > the in column 2 represents sub-delimiter within that column, and > this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv > > I’m using *spark-csv* to import this file as Spark *DataFrame*: > > sqlContext.read.format("com.databricks.spark.csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .option("delimiter", " ") > .load("hdfs:///tmp/1.csv") > .show > > Oddly, the output shows only a part of each column: > > [image: Screenshot from 2016-02-07 15-27-51.png] > > and even the boundary of the table wasn’t shown correctly. I also used the > other way to read csv file, by sc.textFile(...).map(_.split(" ")) and > sqlContext.createDataFrame, and the result is the same. Can someone point > me out where I did it wrong? > > — > BR, > Todd Leo > > -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Imported CSV file content isn't identical to the original file
Hi Spark Users Group, I have a csv file to analysis with Spark, but I’m troubling with importing as DataFrame. Here’s the minimal reproducible example. Suppose I’m having a *10(rows)x2(cols)* *space-delimited csv* file, shown as below: 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 1446566431 2015-11-0400:00:31 the in column 2 represents sub-delimiter within that column, and this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv I’m using *spark-csv* to import this file as Spark *DataFrame*: sqlContext.read.format("com.databricks.spark.csv") .option("header", "false") // Use first line of all files as header .option("inferSchema", "false") // Automatically infer data types .option("delimiter", " ") .load("hdfs:///tmp/1.csv") .show Oddly, the output shows only a part of each column: [image: Screenshot from 2016-02-07 15-27-51.png] and even the boundary of the table wasn’t shown correctly. I also used the other way to read csv file, by sc.textFile(...).map(_.split(" ")) and sqlContext.createDataFrame, and the result is the same. Can someone point me out where I did it wrong? — BR, Todd Leo