[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783440#comment-16783440 ]
Sean Owen commented on SPARK-26972: ----------------------------------- One explanation for your comment about "multiline" support is that the parameter is "multiLine". I think these keys may be case-insensitive now but not earlier. That's kind of secondary though. I think the line separator is the real factor here, per earlier comments. Without that set, I think you are retaining the '\n' when reading this in OS X. That's what I would expect. Per above, this is better now with some automatic line separator detection. But you may wish to set it explicitly. I wasn't clear whether you're saying setting the schema changes the output; I don't imagine it would. What's the output in that case? Yes, really best to test vs master given recent changes. I think this is effectively a duplicate of the issues reported there. > Issue with CSV import and inferSchema set to true > ------------------------------------------------- > > Key: SPARK-26972 > URL: https://issues.apache.org/jira/browse/SPARK-26972 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.1.3, 2.3.3, 2.4.0 > Environment: Java 8/Scala 2.11/MacOs > Reporter: Jean Georges Perrin > Priority: Major > Attachments: ComplexCsvToDataframeApp.java, > ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml > > > > I found a few discrepencies while working with inferSchema set to true in CSV > ingestion. > Given the following CSV in the attached books.csv: > {noformat} > id;authorId;title;releaseDate;link > 1;1;Fantastic Beasts and Where to Find Them: The Original > Screenplay;11/18/16;http://amzn.to/2kup94P > 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry > Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP > 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry > Potter)*;12/4/08;http://amzn.to/2kYezqr > 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry > Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n > 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT > 6;2;*Development Tools in 2006: any Room for a 4GL-style Language? > An independent study by Jean Georges Perrin, IIUG Board > Member*;12/28/16;http://amzn.to/2vBxOe1 > 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav > 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD > 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA > 11;4;Diderot Encyclopedia: The Complete Illustrations > 1762-1777;;http://amzn.to/2i2zo3I > 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ > 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW > 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk > 15;7;Soft Skills: The software developer's life > manual;12/29/14;http://amzn.to/2zNnSyn > 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc > 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style > programming*;8/28/14;http://amzn.to/2isdqoL > 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY > 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG > 20;14;*Fables choisies; mises en vers par M. de La > Fontaine*;9/1/1999;http://amzn.to/2yRH10W > 21;15;Discourse on Method and Meditations on First > Philosophy;6/15/1999;http://amzn.to/2hwB8zc > 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo > 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat} > And this Java code: > {code:java} > Dataset<Row> df = spark.read().format("csv") > .option("header", "true") > .option("multiline", true) > .option("sep", ";") > .option("quote", "*") > .option("dateFormat", "M/d/y") > .option("inferSchema", true) > .load("data/books.csv"); > df.show(7); > df.printSchema(); > {code} > h1. In Spark v2.0.1 > Output: > {noformat} > +---+--------+--------------------+-----------+--------------------+ > | id|authorId| title|releaseDate| link| > +---+--------+--------------------+-----------+--------------------+ > | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| > | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| > | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| > | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| > | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| > | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...| > | 7| 3|Adventures of Huc...|. 5/26/94|http://amzn.to/2w...| > +---+--------+--------------------+-----------+--------------------+ > only showing top 7 rows > Dataframe's schema: > root > |-- id: integer (nullable = true) > |-- authorId: integer (nullable = true) > |-- title: string (nullable = true) > |-- releaseDate: string (nullable = true) > |-- link: string (nullable = true) > {noformat} > *This is fine and the expected output*. > h1. Using Apache Spark v2.1.3 > Excerpt of the dataframe content: > {noformat} > +--------------------+--------+--------------------+-----------+--------------------+ > | id|authorId| title|releaseDate| link| > +--------------------+--------+--------------------+-----------+--------------------+ > | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| > | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| > | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| > | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| > | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| > | 6| 2|Development Tools...| null| null| > |An independent st...|12/28/16|http://amzn.to/2v...| null| null| > +--------------------+--------+--------------------+-----------+--------------------+ > only showing top 7 rows > Dataframe's schema: > root > |-- id: string (nullable = true) > |-- authorId: string (nullable = true) > |-- title: string (nullable = true) > |-- releaseDate: string (nullable = true) > |-- link: string (nullable = true){noformat} > The *multiline* option is *not recognized*. And, of course, the schema is > wrong. > h1. Using Apache Spark v2.2.3 > Excerpt of the dataframe content: > {noformat} > +---+--------+--------------------+-----------+--------------------+ > | id|authorId| title|releaseDate| link > | > +---+--------+--------------------+-----------+--------------------+ > | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| > | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| > | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| > | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| > | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| > | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...| > | 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...| > +---+--------+--------------------+-----------+--------------------+ > only showing top 7 rows > Dataframe's schema: > root > |-- id: integer (nullable = true) > |-- authorId: integer (nullable = true) > |-- title: string (nullable = true) > |-- releaseDate: string (nullable = true) > |-- link > : string (nullable = true) > {noformat} > The *link* column *has a carriage return* at the end of its name. If I run > and use: > {code:java} > df.show(7, 90); > {code} > I get: > {noformat} > +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+ > | id|authorId| title|releaseDate| link > | > +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+ > | 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| > 11/18/16|http://amzn.to/2kup94P > | > | 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry > Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP > | > | 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| > 12/4/08|http://amzn.to/2kYezqr > | > | 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition > (Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n > | > | 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT > | > | 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? > An independent study by...| 12/28/16|http://amzn.to/2vBxOe1 > | > | 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav > | > +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+ > {noformat} > The carriage *return is added to my the last cell*. > Same behavior in v2.3.3 and v2.4.0. > If I add the schema, like in: > {code:java} > // Creates the schema > StructType schema = DataTypes.createStructType(new StructField[] { > DataTypes.createStructField( > "id", > DataTypes.IntegerType, > false), > DataTypes.createStructField( > "authordId", > DataTypes.IntegerType, > true), > DataTypes.createStructField( > "bookTitle", > DataTypes.StringType, > false), > DataTypes.createStructField( > "releaseDate", > DataTypes.DateType, > true), // nullable, but this will be ignore > DataTypes.createStructField( > "url", > DataTypes.StringType, > false) }); > // Reads a CSV file with header, called books.csv, stores it in a > dataframe > Dataset<Row> df = spark.read().format("csv") > .option("header", "true") > .option("multiline", true) > .option("sep", ";") > .option("dateFormat", "M/d/y") > .option("quote", "*") > .schema(schema) > .load("data/books.csv"); > {code} > The output is matching what is expected in any version *except version 2.1.3, > where Spark simply crashes*. > All the code can be downloaded from GitHub at: > [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org