[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775218#comment-16775218 ]
Jean Georges Perrin commented on SPARK-26972: --------------------------------------------- I added the code as attachments, Jira is breaking my formatting :( > Issue with CSV import and inferSchema set to true > ------------------------------------------------- > > Key: SPARK-26972 > URL: https://issues.apache.org/jira/browse/SPARK-26972 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.1.3, 2.3.3, 2.4.0 > Environment: Java 8/Scala 2.11/MacOs > Reporter: Jean Georges Perrin > Priority: Major > Attachments: ComplexCsvToDataframeApp.java, > ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml > > > > > Issue with CSV import and inferSchema set to true. > I found a few discrepencies while working with inferSchema set to true in CSV > ingestion. > Given the following CSV: > {{id;authorId;title;releaseDate;link}} > {{1;1;Fantastic Beasts and Where to Find Them: The Original > Screenplay;11/18/16;http://amzn.to/2kup94P}} > {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry > Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}} > {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry > Potter)*;12/4/08;http://amzn.to/2kYezqr}} > {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition > (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}} > {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}} > {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }} > {{An independent study by Jean Georges Perrin, IIUG Board > Member*;12/28/16;http://amzn.to/2vBxOe1}} > {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}} > {{8;3;A Connecticut Yankee in King Arthur's > Court;6/17/17;http://amzn.to/2x1NuoD}} > {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}} > {{11;4;Diderot Encyclopedia: The Complete Illustrations > 1762-1777;;http://amzn.to/2i2zo3I}} > {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}} > {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}} > {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}} > {{15;7;Soft Skills: The software developer's life > manual;12/29/14;http://amzn.to/2zNnSyn}} > {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}} > {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style > programming*;8/28/14;http://amzn.to/2isdqoL}} > {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}} > {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}} > {{20;14;*Fables choisies; mises en vers par M. de La > Fontaine*;9/1/1999;http://amzn.to/2yRH10W}} > {{21;15;Discourse on Method and Meditations on First > Philosophy;6/15/1999;http://amzn.to/2hwB8zc}} > {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}} > {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}} > And this code: > {{Dataset<Row> df = spark.read().format("csv")}} > {{ .option("header", "true")}} > {{ .option("multiline", true)}} > {{ .option("sep", ";")}} > {{ .option("quote", "*")}} > {{ .option("dateFormat", "M/d/y")}} > {{ .option("inferSchema", true)}} > {{ .load("data/books.csv");}} > {{df.show(7);}} > {{df.printSchema();}} > h1. In Spark v2.0.1 > {{Excerpt of the dataframe content:}} > {{+---+--------+--------------------+-----------+--------------------+}} > {{| id|authorId| title|releaseDate| link|}} > {{+---+--------+--------------------+-----------+--------------------+}} > {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} > {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} > {{+---+--------+--------------------+-----------+--------------------+}} > {{only showing top 7 rows}}{{Dataframe's schema:}} > {{root}} > {{ |-- id: integer (nullable = true)}} > {{ |-- authorId: integer (nullable = true)}} > {{ |-- title: string (nullable = true)}} > {{ |-- releaseDate: string (nullable = true)}} > {{ |-- link: string (nullable = true)}} > *This is fine and the expected output*. > h1. Using Apache Spark v2.1.3 > Excerpt of the dataframe content: > {{+--------------------+--------+--------------------+-----------+--------------------+}} > {{ | id|authorId| title|releaseDate| link|}} > {{ > +--------------------+--------+--------------------+-----------+--------------------+}} > {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > {{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > {{ | 6| 2|Development Tools...| null| null|}} > {{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}} > {{ > +--------------------+--------+--------------------+-----------+--------------------+}} > {{ only showing top 7 rows}}{{Dataframe's schema:}} > {{ root}} > {{ |-- id: string (nullable = true)}} > {{ |-- authorId: string (nullable = true)}} > {{ |-- title: string (nullable = true)}} > {{ |-- releaseDate: string (nullable = true)}} > {{ |-- link: string (nullable = true)}} > The *multiline* option is *not recognized*. And, of course, the schema is > wrong. > h1. Using Apache Spark v2.2.3 > Excerpt of the dataframe content: > {{+---+--------+--------------------+-----------+--------------------+}} > {{| id|authorId| title|releaseDate| link}} > {{|}} > {{+---+--------+--------------------+-----------+--------------------+}} > {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} > {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} > {{+---+--------+--------------------+-----------+--------------------+}} > {{only showing top 7 rows}}{{Dataframe's schema:}} > {{root}} > {{ |-- id: integer (nullable = true)}} > {{ |-- authorId: integer (nullable = true)}} > {{ |-- title: string (nullable = true)}} > {{ |-- releaseDate: string (nullable = true)}} > {{ |-- link}} > {{: string (nullable = true)}} > The *link* column *has a carriage return* at the end of its name. If I run > and use: > {{df.show(7, 90);}} > I get: > {{+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+}} > {{| id|authorId| title|releaseDate| link}} > {{|}} > {{+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+}} > {{| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| > 11/18/16|http://amzn.to/2kup94P}} > {{|}} > {{| 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition > (Harry Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP}} > {{|}} > {{| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| > 12/4/08|http://amzn.to/2kYezqr}} > {{|}} > {{| 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition > (Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n}} > {{|}} > {{| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT}} > {{|}} > {{| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? }} > {{An independent study by...| 12/28/16|http://amzn.to/2vBxOe1}} > {{|}} > {{| 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav}} > {{|}} > {{+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+}} > The carriage *return is added to my the last cell*. > Same behavior in v2.3.3 and v2.4.0. > If I add the schema, like in: > {{StructType schema = DataTypes.createStructType(new StructField[] {}} > {{ DataTypes.createStructField(}} > {{ "id",}} > {{ DataTypes.IntegerType,}} > {{ false),}} > {{ DataTypes.createStructField(}} > {{ "authordId",}} > {{ DataTypes.IntegerType,}} > {{ true),}} > {{ DataTypes.createStructField(}} > {{ "bookTitle",}} > {{ DataTypes.StringType,}} > {{ false),}} > {{ DataTypes.createStructField(}} > {{ "releaseDate",}} > {{ DataTypes.DateType,}} > {{ true), // nullable, but this will be ignore}} > {{ DataTypes.createStructField(}} > {{ "url",}} > {{ DataTypes.StringType,}} > {{ false) });}} > {{// Reads a CSV file with header, called books.csv, stores it in a > dataframe}} > {{Dataset<Row> df = spark.read().format("csv")}} > {{ .option("header", "true")}} > {{ .option("multiline", true)}} > {{ .option("sep", ";")}} > {{ .option("dateFormat", "M/d/y")}} > {{ .option("quote", "*")}} > {{ .schema(schema)}} > {{ .load("data/books.csv");}} > The output is matching what is expected in any version *except version 2.1.3, > where Spark simply crashes*. > All the code can be downloaded from GitHub at: > [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org