Reading SystemML frames from CSV files, and splitting strings honoring quotes, separators, and escaping rules follows the RFC 4180 specification (https://tools.ietf.org/html/rfc4180#page-2). Populating SystemML frames from CSV files is one way, but we can also bind and pass Spark DataFrames with string columns to SystemML frames. Today, we take the Spark DataFrame strings *as is* without any checking whether these string values e.g. contain quotes or separator symbols, and whether they are escaped accordingly. Our transform capabilities can deal with this situation but I am a little uneasy about the fact that depending on where the data strings in our frames come from, they comply with different rules. In the case of CSV files, the fields comply with RFC 4180, and in the case of Spark Dataframes, the strings are any Java/Scala string.
This may or may not be an issue but I wanted to collect some thoughts on this topic. Things to consider are: - reading and writing a CSV file with and without transformencode/transformdecode ... should it result in the same input file? - through MLContext we receive a Spark Dataframe with strings, and in SystemML, we write out the CSV file, and a subsequent DML script wants to read the CSV file? Would you expect the CSV file to be readable by SystemML? Keep in mind that the original scala/java strings may not be properly escaped. Thoughts? Regards, Berthold Reinwald IBM Almaden Research Center office: (408) 927 2208; T/L: 457 2208 e-mail: [email protected]
