Re: [Discuss] String requirements for data passed to SystemML Frames.

Matthias Boehm Sat, 22 Oct 2016 02:46:12 -0700

ok let me clarify a couple of things and provide an easy solution thatresolves this issue altogether.

1) Escaping: transformencode, transformdecode, and transformapply do notremove quotes to provide easy to understand semantics. If users want tomatch strings with different escaping policies to the same entry it'sthe user's responsibility to handle the unquoting. The nice side effectis that transformencode/transformapply and transformdecode are trulyinverse operations, at least for reversible transformations likerecoding and dummy coding.

2) Metadata frames: The schema for meta data frames is a string columnper original column where each transformation type has its specialserialization format. For example, for recoding, we serialize distinct<token><delim><code> (one entry per row). The reason why we use thequote-aware splitting on parsing this meta data is a best effort tohandle cases where delim occurs inside the quoted token. A simplysplitting on <delim> (as done in the "fix" by PR 274) would fail in thissituation.

3) Solution: We could, however, simply flip the serialization format to<code><delim><token> which allows splitting on the first occurrence of<delim> because <code> is guaranteed not to include <delim>. Note thatthis would loose binary backwards compatibility to existing meta dataframes though.


Regards,
Matthias


On 10/22/2016 11:14 AM, Berthold Reinwald wrote:

Reading SystemML frames from CSV files, and splitting strings honoring
quotes, separators, and escaping rules follows the RFC 4180
specification (https://tools.ietf.org/html/rfc4180#page-2). Populating
SystemML frames from CSV files is one way, but we can also bind and
pass Spark DataFrames with string columns to SystemML frames. Today,
we take the Spark DataFrame strings *as is* without any checking
whether these string values e.g. contain quotes or separator symbols,
and whether they are escaped accordingly. Our transform capabilities
can deal with this situation but I am a little uneasy about the fact
that depending on where the data strings in our frames come from, they
comply with different rules. In the case of CSV files, the fields
comply with RFC 4180, and in the case of Spark Dataframes, the strings
are any Java/Scala string.

This may or may not be an issue but I wanted to collect some thoughts on
this topic. Things to consider are:

- reading and writing a CSV file with and without
  transformencode/transformdecode ... should it result in the same
  input file?

- through MLContext we receive a Spark Dataframe with strings, and in
  SystemML, we write out the CSV file, and a subsequent DML script
  wants to read the CSV file? Would you expect the CSV file to be
  readable by SystemML? Keep in mind that the original scala/java
  strings may not be properly escaped.

Thoughts?

Regards,
Berthold Reinwald
IBM Almaden Research Center
office: (408) 927 2208; T/L: 457 2208
e-mail: [email protected]

Re: [Discuss] String requirements for data passed to SystemML Frames.

Reply via email to