I'm working with Spark 1.5.0, and I'm using the Scala API to construct DataFrames and perform operations on them. My application requires that I synthesize column names for intermediate results under some circumstances, and I don't know what the rules are for legal column names. In particular, I'm running into some interesting behavior involving the ability (or lack thereof) to resolve column references. Is there documentation anywhere that describes which column names are considered "safe"?
To see what I mean by "safe", consider the following examples: Let df be a DataFrame with schema [id: bigint]. scala> val df = ... // Details don't matter df: org.apache.spark.sql.DataFrame = [id: bigint] scala> df.select($"id".as("x")).select($"x") res32: org.apache.spark.sql.DataFrame = [x: bigint] Great; that works just as I'd expect it to. Things don't seem to be case-sensitive, though: scala> df.select($"id", $"id".as("ID")).select($"id") org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id#0L, id#163L.; ... and a big stack trace ... Ok, make sure we don't create data frames with two columns whose names differ only by case; fair enough. Certain characters in column names also cause problems: scala> df.select($"id".as("a.b")) res34: org.apache.spark.sql.DataFrame = [a.b: bigint] Good; but can we use the column? scala> df.select($"id".as("a.b")).select($"a.b") org.apache.spark.sql.AnalysisException: cannot resolve 'a.b' given input columns a.b; ... and another big stack trace ... Apparently not. Ok, I think I remember reading somewhere that SparkSQL limits column names to containing only alphanumerics and underscore; does that apply here too? scala> df.select($"id".as("x%y")).select($"x%y") res35: org.apache.spark.sql.DataFrame = [x%y: bigint] Apparently not; % is legal too. (I've done a variety of experiments, not repeated here, that suggest that alphanumerics + underscore are safe. Oddly enough, so are internal spaces.) Is there a specification for legal column names that won't cause resolution problems? I've looked through the Scala API docs for DataFrame, Column, and ColumnName without finding any. Thanks, Richard --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org