GitHub user rick-ibm opened a pull request: https://github.com/apache/spark/pull/9202
[SPARK-10857] [SQL] Block SQL injection vulnerabilities under DataFrameWriter.jdbc(). @marmbrus @rxin @joshrosen @srowen This patch attempts to address both https://issues.apache.org/jira/browse/SPARK-10857 and https://issues.apache.org/jira/browse/SPARK-10977. This patch is my original work which I license to the ASF. My ICLA is already on file as a Derby committer: http://people.apache.org/committer-index.html I have abandoned the approach discussed on those issues, viz., attempting to replace the user-supplied table name with a properly escaped and quoted chain of dot-separated delimited identifiers. I believe that JDBC does not supply enough information to attempt that substitution, particularly in the case of a Microsoft SQL Server running on a case-insensitive file system. An overview of the issues involved can be found here: https://github.com/ontop/ontop/wiki/Case-sensitivity-for-SQL-identifiers Instead, this patch simply verifies that the table name is a valid SQL identifier before inserting it into a CREATE/DROP/INSERT statement or into a statement which tests whether the table exists. That is, a SQL injection attempt will fail because the injected text does not parse as a valid SQL identifier. The validating code may not catch some bad names. This can happen for databases which don't allow embedded, escaped quotes inside quoted identifiers. Microsoft SQL Server and Oracle don't allow embedded quotes. For those databases, the bad table name will appear as an ungrammatical sequence of quoted identifiers. E.g.: "foo""bar" will appear as "foo" "bar" Instead of a "bad identifier" error, those databases will raise some other exception. I leave further polishing of this behavior to people who are experts on the affected databases. This patch makes the following changes: 1) Introduces a new Java class, SqlIdentifierUtil. The code was cribbed from Derby, where it has undergone almost 2 decades of commodity testing in production applications. I thought that it was safer to re-use this code rather than translate it into Scala and run the risk of introducing translation errors. However, I can try my hand at translating this class into Scala if reviewers think that that is a better approach. The SqlIdentifierUtil class has 2 entry points: a) quoteString() - This method escapes embedded quotes when converting a string to a quoted identifier. JdbcDialect.quoteIdentifier() did not account for this subtlety. But the patch makes JdbcDialect.quoteIdentifier() call SqlIdentifier.quoteString() now. b) parseMultiPartSQLIdentifier() - This method takes a dot separated sequence of SQL identifiers like databaseName.schemaName.tableName and returns an array of normalized identifiers like [ DATABASENAME, SCHEMANAME, TABLENAME ]. 2) Introduces some new JdbcDialect methods: a) vetSqlIdentifier() - This method takes a table name and raises an exception if the name isn't a legal SQL identifier as determined by SqlIdentifierUtil.parseMultiPartSQLIdentifier(). b) quoteChar() - This method returns the dialect-specific character which is used to frame quoted identifiers. This information is needed by SqlIdentifierUtil.parseMultiPartSQLIdentifier(). 3) Calls JdbcDialect.vetSqlIdentifier() everywhere that we plug a table name into a SQL statement. I have placed the calls to JdbcDialect.vetSqlIdentifier() as close to the statement text as possible in order to minimize the chance that a new code path will be introduced in the future which circumvents the SQL injection check. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rick-ibm/spark b_10857_sqlInjection Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9202.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9202 ---- commit 4e167a4e9ec8385063c4f46b6b75ae369d2e6204 Author: Rick Hillegas <rhil...@us.ibm.com> Date: 2015-10-19T20:00:32Z SPARK-10857: Vet table names which are inserted into SQL strings to verify that they are legal SQL identifiers in the underlying database. commit 4d9d8859294f83bff172e7e1146a2024ac3b61ec Author: Rick Hillegas <rhil...@us.ibm.com> Date: 2015-10-21T18:46:26Z SPARK-10857: Add and remove some comments. commit a5c4c698705c5ed7417b405e6a0518bd7b997336 Author: Rick Hillegas <rhil...@us.ibm.com> Date: 2015-10-21T18:46:56Z SPARK-10857: Add and remove some comments. commit b3b845de9960e003637d26fecaf8dccaa0206f59 Author: Rick Hillegas <rhil...@us.ibm.com> Date: 2015-10-21T18:50:56Z DERBY-10857: Fix a scalastyle problem. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org