Hi Charles,
Currently I use a function like:
public static string EscapeString(string input) {return input.Replace("'",
"''");}
On all user's inputs.
So later on, if I build an SQL like:
select x from y where name='{EscapeString(userInput)}'
they can't add other SQL parts.
I'm still concerned if there are other characters, encoding, escaping,
Unicode special character, buffer sizes, etc, which may enable SQL
injection by concatenating user inputs.
If such "safe" method would be provided by Drill it could be great help
(assuming there is more to it beside the apostrophe replacement).
What do you think?
Thanks,
Avner
On Mon, May 11, 2020 at 11:00 AM Charles Givre <[email protected]> wrote:
> Hi Avner,
> Can you explain more about what you'd need with the utility escaping
> function? That seems more practical than creating new storage plugins for
> each user.
> --C
>
> > On May 11, 2020, at 10:58 AM, Avner Levy <[email protected]> wrote:
> >
> > Thanks Paul,
> > It seems I wasn't clear enough in my previous email.
> > I have a server in between the end users and drill (exactly as you
> > suggested), my concern is SQL attacks on that server by bad parameters.
> > I've tried Athena and I think they too don't support prepared statements.
> > In addition, they were slower than Drill in my scenarios.
> > If Drill had a utility escaping function that clients could use on their
> > input and know they are safe it would help reducing the risk.
> > For now, I guess it leaves me with the option of creating a data source
> per
> > each client (hoping few thousands of those won't hurt performance) or
> > making sure I properly escape all users input.
> > Appreciate your help.
> > Thanks,
> > Avner
> >
> > On Sun, May 10, 2020 at 9:09 PM Paul Rogers <[email protected]>
> > wrote:
> >
> >> Hi Avner,
> >>
> >> Drill does not support prepared statements. Nor does Drill support
> >> statements with parameters. This is true with all interfaces. These
> would
> >> be great features; but they've never been implemented.
> >>
> >>
> >> Drill was designed to operate in a Hadoop-like environment with
> >> semi-trusted users. (Meaning that, if any user did something malicious,
> you
> >> could sue or fire them.) As noted, the file system enforced security.
> There
> >> was no notion of the public using Drill to access secure data, with
> Drill
> >> acting as the secure gateway. Again, code could be added, but it is not
> >> there today. FWIW, in its present state, I would not trust Drill on a
> >> public web site with sensitive data.
> >>
> >>
> >> Given how Drill acts today, I'd wager your best bet is to insert your
> own
> >> server between your user and Drill. Allow the user to specify queries in
> >> some simple, non-SQL way. Then, your server can build the SQL and
> forward
> >> it to Drill.
> >>
> >> Public Internet (Web Browser --> Secure Gateway) --> Private network
> (App
> >> server --> Drill)
> >>
> >>
> >> For example, if I want to know about "Orders", I can specify a date
> range.
> >> Your server fills in the storage plugin, table name, WHERE clause, etc.
> to
> >> create the SQL. Such an approach allows the Drill REST API is on a
> private
> >> IP address within your data center. Only your outward-facing user
> service
> >> is on a public IP. With this approach, there is no SQL injection risk
> >> because you do not directly use the web-provided info in a SQL
> statement.
> >> Of course, you have to build your SQL statement correctly, as you are
> >> doing. Don't just append web text to a SQL statement.
> >>
> >>
> >> I don't think this is unique to Drill. I'd be surprised if most people
> >> allow, say, public access to their HBase, Cassandra or MySQL DBs.
> >>
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >> On Sunday, May 10, 2020, 5:29:13 PM PDT, Avner Levy <
> >> [email protected]> wrote:
> >>
> >> Hi Charles, Paul,
> >> Thanks for your answers.
> >> I'm interested in a case, where there is a Rest service which
> >> authenticate the service's users, get a request with the user parameters
> >> and build from it the SQL sent to Drill.
> >> The customers are identified by some account ID and they send for
> example
> >> the name of an entity they are looking for as a parameter in the
> service's
> >> REST request.
> >> Then the service can build the select (just an example):
> >> SELECT x FROM S3.db.`data/[CUSTOMER_ID]/data.parquet` where
> >> name='CUSTOMER_USER_INPUT]'
> >>
> >> In such case, they can still send in CUSTOMER_USER_INPUT the following:
> "x'
> >> union SELECT x FROM S3.db.`data/[OTHER_CUSTOMER_ID]/data.parquet`".
> >>
> >> Usually such stuff are solved with prepared statements, but I believe
> this
> >> isn't supported over REST.
> >> I would prefer not having to authenticate my end users to Drill since
> this
> >> creates more work and complexity.
> >> Is there a way to have prepared statements in Drill?
> >> Is it supported in other protocols? (JDBC/ODBC)
> >> Limiting the query folder outside the SQL would do the job as well.
> >> Any feedback is appreciated,
> >> Thanks,
> >> Avner
> >>
> >>
> >> On Sun, May 10, 2020 at 5:57 PM Paul Rogers <[email protected]>
> >> wrote:
> >>
> >>> Hi Charles,
> >>>
> >>> One of the changes I was looking at was allowing multiple SQL
> statements
> >>> per REST request to get around the lack of session. The idea would be
> to
> >>> issue a number of ALTER SESSION, CTTAS, USE and similar statements
> >> followed
> >>> by a single query that returns data.
> >>>
> >>>
> >>> A better solution is to enable session support for the REST API. We
> >>> discussed the challenges involved due the disconnected nature of HTTP
> >>> requests.
> >>>
> >>> Another good improvement would be a SQL command way to create configs,
> >> not
> >>> just JSON editing. That way it would be easier to automate creation of
> a
> >>> config. Also, it would be handy to be able to externalize configs so
> they
> >>> can be stored in locations other than ZK (or local disk, in embedded
> >> mode.)
> >>> For this use case, a query for user "X" would work against the "s3-X"
> >>> config would could be retrieved from an external system that knows the
> >>> mapping from user X to the S3 files visible to X, and the security
> tokens
> >>> to use for that user.
> >>>
> >>> The question for now, however, is how to do this with the code that
> >> exists
> >>> in Drill 1.17. I'm hoping someone has worked out a solution.
> >>>
> >>>
> >>> Thanks,
> >>> - Paul
> >>>
> >>>
> >>>
> >>> On Sunday, May 10, 2020, 1:05:50 PM PDT, Charles Givre <
> >>> [email protected]> wrote:
> >>>
> >>> Hi Avner, Paul,
> >>> I was reading this and wondering:
> >>>
> >>> 1. Is it in fact true (I think it is) that Drill does not allow
> multiple
> >>> queries to be submitted in one REST request? I seem to remember
> running
> >>> into that issue when I was trying to do some of the Superset work.
> >>> 2. If a user is required to be authenticated to execute a query, would
> >>> that not prevent the possibility of a non-authenticated user executing
> >>> arbitrary queries against someone else's data?
> >>> 3. I would definitely create separate data sources for each tenant,
> but
> >> I
> >>> don't know that it is necessary (or helpful) to create one for each
> >> query.
> >>>
> >>> I'd agree with Paul, that Drill's access model needs improvement and
> that
> >>> would be a good addition to the project. We might be able to assist
> with
> >>> that if there's interest.
> >>> Best,
> >>> -- C
> >>>
> >>>
> >>>> On May 10, 2020, at 3:55 PM, Paul Rogers <[email protected]>
> >>> wrote:
> >>>>
> >>>> Hi Avner,
> >>>>
> >>>> Drill was designed for a system in which the user name maps to a
> >>> certificate on the underlying file system, and the file system provides
> >>> complete security. This model has not been extended to the cloud world.
> >>>>
> >>>> What you want is a way to authenticate your user, map the user to a
> >>> storage plugin config for only that client's files, then restrict that
> >> user
> >>> to only that config. Further, you'd want the config to obtain S3 keys
> >> from
> >>> a vault of some sort. If you have that, you'd not have to worry about
> SQL
> >>> injection since only an authorized user could muck with the SQL, and
> they
> >>> could only access their own data -- which they can presumably access
> >> anyway.
> >>>>
> >>>>
> >>>> At present, Drill has no out-of-the-box security model for this use
> >>> case; there is no mechanism to associate users with configs, or to
> >>> externalize S3 security keys. Such a system would be a worthwhile
> >> addition
> >>> to the project.
> >>>>
> >>>> I wonder, has anyone else found a workaround for this use case? Maybe
> >>> via Kerberos or some such?
> >>>>
> >>>>
> >>>> Thanks,
> >>>> - Paul
> >>>>
> >>>>
> >>>>
> >>>> On Sunday, May 10, 2020, 12:04:16 PM PDT, Avner Levy <
> >>> [email protected]> wrote:
> >>>>
> >>>> Hi,
> >>>> I'm trying to use Apache Drill as a database for providing SQL over S3
> >>>> parquet files.
> >>>> Drill is used for serving multi-tenant data for multiple customers.
> >>>> Since I need to build the SQL string using the REST API I'm vulnerable
> >> to
> >>>> SQL injection attacks.
> >>>> I do test all user input and close it between apostrophes and
> >>>> escape apostrophe in the user input by doubling it but I'm still
> >>> concerned
> >>>> about optional SQL attacks.
> >>>> Will adding a different data source (which points to a different
> folder
> >>> on
> >>>> S3) per tenant is something that will have impact on performance? (I
> >>> might
> >>>> have thousands of those)
> >>>> Does it make sense to create the data source on the fly before query?
> >>>> Is there another way to limit the sent SQL to a specific folder?
> >>>> Thanks,
> >>>> Avner
> >>>
> >>
>
>