Hey Ravion, yes, you can obviously specify other column than a primary key. Be aware though, that if the key range is not spread evenly (for example in your code, if there's a "gap" in primary keys and no row has id between 0 and 17220) some of the executors may not assist in loading data (because "SELECT * FROM orders WHERE order_id IS BETWEEN 0 AND 17220 will return an empty result). I think you might want to repartition afterwards to ensure that df is evenly distributed(<--- could somebody confirm my last sentence? I don't want to mislead and I am not sure).
The first question - could you just check and provide us the answer? :) Cheers, Tomasz 2017-12-03 7:39 GMT+01:00 ☼ R Nair (रविशंकर नायर) < ravishankar.n...@gmail.com>: > Hi all, > > I am using a query to fetch data from MYSQL as follows: > > var df = spark.read. > format("jdbc"). > option("url", "jdbc:mysql://10.0.0.192:3306/retail_db"). > option("driver" ,"com.mysql.jdbc.Driver"). > option("user", "retail_dba"). > option("password", "cloudera"). > option("dbtable", "orders"). > option("partitionColumn", "order_id"). > option("lowerBound", "1"). > option("upperBound", "68883"). > option("numPartitions", "4"). > load() > > Question is, can I use a pseudo column (like ROWNUM in Oracle or > RRN(employeeno) in DB2) in option where I specify the "partitionColumn" ? > If not, can we specify a partition column which is not a primary key ? > > Best, > Ravion > > > >