Thanks Michael. My bad regarding hive table primary keys. I have one big 140GB hdfs file and external hive table defined on it. Table is not partitioned. When I read external hive table using sqlContext.sql, how does spark decides number of partitions which should be created for that data frame?
SparkUI tells me that 1000+ tasks are created to read the above mentioned table. I guess one task per hdfs block. Does that mean it creates 1000+ partition created for DF? Is there a way to (hash)partition data frame on specific key column[s] when I read/load the hive table in spark? Thanks, Vijay > On Aug 20, 2015, at 3:05 PM, Michael Armbrust <mich...@databricks.com> wrote: > > There is no such thing as primary keys in the Hive metastore, but Spark SQL > does support partitioned hive tables: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables > > <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables> > > DataFrameWriter also has a partitionBy method. > > On Thu, Aug 20, 2015 at 7:29 AM, VIJAYAKUMAR JAWAHARLAL <sparkh...@data2o.io > <mailto:sparkh...@data2o.io>> wrote: > Hi > > I have a question regarding data frame partition. I read a hive table from > spark and following spark api converts it as DF. > > test_df = sqlContext.sql(“select * from hivetable1”) > > How does spark decide partition of test_df? Is there a way to partition > test_df based on some column while reading hive table? Second question is, if > that hive table has primary key declared, does spark honor PK in hive table > and partition based on PKs? > > Thanks > Vijay > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > For additional commands, e-mail: user-h...@spark.apache.org > <mailto:user-h...@spark.apache.org> > >