Thanks Michael. My bad regarding hive table primary keys.

I have one big 140GB hdfs file and external hive table defined on it. Table is 
not partitioned. When I read external hive table using sqlContext.sql, how does 
spark decides number of partitions which should be created for that data frame?

SparkUI tells me that 1000+ tasks are created to read the above mentioned 
table. I guess one task per hdfs block. Does that mean it creates 1000+ 
partition created for DF? Is there a way to (hash)partition data frame on 
specific key column[s] when I read/load the hive table in spark?

Thanks,
Vijay


> On Aug 20, 2015, at 3:05 PM, Michael Armbrust <mich...@databricks.com> wrote:
> 
> There is no such thing as primary keys in the Hive metastore, but Spark SQL 
> does support partitioned hive tables: 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables
>  
> <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables>
> 
> DataFrameWriter also has a partitionBy method.
> 
> On Thu, Aug 20, 2015 at 7:29 AM, VIJAYAKUMAR JAWAHARLAL <sparkh...@data2o.io 
> <mailto:sparkh...@data2o.io>> wrote:
> Hi
> 
> I have a question regarding data frame partition. I read a hive table from 
> spark and following spark api converts it as DF.
> 
> test_df = sqlContext.sql(“select * from hivetable1”)
> 
> How does spark decide partition of test_df? Is there a way to partition 
> test_df based on some column while reading hive table? Second question is, if 
> that hive table has primary key declared, does spark honor PK in hive table 
> and partition based on PKs?
> 
> Thanks
> Vijay
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 

Reply via email to