Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-28 Thread Andrew Otto
> val sqlContext = new HiveContext(sc) > val schemaRdd = sqlContext.sql("some complex SQL") It mostly works, but have been having issues with tables that contains a large amount of data: https://issues.apache.org/jira/browse/SPARK-6910 > On

Re: HiveContext fails when querying large external Parquet tables

2015-05-22 Thread Andrew Otto
et with many partitions but doing anything with it 8s very > slow...But I am surprised Spark 1.2 worked for you: it has this problem... > > Original message ---- > From: Andrew Otto > Date:05/22/2015 3:51 PM (GMT-05:00) > To: user@spark.apache.org > Cc: Joseph A

HiveContext fails when querying large external Parquet tables

2015-05-22 Thread Andrew Otto
ead.java:107) ``` I've tested this both in local mode and in YARN client mode, and both have similar behavoirs. What's worrysome is that the behavior is different after adding more data to the table, even though I am querying the same very small partition. The whole point of Hive partitions is to allow jobs to work with only the data that is needed. I'm not sure what Spark HiveContext is doing here, but it seems to couple the full size of a Hive table to the performance of a query that only needs a very small amount of data. I poked around the Spark source, and for a minute thought this might be related: https://github.com/apache/spark/commit/42389b17, but that was included in Spark 1.2.0, and this was working for us fine. Is HiveContext somehow trying to scan the whole table in the driver? Has anyone else had this problem? Thanks! -Andrew Otto