This looks interesting, thanks Ruslan. But, compaction with Hive is as
simple as an insert overwrite statement as Hive
supports CombineFileInputFormat, is it possible to do the same with Spark?
On Thu, Nov 26, 2015 at 9:47 AM, Ruslan Dautkhanov
wrote:
> An interesting
You could use the number of input files to determine the number of output
partitions. This assumes your input file sizes are deterministic.
Else, you could also persist the RDD and then determine it's size using the
apis.
Regards
Sab
On 26-Nov-2015 11:13 pm, "Nezih Yigitbasi"
Hi Spark people,
I have a Hive table that has a lot of small parquet files and I am
creating a data frame out of it to do some processing, but since I have a
large number of splits/files my job creates a lot of tasks, which I don't
want. Basically what I want is the same functionality that Hive
An interesting compaction approach of small files is discussed recently
http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
AFAIK Spark supports views too.
--
Ruslan Dautkhanov
On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi <