I have the following problem with Spark Streaming API. I am currently
streaming input data via KAFKA to Spark Streaming, with which I plan to do
some preprocessing for the data. Then, I'd like to save the data to Parquet
file system and query it with Impala.

However, Spark is writing the data files to* separate directories* and a
new directory is generated for every RDD.

This is a problem because, first of all, *the external tables in Impala
cannot detect subdirectories*, but only files, inside the directory they
are pointing to, unless partitioned. Secondly, the new directories are
added so fast by Spark that it would be very bad for performance to create
a new partition periodically in Impala for every generated directory.

On the other hand, if I choose to increase the roll interval of the writes
in Spark, so that the directories will be generated less frequently, there
will be an added delay until Impala can read the incoming data. This is not
acceptable since my system has to support real-time applications. In Hive,
I could configure the external tables to also detect the subdirectories
without need for partitioning, by using these settings:

set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;

But to my understandig Impala does not have a feature like this.

   - Is there any method to make the external tables in Impala detect
   sub-directories?
   - If not, is there any method to make Spark write its output files into
   a single directory or otherwise in a form that is instantly readable by
   Impala?



Regards,

Rafeeq S
*(“What you do is what matters, not what you think or say or plan.” )*

Reply via email to