On 4/9/15 3:09 AM, Michael Armbrust wrote:
Back to the user list so everyone can see the result of the discussion...
Ah. It all makes sense now. The issue is that when I created the
parquet files, I included an unnecessary directory name
(data.parquet) below the partition directories. It’s just a
leftover from when I started with Michael’s sample code and it
only made sense before I added the partition directories. I
probably thought it was some magic name that was required when
spark scanned for parquet files. The structure looks something
like this:
drwxr-xr-x - user supergroup 0 2015-04-02 13:17
hdfs://host/tablename/date=20150302/sym=A/data.parquet/...
If I just move all the files up a level (there goes a day of work)
, the existing code should work fine. Whether it’s useful to
handle intermediate non-partition directories or whether that just
creates some extra risk I can’t say, since I’m new to all the
technology in this whole stack.
I'm mixed here. There is always a tradeoff between "silently"
ignoring structure that people might not be aware of (and thus might
be a bug) and "just working". Having this as an option at least
certainly seems reasonable. I'd be curious if anyone had other thoughts?
Take the following directory name as an example:
/path/to/partition/a=1/random/b=foo
One possible approach can be, we grab both "a=1" and "b=foo", then
either report "random" by throwing an exception or ignore it with a WARN
log.
Unfortunately, it takes many minutes (even with mergeSchema=false)
to create the RDD. It appears that the whole data store will still
be recursively traversed (even with mergeSchema=false, a manually
specified schema, and a partition spec [which I can’t pass in
through a public API]) so that all of the metadata FileStatuses
can be cached. In my case I’m going to have years of data, so
there’s no way that will be feasible.
Should I just explicitly load the partitions I want instead of
using partition discovery? Is there any plan to have a less
aggressive version of support for partitions, where metadata is
only cached for partitions that are used in queries?
We improved the speed here in 1.3.1 so I'd be curious if that helps.
We definitely need to continue to speed things up here though. We
have to enumerate all the partitions so we know what data to read when
a query comes in, but I do think we can parallelize it or something.