Ok, that helped a lot - and I understand the feature/change better now.
Thank you!
On Fri, Mar 25, 2016 at 4:32 PM, Michael Armbrust
wrote:
> Oh, I'm sorry I didn't fully understand what you were trying to do. If
> you don't need partitioning, you can set
>
Oh, I'm sorry I didn't fully understand what you were trying to do. If you
don't need partitioning, you can set
"spark.sql.sources.partitionDiscovery.enabled=false". Otherwise, I think
you need to use the unioning approach.
On Fri, Mar 25, 2016 at 1:35 PM, Spencer Uresk
Thanks for the suggestion - I didn't try it at first because it seems like
I have multiple roots and not necessarily partitioned data. Is this the
correct way to do that?
sqlContext.read.option("basePath",
"hdfs://user/hdfs/analytics/").json("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*")
If so, it
Have you tried setting a base path for partition discovery?
Starting from Spark 1.6.0, partition discovery only finds partitions under
> the given paths by default. For the above example, if users pass
> path/to/table/gender=male to either SQLContext.read.parquet or
> SQLContext.read.load, gender
This is the original subject of the JIRA:
Partition discovery fail if there is a _SUCCESS file in the table's root dir
If I remember correctly, there were discussions on how (traditional)
partition discovery slowed down Spark jobs.
Cheers
On Fri, Mar 25, 2016 at 10:15 AM, suresk
In previous versions of Spark, this would work:
val events =
sqlContext.jsonFile("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*")
Where the first wildcard corresponds to an application directory, the second
to a partition directory, and the third matched all the files in the
partition directory. The