Thanks for the suggestion - I didn't try it at first because it seems like I have multiple roots and not necessarily partitioned data. Is this the correct way to do that?
sqlContext.read.option("basePath", "hdfs://user/hdfs/analytics/").json("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*") If so, it returns the same error: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:? hdfs://user/hdfs/analytics/app1/PAGEVIEW hdfs://user/hdfs/analytics/app2/PAGEVIEW On Fri, Mar 25, 2016 at 2:00 PM, Michael Armbrust <mich...@databricks.com> wrote: > Have you tried setting a base path for partition discovery? > > Starting from Spark 1.6.0, partition discovery only finds partitions under >> the given paths by default. For the above example, if users pass >> path/to/table/gender=male to either SQLContext.read.parquet or >> SQLContext.read.load, gender will not be considered as a partitioning >> column. If users need to specify the base path that partition discovery >> should start with, they can set basePath in the data source options. For >> example, when path/to/table/gender=male is the path of the data and >> users set basePath to path/to/table/, gender will be a partitioning >> column. > > > > http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery > > > > On Fri, Mar 25, 2016 at 10:34 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> This is the original subject of the JIRA: >> Partition discovery fail if there is a _SUCCESS file in the table's root >> dir >> >> If I remember correctly, there were discussions on how (traditional) >> partition discovery slowed down Spark jobs. >> >> Cheers >> >> On Fri, Mar 25, 2016 at 10:15 AM, suresk <sur...@gmail.com> wrote: >> >>> In previous versions of Spark, this would work: >>> >>> val events = >>> sqlContext.jsonFile("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*") >>> >>> Where the first wildcard corresponds to an application directory, the >>> second >>> to a partition directory, and the third matched all the files in the >>> partition directory. The records are all the exact same format, they are >>> just broken out by application first, then event type. This functionality >>> was really useful. >>> >>> In 1.6, this same call results in the following error: >>> >>> Conflicting directory structures detected. Suspicious paths: >>> (list of paths) >>> >>> And then it recommends reading in each root directory separately and >>> unioning them together. It looks like the change happened here: >>> >>> https://github.com/apache/spark/pull/9651 >>> >>> 1) Simply out of curiosity, since I'm still fairly new to Spark - what is >>> the benefit of no longer allowing multiple roots? >>> >>> 2) Is there a better way to do what I'm trying to do? Discovering all of >>> the >>> paths (I won't know them ahead of time), creating tables for each of >>> them, >>> and then doing all of the unions seems inefficient and a lot of extra >>> work >>> compared to what I had before. >>> >>> Thanks. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-and-multiple-roots-in-1-6-tp26598.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >