Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Spencer Uresk
Ok, that helped a lot - and I understand the feature/change better now. Thank you! On Fri, Mar 25, 2016 at 4:32 PM, Michael Armbrust wrote: > Oh, I'm sorry I didn't fully understand what you were trying to do. If > you don't need partitioning, you can set >

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
Oh, I'm sorry I didn't fully understand what you were trying to do. If you don't need partitioning, you can set "spark.sql.sources.partitionDiscovery.enabled=false". Otherwise, I think you need to use the unioning approach. On Fri, Mar 25, 2016 at 1:35 PM, Spencer Uresk

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Spencer Uresk
Thanks for the suggestion - I didn't try it at first because it seems like I have multiple roots and not necessarily partitioned data. Is this the correct way to do that? sqlContext.read.option("basePath", "hdfs://user/hdfs/analytics/").json("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*") If so, it

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
Have you tried setting a base path for partition discovery? Starting from Spark 1.6.0, partition discovery only finds partitions under > the given paths by default. For the above example, if users pass > path/to/table/gender=male to either SQLContext.read.parquet or > SQLContext.read.load, gender

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Ted Yu
This is the original subject of the JIRA: Partition discovery fail if there is a _SUCCESS file in the table's root dir If I remember correctly, there were discussions on how (traditional) partition discovery slowed down Spark jobs. Cheers On Fri, Mar 25, 2016 at 10:15 AM, suresk

SparkSQL and multiple roots in 1.6

2016-03-25 Thread suresk
In previous versions of Spark, this would work: val events = sqlContext.jsonFile("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*") Where the first wildcard corresponds to an application directory, the second to a partition directory, and the third matched all the files in the partition directory. The