Eyal, The Parquet Pig loader is fine if all the data is present, but if I've written out from Spark using `df.write.partitionBy('colA', 'colB').parquet('s3://path/to/output')`, the data from those two columns are put into the output path and taken out from the data: s3://path/to/output/colA=valA/colB=valB/part-0001.parquet. There are hacky workarounds, such as duplicating the columns in Spark before writing, which fix the issue of loading into Pig but then mean they re-appear in the data when you read back into Spark.
Best, Michael On 8/30/18, 10:15 AM, "Adam Szita" <sz...@cloudera.com.INVALID> wrote: Hi Eyal, For just loading Parquet files the Parquet Pig loader is okay, although I don't think it lets you use partition values in the dataset later. I know the plain old PigStorage has a trick with -tagFiles option but not sure if that'd be enough in Michael's case and also if that's something Parquet Loader supports. Thanks On Thu, 30 Aug 2018 at 16:10, Eyal Allweil <eyal_allw...@yahoo.com.invalid> wrote: > Hi Michael, > You can also use the Parquet Pig loader (especially if you're not working > with Hive). Here's a link to the Maven repository for it. > > https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0 > Regards,Eyal > <https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0Regards,Eyal> > > > > > > On Tuesday, August 28, 2018, 2:40:36 PM GMT+3, Adam Szita > <sz...@cloudera.com.INVALID> wrote: > > Hi Michael, > > Yes you can use HCatLoader to do this. > The requirement is that you have a Hive table defined on top of your data > (probably pointing to s3://path/to/files) (and Hive MetaStore has all the > relevant meta/schema information). > If you do not have a Hive table yet, you can go ahead and define it in Hive > by manually specifying schema information, and after that partitions can be > added automatically via the 'msck repair' function of Hive. > > Hope this helps, > Adam > > > On Mon, 27 Aug 2018 at 19:18, Michael Doo <michael....@verve.com> wrote: > > > Hello, > > > > I’m trying to read in Parquet data into Pig that is partitioned (so it’s > > stored in S3 like > > > s3://path/to/files/some_flag=true/part-00095-a2a6230b-9750-48e4-9cd0-b553ffc220de.c000.gz.parquet). > > I’d like to load it into Pig and add the partitions as columns. I’ve read > > some resources suggesting using the HCatLoader, but so far haven’t had > > success. > > > > Any advice would be welcome. > > > > ~ Michael > >