Here is one of the results. It is for the execution with the config I was expecting to perform the best based on my sampled data.
Compression: LZO, Page size and dictionary size: 16KB, block size 32 MB, there are 32 parts of a total 911M on S3 (so a single file is in fact less than 32mb). I am not sure that the block size actually matters so much as the data is on S3 and not hdfs... :( When I just get all the fields it is much worse than with raw thrift. If I select one nested field (foo/** where foo has only 2 leafs) and a few direct leafs then performance is similar to getting all without any filter. When selecting only ~5 leafs performance is similar to raw thrift. Thanks! row group 1: RC:283052 TS:107919094 OFFSET:4 -------------------------------------------------------------------------------- a: INT64 LZO DO:0 FPO:4 SZ:365710/2213388/6,05 VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED b: INT64 LZO DO:0 FPO:365714 SZ:505835/2228766/4,41 VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED c: BINARY LZO DO:0 FPO:871549 SZ:10376384/11393987/1,10 VC:283052 ENC:PLAIN,BIT_PACKED d: BINARY LZO DO:0 FPO:11247933 SZ:70986/78575/1,11 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED e: BINARY LZO DO:0 FPO:11318919 SZ:2159/2603/1,21 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE f: BINARY LZO DO:0 FPO:11321078 SZ:41917/47856/1,14 VC:283052 ENC:PLAIN,BIT_PACKED,RLE g: .g1: BINARY LZO DO:0 FPO:11362995 SZ:38549/37372/0,97 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE .g2: ..g21: INT64 LZO DO:0 FPO:11401544 SZ:61882/388906/6,28 VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE ..g22: BINARY LZO DO:0 FPO:11463426 SZ:1144390/7158351/6,26 VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE h: .h1: BINARY LZO DO:0 FPO:12607816 SZ:63896/68688/1,07 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE .h2: ..h21: INT64 LZO DO:0 FPO:12671712 SZ:1169087/2207025/1,89 VC:283052 ENC:PLAIN,BIT_PACKED,RLE ..h22: BINARY LZO DO:0 FPO:13840799 SZ:29116/40513/1,39 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE i: .i1: BINARY LZO DO:0 FPO:13869915 SZ:10933/13648/1,25 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE .i2: ..i21: INT64 LZO DO:0 FPO:13880848 SZ:11523/17795/1,54 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ..i22: BINARY LZO DO:0 FPO:13892371 SZ:135510/248827/1,84 VC:283052 ENC:PLAIN,BIT_PACKED,RLE j: .j1: BINARY LZO DO:0 FPO:14027881 SZ:37025/35497/0,96 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE .j2: ..j21: INT64 LZO DO:0 FPO:14064906 SZ:28196/37242/1,32 VC:283052 ENC:PLAIN,BIT_PACKED,RLE ..j22: BINARY LZO DO:0 FPO:14093102 SZ:945481/6491450/6,87 VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE k: BINARY LZO DO:0 FPO:15038583 SZ:39147/36673/0,94 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED l: BINARY LZO DO:0 FPO:15077730 SZ:58233/60236/1,03 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE m: BINARY LZO DO:0 FPO:15135963 SZ:28326/30663/1,08 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED n: BINARY LZO DO:0 FPO:15164289 SZ:2223225/26327896/11,84 VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE o: BINARY LZO DO:0 FPO:17387514 SZ:690400/4470368/6,48 VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE p: BINARY LZO DO:0 FPO:18077914 SZ:39/27/0,69 VC:283052 ENC:PLAIN,BIT_PACKED,RLE q: BINARY LZO DO:0 FPO:18077953 SZ:1099508/7582263/6,90 VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE r: BINARY LZO DO:0 FPO:19177461 SZ:1372666/8752125/6,38 VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE s: BINARY LZO DO:0 FPO:20550127 SZ:52878/51840/0,98 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE t: BINARY LZO DO:0 FPO:20603005 SZ:51548/49339/0,96 VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE u: .map: ..key: BINARY LZO DO:0 FPO:20654553 SZ:75794/85569/1,13 VC:291795 ENC:PLAIN_DICTIONARY,RLE ..value: BINARY LZO DO:0 FPO:20730347 SZ:58334/62448/1,07 VC:291795 ENC:PLAIN_DICTIONARY,RLE v: .map: ..key: BINARY LZO DO:0 FPO:20788681 SZ:1072311/2977966/2,78 VC:2674014 ENC:PLAIN_DICTIONARY,RLE ..value: BINARY LZO DO:0 FPO:21860992 SZ:6997331/24721192/3,53 VC:2674014 ENC:PLAIN_DICTIONARY,PLAIN,RLE 2015-04-03 18:22 GMT+02:00 Eugen Cepoi <[email protected]>: > Hey Ryan, > > 2015-04-03 18:00 GMT+02:00 Ryan Blue <[email protected]>: > >> On 04/02/2015 07:38 AM, Eugen Cepoi wrote: >> >>> Hi there, >>> >>> I was testing parquet with thrift to see if there would be an >>> interesting performance gain compared to using just thrift. But in my >>> test I found that just using plain thrift with lzo compression was >>> faster. >>> >> >> This doesn't surprise me too much because of how the Thrift object model >> works. (At least, assuming I understand it right. Feel free to correct me.) >> >> Thrift wants to read and write using the TProtocol, which provides a >> layer like Parquet's Converters that is an intermediary between the object >> model and underlying encodings. Parquet implements TProtocol by building a >> list of the method calls a record will make to read or write itself, then >> allowing the record to read that list. I think this has the potential to >> slow down reading and writing. >> > >> It's on my todo list to try to get this working using avro-thrift, which >> sets the fields directly. > > > > Yes I find logic the double "ser/de" overhead, but was not expecting such > a big difference. > I didn't read the code doing the conversion, but with thrift we can > directly set the fields, at least if what you mean is setting without > reflection. > So basically one can just create an "empty" instance via the default ctr > and reflection and then use setFieldValue method with the corresponding > _Field (an enum) and value. We can even reuse those instances. > I think this would perform better than using avro-thrift that adds another > layer. If you can point me to the code of interest I can maybe be of some > help :) > > Does the impl based on avro perform much better? > > > >> That's just to see if it might be faster constructing the records >> directly, since we rely on TProtocol to make both thrift and scrooge >> objects work. >> >> I used a small EMR cluster with 2 m3.xlarge cores. >>> The sampled input has 9 million records about 1g (on S3) with ~20 fields >>> and some nested structures and maps. I just do a count on it. >>> I tried playing with different tuning options but none seemed to really >>> improve things (the pic shows some global metrics for the different >>> options). >>> >>> I also tried with a larger sample about a couple of gigs (output once >>> compressed), but I had similar results. >>> >> >> Could you post the results of `parquet-tools meta`? I'd like to see what >> your column layout looks like (the final column chunk sizes). >> >> If your data ends up with only a column or two dominating the row group >> and you always select those columns, then you probably wouldn't see an >> improvement. You need at least one "big" column chunk that you're ignoring. >> >> > I'll provide those shortly. BTW I had some warnings indicating that it > couldn't skip row groups due to predicates or something like this. I'll try > to provide it too. > > >> Also, what compression did you use for the Parquet files? >> > > Lzo, it is also the one I am using for the raw thrift data. > > Thank you! > Eugen > > > >> >> In the end the only situation I can see where it can perform >>> significantly better is when reading few columns from a dataset that has >>> a large number of columns. But as the schemas are hand written I don't >>> imagine having data structures with hundreds of columns. >>> >> >> I think we'll know more from taking a look at the row groups and column >> chunk sizes. >> >> >> I am wondering if I am doing something wrong (esp. due to the large >>> difference between plain thrift and parquet+thrift) or if the used >>> dataset isn't a good fit for parquet? >>> >>> Thanks! >>> >>> >>> Cheers, >>> Eugen >>> >> >> rb >> >> >> -- >> Ryan Blue >> Software Engineer >> Cloudera, Inc. >> > >
