2015-04-04 20:01 GMT+02:00 Ryan Blue <[email protected]>: > Did you also set the row group size? It looks like this row group is > ~103MB, which doesn't make sense with your block size (unless I'm reading > the output wrong). I'm not really sure how much block size would matter > either. The row group will only get processed by a single task even if > there are multiple "HDFS" blocks covering it. >
I didn't knew we can configure the row group size, which option is it? I only configured compression, block size, page size and dictionary page size. > > How did you arrive at 16KB for page size? > Mainly by following the configuration guide lines here http://parquet.incubator.apache.org/documentation/latest/. The default is around 1Mb, but I don't have the impression that in my situation changing all those made any big difference. About the config that performed best I coalesced the output to a small number of parts each around 32mb so it matches the actual block size configured for parquet. > > rb > > > On 04/03/2015 09:52 AM, Eugen Cepoi wrote: > >> Here is one of the results. It is for the execution with the config I was >> expecting to perform the best based on my sampled data. >> >> Compression: LZO, Page size and dictionary size: 16KB, block size 32 MB, >> there are 32 parts of a total 911M on S3 (so a single file is in fact less >> than 32mb). I am not sure that the block size actually matters so much as >> the data is on S3 and not hdfs... :( >> >> When I just get all the fields it is much worse than with raw thrift. If I >> select one nested field (foo/** where foo has only 2 leafs) and a few >> direct leafs then performance is similar to getting all without any >> filter. >> When selecting only ~5 leafs performance is similar to raw thrift. >> >> Thanks! >> >> >> row group 1: RC:283052 TS:107919094 OFFSET:4 >> ------------------------------------------------------------ >> -------------------- >> a: INT64 LZO DO:0 FPO:4 SZ:365710/2213388/6,05 VC:283052 >> ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED >> b: INT64 LZO DO:0 FPO:365714 SZ:505835/2228766/4,41 VC:283052 >> ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED >> c: BINARY LZO DO:0 FPO:871549 SZ:10376384/11393987/1,10 >> VC:283052 ENC:PLAIN,BIT_PACKED >> d: BINARY LZO DO:0 FPO:11247933 SZ:70986/78575/1,11 VC:283052 >> ENC:PLAIN_DICTIONARY,BIT_PACKED >> e: BINARY LZO DO:0 FPO:11318919 SZ:2159/2603/1,21 VC:283052 >> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE >> f: BINARY LZO DO:0 FPO:11321078 SZ:41917/47856/1,14 VC:283052 >> ENC:PLAIN,BIT_PACKED,RLE >> g: >> .g1: BINARY LZO DO:0 FPO:11362995 SZ:38549/37372/0,97 >> VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE >> .g2: >> ..g21: INT64 LZO DO:0 FPO:11401544 SZ:61882/388906/6,28 VC:283052 >> ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE >> ..g22: BINARY LZO DO:0 FPO:11463426 SZ:1144390/7158351/6,26 >> VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE >> h: >> .h1: BINARY LZO DO:0 FPO:12607816 SZ:63896/68688/1,07 >> VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE >> .h2: >> ..h21: INT64 LZO DO:0 FPO:12671712 SZ:1169087/2207025/1,89 >> VC:283052 ENC:PLAIN,BIT_PACKED,RLE >> ..h22: BINARY LZO DO:0 FPO:13840799 SZ:29116/40513/1,39 VC:283052 >> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE >> i: >> .i1: BINARY LZO DO:0 FPO:13869915 SZ:10933/13648/1,25 >> VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE >> .i2: >> ..i21: INT64 LZO DO:0 FPO:13880848 SZ:11523/17795/1,54 VC:283052 >> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE >> ..i22: BINARY LZO DO:0 FPO:13892371 SZ:135510/248827/1,84 >> VC:283052 ENC:PLAIN,BIT_PACKED,RLE >> j: >> .j1: BINARY LZO DO:0 FPO:14027881 SZ:37025/35497/0,96 >> VC:283052 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE >> .j2: >> ..j21: INT64 LZO DO:0 FPO:14064906 SZ:28196/37242/1,32 VC:283052 >> ENC:PLAIN,BIT_PACKED,RLE >> ..j22: BINARY LZO DO:0 FPO:14093102 SZ:945481/6491450/6,87 >> VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE >> k: BINARY LZO DO:0 FPO:15038583 SZ:39147/36673/0,94 VC:283052 >> ENC:PLAIN_DICTIONARY,BIT_PACKED >> l: BINARY LZO DO:0 FPO:15077730 SZ:58233/60236/1,03 VC:283052 >> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE >> m: BINARY LZO DO:0 FPO:15135963 SZ:28326/30663/1,08 VC:283052 >> ENC:PLAIN_DICTIONARY,BIT_PACKED >> n: BINARY LZO DO:0 FPO:15164289 SZ:2223225/26327896/11,84 >> VC:283052 ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE >> o: BINARY LZO DO:0 FPO:17387514 SZ:690400/4470368/6,48 >> VC:283052 >> ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE >> p: BINARY LZO DO:0 FPO:18077914 SZ:39/27/0,69 VC:283052 >> ENC:PLAIN,BIT_PACKED,RLE >> q: BINARY LZO DO:0 FPO:18077953 SZ:1099508/7582263/6,90 >> VC:283052 >> ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE >> r: BINARY LZO DO:0 FPO:19177461 SZ:1372666/8752125/6,38 >> VC:283052 >> ENC:PLAIN_DICTIONARY,PLAIN,BIT_PACKED,RLE >> s: BINARY LZO DO:0 FPO:20550127 SZ:52878/51840/0,98 VC:283052 >> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE >> t: BINARY LZO DO:0 FPO:20603005 SZ:51548/49339/0,96 VC:283052 >> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE >> u: >> .map: >> ..key: BINARY LZO DO:0 FPO:20654553 SZ:75794/85569/1,13 >> VC:291795 ENC:PLAIN_DICTIONARY,RLE >> ..value: BINARY LZO DO:0 FPO:20730347 SZ:58334/62448/1,07 >> VC:291795 ENC:PLAIN_DICTIONARY,RLE >> v: >> .map: >> ..key: BINARY LZO DO:0 FPO:20788681 SZ:1072311/2977966/2,78 >> VC:2674014 ENC:PLAIN_DICTIONARY,RLE >> ..value: BINARY LZO DO:0 FPO:21860992 SZ:6997331/24721192/3,53 >> VC:2674014 ENC:PLAIN_DICTIONARY,PLAIN,RLE >> >> >> 2015-04-03 18:22 GMT+02:00 Eugen Cepoi <[email protected]>: >> >> Hey Ryan, >>> >>> 2015-04-03 18:00 GMT+02:00 Ryan Blue <[email protected]>: >>> >>> On 04/02/2015 07:38 AM, Eugen Cepoi wrote: >>>> >>>> Hi there, >>>>> >>>>> I was testing parquet with thrift to see if there would be an >>>>> interesting performance gain compared to using just thrift. But in my >>>>> test I found that just using plain thrift with lzo compression was >>>>> faster. >>>>> >>>>> >>>> This doesn't surprise me too much because of how the Thrift object model >>>> works. (At least, assuming I understand it right. Feel free to correct >>>> me.) >>>> >>>> Thrift wants to read and write using the TProtocol, which provides a >>>> layer like Parquet's Converters that is an intermediary between the >>>> object >>>> model and underlying encodings. Parquet implements TProtocol by >>>> building a >>>> list of the method calls a record will make to read or write itself, >>>> then >>>> allowing the record to read that list. I think this has the potential to >>>> slow down reading and writing. >>>> >>>> >>> It's on my todo list to try to get this working using avro-thrift, which >>>> sets the fields directly. >>>> >>> >>> >>> >>> Yes I find logic the double "ser/de" overhead, but was not expecting such >>> a big difference. >>> I didn't read the code doing the conversion, but with thrift we can >>> directly set the fields, at least if what you mean is setting without >>> reflection. >>> So basically one can just create an "empty" instance via the default ctr >>> and reflection and then use setFieldValue method with the corresponding >>> _Field (an enum) and value. We can even reuse those instances. >>> I think this would perform better than using avro-thrift that adds >>> another >>> layer. If you can point me to the code of interest I can maybe be of some >>> help :) >>> >>> Does the impl based on avro perform much better? >>> >>> >>> >>> That's just to see if it might be faster constructing the records >>>> directly, since we rely on TProtocol to make both thrift and scrooge >>>> objects work. >>>> >>>> I used a small EMR cluster with 2 m3.xlarge cores. >>>> >>>>> The sampled input has 9 million records about 1g (on S3) with ~20 >>>>> fields >>>>> and some nested structures and maps. I just do a count on it. >>>>> I tried playing with different tuning options but none seemed to really >>>>> improve things (the pic shows some global metrics for the different >>>>> options). >>>>> >>>>> I also tried with a larger sample about a couple of gigs (output once >>>>> compressed), but I had similar results. >>>>> >>>>> >>>> Could you post the results of `parquet-tools meta`? I'd like to see what >>>> your column layout looks like (the final column chunk sizes). >>>> >>>> If your data ends up with only a column or two dominating the row group >>>> and you always select those columns, then you probably wouldn't see an >>>> improvement. You need at least one "big" column chunk that you're >>>> ignoring. >>>> >>>> >>>> I'll provide those shortly. BTW I had some warnings indicating that it >>> couldn't skip row groups due to predicates or something like this. I'll >>> try >>> to provide it too. >>> >>> >>> Also, what compression did you use for the Parquet files? >>>> >>>> >>> Lzo, it is also the one I am using for the raw thrift data. >>> >>> Thank you! >>> Eugen >>> >>> >>> >>> >>>> In the end the only situation I can see where it can perform >>>> >>>>> significantly better is when reading few columns from a dataset that >>>>> has >>>>> a large number of columns. But as the schemas are hand written I don't >>>>> imagine having data structures with hundreds of columns. >>>>> >>>>> >>>> I think we'll know more from taking a look at the row groups and column >>>> chunk sizes. >>>> >>>> >>>> I am wondering if I am doing something wrong (esp. due to the large >>>> >>>>> difference between plain thrift and parquet+thrift) or if the used >>>>> dataset isn't a good fit for parquet? >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> Cheers, >>>>> Eugen >>>>> >>>>> >>>> rb >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Cloudera, Inc. >>>> >>>> >>> >>> >> > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
