Hi, I had narrowed down my problem to a very simple case. I'm sending 27kb parquet in attachment. (file:///data/dump/test2 in example) Please, can you take a look at it? Why there is performance drop after 57 sum columns? Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 /_/
Using Python version 2.7.6 (default, Jun 22 2015 17:58:13) SparkSession available as 'spark'. >>> import timeit >>> for x in range(70): print x, >>> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs'] >>> * x) ).collect, number=1) ... SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 0 1.05591607094 1 0.200426101685 2 0.203800916672 3 0.176458120346 4 0.184863805771 5 0.232321023941 6 0.216032981873 7 0.201778173447 8 0.292424917221 9 0.228524923325 10 0.190534114838 11 0.197028160095 12 0.270443916321 13 0.429781913757 14 0.270851135254 15 0.776989936829 16 0.233337879181 17 0.227638959885 18 0.212944030762 19 0.2144780159 20 0.22200012207 21 0.262261152267 22 0.254227876663 23 0.275084018707 24 0.292124032974 25 0.280488014221 16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. 26 0.290093898773 27 0.238478899002 28 0.246420860291 29 0.241401195526 30 0.255286931992 31 0.42702794075 32 0.327946186066 33 0.434395074844 34 0.314198970795 35 0.34576010704 36 0.278323888779 37 0.289474964142 38 0.290827989578 39 0.376291036606 40 0.347742080688 41 0.363158941269 42 0.318687915802 43 0.376327991486 44 0.374994039536 45 0.362971067429 46 0.425967931747 47 0.370860099792 48 0.443903923035 49 0.374128103256 50 0.378985881805 51 0.476850986481 52 0.451028823853 53 0.432540893555 54 0.514838933945 55 0.53990483284 56 0.449142932892 57 0.465240001678 // 5x slower after 57 columns 58 2.40412116051 59 2.41632795334 60 2.41812801361 61 2.55726218224 62 2.55484509468 63 2.56128406525 64 2.54642391205 65 2.56381797791 66 2.56871509552 67 2.66187620163 68 2.63496208191 69 2.81545996666 Sergei Romanov
part-r-00000-d06531d2-d435-4ff0-8804-0dea49953be4.snappy.parquet
Description: Binary data
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org