Hi,
I had narrowed down my problem to a very simple case. I'm sending 27kb parquet 
in attachment. (file:///data/dump/test2 in example)
Please, can you take a look at it? Why there is performance drop after 57 sum 
columns?
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSession available as 'spark'.
>>> import timeit
>>> for x in range(70): print x, 
>>> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs']
>>>  * x) ).collect, number=1)
... 
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
0 1.05591607094
1 0.200426101685
2 0.203800916672
3 0.176458120346
4 0.184863805771
5 0.232321023941
6 0.216032981873
7 0.201778173447
8 0.292424917221
9 0.228524923325
10 0.190534114838
11 0.197028160095
12 0.270443916321
13 0.429781913757
14 0.270851135254
15 0.776989936829
16 0.233337879181
17 0.227638959885
18 0.212944030762
19 0.2144780159
20 0.22200012207
21 0.262261152267
22 0.254227876663
23 0.275084018707
24 0.292124032974
25 0.280488014221
16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan 
since it was too large. This behavior can be adjusted by setting 
'spark.debug.maxToStringFields' in SparkEnv.conf.
26 0.290093898773
27 0.238478899002
28 0.246420860291
29 0.241401195526
30 0.255286931992
31 0.42702794075
32 0.327946186066
33 0.434395074844
34 0.314198970795
35 0.34576010704
36 0.278323888779
37 0.289474964142
38 0.290827989578
39 0.376291036606
40 0.347742080688
41 0.363158941269
42 0.318687915802
43 0.376327991486
44 0.374994039536
45 0.362971067429
46 0.425967931747
47 0.370860099792
48 0.443903923035
49 0.374128103256
50 0.378985881805
51 0.476850986481
52 0.451028823853
53 0.432540893555
54 0.514838933945
55 0.53990483284
56 0.449142932892
57 0.465240001678 // 5x slower after 57 columns
58 2.40412116051
59 2.41632795334
60 2.41812801361
61 2.55726218224
62 2.55484509468
63 2.56128406525
64 2.54642391205
65 2.56381797791
66 2.56871509552
67 2.66187620163
68 2.63496208191
69 2.81545996666
        

Sergei Romanov

Attachment: part-r-00000-d06531d2-d435-4ff0-8804-0dea49953be4.snappy.parquet
Description: Binary data

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to