And even more simple case: >>> df = sc.parallelize([1] for x in xrange(760857)).toDF() >>> for x in range(50, 70): print x, timeit.timeit(df.groupBy().sum(*(['_1'] * >>> x)).collect, number=1) 50 1.91226291656 51 1.50933384895 52 1.582903862 53 1.90537405014 54 1.84442877769 55 1.91788887978 56 1.50977802277 57 1.5907189846 // after 57 rows it's 2x slower
58 3.22199988365 59 2.96345090866 60 2.8297970295 61 2.87895679474 62 2.92077898979 63 2.95195293427 64 4.10550689697 65 4.14798402786 66 3.13437199593 67 3.11248207092 68 3.18963003159 69 3.18774986267 >Суббота, 3 сентября 2016, 15:50 +03:00 от Сергей Романов ><romano...@inbox.ru.INVALID>: > >Same problem happens with CSV data file, so it's not parquet-related either. > >Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 > /_/ > >Using Python version 2.7.6 (default, Jun 22 2015 17:58:13) >SparkSession available as 'spark'. >>>> import timeit >>>> from pyspark.sql.types import * >>>> schema = StructType([StructField('dd_convs', FloatType(), True)]) >>>> for x in range(50, 70): print x, >>>> timeit.timeit(spark.read.csv('file:///data/dump/test_csv', >>>> schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1) >50 0.372850894928 >51 0.376906871796 >52 0.381325960159 >53 0.385444164276 >54 0.386877775192 >55 0.388918161392 >56 0.397624969482 >57 0.391713142395 >58 2.62714004517 >59 2.68421196938 >60 2.74627685547 >61 2.81081581116 >62 3.43532109261 >63 3.07742786407 >64 3.03904604912 >65 3.01616096497 >66 3.06293702126 >67 3.09386610985 >68 3.27610206604 >69 3.2041969299 Суббота, 3 сентября 2016, 15:40 +03:00 от Сергей Романов < >romano...@inbox.ru.INVALID >: >> >>Hi, >>I had narrowed down my problem to a very simple case. I'm sending 27kb >>parquet in attachment. (file:///data/dump/test2 in example) >>Please, can you take a look at it? Why there is performance drop after 57 sum >>columns? >>Welcome to >> ____ __ >> / __/__ ___ _____/ /__ >> _\ \/ _ \/ _ `/ __/ '_/ >> /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 >> /_/ >> >>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13) >>SparkSession available as 'spark'. >>>>> import timeit >>>>> for x in range(70): print x, >>>>> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs'] >>>>> * x) ).collect, number=1) >>... >>SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". >>SLF4J: Defaulting to no-operation (NOP) logger implementation >>SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further >>details. >>0 1.05591607094 >>1 0.200426101685 >>2 0.203800916672 >>3 0.176458120346 >>4 0.184863805771 >>5 0.232321023941 >>6 0.216032981873 >>7 0.201778173447 >>8 0.292424917221 >>9 0.228524923325 >>10 0.190534114838 >>11 0.197028160095 >>12 0.270443916321 >>13 0.429781913757 >>14 0.270851135254 >>15 0.776989936829 >>16 0.233337879181 >>17 0.227638959885 >>18 0.212944030762 >>19 0.2144780159 >>20 0.22200012207 >>21 0.262261152267 >>22 0.254227876663 >>23 0.275084018707 >>24 0.292124032974 >>25 0.280488014221 >>16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan >>since it was too large. This behavior can be adjusted by setting >>'spark.debug.maxToStringFields' in SparkEnv.conf. >>26 0.290093898773 >>27 0.238478899002 >>28 0.246420860291 >>29 0.241401195526 >>30 0.255286931992 >>31 0.42702794075 >>32 0.327946186066 >>33 0.434395074844 >>34 0.314198970795 >>35 0.34576010704 >>36 0.278323888779 >>37 0.289474964142 >>38 0.290827989578 >>39 0.376291036606 >>40 0.347742080688 >>41 0.363158941269 >>42 0.318687915802 >>43 0.376327991486 >>44 0.374994039536 >>45 0.362971067429 >>46 0.425967931747 >>47 0.370860099792 >>48 0.443903923035 >>49 0.374128103256 >>50 0.378985881805 >>51 0.476850986481 >>52 0.451028823853 >>53 0.432540893555 >>54 0.514838933945 >>55 0.53990483284 >>56 0.449142932892 >>57 0.465240001678 // 5x slower after 57 columns >>58 2.40412116051 >>59 2.41632795334 >>60 2.41812801361 >>61 2.55726218224 >>62 2.55484509468 >>63 2.56128406525 >>64 2.54642391205 >>65 2.56381797791 >>66 2.56871509552 >>67 2.66187620163 >>68 2.63496208191 >>69 2.81545996666 >> >>Sergei Romanov >> >>--------------------------------------------------------------------- >>To unsubscribe e-mail: user-unsubscr...@spark.apache.org >Sergei Romanov > >--------------------------------------------------------------------- >To unsubscribe e-mail: user-unsubscr...@spark.apache.org