Same problem happens with CSV data file, so it's not parquet-related either.
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 /_/ Using Python version 2.7.6 (default, Jun 22 2015 17:58:13) SparkSession available as 'spark'. >>> import timeit >>> from pyspark.sql.types import * >>> schema = StructType([StructField('dd_convs', FloatType(), True)]) >>> for x in range(50, 70): print x, >>> timeit.timeit(spark.read.csv('file:///data/dump/test_csv', >>> schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1) 50 0.372850894928 51 0.376906871796 52 0.381325960159 53 0.385444164276 54 0.386877775192 55 0.388918161392 56 0.397624969482 57 0.391713142395 58 2.62714004517 59 2.68421196938 60 2.74627685547 61 2.81081581116 62 3.43532109261 63 3.07742786407 64 3.03904604912 65 3.01616096497 66 3.06293702126 67 3.09386610985 68 3.27610206604 69 3.2041969299 Суббота, 3 сентября 2016, 15:40 +03:00 от Сергей Романов <romano...@inbox.ru.INVALID>: > >Hi, >I had narrowed down my problem to a very simple case. I'm sending 27kb parquet >in attachment. (file:///data/dump/test2 in example) >Please, can you take a look at it? Why there is performance drop after 57 sum >columns? >Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 > /_/ > >Using Python version 2.7.6 (default, Jun 22 2015 17:58:13) >SparkSession available as 'spark'. >>>> import timeit >>>> for x in range(70): print x, >>>> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs'] >>>> * x) ).collect, number=1) >... >SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". >SLF4J: Defaulting to no-operation (NOP) logger implementation >SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further >details. >0 1.05591607094 >1 0.200426101685 >2 0.203800916672 >3 0.176458120346 >4 0.184863805771 >5 0.232321023941 >6 0.216032981873 >7 0.201778173447 >8 0.292424917221 >9 0.228524923325 >10 0.190534114838 >11 0.197028160095 >12 0.270443916321 >13 0.429781913757 >14 0.270851135254 >15 0.776989936829 >16 0.233337879181 >17 0.227638959885 >18 0.212944030762 >19 0.2144780159 >20 0.22200012207 >21 0.262261152267 >22 0.254227876663 >23 0.275084018707 >24 0.292124032974 >25 0.280488014221 >16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan >since it was too large. This behavior can be adjusted by setting >'spark.debug.maxToStringFields' in SparkEnv.conf. >26 0.290093898773 >27 0.238478899002 >28 0.246420860291 >29 0.241401195526 >30 0.255286931992 >31 0.42702794075 >32 0.327946186066 >33 0.434395074844 >34 0.314198970795 >35 0.34576010704 >36 0.278323888779 >37 0.289474964142 >38 0.290827989578 >39 0.376291036606 >40 0.347742080688 >41 0.363158941269 >42 0.318687915802 >43 0.376327991486 >44 0.374994039536 >45 0.362971067429 >46 0.425967931747 >47 0.370860099792 >48 0.443903923035 >49 0.374128103256 >50 0.378985881805 >51 0.476850986481 >52 0.451028823853 >53 0.432540893555 >54 0.514838933945 >55 0.53990483284 >56 0.449142932892 >57 0.465240001678 // 5x slower after 57 columns >58 2.40412116051 >59 2.41632795334 >60 2.41812801361 >61 2.55726218224 >62 2.55484509468 >63 2.56128406525 >64 2.54642391205 >65 2.56381797791 >66 2.56871509552 >67 2.66187620163 >68 2.63496208191 >69 2.81545996666 > >Sergei Romanov > >--------------------------------------------------------------------- >To unsubscribe e-mail: user-unsubscr...@spark.apache.org Sergei Romanov
bad.csv.tgz
Description: Binary data
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org