bad performance on PySpark - big text file

patcharee Tue, 08 Dec 2015 01:27:52 -0800

Hi,

I am very new to PySpark. I have a PySpark app working on text fileswith different size (100M - 100G). However each task is handling thesame size of input split. But workers spend very much longer time onsome input splits, especially when the input splits belong to a bigfile. See the log of these two input splits (check python.PythonRunner:Times: total ... )

15/12/08 07:37:15 INFO rdd.NewHadoopRDD: Input split:hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/budisansblog.blogspot.com.html:39728447488+13421772815/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4335010, boot= -140, init = 282, finish = 433486815/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(125163)called with curMem=227636200, maxMem=434129338315/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_3_1772 stored asbytes in memory (estimated size 122.2 KB, free 3.8 GB)15/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4, boot = 1,init = 0, finish = 315/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(126595)called with curMem=227761363, maxMem=434129338315/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_9_1772 stored asbytes in memory (estimated size 123.6 KB, free 3.8 GB)15/12/08 08:49:30 INFO output.FileOutputCommitter: File Output CommitterAlgorithm version is 115/12/08 08:49:30 INFO datasources.DynamicPartitionWriterContainer:Using output committer classorg.apache.hadoop.mapreduce.lib.output.FileOutputCommitter15/12/08 08:49:30 INFO output.FileOutputCommitter: Saved output of task'attempt_201512080849_0002_m_001772_0' tohdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512080849_0002_m_00177215/12/08 08:49:30 INFO mapred.SparkHadoopMapRedUtil:attempt_201512080849_0002_m_001772_0: Committed15/12/08 08:49:30 INFO executor.Executor: Finished task 1772.0 in stage2.0 (TID 1770). 16216 bytes result sent to driver

15/12/07 20:52:24 INFO rdd.NewHadoopRDD: Input split:hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/bcnn1wp.wordpress.com.html:1476395008+13421772815/12/07 20:53:06 INFO python.PythonRunner: Times: total = 41776, boot =-425, init = 432, finish = 4176915/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1434614)called with curMem=167647961, maxMem=434129338315/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_3_994 stored asbytes in memory (estimated size 1401.0 KB, free 3.9 GB)15/12/07 20:53:06 INFO python.PythonRunner: Times: total = 40, boot =-20, init = 21, finish = 3915/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1463477)called with curMem=169082575, maxMem=434129338315/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_9_994 stored asbytes in memory (estimated size 1429.2 KB, free 3.9 GB)15/12/07 20:53:06 INFO output.FileOutputCommitter: File Output CommitterAlgorithm version is 115/12/07 20:53:06 INFO datasources.DynamicPartitionWriterContainer:Using output committer classorg.apache.hadoop.mapreduce.lib.output.FileOutputCommitter15/12/07 20:53:06 INFO output.FileOutputCommitter: Saved output of task'attempt_201512072053_0002_m_000994_0' tohdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512072053_0002_m_00099415/12/07 20:53:06 INFO mapred.SparkHadoopMapRedUtil:attempt_201512072053_0002_m_000994_0: Committed15/12/07 20:53:06 INFO executor.Executor: Finished task 994.0 in stage2.0 (TID 990). 9386 bytes result sent to driver


Any suggestions please

Thanks,
Patcharee




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

bad performance on PySpark - big text file

Reply via email to