Hi,

I am very new to PySpark. I have a PySpark app working on text files with different size (100M - 100G). However each task is handling the same size of input split. But workers spend very much longer time on some input splits, especially when the input splits belong to a big file. See the log of these two input splits (check python.PythonRunner: Times: total ... )

15/12/08 07:37:15 INFO rdd.NewHadoopRDD: Input split: hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/budisansblog.blogspot.com.html:39728447488+134217728 15/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4335010, boot = -140, init = 282, finish = 4334868 15/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(125163) called with curMem=227636200, maxMem=4341293383 15/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_3_1772 stored as bytes in memory (estimated size 122.2 KB, free 3.8 GB) 15/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4, boot = 1, init = 0, finish = 3 15/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(126595) called with curMem=227761363, maxMem=4341293383 15/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_9_1772 stored as bytes in memory (estimated size 123.6 KB, free 3.8 GB) 15/12/08 08:49:30 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 15/12/08 08:49:30 INFO datasources.DynamicPartitionWriterContainer: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 15/12/08 08:49:30 INFO output.FileOutputCommitter: Saved output of task 'attempt_201512080849_0002_m_001772_0' to hdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512080849_0002_m_001772 15/12/08 08:49:30 INFO mapred.SparkHadoopMapRedUtil: attempt_201512080849_0002_m_001772_0: Committed 15/12/08 08:49:30 INFO executor.Executor: Finished task 1772.0 in stage 2.0 (TID 1770). 16216 bytes result sent to driver


15/12/07 20:52:24 INFO rdd.NewHadoopRDD: Input split: hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/bcnn1wp.wordpress.com.html:1476395008+134217728 15/12/07 20:53:06 INFO python.PythonRunner: Times: total = 41776, boot = -425, init = 432, finish = 41769 15/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1434614) called with curMem=167647961, maxMem=4341293383 15/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_3_994 stored as bytes in memory (estimated size 1401.0 KB, free 3.9 GB) 15/12/07 20:53:06 INFO python.PythonRunner: Times: total = 40, boot = -20, init = 21, finish = 39 15/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1463477) called with curMem=169082575, maxMem=4341293383 15/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_9_994 stored as bytes in memory (estimated size 1429.2 KB, free 3.9 GB) 15/12/07 20:53:06 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 15/12/07 20:53:06 INFO datasources.DynamicPartitionWriterContainer: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 15/12/07 20:53:06 INFO output.FileOutputCommitter: Saved output of task 'attempt_201512072053_0002_m_000994_0' to hdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512072053_0002_m_000994 15/12/07 20:53:06 INFO mapred.SparkHadoopMapRedUtil: attempt_201512072053_0002_m_000994_0: Committed 15/12/07 20:53:06 INFO executor.Executor: Finished task 994.0 in stage 2.0 (TID 990). 9386 bytes result sent to driver

Any suggestions please

Thanks,
Patcharee




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to