Hi,
I am very new to PySpark. I have a PySpark app working on text files
with different size (100M - 100G). However each task is handling the
same size of input split. But workers spend very much longer time on
some input splits, especially when the input splits belong to a big
file. See the log of these two input splits (check python.PythonRunner:
Times: total ... )
15/12/08 07:37:15 INFO rdd.NewHadoopRDD: Input split:
hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/budisansblog.blogspot.com.html:39728447488+134217728
15/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4335010, boot
= -140, init = 282, finish = 4334868
15/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(125163)
called with curMem=227636200, maxMem=4341293383
15/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_3_1772 stored as
bytes in memory (estimated size 122.2 KB, free 3.8 GB)
15/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4, boot = 1,
init = 0, finish = 3
15/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(126595)
called with curMem=227761363, maxMem=4341293383
15/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_9_1772 stored as
bytes in memory (estimated size 123.6 KB, free 3.8 GB)
15/12/08 08:49:30 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
15/12/08 08:49:30 INFO datasources.DynamicPartitionWriterContainer:
Using output committer class
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/12/08 08:49:30 INFO output.FileOutputCommitter: Saved output of task
'attempt_201512080849_0002_m_001772_0' to
hdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512080849_0002_m_001772
15/12/08 08:49:30 INFO mapred.SparkHadoopMapRedUtil:
attempt_201512080849_0002_m_001772_0: Committed
15/12/08 08:49:30 INFO executor.Executor: Finished task 1772.0 in stage
2.0 (TID 1770). 16216 bytes result sent to driver
15/12/07 20:52:24 INFO rdd.NewHadoopRDD: Input split:
hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/bcnn1wp.wordpress.com.html:1476395008+134217728
15/12/07 20:53:06 INFO python.PythonRunner: Times: total = 41776, boot =
-425, init = 432, finish = 41769
15/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1434614)
called with curMem=167647961, maxMem=4341293383
15/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_3_994 stored as
bytes in memory (estimated size 1401.0 KB, free 3.9 GB)
15/12/07 20:53:06 INFO python.PythonRunner: Times: total = 40, boot =
-20, init = 21, finish = 39
15/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1463477)
called with curMem=169082575, maxMem=4341293383
15/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_9_994 stored as
bytes in memory (estimated size 1429.2 KB, free 3.9 GB)
15/12/07 20:53:06 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
15/12/07 20:53:06 INFO datasources.DynamicPartitionWriterContainer:
Using output committer class
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/12/07 20:53:06 INFO output.FileOutputCommitter: Saved output of task
'attempt_201512072053_0002_m_000994_0' to
hdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512072053_0002_m_000994
15/12/07 20:53:06 INFO mapred.SparkHadoopMapRedUtil:
attempt_201512072053_0002_m_000994_0: Committed
15/12/07 20:53:06 INFO executor.Executor: Finished task 994.0 in stage
2.0 (TID 990). 9386 bytes result sent to driver
Any suggestions please
Thanks,
Patcharee
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org