[
https://issues.apache.org/jira/browse/SPARK-20244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964422#comment-15964422
]
Saisai Shao commented on SPARK-20244:
-
This actually is not a UI problem, it is FileSystem thread local statistics
problem, because PythonRDD will create another thread to read data, so the
readBytes getting from another thread will be error. But there's no problem if
using spark-shell, since everything is processed in one thread.
This is a general problem if the child RDD's computation creates another thread
to handle parent's RDD (HadoopRDD)'s iterator. I tried several different ways
to handle this problem, but still have some small issues. The multi-thread
processing inside the RDD make the fix quite complex.
> Incorrect input size in UI with pyspark
> ---
>
> Key: SPARK-20244
> URL: https://issues.apache.org/jira/browse/SPARK-20244
> Project: Spark
> Issue Type: Bug
> Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Artur Sukhenko
>Priority: Minor
> Attachments: pyspark_incorrect_inputsize.png,
> sparkshell_correct_inputsize.png
>
>
> In Spark UI (Details for Stage) Input Size is 64.0 KB when running in
> PySparkShell.
> Also it is incorrect in Tasks table:
> 64.0 KB / 132120575 in pyspark
> 252.0 MB / 132120575 in spark-shell
> I will attach screenshots.
> Reproduce steps:
> Run this to generate big file (press Ctrl+C after 5-6 seconds)
> $ yes > /tmp/yes.txt
> $ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> $ ./bin/pyspark
> {code}
> Python 2.7.5 (default, Nov 6 2016, 00:28:07)
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
> Welcome to
> __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
>/__ / .__/\_,_/_/ /_/\_\ version 2.1.0
> /_/
> Using Python version 2.7.5 (default, Nov 6 2016 00:28:07)
> SparkSession available as 'spark'.{code}
> >>> a = sc.textFile("/tmp/yes.txt")
> >>> a.count()
> Open Spark UI and check Stage 0.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org