[ 
https://issues.apache.org/jira/browse/SPARK-20244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964422#comment-15964422
 ] 

Saisai Shao commented on SPARK-20244:
-------------------------------------

This actually is not a UI problem, it is FileSystem thread local statistics 
problem, because PythonRDD will create another thread to read data, so the 
readBytes getting from another thread will be error. But there's no problem if 
using spark-shell, since everything is processed in one thread.

This is a general problem if the child RDD's computation creates another thread 
to handle parent's RDD (HadoopRDD)'s iterator. I tried several different ways 
to handle this problem, but still have some small issues. The multi-thread 
processing inside the RDD make the fix quite complex. 

> Incorrect input size in UI with pyspark
> ---------------------------------------
>
>                 Key: SPARK-20244
>                 URL: https://issues.apache.org/jira/browse/SPARK-20244
>             Project: Spark
>          Issue Type: Bug
>          Components: Web UI
>    Affects Versions: 2.0.0, 2.1.0
>            Reporter: Artur Sukhenko
>            Priority: Minor
>         Attachments: pyspark_incorrect_inputsize.png, 
> sparkshell_correct_inputsize.png
>
>
> In Spark UI (Details for Stage) Input Size is  64.0 KB when running in 
> PySparkShell. 
> Also it is incorrect in Tasks table:
> 64.0 KB / 132120575 in pyspark
> 252.0 MB / 132120575 in spark-shell
> I will attach screenshots.
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> $ yes > /tmp/yes.txt
> $ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> $ ./bin/pyspark
> {code}
> Python 2.7.5 (default, Nov  6 2016, 00:28:07) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>       /_/
> Using Python version 2.7.5 (default, Nov  6 2016 00:28:07)
> SparkSession available as 'spark'.{code}
> >>> a = sc.textFile("/tmp/yes.txt")
> >>> a.count()
> Open Spark UI and check Stage 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to