[ https://issues.apache.org/jira/browse/SPARK-20244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964422#comment-15964422 ]
Saisai Shao commented on SPARK-20244: ------------------------------------- This actually is not a UI problem, it is FileSystem thread local statistics problem, because PythonRDD will create another thread to read data, so the readBytes getting from another thread will be error. But there's no problem if using spark-shell, since everything is processed in one thread. This is a general problem if the child RDD's computation creates another thread to handle parent's RDD (HadoopRDD)'s iterator. I tried several different ways to handle this problem, but still have some small issues. The multi-thread processing inside the RDD make the fix quite complex. > Incorrect input size in UI with pyspark > --------------------------------------- > > Key: SPARK-20244 > URL: https://issues.apache.org/jira/browse/SPARK-20244 > Project: Spark > Issue Type: Bug > Components: Web UI > Affects Versions: 2.0.0, 2.1.0 > Reporter: Artur Sukhenko > Priority: Minor > Attachments: pyspark_incorrect_inputsize.png, > sparkshell_correct_inputsize.png > > > In Spark UI (Details for Stage) Input Size is 64.0 KB when running in > PySparkShell. > Also it is incorrect in Tasks table: > 64.0 KB / 132120575 in pyspark > 252.0 MB / 132120575 in spark-shell > I will attach screenshots. > Reproduce steps: > Run this to generate big file (press Ctrl+C after 5-6 seconds) > $ yes > /tmp/yes.txt > $ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/ > $ ./bin/pyspark > {code} > Python 2.7.5 (default, Nov 6 2016, 00:28:07) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.5 (default, Nov 6 2016 00:28:07) > SparkSession available as 'spark'.{code} > >>> a = sc.textFile("/tmp/yes.txt") > >>> a.count() > Open Spark UI and check Stage 0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org