[ https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212019#comment-14212019 ]
Davies Liu commented on SPARK-4395: ----------------------------------- [~marmbrus] After removing the cache(), this script finished quickly, so I think there is something wrong the caching. It also hanged when I move the .cache() before parsing. While hanging, most of the CPU is spent in JVM, the python process (only one) is idle. > Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour > -------------------------------------------------------------------------- > > Key: SPARK-4395 > URL: https://issues.apache.org/jira/browse/SPARK-4395 > Project: Spark > Issue Type: Bug > Affects Versions: 1.2.0 > Environment: version 1.2.0-SNAPSHOT > Reporter: Sameer Farooqui > > When I run this command it hangs for one to many hours and then finally > returns with successful results: > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > Note, the lab environment below is still active, so let me know if you'd like > to just access it directly. > +++ My Environment +++ > - 1-node cluster in Amazon > - RedHat 6.5 64-bit > - java version "1.7.0_67" > - SBT version: sbt-0.13.5 > - Scala version: scala-2.11.2 > Ran: > sudo yum -y update > git clone https://github.com/apache/spark > sudo sbt assembly > +++ Data file used +++ > http://blueplastic.com/databricks/movielens/ratings.dat > +++ Code ran +++ > >>> import re > >>> import string > >>> from pyspark.sql import SQLContext, Row > >>> sqlContext = SQLContext(sc) > >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)' > >>> > >>> def parse_ratings_line(line): > ... match = re.search(RATINGS_PATTERN, line) > ... if match is None: > ... # Optionally, you can change this to just ignore if each line of > data is not critical. > ... raise Error("Invalid logline: %s" % logline) > ... return Row( > ... UserID = int(match.group(1)), > ... MovieID = int(match.group(2)), > ... Rating = int(match.group(3)), > ... Timestamp = int(match.group(4))) > ... > >>> ratings_base_RDD = > >>> (sc.textFile("file:///home/ec2-user/movielens/ratings.dat") > ... # Call the parse_apace_log_line function on each line. > ... .map(parse_ratings_line) > ... # Caches the objects in memory since they will be queried > multiple times. > ... .cache()) > >>> ratings_base_RDD.count() > 1000209 > >>> ratings_base_RDD.first() > Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1) > >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD) > >>> schemaRatings.registerTempTable("RatingsTable") > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > (Now the Python shell hangs...) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org