[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour

Davies Liu (JIRA) Fri, 14 Nov 2014 00:58:13 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212019#comment-14212019
 ]


Davies Liu commented on SPARK-4395:
-----------------------------------

[~marmbrus] After removing the cache(), this script finished quickly, so I 
think there is something wrong the caching.

It also hanged when I move the .cache() before parsing. While hanging, most of 
the CPU is spent in JVM, the python process (only one) is idle. 

> Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
> --------------------------------------------------------------------------
>
>                 Key: SPARK-4395
>                 URL: https://issues.apache.org/jira/browse/SPARK-4395
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>         Environment: version 1.2.0-SNAPSHOT
>            Reporter: Sameer Farooqui
>
> When I run this command it hangs for one to many hours and then finally 
> returns with successful results:
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> Note, the lab environment below is still active, so let me know if you'd like 
> to just access it directly.
> +++ My Environment +++
> - 1-node cluster in Amazon
> - RedHat 6.5 64-bit
> - java version "1.7.0_67"
> - SBT version: sbt-0.13.5
> - Scala version: scala-2.11.2
> Ran: 
> sudo yum -y update
> git clone https://github.com/apache/spark
> sudo sbt assembly
> +++ Data file used +++
> http://blueplastic.com/databricks/movielens/ratings.dat
> +++ Code ran +++
> >>> import re
> >>> import string
> >>> from pyspark.sql import SQLContext, Row
> >>> sqlContext = SQLContext(sc)
> >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'
> >>>
> >>> def parse_ratings_line(line):
> ...     match = re.search(RATINGS_PATTERN, line)
> ...     if match is None:
> ...         # Optionally, you can change this to just ignore if each line of 
> data is not critical.
> ...         raise Error("Invalid logline: %s" % logline)
> ...     return Row(
> ...         UserID    = int(match.group(1)),
> ...         MovieID   = int(match.group(2)),
> ...         Rating    = int(match.group(3)),
> ...         Timestamp = int(match.group(4)))
> ...
> >>> ratings_base_RDD = 
> >>> (sc.textFile("file:///home/ec2-user/movielens/ratings.dat")
> ...                # Call the parse_apace_log_line function on each line.
> ...                .map(parse_ratings_line)
> ...                # Caches the objects in memory since they will be queried 
> multiple times.
> ...                .cache())
> >>> ratings_base_RDD.count()
> 1000209
> >>> ratings_base_RDD.first()
> Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
> >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
> >>> schemaRatings.registerTempTable("RatingsTable")
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> (Now the Python shell hangs...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour

Reply via email to