[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour

Sameer Farooqui (JIRA) Mon, 24 Nov 2014 16:45:00 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223851#comment-14223851
 ]


Sameer Farooqui commented on SPARK-4395:
----------------------------------------

Hi Davies and Michael,

I can confirm that this works if I move the .cache() to AFTER the inferSchema 
as Davies suggested. But if the cache is first, then the hang occurs.

Workaround is suitable by me for now, although other people could also run into 
this if they're not aware of this JIRA...

> Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
> --------------------------------------------------------------------------
>
>                 Key: SPARK-4395
>                 URL: https://issues.apache.org/jira/browse/SPARK-4395
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: version 1.2.0-SNAPSHOT
>            Reporter: Sameer Farooqui
>
> When I run this command it hangs for one to many hours and then finally 
> returns with successful results:
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> Note, the lab environment below is still active, so let me know if you'd like 
> to just access it directly.
> +++ My Environment +++
> - 1-node cluster in Amazon
> - RedHat 6.5 64-bit
> - java version "1.7.0_67"
> - SBT version: sbt-0.13.5
> - Scala version: scala-2.11.2
> Ran: 
> sudo yum -y update
> git clone https://github.com/apache/spark
> sudo sbt assembly
> +++ Data file used +++
> http://blueplastic.com/databricks/movielens/ratings.dat
> {code}
> >>> import re
> >>> import string
> >>> from pyspark.sql import SQLContext, Row
> >>> sqlContext = SQLContext(sc)
> >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'
> >>>
> >>> def parse_ratings_line(line):
> ...     match = re.search(RATINGS_PATTERN, line)
> ...     if match is None:
> ...         # Optionally, you can change this to just ignore if each line of 
> data is not critical.
> ...         raise Error("Invalid logline: %s" % logline)
> ...     return Row(
> ...         UserID    = int(match.group(1)),
> ...         MovieID   = int(match.group(2)),
> ...         Rating    = int(match.group(3)),
> ...         Timestamp = int(match.group(4)))
> ...
> >>> ratings_base_RDD = 
> >>> (sc.textFile("file:///home/ec2-user/movielens/ratings.dat")
> ...                # Call the parse_apace_log_line function on each line.
> ...                .map(parse_ratings_line)
> ...                # Caches the objects in memory since they will be queried 
> multiple times.
> ...                .cache())
> >>> ratings_base_RDD.count()
> 1000209
> >>> ratings_base_RDD.first()
> Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
> >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
> >>> schemaRatings.registerTempTable("RatingsTable")
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> {code}
> (Now the Python shell hangs...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour

Reply via email to