[ https://issues.apache.org/jira/browse/SPARK-4778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272587#comment-14272587 ]
Nicholas Chammas commented on SPARK-4778: ----------------------------------------- Also, though the environment is on EC2, this does not appear to be an issue with the EC2 scripts. So I'm gonna remove that component for now. > PySpark Json and groupByKey broken > ---------------------------------- > > Key: SPARK-4778 > URL: https://issues.apache.org/jira/browse/SPARK-4778 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.1.1 > Environment: ec2 cluster launched from ec2 script > pyspark > c3.2xlarge 6 nodes > hadoop major version 1 > Reporter: Brad Willard > > When I run a groupByKey it seems to create a single tasks after the > groupByKey that never stops executing. I'm loading a smallish json dataset > that is 4 million records. This is the code I'm running. > rdd = sql_context.jsonFile(hdfs_uri) > rdd = rdd.cache() > grouped = rdd.map(lambda row: (row.id, row)).groupByKey(160) > grouped.take(1) > The groupByKey stage takes a few minutes which I'd expect. However the take > operation never completes. It it hands indefinitely. > This is what it looks like in UI > http://cl.ly/image/2k1t3I253T0x > The only work around I have at the moment is to run a map operation after I > loaded from json to convert all the Row objects to python dictionary objects > and then things work although the map operation is expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org