I've run into an error when trying to create a dataframe. Here's the code: -- from pyspark import StorageLevel from pyspark.sql import Row
table = 'blah' ssc = HiveContext(sc) data = sc.textFile('s3://bucket/some.tsv') def deserialize(s): p = s.strip().split('\t') p[-1] = float(p[-1]) return Row(normalized_page_sha1=p[0], name=p[1], phrase=p[2], created_at=p[3], layer_id=p[4], score=p[5]) blah = data.map(deserialize) df = sqlContext.inferSchema(blah) --- I've also tried s3n and using createDataFrame. Our setup is on EMR instances, using the setup script Amazon provides. After lots of debugging, I suspect there'll be a problem with this setup. What's weird is that if I run this on pyspark shell, and re-run the last line (inferSchema/createDataFrame), it actually works. We're getting warnings like this: http://pastebin.ca/3016476 Here's the actual error: http://www.pastebin.ca/3016473 Any help would be greatly appreciated. Thanks, Ignacio