Hi Andy, The SQL parser is pretty basic (we plan to improve this for the 1.2 release). In this case I think part of the problem is that one of your variables is "count", which is a reserved word. Unfortunately, we don't have the ability to escape identifiers at this point.
However, I did manage to get your query to parse using the HiveQL parser, provided by HiveContext. hiveCtx.hql(""" select freq.docid, freqTranspose.docid, sum(freq.count * freqTranspose.count) from Frequency freq JOIN (select term, docid, count from Frequency) freqTranspose where freq.term = freqTranspose.term group by freq.docid, freqTranspose.docid""") Michael On Sun, Jul 13, 2014 at 12:43 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > Hi > > I am running into trouble with a nested query using python. To try and > debug it, I first wrote the query I want using sqlite3 > > select freq.docid, freqTranspose.docid, sum(freq.count * > freqTranspose.count) from > Frequency as freq, > (select term, docid, count from Frequency) as freqTranspose > where freq.term = freqTranspose.term > group by freq.docid, freqTranspose.docid > ; > > > Sparksql has trouble parsing the "(select ) as freqTranspose “ line > > Here is what my input data looks like > $ head -n 3 reuters.db.csv > docid,term,count > 10000_txt_earn,net,1 > 10000_txt_earn,rogers,4 > > The output from sqlite3 is > $ head -n 6 3hSimilarityMatrix.slow.sql.out > freq.docid freqTranspose.docid sum(freq.count * freqTranspose.count) > -------------- ------------------- ------------------------------------- > 10000_txt_earn 10000_txt_earn 127 > 10000_txt_earn 10054_txt_earn 33 > 10000_txt_earn 10080_txt_crude 146 > 10000_txt_earn 10088_txt_acq 11 > $ > > > My code example pretty much follows > http://spark.apache.org/docs/latest/sql-programming-guide.html > > dataFile = sc.textFile("reuters.db.csv”) > lines = dataFile.map(lambda l: l.split(",”)) > def mapLines(line) : > ret = {} > ret['docid'] = line[0] > ret['term'] = line[1] > ret['count'] = line[2] > return ret > frequency = lines.map(mapLines) > schemaFrequency = sqlContext.inferSchema(frequency) > schemaFrequency.registerAsTable("frequency”) > > Okay here is where I run into trouble > > sqlCmd = "select \ > freq.docid, \ > freqTranspose.docid \ > from \ > frequency as freq, \ > (select term, docid, count from frequency) \ > " > similarities = sqlContext.sql(sqlCmd) > > > /Users/andy/workSpace/dataBricksIntroToApacheSpark/USBStick/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) 298 > raise Py4JJavaError( 299 'An error occurred > while calling {0}{1}{2}.\n'.--> 300 format(target_id, > '.', name), value) 301 else: 302 raise > Py4JError( > Py4JJavaError: An error occurred while calling o40.sql. > : java.lang.RuntimeException: [1.153] failure: ``('' expected but `from' found > > select freq.docid, freqTranspose.docid from > frequency as freq, (select term, docid, count > from frequency) > > > > Simple sql seems to “parse” I.e. Select freq.docid from frequency as freq > > > Any suggestions would be greatly appreciated. > > > Andy > > > P.s. I should note, I think I am using version 1.0 ? > > > >